# Universal approximation theorem

In the mathematical theory of artificial neural networks, the universal approximation theorem states that feedforward neural networks constructed of artificial neurons can approximate real-valued continuous functions on compact subsets of $\mathbb {R} ^{n}$ , to arbitrary accuracy. The theorem thus implies that simple neural networks can in principle be applied to nearly any problem, as they can approximate essentially any function of interest.

Early versions of the theorem considered networks of arbitrary width. In particular these were considered by George Cybenko in 1989,. Kurt Hornik in 1991, and Moshe Leshno et al in 1993. A simple general formulation was given by Allan Pinkus in 1999, which is the version stated here. Later versions considered the 'dual' problem for networks of arbitrary depth. In particular these were considered by Zhou Lu et al in 2017  and Boris Hanin and Mark Sellke in 2018. A simple general formulation was given by Patrick Kidger and Terry Lyons in 2020 , which is the version stated here.

Several extensions of the theorem exist, such as to discontinuous activation functions, alternative network architectures, other topologies, and noncompact domains.

## Formal statements

The classical arbitrary width version of the theorem may be stated as follows.

Universal approximation theorem; arbitrary width. Let $\varphi :\mathbb {R} \to \mathbb {R}$ be any continuous function (called the activation function). Let $K\subseteq \mathbb {R} ^{n}$ be compact. The space of real-valued continuous functions on $K$ is denoted by $C(K)$ . Let ${\mathcal {M}}$ denote the space of functions of the form

$F(x)=\sum _{i=1}^{N}v_{i}\varphi \left(w_{i}^{T}x+b_{i}\right)$ for all integers $N\in \mathbb {N}$ , real constants $v_{i},b_{i}\in \mathbb {R}$ and real vectors $w_{i}\in \mathbb {R} ^{m}$ for $i=1,\ldots ,N$ .

Then, if and only if $\varphi$ is nonpolynomial, the following statement is true: given any $\varepsilon >0$ and any $f\in C(K)$ , there exists $F\in {\mathcal {M}}$ such that

$|F(x)-f(x)|<\varepsilon$ for all $x\in K$ .

In other words, ${\mathcal {M}}$ is dense in $C(K)$ with respect to the uniform norm if and only if $\varphi$ is nonpolynomial.

This theorem extends straightforwardly to networks with any fixed number of hidden layers: the theorem implies that the first layer can approximate any desired function, and that later layers can approximate the identity function. Thus any fixed-depth network may approximate any continuous function, and this version of the theorem applies to networks with bounded depth and arbitrary width.

The 'dual' version of the the theorem considers networks of bounded width and arbitrary depth. Unlike the previous theorem, it gives sufficient but not necessary conditions.

Universal approximation theorem; arbitrary depth. Let $\varphi :\mathbb {R} \to \mathbb {R}$ be any nonaffine continuous function which is continuously differentiable at at least one point, with nonzero derivative at that point. Let $K\subseteq \mathbb {R} ^{n}$ be compact. The space of real vector-valued continuous functions on $K$ is denoted by $C(K;\mathbb {R} ^{m})$ . Let ${\mathcal {N}}$ denote the space of feedforward neural networks with $n$ input neurons, $m$ output neurons, and an arbitrary number of hidden layers each with $n+m+2$ neurons, such that every hidden neuron has activation function $\varphi$ and every output neuron has the identity as its activation function. Then given any $\varepsilon >0$ and any $f\in C(K;\mathbb {R} ^{m})$ , there exists $F\in {\mathcal {N}}$ such that

$|F(x)-f(x)|<\varepsilon$ for all $x\in K$ .

In other words, ${\mathcal {N}}$ is dense in $C(K;\mathbb {R} ^{m})$ with respect to the uniform norm.

Certain necessary conditions for the bounded width, arbitrary depth case have been established, but there is still a gap between the known sufficient and necessary conditions..

The arbitrary depth case includes polynomial activation functions, which are specifically excluded from the arbitrary width case. This is an example of a qualitative difference between (particular interpretations of) deep and shallow neural networks.