Operator Learning of Lipschitz Operators:
An Information-Theoretic Perspective

Samuel Lanthaler California Institute of Technology [email protected]

(Date: June 26, 2024)

Abstract.

Operator learning based on neural operators has emerged as a promising paradigm for the data-driven approximation of operators, map** between infinite-dimensional Banach spaces. Despite significant empirical progress, our theoretical understanding regarding the efficiency of these approximations remains incomplete. This work addresses the parametric complexity of neural operator approximations for the general class of Lipschitz continuous operators. Motivated by recent findings on the limitations of specific architectures, termed curse of parametric complexity, we here adopt an information-theoretic perspective. Our main contribution establishes lower bounds on the metric entropy of Lipschitz operators in two approximation settings; uniform approximation over a compact set of input functions, and approximation in expectation, with input functions drawn from a probability measure. It is shown that these entropy bounds imply that, regardless of the activation function used, neural operator architectures attaining an approximation accuracy $\epsilon$ must have a size that is exponentially large in $\epsilon^{-1}$ . The size of architectures is here measured by counting the number of encoded bits necessary to store the given model in computational memory. The results of this work elucidate fundamental trade-offs and limitations in operator learning, providing new insights into the limitations of operator learning.

1. Introduction

Operators map** between infinite-dimensional Banach spaces of functions are ubiquitous in the natural sciences and engineering. They often appear in connection with physical models expressed as a set of partial differential equations, where operators of interest frequently arise from associated forward and inverse problems, e.g. map** initial data to the solution at a later time, or identifying external forcing terms from (partial) knowledge of the solution.

Operator learning has emerged as a new paradigm for the data-driven approximation of such operators. Popular operator learning frameworks build on the success of neural networks, but generalize this notion to the infinite-dimensional context of operator approximation, resulting in so-called neural operators. These neural operator architectures define parametric operators, whose parameters are tuned to approximate an underlying operator of interest.

While there is a very rapidly growing body of empirical work demonstrating the great potential, and practical utility, of such data-driven approaches, many open questions remain in our understanding of the theoretical underpinnings of this field, see e.g. [30] for a recent review and references therein.

First theoretical insights into specific architectures, and their underlying approximation mechanisms, can be gained by studying universal approximation, i.e. the ability to approximate very general classes of operators. The study of universal approximation of neural operators dates back at least three decades, to early work on operator networks by Chen and Chen [12]. Due to the recent rise in the popularity of operator learning and the introduction of a number of novel state-of-the-art frameworks, this early work has been complemented by a number of papers in recent years, demonstrating similar universal approximation properties for various architectures; e.g. DeepONets [41, 36], PCA-Net [6, 34], Fourier neural operator [29] and general neural operators [31, 35], as well as multiple other architectures [54, 25, 23, 10, 11].

Universal approximation implies that there are no fundamental obstructions to operator learning with a given framework, and usually requires identification of basic approximation mechanisms that can be leveraged by a given architecture. However, to determine whether operator learning can be achieved efficiently, a refined quantitative analysis is required. In such quantitative analysis, one often distinguishes between parametric complexity, relating the required model size to the achieved accuracy, and sample efficiency, relating the number of required training samples to the achieved accuracy. The focus of the present work is on parametric complexity. For research relevant to the data complexity of operator learning, we mention, for example, [3, 4, 28, 44, 5].

A general class of operators for which efficient approximation is possible, in terms of the required number of tunable parameters, are so-called holomorphic operators. Research into the approximation of holomorphic operators goes back to the seminal work of Cohen, DeVore and Schwab [14, 15], where it was shown that this class of operators can be efficiently approximated by generalized polynomial expansions. More recently, these results have been extended to neural network and neural operator approximation in a series of works [51, 47, 52, 22, 2, 43], demonstrating that similar rates can be achieved by neural operators.

Other classes of operators for which efficient convergence rates have been derived are operator Barron spaces [27] and (operator) reproducing kernel Hilbert spaces (RKHS) [37, 46]. Alternative settings, such as parametric PDEs with low-dimensional latent structure are, for example, explored in [33, 38, 20].

Apart from these specific classes of operators, efficient approximation has also been established via a case-by-case analysis for several PDE solution operators [17, 29, 36, 34, 43, 21]. These results identify a number of individual operators of interest which can be efficiently approximated by certain operator learning frameworks. Despite this progress, a general theory encompassing all these examples has yet to emerge.

A very general class of operators of interest are Lipschitz operators. Approximation theory of relevance to such a general class of operators has been developed e.g. in [40, 21, 50, 48, 32]. All of these works aim to bound the number of tunable parameters (model size) in terms of the accuracy that can be achieved.

The present work will focus on deriving lower complexity bounds for the class of Lipschitz continuous operators $\mathcal{G}:\mathcal{D}\to\mathbb{R}$ , defined on an infinite-dimensional domain $\mathcal{D}$ and taking values in $\mathbb{R}$ (nonlinear Lipschitz functionals). Semantically, no distinction will be made between ‘functional’ and ‘operator’, since all lower bounds established for functionals continue to hold when considering operators with infinite-dimensional output spaces – the latter containing (infinitely many) copies of $\mathbb{R}$ .

In addition to the aforementioned literature on neural operator approximation theory, the present work also takes inspiration from the information-theoretic point of view on neural network approximation theory in a finite-dimensional setting, pioneered in the works [8, 9, 49, 53, 19], as well as notions of stable approximation [18, 13]. In the present work, the underlying ideas will be applied and extended to the infinite-dimensional context of operator learning.

The main motivation for this work are two recent results, established in [34] and [50] respectively, both applicable to the general setting of Lipschitz operators. A one-paragraph summary of the results in [34] and [50] is as follows:

(i)

The first result [34] shows that certain neural operator architectures, based on ReLU activations, suffer from a curse of parametric complexity: under certain assumptions on the input functions, there exist Lipschitz continuous operators which can only be approximated to accuracy $\epsilon$ , if the number of tunable parameters is exponential in $\epsilon^{-1}$ ; more precisely, the number of parameters must be at least as large as $C\exp(c\epsilon^{-\gamma})$ with problem-dependent constants $C,c,\gamma>0$ .
(ii)

The second result in [50] shows that, under similar assumptions on the input functions, neural operator architectures based on super-expressive activation functions can approximate general Lipschitz operators to accuracy $\epsilon$ , with algebraically bounded parameter count; the number of parameters is upper bounded by $C\epsilon^{-\gamma}$ , for problem-dependent $C,\gamma>0$ .

While the first result, viewed in isolation, appears to hint at fundamental limitations to the development of operator learning theory on the general class of Lipschitz operators, due to the identified “curse”, the second result shows rigorously that this curse can be circumvented with a suitable choice of activation.

The aim of the present work is to examine the apparent dichotomy between these two results in detail. To this end, we explore the curse of parametric complexity from an information-theoretic perspective. As a result, we will uncover the fundamental information-theoretic character of the curse of parametric complexity, and identify the relevant trade-offs that are possible when parametric complexity is measured by the number of (real-valued) parameters as in [34, 50].

Main contributions

This work makes the following main contributions:

•

We propose an information-theoretic perspective of operator learning, based on the relation between bit-encoding and Kolmogorov metric entropy; this provides an alternative to the prevalent analysis in the literature, which has focused on estimating the required number of real-valued parameters.
•

For the model class of Lipschitz operators, we derive lower bounds on the metric entropy in two settings: one pertaining to uniform approximation, the other to approximation in expectation.
•

These bounds imply, in either setting, that an exponentially large number of encoding bits is required to store the weights of any architecture achieving accuracy $\epsilon$ on the model class. This result holds independently of the activation function that is chosen.
•

We use topological arguments to show that even generic operators can only be approximated with exponentially increasing complexity; when applied to FNO this implies that the approximation of a generic Lipschitz operator, to accuracy $\epsilon$ , requires a number of tunable parameters exponential in $\epsilon^{-1}$ .

Overview

The remainder of this paper is organized as follows. In Section 2, we state the main results of this work, as they pertain to operator learning with neural operator architectures. This section contains the main conceptual contributions of this work and reviews the link between bit-encoding and Kolmogorov entropy. Several technical details are left to Sections 3 and 4; in Section 3, we derive lower bounds on the Kolmogorov metric entropy of the set of $1$ -Lipschitz operators in both a sup-norm and $L^{p}$ -norm approximation setting. In particular, we show that the metric $\epsilon$ -entropy increases exponentially with $\epsilon^{-1}$ , implying a general curse of parametric complexity for bit-encoded architectures. This is the first main technical contribution of this work. Approximation rates for generic operators are the subject of Section 4, where we first formulate the operator approximation problem in an abstract Banach space setting, and then use topological arguments to relate approximation rates of generic elements of a model class to the metric entropy of this class. This is the second main technical contribution of this work. Finally, Section 5 contains concluding remarks.

2. Main Results

This section contains a summary of the main results of this work, applied to the specific setting of operator learning. Several of these results are based on more general, abstract propositions which are included in subsequent Sections 3 and 4. To aid readability, we leave most technical details to these latter sections. The aim of this section is instead to explain the main ideas underlying our analysis, and their implications for operator learning. Recurring notation, to be introduced and discussed in the following, is summarized in Table 1.

Notation	Meaning
$\mathcal{G}:\mathcal{D}\subset\mathcal{X}\to\mathcal{Y}$	Nonlinear operator with domain $\mathcal{D}$
$\mathcal{X}$ , $\mathcal{Y}$	(Input/output) Banach spaces
$\mathcal{K}\subset\mathcal{X}$	Compact subset of inputs
$\mathcal{D}$	Operator domain, $\mathcal{D}=\mathcal{K}$ or $\mathcal{D}=\mathcal{X}$
$\mu\in\mathcal{P}(\mathcal{X})$	Probability measure on $\mathcal{X}$
$\mathrm{Lip}_{1}(\mathcal{D})$	Real-valued $1$ -Lipschitz operators, $\mathcal{G}:\mathcal{D}\to\mathbb{R}$
$\mathsf{\bm{V}}$	Banach space of operators, $\mathsf{\bm{V}}=C(\mathcal{K})$ or $\mathsf{\bm{V}}=L^{p}(\mu)$
$\mathsf{\bm{A}}\subset\mathsf{\bm{V}}$	Compact subset of $\mathsf{\bm{V}}$ , e.g. $\mathsf{\bm{A}}=\mathrm{Lip}_{1}(\mathcal{D})$

Table 1. Recurring notation and definitions for operator learning.

2.1. Operator approximation by neural operators

We begin the discussion of our main results by proposing an encoder-decoder point of view on operator learning, where the encoder and decoder are implicitly defined by a given architecture. We then define approximation errors of interest and discuss two common measures to quantify the “complexity” of a given architecture. The first counts the number of tunable, real-valued parameters in the architecture. The second goes one step further, and requires specification of a bit-encoding of all parameters, i.e. encoding by a sequence of 0’s and 1’s. To fix intuition, this bit-encoding can be loosely interpreted as the representation of the parameters on computing hardware. The complexity of a bitwise-encoded architecture is measured by the number of bits required to represent it. As will be explained, this provides a link to fundamental information-theoretic concepts such as the Kolmogorov metric entropy of our model class.

2.1.1. Approximation theoretic setting

Assume we are given input and output spaces $\mathcal{X}$ , $\mathcal{Y}$ . A neural operator defines a parametrized map** $\Phi:\mathcal{X}\times\mathbb{R}^{q}\to\mathcal{Y}$ , where $\theta\in\mathbb{R}^{q}$ are tunable parameters. Specification of $\theta$ defines an operator, $\Phi({\,\cdot\,};\theta):\mathcal{X}\to\mathcal{Y}$ . In practice, the training of a neural operator results in an optimized parameter choice $\theta_{\mathcal{G}}$ for given $\mathcal{G}:\mathcal{X}\to\mathcal{Y}$ and an approximation $\mathcal{G}\approx\Phi({\,\cdot\,};\theta_{\mathcal{G}})$ .

Model class $\mathrm{Lip}_{1}(\mathcal{D})$

In the following, we will consider a model class of $1$ -Lipschitz operators, restricting attention to the case of real-valued outputs, $\mathcal{Y}=\mathbb{R}$ :

Definition 2.1 (Model class $\mathrm{Lip}_{1}$ ).

Let $(\mathcal{D},d)$ be a metric space. We define $\mathrm{Lip}_{1}(\mathcal{D})$ as the set consisting of all $1$ -Lipschitz continuous map**s $\mathcal{G}:\mathcal{D}\to\mathbb{R}$ with $\|\mathcal{G}\|_{\mathrm{Lip}}\leq 1$ , where we define the $\|{\,\cdot\,}\|_{\mathrm{Lip}}$ -norm as follows:

(2.1)

\displaystyle\left\{\begin{aligned} \|\mathcal{G}\|_{\mathrm{Lip}}&=\max\Big{% \{}\textstyle\sup_{u\in\mathcal{D}}|\mathcal{G}(u)|,\mathrm{Lip}(\mathcal{G})% \Big{\}},\\ \mathrm{Lip}(\mathcal{G})&=\sup_{u\neq v}\frac{|\mathcal{G}(u)-\mathcal{G}(v)|% }{d(u,v)},\end{aligned}\right.

As described in the introduction, the goal of operator learning is to approximate $\mathcal{G}:\mathcal{D}\to\mathbb{R}$ by a neural operator $\Phi:\mathcal{D}\times\mathbb{R}^{q}\to\mathbb{R}$ . In this work, we aim to relate the approximation accuracy $\epsilon$ to the required model size of $\Phi$ . We will focus on two settings, where either (i) $\mathcal{D}=\mathcal{K}\subset\mathcal{X}$ is a compact subset of a Banach space and the metric is the sup-norm over $\mathcal{K}$ , or (ii) $\mathcal{D}=\mathcal{X}$ is a Banach space and the metric is induced by the $L^{p}(\mu)$ -norm with respect to a probability measure $\mu$ on $\mathcal{X}$ (cp. Table 2).

Approximation spaces and norms

To measure the approximation accuracy of this approximation task, we have to define a distance between operators. To this end, we will consider a Banach space of operators $\mathsf{\bm{V}}$ , allowing for an embedding $\mathrm{Lip}_{1}(\mathcal{D})\subset\mathsf{\bm{V}}$ . Throughout, we will consider one of the following two settings. In the first setting, we aim to approximate $\mathcal{G}$ over a compact domain $\mathcal{D}=\mathcal{K}\subset\mathcal{X}$ :

Setting 2.2 (Uniform approximation).

If $\mathcal{G}:\mathcal{K}\to\mathbb{R}$ is an operator with compact domain $\mathcal{K}\subset\mathcal{X}$ , we will study its uniform approximation over $\mathcal{K}$ , i.e. we take $\mathsf{\bm{V}}=C(\mathcal{K})$ to be the space of continuous operators, metrized by the sup-norm:

(2.2)

\displaystyle\|\mathcal{G}\|_{C(\mathcal{K})}=\sup_{u\in\mathcal{K}}|\mathcal{% G}(u)|.

A common special case of this setting is the case where $\mathcal{K}\subset\mathcal{X}$ is defined by a smoothness constraint, as illustrated by the following example:

Example 2.3.

Let $D\subset\mathbb{R}^{d}$ be a bounded domain. An example of the setting above is the case of Lipschitz operators $\mathcal{G}:\mathcal{K}\subset L^{2}(D)\to\mathbb{R}$ , with

\mathcal{K}={\left\{u\in H^{s}(D)\,\middle|\,\|u\|_{H^{s}(D)}\leq C\right\}},

a set defined by a Sobolev smoothness constraint for $s>0$ . Here, $\mathcal{X}=L^{2}(D)$ .

In the second setting, we aim to approximate $\mathcal{G}$ over the entire Banach space $\mathcal{D}=\mathcal{X}$ , but with respect to a (Bochner) $L^{p}(\mu)$ -norm:

Setting 2.4 (Approximation in expectation).

If $\mathcal{G}:\mathcal{X}\to\mathbb{R}$ is an operator with unbounded domain $\mathcal{X}$ a separable Banach space, then we will assume that inputs are drawn at random from a probability measure $\mu\in\mathcal{P}(\mathcal{X})$ . In this case, we fix $p\in[1,\infty)$ and take $\mathsf{\bm{V}}=L^{p}(\mu)$ as the space of $\mu$ -measurable operators with finite $p$ -th norm. $L^{p}(\mu)$ is metrized by the Bochner $L^{p}$ -norm,

(2.3)

\displaystyle\|\mathcal{G}\|_{L^{p}(\mu)}=\mathbb{E}_{u\sim\mu}\left[|\mathcal% {G}(u)|^{p}\right]^{1/p}.

Operator domain	Operator class	Approximation space	Norm
$\mathcal{K}\subset\mathcal{X}$ compact	$\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{K})$	$\mathsf{\bm{V}}=C(\mathcal{K})$	sup-norm
$\mathcal{X}$ Banach	$\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{X})$	$\mathsf{\bm{V}}=L^{p}(\mu)$	$L^{p}(\mu)$ -norm

Table 2. Operator approximation settings

Measures of complexity: Counting parameters versus bits

We will distinguish two ways of measuring the “complexity” of neural operator $\Phi({\,\cdot\,};\theta)$ : one based on the number of tunable (real-valued) parameters, the other requiring bit-encoding (or quantization) of the parameters.

A first intuitive notion of complexity is the minimal number of tunable parameters required to reach approximation accuracy $\epsilon$ , i.e. the parameter dimension $q$ of a neural operator $\Phi:\mathcal{D}\times\mathbb{R}^{q}\to\mathcal{Y}$ . As mentioned in the introduction, this point of view has been prevalent in the development of approximation theory for operator learning. As explained previously, depending on the type of activation function that is used, vastly different conclusions can be reached with this definition of complexity. This fact is well-known in the finite-dimensional setting: For example, it has been shown [42] that there exist smooth, sigmoidal activation functions for which a neural network of fixed size can approximate arbitrary continuous function to arbitrary accuracy, i.e. approximation accuracy $\epsilon$ can be reached with a number of parameters $q=O(1)$ .

In practical implementations, real-valued parameters can only be digitally represented to finite accuracy. This observation has led a number of authors [8, 9, 49, 53, 19], to analyze neural network approximation from a bit-encoding perspective. In this approach, the continuous parameters $\theta\in\mathbb{R}^{q}$ are replaced by quantized parameters $\theta\in\Theta$ , where $\Theta\subset\mathbb{R}^{q}$ is a finite set. If the number of elements is bounded, say $|\Theta|=2^{B}$ for some $B\in\mathbb{N}$ , then we can identify $\Theta\simeq\{0,1\}^{B}$ , i.e. each element in the set $\Theta$ is encoded by a string of $B$ bits. Taking this information-theoretic point of view, it is possible to derive (lower) complexity bounds that are independent of the activation function.

2.2. Encoder-decoder view of neural operators

Given the discussion of the last paragraph, we now outline an encoder-decoder point of view on neural operators, emphasizing the difference between “counting parameters” and “counting (encoding) bits”.

Counting parameters

Let $\Phi:\mathcal{D}\times\mathbb{R}^{q}\to\mathbb{R}$ be a neural operator architecture. To explain our intuition, we temporarily assume the existence of, and fix an optimal parameter choice $\theta_{\mathcal{G}}\in\mathbb{R}^{q}$ for each $\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{D})$ , so that

(2.4)

\displaystyle\theta_{\mathcal{G}}\in\operatorname*{argmin}_{\theta\in\mathbb{R% }^{q}}\|\mathcal{G}-\Phi({\,\cdot\,};\theta)\|_{\mathsf{\bm{V}}},\quad\forall% \,\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{D}),

with respect to the relevant norm of interest on the space of operators $\mathsf{\bm{V}}\supset\mathrm{Lip}_{1}(\mathcal{D})$ . The corresponding encoder is then given by

(2.5)

\displaystyle\mathcal{E}:\mathrm{Lip}_{1}(\mathcal{D})\to\mathbb{R}^{q},\quad% \mathcal{G}\mapsto\theta_{\mathcal{G}}.

The corresponding decoder is

(2.6)

\displaystyle\mathcal{D}:\mathbb{R}^{q}\to\mathsf{\bm{V}},\quad\theta\mapsto% \Phi({\,\cdot\,};\theta).

In this way, the operator learning architecture $\Phi$ induces a natural encoder/decoder pair on the relevant space of operators, and we are interested in bounds on the encoding error, either for individual $\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{D})$ , i.e.

(2.7)

\displaystyle\mathrm{Err}(\mathcal{G};\Phi)_{\mathsf{\bm{V}}}=\inf_{\theta\in% \mathbb{R}^{q}}\|\mathcal{G}-\Phi({\,\cdot\,};\theta)\|_{\mathsf{\bm{V}}},

or in a minimax sense, i.e.

(2.8)

\displaystyle\mathrm{Err}(\mathrm{Lip}_{1}(\mathcal{D});\Phi)_{\mathsf{\bm{V}}% }=\sup_{\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{D})}\inf_{\theta\in\mathbb{R}^% {q}}\|\mathcal{G}-\Phi({\,\cdot\,};\theta)\|_{\mathsf{\bm{V}}}.

Given a desired approximation accuracy $\epsilon>0$ , either in the sense (2.7) or (2.8), one quantity of interest is the required “complexity” of any architecture $\Phi$ achieving this accuracy. The above point of view is consistent with estimates on the required number of parameters $q$ .

Counting bits

As discussed before, the number of parameters $q$ is not a suitable measure of complexity when results independent of the activation are sought. Therefore, we now assume that the parameters $\theta\in\mathbb{R}^{q}$ are encoded by $B$ bits. This defines a subset $\Theta\subset\mathbb{R}^{q}$ consisting of $|\Theta|=2^{B}$ elements. Each $\theta\in\Theta$ is in correspondence with its bit-encoding $[\theta]\in\{0,1\}^{B}$ . Thus, upon associating with any $\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{D})$ the optimal $\theta_{\mathcal{G}}\in\Theta$ , the continuum encoder (2.5) is now replaced by a bitwise-encoder,

(2.9)

\displaystyle\mathfrak{E}:\mathrm{Lip}_{1}(\mathcal{D})\to\{0,1\}^{B},\quad% \mathcal{G}\mapsto[\theta_{\mathcal{G}}],

with bitwise-decoder,

(2.10)

\displaystyle\mathfrak{D}:\{0,1\}^{B}\to\mathsf{\bm{V}},\quad[\theta]\mapsto% \Phi({\,\cdot\,};\theta).

The individual and minimax errors, (2.7) and (2.8), have the following bit-encoded counterparts,

(2.11)

\displaystyle\mathrm{Err}(\mathcal{G};\Phi,\Theta)_{\mathsf{\bm{V}}}=\inf_{% \theta\in\Theta}\|\mathcal{G}-\Phi({\,\cdot\,};\theta)\|_{\mathsf{\bm{V}}}.

and

(2.12)

\displaystyle\mathrm{Err}(\mathrm{Lip}_{1}(\mathcal{D});\Phi,\Theta)_{\mathsf{% \bm{V}}}=\sup_{\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{D})}\inf_{\theta\in% \Theta}\|\mathcal{G}-\Phi({\,\cdot\,};\theta)\|_{\mathsf{\bm{V}}}.

In the present work, we will focus on such a bit-encoding point of view, but mention that there are close links between these two points of view, if the map** $\theta\mapsto\Phi({\,\cdot\,};\theta)$ possesses some stability properties. Specifically, this link will be used to derive lower complexity bounds for the Fourier neural operator in Section 2.6.

2.3. Information-theoretic notions

The relevance of the bit-encoding point of view is that it relates directly to the (Kolmogorov) metric entropy of the underlying model class $\mathsf{\bm{A}}\subset\mathsf{\bm{V}}$ and allows results to be derived which are independent of specifics of the architecture such as the choice of activation function. Thus bit-encoding enables analysis relating directly to intrinsic topological properties of $\mathsf{\bm{A}}$ .

Minimax code-length

Abstracting further our previous discussion, we make the following formal definition of abstract bitwise encoder/decoder pairs:

Definition 2.5 (Abstract bitwise encoder/decoder pairs).

Given a compact subset $\mathsf{\bm{A}}\subset\mathsf{\bm{V}}$ of a Banach space $\mathsf{\bm{V}}$ , we denote by $\mathrm{Enc}_{B}(\mathsf{\bm{A}};\mathsf{\bm{V}})$ the set of all bitwise encoder/decoder pairs $(\mathfrak{E},\mathfrak{D})$ of length $B$ , i.e. all pairs of map**s $\mathfrak{E}:\mathsf{\bm{A}}\to\{0,1\}^{B}$ and $\mathfrak{D}:\{0,1\}^{B}\to\mathsf{\bm{V}}$ .

Following [9], for $\epsilon>0$ , we also introduce the minimax code length $\mathcal{L}(\mathsf{\bm{A}};\epsilon)_{\mathsf{\bm{V}}}$ of a compact set $\mathsf{\bm{A}}\subset\mathsf{\bm{V}}$ as the minimal number of bits $B$ for which there exists an (abstract) encoder/decoder pair $(\mathfrak{E},\mathfrak{D})\in\mathrm{Enc}_{B}(\mathsf{\bm{A}};\mathsf{\bm{V}})$ such that

\sup_{\mathcal{G}\in\mathsf{\bm{A}}}\|\mathcal{G}-\mathfrak{D}\circ\mathfrak{E% }(\mathcal{G})\|_{\mathsf{\bm{V}}}\leq\epsilon.

That is,

(2.15)

\displaystyle\mathcal{L}(\mathsf{\bm{A}};\epsilon)_{\mathsf{\bm{V}}}:=\min{% \left\{B\in\mathbb{N}\,\middle|\,\begin{gathered}\exists\,(\mathfrak{E},% \mathfrak{D})\in\mathrm{Enc}_{B}(\mathsf{\bm{A}};\mathsf{\bm{V}})\text{ s.t. }% \\ \textstyle\sup_{\mathcal{G}\in\mathsf{\bm{A}}}\|\mathcal{G}-\mathfrak{D}\circ% \mathfrak{E}(\mathcal{G})\|_{\mathsf{\bm{V}}}\leq\epsilon\end{gathered}\right% \}}.

Kolmogorov metric entropy

Given a metric space $(\mathsf{\bm{V}},d)$ , element $g\in\mathsf{\bm{V}}$ and $r>0$ , we denote by

\overline{B_{r}}(g):={\left\{f\in\mathsf{\bm{V}}\,\middle|\,d(g,f)\leq\epsilon% \right\}},

the closed ball of radius $r$ . We now make the following definition for the covering number and (Kolmogorov) metric entropy:

Definition 2.6 (Covering number and metric entropy).

Let $(\mathsf{\bm{V}},d)$ be a metric space. For $\epsilon>0$ , the $\epsilon$ -covering number of a set $\mathsf{\bm{A}}\subset\mathsf{\bm{V}}$ , denoted $\mathcal{N}(\mathsf{\bm{A}};\epsilon)_{\mathsf{\bm{V}}}$ , is the smallest integer $N\in\mathbb{N}$ , such that $\mathsf{\bm{A}}$ can be covered by $N$ closed balls of radius $\epsilon$ , i.e.

(2.16)

\displaystyle\mathcal{N}(\mathsf{\bm{A}};\epsilon)_{\mathsf{\bm{V}}}:=\min{% \left\{N\in\mathbb{N}\,\middle|\,\exists\,g_{1},\dots,g_{N}\in\mathsf{\bm{V}},% \text{ s.t. }\mathsf{\bm{A}}\subset\textstyle{\bigcup_{j=1}^{N}}\overline{B_{% \epsilon}}(g_{j})\right\}}.

We note that the subscript $\mathsf{\bm{V}}$ is used as a shorthand for $(\mathsf{\bm{V}},d)$ , with the relevant metric $d$ implied. The metric entropy of $\mathsf{\bm{A}}\subset\mathsf{\bm{V}}$ is defined as the logarithm (to base $2$ ) of the covering number, i.e.

(2.17)

\displaystyle\mathcal{H}(\mathsf{\bm{A}};\epsilon)_{\mathsf{\bm{V}}}=\log_{2}% \mathcal{N}(\mathsf{\bm{A}};\epsilon)_{\mathsf{\bm{V}}}.

Link between minimax code-length and metric entropy

The minimax code-length and metric entropy introduced in the previous paragraphs are linked by the following fundamental result [16, Rmk. 5.10]:

Proposition 2.7.

Let $\mathsf{\bm{V}}$ be a Banach space, and let $\mathsf{\bm{A}}\subset\mathsf{\bm{V}}$ be compact. Then the metric entropy of $\mathsf{\bm{A}}$ provides a lower bound on the minimax code length:

(2.18)

\displaystyle\mathcal{L}(\mathsf{\bm{A}};\epsilon)_{\mathsf{\bm{V}}}\geq% \mathcal{H}(\mathsf{\bm{A}};\epsilon)_{\mathsf{\bm{V}}}.

Proof.

Let $\epsilon>0$ be given. Let $(\mathfrak{E},\mathfrak{D})$ be a bitwise encoder/decoder pair with $B=\mathcal{L}(\mathsf{\bm{A}};\epsilon)_{\mathsf{\bm{V}}}$ bits, achieving reconstruction error at most $\epsilon$ on $\mathsf{\bm{A}}$ . The image of $\mathfrak{D}:\{0,1\}^{B}\to\mathsf{\bm{V}}$ contains at most $N=2^{B}$ elements, $\mathcal{G}_{1},\dots,\mathcal{G}_{N}$ . Since, for any $\mathcal{G}\in\mathsf{\bm{A}}$ , the specific choice $\mathfrak{D}\circ\mathfrak{E}(\mathcal{G})$ belongs to the image of $\mathfrak{D}$ , it follows that

\sup_{\mathcal{G}\in\mathsf{\bm{A}}}\inf_{n=1,\dots,N}\|\mathcal{G}-\mathcal{G% }_{n}\|\leq\sup_{\mathcal{G}\in\mathsf{\bm{A}}}\|\mathcal{G}-\mathfrak{D}\circ% \mathfrak{E}(\mathcal{G})\|\leq\epsilon.

Thus, $\mathsf{\bm{A}}\subset\bigcup_{n=1}^{N}\overline{B_{\epsilon}}(\mathcal{G}_{n})$ , implying that the covering number of $\mathsf{\bm{A}}$ is bounded by

\mathcal{N}(\mathsf{\bm{A}};\epsilon)_{\mathsf{\bm{V}}}\leq N=2^{B}.

Taking logarithms and recalling that $B=\mathcal{L}(\mathsf{\bm{A}};\epsilon)_{\mathsf{\bm{V}}}$ yields the claim. ∎

In particular, Proposition 2.7 implies that if $\mathcal{H}(\mathsf{\bm{A}};\epsilon)_{\mathsf{\bm{V}}}>B$ , then there cannot exist a bit-encoder-decoder pair $(\mathfrak{E},\mathfrak{D})\in\mathrm{Enc}_{B}(\mathsf{\bm{A}};\mathsf{\bm{V}})$ achieving uniform decoding accuracy $\epsilon$ over $\mathsf{\bm{A}}$ . Conversely, if $(\mathfrak{E},\mathfrak{D})$ is an encoder-decoder pair (2.9), (2.10) associated with a bit-encoded neural operator $\Phi:\mathcal{D}\times\Theta\to\mathbb{R}$ with $|\Theta|\leq 2^{B}$ , and if the following minimax approximation bound holds,

\sup_{\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{D})}\inf_{\theta\in\Theta}\|% \mathcal{G}-\Phi({\,\cdot\,};\theta)\|_{\mathsf{\bm{V}}}\leq\epsilon,

this implies that $B\geq\mathcal{H}(\mathrm{Lip}_{1}(\mathcal{D});\epsilon)_{\mathsf{\bm{V}}}$ .

2.4. Information-theoretic minimax bounds

As a consequence of Proposition 2.7, we can derive a lower bound on the required number of bits $B$ to achieve the minimax bound (2.12) by estimating the entropy of $\mathrm{Lip}_{1}(\mathcal{D})\subset\mathsf{\bm{V}}$ . As mentioned before, we will consider two settings, corresponding to uniform approximation of $\mathcal{G}$ over a compact set $\mathcal{K}$ (the setting $\mathcal{D}=\mathcal{K}$ ) and approximation with respect to a Bochner $L^{p}(\mu)$ -norm for probability measure $\mu$ (the setting $\mathcal{D}=\mathcal{X}$ ).

Uniform approximation

We now consider $\mathcal{K}\subset\mathcal{X}$ a compact set of input functions, and operators belonging to $\mathrm{Lip}_{1}(\mathcal{K})\subset C(\mathcal{K})$ (cp. Setting 2.2). This corresponds to the choice $\mathcal{D}=\mathcal{K}$ , $\mathsf{\bm{A}}=\mathrm{Lip}_{1}(\mathcal{K})$ , $\mathsf{\bm{V}}=C(\mathcal{K})$ , in the discussion of the previous section. We then have the following result:

Theorem 2.8.

Let $\mathcal{X}$ be a Banach space. Let $\mathcal{K}\subset\mathcal{X}$ be a compact set of input functions, and assume that the metric entropy of $\mathcal{K}$ satisfies the lower bound, $\mathcal{H}(\mathcal{K};\epsilon)_{\mathcal{X}}\geq c_{\alpha}\epsilon^{-1/\alpha}$ for $\alpha>0$ . There exists a constant $c>0$ , independent of $\epsilon$ , such that the following holds: If $\Phi:\mathcal{K}\times\Theta\to\mathbb{R}$ is a quantized neural operator architecture, satisfying

\sup_{\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{K})}\inf_{\theta\in\Theta}\|% \mathcal{G}-\Phi({\,\cdot\,};\theta)\|_{C(\mathcal{K})}\leq\epsilon.

and if $|\Theta|\leq 2^{B}$ , i.e. if the parameters of $\Phi$ can be encoded by $B$ bits, then

B\geq\exp(c\epsilon^{-1/\alpha}).

Proof.

The claim follows from the relation between the minimax code-length and the metric entropy of $\mathrm{Lip}_{1}(\mathcal{K})\subset C(\mathcal{K})$ , stated in the above Proposition 2.7, and the following general bound on $\mathcal{H}(\mathrm{Lip}_{1}(\mathcal{K}),\epsilon)_{C(\mathcal{K})}$ :

\mathcal{H}(\mathrm{Lip}_{1}(\mathcal{K});\epsilon)_{C(\mathcal{K})}\geq 2^{% \mathcal{H}(\mathcal{K},6\epsilon)_{\mathcal{X}}}.

This bound will be shown in Section 3.2, Proposition 3.2. Assuming this bound, then by assumption on $\mathcal{K}$ , we have $2^{\mathcal{H}(\mathcal{K},6\epsilon)_{\mathcal{X}}}\geq\exp(c\epsilon^{-1/% \alpha})$ for constant $c>0$ . ∎

If $\mathcal{X}$ is a function space, then compact subsets $\mathcal{K}\subset\mathcal{X}$ are commonly defined by a smoothness constraint, and this partly motivates our assumption on $\mathcal{K}$ in the last theorem. The following example is illustrative.

Example 2.9.

Let $D\subset\mathbb{R}^{d}$ be a bounded domain. Let $\mathcal{X}=L^{2}(D)$ . An example of the setting outlined above is the case of Lipschitz operators $\mathcal{G}:\mathcal{K}\to\mathbb{R}$ , with

\mathcal{K}={\left\{u\in H^{s}(D)\,\middle|\,\|u\|_{H^{s}(D)}\leq C\right\}},

defined by a Sobolev smoothness constraint for $C,s>0$ . In this case, it is well-known that the metric entropy satisfies $\mathcal{H}(\mathcal{K};\epsilon)_{\mathcal{X}}\gtrsim\epsilon^{-d/s}$ , i.e. the assumptions of Theorem 2.8 hold with $\alpha=s/d$ .

Approximation in expectation

Another commonly studied setting concerns the approximation in expectation (cp. Setting 2.4). Here, we consider $1$ -Lipschitz map**s $\mathcal{G}:\mathcal{X}\to\mathbb{R}$ defined on a separable Hilbert space $\mathcal{X}$ . We fix a probability measure $\mu$ on $\mathcal{X}$ and consider inputs as random draws $u\sim\mu$ . To derive quantitative lower bounds, we will need to make minimal structural assumptions on $\mu$ .

Assumption 2.10.

There exists an orthonormal basis $e_{1},e_{2},\dots$ of $\mathcal{X}$ , probability space $(\Omega,\mathbb{P})$ and summable coefficients $\lambda_{1}\geq\lambda_{2}\geq\dots$ , such that $\mu$ is the law of a random variable $u:\Omega\to\mathcal{X}$ of the form,

(2.19)

\displaystyle u(\omega)=\sum_{j=1}^{\infty}\sqrt{\lambda_{j}}Z_{j}(\omega)e_{j% },\quad(\omega\in\Omega).

where $Z_{j}:\Omega\to\mathbb{R}$ are jointly independent random variables. We assume that the random variable $Z_{j}$ satisfies $\mathbb{E}|Z_{j}|^{2}=1$ , and has law $Z_{j}\sim\rho_{j}(z)\,dz$ for a probability density function $\rho_{j}:\mathbb{R}\to\mathbb{R}_{+}$ . We furthermore assume that there exists a constant $L>0$ , such that

(2.20)

\displaystyle\sup_{j\in\mathbb{N}}\|\rho_{j}\|_{L^{\infty}(\mathbb{R})}\leq L,% \quad\sqrt{\lambda_{1}}\leq L.

A concrete, and widely considered, example satisfying Assumption 2.10 is the case of a Gaussian probability measure $\mu$ with prescribed mean and covariance operator. In this case, $\lambda_{j}$ are the eigenvalues of the covariance operator, $e_{j}$ the corresponding eigenfunctions, and the random variables $Z_{j}\sim\rho_{j}$ have standard Gaussian distribution.

Theorem 2.11.

Let $\mathcal{X}$ be a Banach space of input functions. Let $\mu\in\mathcal{P}(\mathcal{X})$ be a probability measure satisfying Assumption 2.10. Assume that the coefficients $\sqrt{\lambda_{j}}\gtrsim j^{-\alpha}$ as $j\to\infty$ , where $\alpha>0$ . Then there exists a constant $c>0$ , independent of $\epsilon$ , such that the following holds: If $\Phi:\mathcal{X}\times\Theta\to\mathbb{R}$ is a quantized neural operator architecture, satisfying

\sup_{\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{K})}\inf_{\theta\in\Theta}\|% \mathcal{G}-\Phi({\,\cdot\,};\theta)\|_{L^{p}(\mu)}\leq\epsilon.

and if $|\Theta|\leq 2^{B}$ , i.e. if the parameters of $\Phi$ can be encoded by $B$ bits, then

B\geq\exp(c\epsilon^{-1/(\alpha+1)}).

Proof.

Similarly to the uniform case, the present claim again follows from the relation between the minimax code-length and the metric entropy of $\mathrm{Lip}_{1}(\mathcal{X})\subset L^{p}(\mu)$ of Proposition 2.7, together with the following general bound on $\mathcal{H}(\mathrm{Lip}_{1}(\mathcal{X}),\epsilon)_{L^{p}(\mu)}$ :

\mathcal{H}(\mathrm{Lip}_{1}(\mathcal{X});\epsilon)_{L^{p}(\mu)}\geq\exp(c% \epsilon^{-1/(\alpha+1)}).

This lower entropy bound will be derived in Section 3.3, Proposition 3.6. ∎

Thus, an exponential number of encoding bits is also needed in an $L^{p}(\mu)$ -setting. Theorem 2.11 shows that the approximation of Lipschitz operators in expectation is not “qualitatively” easier than uniform approximation of such operators over a compact set of input functions.

2.5. Approximation of generic Lipschitz operators

Theorems 2.8 and 2.11 show that operator learning architectures that can approximate arbitrary $1$ -Lipschitz operators to accuracy $\epsilon$ have exponential memory requirements; any (bit-encoded) implementation of such an architecture will require a number of bits that is exponential in $\epsilon^{-1}$ . The reason for this is that the space of Lipschitz operators is exponentially large in a fundamental information-theoretic sense quantified by the metric entropy.

However, this minimax bound applies to the approximation of the entire class $\mathrm{Lip}_{1}(\mathcal{D})$ by a single architecture, and does not necessarily imply that it is impossible to approximate individual $\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{D})$ efficiently. At first sight, it could appear that arguments based on the metric entropy cannot be used to gain any insight into this refined question; Indeed, if we fix individual $\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{D})$ , then the metric entropy of the singleton-set $\mathsf{\bm{A}}=\{\mathcal{G}\}$ is trivially $=0$ , and the minimax code length (2.15) is $=1$ for any value of the accuracy $\epsilon$ , since the trivial decoder $\mathfrak{D}({\,\cdot\,})\equiv\mathcal{G}$ reproduces $\mathcal{G}$ exactly, with vanishing approximation error, $\epsilon=0$ . Thus, while entropy arguments give insights into the (concurrent) approximation of the set $\mathrm{Lip}_{1}(\mathcal{D})$ , they seemingly have no immediate implications for the approximation of individual $\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{D})$ .

Despite these facts, the results below will show that a refined analysis based on the concept of metric entropy is nevertheless possible; in the uniform and $L^{p}$ -settings of the previous section, a fixed sequence of bit-encoded architectures $\{\Phi_{n}\}_{n\in\mathbb{N}}$ , with at most $n$ bits, can approximate generic elements $\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{D})$ at best at a logarithmic rate, $\mathrm{Err}(\mathcal{G};\Phi_{n},\Theta_{n})\lesssim\log(n)^{-\gamma}$ for fixed $\gamma>0$ . Before stating our result, we briefly recall the notion of a generic element of a (compact) metric space (see Appendix A for further remarks, and [45, Chap. 8] for an in-depth discussion):

Definition 2.12 (Topologically generic properties).

Let $(\mathsf{\bm{A}},d)$ be a compact metric space. A subset $\mathsf{\bm{R}}\subset\mathsf{\bm{A}}$ is called residual, if it is equal to a countable intersection of sets, each of whose interior is dense in $\mathsf{\bm{A}}$ . The complement of a residual set is a meagre set. A property $P$ is called generic, if the set

\mathsf{\bm{R}}:={\left\{\mathcal{G}\in\mathsf{\bm{A}}\,\middle|\,\mathcal{G}% \text{ satisfies }P\right\}}\subset\mathsf{\bm{A}},

is residual.

Under the assumption that $(\mathsf{\bm{A}},d)$ is compact, the Baire category theorem (cp. Appendix A) implies that any residual set $\mathsf{\bm{R}}$ is dense in $\mathsf{\bm{A}}$ . Furthermore, the intersection $\mathsf{\bm{R}}=\bigcap_{j=1}^{\infty}\mathsf{\bm{R}}_{j}$ of countably many residual sets $\mathsf{\bm{R}}_{1},\mathsf{\bm{R}}_{2},\dots$ is itself residual, and hence still dense. In this sense, a topologically generic property is somewhat analogous to a property that holds with probability $1$ in a probabilistic sense. Thus, a generic property is often thought of as a property that is satisfied by “almost every” element of $\mathsf{\bm{A}}$ .

We can now state our main results on the approximation of generic operators $\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{D})$ . In the uniform setting (cp. Setting 2.2), we have:

Proposition 2.13 (Uniform approximation of generic operators).

Let $\mathcal{X}$ be a Banach space of input functions. Let $\mathcal{K}\subset\mathcal{X}$ be compact, and assume that the metric entropy $\mathcal{H}(\mathcal{K};\epsilon)_{\mathcal{X}}\gtrsim\epsilon^{-1/\alpha}$ for $\alpha>0$ . Let $\{\Phi_{n}:\mathcal{K}\times\Theta_{n}\to\mathbb{R}\}_{n\in\mathbb{N}}$ be a sequence of bit-encoded neural operator architectures, with quantized parameter set $|\Theta_{n}|\leq 2^{n}$ . Then generic $\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{K})$ cannot be approximated by $\{\Phi_{n}\}$ at a convergence rate better than $\log(n)^{-\alpha}$ ; more precisely, for any sequence $\epsilon_{n}=o(\log(n)^{-\alpha})$ , there is a residual subset $\mathsf{\bm{R}}\subset\mathrm{Lip}_{1}(\mathcal{K})$ , consisting of operators $\mathcal{G}\in\mathsf{\bm{R}}$ , for which

\inf_{\theta\in\Theta_{n}}\|\mathcal{G}-\Phi_{n}({\,\cdot\,};\theta)\|_{C(% \mathcal{K})}\not=O(\epsilon_{n}),\quad(n\to\infty).

Proof.

We let $\mathsf{\bm{V}}:=C(\mathcal{K})$ and $\mathsf{\bm{A}}:=\mathrm{Lip}_{1}(\mathcal{K})$ . We note that $\mathsf{\bm{A}}\subset\mathsf{\bm{V}}$ is a compact, convex subset. We then consider the sequence of subsets $\Sigma_{n}\subset C(\mathcal{K})$ , defined by all possible realizations,

\Sigma_{n}:={\left\{\Phi_{n}({\,\cdot\,};\theta)\,\middle|\,\theta\in\Theta_{n% }\right\}}.

By assumption, $|\Sigma_{n}|=|\Theta_{n}|\leq 2^{n}$ . By Proposition 3.2, to be proved in Section 3.2, we have $\mathcal{H}(\mathsf{\bm{A}},\epsilon)_{\mathsf{\bm{V}}}\geq\exp(c\epsilon^{1/% \alpha})$ . The claim of Proposition 2.13 then follows, as a special case, from the abstract result of Proposition 4.2 to be derived in Section 4. ∎

A similar result holds for approximation of Lipschitz operators in an $L^{p}(\mu)$ sense, as shown in the following proposition (cp. Setting 2.4):

Proposition 2.14 (Approximation of generic operators in expectation).

Let $\mathcal{X}$ be a Banach space of input functions. Let $\mu\in\mathcal{P}(\mathcal{X})$ be a probability measure satisfying Assumption 2.10. Assume that the coefficients $\lambda_{j}\gtrsim j^{-2\alpha}$ as $j\to\infty$ , where $\alpha>0$ . Let $\{\Phi_{n}:\mathcal{X}\times\Theta_{n}\to\mathbb{R}\}_{n\in\mathbb{N}}$ be a sequence of bit-encoded neural operator architectures, with quantized parameter set $|\Theta_{n}|\leq 2^{n}$ . Then generic $\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{X})$ cannot be approximated by $\{\Phi_{n}\}$ at a convergence rate better than $\log(n)^{-(\alpha+1)}$ ; more precisely, for any sequence $\epsilon_{n}=o(\log(n)^{-(\alpha+1)})$ , there is a residual subset $\mathsf{\bm{R}}\subset\mathrm{Lip}_{1}(\mathcal{X})$ , such that for any $\mathcal{G}\in\mathsf{\bm{R}}$ ,

\inf_{\theta\in\Theta_{n}}\|\mathcal{G}-\Phi_{n}({\,\cdot\,};\theta)\|_{L^{p}(% \mu)}\not=O(\epsilon_{n}),\quad(n\to\infty).

Proof.

We let $\mathsf{\bm{V}}:=L^{p}(\mu)$ and $\mathsf{\bm{A}}:=\mathrm{Lip}_{1}(\mathcal{X})$ . We note that $\mathsf{\bm{A}}\subset\mathsf{\bm{V}}$ is a compact, convex subset. We consider the subsets $\Sigma_{n}\subset\mathsf{\bm{V}}$ , defined by all possible realizations,

\Sigma_{n}:={\left\{\Phi_{n}({\,\cdot\,};\theta)\,\middle|\,\theta\in\Theta_{n% }\right\}}.

By assumption, $|\Sigma_{n}|=|\Theta_{n}|\leq 2^{n}$ . By Proposition 3.6, to be proved in Section 3.3, we have $\mathcal{H}(\mathsf{\bm{A}},\epsilon)_{\mathsf{\bm{V}}}\geq\exp(c\epsilon^{-1/% (\alpha+1)})$ . The claim of Proposition 2.14 then follows, as a special case, from the abstract result of Proposition 4.2 to be derived in Section 4. ∎

Remark 2.15.

The notion of a residual subset $\mathsf{\bm{R}}\subset\mathrm{Lip}_{1}(\mathcal{D})$ in Proposition 2.13 and 2.14 is to be understood with respect to the subspace topology on $\mathrm{Lip}_{1}(\mathcal{D})$ , induced by the $C(\mathcal{K})$ and $L^{p}(\mu)$ -norms, respectively.

2.6. Approximation of generic Lipschitz operators by FNO

The results of the previous section are formulated abstractly for an unspecified sequence of quantized neural operator architectures $\{\Phi_{n}\}$ . To conclude the discussion of our main results, we illustrate some implications of these results for a concrete operator learning framework, the Fourier neural operator [39].

We note that although the derivation of these results will rely on Propositions 2.13 and 2.14, the ultimate statement of the theorems will be in terms of the number of tunable real-valued parameters of FNO, without bit-encoding. Thus, the gap between the bit-encoded parameters and real-valued parameters point of view can be bridged in this case.

In preparation to stating these theorems for FNO, we briefly describe a specific setting to which FNO is applicable, and recall the FNO architecture. This is followed by the statement of a novel theorem establishing a curse of (exponential) parametric complexity for the FNO, in the uniform approximation setting.

FNO case study

As a case study, we consider Fourier neural operators (FNO), approximating a relevant class of $1$ -Lipschitz operators,

\mathcal{G}:\mathcal{K}\subset L^{2}(D;\mathbb{R}^{d_{\mathrm{in}}})\to\mathbb% {R},

map** square-integrable input functions to the reals (or equivalently, to a space of constant-valued functions). Here $\mathcal{K}$ is a compact subset of $L^{2}(D;\mathbb{R}^{d_{\mathrm{in}}})$ , consisting of square-integrable functions $u:D\to\mathbb{R}^{d_{\mathrm{in}}}$ . We wish to approximate such $1$ -Lipschitz operator $\mathcal{G}$ , uniformly over the compact set $\mathcal{K}$ .

In the following, we will usually write $L^{2}(D)$ instead of $L^{2}(D;\mathbb{R}^{\mathrm{in}})$ , where for simplicity and due to certain restrictions of the FNO architecture, the underlying domain $D=\mathbb{T}^{d}$ is taken to be the $1$ -periodic torus $\mathbb{T}^{d}\simeq[0,1]^{d}$ in $d$ spatial dimensions, where in typical applications, $d\in\{1,2,3\}$ . Prototpyical examples of relevant $\mathcal{K}$ are $\mathcal{K}=\mathcal{U}(H^{s}(\mathbb{T}^{d}))$ , where

\mathcal{U}(H^{s}(\mathbb{T}^{d}))={\left\{u\in H^{s}(\mathbb{T}^{d})\,\middle% |\,\|u\|_{H^{s}}\leq 1\right\}},

denotes the unit ball in the Sobolev space $H^{s}(\mathbb{T}^{d})$ with smoothness $s>0$ . The question to be addressed is how many tunable parameters $q$ are needed to approximate generic $\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{K})_{L^{2}(D)}$ to a prescribed accuracy $\epsilon$ ?

FNO architecture

We here recall the general notion of Fourier neural operators [39]. Let $\mathcal{X}=\mathcal{X}(D;\mathbb{R}^{d_{\mathrm{in}}})$ and $\mathcal{Y}=\mathcal{Y}(D;\mathbb{R}^{d_{\mathrm{out}}})$ be two Banach function spaces, consisting of functions $u:D\to\mathbb{R}^{d_{\mathrm{in}}}$ and $w:D\to\mathbb{R}^{d_{\mathrm{out}}}$ , respectively. A Fourier neural operator (FNO) defines a nonlinear operator

\Phi_{\mathrm{FNO}}:\mathcal{X}(D;\mathbb{R}^{d_{\mathrm{in}}})\to\mathcal{Y}(% D;\mathbb{R}^{d_{\mathrm{out}}}),

map** between these spaces. By definition of the FNO architecture, such $\Phi_{\mathrm{FNO}}$ takes the form

(2.21)

\displaystyle\Phi_{\mathrm{FNO}}(u;\theta)=Q\circ\mathcal{L}_{L}\circ\dots% \circ\mathcal{L}_{1}\circ P(u).

where $P:\mathcal{X}\to\mathcal{V}$ , $u(x)\mapsto Pu(x)$ is a linear lifting layer, $Q:\mathcal{V}\to\mathcal{Y}$ , $v(x)\mapsto Qv(x)$ is a linear projection layer, and the $\mathcal{L}_{\ell}:\mathcal{V}(D;\mathbb{R}^{{d_{c}}})\to\mathcal{V}(D;\mathbb% {R}^{{d_{c}}})$ are the hidden layers, map** between hidden states $v\mapsto\mathcal{L}_{\ell}(v)\in\mathcal{V}(D;\mathbb{R}^{{d_{c}}})$ . The hidden states are vector-valued functions with ${d_{c}}$ components, $v:D\to\mathbb{R}^{{d_{c}}}$ , belonging to a Banach function space $\mathcal{V}(D;\mathbb{R}^{d_{c}})$ . Here, the “channel width” ${d_{c}}$ is a hyperparameter of the architecture. Each hidden layer $\mathcal{L}_{\ell}$ is of the form

\mathcal{L}_{\ell}(v)(x):=\sigma\big{(}Wv(x)+Kv(x)+b\big{)}

where $W\in\mathbb{R}^{{d_{c}}\times{d_{c}}}$ is a matrix multiplying $v(x)$ pointwise, and $b\in\mathbb{R}^{d_{c}}$ is a bias. $K$ is a non-local operator of the form

v(x)\mapsto(Kv)(x):=\mathcal{F}^{-1}\big{(}\widehat{P}_{k}\mathcal{F}v(k)\big{% )}(x),

with $\mathcal{F}$ (and $\mathcal{F}^{-1}$ ) the Fourier transform (and its inverse). The matrix $\widehat{P}_{k}\in\mathbb{C}^{{d_{c}}\times{d_{c}}}$ is a tunable Fourier multiplier indexed by $k\in\mathbb{Z}^{d}$ . It is assumed that $\widehat{P}_{k}\equiv 0$ for $|k|_{\ell^{\infty}}\geq\kappa$ , i.e. for wavenumbers $k$ above a specified Fourier cut-off parameter $\kappa$ . This Fourier cut-off $\kappa$ is a second hyperparameter of the FNO architecture. We collect the values for different $k\in\mathbb{Z}^{d}$ , $|k|_{\ell^{\infty}}<\kappa$ , in a tensor $\widehat{P}=\{\widehat{P}_{k}\}_{|k|_{\ell^{\infty}}<\kappa}\in\mathbb{C}^{(2% \kappa-1)^{d}\times{d_{c}}\times{d_{c}}}$ , which acts on the Fourier coefficients $\widehat{v}(k)=\mathcal{F}(v)(k)$ , by

(\widehat{P}\widehat{v})(k)_{i}:=\sum_{j=1}^{d_{c}}\widehat{P}_{k,ij}\widehat{% v}(k),\quad(k\in\mathbb{Z}^{d},\;|k|_{\ell^{\infty}}<\kappa).

The resulting FNO architecture depends on the channel width ${d_{c}}$ , Fourier cut-off parameter $\kappa$ and depth $L$ . We collect all tunable parameters in a vector $\theta\in\mathbb{R}^{q}$ . Any parameter $\theta\in\mathbb{R}^{q}$ can be decomposed layer-wise, as

\theta=(\theta_{L+1},\theta_{L},\dots,\theta_{1},\theta_{0}),

where

\theta_{\ell}={\left\{W^{(\ell)}_{ij},\widehat{P}_{k,ij}^{(\ell)},\widehat{b}_% {k}^{(\ell)}\,\middle|\,i,j=1,\dots,{d_{c}},\,|k|<\kappa,\,k\in\mathbb{Z}^{d}% \right\}},

collects the parameters of the $\ell$ -th hidden layer, for $1\leq\ell\leq L$ . We denote by $\theta_{0}={\left\{P_{ij}\,\middle|\,i,j=1,\dots,{d_{c}}\right\}}$ the parameters of the projection $P$ and by $\theta_{L+1}={\left\{Q_{ij}\,\middle|\,i,j=1,\dots,{d_{c}}\right\}}$ the parameters of lifting $Q$ . Assuming that $d_{\mathrm{in}},d_{\mathrm{out}}\leq d_{c}$ , the dimension of $\theta\in\mathbb{R}^{q}$ satisfies,

(2.22)

\displaystyle q={d_{c}}d_{\mathrm{in}}+L({d_{c}}^{2}+(2\kappa)^{d}{d_{c}}^{2}+% {d_{c}})+{d_{c}}d_{\mathrm{out}}\leq 5(2\kappa)^{d}L{d_{c}}^{2}\leq 5q.

Consistent with practical implementations, it is generally assumed that the hidden channel dimension of the FNO is at least as large as both the input and output dimensions $d_{\mathrm{in}},d_{\mathrm{out}}$ . We include a list of hyperparameters in Table 3 to aid clarify notation.

Remark 2.16.

Since we are interested in a restricted class of operators $\mathcal{G}:L^{2}(D)\to\mathbb{R}$ , with real-valued outputs, we will replace the general output layer $\mathcal{Q}:\mathcal{V}(D;\mathbb{R}^{d_{c}})\to\mathcal{Y}(D;\mathbb{R}^{d_{% \mathrm{out}}})$ by a spatially averaged, real-valued version $\widetilde{\mathcal{Q}}:\mathcal{V}(D;\mathbb{R}^{d_{c}})\to\mathbb{R}$ ,

\widetilde{\mathcal{Q}}v:=\fint_{D}\mathcal{Q}v(x)\,dx.

This does not affect the parameter-count, while ensuring real-valued outputs. We will refer to this as an output-averaged FNO.

In passing and in connection with the last remark, we mention relevant work considering variants of FNO for finite-dimensional input and or output spaces [24], where similar alterations to the original FNO architecture have been studied in greater detail.

Symbol	Meaning
${d_{c}}$	channel width
$\kappa$	Fourier cut-off
$L$	depth
$q$	total number of parameters
$M$	parameter bound, $\\|\theta\\|_{\ell^{\infty}}\leq M$

Table 3. Summary of (hyper-)parameters of the FNO architecture. This notation is used throughout this subsection.

Generic curse of parametric complexity for FNO

Our main theorem will be based on Proposition 2.13, and establishes a generic curse of parametric complexity for FNO. In contrast to the aforementioned proposition, this theorem holds at the level of continuous real-valued parameters $\theta\in\mathbb{R}^{q}$ , without requiring specification of a bit-encoding. Instead, we assume a mild bound on the parameters $\theta\in\mathbb{R}^{q}$ . We note that similar assumptions have been considered in the recent work [28], to define relevant approximation spaces of FNO. To this end, we make the following definition:

Definition 2.17.

Given an operator $\mathcal{G}:L^{2}(D)\to\mathbb{R}$ and $\gamma>0$ , we will say that $\mathcal{G}$ can be approximated by FNO at a logarithmic rate $\gamma>0$ , if there exists a sequence $\{\Phi_{q}\}_{q\in\mathbb{N}}$ of output-averaged FNO architectures $\Phi_{q}:L^{2}(D)\times\mathbb{R}^{q}\to\mathbb{R}$ with at most $q$ tunable parameters, and a sequence of parameters $\theta_{q}\in\mathbb{R}^{q}$ , satisfying bound

\|\theta_{q}\|_{\ell^{\infty}}\leq\exp(q),

and

\|\mathcal{G}-\Phi_{q}({\,\cdot\,};\theta_{q})\|_{C(\mathcal{K})}=O(\log(q)^{-% \gamma}),\quad(q\to\infty).

Remark 2.18.

The specific upper bound on the weights, $\|\theta_{q}\|_{\ell^{\infty}}\leq\exp(q)$ , is here chosen for simplicity. For the following discussion, it could readily be replaced by a more general upper bound, $\|\theta_{q}\|_{\ell^{\infty}}\leq c_{1}\exp(c_{2}q^{c_{3}})$ for fixed constants $c_{1}$ , $c_{2}$ , $c_{3}$ , without affecting the main conclusions.

We can now state our main result for FNO:

Theorem 2.19.

Let $\mathcal{K}\subset L^{2}(D)$ be compact. Assume that the metric entropy of $\mathcal{K}$ satisfies an algebraic lower bound, $\mathcal{H}(\mathcal{K};\epsilon)_{L^{2}(D)}\gtrsim\epsilon^{-1/\alpha}$ for some $\alpha>0$ . Consider FNO with a fixed Lipschitz continuous activation function $\sigma$ . Then generic $\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{K})$ cannot be approximated by FNO at a logarithmic rate $\gamma$ , for any $\gamma>\alpha$ .

Thus, loosely speaking and under mild growth assumptions on the weights, the approximation of generic $\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{K})$ to accuracy $\epsilon>0$ , requires an FNO architecture with exponentially many tunable parameters in $\epsilon^{-1}$ .

The following corollary is obtained by taking $\mathcal{K}=\mathcal{U}(H^{s}(\mathbb{T}^{d}))$ as the unit ball in a Sobolev space $H^{s}(\mathbb{T}^{d})$ for $s>0$ , and with $\mathbb{T}^{d}$ the $d$ -dimensional periodic torus:

Corollary 2.20.

Let $s>0$ , and denote $\mathcal{K}=\mathcal{U}(H^{s}(\mathbb{T}^{d}))$ . Then generic $\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{K})$ cannot be approximated by FNO at logarithmic rate $\gamma$ , for any $\gamma>s/d$ .

Proof of Theorem 2.19.

Fix $\gamma>\alpha$ . We wish to show that generic $\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{K})$ cannot be approximated at logarithmic rate $\gamma$ . Proof of this claim will make use of the following lemma:

Lemma 2.21 (FNO quantization lemma).

Fix Lipschitz continuous activation function $\sigma$ . Let $\gamma>0$ . For any $q\in\mathbb{N}$ , there exists a quantized neural operator $\widetilde{\Phi}_{n_{q}}:L^{2}(D)\times\{0,1\}^{n_{q}}\to\mathbb{R}$ with $2^{n_{q}}$ quantized parameter values, where $n_{q}\asymp q^{m}$ , $m=d+6$ , such that for any output-averaged FNO $\Phi_{q}$ with activation $\sigma$ and at most $q$ tunable parameters, we have

\sup_{\theta\in{[-M_{q},M_{q}]}^{q}}\inf_{[\theta]\in\{0,1\}^{n_{q}}}\|\Phi_{q% }({\,\cdot\,};\theta)-\widetilde{\Phi}_{n_{q}}({\,\cdot\,};[\theta])\|_{C(% \mathcal{K})}\leq\log(q)^{-\gamma}.

where $M_{q}:=\exp(q)$ .

Brief sketch of proof.

The detailed proof of this lemma is included in Appendix B; in short, the proof relies on two observations: (i) all possible FNO architectures with at most $q$ parameters can be encapsulated by a “super” FNO-architecture $\widehat{\Phi}({\,\cdot\,};\theta)$ with a number of parameters that is bounded algebraically in $q$ for fixed algebraic exponent, and (ii) quantization of this super-architecture with an algebraically bounded number of bits is possible, since the map** $\theta\mapsto\widehat{\Phi}({\,\cdot\,};\theta)$ has at least a weak form of stability (Lipschitz continuity) over the relevant range of parameters $\theta$ , and a Lipschitz constant that grows at a sufficiently slow rate as a function of $q$ . ∎

By Lemma 2.21, there exists $m\in\mathbb{N}$ , a sequence $n_{q}\asymp q^{m}$ , and a sequence of quantized neural operators, $\widetilde{\Phi}_{n_{q}}:L^{2}(D)\times\{0,1\}^{n_{q}}\to\mathbb{R}$ , such that

\sup_{\theta\in[-M_{q},M_{q}]^{q}}\inf_{[\theta]\in\{0,1\}^{n_{q}}}\|\Phi_{q}(% {\,\cdot\,};\theta)-\widetilde{\Phi}_{n_{q}}({\,\cdot\,};[\theta])\|_{C(% \mathcal{K})}\leq\log(q)^{-\gamma}.

Associated with this subsequence $n_{q}\to\infty$ , we now define an (abstact) sequence of bit-encoded neural operators for arbitrary $n\in\mathbb{N}$ ; specifically, we define $\widetilde{\Phi}_{n}({\,\cdot\,};{\,\cdot\,}):L^{2}(D)\times\{0,1\}^{n}\to% \mathbb{R}$ , by

\widetilde{\Phi}_{n}({\,\cdot\,};[\theta]_{n}):=\widetilde{\Phi}_{n_{q}}({\,% \cdot\,};[\theta]_{n_{q}}),\quad[\theta]_{n}\in\{0,1\}^{n},

where $n_{q}$ is chosen maximal such that $n_{q}\leq n$ , and $[\theta]_{n_{q}}$ are the first $n_{q}\leq n$ bits of $[\theta]_{n}$ (the values of the remaining bits are simply ignored). We note that since $n_{q}\asymp q^{m}$ , we have $\log(q)^{-\gamma}\asymp\log(n_{q})^{-\gamma}$ . Furthermore, for arbitrary fixed operator $\mathcal{G}$ , we note that the decay

\displaystyle\inf_{[\theta]\in\{0,1\}^{n_{q}}}\|\mathcal{G}-\widetilde{\Phi}_{% n_{q}}({\,\cdot\,};[\theta])\|_{C(\mathcal{K})}\lesssim\log(n_{q})^{-\gamma},

along the specified subsequence $n_{q}\asymp q^{m}$ also implies the error decay

(2.23)

\displaystyle\inf_{[\theta]\in\{0,1\}^{n}}\|\mathcal{G}-\widetilde{\Phi}_{n}({% \,\cdot\,};[\theta])\|_{C(\mathcal{K})}\lesssim\log(n)^{-\gamma},

along the full sequence $n\in\mathbb{N}$ , as $n\to\infty$ . This is immediate from the definition of $\widetilde{\Phi}_{n}$ and the fact that $n_{q}\leq n<n_{q+1}$ does not leave exponential gaps between subsequent $n_{q}$ , since $1\leq n_{q+1}/n_{q}\asymp(q+1)^{m}/q^{m}=O(1)$ ; in particular, this implies that $\log(n_{q})\sim\log(n_{q+1})\sim\log(n)$ .

By Proposition 2.13, the set of operators $\mathsf{\bm{M}}\subset\mathrm{Lip}_{1}(\mathcal{K})$ which can be approximated by such a sequence $\{\widetilde{\Phi}_{n}\}$ , at logarithmic rate $\gamma$ , is meagre (its complement is residual). To conclude the argument, it therefore suffices to show that if $\mathcal{G}$ can be approximated by FNO at logarithmic rate $\gamma$ , then $\mathcal{G}\in\mathsf{\bm{M}}$ . This then implies that the set of operators that can be approximated by FNO at logarithmic rate $\gamma$ is a subset of $\mathsf{\bm{M}}$ , and hence is itself meagre.

To this end, assume that $\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{K})$ is approximated by FNO at logarithmic rate $\gamma$ . By definition, there exists a sequence of FNOs, $\Phi_{q}:L^{2}(D)\times\mathbb{R}^{q}\to\mathbb{R}$ , such that,

\inf_{\theta\in[-M_{q},M_{q}]^{q}}\|\mathcal{G}-\Phi_{q}({\,\cdot\,};\theta)\|% _{C(\mathcal{K})}=O(\log(q)^{-\gamma}).

By the triangle inequality,

	$\displaystyle\inf_{[\theta]\in\{0,1\}^{n_{q}}}\\|\mathcal{G}-\widetilde{\Phi}_{% n_{q}}({\,\cdot\,};[\theta])\\|_{C(\mathcal{K})}$
	$\displaystyle\hskip 56.9055pt\leq\\|\mathcal{G}-\Phi_{q}({\,\cdot\,};\theta_{q}% )\\|_{C(\mathcal{K})}+\inf_{[\theta]\in\{0,1\}^{n_{q}}}\\|\Phi_{q}({\,\cdot\,};% \theta_{q})-\widetilde{\Phi}_{n_{q}}({\,\cdot\,};[\theta])\\|_{C(\mathcal{K})}$
	$\displaystyle\hskip 56.9055pt\leq O(\log(q)^{-\gamma})+O(\log(n_{q})^{-\gamma}% )=O(\log(n_{q})^{-\gamma}),$

along the specified sequence $n_{q}\to\infty$ . By (2.23), this implies that

\|\mathcal{G}-\widetilde{\Phi}_{n}({\,\cdot\,};[\theta])\|_{C(\mathcal{K})}=O(% \log(n)^{-\gamma}),

along the entire sequence $n\to\infty$ , and hence $\mathcal{G}\in\mathsf{\bm{M}}$ , i.e. $\mathcal{G}$ belongs to the meagre set of operators which can be approximated by the sequence $\{\widetilde{\Phi}_{n}\}$ at logarithmic rate $\gamma$ .

We have shown that any operator $\mathcal{G}$ that is approximated by FNO at logarithmic rate $\gamma$ belongs to the meagre set $\mathsf{\bm{M}}$ . Hence, the set of operators that is approximated by FNO at logarithmic rate $\gamma$ is itself meagre, and its complement $\mathsf{\bm{R}}=\mathrm{Lip}_{1}(\mathcal{K})\setminus\mathsf{\bm{M}}$ is residual. We conclude that generic operators $\mathcal{G}\in\mathrm{Lip}_{1}(\mathcal{K})$ , belonging to $\mathsf{\bm{R}}$ , cannot be approximated at logarithmic rate $\gamma>\alpha$ . ∎

3. The metric entropy of Lipschitz operators

In the present section, we provide lower bounds on the metric entropy of Lipschitz operators in two general settings; the first pertains to the sup-norm over a compact set of inputs, the second is of relevance to the approximation with respect to the Bochner $L^{p}$ -norm with respect to a probability measure on the input space. After briefly recalling the relation between covering and packing numbers, we proceed to consider the sup-norm setting in Section 3.2 and the $L^{p}$ -setting in Section 3.3.

3.1. Entropy, covering and packing

We recall from Definition 2.6 that the metric entropy $\mathcal{H}(\mathsf{\bm{A}};\epsilon)_{\mathsf{\bm{V}}}$ of a subset $\mathsf{\bm{A}}\subset\mathsf{\bm{V}}$ is defined by $\mathcal{H}(\mathsf{\bm{A}};\epsilon)_{\mathsf{\bm{V}}}=\log_{2}\mathcal{N}(% \mathsf{\bm{A}};\epsilon)_{\mathsf{\bm{V}}}$ ; here, $\mathcal{N}(\mathsf{\bm{A}};\epsilon)_{\mathsf{\bm{V}}}$ denotes the covering number of $\mathsf{\bm{A}}$ , which is defined as the smallest number of open balls needed to cover $\mathsf{\bm{A}}$ . We also recall the closely related notion of a packing number:

Definition 3.1 (Packing number).

Let $(\mathsf{\bm{V}},d)$ be a metric space. The packing number of a subset $\mathsf{\bm{A}}\subset\mathsf{\bm{V}}$ , denoted $\mathcal{M}(\mathsf{\bm{A}};\epsilon)_{\mathsf{\bm{V}}}$ , is the largest integer $M\in\mathbb{N}$ for which there exist elements $u_{1},\dots,u_{M}\in\mathsf{\bm{A}}$ , with pairwise distance $d(u_{j},u_{k})\geq\epsilon$ , for all distinct $j,k\in\{1,\dots,M\}$ .

With our definitions, the following inequalities between covering and packing numbers are elementary: For any subset $\mathsf{\bm{A}}\subset\mathsf{\bm{V}}$ , we have

(3.1)

\displaystyle\mathcal{M}(\mathsf{\bm{A}};3\epsilon)_{\mathsf{\bm{V}}}\leq% \mathcal{N}(\mathsf{\bm{A}};\epsilon)_{\mathsf{\bm{V}}}\leq\mathcal{M}(\mathsf% {\bm{A}};\epsilon)_{\mathsf{\bm{V}}}.

We mention that, if the covering number is defined by open balls, the factor $3$ in the first term could have been replaced by $2$ . With our closed definition, any factor $>2$ would do – we here choose $3$ for simplicity.

3.2. Uniform approximation

We are here interested in the uniform setting (Setting 2.2), i.e. the unifrom approximation of a (real-valued) map** $\mathcal{G}:\mathcal{K}\to\mathbb{R}$ over a compact domain $\mathcal{K}\subset\mathcal{X}$ .

As pointed out before, given the link between minimax code-length and metric entropy, we are interested in estimating the metric entropy of $\mathrm{Lip}_{1}(\mathcal{K})$ for $\mathcal{K}$ a compact metric space. The following proposition relates the metric entropy of $\mathrm{Lip}_{1}(\mathcal{K})\subset\mathsf{\bm{V}}$ to that of $\mathcal{K}$ , when $\mathsf{\bm{V}}=C(\mathcal{K})$ is metrized by the sup-norm:

Proposition 3.2.

Let $(\mathcal{K},d)$ be a metric space. Let $\epsilon\in(0,1/3]$ . The metric entropy of $\mathrm{Lip}_{1}(\mathcal{K})\subset C(\mathcal{K})$ is lower bounded by

(3.2)

\displaystyle\mathcal{H}(\mathrm{Lip}_{1}(\mathcal{K}),\epsilon)_{C(\mathcal{K% })}\geq 2^{\mathcal{H}(\mathcal{K};6\epsilon)_{\mathcal{X}}}.

Proposition 3.2 shows that the space of $1$ -Lipschitz functions on a compact metric space has exponentially larger entropy than the underlying space.

Proof.

Let $\epsilon\in(0,1/3]$ be given. Let $N=\mathcal{N}(\mathcal{K};6\epsilon)_{\mathcal{X}}$ . Since the covering number lower bounds the packing number (cf. (3.1)), there exist $N$ elements $u_{1},\dots,u_{N}\in\mathcal{K}$ , with pairwise distance $\geq 6\epsilon$ . Let

\psi_{j}(u):=\max(3\epsilon-d(u,u_{j}),0),\quad j=1,\dots,N,

denote “hat” functions centered at $u_{j}$ , and non-vanishing only on $B_{3\epsilon}(u_{j})\subset\mathcal{K}$ . We note that each $\psi_{j}$ is $1$ -Lipschitz, satisfies $\|\psi_{j}\|_{C(\mathcal{K})}=3\epsilon$ , and the supports of $\psi_{j}$ are essentially disjoint.

We now consider the set of Lipschitz functions $f:\mathcal{K}\to\mathbb{R}$ of the form,

f_{\sigma}(u)=\sum_{j=1}^{N}\sigma_{j}\psi_{j}(u),\quad\sigma=(\sigma_{1},% \dots,\sigma_{N})\in\{0,1\}^{N}.

These functions satisfy $\|f_{\sigma}\|_{C(\mathcal{K})}\leq 3\epsilon\leq 1$ , and $\mathrm{Lip}(f_{\sigma})\leq\max_{j=1,\dots,N}\mathrm{Lip}(\psi_{j})=1$ , for all choices of $\sigma$ . Furthermore, if $\sigma,\sigma^{\prime}\in\{0,1\}^{N}$ are two distinct elements, say with $\sigma_{j_{0}}\neq\sigma^{\prime}_{j_{0}}$ , then it is straightforward to show that $\|f_{\sigma}-f_{\sigma^{\prime}}\|_{C(\mathcal{K})}\geq\|\psi_{j_{0}}\|_{C(% \mathcal{K})}=3\epsilon$ .

Thus, we have shown that there exist $2^{N}=|\{0,1\}^{N}|$ functions $f_{\sigma}\in\mathrm{Lip}_{1}(\mathcal{K})$ , with pairwise $C(\mathcal{K})$ -distance $\geq 3\epsilon$ . In particular, this implies that the packing number $\mathcal{M}(\mathrm{Lip}_{1}(\mathcal{K});3\epsilon)_{C(\mathcal{K})}\geq 2^{N}$ , and by the inequality (3.1) between packing- and covering-numbers, this now implies that

\mathcal{N}(\mathrm{Lip}_{1}(\mathcal{K});\epsilon)_{C(\mathcal{K})}\geq% \mathcal{M}(\mathrm{Lip}_{1}(\mathcal{K});3\epsilon)_{C(\mathcal{K})}\geq 2^{N}.

The claim follows by taking logarithms and recalling that $N=\mathcal{N}(\mathcal{K};6\epsilon)=2^{\mathcal{H}(\mathcal{K};6\epsilon)_{% \mathcal{X}}}$ . ∎

We conclude this section with several corollaries of Proposition 3.2.

Corollary 3.3 (Lipschitz functions on finite-dimensional domains).

If $D\subset\mathbb{R}^{d}$ is a compact domain in Euclidean space, then

\mathcal{H}(\mathrm{Lip}_{1}(D);\epsilon)\gtrsim\epsilon^{-d}.

Proof.

It is a well-known fact that

\mathcal{N}(D;\epsilon)\gtrsim\epsilon^{-d},

with an implied constant depending on the dimension $d$ and the volume of $D$ ; for example, this can be a simple volume argument for an $\epsilon$ -covering $D\subset\bigcup_{n=1}^{N}\overline{B_{\epsilon}}(x_{n})$ , which yields

\mathrm{vol}(D)\leq\mathrm{vol}\left(\bigcup_{n=1}^{N}\overline{B_{\epsilon}}(% x_{n})\right)\leq N\mathrm{vol}(\overline{B_{\epsilon}})=NC_{d}\epsilon^{d}\;% \Rightarrow\;N\geq\frac{\mathrm{vol}(D)}{C_{d}\epsilon^{d}}.

The claim thus follows from Proposition 3.2. ∎

Corollary 3.4 (Lipschitz functionals on Sobolev spaces).

Let $D\subset\mathbb{R}^{d}$ be a compact domain in Euclidean space. Let $\mathcal{K}=\mathcal{U}(W^{s,p}(D))$ be the unit ball in the space of Sobolev functions possessing $s>0$ weak derivatives in $L^{p}(D)$ , considered as a subset of $L^{p}(D)$ . Then there exists a constant $c>0$ , such that

\mathcal{H}(\mathrm{Lip}_{1}(\mathcal{K});\epsilon)\gtrsim\exp(c\epsilon^{-d/s% }).

Proof.

The metric entropy of $\mathcal{U}(W^{s,p}(D))$ with respect to the $L^{p}$ -norm is lower bounded by [7]:

\mathcal{H}(\mathcal{K};\epsilon)_{L^{p}(D)}\gtrsim\epsilon^{-d/s}.

The claim thus follows from Proposition 3.2. ∎

Corollary 3.5 (Lipschitz functionals on Hölder spaces).

Let $D\subset\mathbb{R}^{d}$ be a compact domain in Euclidean space. Let $\mathcal{K}=\mathcal{U}(C^{s}(D))$ be the unit ball in the space of Hölder continuous functions of order $s>0$ , considered as a subset of $C(D)$ . Then there exists a constant $c>0$ , such that

\mathcal{H}(\mathrm{Lip}_{1}(\mathcal{K});\epsilon)\gtrsim\exp(c\epsilon^{-d/s% }).

Proof.

The metric entropy of $\mathcal{U}(C^{s}(D))$ with respect to the sup-norm is lower bounded by [26]:

\mathcal{H}(\mathcal{K};\epsilon)_{C(D)}\gtrsim\epsilon^{-d/s}.

The claim thus follows from Proposition 3.2. ∎

3.3. Approximation in expectation

Besides the setting discussed in the previous section, which is relevant for the uniform approximation of operators over a compact set of input functions, another commonly studied setting is the approximation in expectation (cp. Setting 2.4): Here, we consider $1$ -Lipschitz map**s $\mathcal{G}:\mathcal{X}\to\mathbb{R}$ defined on a separable Hilbert space $\mathcal{X}$ . We fix a probability measure $\mu$ on $\mathcal{X}$ and consider inputs as random draws $u\sim\mu$ . We assume that $\mu$ satisfies the minimal structural Assumption 2.10; under this assumption, random draws $u\sim\mu$ can be obtained from a Karhunen-Loeve-like expansion, $u=\sum_{j=1}^{\infty}\sqrt{\lambda_{j}}Z_{j}e_{j}$ .

Our aim is to find lower bounds on the metric entropy of $\mathrm{Lip}_{1}(\mathcal{X})\subset\mathsf{\bm{V}}$ , where $\mathsf{\bm{V}}=L^{p}(\mu)$ is the space of $L^{p}(\mu)$ -integrable operators. The following entropy estimate represents the main novel contribution of this section:

Proposition 3.6.

Let $\mathcal{X}$ be a separable Hilbert space, and let $\mu$ be a probability measure satisfying Assumption 2.10. Let $p\in[1,\infty)$ be given. Assume that the coefficients $\sqrt{\lambda_{j}}\gtrsim j^{-\alpha}$ as $j\to\infty$ , where $\alpha>0$ . Then the metric entropy of $\mathrm{Lip}_{1}(\mathcal{X})$ with respect to the Bochner $L^{p}(\mu)$ -norm, obeys the following lower bound: There exist constants $c,\epsilon_{0}>0$ , such that

(3.3)

\displaystyle\mathcal{H}(\mathrm{Lip}_{1}(\mathcal{X});\epsilon)_{L^{p}(\mu)}% \geq\exp\left(c\epsilon^{-1/(\alpha+1)}\right),\quad\forall\,\epsilon\in(0,% \epsilon_{0}].

Our proof of Proposition 3.6 will rely on several technical lemmas, which we state and prove below. The first lemma identifies an isometric embedding $L^{p}([0,1]^{d}){\hookrightarrow}L^{p}(\mu)$ .

Lemma 3.7.

Let $\mathcal{X}$ be a separable Hilbert space. Let $\mu\in\mathcal{P}(\mathcal{X})$ satisfy Assumption 2.10, and let $p\in[1,\infty)$ . Then for any $d\in\mathbb{N}$ , there exists an isometric embedding,

(3.4)

\displaystyle\iota_{d}:L^{p}([0,1]^{d}){\hookrightarrow}L^{p}(\mu),

such that $\iota_{d}(\mathrm{Lip}_{1}([0,1]^{d}))\subset\mathrm{Lip}_{L/\sqrt{\lambda_{d}% }}(\mathcal{X})$ , where the Lipschitz norm on $[0,1]^{d}$ is defined with respect to the $\ell^{\infty}$ -norm on $[0,1]^{d}$ .

Proof.

By assumption, $\mu\in\mathcal{P}(\mathcal{X})$ is the law of a random field $u:\Omega\to\mathcal{X}$ of the form,

(3.5)

\displaystyle u(\omega)=\sum_{j=1}^{\infty}\sqrt{\lambda_{j}}Z_{j}(\omega)e_{j},

with $Z_{j}$ independent, $Z_{j}\sim\rho_{j}(z)\,dz$ . To construct the claimed isometry, we define $F_{j}(z):=\int_{-\infty}\rho_{j}(\zeta)\,d\zeta$ as the cumulative distribution function of $\rho_{j}$ . We recall that $F_{j}(Z_{j})\sim\mathcal{U}(0,1)$ is uniform $[0,1]$ distributed. Furthermore, we clearly have $\mathrm{Lip}(F_{j})=\|\rho_{j}\|_{L^{\infty}(\mathbb{R})}\leq L$ , where the last bound is by Assumption 2.10.

Given $u\in\mathcal{X}$ , we define $u_{j}:=\langle e_{j},u\rangle_{\mathcal{X}}$ the coefficients of $u$ with respect to the orthonormal basis $\{e_{j}\}$ . Using the CDFs introduced above, $F_{j}:\mathbb{R}\to[0,1]$ , we now define a map**,

\iota_{d}:L^{p}([0,1]^{d})\to L^{p}(\mu),\quad(\iota_{d}f)(u):=f(F_{1}(u_{1}/% \sqrt{\lambda_{1}}),\dots,F_{d}(u_{d}/\sqrt{\lambda_{d}})).

To see that this is well-defined, we note that, using the expansion of the random field (3.5), $u_{j}/\sqrt{\lambda_{j}}=Z_{j}$ , and hence

(\iota_{d}f)(u)=f(F_{1}(Z_{1}),\dots,F_{d}(Z_{d})),\quad\text{for }u\sim\mu,

and we once again remind ourselves that $F_{j}(Z_{j})\sim\mathcal{U}(0,1)$ is uniformly distributed on $[0,1]$ , and that the $Z_{j}$ are independent by assumption. Thus, it follows that

	$\displaystyle\mathbb{E}_{u\sim\mu}\|(\iota_{d}f)(u)\|^{p}$	$\displaystyle=\mathbb{E}\|f(F_{1}(Z_{1}),\dots,F_{d}(Z_{d}))\|^{p}$
		$\displaystyle=\int_{[0,1]^{d}}\|f(x_{1},\dots,x_{d})\|^{p}\,dx$
		$\displaystyle=\\|f\\|_{L^{p}([0,1]^{d})}^{p}.$

Thus, $\|\iota_{d}f\|_{L^{p}(\mu)}=\|f\|_{L^{p}([0,1]^{d})}$ . This shows that $\iota_{d}:L^{p}([0,1]^{d})\to L^{p}(\mu)$ is an isometry as claimed. To verify that $\iota_{d}(\mathrm{Lip}_{1}([0,1]^{d}))\subset\mathrm{Lip}_{L/\sqrt{\lambda_{d}% }}(\mathcal{X})$ , we note that

h_{d}:(\mathcal{X},\|{\,\cdot\,}\|_{\mathcal{X}})\to([0,1]^{d},\ell^{\infty}),% \quad u\mapsto(F_{1}(u_{1}/\sqrt{\lambda_{1}}),\dots,F_{d}(u_{d}/\sqrt{\lambda% _{d}})),

has Lipschitz constant bounded by

\mathrm{Lip}(h_{d})\leq\max_{j=1,\dots,d}\frac{\mathrm{Lip}(F_{j})}{\sqrt{% \lambda_{j}}}\leq\frac{L}{\sqrt{\lambda_{d}}}.

Thus, for any $f\in\mathrm{Lip}_{1}([0,1]^{d})=\mathrm{Lip}_{1}(([0,1]^{d},\ell^{\infty}))$ ,

\mathrm{Lip}(\iota_{d}f)=\mathrm{Lip}(f\circ h_{d})\leq\mathrm{Lip}(f)\mathrm{% Lip}(h_{d})\leq\frac{L}{\sqrt{\lambda_{d}}}.

Furthermore, we also have $\|\iota_{d}f\|_{C(\mathcal{X})}\leq\|f\|_{C([0,1]^{d})}\leq 1$ . This shows that

\|\iota_{d}f\|_{\mathrm{Lip}}=\max\left\{\|\iota_{d}f\|_{C(\mathcal{X})},% \mathrm{Lip}(\iota_{d}f)\right\}\leq\max\left\{1,\frac{L}{\sqrt{\lambda_{d}}}% \right\}=\frac{L}{\sqrt{\lambda_{d}}}.

Here, we have made use of the choice $L>\sqrt{\lambda_{1}}\geq\sqrt{\lambda_{d}}$ (cp. (2.20)) in the last inequality. This concludes our proof. ∎

As a consequence of Lemma 3.7, we have:

Corollary 3.8.

Under the assumptions of Lemma 3.7, we have

(3.6)

\displaystyle\mathcal{H}(\mathrm{Lip}_{1}(\mathcal{X});\epsilon)_{L^{p}(\mu)}% \geq\mathcal{H}\left(\mathrm{Lip}_{1}([0,1]^{d});\frac{L\epsilon}{\sqrt{% \lambda_{d}}}\right)_{L^{p}([0,1]^{d})},

for any $d\in\mathbb{N}$ .

Proof.

We recall the existence of an isometric embedding $\iota_{d}:L^{p}([0,1]^{d})\to L^{p}(\mu)$ from Lemma 3.7, with $\iota_{d}(\mathrm{Lip}_{1}(\mathcal{X}))\subset\mathrm{Lip}_{L/\sqrt{\lambda_{% d}}}([0,1]^{d})$ . It follows that

	$\displaystyle\mathcal{N}(\mathrm{Lip}_{1}(\mathcal{X});\epsilon)_{L^{p}(\mu)}$	$\displaystyle=\mathcal{N}(\mathrm{Lip}_{L/\sqrt{\lambda_{d}}}(\mathcal{X});L% \epsilon/\sqrt{\lambda_{d}})_{L^{p}(\mu)}$
		$\displaystyle\geq\mathcal{N}(\iota_{d}(\mathrm{Lip}_{1}([0,1]^{d}));L\epsilon/% \sqrt{\lambda_{d}})_{L^{p}(\mu)}$
		$\displaystyle=\mathcal{N}(\mathrm{Lip}_{1}([0,1]^{d});L\epsilon/\sqrt{\lambda_% {d}})_{L^{p}([0,1]^{d})}.$

Taking logarithms, the claimed inequality between the metric entropy follows. ∎

The proof of Proposition 3.6 will furthermore make use of the following result in the finite-dimensional setting:

Lemma 3.9.

Let $p\in[1,\infty)$ be given. For $d\in\mathbb{N}$ , consider $\mathrm{Lip}_{1}([0,1]^{d})\subset L^{p}([0,1]^{d})$ . Then there exists a constant $c>0$ , independent of $d$ , such that we have the following lower bound on the metric entropy:

(3.7)

\displaystyle\mathcal{H}(\mathrm{Lip}_{1}([0,1]^{d});\epsilon)_{L^{p}([0,1]^{d% })}\geq\frac{1}{8}\left(\frac{c}{d\epsilon}\right)^{d},\quad\forall\,\epsilon% \in\left(0,\frac{c}{d}\right].

Proof.

Since the Hölder inequality implies, for any $p\in[1,\infty)$ , that $\|f\|_{L^{1}([0,1]^{d})}\leq\|f\|_{L^{p}([0,1]^{d})}$ , it follows that any covering of $\mathrm{Lip}_{1}([0,1]^{d})$ by $\epsilon$ -balls with respect to the $L^{p}$ -norm, also gives rise to a covering of $\mathrm{Lip}_{1}([0,1]^{d})$ with respect to the $L^{1}$ -norm (with the same centers). In particular, this implies that

\mathcal{N}(\mathrm{Lip}_{1}([0,1]^{d});\epsilon)_{L^{p}([0,1]^{d})}\geq% \mathcal{N}(\mathrm{Lip}_{1}([0,1]^{d});\epsilon)_{L^{1}([0,1]^{d})},

and we only need to establish (3.7) for $p=1$ .

For $\lambda\in(0,1)$ , define $\phi_{\lambda}:[0,1]^{d}\to\mathbb{R}_{+}$ as a composition $g_{\lambda}\circ\|{\,\cdot\,}\|_{\ell^{\infty}}$ , where $g_{\lambda}:\mathbb{R}\to\mathbb{R}$ is a piecewise linear function (approximately $g_{\lambda}\approx 1_{[0,1]}$ ) with values,

g_{\lambda}(x):=\begin{cases}0,&(x\notin[0,1]),\\ 1,&(x\in[\lambda/2,1-\lambda/2]),\end{cases}

and $g_{\lambda}$ interpolates linearly between $0$ and $1$ on $[0,\lambda/2]$ , and from $1$ to $0$ on $[1-\lambda/2,1]$ . By construction, $g_{\lambda}$ is $2/\lambda$ -Lipschitz. Since $x\mapsto\|x\|_{\ell^{\infty}}$ is $1$ -Lipschitz, it follows that $\mathrm{Lip}(\phi_{\lambda})=\mathrm{Lip}(g_{\lambda}\circ\|{\,\cdot\,}\|_{% \ell^{\infty}})\leq 2/\lambda$ . Clearly, smaller $\lambda$ leads to a larger Lipschitz constant. However, by construction of $\phi_{\lambda}$ , we have $\phi_{\lambda}\geq 1_{[\lambda/2,1-\lambda/2]^{d}}$ . In particular, this implies that $\|\phi_{\lambda}\|_{L^{1}}\geq(1-\lambda)^{d}$ . Thus, smaller $\lambda$ increases the $L^{1}$ -norm of $\phi_{\lambda}$ .

Given $N\in\mathbb{N}$ , we now subdivide $[0,1]^{d}$ into $N^{d}$ cubes of equal length, indexed by $j\in[N]^{d}$ , where $[N]^{d}=\{1,\dots,N\}^{d}$ . For any multi-index $j\in[N]^{d}$ , we define $\phi_{\lambda,j}(x)$ as a rescaled and translated copy of $\phi_{\lambda}$ , such that the support of $\phi_{\lambda,j}$ coincides with the $j$ -th cube. In particular, by construction of $\phi_{\lambda}$ , this implies that

(3.8)		$\displaystyle\\|\phi_{\lambda,j}\\|_{L^{1}([0,1]^{d})}^{2}$	$\displaystyle\geq(1-\lambda)^{d}N^{-d},$
(3.9)		$\displaystyle\mathrm{Lip}(\phi_{\lambda,j})$	$\displaystyle\leq 2N\lambda^{-1}.$

We also note that the $\phi_{\lambda,j}$ have essentially disjoint supports. For $\sigma\in\{-1,1\}^{[N]^{d}}$ , we now define

f_{\sigma}(x)=\frac{\lambda}{2N}\sum_{j\in[N]^{d}}\sigma_{j}\phi_{\lambda,j}(x).

The factor in front of the sum ensures that $\mathrm{Lip}(f_{\sigma})\leq 1$ . Furthermore, we also note that $\|f_{\sigma}\|_{C([0,1]^{d})}\leq\lambda/2N\leq 1$ for any choice of $\lambda\in(0,1)$ and $N\in\mathbb{N}$ . In particular, we have $f_{\sigma}\in\mathrm{Lip}_{1}([0,1]^{d})$ , for any choice of $\sigma$ . We finally observe that, due to the disjoint supports of the $\phi_{\lambda,j}$ , we have, for any $\sigma,\sigma^{\prime}\in\{-1,1\}^{[N]^{d}}$ ,

	$\displaystyle\\|f_{\sigma}-f_{\sigma^{\prime}}\\|_{L^{1}([0,1]^{d})}$	$\displaystyle=\frac{\lambda}{2N}\sum_{j\in[N]^{d}}\|\sigma_{j}-\sigma^{\prime}_% {j}\|\\|\phi_{\lambda,j}\\|_{L^{1}([0,1]^{d})}$
		$\displaystyle\geq\lambda(1-\lambda)^{d}N^{-1}\frac{\#\{\sigma_{j}\neq\sigma_{j% }^{\prime}\}}{N^{d}}.$

The last quotient is the fraction of entries in which $\sigma$ and $\sigma^{\prime}$ differ. It turns out that there exists a subset $\Xi\subset\{-1,1\}^{[N]^{d}}$ , such that any $\sigma\neq\sigma^{\prime}$ belonging to $\Xi$ differ on a substantial fraction of their components; more precisely, as noted in [1] as a result of the Gilbert-Varshamov bound, there exists a subset $\Xi\subset\{-1,1\}^{[N]^{d}}$ satisfying that any two distinct elements $\sigma,\sigma^{\prime}\in\Xi$ , differ on at least a fourth of their coordinates,

(3.10)

\displaystyle\frac{\#\{\sigma_{j}\neq\sigma_{j}^{\prime}\}}{N^{d}}\geq\frac{1}% {4},\qquad\forall\,\sigma,\sigma^{\prime}\in\Xi,\;\sigma\neq\sigma^{\prime},

and the cardinality of $\Xi$ is lower bounded by,

(3.11)

\displaystyle\#\Xi\geq\exp(N^{d}/8)\geq 2^{N^{d}/8}.

This implies that for any two $\sigma\neq\sigma^{\prime}$ in $\Xi$ , we have

\|f_{\sigma}-f_{\sigma^{\prime}}\|_{L^{1}([0,1]^{d})}\geq\frac{1}{4N}\lambda(1% -\lambda)^{d}.

Optimizing the right-hand side over $\lambda\in(0,1)$ , we set $\lambda=1/(1+d)$ to obtain,

\|f_{\sigma}-f_{\sigma^{\prime}}\|_{L^{1}([0,1]^{d})}\geq\frac{1}{4(d+1)N}% \frac{1}{(1+1/d)^{d}}\geq\frac{1}{4e(d+1)N}\geq\frac{1}{8edN},

where we used that the Euler constant $e\geq(1+1/d)^{d}$ and the fact that $d\geq 1$ implies $d+1\leq 2d$ in the last bound.

Taking into account the bound (3.11), it follows that the packing number $\mathcal{M}(\mathrm{Lip}_{1}([0,1]^{d});\epsilon)$ , satisfies the lower bound,

\log_{2}\mathcal{M}(\mathrm{Lip}_{1}([0,1]^{d});(\beta_{d}N)^{-1})\geq N^{d}/8% ,\quad\forall\,N\in\mathbb{N},

where we have defined $\beta_{d}=8ed$ . Given $\epsilon\in(0,\beta_{d}^{-1}]$ , we can find $N\in\mathbb{N}$ , such that

(\beta_{d}N)^{-1}\geq\epsilon\geq(2\beta_{d}N)^{-1}.

It follows that

	$\displaystyle\log_{2}\mathcal{M}(\mathrm{Lip}_{1}([0,1]^{d});\epsilon)$	$\displaystyle\geq\log_{2}\mathcal{M}(\mathrm{Lip}_{1}([0,1]^{d});(2\beta_{d}N)% ^{-1})$
		$\displaystyle\geq\frac{(2N)^{d}}{8}\geq\frac{1}{8}\left(\frac{\beta_{d}% \epsilon}{2}\right)^{-d}.$

We conclude that

\log_{2}\mathcal{M}(\mathrm{Lip}_{1}([0,1]^{d});\epsilon)\geq\frac{1}{8}\left(% \frac{\beta_{d}}{2}\epsilon\right)^{-d},\quad\forall\,\epsilon\in(0,\beta_{d}^% {-1}].

This lower bound on the packing number holds for any dimension $d\in\mathbb{N}$ . We can now use the general relation $\mathcal{N}(A;\epsilon)\geq\mathcal{M}(A;2\epsilon)$ between the covering- and packing-numbers (3.1), to conclude that,

\mathcal{H}(\mathrm{Lip}_{1}([0,1]^{d});\epsilon)=\log_{2}\mathcal{N}(\mathrm{% Lip}_{1}([0,1]^{d});\epsilon)\geq\frac{1}{8}\left(\beta_{d}\epsilon\right)^{-d% },\quad\forall\,\epsilon\in(0,\beta_{d}],

where $\beta_{d}=8ed$ . This proves the claim with $c=1/(8e)$ , i.e.

\log_{2}\mathcal{N}(\mathrm{Lip}_{1}([0,1]^{d});\epsilon)\geq\frac{1}{8}\left(% \frac{c}{d\epsilon}\right)^{d},\quad\forall\,\epsilon\in\left(0,\frac{c}{d}% \right],

∎

Assuming the results of Corollary 3.8 and Lemma 3.9, we can now prove Proposition 3.6.

Proof of Proposition 3.6.

Combining the lower bound (3.6) and (3.7), we obtain that for any $d\in\mathbb{N}$ ,

\displaystyle\log_{2}\mathcal{N}(\mathrm{Lip}_{1}(\mathcal{X});\epsilon)_{L^{p% }(\mu)}\geq\frac{1}{8}\left(\frac{c\sqrt{\lambda_{d}}}{Ld\epsilon}\right)^{d}% \geq\left(\frac{c\sqrt{\lambda_{d}}}{8Ld\epsilon}\right)^{d},

provided that $\epsilon\leq\frac{c\sqrt{\lambda_{d}}}{Ld}$ . Since $\lambda_{d}\gtrsim d^{-2\alpha}$ by assumption, and since $C$ and $L$ are constants independent of $d$ , it thus follows that there exist $c_{1},c_{2}>0$ , independent of $d$ , such that

(3.12)

\displaystyle\log_{2}\mathcal{N}(\mathrm{Lip}_{1}(\mathcal{X});\epsilon)_{L^{p% }(\mu)}\geq\left(\frac{c_{1}}{d^{1+\alpha}\epsilon}\right)^{d},\quad\text{if $% \epsilon\leq c_{2}d^{-(1+\alpha)}$.}

The idea is now to choose $d=d(\epsilon)\sim\epsilon^{-1/(\alpha+1)}$ , such that the term inside the parentheses is lower bounded by $e^{\beta}$ for some fixed $\beta>0$ , implying that the right hand side is $\gtrsim(e^{\beta})^{d}=\exp(\beta d)\gtrsim\exp(c\epsilon^{-1/(\alpha+1)})$ for some constant $c>0$ . This then leads to the claimed lower bound. We now proceed to provide the details of the required argument.

We first fix $\beta=-\log(c_{2}/c_{1})$ , such that

(3.13)

\displaystyle e^{-\beta}=c_{2}/c_{1}.

We next define

(3.14)

\displaystyle\epsilon_{0}=c_{1}e^{-\beta}=c_{2}.

Since $c_{1},c_{2}$ are independent of $d$ , it follows that also $\beta$ and $\epsilon_{0}$ are independent of $d$ .

For any $\epsilon\in(0,\epsilon_{0}]$ , the above choice ensures that

\epsilon\leq\epsilon_{0}\leq c_{1}e^{-\beta},

and hence there exists a unique $d=d(\epsilon)\in\mathbb{N}$ , such that

\epsilon d^{(1+\alpha)}\leq c_{1}e^{-\beta}<\epsilon(2d)^{(1+\alpha)}.

In particular, upon rearranging the first inequality in the last display, we obtain the two equivalent formulations,

(3.15)		$\displaystyle\epsilon\leq c_{1}e^{-\beta}d^{-(1+\alpha)}=c_{2}d^{-(1+\alpha)},$
(3.16)		$\displaystyle\frac{c_{1}}{d^{(1+\alpha)}\epsilon}\geq e^{\beta}.$

while the second bound $c_{1}e^{-\beta}<\epsilon(2d)^{(1+\alpha)}$ implies,

(3.17)

\displaystyle\beta d\geq c\epsilon^{-1/(\alpha+1)},\quad\text{where }c:=\beta% \left[\frac{c_{1}}{2e^{\beta}}\right]^{1/(\alpha+1)}.

With this choice of $d=d(\epsilon)$ , equation (3.15) guarantees that the estimate in (3.12) applies to all $\epsilon\in(0,\epsilon_{0}]$ . This in turn implies that

	$\displaystyle\mathcal{H}(\mathrm{Lip}_{1}(\mathcal{X});\epsilon)_{L^{p}(\mu)}$	$\displaystyle=\log_{2}\mathcal{N}(\mathrm{Lip}_{1}(\mathcal{X});\epsilon)_{L^{% p}(\mu)}$
		$\displaystyle\overset{\mathclap{\underset{\downarrow}{\eqref{eq:logN-d}}}}{% \geq}\left(\frac{c_{1}}{d^{(1+\alpha)}\epsilon}\right)^{d}$
		$\displaystyle\overset{\mathclap{\underset{\downarrow}{\eqref{eq:dd2}}}}{\geq}e% ^{\beta d}$
		$\displaystyle\overset{\mathclap{\underset{\downarrow}{\eqref{eq:dd3}}}}{\geq}% \exp\left(c\epsilon^{-1/(\alpha+1)}\right),$

for all $\epsilon\in(0,\epsilon_{0}]$ . This is the claimed lower bound on the metric entropy. ∎

4. Generic approximation results

We first discuss an abstract formulation of a general “approximation task”. Let $\mathsf{\bm{V}}$ be a Banach space (e.g. a space of operators). In a general non-linear approximation task, we are given for any $n\in\mathbb{N}$ a set $\Sigma_{n}\subset\mathsf{\bm{V}}$ over which we aim to approximate an element $f\in\mathsf{\bm{V}}$ , where we will assume that $f$ belongs to a general class $\mathsf{\bm{A}}\subset\mathsf{\bm{V}}$ of interest. Considering these subsets $\Sigma_{n}\subset\mathsf{\bm{V}}$ fixed, and given a sequence $\epsilon_{n}\to 0$ , we will say that $f\in\mathsf{\bm{A}}$ can be approximated with convergence rate $\epsilon_{n}$ , if there exists a constant $M_{f}>0$ , such that

(4.1)

\displaystyle\inf_{\psi_{n}\in\Sigma_{n}}\|f-\psi\|_{\mathcal{X}}\leq M_{f}% \epsilon_{n},\quad\forall\,n\in\mathbb{N}.

Specifically, we will be most interested in the logarithmic case $\epsilon_{n}=\log(n)^{-\gamma}$ , in the following, with $\Sigma_{n}$ corresponding to all possible realizations of a fixed bit-encoded neural operator architecture (cp. the proofs of Propositions 2.13 and 2.14, respectively).

Coming back to the general abstract setting above, and given $M>0$ , we introduce a set of “efficiently approximated” elements $\mathsf{\bm{E}}_{M}\subset\mathsf{\bm{A}}$ with bound $M$ , i.e.

(4.2)

\displaystyle\mathsf{\bm{E}}_{M}:={\left\{f\in\mathsf{\bm{A}}\,\middle|\,\text% {inequality \eqref{eq:eff} holds with constant $M_{f}=M$}\right\}}.

And we denote the set of all $f\in\mathsf{\bm{A}}$ which can be approximated at convergence rate $\epsilon_{n}$ , by elements in $\Sigma_{n}$ , by

(4.3)

\displaystyle\mathsf{\bm{E}}=\bigcup_{M>0}\mathsf{\bm{E}}_{M}={\left\{f\in% \mathsf{\bm{A}}\,\middle|\,\text{there exists $M_{f}$ such that \eqref{eq:eff}% holds}\right\}}.

Our goal is to study generically achievable approximation rates $\epsilon_{n}$ , in terms of the complexity of $\mathsf{\bm{A}}$ , as measured by its metric entropy.

The following lemma will be fundamental to our analysis:

Lemma 4.1.

Let $\mathsf{\bm{V}}$ be a Banach space. Let $\mathsf{\bm{A}}\subset\mathsf{\bm{V}}$ be a compact, convex subset. Let $\{\Sigma_{n}\}_{n\in\mathbb{N}}$ be a family of subsets $\Sigma_{n}\subset\mathsf{\bm{V}}$ , with $|\Sigma_{n}|\leq 2^{n}$ elements. Fix $M>0$ . If $\mathsf{\bm{E}}_{M}\subset\mathsf{\bm{A}}$ given by (4.2) has non-empty interior in the subspace topology on $\mathsf{\bm{A}}$ , then there exists a constant $\lambda>0$ , independent of $n$ , such that the metric entropy satisfies the bound,

\mathcal{H}(\mathsf{\bm{A}};\lambda\epsilon_{n})_{\mathsf{\bm{V}}}\leq n,\quad% \forall\,n\in\mathbb{N}.

Proof.

At the outset we note that by compactness, we have a uniform upper bound,

\displaystyle\sup_{f\in\mathsf{\bm{A}}}\|f\|\leq C_{\mathsf{\bm{A}}}<\infty.

Upon a simple rescaling, we may wlog assume that $C_{\mathsf{\bm{A}}}=1$ , i.e. that $\|f\|\leq 1$ for all $f\in\mathsf{\bm{A}}$ . This will be assumed in the following proof.

Our next goal is to show that, for any $M>0$ , the set $\mathsf{\bm{E}}_{M}$ defined by (4.8) has empty interior. For the sake of contradiction, assume that $\mathsf{\bm{E}}_{M}$ does not have empty interior. Then there exists $f_{0}\in\mathsf{\bm{A}}$ and $\delta>0$ , such that

B_{\delta}(f_{0})\subset\mathsf{\bm{E}}_{M}\subset\bigcup_{\psi_{n}\in\Sigma_{% n}}\overline{B_{M\epsilon_{n}}}(\psi_{n}),

where $B_{\delta}(f_{0})={\left\{f\in\mathsf{\bm{A}}\,\middle|\,\|f-f_{0}\|<\delta% \right\}}\subset\mathsf{\bm{A}}$ is an open ball in the subspace topology on $\mathsf{\bm{A}}$ . Thus, for any $n\in\mathbb{N}$ , we obtain the following bound on the covering numbers,

(4.4)

\displaystyle\mathcal{N}(B_{\delta}(f_{0});M\epsilon_{n})\leq\mathcal{N}(% \mathsf{\bm{E}}_{M};M\epsilon_{n})\leq|\Sigma_{n}|\leq 2^{n}.

We next recall that we have wlog assumed $\sup_{f\in\mathsf{\bm{A}}}\|f\|\leq 1$ , and we recall that $\mathsf{\bm{A}}$ is convex by assumption. In particular, we next show that this implies that

\left(1-\frac{\delta}{3}\right)f_{0}+\frac{\delta}{3}\mathsf{\bm{A}}\subset B_% {\delta}(f_{0}).

To see why, let $\delta^{\prime}=\delta/3$ and fix $f\in\mathsf{\bm{A}}$ arbitrary. We need to show that $f_{\delta^{\prime}}:=(1-\delta^{\prime})f_{0}+\delta^{\prime}f\in B_{\delta}(f% _{0})$ . Since $\mathsf{\bm{A}}$ is convex, it is clear that $f_{\delta^{\prime}}\in\mathsf{\bm{A}}$ . In addition, we also have

\|f_{\delta^{\prime}}-f_{0}\|=\|(1-\delta^{\prime})f_{0}+\delta^{\prime}f-f_{0% }\|=\delta^{\prime}\|f-f_{0}\|\leq 2\delta^{\prime}=\frac{2\delta}{3}<\delta.

Hence, $f_{\delta^{\prime}}\in B_{\delta}(f_{0})$ as claimed. The inclusion, $(1-\delta/3)f_{0}+(\delta/3)\mathsf{\bm{A}}\subset B_{\delta}(f_{0})$ now implies,

(4.5)

\displaystyle\mathcal{N}(B_{\delta}(f_{0});M\epsilon_{n})\geq\mathcal{N}((% \delta/3)\mathsf{\bm{A}};M\epsilon_{n})=\mathcal{N}(\mathsf{\bm{A}};3M\epsilon% _{n}/\delta).

Combining (4.4) and (4.5), we conclude that

\mathcal{H}(\mathsf{\bm{A}};3M\epsilon_{n}/\delta)_{\mathsf{\bm{V}}}=\log_{2}% \mathcal{N}(\mathsf{\bm{A}};3M\epsilon_{n}/\delta)_{\mathsf{\bm{V}}}\leq n,% \quad\forall\,n\in\mathbb{N}.

We emphasize that $M,\delta>0$ are independent of $n$ in the above argument. In particular, the claim of the lemma holds with constant $\lambda=3M/\delta>0$ . ∎

Proposition 4.2 (Exponential scaling).

Let $\mathsf{\bm{V}}$ be a Banach space. Let $\mathsf{\bm{A}}\subset\mathsf{\bm{V}}$ be a compact, convex subset. Assume that there exist constants $C,c,\gamma>0$ such that,

(4.6)

\displaystyle\mathcal{H}(\mathsf{\bm{A}};\epsilon)_{\mathsf{\bm{V}}}\geq C\exp% \left(c\epsilon^{-1/\gamma}\right),\quad\forall\,\epsilon>0.

Let $\{\Sigma_{n}\}_{n\in\mathbb{N}}$ be family of subsets $\Sigma_{n}\subset\mathsf{\bm{V}}$ with $|\Sigma_{n}|\leq 2^{n}$ elements. Then generic elements $f\in\mathsf{\bm{A}}$ cannot be approximated by elements of $\Sigma_{n}$ at convergence rate better than $\log(n)^{-\gamma}$ ; more precisely, for any sequence $\epsilon_{n}=o(\log(n)^{-\gamma})$ , the subset $\mathsf{\bm{R}}\subset\mathsf{\bm{A}}$ , consisting of all $f\in\mathsf{\bm{A}}$ , such that

(4.7)

\displaystyle\inf_{\psi_{n}\in\Sigma_{n}}\|f-\psi_{n}\|\not=O(\epsilon_{n}),

is residual.

Before coming to the proof of Proposition 4.2, we note that since $\mathsf{\bm{A}}\subset\mathsf{\bm{V}}$ is compact, $\mathsf{\bm{A}}$ is a complete metric space in the subspace topology. In particular, the following argument, which is based on the Baire category theorem, can be applied to $\mathsf{\bm{A}}$ (cp. Appendix A for a summary).

Proof.

Let $\mathsf{\bm{R}}:=\mathsf{\bm{A}}\setminus\mathsf{\bm{E}}$ , where $\mathsf{\bm{E}}$ is defined by (4.3). Recall that $\mathsf{\bm{E}}$ is precisely the set of $f\in\mathsf{\bm{A}}$ for which there exists $M_{f}>0$ such that

\inf_{psi_{n}\in\Sigma_{n}}\|f-\psi_{n}\|\leq M_{f}\epsilon_{n}.

In Lemma 4.1, it is shown that if $\mathsf{\bm{E}}_{M}\subset\mathsf{\bm{A}}$ has non-empty interior then there exists a constant $\lambda>0$ , such that

\log\mathcal{N}(\mathsf{\bm{A}};\lambda\epsilon_{n})\leq n,\quad\forall\,n\in% \mathbb{N}.

By assumption on $\mathsf{\bm{A}}$ , the left hand side is lower bounded by $C\exp\left(c(\lambda\epsilon_{n})^{-1/\gamma}\right)$ . Thus, if $\mathsf{\bm{E}}_{M}$ has non-empty interior, then we must have

C\exp\left((\lambda\epsilon_{n})^{-1/\gamma}\right)\leq n\quad\Rightarrow\quad% \epsilon_{n}\gtrsim\log(n)^{-\gamma},\;\text{as }n\to\infty.

But by the assumption that $\epsilon_{n}=o(\log(n)^{-\gamma})$ , this last lower bound cannot hold, asymptotically as $n\to\infty$ . Thus, we conclude that $\mathsf{\bm{E}}_{M}\subset\mathsf{\bm{A}}$ has empty interior for any $M>0$ . We furthermore note that $\mathsf{\bm{E}}_{M}$ is closed; indeed, $\mathsf{\bm{E}}_{M}$ in (4.2) is given by,

(4.8)

\displaystyle\mathsf{\bm{E}}_{M}=\bigcap_{n=1}^{\infty}\bigcup_{\psi_{n}\in% \Sigma_{n}}\overline{B_{M\epsilon_{n}}(\psi_{n})},

where we define the closed balls (in the induced topology on $\mathsf{\bm{A}}$ ),

\overline{B_{M\epsilon_{n}}(\psi)}:={\left\{f\in\mathsf{\bm{A}}\,\middle|\,\|f% -\psi\|\leq M\epsilon_{n}\right\}}\subset\mathsf{\bm{A}}.

Therefore $\mathsf{\bm{E}}_{M}$ can be written as an intersection of a union of closed balls of radius $M\epsilon_{n}$ centered at elements $\psi\in\Sigma_{n}$ . Note that, since the set $\Sigma_{n}$ is finite by assumption, the union of these closed balls,

\mathsf{\bm{E}}_{M,n}:=\bigcup_{\psi_{n}\in\Sigma_{n}}\overline{B_{M\epsilon_{% n}}(\psi_{n})},

is closed for any $n\in\mathbb{N}$ , implying that also $\mathsf{\bm{E}}_{M}=\bigcap_{n=1}^{\infty}\mathsf{\bm{E}}_{M,n}\subset\mathsf{% \bm{A}}$ is closed as an intersection of closed sets.

To conclude the proof, we simply note that $\mathsf{\bm{E}}=\bigcup_{M\in\mathbb{N}}\mathsf{\bm{E}}_{M}$ can be written as a countable union, for integer $M\in\mathbb{N}$ , of closed subsets with empty interior $\mathsf{\bm{E}}_{M}$ . In particular, this implies that $\mathsf{\bm{E}}$ is itself meagre by the Baire category theorem. We conclude that the complement $\mathsf{\bm{R}}:=\mathsf{\bm{A}}\setminus\mathsf{\bm{E}}$ , consisting of all $f\in\mathsf{\bm{A}}$ for which

\inf_{\psi_{n}\in\Sigma_{n}}\|f-\psi_{n}\|\not=O(\epsilon_{n}),

is residual. This completes the proof. ∎

A similar result can also be derived under the assumption of an algebraic scaling. This may be of relevance for generic function approximation by neural networks, and hence we mention it here, in passing.

Proposition 4.3 (Algebraic scaling).

Let $\mathsf{\bm{V}}$ be a Banach space. Let $\mathsf{\bm{A}}\subset\mathsf{\bm{V}}$ be a compact, convex subset. Assume that there exist constants $C,\gamma>0$ such that,

(4.9)

\displaystyle\log\mathcal{N}(\mathsf{\bm{A}};\epsilon)\geq C\epsilon^{-1/% \gamma},\quad\forall\,\epsilon>0.

Let $\{\Sigma_{n}\}_{n\in\mathbb{N}}$ be a family of subsets $\Sigma_{n}\subset\mathsf{\bm{V}}$ with $|\Sigma_{n}|\leq 2^{n}$ elements. Then generic elements $f\in\mathsf{\bm{A}}$ cannot be approximated by elements of $\Sigma_{n}$ at convergence rate better than $n^{-\gamma}$ ; more precisely, for any sequence $\epsilon_{n}=o(n^{-\gamma})$ , the subset $\mathsf{\bm{R}}\subset\mathsf{\bm{A}}$ , such that,

(4.10)

\displaystyle\inf_{\psi_{n}\in\Sigma_{n}}\|f-\psi_{n}\|\not=O(\epsilon_{n}),% \quad\forall\,f\in\mathsf{\bm{R}},

is residual.

Proof.

\inf_{psi_{n}\in\Sigma_{n}}\|f-\psi_{n}\|\leq M_{f}\epsilon_{n}.

In Lemma 4.1, it is shown that if $\mathsf{\bm{E}}_{M}\subset\mathsf{\bm{A}}$ has non-empty interior then there exists a constant $\lambda>0$ , such that

\log\mathcal{N}(\mathsf{\bm{A}};\lambda\epsilon_{n})\leq n,\quad\forall\,n\in% \mathbb{N}.

By assumption on $\mathsf{\bm{A}}$ , the left hand side is lower bounded by $C(\lambda\epsilon_{n})^{-1/\gamma}$ . Thus, if $\mathsf{\bm{E}}_{M}$ has non-empty interior, then we must have

C(\lambda\epsilon_{n})^{-1/\gamma}\leq n\quad\Rightarrow\quad\epsilon_{n}% \gtrsim n^{-\gamma},\;\text{as }n\to\infty.

By assumption, $\epsilon_{n}=o(n^{-\gamma})$ , this is not the case. Thus, we conclude that $\mathsf{\bm{E}}_{M}\subset\mathsf{\bm{A}}$ has empty interior for any $M>0$ . Thus, arguing as in the proof of Proposition 4.3 it follows that $\mathsf{\bm{E}}$ is meagre, and hence $\mathsf{\bm{R}}=\mathsf{\bm{A}}\setminus\mathsf{\bm{E}}$ is residual. ∎

5. Conclusion

Operator learning is a new paradigm for the data-driven approximation of operators. Popular operator learning frameworks extend and generalize neural networks to this infinite-dimensional setting. While there are numerous papers demonstrating the potential and practical utility of proposed neural operator architectures, our understanding of the precise conditions under which operator learning is practically feasible remains limited.

This paper makes a contribution to the mathematical underpinnings of this field, by providing an information-theoretic perspective on the curse of parametric complexity (a scaling-limit of the curse of dimensionality) identified in [34]. In particular, it is shown that this curse poses a fundamental limitation to operator learning on general spaces of Lipschitz operators. Bit-encoding (storing in memory) any neural operator architecture, which is capable of achieving approximation accuracy $\epsilon$ for general $1$ -Lipschitz continuous and real-valued operators, requires a number of bits that is exponential in $\epsilon^{-1}$ . It is shown that this is true not only when measuring the approximation error in the sup-norm over compact sets of input functions, but also when measuring the error in the $L^{p}(\mu)$ -norm with respect to a probability measure satisfying certain structural assumptions. The assumptions are met for widely considered $\mu$ , including the case of a Gaussian random field with at most algebraically decreasing eigenvalues of the covariance. These results rely on minimax analysis and, in contrast to prior work [34], are independent of the employed activation function in the architecture.

Going beyond such minimax analysis, we furthermore study the approximation of individual Lipschitz operators by a sequence of neural operator architectures. Such a sequence would e.g. be obtained when increasing the width, depth or other hyperparameters at a pre-defined rate as the model is scaled up. In this setting, we address the following question: “At which rate can the approximation error along such a sequence decrease, as a function of the total number of bit-encoded parameters?” Using topological arguments based on Baire category, we establish a quantitative relation between the metric entropy of the set of $1$ -Lipschitz operators, and the best approximation-rate that can be achieved along such a sequence for generic $1$ -Lipschitz operators; as a consequence of the exponential increase in metric $\epsilon$ -entropy of the set of $1$ -Lipschitz operators, it is shown that achievable approximation rates are at most logarithmic as a function of the required encoding bits.

Finally, this abstract analysis leads to a concrete result on the approximation of generic Lipschitz operators by Fourier neural operator. Our results imply that for generic $1$ -Lipschitz operators, and under mild assumptions on the tunable parameters, there cannot exist a sequence of FNO approximations which approximates the underlying operator at a rate that decays faster than logarithmic in the number of real-valued parameters. To obtain this result, mild bounds on the growth of the parameters of FNO approximants are assumed; specifically, the size of individual parameter is assumed to be exponentially bounded by the total number of parameters, as the model size is scaled up.

The results of this work should be compared and contrasted with the recent work [50], which shows the surprising result that there exist (non-standard) neural operator architectures capable of approximating Lipschitz continuous operators to accuracy $\epsilon$ , with a number of real-valued tunable parameters $q$ growing only algebraically with $\epsilon^{-1}$ . The analysis of the present work indicates that a practical implementation of such architectures on computing hardware, and with parameters encoded by a total of $B$ bits will require $B$ to be exponentially large in $\epsilon^{-1}$ . In fact, if each parameter is encoded by $b_{1}$ bits, then a lower bound of the following form is to be expected:

qb_{1}\geq C\exp(c\epsilon^{-\gamma}),

for fixed constants $C,c,\gamma>0$ independent of $\epsilon$ . In particular, if $q\lesssim\epsilon^{-\lambda}$ grows at most algebraically, as in the construction [50], then the number of encoding bits $q_{1}$ per parameter must necessarily grow exponentially. Thus, the only trade-off that appears possible from an information-theoretic perspective is to reduce the number of parameters $q$ at the expense of the required number of bits per parameter $b_{1}$ , or vice versa. In turn, the required number of encoding bits is intimately linked to the stability of the map** $\theta\mapsto\Phi({\,\cdot\,};\theta)$ from parameters $\theta$ to the corresponding realization of the neural operator $\Phi({\,\cdot\,};\theta)$ ; an exponentially growing number of bits $b_{1}$ is only required if the parameter-to-realization map** is either very unstable, e.g. having very large Lipschitz constant, or if the optimal parameters themselves are very large. Here, “large” means that either the Lipschitz constant or the $\ell^{\infty}$ -norm of the parameters grows exponentially with $\epsilon^{-1}$ .

The results of this work underline the fundamental character of the curse of parametric complexity identified in [34] from the point of view of information theory. In addition, it is here shown that this curse persists even when the sup-norm (uniform approximation of the underlying operator) is replaced by an a priori much weaker $L^{p}$ -norm (approximation in expectation). This considerably constrains the generality with which approximation theory for operator learning, guaranteeing efficient approximation by neural operators at algebraic convergence rates, can be developed. A complete or partial characterization of the relevant mathematical properties and structures enabling efficient operator approximation, would be highly desirable. The results presented in this work demonstrate rigorously that one has to go beyond Lipschitz operators to achieve this.

Acknowledgments

The author would like to thank Andrew M. Stuart and Nikola B. Kovachki for interesting discussions which have led to this work. This work has been supported by funding from the Swiss National Science Foundation through Postdoc.Mobility grant P500PT-206737.

References

[1] E. M. Achour, A. Foucault, S. Gerchinovitz, and F. Malgouyres. A general approximation lower bound in $L^{p}$ norm, with applications to feed-forward neural networks. In Advances in Neural Information Processing Systems, 2022.
[2] B. Adcock, S. Brugiapaglia, N. Dexter, and S. Moraga. On efficient algorithms for computing near-best polynomial approximations to high-dimensional, Hilbert-valued functions from limited samples. arXiv preprint arXiv:2203.13908, 2022.
[3] B. Adcock, N. Dexter, and S. Moraga. Optimal approximation of infinite-dimensional holomorphic functions. arXiv preprint arXiv:2305.18642, 2023.
[4] B. Adcock, N. Dexter, and S. Moraga. Optimal approximation of infinite-dimensional holomorphic functions ii: recovery from iid pointwise samples. arXiv preprint arXiv:2310.16940, 2023.
[5] J. A. L. Benitez, T. Furuya, F. Faucher, A. Kratsios, X. Tricoche, and M. V. de Hoop. Out-of-distributional risk bounds for neural operators with applications to the Helmholtz equation. arXiv preprint arXiv:2301.11509, 2023.
[6] K. Bhattacharya, B. Hosseini, N. B. Kovachki, and A. M. Stuart. Model reduction and neural networks for parametric PDEs. The SMAI Journal of Computational Mathematics, 7:121–157, 2021.
[7] M. S. Birman and M. Z. Solomyak. Approximation of functions of the w_p^ $\alpha$ -classes by piece-wise-polynomial functions. Doklady Akademii Nauk, 171(5):1015–1018, 1966.
[8] H. Bölcskei, P. Grohs, G. Kutyniok, and P. Petersen. Memory-optimal neural network approximation. In Wavelets and Sparsity XVII, volume 10394, pages 157–168. SPIE, 2017.
[9] H. Bolcskei, P. Grohs, G. Kutyniok, and P. Petersen. Optimal approximation with sparsely connected deep neural networks. SIAM Journal on Mathematics of Data Science, 1(1):8–45, 2019.
[10] J. Castro. The Kolmogorov infinite dimensional equation in a Hilbert space via deep learning methods. Journal of Mathematical Analysis and Applications, 527(2):127413, 2023.
[11] J. Castro, C. Muñoz, and N. Valenzuela. The Calderón’s problem via DeepONets. arXiv preprint arXiv:2212.08941, 2022.
[12] T. Chen and H. Chen. Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Transactions on Neural Networks, 6(4):911–917, 1995.
[13] A. Cohen, R. DeVore, G. Petrova, and P. Wojtaszczyk. Optimal stable nonlinear approximation. Foundations of Computational Mathematics, 22(3):607–648, 2022.
[14] A. Cohen, R. DeVore, and C. Schwab. Convergence rates of best n-term galerkin approximations for a class of elliptic SPDEs. Foundations of Computational Mathematics, 10(6):615–646, 2010.
[15] A. Cohen, R. Devore, and C. Schwab. Analytic regularity and polynomial approximation of parametric and stochastic elliptic PDEs. Analysis and Applications, 9(01):11–47, 2011.
[16] S. Dahlke, F. De Mari, P. Grohs, and D. Labate. Harmonic and applied analysis. Appl. Numer. Harmon. Anal, 2015.
[17] B. Deng, Y. Shin, L. Lu, Z. Zhang, and G. E. Karniadakis. Convergence rate of DeepONets for learning operators arising from advection-diffusion equations. arXiv preprint arXiv:2102.10621, 2021.
[18] R. DeVore, B. Hanin, and G. Petrova. Neural network approximation. Acta Numerica, 30:327–444, 2021.
[19] D. Elbrächter, D. Perekrestenko, P. Grohs, and H. Bölcskei. Deep neural network approximation theory. IEEE Transactions on Information Theory, 67(5):2581–2623, 2021.
[20] N. R. Franco, S. Fresca, A. Manzoni, and P. Zunino. Approximation bounds for convolutional neural networks in operator learning. Neural Networks, 161:129–141, 2023.
[21] L. Galimberti, A. Kratsios, and G. Livieri. Designing universal causal deep learning models: The case of infinite-dimensional dynamical systems from stochastic analysis. arXiv preprint arXiv:2210.13300, 2022.
[22] L. Herrmann, C. Schwab, and J. Zech. Neural and GPC operator surrogates: Construction and expression rate bounds. arXiv preprint arXiv:2207.04950, 2022.
[23] N. Hua and W. Lu. Basis operator network: A neural network-based model for learning nonlinear operators via neural basis. Neural Networks, 164:21–37, 2023.
[24] D. Z. Huang, N. H. Nelsen, and M. Trautner. An operator learning perspective on parameter-to-observable maps. arXiv preprint arXiv:2402.06031, 2024.
[25] P. **, S. Meng, and L. Lu. Mionet: Learning multiple-input operators via tensor product. SIAM Journal on Scientific Computing, 44(6):A3490–A3514, 2022.
[26] A. N. Kolmogorov and V. M. Tikhomirov. $\epsilon$ -entropy and $\epsilon$ -capacity of sets in functional spaces. Amer. Math. Soc. Transl. Ser. 2, 17, 1961.
[27] Y. Korolev. Two-layer neural networks with values in a banach space. SIAM Journal on Mathematical Analysis, 54(6):6358–6389, 2022.
[28] N. B. Kovachki, S. Lanthaler, and H. Mhaskar. Data complexity estimates for operator learning, 2024.
[29] N. B. Kovachki, S. Lanthaler, and S. Mishra. On universal approximation and error bounds for Fourier neural operators. Journal of Machine Learning Research, 22(1), 2021.
[30] N. B. Kovachki, S. Lanthaler, and A. M. Stuart. Operator learning: Algorithms and analysis. In Numerical Analysis meets Machine Learning, Handbook of Numerical Analysis. Elsevier, 2024.
[31] N. B. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anandkumar. Neural operator: Learning maps between function spaces with applications to PDEs. Journal of Machine Learning Research, 24(89), 2023.
[32] A. Kratsios, T. Furuya, J. A. L. Benitez, M. Lassas, and M. de Hoop. Mixture of experts soften the curse of dimensionality in operator learning, 2024.
[33] G. Kutyniok, P. Petersen, M. Raslan, and R. Schneider. A theoretical analysis of deep neural networks and parametric PDEs. Constructive Approximation, 55(1):73–125, 2022.
[34] S. Lanthaler. Operator learning with PCA-Net: Upper and lower complexity bounds. Journal of Machine Learning Research, 24(318), 2023.
[35] S. Lanthaler, Z. Li, and A. M. Stuart. The nonlocal neural operator: Universal approximation. arXiv preprint arXiv:2304.13221, 2023.
[36] S. Lanthaler, S. Mishra, and G. E. Karniadakis. Error estimates for DeepONets: A deep learning framework in infinite dimensions. Transactions of Mathematics and Its Applications, 6(1), 2022.
[37] S. Lanthaler and N. H. Nelsen. Error bounds for learning with vector-valued random features. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[38] Z. Lei, L. Shi, and C. Zeng. Solving parametric partial differential equations with deep rectified quadratic unit neural networks. Journal of Scientific Computing, 93(3):80, 2022.
[39] Z. Li, N. B. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. M. Stuart, and A. Anandkumar. Fourier neural operator for parametric partial differential equations. In Ninth International Conference on Learning Representations, 2021.
[40] H. Liu, H. Yang, M. Chen, T. Zhao, and W. Liao. Deep nonparametric estimation of operators between infinite dimensional spaces. Journal of Machine Learning Research, 25(24):1–67, 2024.
[41] L. Lu, P. **, G. Pang, Z. Zhang, and G. E. Karniadakis. Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3):218–229, 2021.
[42] V. Maiorov and A. Pinkus. Lower bounds for approximation by mlp neural networks. Neurocomputing, 25(1):81–91, 1999.
[43] C. Marcati and C. Schwab. Exponential convergence of deep operator networks for elliptic partial differential equations. SIAM Journal on Numerical Analysis, 61(3):1513–1545, 2023.
[44] H. N. Mhaskar and N. Hahm. Neural networks for functional approximation and system identification. Neural Computation, 9(1):143–159, 1997.
[45] J. Munkres. Topology. Pearsonn Education Limited, 2 edition, 2014.
[46] N. H. Nelsen and A. M. Stuart. The random feature model for input-output maps between Banach spaces. SIAM Journal on Scientific Computing, 43(5):A3212–A3243, 2021.
[47] J. A. Opschoor, C. Schwab, and J. Zech. Exponential ReLU DNN expression of holomorphic maps in high dimension. Constructive Approximation, 55(1):537–582, 2022.
[48] D. Patel, D. Ray, M. R. Abdelmalik, T. J. Hughes, and A. A. Oberai. Variationally mimetic operator networks. Computer Methods in Applied Mechanics and Engineering, 419:116536, 2024.
[49] P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using deep relu neural networks. Neural Networks, 108:296–330, 2018.
[50] C. Schwab, A. Stein, and J. Zech. Deep operator network approximation rates for Lipschitz operators. arXiv preprint arXiv:2307.09835, 2023.
[51] C. Schwab and J. Zech. Deep learning in high dimension: Neural network expression rates for generalized polynomial chaos expansions in UQ. Analysis and Applications, 17(01):19–55, 2019.
[52] C. Schwab and J. Zech. Deep learning in high dimension: Neural network expression rates for analytic functions in ${L}^{2}(\mathbb{R}^{d},\gamma_{d})$ . SIAM/ASA Journal on Uncertainty Quantification, 11(1):199–234, 2023.
[53] F. Voigtlaender and P. Petersen. Approximation in l p ( $\mu$ ) with deep relu neural networks. In 2019 13th International conference on Sampling Theory and Applications (SampTA), pages 1–4. IEEE, 2019.
[54] Z. Zhang, L. Tat, and H. Schaeffer. BelNet: Basis enhanced learning, a mesh-free neural operator. Proceedings of the Royal Society A, 479, 2023.

Appendix A A short summary of Baire category

In this appendix, we recall the Baire category theorem from general topology. For a more thorough discussion of this result, and its connections to other topological concepts, we refer to the textbook [45, Chap. 8].

Let $X$ be a topological space. Let $A\subset X$ be a subset. We recall that the interior of $A$ is defined as the union of all open sets of $X$ that are contained in $A$ . The set $A$ is said to have empty interior if $A$ contains no open set of $X$ other than the empty set. Equivalently, $A$ is said to have empty interior if the complement of $A$ is dense in $X$ . We then have the following definition [45, Chap. 8, p. 293]:

Definition A.1.

A space $X$ is said to be a Baire space if the following condition holds: Given any countable collection $\{A_{n}\}$ of closed sets of $X$ each of which has empty interior in $X$ , their union $\bigcup_{n}A_{n}$ also has empty interior in $X$ .

This definition can equivalently be stated in terms of open sets [45, Lemma 48.1]:

Lemma A.2.

$X$ is a Baire space if and only if given any countable collection $\{U_{n}\}$ of open sets in $X$ , each of which is dense in $X$ , their intersection $\bigcap_{n}U_{n}$ is also dense in $X$ .

The following Baire category theorem [45, Thm. 48.2] exposes many examples of Baire spaces encountered in applications:

Theorem A.3 (Baire category theorem).

If $X$ is a compact Hausdorff space or a complete metric space, then $X$ is a Baire space.

Appendix B Proof of the quantization lemma

The goal of this appendix is to prove the FNO quantization lemma 2.21:

See 2.21

Proof of Lemma 2.21.

Let $\Phi_{q}$ be an output-averaged FNO with at most $q$ tunable parameters. We first note that the depth of $\Phi_{q}$ can only take the values $L\in\{1,\dots,q\}$ . For each possible value of the depth, we now consider the maximally connected output-averaged FNO architecture $\widehat{\Phi}_{q}^{(L)}$ of depth $L$ , obtained by setting $\kappa,{d_{c}}=q$ in each layer. This maximally connected FNO architecture has at most

\widehat{q}^{(L)}\leq 5(2\kappa)^{d}L{d_{c}}^{2}\leq 5\cdot 2^{d}q^{d+3},

tunable parameters. For later reference, we note that

Observation 1: Any output-averaged averaged FNO $\Phi_{q}({\,\cdot\,};\theta)$ with depth $L$ and at most $q$ parameters can be represented by a specific choice of the weights of $\widehat{\Phi}_{q}^{(L)}({\,\cdot\,};\widehat{\theta})$ . In fact, this only requires zero-padding $\theta$ to obtain $\widehat{\theta}$ .

Our main goal is to suitably quantize $\widehat{\Phi}_{q}^{(L)}$ , and then define a quantized neural operator architecture $\widetilde{\Phi}_{n_{q}}$ with $n_{q}$ bits which can represent all quantized $\widehat{\Phi}_{q}^{(L)}$ for $L=1,\dots,q$ by specific setting of its bitwise-encoded parameters.

It follows from [28, Proposition D.15], with a minimal extension to allow for $\sigma(0)\neq 0$ , that the Lipschitz constant of the map**,

R_{q}^{(L)}:\;\left\{\begin{aligned} {}^{\widehat{q}}&\to C(\mathcal{K}),\\ \theta&\mapsto\widehat{\Phi}_{q}^{(L)}({\,\cdot\,};\theta),\end{aligned}\right.

and with $[-M_{q},M_{q}]^{\widehat{q}}$ metrized by the $\ell^{\infty}$ -norm, can be bounded by

\mathrm{Lip}(R_{q}^{(L)})\leq(L+2)(2{d_{c}}M_{q})^{L+2}\left(C+(2\kappa)^{d/2}% \right).

Here, $C>0$ is a constant depending only on $d$ and $\mathcal{K}$ . In particular, there exists a (larger) constant $C=C(d,\mathcal{K})$ , such that

\mathrm{Lip}(R_{q}^{(L)})\leq(Cq)^{Cq}=\exp(Cq\log(Cq)).

We quantize $\widehat{\Phi}_{q}^{(L)}$ for $\theta\in[-M_{q},M_{q}]^{\widehat{q}}$ by subdividing each coordinate direction by equidistant points of separation $\sim\log(q)^{-\gamma}/\exp(Cq\log(Cq))$ . Denote the resulting discrete set of points by $\Theta^{(L)}_{q}\subset\mathbb{R}^{\widehat{q}}$ . We note that this subdivision requires at most,

O\left(\,\big{\{}M_{q}\log(q)^{\gamma}\exp(Cq\log(Cq))\big{\}}^{\widehat{q}}\,\right)

many quantization points, which can be encoded by

O\Big{(}\,\widehat{q}\,\log\big{(}\,M_{q}\log(q)^{\gamma}\exp(Cq\log(Cq))\,% \big{)}\,\Big{)}

many bits. Since $\widehat{q}=O(q^{d+3})$ , $\log(M_{q}\log(q)^{\gamma})=O(q)$ and $\log(\exp(Cq\log(Cq)))=O(q^{2})$ , it follows that the number of required bits is

O\left(q^{d+6}\right),

i.e. $\log_{2}|\Theta^{(L)}_{q}|=O(q^{d+6})$ . The implied constant here is independent of $L$ . In the following, we denote $m:=d+6$ . In particular, we conclude that there exists a constant $C>0$ , independent of $q$ , such that

\max_{L=1,\dots,q}|\Theta^{(L)}_{q}|\leq Cq^{m}.

We also note that, by construction, for any $\theta\in[-M_{q},M_{q}]^{\widehat{q}}$ , there exists $\theta^{\prime}\in\Theta^{(L)}_{q}$ , such that

\|\theta-\theta^{\prime}\|_{\ell^{\infty}}\leq\frac{\log(q)^{-\gamma}}{\exp(Cq% \log(Cq))}.

It follows that for any $\theta\in[-M_{q},M_{q}]^{\widehat{q}}$ , there exists $\theta^{\prime}\in\Theta^{(L)}_{q}$ , such that

	$\displaystyle\\|\widetilde{\Phi}^{(L)}_{q}({\,\cdot\,};\theta)-\widetilde{\Phi}% ^{(L)}_{q}({\,\cdot\,};\theta^{\prime})\\|_{C(\mathcal{K})}$	$\displaystyle\leq\mathrm{Lip}(R_{q}^{(L)})\,\\|\theta-\theta^{\prime}\\|_{\ell^{% \infty}}$
		$\displaystyle\leq\exp(Cq\log(Cq))\frac{\log(q)^{-\gamma}}{\exp(Cq\log(Cq))}$
		$\displaystyle=\log(q)^{-\gamma}.$

Thus,

(B.1)

\displaystyle\sup_{\theta\in[-M_{q},M_{q}]^{\widehat{q}}}\min_{\theta^{\prime}% \in\Theta^{(L)}_{q}}\|\widetilde{\Phi}^{(L)}_{q}({\,\cdot\,};\theta)-% \widetilde{\Phi}^{(L)}_{q}({\,\cdot\,};\theta^{\prime})\|_{C(\mathcal{K})}\leq% \log(q)^{-\gamma}.

Since $|\Theta^{(L)}_{q}|\leq Cq^{m}$ , any $\theta^{\prime}\in\Theta^{(L)}_{q}$ can be identified with a unique bit-string in $\{0,1\}^{\ell_{q}}$ , where $\ell_{q}=\lceil Cq^{m}\rceil$ . Adding an additional number of $O(\log(q))$ bits to encode the possible values of the depth parameter $L\in\{1,\dots,q\}$ , we can now define a quantized neural operator $\widetilde{\Phi}_{n_{q}}:L^{2}(D)\times\{0,1\}^{n_{q}}\to\mathbb{R}$ encoded by $n_{q}\sim\log(q)+\ell_{q}\sim Cq^{m}$ bits, in the following way: Given $[\theta]\in\{0,1\}^{n_{q}}$ , we first read off the length parameter $L$ from the first $\lceil\log_{2}q\rceil$ bits. Removing these bits, the remaining $\ell_{q}$ bits uniquely identify $\theta^{\prime}\in\Theta_{q}^{(L)}$ , and we set

(B.2)

\displaystyle\widetilde{\Phi}_{n_{q}}({\,\cdot\,};[\theta]):=\Phi^{(L)}_{q}({% \,\cdot\,};\theta^{\prime}).

Thus, $\widetilde{\Phi}_{n_{q}}$ is a neural operator architecture with parameters encoded by $n_{q}\asymp q^{m}$ bits. By our definition (B.2), any neural operator belonging to the set

{\left\{\Phi^{(L)}_{q}({\,\cdot\,};\theta^{\prime})\,\middle|\,L\in\{1,\dots,q% \},\;\theta^{\prime}\in\Theta^{(L)}_{q}\right\}},

can be represented exactly by suitable choice of $[\theta]\in\{0,1\}^{n_{q}}$ . And thus, by (B.1), we have

(B.3)

\displaystyle\sup_{L=1,\dots,q}\sup_{\theta\in[-M_{q},M_{q}]^{\widehat{q}}}% \min_{[\theta]\in\{0,1\}^{n_{q}}}\|\widetilde{\Phi}^{(L)}_{q}({\,\cdot\,};% \theta)-\widetilde{\Phi}_{n_{q}}({\,\cdot\,};[\theta])\|_{C(\mathcal{K})}\leq% \log(q)^{-\gamma}.

We finally note that any neural operator architecture $\Phi_{q}$ with at most $q$ parameters is represented as $\Phi_{q}({\,\cdot\,};\theta)=\widehat{\Phi}_{q}^{(L)}({\,\cdot\,};\widehat{% \theta})$ for suitably chosen $\widehat{\theta}=\widehat{\theta}(\theta)$ (see Observation 1, above). In fact, this only involves zero-padding of the weights $\theta$ . In particular, if $\theta\in[-M_{q},M_{q}]^{q}$ , then $\widehat{\theta}\in[-M_{q},M_{q}]^{\widehat{q}}$ .

From (B.3), it follows that

(B.4)

\displaystyle\sup_{\theta\in[-M_{q},M_{q}]^{q}}\min_{[\theta]\in\{0,1\}^{n_{q}% }}\|\Phi_{q}({\,\cdot\,};\theta)-\widetilde{\Phi}_{n_{q}}({\,\cdot\,};[\theta]% )\|_{C(\mathcal{K})}\leq\log(q)^{-\gamma},

as claimed. This concludes the proof. ∎

	$\displaystyle\mathbb{E}_{u\sim\mu}\|(\iota_{d}f)(u)\|^{p}$	$\displaystyle=\mathbb{E}\|f(F_{1}(Z_{1}),\dots,F_{d}(Z_{d}))\|^{p}$
		$\displaystyle=\int_{[0,1]^{d}}\|f(x_{1},\dots,x_{d})\|^{p}\,dx$
		$\displaystyle=\\|f\\|_{L^{p}([0,1]^{d})}^{p}.$

Operator Learning of Lipschitz Operators: An Information-Theoretic Perspective

Abstract.

1. Introduction

Main contributions

Overview

2. Main Results

2.1. Operator approximation by neural operators

2.1.1. Approximation theoretic setting

Model class Lip1⁢(𝒟)subscriptLip1𝒟\mathrm{Lip}_{1}(\mathcal{D})roman_Lip start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_D )

Definition 2.1 (Model class Lip1subscriptLip1\mathrm{Lip}_{1}roman_Lip start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT).

Approximation spaces and norms

Setting 2.2 (Uniform approximation).

Example 2.3.

Setting 2.4 (Approximation in expectation).

Measures of complexity: Counting parameters versus bits

2.2. Encoder-decoder view of neural operators

Counting parameters

Counting bits

2.3. Information-theoretic notions

Minimax code-length

Definition 2.5 (Abstract bitwise encoder/decoder pairs).

Kolmogorov metric entropy

Definition 2.6 (Covering number and metric entropy).

Link between minimax code-length and metric entropy

Proposition 2.7.

Proof.

2.4. Information-theoretic minimax bounds

Uniform approximation

Theorem 2.8.

Proof.

Example 2.9.

Approximation in expectation

Assumption 2.10.

Theorem 2.11.

Proof.

2.5. Approximation of generic Lipschitz operators

Definition 2.12 (Topologically generic properties).

Proposition 2.13 (Uniform approximation of generic operators).

Proof.

Proposition 2.14 (Approximation of generic operators in expectation).

Proof.

Remark 2.15.

2.6. Approximation of generic Lipschitz operators by FNO

FNO case study

FNO architecture

Remark 2.16.

Generic curse of parametric complexity for FNO

Definition 2.17.

Remark 2.18.

Theorem 2.19.

Corollary 2.20.

Proof of Theorem 2.19.

Lemma 2.21 (FNO quantization lemma).

Brief sketch of proof.

3. The metric entropy of Lipschitz operators

3.1. Entropy, covering and packing

Definition 3.1 (Packing number).

3.2. Uniform approximation

Proposition 3.2.

Proof.

Corollary 3.3 (Lipschitz functions on finite-dimensional domains).

Proof.

Corollary 3.4 (Lipschitz functionals on Sobolev spaces).

Proof.

Corollary 3.5 (Lipschitz functionals on Hölder spaces).

Proof.

3.3. Approximation in expectation

Proposition 3.6.

Lemma 3.7.

Proof.

Corollary 3.8.

Proof.

Lemma 3.9.

Proof.

Proof of Proposition 3.6.

4. Generic approximation results

Lemma 4.1.

Proof.

Proposition 4.2 (Exponential scaling).

Proof.

Operator Learning of Lipschitz Operators:
An Information-Theoretic Perspective

Model class $\mathrm{Lip}_{1}(\mathcal{D})$

Definition 2.1 (Model class $\mathrm{Lip}_{1}$ ).