Operator Learning of Lipschitz Operators:
An Information-Theoretic Perspective
Abstract.
Operator learning based on neural operators has emerged as a promising paradigm for the data-driven approximation of operators, map** between infinite-dimensional Banach spaces. Despite significant empirical progress, our theoretical understanding regarding the efficiency of these approximations remains incomplete. This work addresses the parametric complexity of neural operator approximations for the general class of Lipschitz continuous operators. Motivated by recent findings on the limitations of specific architectures, termed curse of parametric complexity, we here adopt an information-theoretic perspective. Our main contribution establishes lower bounds on the metric entropy of Lipschitz operators in two approximation settings; uniform approximation over a compact set of input functions, and approximation in expectation, with input functions drawn from a probability measure. It is shown that these entropy bounds imply that, regardless of the activation function used, neural operator architectures attaining an approximation accuracy must have a size that is exponentially large in . The size of architectures is here measured by counting the number of encoded bits necessary to store the given model in computational memory. The results of this work elucidate fundamental trade-offs and limitations in operator learning, providing new insights into the limitations of operator learning.
1. Introduction
Operators map** between infinite-dimensional Banach spaces of functions are ubiquitous in the natural sciences and engineering. They often appear in connection with physical models expressed as a set of partial differential equations, where operators of interest frequently arise from associated forward and inverse problems, e.g. map** initial data to the solution at a later time, or identifying external forcing terms from (partial) knowledge of the solution.
Operator learning has emerged as a new paradigm for the data-driven approximation of such operators. Popular operator learning frameworks build on the success of neural networks, but generalize this notion to the infinite-dimensional context of operator approximation, resulting in so-called neural operators. These neural operator architectures define parametric operators, whose parameters are tuned to approximate an underlying operator of interest.
While there is a very rapidly growing body of empirical work demonstrating the great potential, and practical utility, of such data-driven approaches, many open questions remain in our understanding of the theoretical underpinnings of this field, see e.g. [30] for a recent review and references therein.
First theoretical insights into specific architectures, and their underlying approximation mechanisms, can be gained by studying universal approximation, i.e. the ability to approximate very general classes of operators. The study of universal approximation of neural operators dates back at least three decades, to early work on operator networks by Chen and Chen [12]. Due to the recent rise in the popularity of operator learning and the introduction of a number of novel state-of-the-art frameworks, this early work has been complemented by a number of papers in recent years, demonstrating similar universal approximation properties for various architectures; e.g. DeepONets [41, 36], PCA-Net [6, 34], Fourier neural operator [29] and general neural operators [31, 35], as well as multiple other architectures [54, 25, 23, 10, 11].
Universal approximation implies that there are no fundamental obstructions to operator learning with a given framework, and usually requires identification of basic approximation mechanisms that can be leveraged by a given architecture. However, to determine whether operator learning can be achieved efficiently, a refined quantitative analysis is required. In such quantitative analysis, one often distinguishes between parametric complexity, relating the required model size to the achieved accuracy, and sample efficiency, relating the number of required training samples to the achieved accuracy. The focus of the present work is on parametric complexity. For research relevant to the data complexity of operator learning, we mention, for example, [3, 4, 28, 44, 5].
A general class of operators for which efficient approximation is possible, in terms of the required number of tunable parameters, are so-called holomorphic operators. Research into the approximation of holomorphic operators goes back to the seminal work of Cohen, DeVore and Schwab [14, 15], where it was shown that this class of operators can be efficiently approximated by generalized polynomial expansions. More recently, these results have been extended to neural network and neural operator approximation in a series of works [51, 47, 52, 22, 2, 43], demonstrating that similar rates can be achieved by neural operators.
Other classes of operators for which efficient convergence rates have been derived are operator Barron spaces [27] and (operator) reproducing kernel Hilbert spaces (RKHS) [37, 46]. Alternative settings, such as parametric PDEs with low-dimensional latent structure are, for example, explored in [33, 38, 20].
Apart from these specific classes of operators, efficient approximation has also been established via a case-by-case analysis for several PDE solution operators [17, 29, 36, 34, 43, 21]. These results identify a number of individual operators of interest which can be efficiently approximated by certain operator learning frameworks. Despite this progress, a general theory encompassing all these examples has yet to emerge.
A very general class of operators of interest are Lipschitz operators. Approximation theory of relevance to such a general class of operators has been developed e.g. in [40, 21, 50, 48, 32]. All of these works aim to bound the number of tunable parameters (model size) in terms of the accuracy that can be achieved.
The present work will focus on deriving lower complexity bounds for the class of Lipschitz continuous operators , defined on an infinite-dimensional domain and taking values in (nonlinear Lipschitz functionals). Semantically, no distinction will be made between ‘functional’ and ‘operator’, since all lower bounds established for functionals continue to hold when considering operators with infinite-dimensional output spaces – the latter containing (infinitely many) copies of .
In addition to the aforementioned literature on neural operator approximation theory, the present work also takes inspiration from the information-theoretic point of view on neural network approximation theory in a finite-dimensional setting, pioneered in the works [8, 9, 49, 53, 19], as well as notions of stable approximation [18, 13]. In the present work, the underlying ideas will be applied and extended to the infinite-dimensional context of operator learning.
The main motivation for this work are two recent results, established in [34] and [50] respectively, both applicable to the general setting of Lipschitz operators. A one-paragraph summary of the results in [34] and [50] is as follows:
-
(i)
The first result [34] shows that certain neural operator architectures, based on ReLU activations, suffer from a curse of parametric complexity: under certain assumptions on the input functions, there exist Lipschitz continuous operators which can only be approximated to accuracy , if the number of tunable parameters is exponential in ; more precisely, the number of parameters must be at least as large as with problem-dependent constants .
-
(ii)
The second result in [50] shows that, under similar assumptions on the input functions, neural operator architectures based on super-expressive activation functions can approximate general Lipschitz operators to accuracy , with algebraically bounded parameter count; the number of parameters is upper bounded by , for problem-dependent .
While the first result, viewed in isolation, appears to hint at fundamental limitations to the development of operator learning theory on the general class of Lipschitz operators, due to the identified “curse”, the second result shows rigorously that this curse can be circumvented with a suitable choice of activation.
The aim of the present work is to examine the apparent dichotomy between these two results in detail. To this end, we explore the curse of parametric complexity from an information-theoretic perspective. As a result, we will uncover the fundamental information-theoretic character of the curse of parametric complexity, and identify the relevant trade-offs that are possible when parametric complexity is measured by the number of (real-valued) parameters as in [34, 50].
Main contributions
This work makes the following main contributions:
-
•
We propose an information-theoretic perspective of operator learning, based on the relation between bit-encoding and Kolmogorov metric entropy; this provides an alternative to the prevalent analysis in the literature, which has focused on estimating the required number of real-valued parameters.
-
•
For the model class of Lipschitz operators, we derive lower bounds on the metric entropy in two settings: one pertaining to uniform approximation, the other to approximation in expectation.
-
•
These bounds imply, in either setting, that an exponentially large number of encoding bits is required to store the weights of any architecture achieving accuracy on the model class. This result holds independently of the activation function that is chosen.
-
•
We use topological arguments to show that even generic operators can only be approximated with exponentially increasing complexity; when applied to FNO this implies that the approximation of a generic Lipschitz operator, to accuracy , requires a number of tunable parameters exponential in .
Overview
The remainder of this paper is organized as follows. In Section 2, we state the main results of this work, as they pertain to operator learning with neural operator architectures. This section contains the main conceptual contributions of this work and reviews the link between bit-encoding and Kolmogorov entropy. Several technical details are left to Sections 3 and 4; in Section 3, we derive lower bounds on the Kolmogorov metric entropy of the set of -Lipschitz operators in both a sup-norm and -norm approximation setting. In particular, we show that the metric -entropy increases exponentially with , implying a general curse of parametric complexity for bit-encoded architectures. This is the first main technical contribution of this work. Approximation rates for generic operators are the subject of Section 4, where we first formulate the operator approximation problem in an abstract Banach space setting, and then use topological arguments to relate approximation rates of generic elements of a model class to the metric entropy of this class. This is the second main technical contribution of this work. Finally, Section 5 contains concluding remarks.
2. Main Results
This section contains a summary of the main results of this work, applied to the specific setting of operator learning. Several of these results are based on more general, abstract propositions which are included in subsequent Sections 3 and 4. To aid readability, we leave most technical details to these latter sections. The aim of this section is instead to explain the main ideas underlying our analysis, and their implications for operator learning. Recurring notation, to be introduced and discussed in the following, is summarized in Table 1.
Notation | Meaning |
---|---|
Nonlinear operator with domain | |
, | (Input/output) Banach spaces |
Compact subset of inputs | |
Operator domain, or | |
Probability measure on | |
Real-valued -Lipschitz operators, | |
Banach space of operators, or | |
Compact subset of , e.g. |
2.1. Operator approximation by neural operators
We begin the discussion of our main results by proposing an encoder-decoder point of view on operator learning, where the encoder and decoder are implicitly defined by a given architecture. We then define approximation errors of interest and discuss two common measures to quantify the “complexity” of a given architecture. The first counts the number of tunable, real-valued parameters in the architecture. The second goes one step further, and requires specification of a bit-encoding of all parameters, i.e. encoding by a sequence of 0’s and 1’s. To fix intuition, this bit-encoding can be loosely interpreted as the representation of the parameters on computing hardware. The complexity of a bitwise-encoded architecture is measured by the number of bits required to represent it. As will be explained, this provides a link to fundamental information-theoretic concepts such as the Kolmogorov metric entropy of our model class.
2.1.1. Approximation theoretic setting
Assume we are given input and output spaces , . A neural operator defines a parametrized map** , where are tunable parameters. Specification of defines an operator, . In practice, the training of a neural operator results in an optimized parameter choice for given and an approximation .
Model class
In the following, we will consider a model class of -Lipschitz operators, restricting attention to the case of real-valued outputs, :
Definition 2.1 (Model class ).
Let be a metric space. We define as the set consisting of all -Lipschitz continuous map**s with , where we define the -norm as follows:
(2.1) |
As described in the introduction, the goal of operator learning is to approximate by a neural operator . In this work, we aim to relate the approximation accuracy to the required model size of . We will focus on two settings, where either (i) is a compact subset of a Banach space and the metric is the sup-norm over , or (ii) is a Banach space and the metric is induced by the -norm with respect to a probability measure on (cp. Table 2).
Approximation spaces and norms
To measure the approximation accuracy of this approximation task, we have to define a distance between operators. To this end, we will consider a Banach space of operators , allowing for an embedding . Throughout, we will consider one of the following two settings. In the first setting, we aim to approximate over a compact domain :
Setting 2.2 (Uniform approximation).
If is an operator with compact domain , we will study its uniform approximation over , i.e. we take to be the space of continuous operators, metrized by the sup-norm:
(2.2) |
A common special case of this setting is the case where is defined by a smoothness constraint, as illustrated by the following example:
Example 2.3.
Let be a bounded domain. An example of the setting above is the case of Lipschitz operators , with
a set defined by a Sobolev smoothness constraint for . Here, .
In the second setting, we aim to approximate over the entire Banach space , but with respect to a (Bochner) -norm:
Setting 2.4 (Approximation in expectation).
If is an operator with unbounded domain a separable Banach space, then we will assume that inputs are drawn at random from a probability measure . In this case, we fix and take as the space of -measurable operators with finite -th norm. is metrized by the Bochner -norm,
(2.3) |
Operator domain | Operator class | Approximation space | Norm |
---|---|---|---|
compact | sup-norm | ||
Banach | -norm |
Measures of complexity: Counting parameters versus bits
We will distinguish two ways of measuring the “complexity” of neural operator : one based on the number of tunable (real-valued) parameters, the other requiring bit-encoding (or quantization) of the parameters.
A first intuitive notion of complexity is the minimal number of tunable parameters required to reach approximation accuracy , i.e. the parameter dimension of a neural operator . As mentioned in the introduction, this point of view has been prevalent in the development of approximation theory for operator learning. As explained previously, depending on the type of activation function that is used, vastly different conclusions can be reached with this definition of complexity. This fact is well-known in the finite-dimensional setting: For example, it has been shown [42] that there exist smooth, sigmoidal activation functions for which a neural network of fixed size can approximate arbitrary continuous function to arbitrary accuracy, i.e. approximation accuracy can be reached with a number of parameters .
In practical implementations, real-valued parameters can only be digitally represented to finite accuracy. This observation has led a number of authors [8, 9, 49, 53, 19], to analyze neural network approximation from a bit-encoding perspective. In this approach, the continuous parameters are replaced by quantized parameters , where is a finite set. If the number of elements is bounded, say for some , then we can identify , i.e. each element in the set is encoded by a string of bits. Taking this information-theoretic point of view, it is possible to derive (lower) complexity bounds that are independent of the activation function.
2.2. Encoder-decoder view of neural operators
Given the discussion of the last paragraph, we now outline an encoder-decoder point of view on neural operators, emphasizing the difference between “counting parameters” and “counting (encoding) bits”.
Counting parameters
Let be a neural operator architecture. To explain our intuition, we temporarily assume the existence of, and fix an optimal parameter choice for each , so that
(2.4) |
with respect to the relevant norm of interest on the space of operators . The corresponding encoder is then given by
(2.5) |
The corresponding decoder is
(2.6) |
In this way, the operator learning architecture induces a natural encoder/decoder pair on the relevant space of operators, and we are interested in bounds on the encoding error, either for individual , i.e.
(2.7) |
or in a minimax sense, i.e.
(2.8) |
Given a desired approximation accuracy , either in the sense (2.7) or (2.8), one quantity of interest is the required “complexity” of any architecture achieving this accuracy. The above point of view is consistent with estimates on the required number of parameters .
Counting bits
As discussed before, the number of parameters is not a suitable measure of complexity when results independent of the activation are sought. Therefore, we now assume that the parameters are encoded by bits. This defines a subset consisting of elements. Each is in correspondence with its bit-encoding . Thus, upon associating with any the optimal , the continuum encoder (2.5) is now replaced by a bitwise-encoder,
(2.9) |
with bitwise-decoder,
(2.10) |
The individual and minimax errors, (2.7) and (2.8), have the following bit-encoded counterparts,
(2.11) |
and
(2.12) |
In the present work, we will focus on such a bit-encoding point of view, but mention that there are close links between these two points of view, if the map** possesses some stability properties. Specifically, this link will be used to derive lower complexity bounds for the Fourier neural operator in Section 2.6.
2.3. Information-theoretic notions
The relevance of the bit-encoding point of view is that it relates directly to the (Kolmogorov) metric entropy of the underlying model class and allows results to be derived which are independent of specifics of the architecture such as the choice of activation function. Thus bit-encoding enables analysis relating directly to intrinsic topological properties of .
Minimax code-length
Abstracting further our previous discussion, we make the following formal definition of abstract bitwise encoder/decoder pairs:
Definition 2.5 (Abstract bitwise encoder/decoder pairs).
Given a compact subset of a Banach space , we denote by the set of all bitwise encoder/decoder pairs of length , i.e. all pairs of map**s and .
Following [9], for , we also introduce the minimax code length of a compact set as the minimal number of bits for which there exists an (abstract) encoder/decoder pair such that
That is,
(2.15) |
Kolmogorov metric entropy
Given a metric space , element and , we denote by
the closed ball of radius . We now make the following definition for the covering number and (Kolmogorov) metric entropy:
Definition 2.6 (Covering number and metric entropy).
Let be a metric space. For , the -covering number of a set , denoted , is the smallest integer , such that can be covered by closed balls of radius , i.e.
(2.16) |
We note that the subscript is used as a shorthand for , with the relevant metric implied. The metric entropy of is defined as the logarithm (to base ) of the covering number, i.e.
(2.17) |
Link between minimax code-length and metric entropy
The minimax code-length and metric entropy introduced in the previous paragraphs are linked by the following fundamental result [16, Rmk. 5.10]:
Proposition 2.7.
Let be a Banach space, and let be compact. Then the metric entropy of provides a lower bound on the minimax code length:
(2.18) |
Proof.
Let be given. Let be a bitwise encoder/decoder pair with bits, achieving reconstruction error at most on . The image of contains at most elements, . Since, for any , the specific choice belongs to the image of , it follows that
Thus, , implying that the covering number of is bounded by
Taking logarithms and recalling that yields the claim. ∎
In particular, Proposition 2.7 implies that if , then there cannot exist a bit-encoder-decoder pair achieving uniform decoding accuracy over . Conversely, if is an encoder-decoder pair (2.9), (2.10) associated with a bit-encoded neural operator with , and if the following minimax approximation bound holds,
this implies that .
2.4. Information-theoretic minimax bounds
As a consequence of Proposition 2.7, we can derive a lower bound on the required number of bits to achieve the minimax bound (2.12) by estimating the entropy of . As mentioned before, we will consider two settings, corresponding to uniform approximation of over a compact set (the setting ) and approximation with respect to a Bochner -norm for probability measure (the setting ).
Uniform approximation
We now consider a compact set of input functions, and operators belonging to (cp. Setting 2.2). This corresponds to the choice , , , in the discussion of the previous section. We then have the following result:
Theorem 2.8.
Let be a Banach space. Let be a compact set of input functions, and assume that the metric entropy of satisfies the lower bound, for . There exists a constant , independent of , such that the following holds: If is a quantized neural operator architecture, satisfying
and if , i.e. if the parameters of can be encoded by bits, then
Proof.
If is a function space, then compact subsets are commonly defined by a smoothness constraint, and this partly motivates our assumption on in the last theorem. The following example is illustrative.
Example 2.9.
Let be a bounded domain. Let . An example of the setting outlined above is the case of Lipschitz operators , with
defined by a Sobolev smoothness constraint for . In this case, it is well-known that the metric entropy satisfies , i.e. the assumptions of Theorem 2.8 hold with .
Approximation in expectation
Another commonly studied setting concerns the approximation in expectation (cp. Setting 2.4). Here, we consider -Lipschitz map**s defined on a separable Hilbert space . We fix a probability measure on and consider inputs as random draws . To derive quantitative lower bounds, we will need to make minimal structural assumptions on .
Assumption 2.10.
There exists an orthonormal basis of , probability space and summable coefficients , such that is the law of a random variable of the form,
(2.19) |
where are jointly independent random variables. We assume that the random variable satisfies , and has law for a probability density function . We furthermore assume that there exists a constant , such that
(2.20) |
A concrete, and widely considered, example satisfying Assumption 2.10 is the case of a Gaussian probability measure with prescribed mean and covariance operator. In this case, are the eigenvalues of the covariance operator, the corresponding eigenfunctions, and the random variables have standard Gaussian distribution.
Theorem 2.11.
Let be a Banach space of input functions. Let be a probability measure satisfying Assumption 2.10. Assume that the coefficients as , where . Then there exists a constant , independent of , such that the following holds: If is a quantized neural operator architecture, satisfying
and if , i.e. if the parameters of can be encoded by bits, then
Proof.
Thus, an exponential number of encoding bits is also needed in an -setting. Theorem 2.11 shows that the approximation of Lipschitz operators in expectation is not “qualitatively” easier than uniform approximation of such operators over a compact set of input functions.
2.5. Approximation of generic Lipschitz operators
Theorems 2.8 and 2.11 show that operator learning architectures that can approximate arbitrary -Lipschitz operators to accuracy have exponential memory requirements; any (bit-encoded) implementation of such an architecture will require a number of bits that is exponential in . The reason for this is that the space of Lipschitz operators is exponentially large in a fundamental information-theoretic sense quantified by the metric entropy.
However, this minimax bound applies to the approximation of the entire class by a single architecture, and does not necessarily imply that it is impossible to approximate individual efficiently. At first sight, it could appear that arguments based on the metric entropy cannot be used to gain any insight into this refined question; Indeed, if we fix individual , then the metric entropy of the singleton-set is trivially , and the minimax code length (2.15) is for any value of the accuracy , since the trivial decoder reproduces exactly, with vanishing approximation error, . Thus, while entropy arguments give insights into the (concurrent) approximation of the set , they seemingly have no immediate implications for the approximation of individual .
Despite these facts, the results below will show that a refined analysis based on the concept of metric entropy is nevertheless possible; in the uniform and -settings of the previous section, a fixed sequence of bit-encoded architectures , with at most bits, can approximate generic elements at best at a logarithmic rate, for fixed . Before stating our result, we briefly recall the notion of a generic element of a (compact) metric space (see Appendix A for further remarks, and [45, Chap. 8] for an in-depth discussion):
Definition 2.12 (Topologically generic properties).
Let be a compact metric space. A subset is called residual, if it is equal to a countable intersection of sets, each of whose interior is dense in . The complement of a residual set is a meagre set. A property is called generic, if the set
is residual.
Under the assumption that is compact, the Baire category theorem (cp. Appendix A) implies that any residual set is dense in . Furthermore, the intersection of countably many residual sets is itself residual, and hence still dense. In this sense, a topologically generic property is somewhat analogous to a property that holds with probability in a probabilistic sense. Thus, a generic property is often thought of as a property that is satisfied by “almost every” element of .
We can now state our main results on the approximation of generic operators . In the uniform setting (cp. Setting 2.2), we have:
Proposition 2.13 (Uniform approximation of generic operators).
Let be a Banach space of input functions. Let be compact, and assume that the metric entropy for . Let be a sequence of bit-encoded neural operator architectures, with quantized parameter set . Then generic cannot be approximated by at a convergence rate better than ; more precisely, for any sequence , there is a residual subset , consisting of operators , for which
.
Proof.
We let and . We note that is a compact, convex subset. We then consider the sequence of subsets , defined by all possible realizations,
By assumption, . By Proposition 3.2, to be proved in Section 3.2, we have . The claim of Proposition 2.13 then follows, as a special case, from the abstract result of Proposition 4.2 to be derived in Section 4. ∎
A similar result holds for approximation of Lipschitz operators in an sense, as shown in the following proposition (cp. Setting 2.4):
Proposition 2.14 (Approximation of generic operators in expectation).
Let be a Banach space of input functions. Let be a probability measure satisfying Assumption 2.10. Assume that the coefficients as , where . Let be a sequence of bit-encoded neural operator architectures, with quantized parameter set . Then generic cannot be approximated by at a convergence rate better than ; more precisely, for any sequence , there is a residual subset , such that for any ,
.
Proof.
We let and . We note that is a compact, convex subset. We consider the subsets , defined by all possible realizations,
By assumption, . By Proposition 3.6, to be proved in Section 3.3, we have . The claim of Proposition 2.14 then follows, as a special case, from the abstract result of Proposition 4.2 to be derived in Section 4. ∎
2.6. Approximation of generic Lipschitz operators by FNO
The results of the previous section are formulated abstractly for an unspecified sequence of quantized neural operator architectures . To conclude the discussion of our main results, we illustrate some implications of these results for a concrete operator learning framework, the Fourier neural operator [39].
We note that although the derivation of these results will rely on Propositions 2.13 and 2.14, the ultimate statement of the theorems will be in terms of the number of tunable real-valued parameters of FNO, without bit-encoding. Thus, the gap between the bit-encoded parameters and real-valued parameters point of view can be bridged in this case.
In preparation to stating these theorems for FNO, we briefly describe a specific setting to which FNO is applicable, and recall the FNO architecture. This is followed by the statement of a novel theorem establishing a curse of (exponential) parametric complexity for the FNO, in the uniform approximation setting.
FNO case study
As a case study, we consider Fourier neural operators (FNO), approximating a relevant class of -Lipschitz operators,
map** square-integrable input functions to the reals (or equivalently, to a space of constant-valued functions). Here is a compact subset of , consisting of square-integrable functions . We wish to approximate such -Lipschitz operator , uniformly over the compact set .
In the following, we will usually write instead of , where for simplicity and due to certain restrictions of the FNO architecture, the underlying domain is taken to be the -periodic torus in spatial dimensions, where in typical applications, . Prototpyical examples of relevant are , where
denotes the unit ball in the Sobolev space with smoothness . The question to be addressed is how many tunable parameters are needed to approximate generic to a prescribed accuracy ?
FNO architecture
We here recall the general notion of Fourier neural operators [39]. Let and be two Banach function spaces, consisting of functions and , respectively. A Fourier neural operator (FNO) defines a nonlinear operator
map** between these spaces. By definition of the FNO architecture, such takes the form
(2.21) |
where , is a linear lifting layer, , is a linear projection layer, and the are the hidden layers, map** between hidden states . The hidden states are vector-valued functions with components, , belonging to a Banach function space . Here, the “channel width” is a hyperparameter of the architecture. Each hidden layer is of the form
where is a matrix multiplying pointwise, and is a bias. is a non-local operator of the form
with (and ) the Fourier transform (and its inverse). The matrix is a tunable Fourier multiplier indexed by . It is assumed that for , i.e. for wavenumbers above a specified Fourier cut-off parameter . This Fourier cut-off is a second hyperparameter of the FNO architecture. We collect the values for different , , in a tensor , which acts on the Fourier coefficients , by
The resulting FNO architecture depends on the channel width , Fourier cut-off parameter and depth . We collect all tunable parameters in a vector . Any parameter can be decomposed layer-wise, as
where
collects the parameters of the -th hidden layer, for . We denote by the parameters of the projection and by the parameters of lifting . Assuming that , the dimension of satisfies,
(2.22) |
Consistent with practical implementations, it is generally assumed that the hidden channel dimension of the FNO is at least as large as both the input and output dimensions . We include a list of hyperparameters in Table 3 to aid clarify notation.
Remark 2.16.
Since we are interested in a restricted class of operators , with real-valued outputs, we will replace the general output layer by a spatially averaged, real-valued version ,
This does not affect the parameter-count, while ensuring real-valued outputs. We will refer to this as an output-averaged FNO.
In passing and in connection with the last remark, we mention relevant work considering variants of FNO for finite-dimensional input and or output spaces [24], where similar alterations to the original FNO architecture have been studied in greater detail.
Symbol | Meaning |
---|---|
channel width | |
Fourier cut-off | |
depth | |
total number of parameters | |
parameter bound, |
Generic curse of parametric complexity for FNO
Our main theorem will be based on Proposition 2.13, and establishes a generic curse of parametric complexity for FNO. In contrast to the aforementioned proposition, this theorem holds at the level of continuous real-valued parameters , without requiring specification of a bit-encoding. Instead, we assume a mild bound on the parameters . We note that similar assumptions have been considered in the recent work [28], to define relevant approximation spaces of FNO. To this end, we make the following definition:
Definition 2.17.
Given an operator and , we will say that can be approximated by FNO at a logarithmic rate , if there exists a sequence of output-averaged FNO architectures with at most tunable parameters, and a sequence of parameters , satisfying bound
and
Remark 2.18.
The specific upper bound on the weights, , is here chosen for simplicity. For the following discussion, it could readily be replaced by a more general upper bound, for fixed constants , , , without affecting the main conclusions.
We can now state our main result for FNO:
Theorem 2.19.
Let be compact. Assume that the metric entropy of satisfies an algebraic lower bound, for some . Consider FNO with a fixed Lipschitz continuous activation function . Then generic cannot be approximated by FNO at a logarithmic rate , for any .
Thus, loosely speaking and under mild growth assumptions on the weights, the approximation of generic to accuracy , requires an FNO architecture with exponentially many tunable parameters in .
The following corollary is obtained by taking as the unit ball in a Sobolev space for , and with the -dimensional periodic torus:
Corollary 2.20.
Let , and denote . Then generic cannot be approximated by FNO at logarithmic rate , for any .
Proof of Theorem 2.19.
Fix . We wish to show that generic cannot be approximated at logarithmic rate . Proof of this claim will make use of the following lemma:
Lemma 2.21 (FNO quantization lemma).
Fix Lipschitz continuous activation function . Let . For any , there exists a quantized neural operator with quantized parameter values, where , , such that for any output-averaged FNO with activation and at most tunable parameters, we have
where .
Brief sketch of proof.
The detailed proof of this lemma is included in Appendix B; in short, the proof relies on two observations: (i) all possible FNO architectures with at most parameters can be encapsulated by a “super” FNO-architecture with a number of parameters that is bounded algebraically in for fixed algebraic exponent, and (ii) quantization of this super-architecture with an algebraically bounded number of bits is possible, since the map** has at least a weak form of stability (Lipschitz continuity) over the relevant range of parameters , and a Lipschitz constant that grows at a sufficiently slow rate as a function of . ∎
By Lemma 2.21, there exists , a sequence , and a sequence of quantized neural operators, , such that
Associated with this subsequence , we now define an (abstact) sequence of bit-encoded neural operators for arbitrary ; specifically, we define , by
where is chosen maximal such that , and are the first bits of (the values of the remaining bits are simply ignored). We note that since , we have . Furthermore, for arbitrary fixed operator , we note that the decay
along the specified subsequence also implies the error decay
(2.23) |
along the full sequence , as . This is immediate from the definition of and the fact that does not leave exponential gaps between subsequent , since ; in particular, this implies that .
By Proposition 2.13, the set of operators which can be approximated by such a sequence , at logarithmic rate , is meagre (its complement is residual). To conclude the argument, it therefore suffices to show that if can be approximated by FNO at logarithmic rate , then . This then implies that the set of operators that can be approximated by FNO at logarithmic rate is a subset of , and hence is itself meagre.
To this end, assume that is approximated by FNO at logarithmic rate . By definition, there exists a sequence of FNOs, , such that,
By the triangle inequality,
along the specified sequence . By (2.23), this implies that
along the entire sequence , and hence , i.e. belongs to the meagre set of operators which can be approximated by the sequence at logarithmic rate .
We have shown that any operator that is approximated by FNO at logarithmic rate belongs to the meagre set . Hence, the set of operators that is approximated by FNO at logarithmic rate is itself meagre, and its complement is residual. We conclude that generic operators , belonging to , cannot be approximated at logarithmic rate . ∎
3. The metric entropy of Lipschitz operators
In the present section, we provide lower bounds on the metric entropy of Lipschitz operators in two general settings; the first pertains to the sup-norm over a compact set of inputs, the second is of relevance to the approximation with respect to the Bochner -norm with respect to a probability measure on the input space. After briefly recalling the relation between covering and packing numbers, we proceed to consider the sup-norm setting in Section 3.2 and the -setting in Section 3.3.
3.1. Entropy, covering and packing
We recall from Definition 2.6 that the metric entropy of a subset is defined by ; here, denotes the covering number of , which is defined as the smallest number of open balls needed to cover . We also recall the closely related notion of a packing number:
Definition 3.1 (Packing number).
Let be a metric space. The packing number of a subset , denoted , is the largest integer for which there exist elements , with pairwise distance , for all distinct .
With our definitions, the following inequalities between covering and packing numbers are elementary: For any subset , we have
(3.1) |
We mention that, if the covering number is defined by open balls, the factor in the first term could have been replaced by . With our closed definition, any factor would do – we here choose for simplicity.
3.2. Uniform approximation
We are here interested in the uniform setting (Setting 2.2), i.e. the unifrom approximation of a (real-valued) map** over a compact domain .
As pointed out before, given the link between minimax code-length and metric entropy, we are interested in estimating the metric entropy of for a compact metric space. The following proposition relates the metric entropy of to that of , when is metrized by the sup-norm:
Proposition 3.2.
Let be a metric space. Let . The metric entropy of is lower bounded by
(3.2) |
Proposition 3.2 shows that the space of -Lipschitz functions on a compact metric space has exponentially larger entropy than the underlying space.
Proof.
Let be given. Let . Since the covering number lower bounds the packing number (cf. (3.1)), there exist elements , with pairwise distance . Let
denote “hat” functions centered at , and non-vanishing only on . We note that each is -Lipschitz, satisfies , and the supports of are essentially disjoint.
We now consider the set of Lipschitz functions of the form,
These functions satisfy , and , for all choices of . Furthermore, if are two distinct elements, say with , then it is straightforward to show that .
Thus, we have shown that there exist functions , with pairwise -distance . In particular, this implies that the packing number , and by the inequality (3.1) between packing- and covering-numbers, this now implies that
The claim follows by taking logarithms and recalling that . ∎
We conclude this section with several corollaries of Proposition 3.2.
Corollary 3.3 (Lipschitz functions on finite-dimensional domains).
If is a compact domain in Euclidean space, then
Proof.
It is a well-known fact that
with an implied constant depending on the dimension and the volume of ; for example, this can be a simple volume argument for an -covering , which yields
The claim thus follows from Proposition 3.2. ∎
Corollary 3.4 (Lipschitz functionals on Sobolev spaces).
Let be a compact domain in Euclidean space. Let be the unit ball in the space of Sobolev functions possessing weak derivatives in , considered as a subset of . Then there exists a constant , such that
Proof.
Corollary 3.5 (Lipschitz functionals on Hölder spaces).
Let be a compact domain in Euclidean space. Let be the unit ball in the space of Hölder continuous functions of order , considered as a subset of . Then there exists a constant , such that
3.3. Approximation in expectation
Besides the setting discussed in the previous section, which is relevant for the uniform approximation of operators over a compact set of input functions, another commonly studied setting is the approximation in expectation (cp. Setting 2.4): Here, we consider -Lipschitz map**s defined on a separable Hilbert space . We fix a probability measure on and consider inputs as random draws . We assume that satisfies the minimal structural Assumption 2.10; under this assumption, random draws can be obtained from a Karhunen-Loeve-like expansion, .
Our aim is to find lower bounds on the metric entropy of , where is the space of -integrable operators. The following entropy estimate represents the main novel contribution of this section:
Proposition 3.6.
Let be a separable Hilbert space, and let be a probability measure satisfying Assumption 2.10. Let be given. Assume that the coefficients as , where . Then the metric entropy of with respect to the Bochner -norm, obeys the following lower bound: There exist constants , such that
(3.3) |
Our proof of Proposition 3.6 will rely on several technical lemmas, which we state and prove below. The first lemma identifies an isometric embedding .
Lemma 3.7.
Let be a separable Hilbert space. Let satisfy Assumption 2.10, and let . Then for any , there exists an isometric embedding,
(3.4) |
such that , where the Lipschitz norm on is defined with respect to the -norm on .
Proof.
By assumption, is the law of a random field of the form,
(3.5) |
with independent, . To construct the claimed isometry, we define as the cumulative distribution function of . We recall that is uniform distributed. Furthermore, we clearly have , where the last bound is by Assumption 2.10.
Given , we define the coefficients of with respect to the orthonormal basis . Using the CDFs introduced above, , we now define a map**,
To see that this is well-defined, we note that, using the expansion of the random field (3.5), , and hence
and we once again remind ourselves that is uniformly distributed on , and that the are independent by assumption. Thus, it follows that
Thus, . This shows that is an isometry as claimed. To verify that , we note that
has Lipschitz constant bounded by
Thus, for any ,
Furthermore, we also have . This shows that
Here, we have made use of the choice (cp. (2.20)) in the last inequality. This concludes our proof. ∎
As a consequence of Lemma 3.7, we have:
Corollary 3.8.
Proof.
We recall the existence of an isometric embedding from Lemma 3.7, with . It follows that
Taking logarithms, the claimed inequality between the metric entropy follows. ∎
The proof of Proposition 3.6 will furthermore make use of the following result in the finite-dimensional setting:
Lemma 3.9.
Let be given. For , consider . Then there exists a constant , independent of , such that we have the following lower bound on the metric entropy:
(3.7) |
Proof.
Since the Hölder inequality implies, for any , that , it follows that any covering of by -balls with respect to the -norm, also gives rise to a covering of with respect to the -norm (with the same centers). In particular, this implies that
and we only need to establish (3.7) for .
For , define as a composition , where is a piecewise linear function (approximately ) with values,
and interpolates linearly between and on , and from to on . By construction, is -Lipschitz. Since is -Lipschitz, it follows that . Clearly, smaller leads to a larger Lipschitz constant. However, by construction of , we have . In particular, this implies that . Thus, smaller increases the -norm of .
Given , we now subdivide into cubes of equal length, indexed by , where . For any multi-index , we define as a rescaled and translated copy of , such that the support of coincides with the -th cube. In particular, by construction of , this implies that
(3.8) | ||||
(3.9) |
We also note that the have essentially disjoint supports. For , we now define
The factor in front of the sum ensures that . Furthermore, we also note that for any choice of and . In particular, we have , for any choice of . We finally observe that, due to the disjoint supports of the , we have, for any ,
The last quotient is the fraction of entries in which and differ. It turns out that there exists a subset , such that any belonging to differ on a substantial fraction of their components; more precisely, as noted in [1] as a result of the Gilbert-Varshamov bound, there exists a subset satisfying that any two distinct elements , differ on at least a fourth of their coordinates,
(3.10) |
and the cardinality of is lower bounded by,
(3.11) |
This implies that for any two in , we have
Optimizing the right-hand side over , we set to obtain,
where we used that the Euler constant and the fact that implies in the last bound.
Taking into account the bound (3.11), it follows that the packing number , satisfies the lower bound,
where we have defined . Given , we can find , such that
It follows that
We conclude that
This lower bound on the packing number holds for any dimension . We can now use the general relation between the covering- and packing-numbers (3.1), to conclude that,
where . This proves the claim with , i.e.
∎
Proof of Proposition 3.6.
Combining the lower bound (3.6) and (3.7), we obtain that for any ,
provided that . Since by assumption, and since and are constants independent of , it thus follows that there exist , independent of , such that
(3.12) |
The idea is now to choose , such that the term inside the parentheses is lower bounded by for some fixed , implying that the right hand side is for some constant . This then leads to the claimed lower bound. We now proceed to provide the details of the required argument.
We first fix , such that
(3.13) |
We next define
(3.14) |
Since are independent of , it follows that also and are independent of .
For any , the above choice ensures that
and hence there exists a unique , such that
In particular, upon rearranging the first inequality in the last display, we obtain the two equivalent formulations,
(3.15) | |||
(3.16) |
while the second bound implies,
(3.17) |
With this choice of , equation (3.15) guarantees that the estimate in (3.12) applies to all . This in turn implies that
for all . This is the claimed lower bound on the metric entropy. ∎
4. Generic approximation results
We first discuss an abstract formulation of a general “approximation task”. Let be a Banach space (e.g. a space of operators). In a general non-linear approximation task, we are given for any a set over which we aim to approximate an element , where we will assume that belongs to a general class of interest. Considering these subsets fixed, and given a sequence , we will say that can be approximated with convergence rate , if there exists a constant , such that
(4.1) |
Specifically, we will be most interested in the logarithmic case , in the following, with corresponding to all possible realizations of a fixed bit-encoded neural operator architecture (cp. the proofs of Propositions 2.13 and 2.14, respectively).
Coming back to the general abstract setting above, and given , we introduce a set of “efficiently approximated” elements with bound , i.e.
(4.2) |
And we denote the set of all which can be approximated at convergence rate , by elements in , by
(4.3) |
Our goal is to study generically achievable approximation rates , in terms of the complexity of , as measured by its metric entropy.
The following lemma will be fundamental to our analysis:
Lemma 4.1.
Let be a Banach space. Let be a compact, convex subset. Let be a family of subsets , with elements. Fix . If given by (4.2) has non-empty interior in the subspace topology on , then there exists a constant , independent of , such that the metric entropy satisfies the bound,
Proof.
At the outset we note that by compactness, we have a uniform upper bound,
Upon a simple rescaling, we may wlog assume that , i.e. that for all . This will be assumed in the following proof.
Our next goal is to show that, for any , the set defined by (4.8) has empty interior. For the sake of contradiction, assume that does not have empty interior. Then there exists and , such that
where is an open ball in the subspace topology on . Thus, for any , we obtain the following bound on the covering numbers,
(4.4) |
We next recall that we have wlog assumed , and we recall that is convex by assumption. In particular, we next show that this implies that
To see why, let and fix arbitrary. We need to show that . Since is convex, it is clear that . In addition, we also have
Hence, as claimed. The inclusion, now implies,
(4.5) |
Combining (4.4) and (4.5), we conclude that
We emphasize that are independent of in the above argument. In particular, the claim of the lemma holds with constant . ∎
Proposition 4.2 (Exponential scaling).
Let be a Banach space. Let be a compact, convex subset. Assume that there exist constants such that,
(4.6) |
Let be family of subsets with elements. Then generic elements cannot be approximated by elements of at convergence rate better than ; more precisely, for any sequence , the subset , consisting of all , such that
(4.7) |
is residual.
Before coming to the proof of Proposition 4.2, we note that since is compact, is a complete metric space in the subspace topology. In particular, the following argument, which is based on the Baire category theorem, can be applied to (cp. Appendix A for a summary).
Proof.
Let , where is defined by (4.3). Recall that is precisely the set of for which there exists such that
In Lemma 4.1, it is shown that if has non-empty interior then there exists a constant , such that
By assumption on , the left hand side is lower bounded by . Thus, if has non-empty interior, then we must have
But by the assumption that , this last lower bound cannot hold, asymptotically as . Thus, we conclude that has empty interior for any . We furthermore note that is closed; indeed, in (4.2) is given by,
(4.8) |
where we define the closed balls (in the induced topology on ),
Therefore can be written as an intersection of a union of closed balls of radius centered at elements . Note that, since the set is finite by assumption, the union of these closed balls,
is closed for any , implying that also is closed as an intersection of closed sets.
To conclude the proof, we simply note that can be written as a countable union, for integer , of closed subsets with empty interior . In particular, this implies that is itself meagre by the Baire category theorem. We conclude that the complement , consisting of all for which
is residual. This completes the proof. ∎
A similar result can also be derived under the assumption of an algebraic scaling. This may be of relevance for generic function approximation by neural networks, and hence we mention it here, in passing.
Proposition 4.3 (Algebraic scaling).
Let be a Banach space. Let be a compact, convex subset. Assume that there exist constants such that,
(4.9) |
Let be a family of subsets with elements. Then generic elements cannot be approximated by elements of at convergence rate better than ; more precisely, for any sequence , the subset , such that,
(4.10) |
is residual.
Proof.
Let , where is defined by (4.3). Recall that is precisely the set of for which there exists such that
In Lemma 4.1, it is shown that if has non-empty interior then there exists a constant , such that
By assumption on , the left hand side is lower bounded by . Thus, if has non-empty interior, then we must have
By assumption, , this is not the case. Thus, we conclude that has empty interior for any . Thus, arguing as in the proof of Proposition 4.3 it follows that is meagre, and hence is residual. ∎
5. Conclusion
Operator learning is a new paradigm for the data-driven approximation of operators. Popular operator learning frameworks extend and generalize neural networks to this infinite-dimensional setting. While there are numerous papers demonstrating the potential and practical utility of proposed neural operator architectures, our understanding of the precise conditions under which operator learning is practically feasible remains limited.
This paper makes a contribution to the mathematical underpinnings of this field, by providing an information-theoretic perspective on the curse of parametric complexity (a scaling-limit of the curse of dimensionality) identified in [34]. In particular, it is shown that this curse poses a fundamental limitation to operator learning on general spaces of Lipschitz operators. Bit-encoding (storing in memory) any neural operator architecture, which is capable of achieving approximation accuracy for general -Lipschitz continuous and real-valued operators, requires a number of bits that is exponential in . It is shown that this is true not only when measuring the approximation error in the sup-norm over compact sets of input functions, but also when measuring the error in the -norm with respect to a probability measure satisfying certain structural assumptions. The assumptions are met for widely considered , including the case of a Gaussian random field with at most algebraically decreasing eigenvalues of the covariance. These results rely on minimax analysis and, in contrast to prior work [34], are independent of the employed activation function in the architecture.
Going beyond such minimax analysis, we furthermore study the approximation of individual Lipschitz operators by a sequence of neural operator architectures. Such a sequence would e.g. be obtained when increasing the width, depth or other hyperparameters at a pre-defined rate as the model is scaled up. In this setting, we address the following question: “At which rate can the approximation error along such a sequence decrease, as a function of the total number of bit-encoded parameters?” Using topological arguments based on Baire category, we establish a quantitative relation between the metric entropy of the set of -Lipschitz operators, and the best approximation-rate that can be achieved along such a sequence for generic -Lipschitz operators; as a consequence of the exponential increase in metric -entropy of the set of -Lipschitz operators, it is shown that achievable approximation rates are at most logarithmic as a function of the required encoding bits.
Finally, this abstract analysis leads to a concrete result on the approximation of generic Lipschitz operators by Fourier neural operator. Our results imply that for generic -Lipschitz operators, and under mild assumptions on the tunable parameters, there cannot exist a sequence of FNO approximations which approximates the underlying operator at a rate that decays faster than logarithmic in the number of real-valued parameters. To obtain this result, mild bounds on the growth of the parameters of FNO approximants are assumed; specifically, the size of individual parameter is assumed to be exponentially bounded by the total number of parameters, as the model size is scaled up.
The results of this work should be compared and contrasted with the recent work [50], which shows the surprising result that there exist (non-standard) neural operator architectures capable of approximating Lipschitz continuous operators to accuracy , with a number of real-valued tunable parameters growing only algebraically with . The analysis of the present work indicates that a practical implementation of such architectures on computing hardware, and with parameters encoded by a total of bits will require to be exponentially large in . In fact, if each parameter is encoded by bits, then a lower bound of the following form is to be expected:
for fixed constants independent of . In particular, if grows at most algebraically, as in the construction [50], then the number of encoding bits per parameter must necessarily grow exponentially. Thus, the only trade-off that appears possible from an information-theoretic perspective is to reduce the number of parameters at the expense of the required number of bits per parameter , or vice versa. In turn, the required number of encoding bits is intimately linked to the stability of the map** from parameters to the corresponding realization of the neural operator ; an exponentially growing number of bits is only required if the parameter-to-realization map** is either very unstable, e.g. having very large Lipschitz constant, or if the optimal parameters themselves are very large. Here, “large” means that either the Lipschitz constant or the -norm of the parameters grows exponentially with .
The results of this work underline the fundamental character of the curse of parametric complexity identified in [34] from the point of view of information theory. In addition, it is here shown that this curse persists even when the sup-norm (uniform approximation of the underlying operator) is replaced by an a priori much weaker -norm (approximation in expectation). This considerably constrains the generality with which approximation theory for operator learning, guaranteeing efficient approximation by neural operators at algebraic convergence rates, can be developed. A complete or partial characterization of the relevant mathematical properties and structures enabling efficient operator approximation, would be highly desirable. The results presented in this work demonstrate rigorously that one has to go beyond Lipschitz operators to achieve this.
Acknowledgments
The author would like to thank Andrew M. Stuart and Nikola B. Kovachki for interesting discussions which have led to this work. This work has been supported by funding from the Swiss National Science Foundation through Postdoc.Mobility grant P500PT-206737.
References
- [1] E. M. Achour, A. Foucault, S. Gerchinovitz, and F. Malgouyres. A general approximation lower bound in norm, with applications to feed-forward neural networks. In Advances in Neural Information Processing Systems, 2022.
- [2] B. Adcock, S. Brugiapaglia, N. Dexter, and S. Moraga. On efficient algorithms for computing near-best polynomial approximations to high-dimensional, Hilbert-valued functions from limited samples. arXiv preprint arXiv:2203.13908, 2022.
- [3] B. Adcock, N. Dexter, and S. Moraga. Optimal approximation of infinite-dimensional holomorphic functions. arXiv preprint arXiv:2305.18642, 2023.
- [4] B. Adcock, N. Dexter, and S. Moraga. Optimal approximation of infinite-dimensional holomorphic functions ii: recovery from iid pointwise samples. arXiv preprint arXiv:2310.16940, 2023.
- [5] J. A. L. Benitez, T. Furuya, F. Faucher, A. Kratsios, X. Tricoche, and M. V. de Hoop. Out-of-distributional risk bounds for neural operators with applications to the Helmholtz equation. arXiv preprint arXiv:2301.11509, 2023.
- [6] K. Bhattacharya, B. Hosseini, N. B. Kovachki, and A. M. Stuart. Model reduction and neural networks for parametric PDEs. The SMAI Journal of Computational Mathematics, 7:121–157, 2021.
- [7] M. S. Birman and M. Z. Solomyak. Approximation of functions of the w_p^-classes by piece-wise-polynomial functions. Doklady Akademii Nauk, 171(5):1015–1018, 1966.
- [8] H. Bölcskei, P. Grohs, G. Kutyniok, and P. Petersen. Memory-optimal neural network approximation. In Wavelets and Sparsity XVII, volume 10394, pages 157–168. SPIE, 2017.
- [9] H. Bolcskei, P. Grohs, G. Kutyniok, and P. Petersen. Optimal approximation with sparsely connected deep neural networks. SIAM Journal on Mathematics of Data Science, 1(1):8–45, 2019.
- [10] J. Castro. The Kolmogorov infinite dimensional equation in a Hilbert space via deep learning methods. Journal of Mathematical Analysis and Applications, 527(2):127413, 2023.
- [11] J. Castro, C. Muñoz, and N. Valenzuela. The Calderón’s problem via DeepONets. arXiv preprint arXiv:2212.08941, 2022.
- [12] T. Chen and H. Chen. Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Transactions on Neural Networks, 6(4):911–917, 1995.
- [13] A. Cohen, R. DeVore, G. Petrova, and P. Wojtaszczyk. Optimal stable nonlinear approximation. Foundations of Computational Mathematics, 22(3):607–648, 2022.
- [14] A. Cohen, R. DeVore, and C. Schwab. Convergence rates of best n-term galerkin approximations for a class of elliptic SPDEs. Foundations of Computational Mathematics, 10(6):615–646, 2010.
- [15] A. Cohen, R. Devore, and C. Schwab. Analytic regularity and polynomial approximation of parametric and stochastic elliptic PDEs. Analysis and Applications, 9(01):11–47, 2011.
- [16] S. Dahlke, F. De Mari, P. Grohs, and D. Labate. Harmonic and applied analysis. Appl. Numer. Harmon. Anal, 2015.
- [17] B. Deng, Y. Shin, L. Lu, Z. Zhang, and G. E. Karniadakis. Convergence rate of DeepONets for learning operators arising from advection-diffusion equations. arXiv preprint arXiv:2102.10621, 2021.
- [18] R. DeVore, B. Hanin, and G. Petrova. Neural network approximation. Acta Numerica, 30:327–444, 2021.
- [19] D. Elbrächter, D. Perekrestenko, P. Grohs, and H. Bölcskei. Deep neural network approximation theory. IEEE Transactions on Information Theory, 67(5):2581–2623, 2021.
- [20] N. R. Franco, S. Fresca, A. Manzoni, and P. Zunino. Approximation bounds for convolutional neural networks in operator learning. Neural Networks, 161:129–141, 2023.
- [21] L. Galimberti, A. Kratsios, and G. Livieri. Designing universal causal deep learning models: The case of infinite-dimensional dynamical systems from stochastic analysis. arXiv preprint arXiv:2210.13300, 2022.
- [22] L. Herrmann, C. Schwab, and J. Zech. Neural and GPC operator surrogates: Construction and expression rate bounds. arXiv preprint arXiv:2207.04950, 2022.
- [23] N. Hua and W. Lu. Basis operator network: A neural network-based model for learning nonlinear operators via neural basis. Neural Networks, 164:21–37, 2023.
- [24] D. Z. Huang, N. H. Nelsen, and M. Trautner. An operator learning perspective on parameter-to-observable maps. arXiv preprint arXiv:2402.06031, 2024.
- [25] P. **, S. Meng, and L. Lu. Mionet: Learning multiple-input operators via tensor product. SIAM Journal on Scientific Computing, 44(6):A3490–A3514, 2022.
- [26] A. N. Kolmogorov and V. M. Tikhomirov. -entropy and -capacity of sets in functional spaces. Amer. Math. Soc. Transl. Ser. 2, 17, 1961.
- [27] Y. Korolev. Two-layer neural networks with values in a banach space. SIAM Journal on Mathematical Analysis, 54(6):6358–6389, 2022.
- [28] N. B. Kovachki, S. Lanthaler, and H. Mhaskar. Data complexity estimates for operator learning, 2024.
- [29] N. B. Kovachki, S. Lanthaler, and S. Mishra. On universal approximation and error bounds for Fourier neural operators. Journal of Machine Learning Research, 22(1), 2021.
- [30] N. B. Kovachki, S. Lanthaler, and A. M. Stuart. Operator learning: Algorithms and analysis. In Numerical Analysis meets Machine Learning, Handbook of Numerical Analysis. Elsevier, 2024.
- [31] N. B. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anandkumar. Neural operator: Learning maps between function spaces with applications to PDEs. Journal of Machine Learning Research, 24(89), 2023.
- [32] A. Kratsios, T. Furuya, J. A. L. Benitez, M. Lassas, and M. de Hoop. Mixture of experts soften the curse of dimensionality in operator learning, 2024.
- [33] G. Kutyniok, P. Petersen, M. Raslan, and R. Schneider. A theoretical analysis of deep neural networks and parametric PDEs. Constructive Approximation, 55(1):73–125, 2022.
- [34] S. Lanthaler. Operator learning with PCA-Net: Upper and lower complexity bounds. Journal of Machine Learning Research, 24(318), 2023.
- [35] S. Lanthaler, Z. Li, and A. M. Stuart. The nonlocal neural operator: Universal approximation. arXiv preprint arXiv:2304.13221, 2023.
- [36] S. Lanthaler, S. Mishra, and G. E. Karniadakis. Error estimates for DeepONets: A deep learning framework in infinite dimensions. Transactions of Mathematics and Its Applications, 6(1), 2022.
- [37] S. Lanthaler and N. H. Nelsen. Error bounds for learning with vector-valued random features. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- [38] Z. Lei, L. Shi, and C. Zeng. Solving parametric partial differential equations with deep rectified quadratic unit neural networks. Journal of Scientific Computing, 93(3):80, 2022.
- [39] Z. Li, N. B. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. M. Stuart, and A. Anandkumar. Fourier neural operator for parametric partial differential equations. In Ninth International Conference on Learning Representations, 2021.
- [40] H. Liu, H. Yang, M. Chen, T. Zhao, and W. Liao. Deep nonparametric estimation of operators between infinite dimensional spaces. Journal of Machine Learning Research, 25(24):1–67, 2024.
- [41] L. Lu, P. **, G. Pang, Z. Zhang, and G. E. Karniadakis. Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3):218–229, 2021.
- [42] V. Maiorov and A. Pinkus. Lower bounds for approximation by mlp neural networks. Neurocomputing, 25(1):81–91, 1999.
- [43] C. Marcati and C. Schwab. Exponential convergence of deep operator networks for elliptic partial differential equations. SIAM Journal on Numerical Analysis, 61(3):1513–1545, 2023.
- [44] H. N. Mhaskar and N. Hahm. Neural networks for functional approximation and system identification. Neural Computation, 9(1):143–159, 1997.
- [45] J. Munkres. Topology. Pearsonn Education Limited, 2 edition, 2014.
- [46] N. H. Nelsen and A. M. Stuart. The random feature model for input-output maps between Banach spaces. SIAM Journal on Scientific Computing, 43(5):A3212–A3243, 2021.
- [47] J. A. Opschoor, C. Schwab, and J. Zech. Exponential ReLU DNN expression of holomorphic maps in high dimension. Constructive Approximation, 55(1):537–582, 2022.
- [48] D. Patel, D. Ray, M. R. Abdelmalik, T. J. Hughes, and A. A. Oberai. Variationally mimetic operator networks. Computer Methods in Applied Mechanics and Engineering, 419:116536, 2024.
- [49] P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using deep relu neural networks. Neural Networks, 108:296–330, 2018.
- [50] C. Schwab, A. Stein, and J. Zech. Deep operator network approximation rates for Lipschitz operators. arXiv preprint arXiv:2307.09835, 2023.
- [51] C. Schwab and J. Zech. Deep learning in high dimension: Neural network expression rates for generalized polynomial chaos expansions in UQ. Analysis and Applications, 17(01):19–55, 2019.
- [52] C. Schwab and J. Zech. Deep learning in high dimension: Neural network expression rates for analytic functions in . SIAM/ASA Journal on Uncertainty Quantification, 11(1):199–234, 2023.
- [53] F. Voigtlaender and P. Petersen. Approximation in l p () with deep relu neural networks. In 2019 13th International conference on Sampling Theory and Applications (SampTA), pages 1–4. IEEE, 2019.
- [54] Z. Zhang, L. Tat, and H. Schaeffer. BelNet: Basis enhanced learning, a mesh-free neural operator. Proceedings of the Royal Society A, 479, 2023.
Appendix A A short summary of Baire category
In this appendix, we recall the Baire category theorem from general topology. For a more thorough discussion of this result, and its connections to other topological concepts, we refer to the textbook [45, Chap. 8].
Let be a topological space. Let be a subset. We recall that the interior of is defined as the union of all open sets of that are contained in . The set is said to have empty interior if contains no open set of other than the empty set. Equivalently, is said to have empty interior if the complement of is dense in . We then have the following definition [45, Chap. 8, p. 293]:
Definition A.1.
A space is said to be a Baire space if the following condition holds: Given any countable collection of closed sets of each of which has empty interior in , their union also has empty interior in .
This definition can equivalently be stated in terms of open sets [45, Lemma 48.1]:
Lemma A.2.
is a Baire space if and only if given any countable collection of open sets in , each of which is dense in , their intersection is also dense in .
The following Baire category theorem [45, Thm. 48.2] exposes many examples of Baire spaces encountered in applications:
Theorem A.3 (Baire category theorem).
If is a compact Hausdorff space or a complete metric space, then is a Baire space.
Appendix B Proof of the quantization lemma
The goal of this appendix is to prove the FNO quantization lemma 2.21:
See 2.21
Proof of Lemma 2.21.
Let be an output-averaged FNO with at most tunable parameters. We first note that the depth of can only take the values . For each possible value of the depth, we now consider the maximally connected output-averaged FNO architecture of depth , obtained by setting in each layer. This maximally connected FNO architecture has at most
tunable parameters. For later reference, we note that
Observation 1: Any output-averaged averaged FNO with depth and at most parameters can be represented by a specific choice of the weights of . In fact, this only requires zero-padding to obtain .
Our main goal is to suitably quantize , and then define a quantized neural operator architecture with bits which can represent all quantized for by specific setting of its bitwise-encoded parameters.
It follows from [28, Proposition D.15], with a minimal extension to allow for , that the Lipschitz constant of the map**,
and with metrized by the -norm, can be bounded by
Here, is a constant depending only on and . In particular, there exists a (larger) constant , such that
We quantize for by subdividing each coordinate direction by equidistant points of separation . Denote the resulting discrete set of points by . We note that this subdivision requires at most,
many quantization points, which can be encoded by
many bits. Since , and , it follows that the number of required bits is
i.e. . The implied constant here is independent of . In the following, we denote . In particular, we conclude that there exists a constant , independent of , such that
We also note that, by construction, for any , there exists , such that
It follows that for any , there exists , such that
Thus,
(B.1) |
Since , any can be identified with a unique bit-string in , where . Adding an additional number of bits to encode the possible values of the depth parameter , we can now define a quantized neural operator encoded by bits, in the following way: Given , we first read off the length parameter from the first bits. Removing these bits, the remaining bits uniquely identify , and we set
(B.2) |
Thus, is a neural operator architecture with parameters encoded by bits. By our definition (B.2), any neural operator belonging to the set
can be represented exactly by suitable choice of . And thus, by (B.1), we have
(B.3) |
We finally note that any neural operator architecture with at most parameters is represented as for suitably chosen (see Observation 1, above). In fact, this only involves zero-padding of the weights . In particular, if , then .