Last Update: July 1, 2024
On Statistical Rates and Provably Efficient Criteria
of Latent Diffusion Transformers (DiTs)
Jerry Yao-Chieh Hu†∗111[email protected] Weimin Wu†∗222[email protected] Zhuoru Li‡333[email protected] Zhao Song♭444[email protected] Han Liu†§555[email protected]
**footnotetext: These authors contributed equally to this work.Department of Computer Science, Northwestern University, Evanston, IL 60208 USA |
School of Mathematical Science, Fudan University, Yangpu, Shanghai 200433, China |
Adobe Research, Seattle, WA 98103, USA |
Department of Statistics and Data Science, Northwestern University, Evanston, IL 60208 USA |
We investigate the statistical and computational limits of latent Diffusion Transformers (DiTs) under the low-dimensional linear latent space assumption. Statistically, we study the universal approximation and sample complexity of the DiTs score function, as well as the distribution recovery property of the initial data. Specifically, under mild data assumptions, we derive an approximation error bound for the score network of latent DiTs, which is sub-linear in the latent space dimension. Additionally, we derive the corresponding sample complexity bound and show that the data distribution generated from the estimated score function converges toward a proximate area of the original one. Computationally, we characterize the hardness of both forward inference and backward computation of latent DiTs, assuming the Strong Exponential Time Hypothesis (SETH). For forward inference, we identify efficient criteria for all possible latent DiTs inference algorithms and showcase our theory by pushing the efficiency toward almost-linear time inference. For backward computation, we leverage the low-rank structure within the gradient computation of DiTs training for possible algorithmic speedup. Specifically, we show that such speedup achieves almost-linear time latent DiTs training by casting the DiTs gradient as a series of chained low-rank approximations with bounded error. Under the low-dimensional assumption, we show that the convergence rate and the computational efficiency are both dominated by the dimension of the subspace, suggesting that latent DiTs have the potential to bypass the challenges associated with the high dimensionality of initial data.
Contents
- 1 Introduction
- 2 Background
- 3 Statistical Rates of Latent DiTs with Subspace Data Assumption
- 4 Provably Efficient Criteria
- 5 Discussion and Conclusion
- A More Discussion on Low-Dimensional Linear Latent Space
- B Nomenclature Table
- C Related Works
- D Supplementary Theoretical Background
-
E More Background and Auxiliary Lemmas:
Universal Approximation of Transformers via Piecewise Approximation
- E.1 Piecewise-constant Function Approximates Compact-Supported Continuous Function
- E.2 Modified Transformer Approximates Piece-Wise Constant Function
- E.3 Standard Transformers Approximate Modified Transformers
- E.4 All Together: Standard Transformers Approximate Compact-Supported Continuous Functions
- E.5 Supplementary Proofs
- F Proofs of Section 3
- G Proofs of Section 4
1 Introduction
We investigate the statistical and computational limits of latent diffusion transformers (DiTs), assuming the data is supported on an unknown low-dimensional linear subspace. This analysis is not only practical but also timely. On one hand, DiTs have demonstrated revolutionary success in generative AI and digital creation by using Transformers as score networks (Esser et al., 2024; Ma et al., 2024; Chen et al., 2024; Mo et al., 2023; Peebles and Xie, 2023). On the other hand, they require significant computational resources (Liu et al., 2024), making them challenging to train outside of specialized industrial labs. Therefore, it is natural to ask whether it is possible to make them lighter and faster without sacrificing performance. Answering these questions requires a fundamental understanding of the DiT architecture. This work provides a timely theoretical analysis of the fundamental limits of DiT architecture, aided by the analytical feasibility provided by the low-dimensional data assumption.
Empirically, Latent Diffusion is a go-to design for effectiveness and computational efficiency (Rombach et al., 2022; Liu et al., 2021; Pope et al., 2021; Su and Wu, 2018). Theoretically, it is capable to host the assumption of low-dimensional data structure (see Assumption 2.1 for formal definition) for detailed analytical characterization (Chen et al., 2023a; Bortoli, 2022). In essence, diffusion models with low-dimensional data structures manifest a natural lower-dimensional diffusion process through encoder/decoder within a robust and informative latent representation feature space (Rombach et al., 2022; Pope et al., 2021). Such lower-dimensional diffusion improves computational efficiency by reducing data complexity without sacrificing essential information (Liu et al., 2021). With this assumption, Chen et al. (2023a) decompose the score function of U-Net based diffusion models into on-support and orthogonal components. This decomposition allows for the characterization of the distinct behaviors of the two components: the on-support component facilitates latent distribution learning, while the orthogonal component facilitates subspace recovery.
In our work, we utilize low-dimensional data structure assumption to explore statistical and computational limits of latent DiTs. Our analysis includes the characterizations of statistical rates and provably efficient criteria. Statistically, we pose two questions and provide a theory to characterize the statistical rates of latent DiT under the assumption of a low-dimensional data:
Question 1.
What is the approximation limit of using transformers to approximate the DiT score function, particularly in the low-dimensional data subspace?
Question 2.
How accurate is the estimation limit for such a score estimator in practical training scenarios? With the score estimator, how well can diffusion transformers recover the data distribution?
Computationally, the primary challenge of DiT lies in the transformer blocks’ quadratic complexity. This computational burden applies to both inference and training, even with latent diffusion. Thus, it is essential to design algorithms and methods to circumvent this where is the latent DiT sequence length. However, there are no formal results to support and characterize such algorithms. To address this gap, we pose the following questions and provide a fundamental theory to fully characterize the complexity of latent DiT under the low-dimensional linear subspace data assumption:
Question 3.
Is it possible to improve the time complexity with a bounded approximation error for both forward and backward passes? What is the computational limit for such an improvement?
Contributions.
We study the fundamental limits of latent DiT. Our contributions are threefold:
-
•
Score Approximation. We address Question 1 by characterizing the approximation limit of matching the DiT score function with a transformer-based score estimator. Specifically, under mild data assumptions, we derive an approximation error bound for the score network, sub-linear in the latent space dimension (Theorem 3.1). These results not only explain the expressiveness of latent DiT (under mild assumptions) but also provide guidance for the structural configuration of the score network for practical implementations (Theorem 3.1).
-
•
Score and Distribution Estimation. We address Question 2 by exploring the limitations of score and distribution estimations of latent DiTs in practical training scenarios. Specifically, we provide an sample complexity bound for score estimation (Corollary 3.1.1), using norm-based covering number bound of transformer architecture. Additionally, we show that the learned score estimator is able to recover the initial data distribution (Corollary 3.1.2).
-
•
Provably Efficient Criteria and Existence of Almost Linear Time Algorithms. We address Question 3 by providing provably efficient criteria for latent DiTs in both forward inference and backward computation/training. For forward inference, we characterize all possible efficient DiT algorithms using a norm-based efficiency threshold for both conditional and unconditional generation (Proposition 4.1). Efficient algorithms, including almost-linear time algorithms (Proposition 4.2), are possible only below this threshold. For backward computation, we prove the existence of almost-linear time DiT training algorithms (Theorem 4.1) by utilizing the inherent low-rank structure in DiT gradients through a chained low-rank approximation.
Interestingly, both our statistical and computational results (C1-3) are dominated by the subspace dimension under the low-dimensional assumption, suggesting that latent DiT can potentially bypass the challenges associated with the high dimensionality of initial data.
Organization.
Section 2 includes background on score decomposition and Transformer-based score networks. Section 3 presents the statistical rates of DiTs. Section 4 provides provably efficient criteria. We defer discussions of related works to Appendix C due to space constraints.
Notations.
We use lower case letters to denote vectors, e.g., . and denote its Euclidean norm and Infinite norm respectively. We use upper case letters to denote matrix, e.g., . , , and denote the -norm, operator norm and Frobenius norm respectively. denotes the -norm where the -norm is over columns and -norm is over rows. Given a function , let , and . With a distribution , we denote as the norm. Let be a pushforward measure, i.e., for any measurable , . We use for (conditional) Gaussian density functions.
2 Background
This section reviews the ideas we built on, including an overview of diffusion models (Section 2.1), the score decomposition under the linear latent space assumption (Section 2.2), and the transformer backbone in DiT (Section 2.3).
2.1 Score-Matching Denoising Diffusion Models
We briefly review forward process, backward process and score matching in diffusion models.
Forward and Backward Process.
In the forward process, Diffusion models gradually add noise to the original data , and . Let denote the noisy data at time stamp , with marginal distribution and destiny as and . The conditional distribution follows , where , , and is a nondecreasing weighting function. In practice, the forward process terminates at a large enough such that is close to . In the backward process, we obtain by reversing the forward process. The generation of depends on the score function . However, this is unknown in practice, we use a score estimator to replace , where is usually a neural network with parameters . See Section D.1 for the details.
Score Matching.
To estimate the score function, we use the following loss
where is the weight function, and is a small value to stabilize training and prevent score function from blowing up (Vahdat et al., 2021). However, it is hard to compute with available data samples. Therefore, we minimize the equivalent denosing score matching objective
(2.1) |
where is the transition kernel, then .
To train the parameters in the score estimator , we use the empirical version of (2.1). We select i.i.d. data samples , and sample time uniformly from interval . Given , we sample from . The empirical loss is
(2.2) |
For convenience of notation, we denote population loss .
2.2 Score Decomposition in Linear Latent Space
In this part, we review the score decomposition in (Chen et al., 2023a). We consider that the -dimensional input data supported on a -dimensional subspace, where .
Assumption 2.1 (Low-Dimensional Linear Latent Space).
Data point can be written as , where is an unknown matrix with orthonormal columns. The latent variable follows the distribution with a density function .
Remark 2.1.
By “Linear Latent Space,” we mean that each entry of a given latent vector is a linear combination of the corresponding input, i.e., . This is also knonw as “low-dimensional data” assumption in literature (Chen et al., 2023a).
Based on the low-dimensional data structure assumption, we have the following score decomposition theory: on-support score and orthogonal score .
Lemma 2.1 (Score Decomposition, Lemma 1 of (Chen et al., 2023a)).
Let data follow Assumption 2.1. The decomposition of score function is
(2.3) |
where , is the Gaussian density function of , and . We restate the proof in Section D.2 for completeness.
Additionally, our theoretical analysis is based on two following assumptions as in (Chen et al., 2023a).
Assumption 2.2 (Tail Behavior of ).
The density function is twice continuously differentiable. Moreover, there exist positive constants such that when , the density function .
Assumption 2.3 (-Lipschitz of ).
The on-support score function is -Lipschitz in for any .
2.3 Score Network and Transformers
In this part, we introduce the score network architecture and Transformers. Transformers are the backbone of the score network in DiT. By Assumption 2.1, with .
(Latent) Score Network.
Following (Chen et al., 2023a), we rearrange (2.3) into
(2.4) |
We use to approximate , and a neural network to approximate . We adopt the following score network class for diffusion in latent space (i.e., in )
(2.5) |
where the columns in are orthogonal, is a neural network. In our work, we focus on the diffusion transformers (DiTs), i.e., using Transformer for (Peebles and Xie, 2023).
Transformers.
A Transformer block consists of a self-attention layer and a feed-forward layer, with both layers having skip connection. We use to denote a Transformer block. Here and are the number of heads and head size in self-attention layer, and is the hidden dimension in feed-forward layer. Let be the model input, then we have the model output
(2.6) | ||||
(2.7) |
where .
In our work, we use Transformer networks with positional encoding . We define the Transformer networks as the composition of Transformer blocks
For example, the following is a Transformer network consisting blocks and positional encoding
(2.8) |
3 Statistical Rates of Latent DiTs with Subspace Data Assumption
In this section, we analyze the statistical rates of latent DiTs. Section 3.1 introduces the class of latent DiT score networks. In Section 3.2, we prove the approximation limit of matching the DiT score function with the score network class, and characterize the structural configuration of the score network when a specified approximation error is required. Following this, in Section 3.3, utilizing the characterized structural configuration, we prove the score and distribution estimation for latent DiTs.
3.1 DiT Score Network Class
In this part, we give the details about DiT score network class used in our analysis. In (2.5), is a network with Transformer as the backbone, and denotes the input data. Following (Peebles and Xie, 2023), DiT uses time point to calculate the scale and shift value in the Transformer backbone, and it transforms a input picture into a sequential version. To achieve the transformation, we introduce a reshape layer.
Definition 3.1 (DiT Reshape Layer ).
Let be a reshape layer that transforms the -dimensional input into a matrix. Specifically, for any image input, converts it into a sequence representation with feature dimension (where ) and sequence length . Besides, we define the corresponding reverse reshape (flatten) layer as the inverse of . By , are associative w.r.t. their input.
To simplify the self-attention block in (2.6), let and .
Definition 3.2 (Transformer Network Class ).
We define the Transformer network class as
-
•
Model architecture with blocks: ;
-
•
Model output bound: ;
-
•
Parameter bound in : , , , , ;
-
•
Parameter bound in : ;
-
•
Lipschitz of : .
Definition 3.3 (DiT Score Network Class ).
We denote as the DiT score network class in (2.5), replacing with , and is from the Transformer class .
3.2 Score Approximation of DiT
Here, we explore the approximation limit of latent DiT score network class under linear latent space assumption. Recall that is the distribution of , is the variance of , is the dimension of latent space, is the sequence length of transformer input, is the stop** time in forward process, is the early stop** time in backward process, and is the Lipschitz coefficient of on-support score function. Then we have the following Theorem 3.1.
Theorem 3.1 (Score Approximation of DiT).
For any approximation error and any data distribution under Assumptions 2.1, 2.2 and 2.3, there exists a DiT score network from (defined in Definition 3.2), where , such that for any , we have:
where , and the upper bound of hyperparameters in are
Proof Sketch.
Our proof is built on the key observation that there is a tail behavior of the low-dimensional latent variable distribution (Assumption 2.2). Recall that , where (defined in (2.4)). By taking , our aim reduces to construct a transformer network to approximate . To achieve this, we firstly approximate with a compact-supported continuous function, based on the tail behavior of . Then we construct a transformer to approximate the compact-supported continuous function using the universal approximation capacity of transformer (Yun et al., 2020). See Section F.1 for a detailed proof. ∎
Intuitively, Theorem 3.1 indicates the capability of the transformer-based score network to approximate the score function with precise guarantees. Furthermore, Theorem 3.1 provides empirical guidance for the design choices of the score network when a specified approximation error is required.
Remark 3.1 (Comparing with Existing Works).
Theoretical analysis of DiTs is limited. Previous works that do not specify the model architecture assume that the score estimator is well-approximated (Benton et al., 2024; Wibisono et al., 2024). To the best of our knowledge, this work is the first to present an approximation theory for DiTs, offering the estimation theory in Corollaries 3.1.1 and 3.1.2 based on the estimated score network, rather than a perfectly trained one.
Remark 3.2 (Latent Dimension Dependency).
Theorem 3.1 suggests that the approximation capacity and Transformer network size primarily depend on the latent variable dimension . This indicates that DiTs can potentially bypass the challenges associated with the high dimensionality of initial data by transforming input data into a low-dimensional latent variable.
3.3 Score Estimation and Distribution Estimation
Besides score approximation capability, Theorem 3.1 also characterizes the structural configuration of the score network for any specific precision, e.g., , etc. This characterization enables further analysis of the performance of score network in practical scenarios. In Corollary 3.1.1, we provide an sample complexity bound for score estimation. In Corollary 3.1.2, show that the learned score estimator is able to recover the initial data distribution.
Score Estimation.
To derive a sample complexity for score estimation using , we rewrite the score matching objective in (2.2) as .
Corollary 3.1.1 shows that as sample size , convergences to .
Corollary 3.1.1 (Score Estimation of DiT).
Under Assumptions 2.1, 2.2 and 2.3, we choose as in Theorem 3.1 using and , With probability , we have
(3.1) |
where hides the factor about .
Proof.
See Section F.2 for a detailed proof. ∎
Intuitively, Corollary 3.1.1 shows a sample complexity bound for score estimation in practice.
Remark 3.3 (Comparing with Existing Works).
(Zhu et al., 2023) provides a sample complexity for simple ReLU-based diffusion models under the assumption of an accurate score estimator. To the best of our knowledge, we are the first to provide a sample complexity for DiTs, based on the learned score network in Theorem 3.1 and the quantization (piece-wise approximation) approach for transformer universality (Yun et al., 2020).
Remark 3.4.
Corollary 3.1.1 reports an explicit result on sample complexity bounds for score estimation of latent DiTs: a double exponential factor in the first term. We remark that this arises from the required depth is , and the norm of required weight parameters is as shown in Theorem 3.1, assuming the universality of transformers requires dense layers (Yun et al., 2020). This motivate us to rethink about transformer universality and explore new proof techniques for DiTs, which we leave for future work.
Definition 3.4.
For later convenience, we define .
Distribution Estimation.
In practice, DiTs generate data using the discretized version with step size , see Section D.1 for details. Let be the distribution generated by in Corollary 3.1.1. Let and be the distribution and density function of on-support latent variable at . We have the following results for distribution estimation.
Corollary 3.1.2 (Distribution Estimation of DiT, Modified From Theorem 3 of (Chen et al., 2023a)).
Let , where is the minimum eigenvalue of . With the estimated DiT score network in Corollary 3.1.1, we have the following with probability .
-
(i)
The accuracy to recover the subspace is .
-
(ii)
denotes the pushforward distribution. With the conditions , and step size . There exists an orthogonal matrix such that we have the following upper bound for the total variation distance
(3.2) where hides the factor about , and .
-
(iii)
For the generated data distribution , the orthogonal pushforward is , where for a constant .
Proof.
See Section F.3 for a detailed proof. ∎
Intuitively, Corollary 3.1.2 shows the estimation results including 3 parts: (i) The accuracy to recover the subspace . (ii) The estimation error between and . (iii) The vanishing behavior of in the orthogonal space. These three parts indicate that the learned score estimator is capable of recovering the initial data distribution. Notably, Corollary 3.1.2 is agnostic to details of .
Remark 3.5 (Comparing with Existing Works).
Oko et al. (2023) analyze the distribution estimation under the assumption that the initial density is supported on and smooth in the boundary. Our Assumption 2.2 demonstrates greater practical relevance. This suggests that our method of distribution estimation aligns more closely with empirical realities.
Remark 3.6 (Subspace Recovery Accuracy).
(i) of Corollary 3.1.2 confirms that the subspace is learned by DiTs. The error is proportional to the sample complexity for score estimation and depend on the minimum eigenvalue of the covariance of .
4 Provably Efficient Criteria
Here, we analyze the computational limits of latent DiTs under low-dimensional linear subspace data assumption (i.e., Assumption 2.1). The hardness of DiT models ties to both forward and backward passes of the score network in Definition 3.3. We characterize them separately.
4.1 Computational Limits of Backward Computation
Following Section 2, suppose we have i.i.d. data samples , and time uniformly sampled from . For each data , we sample from . Let be the inverse transformation of , and denote . We rewrite the empirical denoising score-matching loss (2.2) as
(4.1) |
For efficiency, it suffices to focus on just transformer attention heads of the DiT score network due to their dominating quadratic time complexity in both passes. Thus, we consider only a single layer attention for , to simplify our analysis. Further, we consider the following simplifications:
-
(S0)
To prove the hardness of (4.1) for both full full gradient descent and stochastic mini-batch gradient descent methods, it suffices to consider training on a single data point.
-
(S1)
For the convenience of our analysis, we consider the following expression for attention mechanism. Let . Let be attention weights such that , and . We write attention mechanism of hidden size and sequence length as
(4.2) with . Here, is entry-wise exponential function, i.e. for any matrix , converts a vector into a diagonal matrix with the vector’s entries on the diagonal, and is the length- all ones vector.
-
(S2)
Since multiplication is linear in weight while - multiplication is exponential in weights, we only need to focus on the gradient update of - multiplication. Therefore, for efficiency analysis of gradient, it is equivalent to analyze a reduced problem with fixed .
-
(S3)
To focus on the DiT, we consider the low-dimensional linear encoder to be pretrained and to not participate in gradient computation. This aligns with common practice (Rombach et al., 2022) and is justified by the trivial computation cost due to the linearity of 111The gradient computation is linear in and hence the computation w.r.t. is cheap and upper-bounded by time in a straightforward way..
- (S4)
Therefore, we simplify the objective of training DiT into
Definition 4.1 (Training Generic DiT Loss).
Given and following (S4), Training a DiT with loss on a single data point is formulated as
(4.4) |
Here .
Remark 4.1 (Conditional and Unconditional Generation).
is generic. If , Definition 4.1 reduces to cross-attention in DiT score net (for conditional generation). If , Definition 4.1 reduces to self-attention in DiT score net (for unconditional vanilla generation).
We introduce the next problem to characterize all possible gradient computations of optimizing (4.4).
Problem 1 (Approximate DiT Gradient Computation ()).
Given . Let . Assume all numerical values are in -bits encoding. Let loss function follow Definition 4.1. The problem of approximating gradient computation of optimizing empirical DiT loss (4.4) is to find an approximated gradient matrix such that . Here, for any matrix .
In this work, we aim to investigate the computational limits of all possible efficient algorithms of ADiTGC with . Yet, the explicit gradient of DiT denoising score matching loss (4.4) is too complicated to characterize ADiTGC. To combat this, we make the following observations.
-
(O1)
Let , , and such that for (with ).
-
(O2)
Vectorization of . For the ease of presentation, we use notation flexibly that to denote both a matrix in and a vector in in the following analysis. This practice does not affect correctness. The context in which is used should clarify whether it refers to a matrix or a vector. Explicit vectorization follows Definition D.1.
-
(O3)
Linearity of . By linearity of , we treat as a matrix in acting on vector .
Therefore, we have , such that its gradient involves . From above, we only need to focus on proving the computation time and error control of term for gradient w.r.t . Luckily, with tools from fine-grained complexity theory (Alman and Song, 2023) and tensor trick (see Section D.3), we prove the existence of almost-linear time algorithms for Problem 1 in the next theorem. Let for any matrix following Definition D.1.
Theorem 4.1 (Existence of Almost-Linear Time Algorithms for ADiTGC).
Suppose all numerical values are in -bits encoding. Let . There exists a time algorithm to solve (i.e., Problem 1) with loss from Definition 4.1 up to accuracy. In particular, this algorithm outputs gradient matrices such that .
Proof Sketch.
Our proof is built on the key observation that there exist low-rank structures within the DiT training gradients. Using the tensor trick (Diao et al., 2019, 2018) and computational hardness results of attention (Hu et al., 2024c; Alman and Song, 2023), we approximate DiT training gradients with a series of low-rank approximations and carefully match the multiplication dimensions so that the computation of forms a chained low-rank approximation. We complete the proof by demonstrating that this approximation is bounded by a error and requires only almost-linear time. See Section G.2 for a detailed proof. ∎
Remark 4.2.
We remark that Theorem 4.1 is dominated by the relation between and , hence by the subspace dimension222See Assumption 2.1. . A smaller makes Theorem 4.1 more likely to hold.
4.2 Computational Limits of Forward Inference
Since the inference of score-matching diffusion models is a forward pass of the trained score estimator , the computational hardness of DiT ties to the transformer-based score network,
(4.5) |
following notation in Definition 4.1. For inference, we study the following approximation problem. Notably, by Remark 4.1, (4.5) subsumes both conditional and unconditional DiT inferences.
Problem 2 (Approximate DiT Inference ).
Let and . Given , and with guarantees that , and , we aim to study an approximation problem , that approximates with a vector (with ) such that . Here, for any matrix .
By (O2) and (O3), we make an observation that Problem 2 is just a special case of (Alman and Song, 2023). Hence, we characterize the all possible efficient algorithms for ADiTI with next proposition.
Proposition 4.1 (Norm-Based Efficiency Phase Transition).
Let , and with . Assuming SETH (Hypothesis 1), for every , there are constants such that: there is no -time (sub-quadratic) algorithm for the problem .
Remark 4.3.
Proposition 4.1 suggests an efficiency threshold for the upper bound of , , . Only below this threshold are efficient algorithms for Problem 2 possible.
Moreover, there exists almost-linear DiT inference algorithms following (Alman and Song, 2023).
Proposition 4.2 (Almost-Linear Time DiT Inference).
Assuming SETH, the DiT inference problem can be solved in time.
Remark 4.4.
Proposition 4.2 is a special case of Proposition 4.1 under the efficiency threshold.
Remark 4.5.
Propositions 4.2 and 4.1 are dominated by the relation between and , hence by the subspace dimension . A smaller makes Propositions 4.2 and 4.1 more likely to hold.
5 Discussion and Conclusion
We explore the fundamental limits of latent DiTs with 3 key contributions. First, we prove that transformers are universal approximators for the score functions in DiTs (Theorem 3.1), with approximation capacity and model size dependent only on the latent dimension, suggesting DiTs can handle high-dimensional data challenges. Second, we show that Transformer-based score estimators converge to the true score function (Corollary 3.1.1), ensuring the generated data distribution closely approximates the original (Corollary 3.1.2). Third, we provide provably efficient criteria (Proposition 4.1) and prove the existence of almost-linear time algorithms for forward inference (Proposition 4.2) and backward computation (Theorem 4.1). These results highlight the potential of latent DiTs to achieve both computational efficiency and robust performance in practical scenarios.
Limitations and Future Direction. As discussed in Remark 3.4, the double exponential factor in our explicit sample complexity bound (Corollary 3.1.1) suggests a possible gap in our understanding of transformer universality and its interplay with DiT architecture. This motivate us to rethink about transformer universality and explore new proof techniques for DiTs, which we leave for future work. Besides, due to its formal nature, this work do not provide immediate practical implementations. However, we expect that our findings provide valuable insights for future diffusion generative models.
Broader Impact
This theoretical work aims to shed light on the foundations of diffusion generative models and is not anticipated to have negative social impacts.
Acknowledgments
JH would like to thank to Minshuo Chen, Sophia Pi, Yibo Wen, Tim Tsz-Kit Lau, Chenwei Xu, Dino Feng and Andrew Chen for enlightening discussions on related topics, and the Red Maple Family for support.
JH is partially supported by the Walter P. Murphy Fellowship. HL is partially supported by NIH R01LM1372201. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.
Appendix
[sections] \printcontents[sections] 1
Appendix A More Discussion on Low-Dimensional Linear Latent Space
Our analysis is based on the low-dimensional linear latent space assumption, here we give a further discussion about it with our theoretical results.
The low-dimensional data structure in Assumption 2.1 indicates robust and informative latent representation feature space. Besides, it improves computational efficiency by reducing data complexity without sacrificing essential information. This is consistent with the analysis in our work. Similar to the results under Assumption 2.1 (), it is easy to find that our theoretical results hold in other two settings: and .
-
•
Statistically, for score approximation, score estimation, and distribution estimation, the upper bound depends on the dimension of the latent variable , other than . A smaller allows for a reduced model size to achieve a specified approximation error compared to larger one (Theorem 3.1). Additionally, with a smaller , both score and distribution estimation errors are reduced relative to scenarios with larger one (Corollary 3.1.1 and Corollary 3.1.2).
-
•
Computationally, smaller benefits the provably efficient criteria (Proposition 4.1, almost-linear time algorithms for forward inference (Proposition 4.2) and backward computation (Theorem 4.1).
Appendix B Nomenclature Table
We summarize our notations in the following table for easy reference.
Symbol | Description |
Euclidean norm, where is a vector | |
Infinite norm, where is a vector | |
2-norm, where is a matrix | |
Operator norm, where is a matrix | |
Frobenius norm, where is a matrix | |
-norm, where is a matrix | |
-norm, where is a function | |
-norm, where is a function and is a distribution | |
Lipschitz-norm, where is a function | |
Pushforward measure, where is a function and is a distribution | |
Sample size | |
Data point in original data space, | |
Latent variable in low-dimensional subspace, | |
The destiny function of | |
The matrix with orthonormal columns to transform to , where | |
Stop** time in forward process of Diffusion model | |
Stop** time in backward process of Diffusion model | |
Discretized step size in backward process | |
The density function of for at time | |
The density function of at time | |
(Conditional) Gaussian density function | |
Input dimension of each token in the Transformer network of DiT | |
Token length in the Transformer network of DiT | |
Sequence input of Transformer network in DiT, where | |
Position encoding, where | |
Reshape layer in DiT, | |
The orthonormal matrix to approximate , where |
Appendix C Related Works
Diffusion (Ho et al., 2020) and score-based generative models (Song and Ermon, 2019) have been particularly successful as generative models of images, video and biomedical data (Nichol et al., 2021; Ramesh et al., 2022; Liu et al., 2024; Zhou et al., 2024a, b; Wang et al., 2024a, b). There are two popular directions in this direction. Empirically, diffusion transformers (DiTs) (Peebles and Xie, 2023) have emerged as a significant advancement, effectively combining the strengths of transformer architectures and diffusion-based approaches. Theoretically, the development of the approximation theory for diffusion models supports their practical success, providing a theoretical framework for understanding and enhancing their effectiveness in various applications (Chen et al., 2023a).
Organization.
In the following, we first discuss recent developments in DiTs. Then, we discuss the main technique of our statistical results: the universality (universal approximation) of transformer. Next, we discuss recent theoretical developments in diffusion generative models. Lastly, we discuss other aspects of transformer in foundation models beyond diffusion models.
Diffusion Transformers.
Recently, transformer-based diffusion models have garnered significant attention in research. The U-ViT model (Bao et al., 2022) incorporates transformer blocks into a U-net architecture, treating all inputs as tokens. In contrast, DiT (Peebles and Xie, 2023) utilizes a straightforward, non-hierarchical transformer structure. Models like MDT (Gao et al., 2023a) and MaskDiT (Zheng et al., 2023) improve the training efficiency of DiT by applying a masking strategy.
Universality and Memory Capacity of Transformers.
The universality of transformers refers to their ability to serve as universal approximators. This means that transformers theoretically models any sequence-to-sequence function to a desired degree of accuracy. Yun et al. (2020) establish that transformers can universally approximate sequence-to-sequence functions by stacking numerous layers of feed-forward functions and self-attention functions. In a different approach, Jiang and Li (2023) affirm the universality of transformers by utilizing the Kolmogorov-Albert representation Theorem. Most recently, Kajitsuka and Sato (2023) show that transformers with one self-attention layer is a universal approximator.
The memory capacity of a transformer is a practical measure to test the theoretical results of the transformer’s universality, by ensuring the model can handle necessary context and dependencies. By memory capacity, we refer to the minimal set of parameters such that the model (i.e., transformer) approximates all input-output pairs in the training dataset with a bounded error. Several works address the memory capacity of transformers. Kim et al. (2022) show that transformers with parameters are sufficient to memorize length- and dimension- sequence-to-sequence data points by constructing a contextual map** with attention layers. Mahdavi et al. (2023) show that a multi-head-attention with heads is able to memorize examples under a linear independence data assumption. Kajitsuka and Sato (2023) show that a single layer transformer with parameters is able to memorize length- and dimension- sequence-to-sequence data points by utilizing the connection between the softmax function and Boltzmann operator. Wang et al. (2023) extend the results of (Yun et al., 2020) to prompt tuning and discuss the memorization of only the last token of each data sequence. Another line of research establishes a different kind of memory capacity for transformers by connecting transformer attention with dense associative memory models (modern Hopfield models) (Hu et al., 2024a, b, c, 2023; Wu et al., 2024a, b; Ramsauer et al., 2020). Notably, they define memory capacity as the smallest number of (length- and dimension-) data points the model (transformer attention) is able to store and derive exponential-in- high-probability capacity lower bounds.
Our work is motivated by and builds on (Yun et al., 2020) to bridge the transformer’s function approximation ability with data distribution estimation. While we do not address the memorization of DiTs (or diffusion models in general), recent studies on dense associative models suggest viewing pretrained diffusion generative models as associative memory models (Hoover et al., 2023; Ambrogioni, 2023). We plan to explore this aspect in future work.
Theories of Diffusion Models.
In addition to empirical success, there has been several theoretical analysis about diffusion models. Chen et al. (2023a) studies score approximation, estimation, and distribution recovery of U-Net based diffusion models. Benton et al. (2024) provide convergence bounds linear in data dimensions, assuming accurate score function approximation. Zhu et al. (2023); Wibisono et al. (2024) provide statistical sample complexity bounds for score-matching under the similar assumptions. Oko et al. (2023) analyze the distribution estimation under the assumption that the initial density is supported on and smooth in the boundary.
Among these works, our work is built on and closest to (Chen et al., 2023a), as both assume the data has a low-dimensional structure. However, our work differs in three key aspects. First, beyond the simple ReLU networks considered in (Chen et al., 2023a), we provide the first score approximation analysis for DiTs with a transformer-based score estimator. Second, our work is the first to provide the statistical rates of DiTs (score and distribution estimation) based on transformer universality (Yun et al., 2020) and norm-based converging number bound (Edelman et al., 2022), supporting the practical success of DiTs (Esser et al., 2024; Ma et al., 2024). Lastly, our work provides the first comprehensive analysis of the computational limits and all possible efficient DiT algorithms/methods for both forward inference and backward training. This offers timely insights into the empirical computational inefficiency of DiTs (Liu et al., 2024) and guidance for future DiT architectures.
Transformers in Foundation Models: Transformer-Based Pretrained Models.
Transformer-based pretrained models utilize attention mechanisms to process sequential data, enabling the learning of contextual relationships for tasks like natural language understanding and generation. These models encompass three types: encoder-based, decoder-based, and diffusion transformers. Encoder-based transformers, such as DNABERT (Zhou et al., 2024c, 2023; Ji et al., 2021), employ bidirectional attention to extract feature representations DNABERT shows great potential to capture complex patterns of genome sequences and improve tasks such as gene prediction. Decoder-based transformers generate output sequences from encoded information using unidirectional attention, such as ChatGPT (Lagler et al., 2013; Floridi and Chiriatti, 2020; Brown et al., 2020) for natural language. The diffusion transformers generate a sequence toward a target distribution, such as Sora (Liu et al., 2024) and Videofusion (Luo et al., 2023) for video generation and DecompDiff (Guan et al., 2024) for drug design. In our paper, we present an early exploration of the statistical and computational limits of diffusion transformer models.
Appendix D Supplementary Theoretical Background
In this section, we provide some further background. We show the details about the forward and backward process in Diffusion Models in Section D.1. Besides, we give the details of the proof about the score decomposition in Section D.2.
D.1 Diffusion Models
Forward Process.
Diffusion models gradually add noise to the original data in the forward process. We describe the forward process as the following SDE
(D.1) |
where , is a standard Brownian motion, and is a nondecreasing weighting function. Let and denote the marginal distribution and destiny of . The conditional distribution follows , where and . In practice, (D.1) terminates at a large enough such that is close to .
Backward Process.
We obtain the backward process by reversing (D.1). The backward process satisfies
where the score function is the gradient of log probability density function of , and is a reversed Brownian motion. However, and are both unknown in (D.1). To resolve this, we use a score estimator to replace , where is usually a neural network with parameters . Secondly, we replace by the standard Gaussian distribution. Consequently, we obtain the following SDE
(D.2) |
In practice, we use discrete schemes of (D.2) to generate data, following (Song and Ermon, 2019). We use to denote the discretization step size, and for , we have
(D.3) |
D.2 Proof of Lemma 2.1
Here we restate the proof of (Chen et al., 2023a, Lemma 1) for completeness.
Proof.
Recall by Assumption 2.1 with , and .
Then we write the score function as
(By log-derivative) | ||||
(By pluging in ) | ||||
(By interchanging with ) |
where the last equality holds since is continuously differentiable in .
Plugging (D.5) into ((By log-derivative)), we have
We them decompose above score function by projecting of into , i.e., replacing with :
Absorbing the factor of into the Gaussian kernel , we have
To further simplify , we decompose as
( is in while is orthogonal to ) | ||||
(since has orthonormal columns) |
where both and are Gaussian.
Plugging into , we obtain
where .
Notably, depends only on the projected data . Therefore, we are able to replace with . The benefit is that the dimension of the first input in is much smaller.
Lastly, by denoting such that , we arrive at
() |
This completes the proof. ∎
D.3 Preliminaries: Strong Exponential Time Hypothesis (SETH) and Tensor Trick
Here we present the ideas we built upon for Section 4.
Strong Exponential Time Hypothesis (SETH). Impagliazzo and Paturi (2001) introduce the Strong Exponential Time Hypothesis (SETH) as a stronger form of the conjecture. It suggests that our current best algorithms are optimal and is a popular conjecture for proving fine-grained lower bounds for a wide variety of algorithmic problems (Cygan et al., 2016; Williams, 2018).
Hypothesis 1 (SETH).
For every , there is a positive integer such that - on formulas with variables cannot be solved in time, even by a randomized algorithm.
Tensor Trick for Computing Gradients. The tensor trick (Diao et al., 2019, 2018) is an instrument to compute complicated gradients in a clean and tractable fashion. We start with some definitions.
Definition D.1 (Vectorization).
For any matrix , we define such that for all and .
Definition D.2 (Matrixization).
For any vector , we define such that for all and , namely .
Definition D.3 (Kronecker Product).
Let and . We define the Kronecker product of and as such that , is equal to with .
Definition D.4 (Sub-Block of a Tensor).
For any and , let . For any , we define be the -th sub-block of .
To showcase the tensor trick, let’s consider a (single data point) attention following (Gao et al., 2023b, c). Setting and , we have
(D.6) |
Proposition D.1 (Definition 4.7 of (Gao et al., 2023b)).
By Definition D.3 and Definition D.4, we identify for all , with and . Therefore, for each and , it holds .
The elegance of Proposition D.1 emerges when we vectorize the weights into vectors , making the gradient computations (e.g., and ) more tractable by avoiding complex matrix or tensor derivatives. This approach systematically simplifies the handling of chain-rule terms in the gradient computation of losses like .
Appendix E More Background and Auxiliary Lemmas: Universal Approximation of Transformers via Piecewise Approximation
Here, we review the universal approximation of Transformers following (Yun et al., 2020). Our goal is to reproduce the results of (Yun et al., 2020) and use or modify them as auxiliary lemmas for proofs of Section 3 (i.e., Appendix F.)
We start with their central result, and the rest of the section aims to prove it.
Lemma E.1 (Universal Approximation of Transformers, Theorem 3 of (Yun et al., 2020)).
Let . For any given compact-supported continuous function , there exists a Transformer network such that we have
Proof Overview.
We use the following proof strategy:
-
•
Step 1. We show that piecewise-constant function is able to approximate compact-supported continuous function in Section E.1.
-
•
Step 2. We define modified self-attention and feed-forward layers to construct the modified transformer. We show that modified transformer is able to approximate piecewise-constant function in Section E.2.
-
•
Step 3. We show that the modified transformer is able to approximate normal transformer in Section E.3.
Below, we provide details of Step 1. in Section E.1, Step 2. in Section E.2 and Step 3. in Section E.3. Then we give a summary of our results in Section E.4.
E.1 Piecewise-constant Function Approximates Compact-Supported Continuous Function
In this subsection, we show that piecewise-constant function is able to approximate compact-supported continuous function.
We start with the definition of the compact-supported continuous functions of interest.
Assumption E.1.
Without loss of generality, we assume that the target function in discussion is supported on . We denote the set of -supported continuous functions as .
We introduce the notion of grid and cube for the compact support .
Definition E.1 (Grid and Cube with Width ).
Given a grid width , let denote the set of grids within . For a grid point , we denote its associated cube as
We introduce the notion of piecewise-constant fucntion class w.r.t. the -supported continuous function class .
Definition E.2 (Piecewise-Constant Function Class).
Let denote the piesewise constant function of grid width , and denote the indicator function. For each , and any matrix , we define the piecewise-constant function class as
(E.1) |
We recall that for a given sequence-to-sequence function , we have
We approximate the compact-supported function with piecewise-constant function with next lemma.
Lemma E.2.
(Lemma 8 of (Yun et al., 2020)) For any given and , we can find a such that there exists a satisfying .
Proof.
See Section E.5.2 for a detailed proof. ∎
E.2 Modified Transformer Approximates Piece-Wise Constant Function
In this subsection, we define modified self-attention and feed-forward layers to construct the modified transformers. We use the modified transformers to approximate piecewise-constant function.
Definition E.3 (Modified Transformer Networks).
The modification of transformer networks includes two modifications from normal transformer networks :
-
•
Modified attention layer: Replace operator with operator .
-
•
Modified feed-forward layer: Replace with activation function . Here, denotes the set of all piecewise linear functions with at most three pieces and at least one is constant.
We approximate with this modified transformer networks as the following.
Lemma E.3 (Modified from Proposition 4 of (Yun et al., 2020)).
For each , there exists a such that .
Proof Sketch.
Given us , we have the grid , and the cude for . Our proof follows two steps:
-
•
Quantization. For all , we quantize it to a finite set:
-
–
If , we quantize it to the element .
-
–
If , we quantize it to an element out of .
-
–
-
•
Map**. For any , we map it to the desired output .
For Quantization, We achieve by a series of modified feed-forward layers. We show this in Section E.2.1.
For Map**, we follow two steps:
-
•
For any , we use a “contextual map**” (defined as Definition E.4), which maps all the elements in and to different value. Then we use a series of modified self-attention layers to achieve “contextual map**.” We show this in Section E.2.2.
Definition E.4 (Contextual Map**).
Consider a finite set . A map defines a contextual map** if the map satisfies the following:
-
–
For any , the entries in are all distinct.
-
–
For any , all entries of and are distinct.
-
–
-
•
For any , we use a series of modified feed-forward layers to map to . We show this in Section E.2.3.
∎
Remark E.1.
E.2.1 Quantization by Modified Feed-forward Layers
We use a series of modified feed-forward layers in to quantize an input to an element in a grid:
where is a number large enough to be determined later. We achieve this via two steps.
-
•
Step 1: Map the element out of to .
We use to represent the standard unit vector where the -th element is . For the -th row of , we define the following feed-forward layer to achieve our aim.
Definition E.5 (Feed-forward Layer 1).
The vector acts as the weight parameters and acts as the activation function in the feed-forward layer.
(E.2) We take as an example to give the specific calculation. We denote , then we have
In the first row of , the above layer transform the element that is out of to .
We stack the above layers together for . If the element of is out of , the series of layers maps it to .
-
•
Step 2: Map the element in to .
For the -th row of , we take respectively, and define the following layer.
Definition E.6 (Feed-forward Layer 2).
The vector acts as the weight parameters and acts as the activation function in the feed-forward layer.
(E.3) We take as an example, and give the specific calculation.
In the first row of , the above layer transform the element in to .
We stack the above layers together for and . If the element of is in , the series layers maps it to .
Combining above two parts, we achieve our goal with feed-forward layers. We denote the series layers as .
E.2.2 Contextual Map** by Modified Self-attention Layers
In our attention layers, we use the following positional encoding .
(E.4) |
According to Section E.2.1, the output of is in the grid . For any in this grid, the first column of is in
and the second column is in
For the other columns, the results are similar.
For , we use the following notation:
The we define the grid as the following.
Definition E.7 (Grid ).
is in the grid:
Next, we show that the modified attention layer computes contextual map** (Definition E.4) for . For , we use the following notation:
Lemma E.4 (Modified from Lemma 6 of (Yun et al., 2020)).
We consider the following subset of :
Assume that and . Then, there exist a function composed of modified attention layers (Definition E.3), a vector , and two constants (), such that satisfies the following properties:
-
1.
For any , the entries of are all distinct.
-
2.
For any different , all entries of , are distinct.
-
3.
For any , all the entries of are in .
-
4.
For any , all the entries of are outside .
Proof.
See Section E.5.3 for a detailed proof. ∎
E.2.3 Map to the Desired Output by Modified Feed-forward Layers
Next, we show that a series of feed-forward layers map output of modified attention layers to the desired output of function .
Lemma E.5 (Lemma 7 of (Yun et al., 2020)).
There exists a function composed of modified feed-forward layers, such that
Proof.
See Section E.5.4 for a detailed proof. ∎
From above conclusions, we have the following lemma for the required number of layers in modified transformer.
Lemma E.6 ((Yun et al., 2020)).
From the proof of Lemma E.3, if we want to achieve a approximation error by the modified transformer, we need modified feed-forward layers in , modified self-attention layers in , and modified feed-forward layers in .
Proof.
By the proof of Lemma E.3, we complete the proof. ∎
E.3 Standard Transformers Approximate Modified Transformers
In this subsection, we show that standard neural network layers are able to approximate the modified self-attention layers and the modified feed-forward layers (Definition E.3). We have the following Lemma E.7.
Lemma E.7 (Lemma 9 of (Yun et al., 2020)).
For each and any , there exists such that .
Proof.
See Section E.5.5 for a detailed proof. ∎
E.4 All Together: Standard Transformers Approximate Compact-Supported Continuous Functions
E.5 Supplementary Proofs
Here we first present two preliminaries: selective shift operation and bijective column ID map** in Section E.5.1 to proceed with our proof. Then we show the proof of Lemma E.2 in Section E.5.2, proof of Lemma E.4 in Section E.5.3, proof of Lemma E.5 in Section E.5.4, and proof of Lemma E.7 in Section E.5.5.
E.5.1 Preliminaries
We give the definition of two preliminaries: selective shift operation and bijective column ID map**.
Selective Shift Operation.
This operation refers to shifting certain entries of the input selectively.
To achieve this, we consider the following function .
(E.5) |
where , , , and is a vector to be determined.
To see the output, we consider the -th column of :
-
•
If , it calculates of ;
-
•
If , it calculates of .
With , all rows of except the first row are zero. We consider the -th entry of the first row in , which is denoted as . Then for all , we have
From this observation, we define a function parametrized by and , where .
(E.6) |
Then we have
We define an attention layer of the form . For any column , if , its first coordinate is shifted up by , while all the other coordinates stay untouched. We call this the selective shift operation, because we can choose and to shift certain entries of the input selectively.
Bijective Column ID Map**.
We consider the input (Definition E.7). We use
(E.7) |
For any , we have the following two conclusions:
-
•
If for all , i.e., , then we have
(E.8) The map from to is a bijection.
-
•
If there exists such that , then
(E.9)
We say that gives the “column ID” for each possible value of .
Remark E.3 (Illustration of Bijection Properity).
For the bijection property, we give the following illustration. Let and . If and , we deduce
(E.10) |
Because , then there exist a , such that and . We have
However,
This contradicts with (E.10). Thus we prove the property of bijection.
E.5.2 Proof of Lemma E.2
Proof of Lemma E.2.
We restate the proof from (Yun et al., 2020) for completeness.
By the nature of the compact-supported continuous function, is uniformly continuous.
Because is equivalent to when the number of entries are finite, we have the following by the definition of uniform continuity.
For any , there exist a , such that for any , and , we have .
Then we perform the following steps following Definitions E.1 and E.2:
-
•
We create a grid by choosing grid width , and cube with respect to .
-
•
For any grid point , we define to be the center point of the cube .
-
•
We define a piece-wise constant function .
Then for any , we have . According to the uniform continuity, we drive
This implies that and completes the proof. ∎
E.5.3 Proof of Lemma E.4
We give the proof of Lemma E.4 by constructing the network to satisfy the requirements.
Proof of Lemma E.4.
Recall the selective shift operation in Section E.5.1, the overall idea of the construction includes two steps:
-
•
Step 1: For each , we stack attention layers. We use the attention layer as
(E.11) for in the increasing order. The total number of layers is . These layers cast to different entries required by Property 1 of Lemma E.4.
-
•
Step 2: We add an extra single-head attention layer with attention part
(E.12) This layer achieves a global shifting and casts different to unique elements required by properties Property 2 of Lemma E.4.
The two operations together map and to different sets, as required by properties 3-4 of Lemma E.4. The bounds and are calculated then.
Then, we give detailed proof by showing the impact of the two steps and verifying the four properties of Lemma E.4. We achieve this by making a category division of :
-
•
Category 1: , all entries in the point are between and .
-
•
Category 2: , the point has at least one entry that equals to .
Let , and recall that for any in (E.8).
Category 1.
We denote , then we have . The first layers sweep the set and apply selective shift operation on each element in the set. This means that selective shift operation will be applied to first, then , and then , and so on, regardless of the specific values of ’s.
-
•
First Shift Operation. In the first selective shift operation with going through , the -th entry of (e.g., ) is shifted by the operation, while the other entries are left untouched. The updated value is
Therefore, after the operation, the output of the layer is
We have
Then we deduce , because
(By (E.8)) (By and (E.8)) Thus, after updating,
and the new minimum is .
-
•
Second Shift Operation. In the second selective shift operation with going through , the -th entry of (e.g., ) is shifted by the operation, while the other entries are left untouched. The updated value is
Therefore, after the operation, the output of the layer is
We have
Then we deduce , because
(By and ) Thus, after updating,
and the new minimum is .
-
•
Repeating The Process. By repeating this process, we show that the -th shift operation shifts by , and we have
We deduce holds for all , because
where the last inequality holds because
Therefore, after the -th selective shift operation, is the new maximum among and is the new minimum.
-
•
After Shift Operations. After the whole shift operations, the input is mapped to a new point , where and . For the lower and upper bound of , we have the following lemma.
Lemma E.8 (Lemma 10 of (Yun et al., 2020)).
satisfies the following bounds:
Also, the map** from to is one-to-one map**.
-
•
Global Shifting by the Last Layer. We note that after the above shift operations, there is another attention layer with attention part . Since , what it does to is that it adds the following to each entry in the first row of :
The output of this layer is defined to be the function .
Now, in summary, for any , , and , we have
For any and ,
Next, we check the Property 1, Property 2 and Property 3 of Lemma E.4.
- •
-
•
Checking Property 2 of Lemma E.4. Note that the upper bound on from Lemma E.8 also holds for other ’s, so for all , we have
Now, from Lemma E.8, two different map to different and , and they differ at least by . This means that two intervals
are guaranteed to be disjoint, so the entries of and are all distinct.
Now, we finish showing that the map we constructed using attention layers implements a contextual map** on .
-
•
Checking Property 3 of Lemma E.4. With and Lemma E.8, we show that for any , we have
This proves that all are between and , where
Category 2. Now we check Property 4 of Lemma E.4. For the input points , note that the point has at least one entry that equals to . Let , and recall that whenever a column has an entry that equals to , we have . Without loss of generality, assume that .
Because the selective shift operation is applied to each element of , not to negative values, thus we have , never gets shifted upwards, and remains as the minimum for the whole time.
-
•
All ’s Are Negative. When all ’s are negative, selective shift operation never shifts the input , thus . Recall that for all . The last layer with attention part adds to each entry in the first row of , making remain negative. Therefore, satisfies for all .
-
•
Not All ’s Are Negative. Now consider the case where at least one is positive. Suppose that there are positive and satisfies . Thus selective shift operation does not affect , where , but it shifts by
(By (E.9)) (By ) The next shift operations shift by an even larger amount, so at the end of the first layers, we have , while for all .
Then, we shift by the last layer. The last layer with attention part acts differently for negative and positive ’s. (i). For negative ’s, it adds the following to :
This term push them further to the negative side. (ii). For positive ’s, it adds
Thus they are all greater than or equal to .
Note that
Then we have the final output satisfies , for all . This completes the verification of Property 4 of Lemma E.4.
In conclusion, we need layers of modified self-attention layer to obtain our approximation. This completes the proof. ∎
E.5.4 Proof of Lemma E.5
Proof of Lemma E.5.
We restate the proof from (Yun et al., 2020) for completeness.
Note that , so the output of has finite number of distinct real values. Let be the upper bound of all these possible values. By construction of , .
Construct the Layers: if .
According to Lemma E.4, for all , we have if , and if . Due to this property, we add the following feed-forward layer:
Definition E.8 (Feed-forward Layer 3).
The vectors and act as the weight parameters and acts as the activation function in the feed-forward layer.
(E.13) |
-
•
Case for . We have , so all the entries of the input are shifted by , and become strictly negative.
-
•
Case for . We have , so the output stays the same as the .
With the input , if , then , so the output stays the same as the input. If , then , so all the entries of the input are shifted by , and become strictly negative.
Next, we map those negative entries to zero. For , we add the following layer:
Definition E.9 (Feed-forward Layer 4).
The vectors and act as the weight parameters and acts as the activation function in the feed-forward layer.
(E.14) |
After these layers, the output for is a zero matrix, while the output for remains .
Construct the Layers: if .
Each different is mapped to unique numbers , which are at least apart from each other. We map each unique number to the corresponding output column as follows. We choose one , for each , , we add the following feed-forward layer.
Definition E.10 (Feed-forward Layer 5).
The vectors and act as the weight parameters and acts as the activation function in the feed-forward layer.
(E.15) | ||||
(E.16) |
-
•
Case for . Recall that the input of this layer is . If is a zero matrix, which is the case for , we have . Then . Since , the output remains the same as .
-
•
Case for . Consider the input is , where is not equal to . According to Property 2 of Lemma E.4, given a , differs from by at least . Then we have
Thus the input is left untouched.
If , then
Thus we shift the -th column of to
In other word, this layer maps the column to , without affecting any other columns.
We defer from above that we need one layer per each unique value of for each . Note that there are such numbers, so we use layers to finish our construction. ∎
E.5.5 Proof of Lemma E.7
Proof of Lemma E.7.
We restate the proof from (Yun et al., 2020) for completeness.
The proof follows two steps: (i) Approximate the modified self-attention layers. (ii) Approximate the modified feed-forward layers.
-
•
Step 1: Approximate the Modified Self-Attention Layers.
We achieve this by approximating the operator with the operator . Given a matrix , we have
The operator is the only difference between the normal and the modified self-attention layers. We approximate the modified self-attention layer in by the normal self-attention layer with the same number of heads and head size .
-
•
Step2: Approximate the Modified Feed-Forward Layers.
We achieve this by approximating the activation function in with four functions. From Definition E.3, we recall that denotes three-piecewise functions with at least a constant piece. We consider the following :
where , and .
We approximate by composed of four functions:
As , we approximate using . The activation function is the only difference between the normal and modified feed-forward layers. We approximate the modified feed-forward layer in by the normal one.
Thus, for any , there exists a function to approximate .
This completes the proof. ∎
Appendix F Proofs of Section 3
Our proof is motivated by the approximation and estimation theory of U-Net-based diffusion models in (Chen et al., 2023a). We use the universal approximation capability Appendix E and the covering number of transformer networks to proceed with our proof. Specifically, we derive the approximation error bound in Section F.1 and the corresponding sample complexity bound in Section F.2. Then we show that the data distribution generated from the estimated score function converges toward a proximate area of the original one in Section F.3.
F.1 Proof of Theorem 3.1
Here we present some auxiliary theoretical results in Section F.1.1 to prepare our main proof of Theorem 3.1. Then we derive the approximation error bound of DiTs (i.e., the proof of Theorem 3.1) in Section F.1.2.
F.1.1 Auxiliary Lemmas for Theorem 3.1.
We restate some auxiliary lemmas and their proofs here from (Chen et al., 2023a) for later convenience.
Lemma F.1 (Lemma 16 of (Chen et al., 2023a)).
Consider a probability density function for and constant . Let be a fixed radius. Then it holds
Lemma F.2 (Lemma 2 of (Chen et al., 2023a)).
Lemma F.3 (Theorem 1 of (Chen et al., 2023a)).
Lemma F.4 (Lemma 10 of (Chen et al., 2020b)).
For any given , and -Lipschitz function defined on , there exists a continuous function constructed by trapezoid function that
Moreover, the Lipschitz continuity of is bounded by
F.1.2 Main Proof of Theorem 3.1
Proof of Theorem 3.1.
With , we note that in (2.4)
(F.1) |
We proceed as follows:
-
•
Step 1. Approximate with a compact-supported continuous function .
-
•
Step 2. Approximate with a Transformer network.
Step 1. Approximate with a Compact-supported Continuous Function . Here we partition into a compact subset and its complement , where is to be determined later. We approximate on the two subset respectively, and then prove ’s continuity. Such a step achieves an estimation error of between and . We show the main proof here.
- •
-
•
Approximation on . On , we approximate by each coordinate respectively, where . We firstly rescale the input by and , so that the transformed input space is . We implement such a transformation by a single feed-forward layer.
By Assumption 2.3, on-support score is -Lipschitz in . This implies is -Lipschitz in . When taking the transformed inputs, becomes -Lipschitz in ; so is each coordinate . Here we take .
Besides, is -Lipsichitz with respect to , where
We have a coarse upper bound for in Lemma F.3. We repeat it here for convenience
In conclusion, each is Lipsichitz continuous. So we can apply Lemma F.4 to find out for approximating each coordinate. We concatenate ’s together and construct . According to the construction in Lemma F.4, for any given , we achieve
Considering the input rescaling (i.e., and ), we obtain:
-
–
The constructed function is Lipschitz continuous in , i.e., for any and , it holds
(F.2) -
–
The function is also Lipschitz in , i.e., for any and , it holds
Due to the fact that the construction of is based on trapezoid function, we have for . So the two part of can be joined together. To be more specific, the above Lipschitz continuity in extends to the whole .
-
–
-
•
Approximation Error Analysis under Norm. The approximation error of can be decomposed into two terms:
The second term on the right-hand side above has already been bounded with the selection of :
The first term is bounded by:
So we obtain
If we substitute with , we obtain that the approximation error of is .
Step 2. Approximate by a Transformer. This step is based on the universal approximation of transformers for the compact-supported continuous function in Lemma E.1. Following (Peebles and Xie, 2023), DiT uses time point to calculate the scale and shift value in the Transformer backbone, and it transforms a input picture into a sequential version. We ignore time point in the notation of Transformer network in DiT. Recall that the reshape layer in Definition 3.1, we consider use to approximate , where .
-
•
Overall Approximation Error. With Lemma E.1, we approximate with , and denote
We have
(F.3) Along with Step 1, we obtain
The constructed approximator to is , whose approximation error is
-
•
Settling-down of Hyperparameters. We settle down the hyperparameters to configure our network here. We refer to Section E.2 for some of the following calculations.
Then we have
(F.12) | ||||
(By setting according to Section E.4) |
and
(F.13) | ||||
(By setting according to Section E.4) |
This completes the proof. ∎
F.2 Proof of Corollary 3.1.1
Here we present the auxiliary theoretical results about the covering number of transformer networks in Section F.2.1 to prepare our main proof of Corollary 3.1.1. The results is based on the Theorem A.17 of (Edelman et al., 2022). Then we derive the sample complexity bound of DiTs (i.e., the proof of Corollary 3.1.1) in Section F.2.
F.2.1 Auxiliary Lemmas for Corollary 3.1.1
Lemma F.5 (Lemma 15 of (Chen et al., 2023a)).
Let be a bounded function class, i.e., there exists a constant such that any . Let be i.i.d. random variables. For any , and , we have
Now, we give the definition of covering number as the follows.
Definition F.1 (Covering Number).
Given a function class and a data distribution . Sample n data points from , then the covering number is the smallest size of a collection (a cover) such that for any , there exist satisfying
Further, we define the covering number with respect to the data distribution as
Then we give the covering number of the transformer networks.
Lemma F.6 (Modified from Theorem A.17 of (Edelman et al., 2022)).
Let represent the class of functions of -layer transformer blocks satisfying the norm bound for matrix and Lipsichitz property for feed-forward layers. Then for all data point we have
where .
Remark F.1.
We modify (Edelman et al., 2022, Theorem A.17) in seven aspects:
-
1.
We do not consider the last linear layer in the model: converting each column vector of the Transformer output to a scalar. Therefore, we ignore the item related to the last linear layer in (Edelman et al., 2022, Theorem A.17).
-
2.
We do not consider the normalization layer in our model. Because the normalization layer in the original proof of only applies , ignoring this layer does not change the result.
-
3.
Our activation function is , we replace the Lipschitz upperbound of activate function by 1.
-
4.
We consider the positional encoding (E.4) in our work, we need to replace the upperbound for the inputs with the upperbound . Besides, for multi-layer Transformer, the original conclusion in (Edelman et al., 2022, Theorem A.17) considers the upperbound for the -norm of inputs is 1, we add the upperbound for the inputs in Lemma F.6.
-
5.
We use (2.7) as the feed forward layer, including two linear layers and a residual layer. Thus, in Lemma F.6, we replace the original upperbound for the norm of weight matrix with the upperbound for the norm of . In the following, we use to estimate the log-covering number, thus we ignore the item for here for converience. This is the same for the self-attention layer.
-
6.
We use multi-head attention, we add the number of heads in our result, similar to (Edelman et al., 2022, Theorem A.12).
-
7.
In our work, we use Transformer , i.e., .
F.2.2 Proof of Corollary 3.1.1
Proof of Corollary 3.1.1.
Our proof is built on (Chen et al., 2023a, Appendix B.2). Firstly, for one data sample, we define the empirical score matching loss objective (2.1) as follows
Then we define .
We denote
For any , following (Chen et al., 2023a, Appendix B.2), we have the following for term with probability ,
where is a constant, and will be determined later.
We set , then we have
with probability .
Following the upper bound of other two terms and the proof details in (Chen et al., 2023a, Appendix B.2), we have
(F.14) |
with probability .
Covering Number of .
Next step is to calculate the covering number of . consists of two components: (i) Matrix with orthonormal columns; (ii) Network function . Suppose we have and such that and , where . Then we evaluate
(F.15) |
where upper bounds the Lipschitz constant of .
For set , its -covering number is ((Chen et al., 2020a, Lemma 8)). The -covering number of needs a further discussion as there is a resha** process in our network. For the input reshaped from to , we have
and
Thus we can follow the covering number property for sequence-to-sequence transformer , i.e., Lemma F.6 and get the following -covering number
where
According to the (LABEL:eq:K_est), (LABEL:eq:L_tau_est), (LABEL:eq:W_ov_est_inf), (LABEL:eq:W_ov_est_2), (LABEL:eq:W_kq_est_inf), (LABEL:eq:W_kq_est_2), (F.12), (F.13), (LABEL:eq:C_e_est) and (LABEL:eq:C_tau_est) in Section F.1.2, we derive the following with (Section E.4) and (Theorem 3.1):
(F.16) | ||||
We consider that each elements of the input data are within as shown in Appendix E.
Recall that , then we get the log-covering number of ,
Following (Chen et al., 2023a, Appendix B.2), then the log-covering number of is
(By (F.2.2)) | ||||
(By (F.2.2)) | ||||
(By ) | ||||
(By ignoring the log factors) | ||||
Substituting the log-covering number into (F.2.2), we have
(F.17) |
Recall the following parameters,
-
•
-
•
-
•
: probability error
-
•
: approximation error
-
•
: sample size
-
•
-
•
: feature dimension
-
•
: sequence length
-
•
-
•
: Lipschitz coefficient
Thus, the final bound is
Thus, we complete the proof of Corollary 3.1.1. ∎
F.3 Proof of Corollary 3.1.2
Our proof is built on (Chen et al., 2023a, Appendix C). The main difference between our work and (Chen et al., 2023a) is our score estimation error from Corollary 3.1.1. Consequently, only the subspace error and the total variation distance differ from (Chen et al., 2023a, Theorem 3).
Proof Sketch of (i).
We show that if the orthogonal score increases significantly, the mismatch between the column span of and will be greatly amplified. Therefore, an accurate score network estimator forces and to align with each other.
Proof Sketch of (ii).
We conduct the proof via 2 steps:
-
•
Step 1: Total Variation Distance Bound. We obtain the discrete result from the continuous-time generated distribution by adding discretization error (Chen et al., 2023a, Lemma 4). It suffices to bound the divergence between the following two stochastic processes:
-
–
For the ground-truth backward process, consider and the following SDE:
Denote the marginal distribution of the ground-truth process as .
-
–
For the learned process, consider and the following SDE:
where and is an orthogonal matrix. Following the notation in (Chen et al., 2023a), we use to denote the marginal distribution of . We first calculate the latent score matching error, i.e., the error between and . Then, we adopt Girsanov’s Theorem (Chen et al., 2023b) and bound the difference in the KL divergence of the above two processes to derive the score-matching error bound.
-
–
-
•
Step 2: Wasserstein-2 Distance Bound. We use the same technique as (Chen et al., 2023a, Theorem 3).
Proof Sketch of (iii).
We derive item (iii) by solving the orthogonal backward process of the diffusion model.
Next, we present the auxiliary theoretical results in Section F.3.1 to prepare our main proof of Corollary 3.1.2. Then we give detailed proof of Corollary 3.1.2 in Section F.3.2.
F.3.1 Auxiliary Lemmas
Here we include a few auxiliary lemmas from (Chen et al., 2023a) without proofs. Recall the definition of Lipschitz norm: for a given function , .
Lemma F.7 (Lemma 3 of (Chen et al., 2023a)).
Assume that the following holds
where denotes the smallest eigenvalue. We denote
We set and . Suppose we have
Then we have
and there exists an orthorgonal matrix , such that:
Lemma F.8 (Lemma 4 of (Chen et al., 2023a)).
Assume that is sub-Gaussian, and are Lipschitz in both and . Assume we have the latent score matching error bound
Then we have the following latent distribution estimation error for the undiscretized backward SDE
Furthermore, we have the following latent distribution estimation error for the discretized backward SDE
where
and is the step size in the backward process.
Lemma F.9 (Lemma 6 of (Chen et al., 2023a)).
Consider the following discretized SDE with step size satisfying
where . Then when and , we have with .
Lemma F.10 (Lemma 10 in (Chen et al., 2023a)).
Assume that is -Lipschitz. Then we have .
F.3.2 Main Proof of Corollary 3.1.2
Proof.
Recall
-
•
Proof of (i). With Lemma F.7, we replace to be and we set by Lemma F.10, we have
We substitute the score estimation error in Corollary 3.1.1 and into the bound above, we deduce
where .
We note that is great enough to make satisfies where .
-
•
Proof of (ii). Lemma F.7 and Lemma F.10 imply that
where
Through the algebra calculation, we get
With and Lemma F.8, we obtain
As we choose time step , we obtain
By definition, . This completes the proof of the total variation distance in (3.2).
For Wasserstein-2 distance , we bound it by using the same technique as (Chen et al., 2023b, Lemma 16). Specifically, our proof only requires finite second moment of verified in Assumption 2.2. As a result, we have
-
•
Proof of (iii). We apply Lemma F.9 due to our score decomposition. With the marginal distribution at time and observing , we obtain the last property.
This completes the proof. ∎
Appendix G Proofs of Section 4
Our proofs are motivated by the observation of low-rank gradient decomposition in transformer-like models (Alman and Song, 2024a; Gu et al., 2024). With our simplifications and observations made in Section 4, we utilize the fine-grained complexity results of transformer and attention (Hu et al., 2024c; Alman and Song, 2024b, 2023) and tensor trick (Lemma D.1 and (Diao et al., 2019, 2018)) to proceed our proofs. Specifically, we approximate DiT training gradients with a series of low-rank approximations in Sections G.1.1, G.1.2 and G.1.3, and carefully match the multiplication dimensions so that the computation of forms a chained low-rank approximation in Section G.2.
G.1 Auxiliary Theoretical Results for Theorem 4.1
Here we present some auxiliary theoretical results to prepare our main proof of the Existence of almost-linear Time Algorithms for ADITGC Theorem 4.1.
G.1.1 Low-Rank Decomposition of DiT Gradients
We start by some definitions. Recall that and denotes the vectorization of following Definition D.1.
Definition G.1.
Let be two matrices. Suppose . Define as an sub-block of . There are such sub-blocks in total. For each , define the function by .
Definition G.2.
Let be two matrices. Suppose . Define as an sub-block of . There are such sub-blocks in total. For every index , consider the function defined by .
Definition G.3.
Suppose that and are defined as in Definitions G.2 and G.1, respectively. For a fixed , consider the function defined by
Define as the matrix where the -th row is .
Definition G.4.
For every , define the function by
Here, denotes the matrix representation of , and represents the -th column of . Define as the matrix where the -th column is .
Definition G.5.
For each , we denote as the normalized vector defined by Definition G.3. For each , is defined as per Definition G.4. For every pair , define the function by
where is the element at the position of the matrix . has matrix form
With the tensor trick (Section D.3), we compute the gradient of the DiT loss as follows:
(G.1) |
(G.1) presents a neat decomposition of . Each term is easy enough to handle. Thus, we arrive the following lemma. Let and be the -th row and -th column of matrix .
Lemma G.1 (Low-Rank Decomposition of DiT Gradient).
Let matrix and loss function follow Definition 4.1, and . It holds
(G.2) |
Proof.
Let and be the -th row and -th column of matrix .
With DiT loss Definition 4.1, we have
(By Definition G.5) | ||||
(By Definition G.3) | ||||
(By chain rule) |
For each , we have
Therefore, for each , we have
(By Definition G.1) | ||||
(By entry-wise product rule) | ||||
(By Definition G.1 again) |
Similarly,
(By Definition G.2) | ||||
(By entry-wise product rule) | ||||
(By Definition G.1 again) |
Putting all together, we have
where
This completes the proof. ∎
Observe (G.2) carefully. We see that (I) is diagonal and (II) is low-rank. This provides a hint for algorithmic speedup through low-rank approximation: If we approximate the other parts with low-rank approximation and carefully match the multiplication dimensions, we might formulate the computation of as a chained low-rank approximation.
Surprisingly, such approach makes computing (G.2) as fast as in almost-linear time. To proceed, we further decompose (G.2) according to the chain-rule in the next lemma, and then conduct the approximation term-by-term.
To facilitate our proof, it’s convenient to introduce the following notations.
Definition G.6 ().
Define as specified in Definition G.5 and as described in Definition G.4. Define by
In addition, denotes the -th row of , transposed, making it an vector.
Definition G.7 (,, ).
For each index , we define as follows:
We define such that forms the -th row of . In addition, for every index , we define as
such that .
allows us to express in a neat form:
Lemma G.2.
G.1.2 Low-Rank Approximations of Building Blocks I
The definitions of , , , and Lemma G.2 show that the DiT training gradient involves entry-wise products of , , and . Therefore, if we approximate these with inner-dimension-matched low-rank approximations, computing itself becomes a low-rank approximation. In the following sections, we present low-rank approximations for , , and .
Lemma G.3 (Approximate , Modified from (Alman and Song, 2023)).
Let and . Let , and with follows Definitions G.1, G.2, G.5 and G.3. If ,, then there exist two matrices such that . In addition, it takes time to construct and .
Proof.
By (Alman and Song, 2023, Theorem 3), we complete the proof. ∎
Lemma G.4 (Approximate ).
Assume all numerical values are in bits. Let and follows Definition G.5. There exist two matrices such that .
Proof of Lemma G.4.
(By Definition G.5) | ||||
(By (Alman and Song, 2023, Theorem 3)) |
∎
Lemma G.5 (Approximate ).
Let , follow Definition G.5 and let (follow Definition G.6). There exist two matrices such that . In addition, it takes time to construct .
G.1.3 Low-Rank Approximations of Building Blocks II
Now, we use the low-rank approximations of to construct low-rank approximations for .
Lemma G.6 (Approximate ).
Let . Suppose approximates such that , and approximates the such that . Then there exist two matrices such that . In addition, it takes time to construct .
Proof of Lemma G.6.
By tensor trick, we construct , as tensor products of and , respectively, while preserving their low-rank structures. Then, we show the low-rank approximation of with bounded error by Lemma G.3 and Lemma G.5.
Let be column-wise Kronecker product such that for .
Let and denote matrix-multiplication approximations to and , respectively.
Lemma G.7 (Approximate ).
Let . Let follow Definition G.7 such that its -th column is for each . Suppose approximates the such that , and approximates the such that . Then there exist matrices such that . In addition, it takes time to construct .
Proof of Lemma G.7.
From Definition G.7,
For (I), we show its low-rank approximation by observing the low-rank-preserving property of the multiplication between and (from Lemma G.3 and Lemma G.5). For (II), we show its low-rank approximation by the low-rank structure of and (I).
Part (I).
We define a function such that the -th component for all . Let denote the approximation of via decomposing into and :
(G.6) |
for all . This allows us to write with denoting a diagonal matrix with diagonal entries being components of .
Part (II).
With , we approximate with as follows.
Since has low rank representation, and is a diagonal matrix, has low-rank representation by definition. Thus, we set with and . Then, we bound the approximation error
(By triangle inequality) | ||||
Computationally, computing takes time by . Once we have precomputed, (G.6) only takes time for each . Thus, the total time is . Since and takes time to construct and also takes time, and takes time to construct. This completes the proof. ∎
G.2 Proof of Theorem 4.1
Proof of Theorem 4.1.
By the definitions of matrices , and (Definition G.7), we have
By Lemma G.2, we have
(G.7) |
To show the existence of algorithms for DiT backward computation Problem 1, we prove fast low-rank approximations for and as follows.
Let denote the approximations to , respectively.
Therefore, total running time for is .
For the same reason (by Lemma G.7), total running time for is .
Lastly, we have
(By Lemma G.2) | ||||
(By definition, for any matrix ) | ||||
(By Definition G.7 and triangle inequality) | ||||
(By the sub-multiplicative property of ) | ||||
(By Lemma G.6 and Lemma G.7) |
Set . We complete the proof. ∎
References
- Alman and Song [2023] Josh Alman and Zhao Song. Fast attention requires bounded entries. Advances in Neural Information Processing Systems (NeurIPS), 36, 2023.
- Alman and Song [2024a] Josh Alman and Zhao Song. The fine-grained complexity of gradient computation for training large language models. arXiv preprint arXiv:2402.04497, 2024a.
- Alman and Song [2024b] Josh Alman and Zhao Song. How to capture higher-order correlations? generalizing matrix softmax attention to kronecker computation. In The Twelfth International Conference on Learning Representations (ICLR), 2024b.
- Ambrogioni [2023] Luca Ambrogioni. In search of dispersed memories: Generative diffusion models are associative memory networks. arXiv preprint arXiv:2309.17290, 2023.
- Bao et al. [2022] Fan Bao, Chongxuan Li, Yue Cao, and Jun Zhu. All are worth words: a vit backbone for score-based diffusion models. In NeurIPS 2022 Workshop on Score-Based Methods, 2022.
- Benton et al. [2024] Joe Benton, Valentin De Bortoli, Arnaud Doucet, and George Deligiannidis. Nearly d-linear convergence bounds for diffusion models via stochastic localization. In The Twelfth International Conference on Learning Representations (ICLR), 2024.
- Bortoli [2022] Valentin De Bortoli. Convergence of denoising diffusion models under the manifold hypothesis. Transactions on Machine Learning Research, 2022. ISSN 2835-8856.
- Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Chen et al. [2024] Junsong Chen, **cheng YU, Chongjian GE, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, ** Luo, Huchuan Lu, and Zhenguo Li. Pixart-$\alpha$: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In The Twelfth International Conference on Learning Representations (ICLR), 2024.
- Chen et al. [2020a] Minshuo Chen, Xingguo Li, and Tuo Zhao. On generalization bounds of a family of recurrent neural networks. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (AISTATS), volume 108, pages 1233–1243, 2020a.
- Chen et al. [2020b] Minshuo Chen, Wen**g Liao, Hongyuan Zha, and Tuo Zhao. Distribution approximation and statistical estimation guarantees of generative adversarial networks. arXiv preprint arXiv:2002.03938, 2020b.
- Chen et al. [2023a] Minshuo Chen, Kaixuan Huang, Tuo Zhao, and Mengdi Wang. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. In International Conference on Machine Learning (ICML), pages 4672–4712. PMLR, 2023a.
- Chen et al. [2023b] Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. In The Eleventh International Conference on Learning Representations (ICLR), 2023b.
- Cygan et al. [2016] Marek Cygan, Holger Dell, Daniel Lokshtanov, Dániel Marx, Jesper Nederlof, Yoshio Okamoto, Ramamohan Paturi, Saket Saurabh, and Magnus Wahlström. On problems as hard as cnf-sat. ACM Transactions on Algorithms (TALG), 12(3):1–24, 2016.
- Diao et al. [2018] Huaian Diao, Zhao Song, Wen Sun, and David Woodruff. Sketching for kronecker product regression and p-splines. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1299–1308. PMLR, 2018.
- Diao et al. [2019] Huaian Diao, Rajesh Jayaram, Zhao Song, Wen Sun, and David Woodruff. Optimal sketching for kronecker product regression and low rank approximation. Advances in neural information processing systems (NeurIPS), 32, 2019.
- Edelman et al. [2022] Benjamin L Edelman, Surbhi Goel, Sham Kakade, and Cyril Zhang. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning (ICML), pages 5793–5831. PMLR, 2022.
- Esser et al. [2024] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024.
- Floridi and Chiriatti [2020] Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020.
- Gao et al. [2023a] Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Masked diffusion transformer is a strong image synthesizer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23164–23173, 2023a.
- Gao et al. [2023b] Yeqi Gao, Zhao Song, Weixin Wang, and Junze Yin. A fast optimization view: Reformulating single layer attention in llm based on tensor and svm trick, and solving it in matrix multiplication time. arXiv preprint arXiv:2309.07418, 2023b.
- Gao et al. [2023c] Yeqi Gao, Zhao Song, and Shenghao Xie. In-context learning for attention scheme: from single softmax regression to multiple softmax regression via a tensor trick. arXiv preprint arXiv:2307.02419, 2023c.
- Gu et al. [2024] Jiuxiang Gu, Yingyu Liang, Zhenmei Shi, Zhao Song, and Yufa Zhou. Tensor attention training: Provably efficient learning of higher-order transformers. arXiv preprint arXiv:2405.16411, 2024.
- Guan et al. [2024] Jiaqi Guan, Xiangxin Zhou, Yuwei Yang, Yu Bao, Jian Peng, Jianzhu Ma, Qiang Liu, Liang Wang, and Quanquan Gu. Decompdiff: diffusion models with decomposed priors for structure-based drug design. arXiv preprint arXiv:2403.07902, 2024.
- Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Hoover et al. [2023] Benjamin Hoover, Hendrik Strobelt, Dmitry Krotov, Judy Hoffman, Zsolt Kira, and Duen Horng Chau. Memory in plain sight: A survey of the uncanny resemblances between diffusion models and associative memories. arXiv preprint arXiv:2309.16750, 2023.
- Hu et al. [2023] Jerry Yao-Chieh Hu, Donglin Yang, Dennis Wu, Chenwei Xu, Bo-Yu Chen, and Han Liu. On sparse modern hopfield model. In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023.
- Hu et al. [2024a] Jerry Yao-Chieh Hu, Pei-Hsuan Chang, Haozheng Luo, Hong-Yu Chen, Weijian Li, Wei-Po Wang, and Han Liu. Outlier-efficient hopfield layers for large transformer-based models. In Forty-first International Conference on Machine Learning (ICML), 2024a.
- Hu et al. [2024b] Jerry Yao-Chieh Hu, Bo-Yu Chen, Dennis Wu, Feng Ruan, and Han Liu. Nonparametric modern hopfield models. arXiv preprint arXiv:2404.03900, 2024b.
- Hu et al. [2024c] Jerry Yao-Chieh Hu, Thomas Lin, Zhao Song, and Han Liu. On computational limits of modern hopfield models: A fine-grained complexity analysis. In Forty-first International Conference on Machine Learning (ICML), 2024c.
- Impagliazzo and Paturi [2001] Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. Journal of Computer and System Sciences, 62(2):367–375, 2001.
- Ji et al. [2021] Yanrong Ji, Zhihan Zhou, Han Liu, and Ramana V Davuluri. Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome. Bioinformatics, 37(15):2112–2120, 2021.
- Jiang and Li [2023] Haotian Jiang and Qianxiao Li. Approximation theory of transformer networks for sequence modeling. arXiv preprint arXiv:2305.18475, 2023.
- Kajitsuka and Sato [2023] Tokio Kajitsuka and Issei Sato. Are transformers with one layer self-attention using low-rank weight matrices universal approximators? arXiv preprint arXiv:2307.14023, 2023.
- Kim et al. [2022] Junghwan Kim, Michelle Kim, and Barzan Mozafari. Provable memorization capacity of transformers. In The Eleventh International Conference on Learning Representations (ICLR), 2022.
- Lagler et al. [2013] Klemens Lagler, Michael Schindelegger, Johannes Böhm, Hana Krásná, and Tobias Nilsson. Gpt2: Empirical slant delay model for radio space geodetic techniques. Geophysical research letters, 40(6):1069–1073, 2013.
- Liu et al. [2024] Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024.
- Liu et al. [2021] Zhonghua Liu, Yue Lu, Zhihui Lai, Weihua Ou, and Kaibing Zhang. Robust sparse low-rank embedding for image dimension reduction. Applied Soft Computing, 113:107907, 2021.
- Luo et al. [2023] Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, **gren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion models for high-quality video generation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10209–10218. IEEE, 2023.
- Ma et al. [2024] Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. arXiv preprint arXiv:2401.08740, 2024.
- Mahdavi et al. [2023] Sadegh Mahdavi, Renjie Liao, and Christos Thrampoulidis. Memorization capacity of multi-head attention in transformers. arXiv preprint arXiv:2306.02010, 2023.
- Mo et al. [2023] Shentong Mo, Enze Xie, Ruihang Chu, Lanqing Hong, Matthias Niessner, and Zhenguo Li. Dit-3d: Exploring plain diffusion transformers for 3d shape generation. Advances in Neural Information Processing Systems (NeurIPS), 36, 2023.
- Nichol et al. [2021] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Oko et al. [2023] Kazusato Oko, Shunta Akiyama, and Taiji Suzuki. Diffusion models are minimax optimal distribution estimators. In International Conference on Machine Learning (ICML), pages 26517–26582. PMLR, 2023.
- Peebles and Xie [2023] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023.
- Pope et al. [2021] Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning. arXiv preprint arXiv:2104.08894, 2021.
- Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- Ramsauer et al. [2020] Hubert Ramsauer, Bernhard Schafl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlovic, Geir Kjetil Sandve, et al. Hopfield networks is all you need. arXiv preprint arXiv:2008.02217, 2020.
- Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 10684–10695, 2022.
- Song and Ermon [2019] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems (NeurIPS), 32, 2019.
- Su and Wu [2018] Bing Su and Ying Wu. Learning low-dimensional temporal representations. In International Conference on Machine Learning (ICML), pages 4761–4770. PMLR, 2018.
- Vahdat et al. [2021] Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 11287–11302, 2021.
- Wang et al. [2024a] Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, and Quanquan Gu. Diffusion language models are versatile protein learners. arXiv preprint arXiv:2402.18567, 2024a.
- Wang et al. [2024b] Yan Wang, Lihao Wang, Yuning Shen, Yiqun Wang, Huizhuo Yuan, Yue Wu, and Quanquan Gu. Protein conformation generation via force-guided se (3) diffusion models. arXiv preprint arXiv:2403.14088, 2024b.
- Wang et al. [2023] Yihan Wang, Jatin Chauhan, Wei Wang, and Cho-Jui Hsieh. Universality and limitations of prompt tuning. Advances in Neural Information Processing Systems (NeurIPS), 36, 2023.
- Wibisono et al. [2024] Andre Wibisono, Yihong Wu, and Kaylee Yingxi Yang. Optimal score estimation via empirical bayes smoothing. arXiv preprint arXiv:2402.07747, 2024.
- Williams [2018] Virginia Vassilevska Williams. On some fine-grained questions in algorithms and complexity. In Proceedings of the international congress of mathematicians: Rio de janeiro 2018, pages 3447–3487. World Scientific, 2018.
- Wu et al. [2024a] Dennis Wu, Jerry Yao-Chieh Hu, Teng-Yun Hsiao, and Han Liu. Uniform memory retrieval with larger capacity for modern hopfield models. In Forty-first International Conference on Machine Learning (ICML), 2024a.
- Wu et al. [2024b] Dennis Wu, Jerry Yao-Chieh Hu, Weijian Li, Bo-Yu Chen, and Han Liu. STanhop: Sparse tandem hopfield model for memory-enhanced time series prediction. In The Twelfth International Conference on Learning Representations (ICLR), 2024b.
- Yun et al. [2020] Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations (ICLR), 2020.
- Zheng et al. [2023] Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers. arXiv preprint arXiv:2306.09305, 2023.
- Zhou et al. [2024a] Xiangxin Zhou, Xiwei Cheng, Yuwei Yang, Yu Bao, Liang Wang, and Quanquan Gu. Decompopt: Controllable and decomposed diffusion models for structure-based molecular optimization. arXiv preprint arXiv:2403.13829, 2024a.
- Zhou et al. [2024b] Xiangxin Zhou, Dongyu Xue, Ruizhe Chen, Zaixiang Zheng, Liang Wang, and Quanquan Gu. Antigen-specific antibody design via direct energy-based preference optimization. arXiv preprint arXiv:2403.16576, 2024b.
- Zhou et al. [2023] Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana Davuluri, and Han Liu. Dnabert-2: Efficient foundation model and benchmark for multi-species genome. arXiv preprint arXiv:2306.15006, 2023.
- Zhou et al. [2024c] Zhihan Zhou, Weimin Wu, Harrison Ho, Jiayi Wang, Lizhen Shi, Ramana V Davuluri, Zhong Wang, and Han Liu. Dnabert-s: Learning species-aware dna embedding with genome foundation models. ArXiv, 2024c.
- Zhu et al. [2023] Zhenyu Zhu, Francesco Locatello, and Volkan Cevher. Sample complexity bounds for score-matching: Causal discovery and generative modeling. Advances in Neural Information Processing Systems (NeurIPS), 36, 2023.