We investigate the statistical and computational limits of latent Diffusion Transformers (DiTs) under the low-dimensional linear latent space assumption. Statistically, we study the universal approximation and sample complexity of the DiTs score function, as well as the distribution recovery property of the initial data. Specifically, under mild data assumptions, we derive an approximation error bound for the score network of latent DiTs, which is sub-linear in the latent space dimension. Additionally, we derive the corresponding sample complexity bound and show that the data distribution generated from the estimated score function converges toward a proximate area of the original one. Computationally, we characterize the hardness of both forward inference and backward computation of latent DiTs, assuming the Strong Exponential Time Hypothesis (SETH). For forward inference, we identify efficient criteria for all possible latent DiTs inference algorithms and showcase our theory by pushing the efficiency toward almost-linear time inference. For backward computation, we leverage the low-rank structure within the gradient computation of DiTs training for possible algorithmic speedup. Specifically, we show that such speedup achieves almost-linear time latent DiTs training by casting the DiTs gradient as a series of chained low-rank approximations with bounded error. Under the low-dimensional assumption, we show that the convergence rate and the computational efficiency are both dominated by the dimension of the subspace, suggesting that latent DiTs have the potential to bypass the challenges associated with the high dimensionality of initial data.

1 Introduction

We investigate the statistical and computational limits of latent diffusion transformers (DiTs), assuming the data is supported on an unknown low-dimensional linear subspace. This analysis is not only practical but also timely. On one hand, DiTs have demonstrated revolutionary success in generative AI and digital creation by using Transformers as score networks (Esser et al., 2024; Ma et al., 2024; Chen et al., 2024; Mo et al., 2023; Peebles and Xie, 2023). On the other hand, they require significant computational resources (Liu et al., 2024), making them challenging to train outside of specialized industrial labs. Therefore, it is natural to ask whether it is possible to make them lighter and faster without sacrificing performance. Answering these questions requires a fundamental understanding of the DiT architecture. This work provides a timely theoretical analysis of the fundamental limits of DiT architecture, aided by the analytical feasibility provided by the low-dimensional data assumption.

Empirically, Latent Diffusion is a go-to design for effectiveness and computational efficiency (Rombach et al., 2022; Liu et al., 2021; Pope et al., 2021; Su and Wu, 2018). Theoretically, it is capable to host the assumption of low-dimensional data structure (see Assumption 2.1 for formal definition) for detailed analytical characterization (Chen et al., 2023a; Bortoli, 2022). In essence, diffusion models with low-dimensional data structures manifest a natural lower-dimensional diffusion process through encoder/decoder within a robust and informative latent representation feature space (Rombach et al., 2022; Pope et al., 2021). Such lower-dimensional diffusion improves computational efficiency by reducing data complexity without sacrificing essential information (Liu et al., 2021). With this assumption, Chen et al. (2023a) decompose the score function of U-Net based diffusion models into on-support and orthogonal components. This decomposition allows for the characterization of the distinct behaviors of the two components: the on-support component facilitates latent distribution learning, while the orthogonal component facilitates subspace recovery.

In our work, we utilize low-dimensional data structure assumption to explore statistical and computational limits of latent DiTs. Our analysis includes the characterizations of statistical rates and provably efficient criteria. Statistically, we pose two questions and provide a theory to characterize the statistical rates of latent DiT under the assumption of a low-dimensional data:

Question 1.

What is the approximation limit of using transformers to approximate the DiT score function, particularly in the low-dimensional data subspace?

Question 2.

How accurate is the estimation limit for such a score estimator in practical training scenarios? With the score estimator, how well can diffusion transformers recover the data distribution?

Computationally, the primary challenge of DiT lies in the transformer blocks’ quadratic complexity. This computational burden applies to both inference and training, even with latent diffusion. Thus, it is essential to design algorithms and methods to circumvent this $\Omega(L^{2})$ where $L$ is the latent DiT sequence length. However, there are no formal results to support and characterize such algorithms. To address this gap, we pose the following questions and provide a fundamental theory to fully characterize the complexity of latent DiT under the low-dimensional linear subspace data assumption:

Question 3.

Is it possible to improve the $\Omega(L^{2})$ time complexity with a bounded approximation error for both forward and backward passes? What is the computational limit for such an improvement?

Contributions.

We study the fundamental limits of latent DiT. Our contributions are threefold:

•

Score Approximation. We address Question 1 by characterizing the approximation limit of matching the DiT score function with a transformer-based score estimator. Specifically, under mild data assumptions, we derive an approximation error bound for the score network, sub-linear in the latent space dimension (Theorem 3.1). These results not only explain the expressiveness of latent DiT (under mild assumptions) but also provide guidance for the structural configuration of the score network for practical implementations (Theorem 3.1).
•

Score and Distribution Estimation. We address Question 2 by exploring the limitations of score and distribution estimations of latent DiTs in practical training scenarios. Specifically, we provide an sample complexity bound for score estimation (Corollary 3.1.1), using norm-based covering number bound of transformer architecture. Additionally, we show that the learned score estimator is able to recover the initial data distribution (Corollary 3.1.2).
•

Provably Efficient Criteria and Existence of Almost Linear Time Algorithms. We address Question 3 by providing provably efficient criteria for latent DiTs in both forward inference and backward computation/training. For forward inference, we characterize all possible efficient DiT algorithms using a norm-based efficiency threshold for both conditional and unconditional generation (Proposition 4.1). Efficient algorithms, including almost-linear time algorithms (Proposition 4.2), are possible only below this threshold. For backward computation, we prove the existence of almost-linear time DiT training algorithms (Theorem 4.1) by utilizing the inherent low-rank structure in DiT gradients through a chained low-rank approximation.

Interestingly, both our statistical and computational results (C1-3) are dominated by the subspace dimension under the low-dimensional assumption, suggesting that latent DiT can potentially bypass the challenges associated with the high dimensionality of initial data.

Organization.

Section 2 includes background on score decomposition and Transformer-based score networks. Section 3 presents the statistical rates of DiTs. Section 4 provides provably efficient criteria. We defer discussions of related works to Appendix C due to space constraints.

Notations.

We use lower case letters to denote vectors, e.g., $z\in\mathbb{R}^{D}$ . $\norm{z}_{2}$ and $\norm{z}_{\infty}$ denote its Euclidean norm and Infinite norm respectively. We use upper case letters to denote matrix, e.g., $Z\in\mathbb{R}^{d\times L}$ . $\norm{Z}_{2}$ , $\norm{Z}_{\rm op}$ , and $\norm{Z}_{F}$ denote the $2$ -norm, operator norm and Frobenius norm respectively. $\norm{Z}_{p,q}$ denotes the $p,q$ -norm where the $p$ -norm is over columns and $q$ -norm is over rows. Given a function $f$ , let $\norm{f(x)}_{L^{2}}\coloneqq(\int\norm{f(x)}_{2}^{2}\differential x)^{1/2}$ , and $\norm{f(\cdot)}_{Lip}=\sup_{x\neq y}(\norm{f(x)-f(y)}_{2}/\norm{x-y}_{2})$ . With a distribution $P$ , we denote $\norm{f}_{L^{2}(P)}=(\int_{P}\norm{f(x)}_{2}^{2}\differential x)^{1/2}$ as the $L^{2}(P)$ norm. Let $f_{\sharp}P$ be a pushforward measure, i.e., for any measurable $\Omega$ , $(f_{\sharp}P)(\Omega)=P(f^{-1}(\Omega))$ . We use $\psi$ for (conditional) Gaussian density functions.

2 Background

This section reviews the ideas we built on, including an overview of diffusion models (Section 2.1), the score decomposition under the linear latent space assumption (Section 2.2), and the transformer backbone in DiT (Section 2.3).

2.1 Score-Matching Denoising Diffusion Models

We briefly review forward process, backward process and score matching in diffusion models.

Forward and Backward Process.

In the forward process, Diffusion models gradually add noise to the original data $x_{0}\in\mathbb{R}^{D}$ , and $x_{0}\sim P_{0}$ . Let $x_{t}$ denote the noisy data at time stamp $t$ , with marginal distribution and destiny as $P_{t}$ and $p_{t}$ . The conditional distribution $P(x_{t}|x_{0})$ follows $N(\beta(t)x_{0},\sigma(t)I_{D})$ , where $\beta(t)={\exp}(-\int_{0}^{t}w(s)\mathrm{d}s/2)$ , $\sigma(t)=1-\beta^{2}(t)$ , and $w(t)>0$ is a nondecreasing weighting function. In practice, the forward process terminates at a large enough $T$ such that $P_{T}$ is close to $N(0,I_{D})$ . In the backward process, we obtain $y_{t}$ by reversing the forward process. The generation of $y_{t}$ depends on the score function $\nabla\log p_{t}(\cdot)$ . However, this is unknown in practice, we use a score estimator $s_{W}(\cdot,t)$ to replace $\nabla\log p_{t}(\cdot)$ , where $s_{W}(\cdot,t)$ is usually a neural network with parameters $W$ . See Section D.1 for the details.

Score Matching.

To estimate the score function, we use the following loss

\displaystyle\min_{W}\int_{T_{0}}^{T}\gamma(t)\mathbb{E}_{x_{t}\sim P_{t}}% \left[\norm{s_{W}(x_{t},t)-\nabla\log p_{t}(x_{t})}_{2}^{2}\right]% \differential t,

where $\gamma(t)$ is the weight function, and $T_{0}$ is a small value to stabilize training and prevent score function from blowing up (Vahdat et al., 2021). However, it is hard to compute $\nabla\log p_{t}(\cdot)$ with available data samples. Therefore, we minimize the equivalent denosing score matching objective

\displaystyle\min_{W}\int_{T_{0}}^{T}\gamma(t)\mathbb{E}_{x_{0}\sim P_{0}}% \left[\mathbb{E}_{x_{t}|x_{0}}\left[\left\|s_{W}(x_{t},t)-\nabla_{x_{t}}\log% \psi_{t}(x_{t}\mid x_{0})\right\|_{2}^{2}\right]\right]\differential t,

(2.1)

where $\psi_{t}(x_{t}|x_{0})$ is the transition kernel, then $\nabla_{x_{t}}\log\psi_{t}(x_{t}|x_{0})=\left(\beta(t)x_{0}-x_{t}\right)/% \sigma(t)$ .

To train the parameters $W$ in the score estimator $s_{W}(\cdot,t)$ , we use the empirical version of (2.1). We select $n$ i.i.d. data samples $\{x_{0,i}\}_{i=1}^{n}\sim P_{0}$ , and sample time $t_{i}$ $(1\leq i\leq n)$ uniformly from interval $[T_{0},T]$ . Given $x_{0,i}$ , we sample $x_{t_{i}}$ from $N(\beta(t_{i})x_{0,i},\sigma(t_{i})I_{D})$ . The empirical loss is

\displaystyle\min_{W}\leavevmode\nobreak\ \widehat{\mathcal{L}}(W)

\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\norm{s_{W}(x_{t_{i}},t_{i})-x_{0,i}}_{% 2}^{2}.

(2.2)

For convenience of notation, we denote population loss $\mathcal{L}(W)=\mathbb{E}_{P_{0}}[\widehat{\mathcal{L}}(W)]$ .

2.2 Score Decomposition in Linear Latent Space

In this part, we review the score decomposition in (Chen et al., 2023a). We consider that the $D$ -dimensional input data $x$ supported on a $d_{0}$ -dimensional subspace, where $d_{0}\leq D$ .

Assumption 2.1 (Low-Dimensional Linear Latent Space).

Data point $x$ can be written as $x=Bh$ , where $B\in\mathbb{R}^{D\times d_{0}}$ is an unknown matrix with orthonormal columns. The latent variable $h\in\mathbb{R}^{d_{0}}$ follows the distribution $P_{h}$ with a density function $p_{h}$ .

Remark 2.1.

By “Linear Latent Space,” we mean that each entry of a given latent vector is a linear combination of the corresponding input, i.e., $h=Bx$ . This is also knonw as “low-dimensional data” assumption in literature (Chen et al., 2023a).

Based on the low-dimensional data structure assumption, we have the following score decomposition theory: on-support score $s_{+}(B^{\top}x,t)$ and orthogonal score $s_{-}(x,t)$ .

Lemma 2.1 (Score Decomposition, Lemma 1 of (Chen et al., 2023a)).

Let data $x=Bh$ follow Assumption 2.1. The decomposition of score function $\nabla\log p_{t}(x)$ is

\displaystyle\leavevmode\nobreak\ \nabla\log p_{t}(x)=\underbrace{B\nabla\log p% _{t}^{h}(\bar{h})}_{s_{+}(\bar{h},t)}\underbrace{-\left(I_{D}-BB^{\top}\right)% x/\sigma(t)}_{s_{-}(x,t)},\leavevmode\nobreak\ \bar{h}=B^{\top}x,

(2.3)

where $p_{t}^{h}(\bar{h})\coloneqq\int\psi_{t}(\bar{h}|h)p_{h}(h)\differential h$ , $\psi_{t}(\cdot|h)$ is the Gaussian density function of $N(\beta(t)h,\sigma(t)I_{d_{0}})$ , $\beta(t)=e^{-t/2}$ and $\sigma(t)=1-e^{-t}$ . We restate the proof in Section D.2 for completeness.

Additionally, our theoretical analysis is based on two following assumptions as in (Chen et al., 2023a).

Assumption 2.2 (Tail Behavior of $P_{h}$ ).

The density function $p_{h}>0$ is twice continuously differentiable. Moreover, there exist positive constants $A_{0},A_{1},A_{2}$ such that when $\norm{h}_{2}\geq A_{0}$ , the density function $p_{h}(h)\leq(2\pi)^{-d_{0}/2}A_{1}{\exp}(-A_{2}\|h\|_{2}^{2}/2)$ .

Assumption 2.3 ( $L_{s_{+}}$ -Lipschitz of $s_{+}(\bar{h},t)$ ).

The on-support score function $s_{+}(\bar{h},t)$ is $L_{s_{+}}$ -Lipschitz in $\bar{h}\in\mathbb{R}^{d_{0}}$ for any $t\in[0,T]$ .

2.3 Score Network and Transformers

In this part, we introduce the score network architecture and Transformers. Transformers are the backbone of the score network in DiT. By Assumption 2.1, $\bar{h}=B^{\top}x\in\mathbb{R}^{d_{0}}$ with $d_{0}<D$ .

(Latent) Score Network.

Following (Chen et al., 2023a), we rearrange (2.3) into

\displaystyle\nabla\log p_{t}(x)=B(\underbrace{\sigma(t)\nabla\log p_{t}^{h}(B% ^{\top}x)+B^{\top}x}_{\coloneqq q(B^{\top}x,t):\mathbb{R}^{d_{0}}\times[T_{0},% T]\to\mathbb{R}^{d_{0}}})/\sigma(t)-x/\sigma(t).

(2.4)

We use $W_{B}\in\mathbb{R}^{D\times d_{0}}$ to approximate $B\in\mathbb{R}^{D\times d_{0}}$ , and a neural network $f(W_{B}^{\top}x,t)$ to approximate $q(B^{\top}x,t)$ . We adopt the following score network class for diffusion in latent space (i.e., in $h\in\mathbb{R}^{d_{0}}$ )

\displaystyle\mathcal{S}=\left\{s_{W}(x,t)=W_{B}f(W_{B}^{T}x,t)/\sigma(t)-x/% \sigma(t),\leavevmode\nobreak\ W=\{W_{B},f\}\right\},

(2.5)

where the columns in $W_{B}$ are orthogonal, $f:\mathbb{R}^{d_{0}}\times[T_{0},T]\rightarrow\mathbb{R}^{d_{0}}$ is a neural network. In our work, we focus on the diffusion transformers (DiTs), i.e., using Transformer for $f$ (Peebles and Xie, 2023).

Transformers.

A Transformer block consists of a self-attention layer and a feed-forward layer, with both layers having skip connection. We use $\tau^{r,m,l}:\mathbb{R}^{d\times L}\rightarrow\mathbb{R}^{d\times L}$ to denote a Transformer block. Here $r$ and $m$ are the number of heads and head size in self-attention layer, and $l$ is the hidden dimension in feed-forward layer. Let $X\in\mathbb{R}^{d\times L}$ be the model input, then we have the model output

	$\displaystyle\leavevmode\nobreak\ {\rm Attn}(X)$	$\displaystyle=X+\sum\nolimits_{i=1}^{r}W_{O}^{i}W_{V}^{i}X\cdot\mathop{\rm{% Softmax}}\left(\left(W_{K}^{i}X\right)^{\mathsf{T}}W_{Q}^{i}X\right),$		(2.6)
	$\displaystyle\leavevmode\nobreak\ {\rm FF}\circ{\rm Attn}(X)$	$\displaystyle={\rm Attn}(X)+W_{2}\cdot{\rm ReLU}(W_{1}\cdot{\rm Attn}(X)+b_{1}% \mathds{1}_{L}^{\mathsf{T}})+b_{2}\mathds{1}_{L}^{\mathsf{T}},$		(2.7)

where $W_{K}^{i},W_{Q}^{i},W_{V}^{i}\in\mathbb{R}^{m\times d},W_{O}^{i}\in\mathbb{R}^% {d\times m},W_{1}\in\mathbb{R}^{l\times d},W_{2}\in\mathbb{R}^{d\times l},b_{1% }\in\mathbb{R}^{l},b_{2}\in\mathbb{R}^{d}$ .

In our work, we use Transformer networks with positional encoding $E\in\mathbb{R}^{d\times L}$ . We define the Transformer networks as the composition of Transformer blocks

\displaystyle\mathcal{T}_{P}^{r,m,l}=\{f_{\mathcal{T}}:\mathbb{R}^{d\times L}% \rightarrow{\mathbb{R}^{d\times L}}\mid f_{\mathcal{T}}\text{ is a composition% of blocks }\tau^{r,m,l}\text{'s}\}.

For example, the following is a Transformer network consisting $K$ blocks and positional encoding

\displaystyle f_{\mathcal{T}}(X)={\rm FF}^{(K)}\circ{\rm Attn}^{(K)}\circ% \cdots{\rm FF}^{(1)}\circ{\rm Attn}^{(1)}(X+E).

(2.8)

3 Statistical Rates of Latent DiTs with Subspace Data Assumption

In this section, we analyze the statistical rates of latent DiTs. Section 3.1 introduces the class of latent DiT score networks. In Section 3.2, we prove the approximation limit of matching the DiT score function with the score network class, and characterize the structural configuration of the score network when a specified approximation error is required. Following this, in Section 3.3, utilizing the characterized structural configuration, we prove the score and distribution estimation for latent DiTs.

3.1 DiT Score Network Class

In this part, we give the details about DiT score network class used in our analysis. In (2.5), $f$ is a network with Transformer as the backbone, and $(h,t)\in\mathbb{R}^{d_{0}}\times[T_{0},T]$ denotes the input data. Following (Peebles and Xie, 2023), DiT uses time point $t$ to calculate the scale and shift value in the Transformer backbone, and it transforms a input picture into a sequential version. To achieve the transformation, we introduce a reshape layer.

Definition 3.1 (DiT Reshape Layer $R(\cdot)$ ).

Let $R(\cdot):\mathbb{R}^{d_{0}}\to\mathbb{R}^{d\times L}$ be a reshape layer that transforms the $d_{0}$ -dimensional input into a $d\times L$ matrix. Specifically, for any $d_{0}=i\times i$ image input, $R(\cdot)$ converts it into a sequence representation with feature dimension $d\coloneqq p^{2}$ (where $p\geq 2$ ) and sequence length $L\coloneqq\left(i/p\right)^{2}$ . Besides, we define the corresponding reverse reshape (flatten) layer $R^{-1}(\cdot):\mathbb{R}^{d\times L}\to\mathbb{R}^{d_{0}}$ as the inverse of $R(\cdot)$ . By $d_{0}=dL$ , $R,R^{-1}$ are associative w.r.t. their input.

To simplify the self-attention block in (2.6), let $W_{OV}^{i}=W_{O}^{i}W_{V}^{i}$ and $W_{KQ}^{i}=(W_{K}^{i})^{\mathsf{T}}W_{Q}^{i}$ .

Definition 3.2 (Transformer Network Class $\mathcal{T}_{p}^{r,m,l}$ ).

We define the Transformer network class as

\displaystyle\mathcal{T}_{p}^{r,m,l}

\displaystyle(K,C_{\mathcal{T}},C_{OV}^{2,\infty},C_{OV},C_{KQ}^{2,\infty},C_{% KQ},C_{F}^{2,\infty},C_{F},C_{E},L_{\mathcal{T}}),\leavevmode\nobreak\ \text{% satisfying the constraints}

•

Model architecture with $K$ blocks: $f_{\mathcal{T}}(X)={\rm FF}^{(K)}\circ{\rm Attn}^{(K)}\circ\cdots{\rm FF}^{(1)% }\circ{\rm Attn}^{(1)}(X)$ ;
•

Model output bound: $\sup_{X}\norm{f_{\mathcal{T}}(X)}_{2}\leq C_{\mathcal{T}}$ ;
•

Parameter bound in ${\rm Attn^{(i)}}$ : $\norm{(W_{OV}^{i})^{\top}}_{2,\infty}\leq C_{OV}^{2,\infty}$ , $\norm{(W_{OV}^{i})^{\top}}_{2}\leq C_{OV}$ , $\norm{W_{KQ}^{i}}_{2,\infty}\leq C_{KQ}^{2,\infty}$ , $\norm{W_{KQ}^{i}}_{2}\leq C_{KQ}$ , $\norm{E^{\top}}_{2,\infty}\leq C_{E},\forall i\in[K]$ ;
•

Parameter bound in ${\rm FF^{(i)}}$ : $\norm{W_{j}^{i}}_{2,\infty}\leq C_{F}^{2,\infty},\norm{W_{j}^{i}}_{2}\leq C_{F% },\forall j\in[2],i\in[K]$ ;
•

Lipschitz of $f_{\mathcal{T}}$ : $\norm{f_{\mathcal{T}}(X_{1})-f_{\mathcal{T}}(X_{2})}_{F}\leq L_{\mathcal{T}}% \norm{X_{1}-X_{2}}_{F},\forall X_{1},X_{2}\in\mathbb{R}^{d\times L}$ .

Definition 3.3 (DiT Score Network Class $\mathcal{S}_{\mathcal{T}_{p}^{r,m,l}}$ ).

We denote $\mathcal{S}_{\mathcal{T}_{p}^{r,m,l}}$ as the DiT score network class in (2.5), replacing $f$ with ${R^{-1}\circ f_{\mathcal{T}}\circ R}$ , and $f_{\mathcal{T}}$ is from the Transformer class $\mathcal{T}_{p}^{r,m,l}$ .

3.2 Score Approximation of DiT

Here, we explore the approximation limit of latent DiT score network class $\mathcal{S}_{\mathcal{T}_{p}^{r,m,l}}$ under linear latent space assumption. Recall that $P_{t}$ is the distribution of $x_{t}$ , $\sigma(t)$ is the variance of $P(x_{t}|x_{0})$ , $d_{0}$ is the dimension of latent space, $L$ is the sequence length of transformer input, $T$ is the stop** time in forward process, $T_{0}$ is the early stop** time in backward process, and $L_{s_{+}}$ is the Lipschitz coefficient of on-support score function. Then we have the following Theorem 3.1.

Theorem 3.1 (Score Approximation of DiT).

For any approximation error $\epsilon>0$ and any data distribution $P_{0}$ under Assumptions 2.1, 2.2 and 2.3, there exists a DiT score network $s_{\widehat{W}}$ from $\mathcal{S}_{\mathcal{T}_{p}^{2,1,4}}$ (defined in Definition 3.2), where $\widehat{W}=\{\widehat{W}_{B},\widehat{f}_{\mathcal{T}}\}$ , such that for any $t\in[T_{0},T]$ , we have:

\displaystyle\norm{s_{\widehat{W}}(\cdot,t)-\nabla\log p_{t}(\cdot)}_{L^{2}(P_% {t})}\leq\epsilon\cdot\sqrt{d_{0}}/\sigma(t),

where $\sigma(t)=1-e^{-t}$ , and the upper bound of hyperparameters in $\mathcal{S}_{\mathcal{T}_{p}^{2,1,4}}$ are

	$\displaystyle\leavevmode\nobreak\ K=\mathcal{O}(\epsilon^{-2L}),\leavevmode% \nobreak\ C_{\mathcal{T}}=\mathcal{O}\left(d_{0}L_{s_{+}}\sqrt{d_{0}\log(d_{0}% /T_{0})+\log(1/\epsilon)}\right),$
	$\displaystyle\leavevmode\nobreak\ C_{OV}^{2,\infty}=(1/\epsilon)^{\mathcal{O}(% 1)},\leavevmode\nobreak\ C_{OV}=(1/\epsilon)^{\mathcal{O}(1)},\leavevmode% \nobreak\ C_{KQ}^{2,\infty}=(1/\epsilon)^{\mathcal{O}(1)},\leavevmode\nobreak% \ C_{KQ}=(1/\epsilon)^{\mathcal{O}(1)},$
	$\displaystyle\leavevmode\nobreak\ C_{E}=\mathcal{O}(L^{3/2}),\leavevmode% \nobreak\ C_{F}^{2,\infty}=(1/\epsilon)^{\mathcal{O}(1)},\leavevmode\nobreak\ % C_{F}=(1/\epsilon)^{\mathcal{O}(1)},\leavevmode\nobreak\ L_{\mathcal{T}}=% \mathcal{O}\left(d_{0}L_{s_{+}}\right).$

Proof Sketch.

Our proof is built on the key observation that there is a tail behavior of the low-dimensional latent variable distribution $P_{h}$ (Assumption 2.2). Recall that $\nabla\log p_{t}(x)=Bq(\bar{h},t)/\sigma(t)-x/\sigma(t)$ , where $\bar{h}=B^{\top}x$ (defined in (2.4)). By taking $\widehat{W}_{B}=B$ , our aim reduces to construct a transformer network to approximate $q(\bar{h},t)$ . To achieve this, we firstly approximate $q(\bar{h},t)$ with a compact-supported continuous function, based on the tail behavior of $P_{h}$ . Then we construct a transformer to approximate the compact-supported continuous function using the universal approximation capacity of transformer (Yun et al., 2020). See Section F.1 for a detailed proof. ∎

Intuitively, Theorem 3.1 indicates the capability of the transformer-based score network to approximate the score function with precise guarantees. Furthermore, Theorem 3.1 provides empirical guidance for the design choices of the score network when a specified approximation error is required.

Remark 3.1 (Comparing with Existing Works).

Theoretical analysis of DiTs is limited. Previous works that do not specify the model architecture assume that the score estimator is well-approximated (Benton et al., 2024; Wibisono et al., 2024). To the best of our knowledge, this work is the first to present an approximation theory for DiTs, offering the estimation theory in Corollaries 3.1.1 and 3.1.2 based on the estimated score network, rather than a perfectly trained one.

Remark 3.2 (Latent Dimension Dependency).

Theorem 3.1 suggests that the approximation capacity and Transformer network size primarily depend on the latent variable dimension $d_{0}=d\times L$ . This indicates that DiTs can potentially bypass the challenges associated with the high dimensionality of initial data by transforming input data into a low-dimensional latent variable.

3.3 Score Estimation and Distribution Estimation

Besides score approximation capability, Theorem 3.1 also characterizes the structural configuration of the score network for any specific precision, e.g., $K,C_{E},C_{F}$ , etc. This characterization enables further analysis of the performance of score network in practical scenarios. In Corollary 3.1.1, we provide an sample complexity bound for score estimation. In Corollary 3.1.2, show that the learned score estimator is able to recover the initial data distribution.

Score Estimation.

To derive a sample complexity for score estimation using $\mathcal{S}_{\mathcal{T}_{p}^{2,1,4}}$ , we rewrite the score matching objective in (2.2) as $\widehat{W}\in\mathop{\mathrm{argmin}}_{s_{W}\in\mathcal{S}_{\mathcal{T}_{p}^{% 2,1,4}}}\widehat{\mathcal{L}}(s_{W}),\leavevmode\nobreak\ \widehat{W}=\{% \widehat{W}_{B},\widehat{f}_{\mathcal{T}}\}$ .

Corollary 3.1.1 shows that as sample size $n\rightarrow\infty$ , $s_{W}(\cdot,t)$ convergences to $\nabla\log p_{t}(\cdot)$ .

Corollary 3.1.1 (Score Estimation of DiT).

Under Assumptions 2.1, 2.2 and 2.3, we choose $\mathcal{S}_{\mathcal{T}_{p}^{2,1,4}}$ as in Theorem 3.1 using $\epsilon\in(0,1)$ and $L>1$ , With probability $1-1/\mathrm{poly}(n)$ , we have

\displaystyle\leavevmode\nobreak\ \frac{1}{T-T_{0}}\int_{T_{0}}^{T}\norm{s_{% \widehat{W}}(\cdot,t)-\nabla\log p_{t}(\cdot)}_{L^{2}(P_{t})}\differential t=% \widetilde{\mathcal{O}}\left(\frac{1}{n^{1/2}}\frac{T}{T_{0}}\cdot 2^{(1/% \epsilon)^{2L}}+\frac{1}{T_{0}T}\epsilon^{2}+\frac{1}{n}\right),

(3.1)

where $\widetilde{\mathcal{O}}$ hides the factor about $D,d_{0},d,L_{s_{+}},\log n$ .

Proof.

See Section F.2 for a detailed proof. ∎

Intuitively, Corollary 3.1.1 shows a sample complexity bound for score estimation in practice.

Remark 3.3 (Comparing with Existing Works).

(Zhu et al., 2023) provides a sample complexity for simple ReLU-based diffusion models under the assumption of an accurate score estimator. To the best of our knowledge, we are the first to provide a sample complexity for DiTs, based on the learned score network in Theorem 3.1 and the quantization (piece-wise approximation) approach for transformer universality (Yun et al., 2020).

Remark 3.4.

Corollary 3.1.1 reports an explicit result on sample complexity bounds for score estimation of latent DiTs: a double exponential factor $2^{(1/\epsilon)^{2L}}$ in the first term. We remark that this arises from the required depth $K$ is $\mathcal{O}(\epsilon^{-2L})$ , and the norm of required weight parameters is $(1/\epsilon)^{\mathcal{O}(1)}$ as shown in Theorem 3.1, assuming the universality of transformers requires dense layers (Yun et al., 2020). This motivate us to rethink about transformer universality and explore new proof techniques for DiTs, which we leave for future work.

Definition 3.4.

For later convenience, we define $\xi(n,\epsilon,L):=\frac{1}{n^{1/2}}\frac{T}{T_{0}}\cdot 2^{(1/\epsilon)^{2L}}% +\frac{1}{T_{0}T}\epsilon^{2}+\frac{1}{n}$ .

Distribution Estimation.

In practice, DiTs generate data using the discretized version with step size $\mu$ , see Section D.1 for details. Let $\widehat{P}_{T_{0}}$ be the distribution generated by $s_{\widehat{W}}$ in Corollary 3.1.1. Let $P_{T_{0}}^{h}$ and $p_{T_{0}}^{h}$ be the distribution and density function of on-support latent variable $\bar{h}$ at $T_{0}$ . We have the following results for distribution estimation.

Corollary 3.1.2 (Distribution Estimation of DiT, Modified From Theorem 3 of (Chen et al., 2023a)).

Let $T=\mathcal{O}(\log n),T_{0}=\mathcal{O}(\min\{c_{0},1/L_{s_{+}}\})$ , where $c_{0}$ is the minimum eigenvalue of $\mathbb{E}_{P_{h}}[hh^{\top}]$ . With the estimated DiT score network $s_{\widehat{W}}$ in Corollary 3.1.1, we have the following with probability $1-1/\mathrm{poly}(n)$ .

(i)

The accuracy to recover the subspace $B$ is $\norm{W_{B}W_{B}^{\top}-BB^{\top}}_{F}^{2}=\widetilde{\mathcal{O}}\left(T_{0}% \xi(n,\epsilon,L)/c_{0}\right)$ .

(ii)

$(W_{B}U)^{\top}_{\sharp}\widehat{P}_{T_{0}}$ denotes the pushforward distribution. With the conditions ${\sf KL}(P_{h}||N(0,I_{d_{0}}))<\infty$ , and step size $\mu\leq\xi(n,\epsilon,L)\cdot T_{0}^{2}/(d_{0}\sqrt{\log d_{0}})$ . There exists an orthogonal matrix $U\in\mathbb{R}^{d\times d}$ such that we have the following upper bound for the total variation distance

\displaystyle{\sf TV}(P_{T_{0}}^{h},(W_{B}U)^{\top}_{\sharp}\widehat{P}_{T_{0}% })=\widetilde{\mathcal{O}}(\sqrt{\xi(n,\epsilon,L)}),

(3.2)

where $\widetilde{\mathcal{O}}$ hides the factor about $D,d_{0},d,L_{s_{+}},\log n$ , and $T-T_{0}$ .

(iii)

For the generated data distribution $\widehat{P}_{T_{0}}$ , the orthogonal pushforward $(I-W_{B}W_{B}^{\top})_{\sharp}\widehat{P}_{T_{0}}$ is ${N}(0,\Sigma)$ , where $\Sigma\preceq aT_{0}I$ for a constant $a>0$ .

Proof.

See Section F.3 for a detailed proof. ∎

Intuitively, Corollary 3.1.2 shows the estimation results including 3 parts: (i) The accuracy to recover the subspace $B$ . (ii) The estimation error between $\widehat{P}_{T_{0}}$ and $P_{T_{0}}^{h}$ . (iii) The vanishing behavior of $\widehat{P}_{T_{0}}$ in the orthogonal space. These three parts indicate that the learned score estimator is capable of recovering the initial data distribution. Notably, Corollary 3.1.2 is agnostic to details of $\xi(n,\epsilon,L)$ .

Remark 3.5 (Comparing with Existing Works).

Oko et al. (2023) analyze the distribution estimation under the assumption that the initial density is supported on $[-1,1]^{D}$ and smooth in the boundary. Our Assumption 2.2 demonstrates greater practical relevance. This suggests that our method of distribution estimation aligns more closely with empirical realities.

Remark 3.6 (Subspace Recovery Accuracy).

(i) of Corollary 3.1.2 confirms that the subspace is learned by DiTs. The error is proportional to the sample complexity for score estimation and depend on the minimum eigenvalue of the covariance of $P_{h}$ .

4 Provably Efficient Criteria

Here, we analyze the computational limits of latent DiTs under low-dimensional linear subspace data assumption (i.e., Assumption 2.1). The hardness of DiT models ties to both forward and backward passes of the score network in Definition 3.3. We characterize them separately.

4.1 Computational Limits of Backward Computation

Following Section 2, suppose we have $n$ i.i.d. data samples $\{x_{0,i}\}_{i=1}^{n}\sim P_{d}$ , and time $t_{i_{0}}$ $(1\leq i\leq n)$ uniformly sampled from $[T_{0},T]$ . For each data $x_{0,i}\in\mathbb{R}^{D}$ , we sample $x_{t_{i_{0}}}\in\mathbb{R}^{D}$ from $N(\beta(t_{i_{0}})x_{0,i},\sigma(t_{i_{0}})I_{D})$ . Let $(W_{A}R^{-1}(\cdot))^{\dagger}$ be the inverse transformation of $W_{A}R^{-1}(\cdot)$ , and denote $Y_{0,i}\coloneqq(W_{A}R^{-1})^{\dagger}(x_{0,i})\in\mathbb{R}^{d\times L}$ . We rewrite the empirical denoising score-matching loss (2.2) as

\displaystyle\frac{1}{n}\sum_{i=1}^{n}\Big{\|}W_{A}R^{-1}(f_{\mathcal{T}}(R(% \underbrace{W_{A}^{\top}x_{t_{i_{0}}}}_{d_{0}\times 1})))-x_{0,i}\Big{\|}_{F}^% {2}=\frac{1}{n}\sum_{i=1}^{n}\Big{\|}\underbrace{W_{A}}_{D\times d_{0}}% \underbrace{R^{-1}\big{(}\overbrace{f_{\mathcal{T}}(R(W_{A}^{\top}x_{t_{i_{0}}% }})}_{d_{0}\times 1}^{d\times L})-\underbrace{Y_{0,i}}_{d\times L}\big{)}\Big{% \|}_{F}^{2}.

(4.1)

For efficiency, it suffices to focus on just transformer attention heads of the DiT score network due to their dominating quadratic time complexity in both passes. Thus, we consider only a single layer attention for $f_{\mathcal{T}}$ , to simplify our analysis. Further, we consider the following simplifications:

(S0)

To prove the hardness of (4.1) for both full full gradient descent and stochastic mini-batch gradient descent methods, it suffices to consider training on a single data point.

(S1)

For the convenience of our analysis, we consider the following expression for attention mechanism. Let $X,Y\in\mathbb{R}^{d\times L}$ . Let $W_{K},W_{Q},W_{V}\in\mathbb{R}^{s\times d}$ be attention weights such that $Q=W_{Q}X\in\mathbb{R}^{d\times L}$ , $K=W_{K}X\in\mathbb{R}^{s\times L}$ and $V=W_{V}X\in\mathbb{R}^{s\times L}$ . We write attention mechanism of hidden size $s$ and sequence length $L$ as

\displaystyle{\rm Att}(X)=\underbrace{(W_{O}W_{V}X)}_{V\text{ multiplication}}% \underbrace{D^{-1}\exp(X^{\mathsf{T}}W_{K}^{\mathsf{T}}W_{Q}X)}_{K\text{-}Q% \text{ multiplication}}\in\mathbb{R}^{d\times L},

(4.2)

with $D\coloneqq\mathop{\rm{diag}}\left(\exp(XW_{Q}W_{K}^{\mathsf{T}}X^{\mathsf{T}})% \mathds{1}_{L}\right)$ . Here, $\exp(\cdot)$ is entry-wise exponential function, i.e. $\exp{A}_{i,j}=\exp{A_{i,j}}$ for any matrix $A$ , $\mathop{\rm{diag}}\left(\cdot\right)$ converts a vector into a diagonal matrix with the vector’s entries on the diagonal, and $\mathds{1}_{L}$ is the length- $L$ all ones vector.

(S2)

Since $V$ multiplication is linear in weight while $K$ - $Q$ multiplication is exponential in weights, we only need to focus on the gradient update of $K$ - $Q$ multiplication. Therefore, for efficiency analysis of gradient, it is equivalent to analyze a reduced problem with fixed $W_{O}W_{V}X=\text{const.}$ .
(S3)

To focus on the DiT, we consider the low-dimensional linear encoder $W_{A}$ to be pretrained and to not participate in gradient computation. This aligns with common practice (Rombach et al., 2022) and is justified by the trivial computation cost due to the linearity of $W_{A}$ ¹¹1The gradient computation is linear in $W_{A}$ and hence the computation w.r.t. $W_{A}$ is cheap and upper-bounded by $L\cdot\mathrm{poly}(d)$ time in a straightforward way..

(S4)

To further simplify, we introduce $A_{1},A_{2},A_{3}\in\mathbb{R}^{s\times L}$ and $W\in\mathbb{R}^{d\times d}$ via

		$\displaystyle\leavevmode\nobreak\ \Big{\\|}W_{A}R^{-1}\big{(}f_{\mathcal{T}}(% \underbrace{R(W_{A}^{\top}x_{t_{i_{0}}})}_{\coloneqq X\in\mathbb{R}^{d\times L% }})-\underbrace{Y_{0,i}}_{\coloneqq Y\in\mathbb{R}^{d\times L}}\big{)}\Big{\\|}% _{F}^{2}$		(By (S0), (S1) and (S2))
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \Big{\\|}W_{A}R^{-1}\big{(}\underbrace{W_{O}W% _{V}}_{\coloneqq W_{OV}\in\mathbb{R}^{d\times d}}\underbrace{X}_{\coloneqq A_{% 3}\in\mathbb{R}^{d\times L}}D^{-1}\exp\big(\underbrace{X^{\mathsf{T}}}_{% \coloneqq A_{1}^{\top}\in\mathbb{R}^{L\times d}}\underbrace{W_{K}^{\mathsf{T}}% W_{Q}}_{\coloneqq W\in\mathbb{R}^{d\times d}}\underbrace{X}_{\coloneqq A_{2}% \in\mathbb{R}^{d\times L}}\big{missing})-Y\big{)}\Big{\\|}_{F}^{2}.$		(4.3)

Notably, $A_{1},A_{2},A_{3},X,Y$ are constants w.r.t. training above loss with gradient updates.

Therefore, we simplify the objective of training DiT into

Definition 4.1 (Training Generic DiT Loss).

Given $A_{1},A_{2},A_{3},Y\in\mathbb{R}^{d\times L}$ and $W_{OV},W\in\mathbb{R}^{d\times d}$ following (S4), Training a DiT with $\ell_{2}$ loss on a single data point $X,Y\in\mathbb{R}^{d\times L}$ is formulated as

\displaystyle\min_{W}\leavevmode\nobreak\ \mathcal{L}_{0}(W)=\min_{W}% \leavevmode\nobreak\ {\frac{1}{2}}\Big{\|}W_{A}R^{-1}\big{(}W_{OV}A_{3}D^{-1}% \exp(A_{1}^{\top}WA_{2})-Y\big{)}\Big{\|}_{F}^{2}.

(4.4)

Here $D:=\mathop{\rm{diag}}(\exp(A_{1}^{\top}WA_{2}){\mathds{1}}_{n})\in\mathbb{R}^{% L\times L}$ .

Remark 4.1 (Conditional and Unconditional Generation).

$\mathcal{L}_{0}$ is generic. If $A_{1}\neq A_{2}\in\mathbb{R}^{d\times L}$ , Definition 4.1 reduces to cross-attention in DiT score net (for conditional generation). If $A_{1}=A_{2}\in\mathbb{R}^{d\times L}$ , Definition 4.1 reduces to self-attention in DiT score net (for unconditional vanilla generation).

We introduce the next problem to characterize all possible gradient computations of optimizing (4.4).

Problem 1 (Approximate DiT Gradient Computation ( $\textsc{ADiTGC}(L,d,\Gamma,\epsilon)$ )).

Given $A_{1},A_{2},A_{3},Y\in\mathbb{R}^{d\times L}$ . Let $\epsilon>0$ . Assume all numerical values are in $\mathcal{O}(\log(L))$ -bits encoding. Let loss function $\mathcal{L}_{0}$ follow Definition 4.1. The problem of approximating gradient computation of optimizing empirical DiT loss (4.4) is to find an approximated gradient matrix $\tilde{G}^{(W)}\in\mathbb{R}^{d\times d}$ such that $\big{\|}\underline{\tilde{G}}^{(W)}-\partialderivative{\mathcal{L}}{\underline% {W}}\big{\|}_{\max}\leq 1/\mathrm{poly}(L)$ . Here, $\norm{A}_{\max}\coloneqq\max_{i,j}\absolutevalue{A_{ij}}$ for any matrix $A$ .

In this work, we aim to investigate the computational limits of all possible efficient algorithms of ADiTGC with $\epsilon=1/\mathrm{poly}(L)$ . Yet, the explicit gradient of DiT denoising score matching loss (4.4) is too complicated to characterize ADiTGC. To combat this, we make the following observations.

(O1)

Let $g_{1}(\cdot)\coloneqq W_{A}R^{-1}(\cdot):\mathbb{R}^{d\times L}\to\mathbb{R}^{% d_{0}}$ , $g_{2}(\cdot)\coloneqq{\rm Att}(\cdot):\mathbb{R}^{d\times L}\to\mathbb{R}^{d% \times L}$ , and $g_{3}(\cdot)\coloneqq R(W_{A}^{\top}\cdot):\mathbb{R}^{D}\to\mathbb{R}^{d% \times L}$ such that $g_{3}(x)=X$ for $x\in\mathbb{R}^{D}$ (with $D>d_{0}=dL$ ).
(O2)

Vectorization of $f_{\mathcal{T}}$ . For the ease of presentation, we use notation flexibly that $f_{\mathcal{T}}$ to denote both a matrix in $\mathbb{R}^{d\times L}$ and a vector in $\mathbb{R}^{dL}$ in the following analysis. This practice does not affect correctness. The context in which $f_{\mathcal{T}}$ is used should clarify whether it refers to a matrix or a vector. Explicit vectorization follows Definition D.1.
(O3)

Linearity of $g_{1}$ . By linearity of $W_{A}R^{-1}(\cdot)$ , we treat $g_{1}$ as a matrix in $\mathbb{R}^{d_{0}\times dL}$ acting on vector $f_{\mathcal{T}}(\cdot)\in\mathbb{R}^{dL}$ .

Therefore, we have $\mathcal{L}_{0}=\norm{g_{1}\cdot\left[g_{2}(g_{3})-Y\right]}_{2}^{2}$ , such that its gradient involves $\derivative{\mathcal{L}_{0}}{W}=g_{1}\derivative{g_{2}}{W}$ . From above, we only need to focus on proving the computation time and error control of term $\derivative{g_{2}}{W}$ for gradient w.r.t $W$ . Luckily, with tools from fine-grained complexity theory (Alman and Song, 2023) and tensor trick (see Section D.3), we prove the existence of almost-linear time algorithms for Problem 1 in the next theorem. Let $\operatorname{vec}(W)\coloneqq\underline{W}$ for any matrix $W$ following Definition D.1.

Theorem 4.1 (Existence of Almost-Linear Time Algorithms for ADiTGC).

Suppose all numerical values are in $\mathcal{O}(\log L)$ -bits encoding. Let $\max(\|W_{OV}A_{3}\|_{\max},\norm{W_{K}A_{1}}_{\max},\norm{W_{Q}A_{2}}_{\max})\leq\Gamma$ . There exists a $L^{1+o(1)}$ time algorithm to solve $\textsc{ADiTGC}(L_{p},L,d=\mathcal{O}(\log L),\Gamma=o(\sqrt{\log L}))$ (i.e., Problem 1) with loss $\mathcal{L}_{0}$ from Definition 4.1 up to $1/\mathrm{poly}(L)$ accuracy. In particular, this algorithm outputs gradient matrices $\tilde{G}^{(W)}\in\mathbb{R}^{d\times d}$ such that $\big{\|}\underline{\tilde{G}}^{(W)}-\partialderivative{\mathcal{L}}{\underline% {W}}\big{\|}_{\max}\leq 1/\mathrm{poly}(L)$ .

Proof Sketch.

Our proof is built on the key observation that there exist low-rank structures within the DiT training gradients. Using the tensor trick (Diao et al., 2019, 2018) and computational hardness results of attention (Hu et al., 2024c; Alman and Song, 2023), we approximate DiT training gradients with a series of low-rank approximations and carefully match the multiplication dimensions so that the computation of $\derivative{g_{2}}{\underline{W}}$ forms a chained low-rank approximation. We complete the proof by demonstrating that this approximation is bounded by a $1/\mathrm{poly}(L)$ error and requires only almost-linear time. See Section G.2 for a detailed proof. ∎

Remark 4.2.

We remark that Theorem 4.1 is dominated by the relation between $L$ and $d$ , hence by the subspace dimension²²2See Assumption 2.1. $d_{0}=dL$ . A smaller $d_{0}$ makes Theorem 4.1 more likely to hold.

4.2 Computational Limits of Forward Inference

Since the inference of score-matching diffusion models is a forward pass of the trained score estimator $s_{W}$ , the computational hardness of DiT ties to the transformer-based score network,

\displaystyle s_{W}(A_{1},A_{2},A_{3})=W_{A}R^{-1}\big{(}\underbrace{W_{OV}A_{% 3}}_{d\times L}\underbrace{D^{-1}}_{L\times L}\exp\big(\underbrace{A_{1}^{\top% }W_{K}^{\top}}_{L\times s}\underbrace{W_{Q}A_{2}}_{d\times L}\big{missing})% \big{)},

(4.5)

following notation in Definition 4.1. For inference, we study the following approximation problem. Notably, by Remark 4.1, (4.5) subsumes both conditional and unconditional DiT inferences.

Problem 2 (Approximate DiT Inference $\textsc{ADiTI}(d,L,\Gamma,\delta_{F})$ ).

Let $\delta_{F}>0$ and $B>0$ . Given $A_{1},A_{2},A_{3}\in\mathbb{R}^{d\times L}$ , and $W_{OV},W_{K},W_{Q}\in\mathbb{R}^{d\times d}$ with guarantees that $\norm{W_{OV}A_{3}}_{\infty}\leq B$ , $\norm{W_{K}A_{1}}_{\infty}\leq B$ and $\norm{W_{Q}A_{2}}_{\infty}\leq B$ , we aim to study an approximation problem $\textsc{ADiTI}(d,L,B,\delta_{F})$ , that approximates $s_{W}(A_{1},A_{2},A_{3})$ with a vector $\tilde{z}\in\mathbb{R}^{d_{0}}$ (with $d_{0}=d\cdot L$ ) such that $\norm{\tilde{z}-W_{A}R^{-1}\left(W_{OV}A_{3}D^{-1}\exp(A_{1}^{\top}W_{K}^{\top% }W_{Q}A_{2})\right)}_{\max}\leq\delta_{F}$ . Here, $\norm{A}_{\max}\coloneqq\max_{i,j}\absolutevalue{A_{ij}}$ for any matrix $A$ .

By (O2) and (O3), we make an observation that Problem 2 is just a special case of (Alman and Song, 2023). Hence, we characterize the all possible efficient algorithms for ADiTI with next proposition.

Proposition 4.1 (Norm-Based Efficiency Phase Transition).

Let $\norm{W_{Q}A_{2}}_{\infty}\leq B$ , $\norm{W_{K}A_{1}}_{\infty}\leq B$ and $\norm{W_{OV}A_{3}}_{\infty}\leq B$ with $B=\mathcal{O}(\sqrt{\log L})$ . Assuming SETH (Hypothesis 1), for every $q>0$ , there are constants $C,C_{a},C_{b}>0$ such that: there is no $O(n^{2-q})$ -time (sub-quadratic) algorithm for the problem $\textsc{ADiTI}(L,d=C\log L,B=C_{b}\sqrt{\log L},\delta_{F}=L^{-C_{a}})$ .

Remark 4.3.

Proposition 4.1 suggests an efficiency threshold for the upper bound of $\norm{W_{K}A_{1}}_{\infty}$ , $\norm{W_{Q}A_{2}}_{\infty}$ , $\norm{W_{OV}A_{3}}_{\infty}$ . Only below this threshold are efficient algorithms for Problem 2 possible.

Moreover, there exists almost-linear DiT inference algorithms following (Alman and Song, 2023).

Proposition 4.2 (Almost-Linear Time DiT Inference).

Assuming SETH, the DiT inference problem $\textsc{ADiTI}(L,d=\mathcal{O}(\log L),B=o(\sqrt{\log L}),\delta_{F}=1/\mathrm% {poly}(L))$ can be solved in $L^{1+o(1)}$ time.

Remark 4.4.

Proposition 4.2 is a special case of Proposition 4.1 under the efficiency threshold.

Remark 4.5.

Propositions 4.2 and 4.1 are dominated by the relation between $L$ and $d$ , hence by the subspace dimension $d_{0}=dL$ . A smaller $d_{0}$ makes Propositions 4.2 and 4.1 more likely to hold.

5 Discussion and Conclusion

We explore the fundamental limits of latent DiTs with 3 key contributions. First, we prove that transformers are universal approximators for the score functions in DiTs (Theorem 3.1), with approximation capacity and model size dependent only on the latent dimension, suggesting DiTs can handle high-dimensional data challenges. Second, we show that Transformer-based score estimators converge to the true score function (Corollary 3.1.1), ensuring the generated data distribution closely approximates the original (Corollary 3.1.2). Third, we provide provably efficient criteria (Proposition 4.1) and prove the existence of almost-linear time algorithms for forward inference (Proposition 4.2) and backward computation (Theorem 4.1). These results highlight the potential of latent DiTs to achieve both computational efficiency and robust performance in practical scenarios.

Limitations and Future Direction. As discussed in Remark 3.4, the double exponential factor in our explicit sample complexity bound (Corollary 3.1.1) suggests a possible gap in our understanding of transformer universality and its interplay with DiT architecture. This motivate us to rethink about transformer universality and explore new proof techniques for DiTs, which we leave for future work. Besides, due to its formal nature, this work do not provide immediate practical implementations. However, we expect that our findings provide valuable insights for future diffusion generative models.

Broader Impact

This theoretical work aims to shed light on the foundations of diffusion generative models and is not anticipated to have negative social impacts.

Acknowledgments

JH would like to thank to Minshuo Chen, Sophia Pi, Yibo Wen, Tim Tsz-Kit Lau, Chenwei Xu, Dino Feng and Andrew Chen for enlightening discussions on related topics, and the Red Maple Family for support.

JH is partially supported by the Walter P. Murphy Fellowship. HL is partially supported by NIH R01LM1372201. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.

Appendix

\startcontents

[sections] \printcontents[sections] 1

Appendix A More Discussion on Low-Dimensional Linear Latent Space

Our analysis is based on the low-dimensional linear latent space assumption, here we give a further discussion about it with our theoretical results.

The low-dimensional data structure in Assumption 2.1 indicates robust and informative latent representation feature space. Besides, it improves computational efficiency by reducing data complexity without sacrificing essential information. This is consistent with the analysis in our work. Similar to the results under Assumption 2.1 ( $d_{0}<D$ ), it is easy to find that our theoretical results hold in other two settings: $d_{0}=D$ and $d_{0}>D$ .

•

Statistically, for score approximation, score estimation, and distribution estimation, the upper bound depends on the dimension of the latent variable $d_{0}$ , other than $d$ . A smaller $d_{0}$ allows for a reduced model size to achieve a specified approximation error compared to larger one (Theorem 3.1). Additionally, with a smaller $d_{0}$ , both score and distribution estimation errors are reduced relative to scenarios with larger one (Corollary 3.1.1 and Corollary 3.1.2).
•

Computationally, smaller $d_{0}$ benefits the provably efficient criteria (Proposition 4.1, almost-linear time algorithms for forward inference (Proposition 4.2) and backward computation (Theorem 4.1).

Appendix B Nomenclature Table

We summarize our notations in the following table for easy reference.

Table 1: Mathematical Notations and Symbols

Symbol	Description
$\norm{z}_{2}$	Euclidean norm, where $z$ is a vector
$\norm{z}_{\infty}$	Infinite norm, where $z$ is a vector
$\norm{Z}_{2}$	2-norm, where $Z$ is a matrix
$\norm{Z}_{\rm op}$	Operator norm, where $Z$ is a matrix
$\norm{Z}_{F}$	Frobenius norm, where $Z$ is a matrix
$\norm{Z}_{p,q}$	$p,q$ -norm, where $Z$ is a matrix
$\norm{f(x)}_{L^{2}}$	$L^{2}$ -norm, where $f$ is a function
$\norm{f(x)}_{L^{2}(P)}$	$L^{2}(P)$ -norm, where $f$ is a function and $P$ is a distribution
$\norm{f(\cdot)}_{Lip}$	Lipschitz-norm, where $f$ is a function
$f_{\sharp}P$	Pushforward measure, where $f$ is a function and $P$ is a distribution
$n$	Sample size
$x$	Data point in original data space, $x\in\mathbb{R}^{D}$
$h$	Latent variable in low-dimensional subspace, $h\in\mathbb{R}^{d_{0}}$
$p_{h}$	The destiny function of $h$
$B$	The matrix with orthonormal columns to transform $h$ to $x$ , where $B\in\mathbb{R}^{D\times d_{0}}$
$\bar{h}$	$\bar{h}=B^{\top}x$
$T$	Stop** time in forward process of Diffusion model
$T_{0}$	Stop** time in backward process of Diffusion model
$\mu$	Discretized step size in backward process
$p_{t}(\cdot)$	The density function of $x$ for at time $t$
$p_{t}^{h}(\cdot)$	The density function of $\bar{h}$ at time $t$
$\psi$	(Conditional) Gaussian density function
$d$	Input dimension of each token in the Transformer network of DiT
$L$	Token length in the Transformer network of DiT
$X$	Sequence input of Transformer network in DiT, where $X\in\mathbb{R}^{d\times L}$
$E$	Position encoding, where $E\in\mathbb{R}^{d\times L}$
$R(\cdot)$	Reshape layer in DiT, $R(\cdot):\mathbb{R}^{d_{0}}\to\mathbb{R}^{d\times L}$
$W_{B}$	The orthonormal matrix to approximate $B$ , where $W_{B}\in\mathbb{R}^{D\times d_{0}}$

Appendix C Related Works

Diffusion (Ho et al., 2020) and score-based generative models (Song and Ermon, 2019) have been particularly successful as generative models of images, video and biomedical data (Nichol et al., 2021; Ramesh et al., 2022; Liu et al., 2024; Zhou et al., 2024a, b; Wang et al., 2024a, b). There are two popular directions in this direction. Empirically, diffusion transformers (DiTs) (Peebles and Xie, 2023) have emerged as a significant advancement, effectively combining the strengths of transformer architectures and diffusion-based approaches. Theoretically, the development of the approximation theory for diffusion models supports their practical success, providing a theoretical framework for understanding and enhancing their effectiveness in various applications (Chen et al., 2023a).

Organization.

In the following, we first discuss recent developments in DiTs. Then, we discuss the main technique of our statistical results: the universality (universal approximation) of transformer. Next, we discuss recent theoretical developments in diffusion generative models. Lastly, we discuss other aspects of transformer in foundation models beyond diffusion models.

Diffusion Transformers.

Recently, transformer-based diffusion models have garnered significant attention in research. The U-ViT model (Bao et al., 2022) incorporates transformer blocks into a U-net architecture, treating all inputs as tokens. In contrast, DiT (Peebles and Xie, 2023) utilizes a straightforward, non-hierarchical transformer structure. Models like MDT (Gao et al., 2023a) and MaskDiT (Zheng et al., 2023) improve the training efficiency of DiT by applying a masking strategy.

Universality and Memory Capacity of Transformers.

The universality of transformers refers to their ability to serve as universal approximators. This means that transformers theoretically models any sequence-to-sequence function to a desired degree of accuracy. Yun et al. (2020) establish that transformers can universally approximate sequence-to-sequence functions by stacking numerous layers of feed-forward functions and self-attention functions. In a different approach, Jiang and Li (2023) affirm the universality of transformers by utilizing the Kolmogorov-Albert representation Theorem. Most recently, Kajitsuka and Sato (2023) show that transformers with one self-attention layer is a universal approximator.

The memory capacity of a transformer is a practical measure to test the theoretical results of the transformer’s universality, by ensuring the model can handle necessary context and dependencies. By memory capacity, we refer to the minimal set of parameters such that the model (i.e., transformer) approximates all input-output pairs in the training dataset with a bounded error. Several works address the memory capacity of transformers. Kim et al. (2022) show that transformers with $\tilde{O}(d+L+\sqrt{NL})$ parameters are sufficient to memorize $N$ length- $L$ and dimension- $d$ sequence-to-sequence data points by constructing a contextual map** with $\mathcal{O}(L)$ attention layers. Mahdavi et al. (2023) show that a multi-head-attention with $h$ heads is able to memorize $\mathcal{O}(hL)$ examples under a linear independence data assumption. Kajitsuka and Sato (2023) show that a single layer transformer with $\mathcal{O}(NLd+d^{2})$ parameters is able to memorize $N$ length- $L$ and dimension- $d$ sequence-to-sequence data points by utilizing the connection between the softmax function and Boltzmann operator. Wang et al. (2023) extend the results of (Yun et al., 2020) to prompt tuning and discuss the memorization of only the last token of each data sequence. Another line of research establishes a different kind of memory capacity for transformers by connecting transformer attention with dense associative memory models (modern Hopfield models) (Hu et al., 2024a, b, c, 2023; Wu et al., 2024a, b; Ramsauer et al., 2020). Notably, they define memory capacity as the smallest number of (length- $L$ and dimension- $d$ ) data points the model (transformer attention) is able to store and derive exponential-in- $d$ high-probability capacity lower bounds.

Our work is motivated by and builds on (Yun et al., 2020) to bridge the transformer’s function approximation ability with data distribution estimation. While we do not address the memorization of DiTs (or diffusion models in general), recent studies on dense associative models suggest viewing pretrained diffusion generative models as associative memory models (Hoover et al., 2023; Ambrogioni, 2023). We plan to explore this aspect in future work.

Theories of Diffusion Models.

In addition to empirical success, there has been several theoretical analysis about diffusion models. Chen et al. (2023a) studies score approximation, estimation, and distribution recovery of U-Net based diffusion models. Benton et al. (2024) provide convergence bounds linear in data dimensions, assuming accurate score function approximation. Zhu et al. (2023); Wibisono et al. (2024) provide statistical sample complexity bounds for score-matching under the similar assumptions. Oko et al. (2023) analyze the distribution estimation under the assumption that the initial density is supported on $[-1,1]^{D}$ and smooth in the boundary.

Among these works, our work is built on and closest to (Chen et al., 2023a), as both assume the data has a low-dimensional structure. However, our work differs in three key aspects. First, beyond the simple ReLU networks considered in (Chen et al., 2023a), we provide the first score approximation analysis for DiTs with a transformer-based score estimator. Second, our work is the first to provide the statistical rates of DiTs (score and distribution estimation) based on transformer universality (Yun et al., 2020) and norm-based converging number bound (Edelman et al., 2022), supporting the practical success of DiTs (Esser et al., 2024; Ma et al., 2024). Lastly, our work provides the first comprehensive analysis of the computational limits and all possible efficient DiT algorithms/methods for both forward inference and backward training. This offers timely insights into the empirical computational inefficiency of DiTs (Liu et al., 2024) and guidance for future DiT architectures.

Transformers in Foundation Models: Transformer-Based Pretrained Models.

Transformer-based pretrained models utilize attention mechanisms to process sequential data, enabling the learning of contextual relationships for tasks like natural language understanding and generation. These models encompass three types: encoder-based, decoder-based, and diffusion transformers. Encoder-based transformers, such as DNABERT (Zhou et al., 2024c, 2023; Ji et al., 2021), employ bidirectional attention to extract feature representations DNABERT shows great potential to capture complex patterns of genome sequences and improve tasks such as gene prediction. Decoder-based transformers generate output sequences from encoded information using unidirectional attention, such as ChatGPT (Lagler et al., 2013; Floridi and Chiriatti, 2020; Brown et al., 2020) for natural language. The diffusion transformers generate a sequence toward a target distribution, such as Sora (Liu et al., 2024) and Videofusion (Luo et al., 2023) for video generation and DecompDiff (Guan et al., 2024) for drug design. In our paper, we present an early exploration of the statistical and computational limits of diffusion transformer models.

Appendix D Supplementary Theoretical Background

In this section, we provide some further background. We show the details about the forward and backward process in Diffusion Models in Section D.1. Besides, we give the details of the proof about the score decomposition in Section D.2.

D.1 Diffusion Models

Forward Process.

Diffusion models gradually add noise to the original data in the forward process. We describe the forward process as the following SDE

\displaystyle\differential x_{t}=-\frac{1}{2}w(t)x_{t}\differential t+\sqrt{w(% t)}\differential B_{t},\leavevmode\nobreak\ x_{t}\in\mathbb{R}^{D},

(D.1)

where $x_{0}\sim P_{0}$ , $(B_{t})_{t\geq 0}$ is a standard Brownian motion, and $w(t)>0$ is a nondecreasing weighting function. Let $P_{t}$ and $p_{t}$ denote the marginal distribution and destiny of $x_{t}$ . The conditional distribution $P(x_{t}|x_{0})$ follows $N(\beta(t)x_{0},\sigma(t)I_{D})$ , where $\beta(t)=\exp(-\int_{0}^{t}w(s)\differential s/2)$ and $\sigma(t)=1-\beta^{2}(t)$ . In practice, (D.1) terminates at a large enough $T$ such that $P_{T}$ is close to $N(0,I_{D})$ .

Backward Process.

We obtain the backward process $y_{t}:=x_{T-t}$ by reversing (D.1). The backward process satisfies

\displaystyle\differential y_{t}=\left[\frac{1}{2}w(T-t)y_{t}+w(T-t)\nabla\log p% _{T-t}(y_{t})\right]\differential t+\sqrt{w(T-t)}\differential\macc@depth\char 1% \relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{B}_{t}

where the score function $\nabla\log p_{t}(\cdot)$ is the gradient of log probability density function of $x_{t}$ , and $\bar{B}_{t}$ is a reversed Brownian motion. However, $\nabla\log p_{t}(\cdot)$ and $P_{T}$ are both unknown in (D.1). To resolve this, we use a score estimator $s_{W}(\cdot,t)$ to replace $\nabla\log p_{t}(\cdot)$ , where $s_{W}(\cdot,t)$ is usually a neural network with parameters $W$ . Secondly, we replace $P_{T}$ by the standard Gaussian distribution. Consequently, we obtain the following SDE

\displaystyle\differential y_{t}=\left[\frac{1}{2}w(T-t)y_{t}+w(T-t)s_{W}(y_{t% },T-t)\right]\differential t+\sqrt{w(T-t)}\differential\bar{B}_{t},\leavevmode% \nobreak\ y_{0}\sim N(0,I_{D}).

(D.2)

In practice, we use discrete schemes of (D.2) to generate data, following (Song and Ermon, 2019). We use $\mu>0$ to denote the discretization step size, and for $t\in[k\eta,(k+1)\mu]$ , we have

\displaystyle\differential y_{t}^{\leftarrow}=\left[\frac{1}{2}w(T-t)y_{k\mu}^% {\leftarrow}+w(T-t)s_{W}(y_{k\mu}^{\leftarrow},T-k\mu)\right]\differential t+% \sqrt{w(T-t)}\differential\bar{B}_{t}.

(D.3)

D.2 Proof of Lemma 2.1

Here we restate the proof of (Chen et al., 2023a, Lemma 1) for completeness.

Proof.

Recall $x=Bh$ by Assumption 2.1 with $x\in\mathbb{R}^{D}$ , $B\in\mathbb{R}^{D\times d_{0}}$ and $h\in\mathbb{R}^{d_{0}}$ .

By the forward process (D.1), we have

\displaystyle p_{t}(x)=\int\psi_{t}(x\mid Bh)p_{h}(h)\differential h,

(D.4)

where

\displaystyle\psi_{t}(x\mid Bh)=[2\pi h(t)]^{-D/2}\exp\left(-\frac{\norm{\beta% (t)Bh-x}_{2}^{2}}{2\sigma(t)}\right),

(D.5)

is the Gaussian transition kernel.

Then we write the score function as

$\displaystyle\nabla\log p_{t}(x)$	$\displaystyle=\frac{\nabla p_{t}(x)}{p_{t}(x)}$	(By log-derivative)
	$\displaystyle=\frac{\nabla\int\psi_{t}(x\mid Bh)p_{h}(h)\differential h}{\int% \psi_{t}(x\mid Bh)p_{h}(h)\differential h}$	(By pluging in $p_{t}(x)$ )
	$\displaystyle=\frac{\int\nabla\psi_{t}(x\mid Bh)p_{h}(h)\differential h}{\int% \psi_{t}(x\mid Bh)p_{h}(h)\differential h},$	(By interchanging $\int$ with $\nabla$ )

where the last equality holds since $\psi_{t}(x\mid Bh)$ is continuously differentiable in $x$ .

Plugging (D.5) into ((By log-derivative)), we have

		$\displaystyle\leavevmode\nobreak\ \nabla\log p_{t}(x)$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \frac{[2\pi h(t)]^{-D/2}}{\int\psi_{t}(x\mid Bh% )p_{h}(h)\differential h}\int\frac{1}{\sigma(t)}\left(\beta(t)Bh-x\right)\exp% \left(-\frac{\norm{\beta(t)Bh-x}_{2}^{2}}{2\sigma(t)}\right)p_{h}(h)% \differential h.$

We them decompose above score function by projecting of $x$ into ${\rm Span}(B)$ , i.e., replacing $-x$ with $-BB^{\top}x-(I_{D}-BB^{\top})x$ :

		$\displaystyle\leavevmode\nobreak\ \nabla\log p_{t}(x)$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \frac{[2\pi h(t)]^{-D/2}}{\int\psi_{t}(x\mid Bh% )p_{h}(h)\differential h}$
		$\displaystyle\leavevmode\nobreak\ \cdot\int\frac{1}{\sigma(t)}\Bigg{[}\left(% \beta(t)Bh-BB^{\top}x\right)-\left(I_{D}-BB^{\top}\right)x\Bigg{]}\exp\left(-% \frac{\norm{\beta(t)Bh-x}_{2}^{2}}{2\sigma(t)}\right)p_{h}(h)\differential h.$

Absorbing the factor of $[2\pi h(t)]^{-D/2}$ into the Gaussian kernel $\psi_{t}(x\mid Bh)$ , we have

		$\displaystyle\leavevmode\nobreak\ \nabla\log p_{t}(x)$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \frac{[2\pi h(t)]^{-D/2}}{\int\psi_{t}(x\mid Bh% )p_{h}(h)\differential h}\int\frac{1}{\sigma(t)}\left(\beta(t)Bh-BB^{\top}x% \right)\exp\left(-\frac{\norm{\beta(t)Bh-x}_{2}^{2}}{2\sigma(t)}\right)p_{h}(h% )\differential h$
		$\displaystyle\leavevmode\nobreak\ -\frac{1}{\int\psi_{t}(x\|Bh)p_{h}(h)% \differential h}\left(\frac{1}{\sigma(t)}\left(I_{D}-BB^{\top}\right)x\right)% \int\psi_{t}(x\mid Bh)p_{h}(h)\differential h$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \underbrace{\frac{1}{\int\psi_{t}(x\mid Bh)p% _{h}(h)\differential h}\int\frac{1}{\sigma(t)}\left(\beta(t)Bh-BB^{\top}x% \right)\psi_{t}(x\mid Bh)p_{h}(h)\differential h}_{\coloneqq s_{+}}\underbrace% {-\frac{1}{\sigma(t)}\left(I_{D}-BB^{\top}\right)x}_{\coloneqq s_{-}}.$

To further simplify $s_{+}$ , we decompose $\psi_{t}(x\mid Bh)$ as

	$\displaystyle\leavevmode\nobreak\ \psi_{t}(x\mid Bh)$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ [2\pi h(t)]^{-D/2}\exp\left(-\frac{1}{2% \sigma(t)}\norm{\beta(t)Bh-x}_{2}^{2}\right)$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ [2\pi h(t)]^{-D/2}\exp\left(-\frac{1}{2% \sigma(t)}\norm{\beta(t)Bh-BB^{\top}x-\left(I_{D}-BB^{\top}\right)x}_{2}^{2}\right)$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ [2\pi h(t)]^{-D/2}$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ [2\pi h(t)]^{-D/2}\exp\left(-\frac{1}{2% \sigma(t)}\left(\norm{\beta(t)Bh-BB^{\top}x}_{2}^{2}+\norm{\left(I_{D}-BB^{% \top}\right)x}_{2}^{2}\right)\right)$	( $B(\beta(t)h-B^{\top}x)$ is in ${\rm Span}(B)$ while $(I_{D}-BB^{\top})x$ is orthogonal to ${\rm Span}(B)$ )
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \underbrace{[2\pi h(t)]^{-d_{0}/2}\exp\left(% -\frac{\norm{\beta(t)h-B^{\top}x}_{2}^{2}}{2\sigma(t)}\right)}_{\coloneqq\psi_% {t}\left(B^{\top}x\mid h\right)}\cdot\underbrace{[2\pi h(t)]^{-(D-d_{0})/2}% \exp\left(-\frac{\norm{\left(I_{D}-BB^{\top}\right)x}_{2}^{2}}{2\sigma(t)}% \right)}_{\coloneqq\psi_{t}\left((I_{D}-BB^{\top})x\right)},$	(since $B$ has orthonormal columns)

where both $\psi_{t}\left(B^{\top}x\mid h\right)$ and $\psi_{t}\left((I_{D}-BB^{\top})x\right)$ are Gaussian.

Plugging $\psi_{t}(x\mid Bh)=\psi_{t}\left(B^{\top}x\mid h\right)\psi_{t}\left((I_{D}-BB% ^{\top})x\right)$ into $s_{+}$ , we obtain

	$\displaystyle s_{+}(x,t)$	$\displaystyle=C\int\frac{1}{\sigma(t)}\left(\beta(t)Bh-BB^{\top}x\right)\psi_{% t}(B^{\top}x\mid h)\psi_{t}((I_{D}-BB^{\top})x)p_{h}(h)\differential h$
		$\displaystyle=C\psi_{t}((I_{D}-BB^{\top})x)\int\frac{1}{\sigma(t)}\left(\beta(% t)Bh-BB^{\top}x\right)\psi_{t}(B^{\top}x\mid h)p_{h}(h)\differential h$
		$\displaystyle=\frac{1}{\int\psi_{t}(B^{\top}x\mid h^{\prime})p_{h^{\prime}}(h^% {\prime})\differential h^{\prime}}\int\frac{1}{\sigma(t)}\left(\beta(t)Bh-BB^{% \top}x\right)\psi_{t}(B^{\top}x\mid h)p_{h}(h)\differential h,$

where $C\coloneqq[\psi_{t}((I_{D}-BB^{\top})x)\int\psi_{t}(B^{\top}x\mid h^{\prime})p% _{h^{\prime}}(h^{\prime})\differential h^{\prime}]^{-1}$ .

Notably, $s_{+}$ depends only on the projected data $B^{\top}x$ . Therefore, we are able to replace $s_{+}(x,t)$ with $s_{+}(B^{\top}x,t)$ . The benefit is that the dimension $d_{0}$ of the first input in $s_{+}(B^{\top}x,t)$ is much smaller.

Lastly, by denoting $\bar{h}=B^{\top}x$ such that $\nabla_{\bar{h}}\psi_{t}(\bar{h}\mid h)=(\beta(t)h-\bar{h})\psi_{t}(B^{\top}x% \mid h)/\sigma(t)$ , we arrive at

	$\displaystyle s_{+}(B^{\top}x,t)$	$\displaystyle=B\int\frac{\nabla_{\bar{h}}\psi_{t}(\bar{h}\mid h)p_{h}(h)}{\int% \psi_{t}(\bar{h}\mid h^{\prime})p_{h^{\prime}}(h^{\prime})\differential h^{% \prime}}\differential h$
		$\displaystyle=B\nabla\log p_{t}^{h}(B^{\top}x).$		( $p_{t}^{h}(\bar{h})\coloneqq\int\psi_{t}(\bar{h}\|h)p_{h}(h)\differential h$ )

This completes the proof. ∎

D.3 Preliminaries: Strong Exponential Time Hypothesis (SETH) and Tensor Trick

Here we present the ideas we built upon for Section 4.

Strong Exponential Time Hypothesis (SETH). Impagliazzo and Paturi (2001) introduce the Strong Exponential Time Hypothesis (SETH) as a stronger form of the $\mathtt{P}\neq\mathtt{NP}$ conjecture. It suggests that our current best $\mathtt{SAT}$ algorithms are optimal and is a popular conjecture for proving fine-grained lower bounds for a wide variety of algorithmic problems (Cygan et al., 2016; Williams, 2018).

Hypothesis 1 (SETH).

For every $\epsilon>0$ , there is a positive integer $k\geq 3$ such that $k$ - $\mathtt{SAT}$ on formulas with $n$ variables cannot be solved in $\mathcal{O}(2^{(1-\epsilon)n})$ time, even by a randomized algorithm.

Tensor Trick for Computing Gradients. The tensor trick (Diao et al., 2019, 2018) is an instrument to compute complicated gradients in a clean and tractable fashion. We start with some definitions.

Definition D.1 (Vectorization).

For any matrix $X\in\mathbb{R}^{L\times d}$ , we define $\underline{X}\coloneqq\operatorname{vec}{(X)}\in\mathbb{R}^{Ld}$ such that $X_{i,j}=\underline{X}_{(i-1)d+j}$ for all $i\in[L]$ and $j\in[d]$ .

Definition D.2 (Matrixization).

For any vector $\underline{X}\in\mathbb{R}^{Ld}$ , we define $\mathrm{mat}(\underline{X})=X$ such that $X_{i,j}=\mathrm{mat}(\underline{X})\coloneqq\underline{X}_{(i-1)d+j}$ for all $i\in[L]$ and $j\in[d]$ , namely $\mathrm{mat}(\cdot)=\operatorname{vec}^{-1}(\cdot)$ .

Definition D.3 (Kronecker Product).

Let $A\in\mathbb{R}^{L_{a}\times d_{a}}$ and $B\in\mathbb{R}^{L_{b}\times d_{b}}$ . We define the Kronecker product of $A$ and $B$ as $A\otimes B\in\mathbb{R}^{L_{a}L_{b}\times d_{a}d_{b}}$ such that $(A\otimes B)_{(i_{a}-1)L_{b}+i_{b},(j_{a}-1)d_{b}+j_{b}}$ , is equal to $A_{i_{a},j_{a}}B_{i_{b},j_{b}}$ with $i_{a}\in[L_{a}],j_{a}\in[d_{a}],i_{b}\in[L_{b}],j_{b}\in[d_{b}]$ .

Definition D.4 (Sub-Block of a Tensor).

For any $A\in\mathbb{R}^{L_{a}\times d_{a}}$ and $B\in\mathbb{R}^{L_{b}\times d_{b}}$ , let $\operatorname{\mathsf{A}}\coloneqq A\otimes B\in\mathbb{R}^{L_{a}L_{b}\times d% _{a}d_{b}}$ . For any $\underline{j}\in[L_{a}]$ , we define $\operatorname{\mathsf{A}}_{\underline{j}}\in\mathbb{R}^{L_{b}\times d_{a}d_{b}}$ be the $\underline{j}$ -th $L_{b}\times d_{a}d_{b}$ sub-block of $\operatorname{\mathsf{A}}$ .

Lemma D.1 (Tensor Trick (Diao et al., 2019, 2018)).

For any $A\in\mathbb{R}^{L_{a}\times d_{a}}$ , $B\in\mathbb{R}^{L_{b}\times d_{b}}$ and $X\in\mathbb{R}^{d_{a}\times d_{b}}$ , it holds $\operatorname{vec}\left(A^{\top}XB\right)=(A^{\top}\otimes B^{\top})\underline% {X}\in\mathbb{R}^{L_{a}L_{b}}$ .

To showcase the tensor trick, let’s consider a (single data point) attention following (Gao et al., 2023b, c). Setting $D\coloneqq\mathop{\rm{diag}}\left(\exp(X^{\mathsf{T}}W_{K}^{\mathsf{T}}W_{Q}X)% \mathds{1}_{L}\right)$ and $W\coloneqq W_{K}W_{Q}^{\mathsf{T}}\in\mathbb{R}^{d\times d}$ , we have

\displaystyle\mathcal{L}_{0}\coloneqq\big{\|}\underbrace{W_{V}}_{d\times d}% \underbrace{X}_{\in\mathbb{R}^{d\times L}}\underbrace{D^{-1}}_{\in\mathbb{R}^{% L\times L}}\underbrace{\exp{X^{\mathsf{T}}WX}}_{\in\mathbb{R}^{L\times L}}-% \underbrace{Y}_{\in\mathbb{R}^{d\times L}}\big{\|}_{2}^{2}.

(D.6)

Proposition D.1 (Definition 4.7 of (Gao et al., 2023b)).

By Definition D.3 and Definition D.4, we identify $D_{\underline{j},\underline{j}}\coloneqq\Braket{\exp(\operatorname{\mathsf{A}}% _{\underline{j}}\underline{W}),\mathds{1}_{L}}\in\mathbb{R}$ for all $\underline{j}\in[L]$ , with $\operatorname{\mathsf{A}}\coloneqq X\otimes X\in\mathbb{R}^{L^{2}\times d^{2}}$ and $\underline{W}\in\mathbb{R}^{d^{2}}$ . Therefore, for each $\underline{j}\in[L]$ and $\underline{i}\in[d]$ , it holds $\mathcal{L}_{0}=\sum_{\underline{j}=1}^{L}\sum_{\underline{i}=1}^{d}{\frac{1}{% 2}}\left(\Braket{D^{-1}_{\underline{j},\underline{j}}\exp(\operatorname{% \mathsf{A}}_{\underline{j}}\underline{W}),XW_{V}[\cdot,\underline{i}]}-Y_{% \underline{j},\underline{i}}\right)^{2}$ .

The elegance of Proposition D.1 emerges when we vectorize the weights into vectors $\underline{W},\underline{W}_{V}$ , making the gradient computations (e.g., $\nicefrac{{\differential\mathcal{L}_{0}}}{{\underline{W}}}$ and $\nicefrac{{\differential\mathcal{L}_{0}}}{{\underline{W}_{V}}}$ ) more tractable by avoiding complex matrix or tensor derivatives. This approach systematically simplifies the handling of chain-rule terms in the gradient computation of losses like $\mathcal{L}_{0}$ .

Appendix E More Background and Auxiliary Lemmas: Universal Approximation of Transformers via Piecewise Approximation

Here, we review the universal approximation of Transformers following (Yun et al., 2020). Our goal is to reproduce the results of (Yun et al., 2020) and use or modify them as auxiliary lemmas for proofs of Section 3 (i.e., Appendix F.)

We start with their central result, and the rest of the section aims to prove it.

Lemma E.1 (Universal Approximation of Transformers, Theorem 3 of (Yun et al., 2020)).

Let $\epsilon>0$ . For any given compact-supported continuous function $f:\mathbb{R}^{d\times L}\to\mathbb{R}^{d\times L}$ , there exists a Transformer network $f_{\mathcal{T}}\in\mathcal{T}_{p}^{2,1,4}$ such that we have

\displaystyle\left(\int\norm{f_{\mathcal{T}}(X)-f(X)}_{F}^{2}\differential X% \right)^{1/2}\leq\epsilon.

Proof Overview.

We use the following proof strategy:

•

Step 1. We show that piecewise-constant function is able to approximate compact-supported continuous function in Section E.1.
•

Step 2. We define modified self-attention and feed-forward layers to construct the modified transformer. We show that modified transformer is able to approximate piecewise-constant function in Section E.2.
•

Step 3. We show that the modified transformer is able to approximate normal transformer in Section E.3.

Below, we provide details of Step 1. in Section E.1, Step 2. in Section E.2 and Step 3. in Section E.3. Then we give a summary of our results in Section E.4.

E.1 Piecewise-constant Function Approximates Compact-Supported Continuous Function

In this subsection, we show that piecewise-constant function is able to approximate compact-supported continuous function.

We start with the definition of the compact-supported continuous functions of interest.

Assumption E.1.

Without loss of generality, we assume that the target function in discussion is supported on $[0,1]^{d\times L}$ . We denote the set of $[0,1]^{d\times L}$ -supported continuous functions as $\mathcal{F}$ .

We introduce the notion of grid and cube for the compact support $[0,1]^{d\times L}$ .

Definition E.1 (Grid and Cube with Width $\delta$ ).

Given a grid width $\delta$ , let $\mathcal{G}_{\delta}\coloneqq\{0,\delta,\dots,1-\delta\}^{d\times L}$ denote the set of grids within $[0,1]^{d\times L}$ . For a grid point $G=(G_{j\in[d],k\in[L]})\in\mathcal{G}_{\delta}$ , we denote its associated cube as

\displaystyle\mathcal{S}_{G}:=\otimes_{j=1}^{d}\otimes_{k=1}^{L}[G_{j,k},G_{j,% k}+\delta)\subset[0,1]^{d\times L}.

We introduce the notion of piecewise-constant fucntion class w.r.t. the $[0,1]^{d\times L}$ -supported continuous function class $\mathcal{F}$ .

Definition E.2 (Piecewise-Constant Function Class).

Let $f_{\delta}$ denote the piesewise constant function of grid width $\delta$ , and $\mathds{1}\{\cdot\}$ denote the indicator function. For each $G\in\mathcal{G}_{\delta}$ , and any matrix $A_{G}\in\mathbb{R}^{d\times L}$ , we define the piecewise-constant function class as

\displaystyle\mathcal{F}(\delta)\coloneqq\left\{f_{\delta}:X\rightarrow\sum% \nolimits_{G\in\mathcal{G}_{\delta}}A_{G}\cdot\mathds{1}\{X\in\mathcal{S}_{G}% \},A_{G}\in\mathbb{R}^{d\times L}\right\}.

(E.1)

We recall that for a given sequence-to-sequence function $f$ , we have

\displaystyle\norm{f}_{L^{2}}:=\bigg{(}\int\norm{f(X)}_{F}^{2}\differential X% \bigg{)}^{1/2}.

We approximate the compact-supported function with piecewise-constant function with next lemma.

Lemma E.2.

(Lemma 8 of (Yun et al., 2020)) For any given $f\in\mathcal{F}$ and $\epsilon/3>0$ , we can find a $\delta^{\star}>0$ such that there exists a $f_{\delta^{\star}}\in\mathcal{F}(\delta^{\star})$ satisfying $\norm{f-f_{\delta^{\star}}}_{L^{2}}\leq\epsilon/3$ .

Proof.

See Section E.5.2 for a detailed proof. ∎

E.2 Modified Transformer Approximates Piece-Wise Constant Function

In this subsection, we define modified self-attention and feed-forward layers to construct the modified transformers. We use the modified transformers to approximate piecewise-constant function.

Definition E.3 (Modified Transformer Networks).

The modification of transformer networks $\bar{\mathcal{T}}_{p}^{r,m,l}$ includes two modifications from normal transformer networks $\mathcal{T}_{p}^{r,m,l}$ :

•

Modified attention layer: Replace $\mathop{\rm{Softmax}}$ operator with $\mathop{\rm{Hardmax}}$ operator $\sigma_{H}(\cdot)$ .
•

Modified feed-forward layer: Replace ${\rm ReLU(\cdot)}$ with activation function $\zeta\in\Psi$ . Here, $\Psi$ denotes the set of all piecewise linear functions with at most three pieces and at least one is constant.

We approximate $\mathcal{F}(\delta)$ with this modified transformer networks $\bar{\mathcal{T}}_{p}^{r,m,l}$ as the following.

Lemma E.3 (Modified from Proposition 4 of (Yun et al., 2020)).

For each $f_{\delta}\in\mathcal{F}(\delta)$ , there exists a $f_{\mathcal{T},c}\in\bar{\mathcal{T}}_{p}^{2,1,1}$ such that $\norm{f_{\delta}-f_{\mathcal{T},c}}_{L^{2}}=\mathcal{O}(\delta^{d/2})$ .

Proof Sketch.

Given us $\delta$ , we have the grid $\mathcal{G}_{\delta}$ , and the cude $\mathcal{S}_{G}$ for $G\in\mathcal{G}_{\delta}$ . Our proof follows two steps:

•
Quantization. For all $X\in\mathbb{R}^{d\times L}$ , we quantize it to a finite set:
- –
  
  If $X\in\mathcal{S}_{G}\subset[0,1]^{d\times L}$ , we quantize it to the element $G\in\mathcal{G}_{\delta}$ .
- –
  
  If $X\notin[0,1]^{d\times L}$ , we quantize it to an element out of $\mathcal{G}_{\delta}$ .
•

Map**. For any $G\in\mathcal{G}_{\delta}$ , we map it to the desired output $A_{G}$ .

For Quantization, We achieve by a series of modified feed-forward layers. We show this in Section E.2.1.

For Map**, we follow two steps:

•

For any $G\neq G^{\prime}\in\mathcal{G}_{\delta}$ , we use a “contextual map**” $q_{c}(\cdot)$ (defined as Definition E.4), which maps all the elements in $q_{c}(G)$ and $q_{c}(G^{\prime})$ to different value. Then we use a series of modified self-attention layers to achieve “contextual map**.” We show this in Section E.2.2.
Definition E.4 (Contextual Map**).
Consider a finite set $\mathcal{G}_{\delta}\in\mathbb{R}^{d\times L}$ . A map $q_{c}:\mathcal{G}_{\delta}\rightarrow\mathbb{R}^{1\times L}$ defines a contextual map** if the map satisfies the following:
- –
  
  For any $G\in\mathcal{G}_{\delta}$ , the entries in $q_{c}(G)$ are all distinct.
- –
  
  For any $G\neq G^{\prime}\in\mathcal{G}_{\delta}$ , all entries of $q_{c}(G)$ and $q_{c}(G^{\prime})$ are distinct.
•

For any $G\in\mathcal{G}_{\delta}$ , we use a series of modified feed-forward layers to map $q_{c}(G)$ to $A_{G}$ . We show this in Section E.2.3.

∎

Remark E.1.

Our proof differs from (Yun et al., 2020) in one aspect: while Proposition 4 in (Yun et al., 2020) uses a transformer network without positional encoding, we add positional encoding to complete our proof.

E.2.1 Quantization by Modified Feed-forward Layers

We use a series of modified feed-forward layers in $\bar{\mathcal{T}}_{p}^{r,m,l}$ to quantize an input $X\in\mathbb{R}^{d\times L}$ to an element $G$ in a grid:

\displaystyle\{-J,0,\delta,\dots,1-\delta\}^{d\times L},

where $J>L>0$ is a number large enough to be determined later. We achieve this via two steps.

•

Step 1: Map the element out of $[0,1)$ to $-J$ .

We use $e_{i}$ to represent the standard unit vector where the $i$ -th element is $1$ . For the $i$ -th row of $X$ , we define the following feed-forward layer to achieve our aim.

Definition E.5 (Feed-forward Layer 1).

The vector $e_{i}$ acts as the weight parameters and $\zeta_{1}(\cdot)$ acts as the activation function in the feed-forward layer.

\displaystyle X\rightarrow X+e_{i}\zeta_{1}(e_{i}^{\top}X),\leavevmode\nobreak% \ \leavevmode\nobreak\ \zeta_{1}(t)=\begin{cases}-t-J&\text{for }t<0\text{ or % }t\geq 1,\\ 0&\text{otherwise}.\end{cases}

(E.2)

We take $i=1$ as an example to give the specific calculation. We denote $X=(x_{i,j})_{d\times L}$ , then we have

	$\displaystyle\leavevmode\nobreak\ {\rm FF}(X)$	$\displaystyle=X+\begin{pmatrix}1\\ 0\\ \vdots\\ 0\end{pmatrix}\begin{pmatrix}\zeta_{1}(x_{1,1}),&\zeta_{1}(x_{1,2}),&\cdots,&% \zeta_{1}(x_{1,L})\end{pmatrix}$
		$\displaystyle=X+\begin{pmatrix}\zeta_{1}(x_{1,1})&\zeta_{1}(x_{1,2})&\cdots&% \zeta_{1}(x_{1,L})\\ 0&0&\cdots&0\\ \vdots&\vdots&\vdots&\vdots\\ 0&0&\cdots&0\end{pmatrix}.$

In the first row of $X$ , the above layer transform the element that is out of $[0,1)$ to $-J$ .

We stack the above layers together for $i=1,2,\dots,d$ . If the element of $X$ is out of $[0,1)$ , the series of layers maps it to $J$ .

•

Step 2: Map the element in $[0,1)$ to $\{0,\delta,2\delta,\dots,1-\delta\}$ .

For the $i$ -th row of $X$ , we take $k=0,1,\dots,1/\delta-1$ respectively, and define the following layer.

Definition E.6 (Feed-forward Layer 2).

The vector $e_{i}$ acts as the weight parameters and $\zeta_{2}(\cdot)$ acts as the activation function in the feed-forward layer.

\displaystyle X\rightarrow X+e_{i}\zeta_{2}(e_{i}^{\top}X-k\delta\mathds{1}_{n% }^{\top}),\leavevmode\nobreak\ \leavevmode\nobreak\ \zeta_{2}(t)=\begin{cases}% 0&t<0\text{ or }t\geq\delta\\ -t&0\leq t<\delta.\end{cases}

(E.3)

We take $i=1,k=1$ as an example, and give the specific calculation.

	$\displaystyle{\rm FF}(X)$	$\displaystyle=X+\begin{pmatrix}1\\ 0\\ \vdots\\ 0\end{pmatrix}\begin{pmatrix}\zeta_{2}(x_{1,1}-\delta)&\zeta_{2}(x_{1,2}-% \delta)&\cdots&\zeta_{2}(x_{1,L}-\delta)\end{pmatrix}$
		$\displaystyle=X+\begin{pmatrix}\zeta_{2}(x_{1,1}-\delta)&\zeta_{2}(x_{1,2}-% \delta)&\cdots&\zeta_{2}(x_{1,L}-\delta)\\ 0&0&\cdots&0\\ \vdots&\vdots&\vdots&\vdots\\ 0&0&\cdots&0\end{pmatrix}.$

In the first row of $X$ , the above layer transform the element in $[\delta,2\delta]$ to $\delta$ .

We stack the above layers together for $i=1,2,\dots,d$ and $k=0,1,\dots,1/\delta-1$ . If the element of $X$ is in $[k\delta,(k+1)\delta]$ , the series layers maps it to $k\delta$ .

Combining above two parts, we achieve our goal with $d/\delta+d$ feed-forward layers. We denote the $d/\delta+d$ series layers as $f_{\mathcal{T},c1}$ .

E.2.2 Contextual Map** by Modified Self-attention Layers

In our attention layers, we use the following positional encoding $E\in\mathbb{R}^{d\times L}$ .

\displaystyle E=\begin{pmatrix}0&1&2&\cdots&L-1\\ 0&1&2&\cdots&L-1\\ \vdots&\vdots&\vdots&&\vdots\\ 0&1&2&\cdots&L-1\end{pmatrix}.

(E.4)

According to Section E.2.1, the output of $f_{\mathcal{T},c1}$ is in the grid $\{-J,0,\delta,\dots,1-\delta\}^{d\times L}$ . For any $X$ in this grid, the first column of $X+E$ is in

\displaystyle\{-J,0,\delta,\dots,1-\delta\}^{d},

and the second column is in

\displaystyle\{-J+1,1,1+\delta,\dots,2-\delta\}^{d}.

For the other columns, the results are similar.

For $i=0,1,\dots,L-1$ , we use the following notation:

\displaystyle[i:\delta:i+1-\delta]_{J}\coloneqq\{i-J,i,i+\delta,\dots,i+1-% \delta\}.

The we define the grid $\mathcal{G}_{\delta}^{+}$ as the following.

Definition E.7 (Grid $\mathcal{G}_{\delta}^{+}$ ).

$X+E$ is in the grid:

\displaystyle\mathcal{G}_{\delta}^{+}\coloneqq[0:\delta:1-\delta]_{J}^{d}% \times[1:\delta:2-\delta]_{J}^{d}\times\cdots\times[L-1:\delta:L-\delta]_{J}^{% d}.

Next, we show that the modified attention layer computes contextual map** (Definition E.4) for $\mathcal{G}_{\delta}^{+}$ . For $i=1,2,\dots,L-1$ , we use the following notation:

\displaystyle[i:\delta:i+1-\delta]\coloneqq\{i,i+\delta,i+2\delta,\dots,i+1-% \delta\}.

Lemma E.4 (Modified from Lemma 6 of (Yun et al., 2020)).

We consider the following subset of $\mathcal{G}_{\delta}^{+}$ :

\displaystyle\widetilde{\mathcal{G}}_{\delta}:=\underbrace{[0:\delta:1-\delta]% ^{d}\times[1:\delta:2-\delta]^{d}\times\cdots\times[L-1:\delta:L-\delta]^{d}}_% {L}.

Assume that $L\geq 2$ and $\delta^{-1}\geq 2$ . Then, there exist a function $f_{\mathcal{T},c2}:\mathbb{R}^{d\times L}\to\mathbb{R}^{d\times L}$ composed of $\delta^{-d}+1$ modified attention layers (Definition E.3), a vector $u\in\mathbb{R}^{d}$ , and two constants $t_{l},t_{r}\in\mathbb{R}$ ( $0<t_{l}<t_{r}$ ), such that $q_{c}(G)\coloneqq u^{\top}f_{\mathcal{T},c2}(G),G\in\mathcal{G}_{\delta}^{+}$ satisfies the following properties:

1.

For any $G\in\widetilde{\mathcal{G}}_{\delta}$ , the entries of $q_{c}(G)$ are all distinct.
2.

For any different $G,G^{\prime}\!\in\!\widetilde{\mathcal{G}}_{\delta}$ , all entries of $q_{c}(G)$ , $q_{c}(G^{\prime})$ are distinct.
3.

For any $G\in\widetilde{\mathcal{G}}_{\delta}$ , all the entries of $q_{c}(G)$ are in $[t_{l},t_{r}]$ .
4.

For any $G\in\mathcal{G}^{+}_{\delta}\setminus\widetilde{\mathcal{G}}_{\delta}$ , all the entries of $q_{c}(G)$ are outside $[t_{l},t_{r}]$ .

Proof.

See Section E.5.3 for a detailed proof. ∎

Remark E.2.

Our proof differs from (Yun et al., 2020) in one aspect: the original (Yun et al., 2020, Lemma 6) does not include positional encoding (E.4). We add (E.4) to the input of the attention layer.

E.2.3 Map to the Desired Output by Modified Feed-forward Layers

Next, we show that a series of feed-forward layers map output of modified attention layers $f_{\mathcal{T},c2}$ to the desired output of function $f_{\delta^{\star}}$ .

Lemma E.5 (Lemma 7 of (Yun et al., 2020)).

There exists a function $f_{\mathcal{T},c3}:\mathbb{R}^{d\times L}\to\mathbb{R}^{d\times L}$ composed of $\mathcal{O}(L(1/\delta)^{dL}/L!)$ modified feed-forward layers, such that

\displaystyle f_{\mathcal{T},c3}\circ f_{\mathcal{T},c2}(G)=\begin{cases}A_{G}% &\text{ if }G\in\widetilde{\mathcal{G}}_{\delta},\\ \mathbf{0}_{d\times L}&\text{ if }G\in\mathcal{G}^{+}_{\delta}\setminus% \widetilde{\mathcal{G}}_{\delta}.\end{cases}

Proof.

See Section E.5.4 for a detailed proof. ∎

From above conclusions, we have the following lemma for the required number of layers in modified transformer.

Lemma E.6 ((Yun et al., 2020)).

From the proof of Lemma E.3, if we want to achieve a approximation error $\mathcal{O}(\delta^{d/2})$ by the modified transformer, we need $\mathcal{O}(\delta^{-1})$ modified feed-forward layers in $f_{\mathcal{T},c1}$ , $\mathcal{O}(\delta^{-d})$ modified self-attention layers in $f_{\mathcal{T},c2}$ , and $\mathcal{O}(\delta^{-dL})$ modified feed-forward layers in $f_{\mathcal{T},c3}$ .

Proof.

By the proof of Lemma E.3, we complete the proof. ∎

E.3 Standard Transformers Approximate Modified Transformers

In this subsection, we show that standard neural network layers are able to approximate the modified self-attention layers and the modified feed-forward layers (Definition E.3). We have the following Lemma E.7.

Lemma E.7 (Lemma 9 of (Yun et al., 2020)).

For each $f_{\mathcal{T},c}\in\bar{\mathcal{T}}_{p}^{2,1,1}$ and any $\epsilon>0$ , there exists $f_{\mathcal{T}}\in\mathcal{T}_{p}^{2,1,4}$ such that $\norm{f_{\mathcal{T}}-f_{\mathcal{T},c}}_{L^{2}}\leq\epsilon/3$ .

Proof.

See Section E.5.5 for a detailed proof. ∎

E.4 All Together: Standard Transformers Approximate Compact-Supported Continuous Functions

We summarize the results of Lemmas E.2, E.3 and E.7, and thus prove Lemma E.1. Furthermore, to achieve the $\epsilon$ approximation error in Lemma E.1, we take $\delta=\mathcal{O}(\epsilon^{2/d})$ in Lemma E.3.

E.5 Supplementary Proofs

Here we first present two preliminaries: selective shift operation and bijective column ID map** in Section E.5.1 to proceed with our proof. Then we show the proof of Lemma E.2 in Section E.5.2, proof of Lemma E.4 in Section E.5.3, proof of Lemma E.5 in Section E.5.4, and proof of Lemma E.7 in Section E.5.5.

E.5.1 Preliminaries

We give the definition of two preliminaries: selective shift operation and bijective column ID map**.

Selective Shift Operation.

This operation refers to shifting certain entries of the input selectively.

To achieve this, we consider the following function $\xi(\cdot;\cdot):\mathbb{R}^{d\times L}\rightarrow\mathbb{R}^{d\times L}$ .

\displaystyle\xi(X;b_{Q})=e_{1}u^{\top}X\sigma_{H}\left[(u^{\top}X)^{\top}(u^{% \top}X-b_{Q}\mathds{1}_{n}^{\top})\right],

(E.5)

where $X\in\mathbb{R}^{d\times L}$ , $e_{1}=(1,0,0,\cdots,0)^{\top}\in\mathbb{R}^{d}$ , $b_{Q}\in\mathbb{R}$ , and $u\in\mathbb{R}^{d}$ is a vector to be determined.

To see the output, we consider the $j$ -th column of $u^{\top}X\sigma_{H}\left[(u^{\top}X)^{\top}(u^{\top}X-b_{Q}\mathds{1}_{n}^{% \top})\right]$ :

•

If $u^{\top}X_{:,j}>b_{Q}$ , it calculates $\mathop{\mathrm{argmax}}$ of $u^{\top}X$ ;
•

If $u^{\top}X_{:,j}<b_{Q}$ , it calculates $\mathop{\mathrm{argmin}}$ of $u^{\top}X$ .

With $e_{1}$ , all rows of $\xi(X;b_{Q})$ except the first row are zero. We consider the $j$ -th entry of the first row in $\xi(X;b_{Q})$ , which is denoted as $\xi(X;b_{Q})_{1,j}$ . Then for all $j\in[L]$ , we have

\displaystyle\xi(X;b_{Q})_{1,j}=u^{\top}X\sigma_{H}\left[(u^{\top}X)^{\top}(u^% {\top}X_{:,j}-b_{Q})\right]=\begin{cases}\max_{k}u^{\top}X_{:,k}&\text{ if }u^% {\top}X_{:,j}>b_{Q},\\ \min_{k}u^{\top}X_{:,k}&\text{ if }u^{\top}X_{:,j}<b_{Q}.\end{cases}

From this observation, we define a function parametrized by $b_{Q}$ and $b^{\prime}_{Q}$ , where $b_{Q}<b^{\prime}_{Q}$ .

\displaystyle\xi(X;b_{Q},b^{\prime}_{Q}):=\xi(X;b_{Q})-\xi(X;b^{\prime}_{Q}).

(E.6)

Then we have

\displaystyle\xi(X;b_{Q},b^{\prime}_{Q})_{1,j}=\begin{cases}\max_{k}u^{\top}X_% {:,k}-\min_{k}u^{\top}X_{:,k}&\leavevmode\nobreak\ \text{if}\leavevmode% \nobreak\ b_{Q}<u^{\top}X_{:,j}<b^{\prime}_{Q},\\ 0&\leavevmode\nobreak\ \text{others}.\end{cases}

We define an attention layer of the form $X\rightarrow X+\xi(X;b_{Q},b^{\prime}_{Q})$ . For any column $X_{:,j}$ , if $b_{Q}<u^{\top}X_{:,j}<b^{\prime}_{Q}$ , its first coordinate $X_{1,j}$ is shifted up by $\max_{k}u^{\top}X_{:,k}-\min_{k}u^{\top}X_{:,k}$ , while all the other coordinates stay untouched. We call this the selective shift operation, because we can choose $b_{Q}$ and $b^{\prime}_{Q}$ to shift certain entries of the input selectively.

Bijective Column ID Map**.

We consider the input $G\in\mathcal{G}^{+}_{\delta}$ (Definition E.7). We use

\displaystyle J=L+3L\delta^{-dL},\leavevmode\nobreak\ \text{and}\leavevmode% \nobreak\ u=(1,\delta^{-1},\delta^{-2},\dots,\delta^{-d+1}).

(E.7)

For any $j\in[L]$ , we have the following two conclusions:

•

If $G_{i,j}\geq 0$ for all $i\in[d]$ , i.e., $G_{:,j}\in[j-1:\delta:j-\delta]^{d}$ , then we have

\displaystyle u^{\top}G_{:,j}\in\left[\delta_{j}:\delta:\delta_{j}+\delta^{-d+% 1}-\delta\right],\leavevmode\nobreak\ \text{where}\leavevmode\nobreak\ \delta_% {j}=(j-1)\cdot\left(\frac{\delta-\delta^{-d+1}}{\delta-1}\right).

(E.8)

The map $G_{:,j}\rightarrow u^{\top}G_{:,j}$ from $[j-1:\delta:j-\delta]^{d}$ to $\left[\delta_{j}:\delta:\delta_{j}+\delta^{-d+1}-\delta\right]$ is a bijection.

•

If there exists $i\in[d]$ such that $G_{i,j}=-J+j$ , then

\displaystyle u^{\top}G_{:,j}\leq-3L\delta^{-dL}+(j-1)\cdot\left(\frac{\delta^% {-d+1}-\delta}{1-\delta}\right)+\delta^{-d+1}<0.

(E.9)

We say that $u^{\top}G_{:,j}$ gives the “column ID” for each possible value of $G_{:,j}\in[j-1:\delta:j-\delta]^{d}$ .

Remark E.3 (Illustration of Bijection Properity).

For the bijection property, we give the following illustration. Let $G_{:j}=(g_{1j},g_{2j},\cdots,g_{dj})^{\top}$ and $\bar{G}_{:j}=(\bar{g}_{1j},\bar{g}_{2j},\cdots,\bar{g}_{dj})^{\top}$ . If $u^{\top}G_{:j}=u^{\top}\bar{G}_{:j}$ and $G_{:j}\neq\bar{G}_{:j}$ , we deduce

\displaystyle(g_{1j}-\bar{g}_{1j})+\delta^{-1}(g_{2j}-\bar{g}_{2j})+\cdots+% \delta^{-d+1}(g_{dj}-\bar{g}_{dj})=0.

(E.10)

Because $G_{:j}\neq\bar{G}_{:j}$ , then there exist a $k\leavevmode\nobreak\ (k<d)$ , such that $g_{kj}\neq\bar{g}_{kj}$ and $g_{ij}=\bar{g}_{ij}(i>k)$ . We have

\displaystyle\absolutevalue{\delta^{-k+1}(g_{kj}-\bar{g}_{kj})}\geq\delta^{-k+% 2}.

However,

		$\displaystyle\leavevmode\nobreak\ \absolutevalue{(g_{1j}-\bar{g}_{1j})+\cdots+% \delta^{-k+2}(g_{k-1,j}-\bar{g}_{k-1,j})}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \absolutevalue{g_{1j}-\bar{g}_{1j}}+\cdots+% \absolutevalue{\delta^{-k+2}(g_{k-1,j}-\bar{g}_{k-1,j})}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ (1-\delta)+\cdots+\delta^{-k+2}(1-\delta)$
	$\displaystyle<$	$\displaystyle\leavevmode\nobreak\ \delta^{-k+2}.$

This contradicts with (E.10). Thus we prove the property of bijection.

E.5.2 Proof of Lemma E.2

Proof of Lemma E.2.

We restate the proof from (Yun et al., 2020) for completeness.

By the nature of the compact-supported continuous function, $f$ is uniformly continuous.

Because $\norm{\cdot}_{\infty}$ is equivalent to $\norm{\cdot}_{F}$ when the number of entries are finite, we have the following by the definition of uniform continuity.

For any $\epsilon/3>0$ , there exist a $\delta^{\star}>0$ , such that for any $X,Y\in\mathbb{R}^{d\times L}$ , and $\norm{X-Y}_{\infty}<\delta^{\star}$ , we have $\norm{f(X)-f(Y)}_{F}<\epsilon/3$ .

Then we perform the following steps following Definitions E.1 and E.2:

•

We create a grid $\mathcal{G}_{\delta^{\star}}$ by choosing grid width $\delta^{\star}$ , and cube $\mathcal{S}_{G}$ with respect to $G\in\mathcal{G}_{\delta^{\star}}$ .
•

For any grid point $G\in\mathcal{G}_{\delta^{\star}}$ , we define $C_{G}\in\mathcal{S}_{G}$ to be the center point of the cube $\mathcal{S}_{G}$ .
•

We define a piece-wise constant function $f_{\delta^{\star}}(X)=\sum\nolimits_{L\in\mathcal{G}_{\delta^{\star}}}f(C_{G})% \mathds{1}\{X\in\mathcal{S}_{G}\}$ .

Then for any $X\in\mathcal{S}_{G}$ , we have $\norm{X-C_{G}}_{\infty}<\delta^{\star}$ . According to the uniform continuity, we drive

\displaystyle\norm{f(X)-f_{\delta^{\star}}(X)}_{F}=\norm{f(X)-f(C_{G})}_{F}<% \epsilon/3.

This implies that $\norm{f-f_{\delta^{\star}}}_{L^{2}}<\epsilon/3$ and completes the proof. ∎

E.5.3 Proof of Lemma E.4

We give the proof of Lemma E.4 by constructing the network to satisfy the requirements.

Proof of Lemma E.4.

Recall the selective shift operation in Section E.5.1, the overall idea of the construction includes two steps:

•

Step 1: For each $j\in[L]$ , we stack $\delta^{-d}$ attention layers. We use the attention layer as

\displaystyle\delta^{-d}\xi(\cdot;g-\delta/2,g+\delta/2),

(E.11)

for $g\in[\delta_{j}:\delta:\delta_{j}+\delta^{-d+1}-\delta]\leavevmode\nobreak\ % \eqref{eq:map_domain}$ in the increasing order. The total number of layers is $L\delta^{-d}$ . These layers cast $G\in\widetilde{\mathcal{G}}_{\delta}$ to $L$ different entries required by Property 1 of Lemma E.4.

•

Step 2: We add an extra single-head attention layer with attention part

\displaystyle L\delta^{-(L+1)d-1}\xi(\cdot;0).

(E.12)

This layer achieves a global shifting and casts different $G\in\widetilde{\mathcal{G}}_{\delta}$ to unique elements required by properties Property 2 of Lemma E.4.

The two operations together map $\widetilde{\mathcal{G}}_{\delta}$ and $\mathcal{G}_{\delta}^{+}\setminus\widetilde{\mathcal{G}}_{\delta}$ to different sets, as required by properties 3-4 of Lemma E.4. The bounds $t_{l}$ and $t_{r}$ are calculated then.

Then, we give detailed proof by showing the impact of the two steps and verifying the four properties of Lemma E.4. We achieve this by making a category division of $\mathcal{G}^{+}_{\delta}$ :

•

Category 1: $G\in\widetilde{\mathcal{G}}_{\delta}$ , all entries in the point $G$ are between $0$ and $L-\delta$ .
•

Category 2: $G\in\mathcal{G}^{+}_{\delta}\setminus\widetilde{\mathcal{G}}_{\delta}$ , the point $G$ has at least one entry that equals to $-J$ .

Let $u=(1,\delta^{-1},\delta^{-2},\ldots,\delta^{-d+1})$ , and recall that $\delta_{j}=(j-1)(\delta-\delta^{-d+1})/(\delta-1)$ for any $j\in[L]$ in (E.8).

Category 1.

We denote $g_{j}\coloneqq u^{\top}G_{:,j}$ , then we have $g_{1}<g_{2}<\cdots<g_{L}$ . The first $\delta^{-d}$ layers sweep the set $[\delta_{j}:\delta:\delta_{j}+\delta^{-d+1}-\delta],j\in[L]$ and apply selective shift operation on each element in the set. This means that selective shift operation will be applied to $g_{1}$ first, then $g_{2}$ , and then $g_{3}$ , and so on, regardless of the specific values of $g_{j}$ ’s.

•

First Shift Operation. In the first selective shift operation with $g$ going through $[\delta_{1}:\delta:\delta_{1}+\delta^{-d+1}-\delta]$ , the $(1,1)$ -th entry of $G$ (e.g., $G_{1,1}$ ) is shifted by the operation, while the other entries are left untouched. The updated value $\widetilde{G}_{1,1}$ is

\displaystyle\widetilde{G}_{1,1}=G_{1,1}+\delta^{-d}\left[\max_{k}\left(u^{% \top}G_{:,k}\right)-\min_{k}\left(u^{\top}G_{:,k}\right)\right]=G_{1,1}+\delta% ^{-d}(g_{L}-g_{1}).

Therefore, after the operation, the output of the layer is

\displaystyle\begin{pmatrix}\widetilde{G}_{:,1}&G_{:,2}&\cdots&G_{:,L}\end{% pmatrix}.

We have

	$\displaystyle\widetilde{g}_{1}$	$\displaystyle\coloneqq u^{T}\widetilde{G}_{:,1}$
		$\displaystyle=\widetilde{G}_{1,1}+\sum_{i=2}^{d}\delta^{-i+1}G_{i,1}$
		$\displaystyle=G_{1,1}+\delta^{-d}(g_{L}-g_{1})+\sum_{i=2}^{d}\delta^{-i+1}G_{i% ,1}$
		$\displaystyle=g_{1}+\delta^{-d}(g_{L}-g_{1}).$

Then we deduce $g_{L}<\widetilde{g}_{1}$ , because

$\displaystyle\widetilde{g}_{1}$	$\displaystyle=g_{1}+\delta^{-d}(g_{L}-g_{1})$
	$\displaystyle\geq 0+\delta^{-d}\left[(L-1)\cdot\frac{\delta-\delta^{-d+1}}{% \delta-1}-\delta^{-d+1}+\delta\right]$	(By (E.8))
	$\displaystyle=\delta^{-d}\left[(L-1)\frac{\delta}{1-\delta}+\delta+(L-1)\frac{% \delta^{-d+1}}{1-\delta}-\delta^{-d+1}\right]$
	$\displaystyle\geq\delta^{-d}\cdot\left((L-1)\frac{\delta}{1-\delta}+\delta\right)$
	$\displaystyle=(L-1)\frac{\delta^{-d+1}}{1-\delta}+\delta^{-d+1}$
	$\displaystyle>g_{L}.$	(By $\delta<1$ and (E.8))

Thus, after updating,

\max u^{\top}\begin{pmatrix}\widetilde{G}_{:,1}&G_{:,2}&\cdots&G_{:,L}\end{% pmatrix}=\max\{\widetilde{g}_{1},g_{2},\dots,g_{L}\}=\widetilde{g}_{1},

and the new minimum is $g_{2}$ .

•

Second Shift Operation. In the second selective shift operation with $g$ going through $[\delta_{2}:\delta:\delta_{2}+\delta^{-d+1}-\delta]$ , the $(1,2)$ -th entry of $G$ (e.g., $G_{1,2}$ ) is shifted by the operation, while the other entries are left untouched. The updated value $\widetilde{G}_{1,2}$ is

	$\displaystyle\widetilde{G}_{1,2}$	$\displaystyle=G_{1,2}+\delta^{-d}(\widetilde{g}_{1}-g_{2})$
		$\displaystyle=G_{1,2}+\delta^{-d}(g_{1}-g_{2})+\delta^{-2d}(g_{L}-g_{1}).$

Therefore, after the operation, the output of the layer is

\displaystyle\begin{pmatrix}\widetilde{G}_{:,1}&\widetilde{G}_{:,2}&\cdots&G_{% :,L}\end{pmatrix}.

We have

	$\displaystyle\widetilde{g}_{2}$	$\displaystyle\coloneqq u^{\top}\widetilde{G}_{:,2}$
		$\displaystyle=g_{2}+\delta^{-d}(g_{1}-g_{2})+\delta^{-2d}(g_{L}-g_{1}).$

Then we deduce $\widetilde{g}_{1}<\widetilde{g}_{2}$ , because

		$\displaystyle g_{1}+\delta^{-d}(g_{L}-g_{1})<g_{2}+\delta^{-d}(g_{1}-g_{2})+% \delta^{-2d}(g_{L}-g_{1})$
	$\displaystyle\iff\leavevmode\nobreak\$	$\displaystyle(\delta^{-d}-1)(g_{2}-g_{1})<\delta^{-d}(\delta^{-d}-1)(g_{L}-g_{% 1}).$		(By $\delta^{-d}>1$ and $g_{L}>g_{2}$ )

Thus, after updating,

\max u^{\top}\begin{pmatrix}\widetilde{G}_{:,1}&\widetilde{G}_{:,2}&\cdots&G_{% :,L}\end{pmatrix}=\max\{\widetilde{g}_{1},\widetilde{g}_{2},\dots,g_{L}\}=% \widetilde{g}_{2},

and the new minimum is $g_{3}$ .

•

Repeating The Process. By repeating this process, we show that the $j$ -th shift operation shifts $G_{1,j}$ by $\delta^{-d}(\widetilde{g}_{j-1}-g_{j})$ , and we have

	$\displaystyle\widetilde{g}_{j}$	$\displaystyle\coloneqq u^{\top}\widetilde{G}_{:,j}$
		$\displaystyle=g_{j}+\sum_{k=1}^{j-1}\delta^{-kd}(g_{j-k}-g_{j-k+1})+\delta^{-% jd}(g_{L}-g_{1}).$

We deduce $\widetilde{g}_{j-1}<\widetilde{g}_{j}$ holds for all $2\leq j\leq L$ , because

		$\displaystyle\leavevmode\nobreak\ \widetilde{g}_{j-1}<\widetilde{g}_{j}$
	$\displaystyle\iff\leavevmode\nobreak\$	$\displaystyle\leavevmode\nobreak\ g_{j-1}+\sum_{k=2}^{j-1}\delta^{-kd+d}(g_{j-% k}-g_{j-k+1})+\delta^{-(j-1)d}(g_{L}-g_{1})$
		$\displaystyle\leavevmode\nobreak\ <g_{j}+\sum_{k=1}^{j-1}\delta^{-kd}(g_{j-k}-% g_{j-k+1})+\delta^{-jd}(g_{L}-g_{1})$
	$\displaystyle\iff\leavevmode\nobreak\$	$\displaystyle\leavevmode\nobreak\ \sum_{k=1}^{j-1}\delta^{-kd+d}(\delta^{-d}-1% )(g_{j-k+1}-g_{j-k})<\delta^{-(j-1)d}(\delta^{-d}-1)(g_{L}-g_{1}),$

where the last inequality holds because

		$\displaystyle\leavevmode\nobreak\ \sum_{k=1}^{j-1}\delta^{-kd+d}(g_{j-k+1}-g_{% j-k})$
	$\displaystyle<$	$\displaystyle\leavevmode\nobreak\ \delta^{-(j-1)d}\sum_{k=1}^{j-1}(g_{j-k+1}-g% _{j-k})$
	$\displaystyle<$	$\displaystyle\leavevmode\nobreak\ \delta^{-(j-1)d}(g_{L}-g_{1}).$

Therefore, after the $j$ -th selective shift operation, $\widetilde{g}_{j}$ is the new maximum among $\{\widetilde{g}_{1},\dots,\widetilde{g}_{j},g_{j+1},\dots,g_{L}\}$ and $g_{j+1}$ is the new minimum.

•

After $L$ Shift Operations. After the whole $L$ shift operations, the input $G$ is mapped to a new point $\widetilde{G}$ , where $u^{\top}\widetilde{G}=\begin{pmatrix}\widetilde{g}_{1}&\widetilde{g}_{2}&\dots% &\widetilde{g}_{L}\end{pmatrix}$ and $\widetilde{g}_{1}<\widetilde{g}_{2}<\dots<\widetilde{g}_{L}$ . For the lower and upper bound of $\widetilde{g}_{L}$ , we have the following lemma.

Lemma E.8 (Lemma 10 of (Yun et al., 2020)).

$\widetilde{g}_{L}=u^{\top}\widetilde{G}_{:,L}$ satisfies the following bounds:

\displaystyle\delta^{-(L-1)d+1}(\delta^{-d}-1)\leq\widetilde{g}_{L}\leq L% \delta^{-(L+1)d}.

Also, the map** from $\begin{pmatrix}g_{1}&g_{2}&\cdots&g_{L}\end{pmatrix}$ to $\widetilde{g}_{L}$ is one-to-one map**.

•

Global Shifting by the Last Layer. We note that after the above $L$ shift operations, there is another attention layer with attention part $L\delta^{-(L+1)d-1}\xi(\cdot;0)$ . Since $0<\widetilde{g}_{1}<\cdots<\widetilde{g}_{L}$ , what it does to $\widetilde{G}$ is that it adds the following to each entry in the first row of $\widetilde{G}$ :

\displaystyle L\delta^{-(L+1)d-1}\max_{k}u^{\top}\widetilde{G}_{:,k}=L\delta^{% -(L+1)d-1}\widetilde{g}_{L}.

The output of this layer is defined to be the function $f_{\mathcal{T},c2}(G)$ .

Now, in summary, for any $G\in\widetilde{\mathcal{G}}_{\delta}$ , $i\in[d]$ , and $j\in[L]$ , we have

	$\displaystyle f_{\mathcal{T},c2}(G)_{i,j}$	$\displaystyle=\begin{cases}G_{1,j}+\delta_{j}^{+}&\text{ if }i=1,\\ G_{i,j}&\text{ if }2\leq i\leq d,\end{cases}$
		$\displaystyle\leavevmode\nobreak\ \text{where}\leavevmode\nobreak\ \delta_{j}^% {+}=\sum_{k=1}^{j-1}\delta^{-kd}(g_{j-k}-g_{j-k+1})+\delta^{-jd}(g_{L}-g_{1})+% L\delta^{-(L+1)d-1}\widetilde{g}_{L}.$

For any $G\in\widetilde{\mathcal{G}}_{\delta}$ and $j\in[L]$ ,

u^{\top}f_{\mathcal{T},c2}(G)_{:,j}=\widetilde{g}_{j}+L\delta^{-(L+1)d-1}% \widetilde{g}_{L}.

Next, we check the Property 1, Property 2 and Property 3 of Lemma E.4.

•

Checking Property 1 of Lemma E.4. Given any $G\in\widetilde{\mathcal{G}}_{\delta}$ , we already prove that

\displaystyle\widetilde{g}_{1}<\widetilde{g}_{2}<\dots<\widetilde{g}_{L},

so they are all distinct.

•

Checking Property 2 of Lemma E.4. Note that the upper bound on $\widetilde{g}_{L}$ from Lemma E.8 also holds for other $\widetilde{g}_{j}$ ’s, so for all $j\in[L]$ , we have

\displaystyle L\delta^{-(L+1)d-1}\widetilde{g}_{L}\leq u^{\top}f_{\mathcal{T},% c2}(G)_{:,j}<L\delta^{-(L+1)d-1}\widetilde{g}_{L}+L\delta^{-(L+1)d}.

Now, from Lemma E.8, two different $G,G^{\prime}\in\widetilde{\mathcal{G}}_{\delta}$ map to different $\widetilde{g}_{L}$ and $\widetilde{g}^{\prime}_{L}$ , and they differ at least by $\delta$ . This means that two intervals

	$\displaystyle\leavevmode\nobreak\ [L\delta^{-(L+1)d-1}\widetilde{g}_{L},L% \delta^{-(L+1)d-1}\widetilde{g}_{L}+L\delta^{-(L+1)d}),$
	$\displaystyle\leavevmode\nobreak\ [L\delta^{-(L+1)d-1}\widetilde{g}^{\prime}_{% L},L\delta^{-(L+1)d-1}\widetilde{g}^{\prime}_{L}+L\delta^{-(L+1)d}),$

are guaranteed to be disjoint, so the entries of $u^{\top}f_{\mathcal{T},c2}(G)$ and $u^{\top}f_{\mathcal{T},c2}(G^{\prime})$ are all distinct.

Now, we finish showing that the map $f_{\mathcal{T},c2}(\cdot)$ we constructed using $(1/\delta)^{d}+1$ attention layers implements a contextual map** on $\widetilde{\mathcal{G}}_{\delta}$ .

•

Checking Property 3 of Lemma E.4. With $u^{\top}f_{\mathcal{T},c2}(G)_{:,j}\in[L\delta^{-(L+1)d-1}\widetilde{g}_{L},L% \delta^{-(L+1)d-1}\widetilde{g}_{L}+L\delta^{-(L+1)d})$ and Lemma E.8, we show that for any $G\in\widetilde{\mathcal{G}}_{\delta}$ , we have

	$\displaystyle\leavevmode\nobreak\ u^{\top}f_{\mathcal{T},c2}(G)_{:,j}\geq L% \delta^{-2(L+1)d}(\delta^{-d}-1),$
	$\displaystyle\leavevmode\nobreak\ u^{\top}f_{\mathcal{T},c2}(G)_{:,j}<L^{2}% \delta^{-2(L+1)d-1}+L\delta^{-(L+1)d}.$

This proves that all $u^{\top}f_{\mathcal{T},c2}(L)_{:,j}$ are between $t_{l}$ and $t_{r}$ , where

	$\displaystyle\leavevmode\nobreak\ t_{l}=L\delta^{-2(L+1)d}(\delta^{-d}-1),$
	$\displaystyle\leavevmode\nobreak\ t_{r}=L^{2}\delta^{-2(L+1)d-1}+L\delta^{-(L+% 1)d}.$

Category 2. Now we check Property 4 of Lemma E.4. For the input points $G\in\mathcal{G}^{+}_{\delta}\setminus\widetilde{\mathcal{G}}_{\delta}$ , note that the point $G$ has at least one entry that equals to $-J+k,k\in[L-1]$ . Let $g_{j}\coloneqq u^{\top}G_{:,j}$ , and recall that whenever a column $G_{:,j}$ has an entry that equals to $-J+k,k\in[L-1]$ , we have $g_{j}<0$ . Without loss of generality, assume that $g_{1}<0$ .

Because the selective shift operation is applied to each element of $[0:\delta:\delta_{L}+\delta^{-d+1}-\delta]$ , not to negative values, thus we have $\min_{k}u^{\top}G_{:,k}=g_{1}<0$ , $g_{1}$ never gets shifted upwards, and remains as the minimum for the whole time.

•

All $g_{j}$ ’s Are Negative. When all $g_{j}$ ’s are negative, selective shift operation never shifts the input $G$ , thus $\widetilde{G}=G$ . Recall that $u^{\top}\widetilde{G}_{:,j}<0$ for all $j\in[L]$ . The last layer with attention part $L\delta^{-(L+1)d-1}\xi(\cdot;0)$ adds $L\delta^{-(L+1)d-1}\min_{k}u^{\top}\widetilde{G}_{:,k}<0$ to each entry in the first row of $\widetilde{G}$ , making $\widetilde{G}$ remain negative. Therefore, $f_{\mathcal{T},c2}(G)$ satisfies $u^{\top}f_{\mathcal{T},c2}(G)_{:,j}<0<t_{l}$ for all $j\in[L]$ .

•

Not All $g_{j}$ ’s Are Negative. Now consider the case where at least one $g_{j}$ is positive. Suppose that there are $k$ positive and satisfies $g_{i_{1}}<g_{i_{2}}<\cdots<g_{i_{k}}$ . Thus selective shift operation does not affect $g_{i}$ , where $i\in[L]\setminus\{i_{1},\dots,i_{k}\}$ , but it shifts $g_{i_{1}}$ by

	$\displaystyle\leavevmode\nobreak\ \delta^{-d}(\max_{k}u^{\top}G_{:,k}-\min_{k}% u^{\top}G_{:,k})$
$\displaystyle\geq$	$\displaystyle\leavevmode\nobreak\ \delta^{-d}(2L\delta^{-dL}-(L-1)\frac{\delta% ^{-d+1}-\delta}{1-\delta}-\delta^{-d+1}+(i_{k}-1)\frac{\delta^{-d+1}-\delta}{1% -\delta})$	(By (E.9))
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \delta^{-d}(3L\delta^{-dL}-\delta^{-d+1}-(L-% i_{k})\frac{\delta^{-d+1}-\delta}{1-\delta})$
$\displaystyle\geq$	$\displaystyle\leavevmode\nobreak\ \delta^{-d}\cdot 2L\delta^{-dL}$	(By $\delta^{-1}\geq 2$ )
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ 2L\delta^{-(L+1)d}.$

The next shift operations shift $g_{i_{2}},\dots,g_{i_{k}}$ by an even larger amount, so at the end of the first $L(1/\delta)^{d}$ layers, we have $L\delta^{-(L+1)d}\leq\widetilde{g}_{i_{1}}\leq\dots\leq\widetilde{g}_{i_{k}}$ , while $\widetilde{g}_{j}<0$ for all $j\in[L]\setminus\{i_{1},\dots,i_{k}\}$ .

Then, we shift $G$ by the last layer. The last layer with attention part $L\delta^{-(L+1)d-1}\xi(\cdot;0)$ acts differently for negative and positive $\widetilde{g}_{j}$ ’s. (i). For negative $\widetilde{g}_{j}$ ’s, it adds the following to $\widetilde{g}_{j},j\in[L]\setminus\{i_{1},\dots,i_{k}\}$ :

\displaystyle L\delta^{-(L+1)d-1}\min_{k}u^{\top}\widetilde{G}_{:,k}=L\delta^{% -(L+1)d-1}g_{1}<0.

This term push them further to the negative side. (ii). For positive $\widetilde{g}_{i}$ ’s, it adds

\displaystyle L\delta^{-(L+1)d-1}\max_{k}u^{\top}\widetilde{G}_{k}=L\delta^{-(% L+1)d-1}\widetilde{g}_{i_{k}}\geq 2L^{2}\delta^{-2(L+1)d-1}.

Thus they are all greater than or equal to $2L^{2}\delta^{-2(L+1)d+1}$ .

Note that

\displaystyle 2L^{2}\delta^{-2(L+1)d-1}>t_{r},\leavevmode\nobreak\ \text{where% }\leavevmode\nobreak\ t_{r}=L^{2}\delta^{-2(L+1)d-1}+L\delta^{-(L+1)d}.

Then we have the final output $f_{\mathcal{T},c2}(G)$ satisfies $u^{\top}f_{\mathcal{T},c2}(G)_{:,j}\notin[t_{l},t_{r}]$ , for all $j\in[L]$ . This completes the verification of Property 4 of Lemma E.4.

In conclusion, we need $\mathcal{O}(L\delta^{-d})$ layers of modified self-attention layer to obtain our approximation. This completes the proof. ∎

E.5.4 Proof of Lemma E.5

Proof of Lemma E.5.

We restate the proof from (Yun et al., 2020) for completeness.

Note that $|\mathcal{G}^{+}_{\delta}|=(1/\delta+1)^{dL}<\infty$ , so the output of $f_{\mathcal{T},c2}(\mathcal{G}^{+}_{\delta})$ has finite number of distinct real values. Let $M$ be the upper bound of all these possible values. By construction of $f_{\mathcal{T},c2}$ , $M>0$ .

Construct the Layers: $f_{\mathcal{T},c3}(f_{\mathcal{T},c2}(G))=\mathbf{0}_{d\times L}$ if $G\in\mathcal{G}^{+}_{\delta}\setminus\widetilde{\mathcal{G}}_{\delta}$ .

According to Lemma E.4, for all $j\in[L]$ , we have $u^{\top}f_{\mathcal{T},c2}(G)_{:,j}\in[t_{l},t_{r}]$ if $G\in\widetilde{\mathcal{G}}_{\delta}$ , and $u^{\top}f_{\mathcal{T},c2}(G)_{:,j}\notin[t_{l},t_{r}]$ if $G\in\mathcal{G}^{+}_{\delta}\setminus\widetilde{\mathcal{G}}_{\delta}$ . Due to this property, we add the following feed-forward layer:

Definition E.8 (Feed-forward Layer 3).

The vectors $u$ and $\mathds{1}_{L}$ act as the weight parameters and $\zeta_{3}(\cdot)$ acts as the activation function in the feed-forward layer.

\displaystyle X\rightarrow X-(M+1)\mathds{1}_{L}\zeta_{3}(u^{\top}X),% \leavevmode\nobreak\ \leavevmode\nobreak\ \zeta_{3}(t)=\begin{cases}0&\text{ % if }t\in[t_{l},t_{r}]\\ 1&\text{ if }t\notin[t_{l},t_{r}].\end{cases}

(E.13)

•

Case for $G\in\mathcal{G}^{+}_{\delta}\setminus\widetilde{\mathcal{G}}_{\delta}$ . We have $\zeta_{3}(u^{\top}f_{\mathcal{T},c2}(G))=\mathds{1}_{L}^{\top}$ , so all the entries of the input are shifted by $-M-1$ , and become strictly negative.
•

Case for $G\in\widetilde{\mathcal{G}}_{\delta}$ . We have $\zeta_{3}(u^{\top}f_{\mathcal{T},c2}(G))=\mathbf{0}_{L}^{\top}$ , so the output stays the same as the $f_{\mathcal{T},c2}(G)$ .

With the input $f_{\mathcal{T},c2}(G)$ , if $G\in\widetilde{\mathcal{G}}_{\delta}$ , then $\zeta_{3}(u^{\top}f_{\mathcal{T},c2}(G))=\mathbf{0}_{L}^{\top}$ , so the output stays the same as the input. If $G\in\mathcal{G}^{+}_{\delta}\setminus\widetilde{\mathcal{G}}_{\delta}$ , then $\zeta_{3}(u^{\top}f_{\mathcal{T},c2}(G))=\mathds{1}_{L}^{\top}$ , so all the entries of the input are shifted by $-M-1$ , and become strictly negative.

Next, we map those negative entries to zero. For $i=1,2,\cdots,d$ , we add the following layer:

Definition E.9 (Feed-forward Layer 4).

The vectors $u$ and $e_{i}$ act as the weight parameters and $\zeta_{4}(\cdot)$ acts as the activation function in the feed-forward layer.

\displaystyle X\rightarrow X+e_{i}\zeta_{4}((e_{i})^{\top}X),\leavevmode% \nobreak\ \leavevmode\nobreak\ \zeta_{4}(t)=\begin{cases}-t&\text{ if }t<0\\ 0&\text{ if }t\geq 0.\end{cases}

(E.14)

After these $d$ layers, the output for $G\in\mathcal{G}^{+}_{\delta}\setminus\widetilde{\mathcal{G}}_{\delta}$ is a zero matrix, while the output for $G\in\widetilde{\mathcal{G}}_{\delta}$ remains $f_{\mathcal{T},c2}(G)$ .

Construct the Layers: $f_{\mathcal{T},c3}(f_{\mathcal{T},c2}(G))=A_{G}$ if $G\in\widetilde{\mathcal{G}}_{\delta}$ .

Each different $G$ is mapped to $L$ unique numbers $u^{\top}f_{\mathcal{T},c2}(G)$ , which are at least $\delta$ apart from each other. We map each unique number to the corresponding output column as follows. We choose one $\bar{G}\in\widetilde{\mathcal{G}}_{\delta}$ , for each $u^{\top}f_{\mathcal{T},c2}(\bar{G})_{:,j}$ , $j\in[L]$ , we add the following feed-forward layer.

Definition E.10 (Feed-forward Layer 5).

The vectors $u$ and $e_{i}$ act as the weight parameters and $\zeta_{4}(\cdot)$ acts as the activation function in the feed-forward layer.

	$\displaystyle X\rightarrow$	$\displaystyle X+\left((A_{\bar{G}})_{:,j}-f_{\mathcal{T},c2}({\bar{G}})_{:,j}% \right)\zeta_{5}(u^{\top}X-u^{\top}f_{\mathcal{T},c2}(\bar{G})_{:,j}\mathds{1}% _{L}^{\top}),$		(E.15)
		$\displaystyle\zeta_{5}(t)=\begin{cases}1&-\delta/2\leq t<\delta/2,\\ 0&\leavevmode\nobreak\ \text{others}.\end{cases}$		(E.16)

•

Case for $G\in\mathcal{G}^{+}_{\delta}\setminus\widetilde{\mathcal{G}}_{\delta}$ . Recall that the input $X$ of this layer is $f_{\mathcal{T},c2}({G})$ . If $X$ is a zero matrix, which is the case for $G\in\mathcal{G}^{+}_{\delta}\setminus\widetilde{\mathcal{G}}_{\delta}$ , we have $u^{\top}X=\mathbf{0}_{L}^{\top}$ . Then $u^{\top}X-u^{\top}f_{\mathcal{T},c2}({\bar{G}})_{:,j}\mathds{1}_{L}^{\top}<-t_% {l}\mathds{1}_{L}$ . Since $t_{l}>\delta/2$ , the output remains the same as $X$ .

•

Case for $G\in\widetilde{\mathcal{G}}_{\delta}$ . Consider the input $X$ is $f_{\mathcal{T},c2}(G)$ , where $G\in\widetilde{\mathcal{G}}_{\delta}$ is not equal to $\bar{G}$ . According to Property 2 of Lemma E.4, given a $j\in[L]$ , $u^{\top}f_{\mathcal{T},c2}(G)_{:,k},(k\in[L])$ differs from $u^{\top}f_{\mathcal{T},c2}({\bar{G}})_{:,j}$ by at least $\delta$ . Then we have

\zeta_{5}(u^{\top}f_{\mathcal{T},c2}(G)-u^{\top}f_{\mathcal{T},c2}({\bar{G}})_% {:,j}\mathds{1}_{L}^{\top})=\mathbf{0}_{L}^{\top}.

Thus the input is left untouched.

If $G=\bar{G}$ , then

\zeta_{5}(u^{\top}f_{\mathcal{T},c2}(G)-u^{\top}f_{\mathcal{T},c2}({\bar{G}})_% {:,j}\mathds{1}_{L}^{\top})=(e_{j})^{\top}.

Thus we shift the $j$ -th column of $f_{\mathcal{T},c2}(G)$ to

\displaystyle f_{\mathcal{T},c2}(G)_{:,j}+((A_{\bar{G}})_{:,j}-f_{\mathcal{T},% c2}({\bar{G}})_{:,j})=f_{\mathcal{T},c2}(G)_{:,j}+((A_{G})_{:,j}-f_{\mathcal{T% },c2}(G)_{:,j})=(A_{G})_{:,j}.

In other word, this layer maps the column $f_{\mathcal{T},c2}(G)_{:,j}$ to $(A_{G})_{:,j}$ , without affecting any other columns.

We defer from above that we need one layer per each unique value of $u^{\top}f_{\mathcal{T},c2}(G)_{:,j}$ for each $G\in\widetilde{\mathcal{G}}_{\delta}$ . Note that there are $\mathcal{O}(\delta^{-dL})$ such numbers, so we use $\mathcal{O}(\delta^{-dL})$ layers to finish our construction. ∎

E.5.5 Proof of Lemma E.7

Proof of Lemma E.7.

We restate the proof from (Yun et al., 2020) for completeness.

The proof follows two steps: (i) Approximate the modified self-attention layers. (ii) Approximate the modified feed-forward layers.

•

Step 1: Approximate the Modified Self-Attention Layers.

We achieve this by approximating the $\mathop{\rm{Softmax}}$ operator $\sigma_{S}$ with the $\mathop{\rm{Hardmax}}$ operator $\sigma_{H}$ . Given a matrix $X\in\mathbb{R}^{d\times L}$ , we have

\sigma_{S}(\lambda X)\rightarrow\sigma_{H}(X),\quad\text{as}\quad\lambda% \rightarrow\infty.

The operator is the only difference between the normal and the modified self-attention layers. We approximate the modified self-attention layer in $\bar{\mathcal{T}}_{p}^{r,m,l}$ by the normal self-attention layer with the same number of heads $r$ and head size $m$ .

•

Step2: Approximate the Modified Feed-Forward Layers.

We achieve this by approximating the activation function in $\Psi$ with four ${\rm ReLU}$ functions. From Definition E.3, we recall that $\Psi$ denotes three-piecewise functions with at least a constant piece. We consider the following $\zeta\in\Psi$ :

\zeta(x)=\begin{cases}b_{1}&\text{ if }x<c_{1},\\ a_{2}x+b_{2}&\text{ if }c_{1}\leq x<c_{2},\\ a_{3}x+b_{3}&\text{ if }c_{2}\leq x,\end{cases}

where $a_{2},a_{3},b_{1},b_{2},b_{3},c_{1},c_{2}\in\mathbb{R}$ , and $c_{1}<c_{2}$ .

We approximate $\zeta(x)$ by $\widetilde{\zeta}(x)$ composed of four ${\rm ReLU}$ functions:

	$\displaystyle\widetilde{\zeta}(x)=$	$\displaystyle b_{1}+\frac{a_{2}c_{1}+b_{2}-b_{1}}{\epsilon}\rm{ReLU}(x-c_{1}+% \epsilon)+\left(a_{2}-\frac{a_{2}c_{1}+b_{2}-b_{1}}{\epsilon}\right)\rm{ReLU}(% x-c_{1})$
		$\displaystyle+\left(\frac{a_{3}c_{2}+b_{3}-a_{2}(c_{2}-\epsilon)-b_{2}}{% \epsilon}-a_{2}\right)\rm{ReLU}(x-c_{2}+\epsilon)$
		$\displaystyle+\left(a_{3}-\frac{a_{3}c_{2}+b_{3}-a_{2}(c_{2}-\epsilon)-b_{2}}{% \epsilon}\right)\rm{ReLU}(x-c_{2})$
	$\displaystyle=$	$\displaystyle\begin{cases}b_{1}&\text{ if }x<c_{1}-\epsilon,\\ (a_{2}c_{1}+b_{2}-b_{1})(x-c_{1})/\epsilon+a_{2}c_{1}+b_{2}&\text{ if }c_{1}-% \epsilon\leq x<c_{1},\\ a_{2}x+b_{2}&\text{ if }c_{1}\leq x<c_{2}-\epsilon,\\ (a_{3}c_{2}+b_{3}-a_{2}(c_{2}-\epsilon)-b_{2})(x-c_{2})/\epsilon+a_{3}c_{2}+b_% {3}&\text{ if }c_{2}-\epsilon\leq x<c_{2},\\ a_{3}x+b_{3}&\text{ if }c_{2}\leq x.\end{cases}$

As $\epsilon\rightarrow 0$ , we approximate $\zeta(x)$ using $\widetilde{\zeta}(x)$ . The activation function is the only difference between the normal and modified feed-forward layers. We approximate the modified feed-forward layer in $\bar{\mathcal{T}}_{p}^{r,m,l}$ by the normal one.

Thus, for any $f_{\mathcal{T},c}\in\bar{\mathcal{T}}_{p}^{2,1,1}$ , there exists a function $f_{\mathcal{T}}\in\mathcal{T}_{p}^{2,1,4}$ to approximate $f_{\mathcal{T},c}$ .

This completes the proof. ∎

Appendix F Proofs of Section 3

Our proof is motivated by the approximation and estimation theory of U-Net-based diffusion models in (Chen et al., 2023a). We use the universal approximation capability Appendix E and the covering number of transformer networks to proceed with our proof. Specifically, we derive the approximation error bound in Section F.1 and the corresponding sample complexity bound in Section F.2. Then we show that the data distribution generated from the estimated score function converges toward a proximate area of the original one in Section F.3.

F.1 Proof of Theorem 3.1

Here we present some auxiliary theoretical results in Section F.1.1 to prepare our main proof of Theorem 3.1. Then we derive the approximation error bound of DiTs (i.e., the proof of Theorem 3.1) in Section F.1.2.

F.1.1 Auxiliary Lemmas for Theorem 3.1.

We restate some auxiliary lemmas and their proofs here from (Chen et al., 2023a) for later convenience.

Lemma F.1 (Lemma 16 of (Chen et al., 2023a)).

Consider a probability density function $p_{h}(h)=\exp(-C\norm{h}_{2}^{2}/2)$ for $h\in\mathbb{R}^{d_{0}}$ and constant $C>0$ . Let $r_{h}>0$ be a fixed radius. Then it holds

	$\displaystyle\int_{\norm{h}_{2}>r_{h}}p_{h}(h)\differential h\leq\frac{2d_{0}% \pi^{d_{0}/2}}{C\Gamma(d_{0}/2+1)}r_{h}^{d_{0}-2}\exp(-Cr_{h}^{2}/2),$
	$\displaystyle\int_{\norm{h}_{2}>r_{h}}\norm{h}_{2}^{2}p_{h}(h)\differential h% \leq\frac{2d_{0}\pi^{d_{0}/2}}{C\Gamma(d_{0}/2+1)}r_{h}^{d_{0}}\exp(-Cr_{h}^{2% }/2).$

Lemma F.2 (Lemma 2 of (Chen et al., 2023a)).

Suppose Assumption 2.2 holds and $g$ is defined as:

\displaystyle q(\bar{h},t)=\int\frac{h\psi_{t}(\bar{h}|h)p_{h}(h)}{\int\psi_{t% }(\bar{h}|h)p_{h}(h)\differential h}\differential h,\quad\bar{h}=B^{\top}x.

Given $\epsilon>0$ , with $r_{h}=c\left(\sqrt{d_{0}\log(d_{0}/T_{0})+\log(1/\epsilon)}\right)$ for an absolute constant $c$ , it holds

\displaystyle\norm{q(\bar{h},t)\mathds{1}\{\norm{\bar{h}}_{2}\geq r_{h}\}}_{L^% {2}(P_{t})}\leq\epsilon,\leavevmode\nobreak\ \text{for}\leavevmode\nobreak\ t% \in[T_{0},T].

Lemma F.3 (Theorem 1 of (Chen et al., 2023a)).

We denote

\displaystyle\tau(r_{h})=\sup_{t\in[T_{0},T]}\sup_{\bar{h}\in[0,r_{h}]^{d}}% \norm{\frac{\partial}{\partial t}q(\bar{h},t)}_{2}.

With $q(\bar{h},t)=\int h\psi_{t}(\bar{h}|h)p_{h}(h)/(\int\psi_{t}(\bar{h}|h)p_{h}(h% )\differential h)\differential h$ and $p_{h}$ satisfies Assumption 2.2, we have a coarse upper bound for $\tau(r_{h})$

\displaystyle\tau(r_{h})=\mathcal{O}\left(\frac{1+\beta^{2}(t)}{\beta(t)}\left% (L_{s_{+}}+\frac{1}{\sigma(t)}\right)\sqrt{d_{0}}r_{h}\right)=\mathcal{O}\left% (e^{T/2}L_{s_{+}}r_{h}\sqrt{d_{0}}\right).

Lemma F.4 (Lemma 10 of (Chen et al., 2020b)).

For any given $\epsilon>0$ , and $L$ -Lipschitz function $g$ defined on $[0,1]^{d_{0}}$ , there exists a continuous function $\bar{f}$ constructed by trapezoid function that

\displaystyle\norm{g-\bar{f}}_{\infty}\leq\epsilon.

Moreover, the Lipschitz continuity of $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar% \macc@nested@a 111{f}$ is bounded by

\displaystyle\left\lvert\macc@depth\char 1\relax\frozen@everymath{\macc@group}% \macc@set@skewchar\macc@nested@a 111{f}(x)-\macc@depth\char 1\relax% \frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{f}(y)\right% \rvert\leq 10d_{0}L\norm{x-y}_{2}\quad\text{for any}\quad x,y\in[0,1]^{d_{0}}.

F.1.2 Main Proof of Theorem 3.1

Proof of Theorem 3.1.

With $\nabla\log p_{t}^{h}\left(\bar{h}\right)=B^{\top}s_{+}(\bar{h},t)$ , we note that in (2.4)

\displaystyle q(\bar{h},t)=\sigma(t)\nabla\log p_{t}^{h}\left(\bar{h}\right)+B% ^{\top}x=\sigma(t)B^{\top}(s_{+}(\bar{h},t)+x).

(F.1)

We proceed as follows:

•

Step 1. Approximate $q(\bar{h},t)$ with a compact-supported continuous function $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar% \macc@nested@a 111{f}(\bar{h},t)$ .
•

Step 2. Approximate $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar% \macc@nested@a 111{f}(\bar{h},t)$ with a Transformer network.

Step 1. Approximate $q(\bar{h},t)$ with a Compact-supported Continuous Function $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar% \macc@nested@a 111{f}(\bar{h},t)$ . Here we partition $\mathbb{R}^{d_{0}}$ into a compact subset $H_{1}:=\{\bar{h}|\norm{\bar{h}}_{2}\leq r_{h}\}$ and its complement $H_{2}$ , where $r_{h}$ is to be determined later. We approximate $q(\bar{h},t)$ on the two subset respectively, and then prove $\bar{f}$ ’s continuity. Such a step achieves an estimation error of $\sqrt{d_{0}}\epsilon$ between $q(\bar{h},t)$ and $\bar{f}(\bar{h},t)$ . We show the main proof here.

•

Approximation on $H_{2}\times[T_{0},T]$ . For any $\epsilon>0$ , we take $r_{h}=c(\sqrt{d_{0}\log(d_{0}/T_{0})-\log\epsilon})$ . We obtain from Lemma F.2 that

\displaystyle\norm{q(\bar{h},t)\mathds{1}\{\norm{\bar{h}}_{2}\geq r_{h}\}}_{L^% {2}(P_{t})}\leq\epsilon\quad\text{for}\quad t\in[T_{0},T].

So we set $\bar{f}(\bar{h},t)=0$ on $H_{2}\times[T_{0},T]$ .

•

Approximation on $H_{1}\times[T_{0},T]$ . On $H_{1}\times[T_{0},T]$ , we approximate $q(\bar{h},t)$ by each coordinate $q_{k}(\bar{h},t)$ respectively, where $q(\bar{h},t)=[q_{1}(\bar{h},t),q_{2}(\bar{h},t),\cdots,q_{d_{0}}(\bar{h},t)]$ . We firstly rescale the input by $y^{\prime}=(\bar{h}+r_{h}\mathds{1})/2r_{h}$ and $t^{\prime}=t/T$ , so that the transformed input space is $[0,1]^{d_{0}}\times[T_{0}/T,1]$ . We implement such a transformation by a single feed-forward layer.

By Assumption 2.3, on-support score $s_{+}(\bar{h},t)$ is $L_{s_{+}}$ -Lipschitz in $\bar{h}$ . This implies $q(\bar{h},t)$ is $(1+L_{s_{+}})$ -Lipschitz in $\bar{h}$ . When taking the transformed inputs, $g(y^{\prime},t^{\prime})=q(2r_{h}y^{\prime}-r_{h}\mathds{1},Tt^{\prime})$ becomes $2r_{h}(1+L_{s_{+}})$ -Lipschitz in $y^{\prime}$ ; so is each coordinate $g_{k}(y^{\prime},t)$ . Here we take $L_{h}=1+L_{s_{+}}$ .

Besides, $g(y^{\prime},t^{\prime})$ is $T\tau(r_{h})$ -Lipsichitz with respect to $t$ , where

\displaystyle\tau(r_{h})=\sup_{t\in[T_{0},T]}\sup_{\bar{h}\in[0,r_{h}]^{d}}% \norm{\frac{\partial}{\partial t}q(\bar{h},t)}_{2}.

We have a coarse upper bound for $\tau(r_{h})$ in Lemma F.3. We repeat it here for convenience

\displaystyle\tau(r_{h})=\mathcal{O}\left(\frac{1+\beta^{2}(t)}{\beta(t)}\left% (L_{s_{+}}+\frac{1}{\sigma(t)}\right)\sqrt{d_{0}}r_{h}\right)=\mathcal{O}\left% (e^{T/2}L_{s_{+}}r_{h}\sqrt{d_{0}}\right).

In conclusion, each $g_{k}(y^{\prime},t)$ is Lipsichitz continuous. So we can apply Lemma F.4 to find out $\bar{f}_{k}(y^{\prime},t)$ for approximating each coordinate. We concatenate $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar% \macc@nested@a 111{f}_{i}$ ’s together and construct $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar% \macc@nested@a 111{f}=[\macc@depth\char 1\relax\frozen@everymath{\macc@group}% \macc@set@skewchar\macc@nested@a 111{f}_{1},\dots,\macc@depth\char 1\relax% \frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{f}_{d_{0}}]% ^{\top}$ . According to the construction in Lemma F.4, for any given $\epsilon$ , we achieve

\displaystyle\sup_{y^{\prime},t^{\prime}\in[0,1]^{d}\times[T_{0}/T,1]}\norm{% \macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar% \macc@nested@a 111{f}(y^{\prime},t^{\prime})-g(y^{\prime},t^{\prime})}_{\infty% }\leq\epsilon,

Considering the input rescaling (i.e., $\bar{h}\to y^{\prime}$ and $t\to t^{\prime}$ ), we obtain:

–

The constructed function is Lipschitz continuous in $\bar{h}$ , i.e., for any $\bar{h}_{1},\bar{h}_{2}\in H_{1}$ and $t\in[T_{0},T]$ , it holds

\displaystyle\norm{\macc@depth\char 1\relax\frozen@everymath{\macc@group}% \macc@set@skewchar\macc@nested@a 111{f}(\bar{h}_{1},t)-\macc@depth\char 1% \relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{f}(% \bar{h}_{2},t)}_{\infty}\leq 10d_{0}L_{h}\norm{\bar{h}_{1}-\bar{h}_{2}}_{2}.

(F.2)

–

The function is also Lipschitz in $t$ , i.e., for any $t_{1},t_{2}\in[T_{0},T]$ and $\norm{\bar{h}}_{2}\leq r_{h}$ , it holds

\displaystyle\norm{\macc@depth\char 1\relax\frozen@everymath{\macc@group}% \macc@set@skewchar\macc@nested@a 111{f}(\bar{h},t_{1})-\macc@depth\char 1% \relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{f}(% \bar{h},t_{2})}_{\infty}\leq 10\tau(r_{h})\norm{t_{1}-t_{2}}_{2}.

Due to the fact that the construction of $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar% \macc@nested@a 111{f}(\bar{h},t)$ is based on trapezoid function, we have $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar% \macc@nested@a 111{f}(\bar{h},t)=0$ for $\norm{\bar{h}}_{2}=r_{h},\forall t\in[T_{0},T]$ . So the two part of $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar% \macc@nested@a 111{f}(\bar{h},t)$ can be joined together. To be more specific, the above Lipschitz continuity in $\bar{h}$ extends to the whole $\mathbb{R}^{d_{0}}$ .

•

Approximation Error Analysis under $L^{2}$ Norm. The $L^{2}$ approximation error of $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar% \macc@nested@a 111{f}$ can be decomposed into two terms:

\displaystyle\norm{q(\bar{h},t)-\bar{f}(\bar{h},t)}_{L^{2}(P_{t}^{h})}=\norm{(% q(\bar{h},t)-\macc@depth\char 1\relax\frozen@everymath{\macc@group}% \macc@set@skewchar\macc@nested@a 111{f}(\bar{h},t))\mathds{1}\{\norm{\bar{h}}_% {2}<r_{h}\}}_{L^{2}(P_{t}^{h})}+\norm{q(\bar{h},t)\mathds{1}\{\norm{\bar{h}}_{% 2}>r_{h}\}}_{L^{2}(P_{t}^{h})}.

The second term on the right-hand side above has already been bounded with the selection of $r_{h}$ :

\displaystyle\norm{g(\bar{h},t)\mathds{1}\{\norm{\bar{h}}_{2}>r_{h}\}}_{L^{2}(% P_{t}^{h})}\leq\epsilon.

The first term is bounded by:

\displaystyle\norm{(q(\bar{h},t)-\macc@depth\char 1\relax\frozen@everymath{% \macc@group}\macc@set@skewchar\macc@nested@a 111{f}(\bar{h},t))\mathds{1}\{% \norm{\bar{h}}_{2}<r_{h}\}}_{L^{2}(P_{t}^{h})}\leq\sqrt{d_{0}}\sup_{y^{\prime}% ,t^{\prime}\in[0,1]^{d}\times[T_{0}/T,1]}\norm{\macc@depth\char 1\relax% \frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{f}(y^{% \prime},t^{\prime})-g(y^{\prime},t^{\prime})}_{\infty}\leq\sqrt{d_{0}}\epsilon.

So we obtain

\displaystyle\norm{q(\bar{h},t)-\macc@depth\char 1\relax\frozen@everymath{% \macc@group}\macc@set@skewchar\macc@nested@a 111{f}(\bar{h},t)}_{L^{2}(P_{t}^{% h})}\leq(\sqrt{d_{0}}+1)\epsilon.

If we substitute $\epsilon$ with $\epsilon/2$ , we obtain that the approximation error of $\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar% \macc@nested@a 111{f}(\bar{h},t)$ is $\sqrt{d_{0}}\epsilon$ .

Step 2. Approximate $\bar{f}(\bar{h},t)$ by a Transformer. This step is based on the universal approximation of transformers for the compact-supported continuous function in Lemma E.1. Following (Peebles and Xie, 2023), DiT uses time point $t$ to calculate the scale and shift value in the Transformer backbone, and it transforms a input picture into a sequential version. We ignore time point $t$ in the notation of Transformer network in DiT. Recall that the reshape layer $R(\cdot)$ in Definition 3.1, we consider use $f(\cdot):={R^{-1}\circ f_{\mathcal{T}}\circ R}(\cdot)$ to approximate $\bar{f}_{t}(\cdot):=\bar{f}(\cdot,t)$ , where $f_{\mathcal{T}}\in\mathcal{T}_{p}^{2,1,4}$ .

•

Overall Approximation Error. With Lemma E.1, we approximate $\bar{f}_{t}(\cdot)$ with $\widehat{f}(\cdot):={R^{-1}\circ\widehat{f}_{\mathcal{T}}\circ R}(\cdot)$ , and denote

\displaystyle H=R(\bar{h}).

We have

$\displaystyle\norm{\bar{f}_{t}(\bar{h})-\widehat{f}(\bar{h})}_{L^{2}(P_{t}^{h})}$	$\displaystyle=\left(\int_{P_{t}^{h}}\norm{\bar{f}_{t}(\bar{h})-\widehat{f}(% \bar{h})}_{2}^{2}\differential h\right)^{1/2}$
	$\displaystyle=\left(\int_{P_{t}^{h}}\norm{R\circ\bar{f}_{t}\circ R^{-1}(H)-R% \circ\widehat{f}\circ R^{-1}(H)}_{F}^{2}\differential h\right)^{1/2}$
	$\displaystyle=\left(\int_{P_{t}^{h}}\norm{R\circ\bar{f}_{t}\circ R^{-1}(H)-% \widehat{f}_{\mathcal{T}}(H)}_{F}^{2}\differential h\right)^{1/2}$
	$\displaystyle\leq\epsilon.$	(F.3)

Along with Step 1, we obtain

\displaystyle\norm{q(\bar{h},t)-\widehat{f}(\bar{h})}_{L^{2}(P_{t}^{h})}\leq% \norm{q(\bar{h},t)-\bar{f}(\bar{h},t)}_{L^{2}(P_{t}^{h})}+\norm{\bar{f}(\bar{h% },t)-\widehat{f}(\bar{h})}_{L^{2}(P_{t}^{h})}\leq(1+\sqrt{d_{0}})\epsilon.

The constructed approximator to $\nabla\log p_{t}(x)$ is $s_{\widehat{W}}=(B\widehat{f}(B^{\top}x,t)-x)/\sigma(t)$ , whose approximation error is

\displaystyle\norm{\nabla\log p_{t}(\cdot)-s_{\widehat{W}}(\cdot,t)}_{L^{2}(P_% {t})}\leq\frac{1+\sqrt{d_{0}}}{\sigma(t)}\epsilon,\quad\forall t\in[T_{0},T].

•

Settling-down of Hyperparameters. We settle down the hyperparameters to configure our network here. We refer to Section E.2 for some of the following calculations.

Then we have

	$\displaystyle C_{F}^{2,\infty}$	$\displaystyle=\mathcal{O}\left(\sqrt{\sum_{i=0}^{d-1}\delta^{-2i}}\right)=% \mathcal{O}\left(\delta^{-d}\right)$		(F.12)
		$\displaystyle=(1/\epsilon)^{\mathcal{O}(1)}.$		(By setting $\delta=\mathcal{O}(\epsilon^{2/d})$ according to Section E.4)

and

	$\displaystyle C_{F}$	$\displaystyle=\sup_{\norm{x}_{2}=1}\norm{W_{1}x}_{2}=\mathcal{O}\left(\delta^{% -d}\right)$		(F.13)
		$\displaystyle=(1/\epsilon)^{\mathcal{O}(1)}.$		(By setting $\delta=\mathcal{O}(\epsilon^{2/d})$ according to Section E.4)

This completes the proof. ∎

F.2 Proof of Corollary 3.1.1

Here we present the auxiliary theoretical results about the covering number of transformer networks in Section F.2.1 to prepare our main proof of Corollary 3.1.1. The results is based on the Theorem A.17 of (Edelman et al., 2022). Then we derive the sample complexity bound of DiTs (i.e., the proof of Corollary 3.1.1) in Section F.2.

F.2.1 Auxiliary Lemmas for Corollary 3.1.1

Lemma F.5 (Lemma 15 of (Chen et al., 2023a)).

Let $\mathcal{G}$ be a bounded function class, i.e., there exists a constant $b$ such that any $g\in\mathcal{G}:\mathbb{R}^{d_{0}}\mapsto[0,b]$ . Let $z_{1},z_{2},\cdots,z_{n}\in\mathbb{R}^{d_{0}}$ be i.i.d. random variables. For any $\delta\in(0,1),a\leq 1$ , and $c>0$ , we have

	$\displaystyle\mathbb{P}\left(\sup_{g\in\mathcal{G}}\frac{1}{n}\sum_{i=1}^{n}g(% z_{i})-(1+a)\mathbb{E}\left[g(z)\right]>\frac{(1+3/a)B}{3n}\log\frac{\mathcal{% N}(c,\mathcal{G},\norm{\cdot}_{\infty})}{\delta}+(2+a)c\right)\leq\delta,$
	$\displaystyle\mathbb{P}\left(\sup_{g\in\mathcal{G}}\mathbb{E}\left[g(z)\right]% -\frac{1+a}{n}\sum_{i=1}^{n}g(z_{i})>\frac{(1+6/a)B}{3n}\log\frac{\mathcal{N}(% c,\mathcal{G},\norm{\cdot}_{\infty})}{\delta}+(2+a)c\right)\leq\delta.$

Now, we give the definition of covering number as the follows.

Definition F.1 (Covering Number).

Given a function class $\mathcal{F}$ and a data distribution $P$ . Sample n data points $\{X_{i}\}_{i=1}^{n}$ from $P$ , then the covering number $\mathcal{N}(\epsilon,\mathcal{F},\{X_{i}\}_{i=1}^{n},\norm{\cdot})$ is the smallest size of a collection (a cover) $\mathcal{C}\in\mathcal{F}$ such that for any $f\in\mathcal{F}$ , there exist $\widehat{f}\in\mathcal{C}$ satisfying

\displaystyle\max_{i}\norm{f(X_{i})-\widehat{f}(X_{i})}\leq\epsilon.

Further, we define the covering number with respect to the data distribution as

\displaystyle\mathcal{N}(\epsilon,\mathcal{F},\norm{\cdot})=\sup_{\{X_{i}\}_{i% =1}^{n}\sim P}\mathcal{N}(\epsilon,\mathcal{F},\{X_{i}\}_{i=1}^{n},\norm{\cdot% }).

Then we give the covering number of the transformer networks.

Lemma F.6 (Modified from Theorem A.17 of (Edelman et al., 2022)).

Let $\mathcal{T}_{p}^{r,m,l}(K,C_{\mathcal{T}},C_{OV}^{2,\infty},C_{OV},C_{KQ}^{2,% \infty},C_{KQ},C_{F}^{2,\infty},C_{F},C_{E},L_{\mathcal{T}})$ represent the class of functions of $K$ -layer transformer blocks satisfying the norm bound for matrix and Lipsichitz property for feed-forward layers. Then for all data point $\norm{X}_{2,\infty}\leq C_{X}$ we have

		$\displaystyle\log\mathcal{N}(\epsilon_{c},\mathcal{T}_{p}^{r,m,l}(K,C_{% \mathcal{T}},C_{OV}^{2,\infty},C_{OV},C_{KQ}^{2,\infty},C_{KQ},C_{F}^{2,\infty% },C_{F},C_{E},L_{\mathcal{T}}),\norm{\cdot}_{2})$
	$\displaystyle\leq$	$\displaystyle\frac{\log(nL)}{\epsilon_{c}^{2}}\cdot\left(\sum_{i=1}^{K}\alpha^% {\frac{2}{3}}\left(d^{\frac{2}{3}}\left(C_{F}^{2,\infty}\right)^{\frac{4}{3}}+% d^{\frac{2}{3}}\left(2(C_{F})^{2}C_{OV}C_{KQ}^{2,\infty}\right)^{\frac{2}{3}}+% \tau m^{\frac{2}{3}}\left((C_{F})^{2}C_{OV}^{2,\infty}\right)^{\frac{2}{3}}% \right)\right)^{3},$

where $\alpha\coloneqq\prod_{j<i}(C_{F})^{2}C_{OV}(1+4C_{KQ})(C_{X}+C_{E})$ .

Remark F.1.

We modify (Edelman et al., 2022, Theorem A.17) in seven aspects:

1.

We do not consider the last linear layer in the model: converting each column vector of the Transformer output to a scalar. Therefore, we ignore the item related to the last linear layer in (Edelman et al., 2022, Theorem A.17).
2.

We do not consider the normalization layer in our model. Because the normalization layer in the original proof of only applies $\norm{\prod_{\rm norm}(X_{1})-\prod_{\rm norm}(X_{2})}_{2,\infty}\leq\norm{X_{% 1}-X_{2}}_{2,\infty}$ , ignoring this layer does not change the result.
3.

Our activation function is ${\rm ReLU}$ , we replace the Lipschitz upperbound of activate function by 1.
4.

We consider the positional encoding (E.4) in our work, we need to replace the upperbound $C_{X}$ for the inputs with the upperbound $C_{X}+C_{E}$ . Besides, for multi-layer Transformer, the original conclusion in (Edelman et al., 2022, Theorem A.17) considers the upperbound for the $2,\infty$ -norm of inputs is 1, we add the upperbound for the inputs in Lemma F.6.
5.

We use (2.7) as the feed forward layer, including two linear layers and a residual layer. Thus, in Lemma F.6, we replace the original upperbound for the norm of weight matrix with the upperbound for the norm of $I_{d}+W_{2}W_{1}$ . In the following, we use $\mathcal{O}$ to estimate the log-covering number, thus we ignore the item for $I_{d}$ here for converience. This is the same for the self-attention layer.
6.

We use multi-head attention, we add the number of heads $\tau$ in our result, similar to (Edelman et al., 2022, Theorem A.12).
7.

In our work, we use Transformer $\mathcal{T}_{p}^{2,1,4}$ , i.e., $\tau=2,m=1$ .

F.2.2 Proof of Corollary 3.1.1

Proof of Corollary 3.1.1.

Our proof is built on (Chen et al., 2023a, Appendix B.2). Firstly, for one data sample, we define the empirical score matching loss objective (2.1) as follows

\displaystyle\ell(x;s_{\widehat{W}})=\frac{1}{T-T_{0}}\int_{T_{0}}^{T}\mathbb{% E}_{x_{t}|x_{0}=x}[\norm{\nabla_{x_{t}}\log\psi_{t}(x_{t}|x_{0})-s_{\widehat{W% }}(x_{t},t)}_{2}^{2}]\differential t.

Then we define $\mathcal{L}(s_{\widehat{W}})=\mathbb{E}_{x\sim P_{0}}\left[\ell(x;s_{\widehat{% W}})\right]$ .

Following (Chen et al., 2023a, Appendix B.2), for any $a\in(0,1)$ , we have

\displaystyle\mathcal{L}(s_{\widehat{W}})\leq\underbrace{\mathcal{L}^{\rm trunc% }(s_{\widehat{W}})-(1+a)\widehat{\mathcal{L}}^{\rm trunc}(s_{\widehat{W}})}_{(% I)}+\underbrace{\mathcal{L}(s_{\widehat{W}})-\mathcal{L}^{\rm trunc}(s_{% \widehat{W}})}_{(II)}+(1+a)\underbrace{\inf_{s_{W}\in\mathcal{S}_{\rm NN}}% \widehat{\mathcal{L}}(s_{W})}_{(III)}.

where

\displaystyle\mathcal{L}^{\rm trunc}(s_{\widehat{W}})\coloneqq\mathbb{E}_{x% \sim P_{0}}\left[\ell^{\rm trunc}(x;s_{\widehat{W}})\right]=\mathbb{E}_{x\sim P% _{0}}\left[\ell(x;s_{\widehat{W}})\mathds{1}\{\norm{x}_{2}\leq r_{x}\}\right],% \leavevmode\nobreak\ r_{x}>B.

We denote

	$\displaystyle\leavevmode\nobreak\ \eta$	$\displaystyle\coloneqq 4C_{\mathcal{T}}(C_{\mathcal{T}}+r_{x})(r_{x}/D)^{D-2}% \exp(-r_{x}^{2}/\sigma(t))/(T_{0}(T-T_{0})),$
	$\displaystyle\leavevmode\nobreak\ r_{x}$	$\displaystyle\coloneqq\mathcal{O}\left(\sqrt{d_{0}\log d_{0}+\log C_{\mathcal{% T}}+\log(n/\bar{\delta})}\right).$

For any $\bar{\delta}>0$ , following (Chen et al., 2023a, Appendix B.2), we have the following for term $(I)$ with probability $1-\bar{\delta}$ ,

\displaystyle(I)=\mathcal{O}\left(\frac{(1+3/a)(C_{\mathcal{T}}^{2}+r_{x}^{2})% }{nT_{0}(T-T_{0})}\log\frac{\mathcal{N}\left(\frac{(T-T_{0})(\iota-\eta)}{(C_{% \mathcal{T}}+r_{x})\log(T/T_{0})},\mathcal{S}_{\mathcal{T}_{p}^{2,1,4}},\norm{% \cdot}_{2}\right)}{\bar{\delta}}+(2+a)c\right).

where $c\leq 0$ is a constant, and $\iota>0$ will be determined later.

We set $\iota=1/(n^{1/4}T_{0}(T-T_{0}))$ , then we have

\displaystyle(I)=\mathcal{O}\left(\frac{(1+3/a)\left(C_{\mathcal{T}}^{2}+r_{x}% ^{2}\right)}{nT_{0}(T-T_{0})}\log\frac{\mathcal{N}\left((n(C_{\mathcal{T}}+r_{% x})T_{0}\log(T/T_{0}))^{-1},\mathcal{S}_{\mathcal{T}_{p}^{2,1,4}},\norm{\cdot}% _{2}\right)}{\bar{\delta}}+\frac{1}{n}\right),

with probability $1-\bar{\delta}$ .

Following the upper bound of other two terms and the proof details in (Chen et al., 2023a, Appendix B.2), we have

		$\displaystyle\leavevmode\nobreak\ \frac{1}{T-T_{0}}\int_{T_{0}}^{T}\norm{s_{% \widehat{W}}(\cdot,t)-\nabla\log p_{t}(\cdot)}_{L^{2}(P_{t})}^{2}\differential t$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \mathcal{O}\left(\frac{\left(C_{\mathcal{T}}% ^{2}+r_{x}^{2}\right)}{\epsilon^{2}nT_{0}(T-T_{0})}\log\frac{\mathcal{N}\left(% (n(C_{\mathcal{T}}+r_{x})T_{0}\log(T/T_{0}))^{-1},\mathcal{S}_{\mathcal{T}_{p}% ^{2,1,4}},\norm{\cdot}_{2}\right)}{\bar{\delta}}+\frac{1}{n}+\frac{d_{0}^{2}}{% T_{0}(T-T_{0})}\epsilon^{2}\right),$		(F.14)

with probability $1-3\bar{\delta}$ .

Covering Number of $\mathcal{S}_{\mathcal{T}_{p}^{2,1,4}}$ .

Next step is to calculate the covering number of $\mathcal{S}_{\mathcal{T}_{p}^{2,1,4}}$ . $\mathcal{S}_{\mathcal{T}_{p}^{2,1,4}}$ consists of two components: (i) Matrix $W_{B}$ with orthonormal columns; (ii) Network function $f_{\mathcal{T}}$ . Suppose we have $W_{B1},W_{B2}$ and $f_{1},f_{2}$ such that $\norm{W_{B1}-W_{B2}}_{F}\leq\delta_{1}$ and $\sup_{\norm{x}_{2}\leq 3r_{x}+\sqrt{D\log D},t\in[T_{0},T]}\norm{f_{1}(x,t)-f_% {2}(x,t)}_{2}\leq\delta_{2}$ , where $f_{1}=R^{-1}\circ f_{\mathcal{T}1}\circ R,f_{2}=R^{-1}\circ f_{\mathcal{T}2}\circ R$ . Then we evaluate

		$\displaystyle\quad\sup_{\norm{x}_{2}\leq 3r_{x}+\sqrt{D\log D},t\in[T_{0},T]}% \norm{s_{W_{B1},f_{\mathcal{T}1}}(x,t)-s_{W_{B2},f_{\mathcal{T}2}}(x,t)}_{2}$
		$\displaystyle=\frac{1}{\sigma(t)}\sup_{\norm{x}_{2}\leq 3r_{x}+\sqrt{D\log D},% t\in[T_{0},T]}\norm{W_{B1}f_{1}(W_{B1}^{\top}x,t)-W_{B2}f_{2}(W_{B2}^{\top}x,t% )}_{2}$
		$\displaystyle\leq\frac{1}{\sigma(t)}\sup_{\norm{x}_{2}\leq 3r_{x}+\sqrt{D\log D% },t\in[T_{0},T]}\Bigg{(}\norm{W_{B1}f_{1}(W_{B1}^{\top}x,t)-W_{B1}f_{1}(W_{B2}% ^{\top}x,t)}_{2}$
		$\displaystyle\quad+\norm{W_{B1}f_{1}(W_{B2}^{\top}x,t)-W_{B1}f_{2}(W_{B2}^{% \top}x,t)}_{2}+\norm{W_{B1}f_{2}(W_{B2}^{\top}x,t)-W_{B2}f_{2}(W_{B2}^{\top}x,% t)}_{2}\Bigg{)}$
		$\displaystyle\leq\frac{1}{\sigma(t)}\left(L_{\mathcal{T}}\delta_{1}\sqrt{d_{0}% }(3r_{x}+\sqrt{D\log D})+\delta_{2}+\delta_{1}K\right),$		(F.15)

where $L_{\mathcal{T}}$ upper bounds the Lipschitz constant of $f_{\mathcal{T}}$ .

For set $\{W_{B}\in\mathbb{R}^{D\times d_{0}}:\norm{W_{B}}_{\rm 2}\leq 1\}$ , its $\delta_{1}$ -covering number is $\left(1+2\sqrt{d_{0}}/\delta_{1}\right)^{Dd_{0}}$ ((Chen et al., 2020a, Lemma 8)). The $\delta_{2}$ -covering number of $f$ needs a further discussion as there is a resha** process in our network. For the input reshaped from $\bar{h}\in\mathbb{R}^{d_{0}}$ to $H\in\mathbb{R}^{d\times L}$ , we have

\displaystyle\norm{\bar{h}}_{2}\leq r_{x}\Longleftrightarrow\norm{H}_{F}\leq r% _{x},

\displaystyle\leavevmode\nobreak\ \sup_{\norm{\bar{h}}_{2}\leq 3r_{x}+\sqrt{D% \log D},t\in[T_{0},T]}\norm{f_{1}(\bar{h},t)-f_{2}(\bar{h},t)}_{2}\leq\delta_{% 2},

and

\displaystyle\Longleftrightarrow

\displaystyle\leavevmode\nobreak\ \sup_{\norm{H}_{F}\leq 3r_{x}+\sqrt{D\log D}% ,t\in[T_{0},T]}\norm{f_{\mathcal{T}1}(H)-f_{\mathcal{T}2}(H)}_{2}\leq\delta_{2}.

Thus we can follow the covering number property for sequence-to-sequence transformer $\mathcal{T}_{p}^{2,1,4}$ , i.e., Lemma F.6 and get the following $\delta_{2}$ -covering number

\displaystyle\frac{\log(nL)}{\delta_{2}^{2}}\cdot\left(\sum_{i=1}^{K}\alpha_{i% }^{\frac{2}{3}}\left(d^{\frac{2}{3}}\left(C_{F}^{2,\infty}\right)^{\frac{4}{3}% }+d^{\frac{2}{3}}\left(2(C_{F})^{2}C_{OV}C_{KQ}^{2,\infty}\right)^{\frac{2}{3}% }+\tau m^{\frac{2}{3}}\left((C_{F})^{2}C_{OV}^{2,\infty}\right)^{\frac{2}{3}}% \right)\right)^{3},

where

\displaystyle\alpha_{i}\coloneqq\prod_{j<i}(C_{F})^{2}C_{OV}(1+4C_{KQ})(C_{X}+% C_{E}).

According to the (LABEL:eq:K_est), (LABEL:eq:L_tau_est), (LABEL:eq:W_ov_est_inf), (LABEL:eq:W_ov_est_2), (LABEL:eq:W_kq_est_inf), (LABEL:eq:W_kq_est_2), (F.12), (F.13), (LABEL:eq:C_e_est) and (LABEL:eq:C_tau_est) in Section F.1.2, we derive the following with $\delta=\mathcal{O}(\epsilon^{2/d})$ (Section E.4) and $d=4$ (Theorem 3.1):

		$\displaystyle\leavevmode\nobreak\ K=\mathcal{O}\left(\epsilon^{-2L}\right),L_{% \mathcal{T}}=\mathcal{O}\left(d_{0}L_{s_{+}}\right),\leavevmode\nobreak\ C_{OV% }^{2,\infty}=\mathcal{O}(d\epsilon^{-4L}),\leavevmode\nobreak\ C_{OV}=\mathcal% {O}(\epsilon^{-4L}),$
		$\displaystyle\leavevmode\nobreak\ C_{KQ}^{2,\infty}=\mathcal{O}(\epsilon^{-4})% ,\leavevmode\nobreak\ C_{KQ}=\mathcal{O}(\epsilon^{-4}),\leavevmode\nobreak\ C% _{F}^{2,\infty}=\mathcal{O}(\epsilon^{-4}),\leavevmode\nobreak\ C_{F}=\mathcal% {O}(\epsilon^{-2}),\leavevmode\nobreak\ C_{E}=\mathcal{O}(L^{3/2}),$		(F.16)
		$\displaystyle\leavevmode\nobreak\ C_{\mathcal{T}}=\mathcal{O}\left(d_{0}L_{s_{% +}}\cdot\sqrt{d_{0}\log(d_{0}/T_{0})+\log(1/\epsilon)}\right),\leavevmode% \nobreak\ r_{x}=\mathcal{O}\left(\sqrt{d_{0}\log d_{0}+\log C_{\mathcal{T}}+% \log(n/\bar{\delta})}\right).$

We consider that each elements of the input data are within $[0,1]$ as shown in Appendix E.

Recall that $\iota=1/(n^{1/4}T_{0}(T-T_{0}))$ , then we get the log-covering number of $\mathcal{T}_{p}^{2,1,4}$ ,

	$\displaystyle\leavevmode\nobreak\ \log\mathcal{N}\left(\iota,\mathcal{T}_{p}^{% 2,1,4},\norm{\cdot}_{2}\right)=$	$\displaystyle\leavevmode\nobreak\ \mathcal{O}\left(\frac{\epsilon^{-8K}\cdot L% ^{K}d^{2}\log(nL)}{\iota}\right)$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \mathcal{O}(1)\cdot\left(\frac{2^{8K\log(L/% \epsilon)}d^{2}\log(nL)}{\iota}\right).$

Following (Chen et al., 2023a, Appendix B.2), then the log-covering number of $\mathcal{S}_{\mathcal{T}_{p}^{2,1,4}}$ is

	$\displaystyle\leavevmode\nobreak\ \log\mathcal{N}\left(\iota,\mathcal{S}_{% \mathcal{T}_{p}^{2,1,4}},\norm{\cdot}_{2}\right)$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \mathcal{O}\left(2Dd_{0}\cdot\log\left(1+% \frac{6C_{\mathcal{T}}L_{\mathcal{T}}\sqrt{d_{0}}(3r_{x}+\sqrt{D\log D})}{T_{0% }\iota}\right)+\frac{2^{8K\log(L/\epsilon)}d^{2}\log(nL)}{T_{0}^{2}\iota^{2}}\right)$	(By (F.2.2))
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \mathcal{O}\left(n^{1/2}2^{8(1/\epsilon)^{L}% \log(L/\epsilon)}Dd^{2}d_{0}^{6}L_{s_{+}}^{2}(T-T_{0})^{2}\cdot\log(nL)\right)$	(By (F.2.2))
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \mathcal{O}\left(n^{1/2}2^{(1/\epsilon)^{2L}% }Dd^{2}d_{0}^{6}L_{s_{+}}^{2}(T-T_{0})^{2}\cdot\log(nL)\right)$	(By $(1/\epsilon)^{L}\geq 8\log(L/\epsilon)$ )
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \widetilde{\mathcal{O}}\left(n^{1/2}2^{(1/% \epsilon)^{2L}}Dd^{2}d_{0}^{6}L_{s_{+}}^{2}(T-T_{0})^{2}\right)$	(By ignoring the log factors)
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \widetilde{\mathcal{O}}\left(n^{1/2}2^{(1/% \epsilon)^{2L}}Dd^{2}d_{0}^{6}L_{s_{+}}^{2}T^{2}\right).$

Substituting the log-covering number into (F.2.2), we have

	$\displaystyle\leavevmode\nobreak\ \frac{1}{T-T_{0}}\int_{T_{0}}^{T}\norm{s_{% \widehat{W}}(\cdot,t)-\nabla\log p_{t}(\cdot)}_{L^{2}(P_{t})}^{2}\differential t$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \mathcal{O}\Big{(}\frac{C_{\mathcal{T}}^{2}+% r_{x}^{2}}{\epsilon^{2}nT_{0}(T-T_{0})}(\log(\mathcal{N})+\log(1/\bar{\delta})% )+\frac{d_{0}^{2}}{T_{0}(T-T_{0})}\epsilon^{2}+\frac{1}{n}\Big{)}$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \mathcal{O}\Big{(}\underbrace{\frac{C_{% \mathcal{T}}^{2}+r_{x}^{2}}{\epsilon^{2}nT_{0}T}(\log(\mathcal{N})+\log(1/\bar% {\delta}))}_{\mathrm{1st\leavevmode\nobreak\ term}}+\underbrace{\frac{d_{0}^{2% }}{T_{0}T}\epsilon^{2}}_{\mathrm{2nd\leavevmode\nobreak\ term}}+\frac{1}{n}% \Big{)}.$	(F.17)

Recall the following parameters,

•

$C_{\mathcal{T}}^{2}=\mathcal{O}(d_{0}^{2}L_{s_{+}}^{2}d_{0}\log(d_{0}/T_{0})+% \log(1/\epsilon))$
•

$r_{x}^{2}=\mathcal{O}(d_{0}\log d_{0}+\log C_{\mathcal{T}}+\log(n/\bar{\delta}))$
•

$\bar{\delta}$ : probability error
•

$\epsilon$ : approximation error
•

$n$ : sample size
•

$T_{0}<T/2$
•

$D,d,d_{0}>1$ : feature dimension
•

$L>1$ : sequence length
•

$d_{0}=L\cdot d$
•

$L_{s_{+}}$ : Lipschitz coefficient

Ignoring the $\log$ factors, and $\mathrm{poly}(D,d,d_{0},L_{S_{+}})$ , the first term in (F.17) becomes

\displaystyle\frac{1}{n^{1/2}}\cdot\frac{T}{T_{0}}\cdot 2^{(1/\epsilon)^{2L}}.

The second term simplifies to

\displaystyle\frac{1}{T_{0}T}\epsilon^{2}.

Thus, the final bound is

\displaystyle\widetilde{O}\Bigg{(}\frac{1}{n^{1/2}}\frac{T}{T_{0}}\cdot 2^{(1/% \epsilon)^{2L}}+\frac{1}{T_{0}T}\epsilon^{2}+\frac{1}{n}\Bigg{)}.

Thus, we complete the proof of Corollary 3.1.1. ∎

F.3 Proof of Corollary 3.1.2

Our proof is built on (Chen et al., 2023a, Appendix C). The main difference between our work and (Chen et al., 2023a) is our score estimation error from Corollary 3.1.1. Consequently, only the subspace error and the total variation distance differ from (Chen et al., 2023a, Theorem 3).

Proof Sketch of (i).

We show that if the orthogonal score increases significantly, the mismatch between the column span of $B$ and $W_{B}$ will be greatly amplified. Therefore, an accurate score network estimator forces $B$ and $W_{B}$ to align with each other.

Proof Sketch of (ii).

We conduct the proof via 2 steps:

•

Step 1: Total Variation Distance Bound. We obtain the discrete result from the continuous-time generated distribution $\widehat{P}_{T_{0}}$ by adding discretization error (Chen et al., 2023a, Lemma 4). It suffices to bound the divergence between the following two stochastic processes:

–

For the ground-truth backward process, consider $h_{t}^{\leftarrow}=B^{\top}y_{t}$ and the following SDE:

\displaystyle\differential h_{t}^{\leftarrow}=\left[\frac{1}{2}h_{t}^{% \leftarrow}+\nabla\log p^{h}{T-t}(h_{t}^{\leftarrow})\right]\differential t+% \differential\bar{B}_{t}^{h}.

Denote the marginal distribution of the ground-truth process as $P_{T_{0}}^{h}$ .

–

For the learned process, consider ${\widetilde{h}}^{\leftarrow,r}_{t}$ and the following SDE:

\displaystyle\differential{\widetilde{h}}^{\leftarrow,r}_{t}=\left[\frac{1}{2}% {\widetilde{h}}^{\leftarrow,r}_{t}+\widetilde{s}^{h}_{f,U}({\widetilde{h}}^{% \leftarrow,r}_{t},T-t)\right]\differential t+\differential\bar{B}^{h}_{t},

where $\widetilde{s}_{f,U}^{h}(z,t)\coloneqq[U^{\top}f(Uz,t)-z]/\sigma(t)$ and $U$ is an orthogonal matrix. Following the notation in (Chen et al., 2023a), we use $(W_{B}U)_{\sharp}^{\top}\widehat{P}_{T_{0}}$ to denote the marginal distribution of $\widehat{P}_{T_{0}}$ . We first calculate the latent score matching error, i.e., the error between $\nabla\log p^{h}_{t}(h)$ and $\widetilde{s}_{U,f}^{h}(h,t)$ . Then, we adopt Girsanov’s Theorem (Chen et al., 2023b) and bound the difference in the KL divergence of the above two processes to derive the score-matching error bound.

•

Step 2: Wasserstein-2 Distance Bound. We use the same technique as (Chen et al., 2023a, Theorem 3).

Proof Sketch of (iii).

We derive item (iii) by solving the orthogonal backward process of the diffusion model.

Next, we present the auxiliary theoretical results in Section F.3.1 to prepare our main proof of Corollary 3.1.2. Then we give detailed proof of Corollary 3.1.2 in Section F.3.2.

F.3.1 Auxiliary Lemmas

Here we include a few auxiliary lemmas from (Chen et al., 2023a) without proofs. Recall the definition of Lipschitz norm: for a given function $f$ , $\norm{f(\cdot)}_{Lip}=\sup_{x\neq y}(\norm{f(x)-f(y)}_{2}/\norm{x-y}_{2})$ .

Lemma F.7 (Lemma 3 of (Chen et al., 2023a)).

Assume that the following holds

\displaystyle\mathbb{E}_{h\sim P_{h}}\norm{\nabla\log p_{h}(h)}_{2}^{2}\leq C_% {sh},\quad\lambda_{\rm min}\mathbb{E}_{h\sim P_{h}}[hh^{\top}]\geq c_{0},\quad% \mathbb{E}_{h\sim P_{h}}\norm{h}_{2}^{2}\leq C_{h},

where $\lambda_{\rm min}$ denotes the smallest eigenvalue. We denote

\displaystyle\bar{\mathbb{E}}[\phi(\cdot,t)]=\int_{T_{0}}^{T}\frac{1}{\sigma^{% 2}(t)}\mathbb{E}_{x\sim P_{t}}[\phi(\cdot,t)]dt.

We set $T_{0}\leq\min\{2\log(d_{0}/C_{sh}),1,2\log(c_{0}),c_{0}\}$ and $T\geq\max\{2\log(C_{h}/d_{0}),1\}$ . Suppose we have

\displaystyle\bar{\mathbb{E}}\norm{W_{B}f(W_{B}^{\top}x,t)-Bq(B^{\top}x,t)}_{2% }^{2}\leq\epsilon.

Then we have

\displaystyle\norm{W_{B}W_{B}^{\top}-BB^{\top}}_{\rm F}^{2}=\mathcal{O}(% \epsilon T_{0}/c_{0}),

and there exists an orthorgonal matrix $U\in\mathbb{R}^{d_{0}\times d_{0}}$ , such that:

	$\displaystyle\quad\bar{\mathbb{E}}\norm{U^{\top}f(Uh,t)-q(h,t)}_{2}^{2}$
	$\displaystyle=\epsilon\cdot\mathcal{O}\left(1+\frac{T_{0}}{c_{0}}\left[(T-\log T% _{0})d_{0}\cdot\max_{t}\norm{f(\cdot,t)}_{\rm Lip}^{2}+C_{s}h\right]+\frac{% \max_{t}\norm{f(\cdot,t)}_{\rm Lip}^{2}\cdot C_{h}}{c_{0}}\right).$

Lemma F.8 (Lemma 4 of (Chen et al., 2023a)).

Assume that $P_{h}$ is sub-Gaussian, $f(h,t)$ and $\nabla\log p_{t}^{h}(h)$ are Lipschitz in both $h$ and $t$ . Assume we have the latent score matching error bound

\displaystyle\int_{T_{0}}^{T}\mathbb{E}_{h\sim P_{t}^{h}}\left\|\widetilde{s}_% {U,f}^{h}\left(h_{t},t\right)-\nabla\log p_{t}^{h}\left(h_{t}\right)\right\|_{% 2}^{2}\mathrm{\leavevmode\nobreak\ d}t\leq\epsilon_{\text{latent }}(T-T_{0}).

Then we have the following latent distribution estimation error for the undiscretized backward SDE

\operatorname{TV}\left(P_{T_{0}}^{h},\widehat{P}_{T_{0}}^{h}\right)\lesssim% \sqrt{\epsilon_{\text{latent }}(T-T_{0})}+\sqrt{\mathrm{KL}\left(P_{h}\|N\left% (0,I_{d_{0}}\right)\right)}\cdot\exp(-T).

Furthermore, we have the following latent distribution estimation error for the discretized backward SDE

\operatorname{TV}\left(P_{T_{0}}^{h},\widehat{P}_{T_{0}}^{h,\mathrm{dis}}% \right)\lesssim\sqrt{\epsilon_{\text{latent}}(T-T_{0})}+\sqrt{\mathrm{KL}\left% (P_{h}\|N\left(0,I_{d_{0}}\right)\right)}\cdot\exp(-T)+\sqrt{\epsilon_{\text{% dis}}(T-T_{0})},

where

	$\displaystyle\epsilon_{\rm dis}=$	$\displaystyle\left(\frac{\max_{h}\left\\|f(h,\cdot)\right\\|_{\text{Lip }}}{% \sigma\left(T_{0}\right)}+\frac{\max_{h,t}\left\\|f(h,t)\right\\|_{2}}{T_{0}^{2}% }\right)^{2}\eta^{2}$
		$\displaystyle+\left(\frac{\max_{t}\left\\|f(\cdot,t)\right\\|_{\text{Lip }}}{% \sigma\left(T_{0}\right)}\right)^{2}\eta^{2}\max\left\{\mathbb{E}\left\\|h_{0}% \right\\|^{2},d_{0}\right\}+\eta d_{0},$

and $\eta$ is the step size in the backward process.

Lemma F.9 (Lemma 6 of (Chen et al., 2023a)).

Consider the following discretized SDE with step size $\mu$ satisfying $T-T_{0}=K_{T}\mu$

\mathrm{d}y_{t}=\left[\frac{1}{2}-\frac{1}{\sigma(T-k\mu)}\right]{y}_{k\mu}% \mathrm{d}t+\mathrm{d}{B}_{t},\text{ for }t\in[k\mu,(k+1)\mu),

where ${Y}_{0}\sim\mathrm{N}(0,I)$ . Then when $T>1$ and $T_{0}+\mu\leq 1$ , we have ${Y}_{T-T_{0}}\sim\mathrm{N}\left(0,\sigma^{2}I\right)$ with $\sigma^{2}\leq e\left(T_{0}+\mu\right)$ .

Lemma F.10 (Lemma 10 in (Chen et al., 2023a)).

Assume that $\nabla\log p_{h}(h)$ is $L_{h}$ -Lipschitz. Then we have $\mathbb{E}_{h\sim P_{h}}\left\|\nabla\log p_{h}(h)\right\|_{2}^{2}\leq d_{0}L_% {h}$ .

F.3.2 Main Proof of Corollary 3.1.2

Proof.

Recall

\displaystyle\xi(n,\epsilon,L):=\frac{1}{n^{1/2}}\frac{T}{T_{0}}\cdot 2^{(1/% \epsilon)^{2L}}+\frac{1}{T_{0}T}\epsilon^{2}+\frac{1}{n}.

•

Proof of (i). With Lemma F.7, we replace $\epsilon$ to be $\epsilon(T-T_{0})$ and we set $C_{sh}=L_{h}d_{0}$ by Lemma F.10, we have

\displaystyle\norm{W_{B}W_{B}^{\top}-BB^{\top}}_{F}^{2}=\mathcal{O}\Bigg{(}% \frac{T_{0}\xi(n,\epsilon,L)}{c_{0}}\Bigg{)}.

We substitute the score estimation error in Corollary 3.1.1 and $T=\mathcal{O}(\log n)$ into the bound above, we deduce

\displaystyle\norm{W_{B}W_{B}^{\top}-BB^{\top}}_{F}^{2}=\widetilde{\mathcal{O}% }\left(\frac{1}{c_{0}}n^{-\zeta(n)}\cdot\log^{3}n\right),

where $\zeta_{1}(n)=1/2-9L^{2L}\cdot n^{(2L^{2L+1}/(37L\cdot\log n))}/(37\log n)$ .

We note that $\log n$ is great enough to make $T$ satisfies $T\geq\max\{\log(C_{h}/d_{0}+1),1\}$ where $C_{h}\geq\mathbb{E}_{h\sim P_{h}}\norm{h}_{2}^{2}$ .

•

Proof of (ii). Lemma F.7 and Lemma F.10 imply that

\displaystyle\bar{\mathbb{E}}\norm{U^{\top}f(Uh,t)-q(h,t)}_{2}^{2}=\mathcal{O}% (\epsilon_{\text{latent}}(T-T_{0})),

where

\displaystyle\epsilon_{\text{latent}}=\epsilon\cdot\mathcal{O}\left(\frac{T_{0% }}{c_{0}}\left[(T-\log T_{0})d_{0}\cdot L_{s_{+}}^{2}+d_{0}L_{h}\right]+\frac{% L_{s_{+}}^{2}\cdot C_{h}}{c_{0}}\right).

Through the algebra calculation, we get

	$\displaystyle\macc@depth\char 1\relax\frozen@everymath{\macc@group}% \macc@set@skewchar\macc@nested@a 111{}\norm{U^{\top}f(Uh,t)-q(h,t)}_{2}^{2}$	$\displaystyle=\int_{T_{0}}^{T}\mathbb{E}_{h\sim P_{t}^{h}}\norm{\frac{U^{\top}% f(Uh,t)-h}{\sigma(t)}-\nabla\log p_{t}^{h}(h)}_{2}^{2}\differential t$
		$\displaystyle\leq\epsilon_{\text{latent}}(T-T_{0}).$

With $\epsilon_{\text{latent}}$ and Lemma F.8, we obtain

		$\displaystyle\leavevmode\nobreak\ {\sf TV}(P_{T_{0}}^{h},(W_{B}U)^{\top}_{% \sharp}\widehat{P}_{T_{0}}^{\rm dis})$
	$\displaystyle\lesssim$	$\displaystyle\leavevmode\nobreak\ \sqrt{\epsilon_{\text{latent }}(T-T_{0})}+% \sqrt{\mathrm{KL}\left(P_{h}\\|N\left(0,I_{d_{0}}\right)\right)}\exp(-T)+\sqrt{% \epsilon_{\text{dis }}(T-T_{0})}$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \widetilde{\mathcal{O}}\left(\frac{1}{\sqrt{% c_{0}}}\sqrt{\xi(n,\epsilon,L)}+\frac{1}{n}+\mu\frac{\sqrt{d_{0}^{2}\log d_{0}% }}{T_{0}^{2}}+\sqrt{\mu}\sqrt{d_{0}}\right).$

As we choose time step $\mu=\mathcal{O}(\xi(n,\epsilon,L)\cdot T_{0}^{2}/d_{0}\sqrt{\log d_{0}})$ , we obtain

\displaystyle{\sf TV}(P_{T_{0}}^{h},(W_{B}U)^{\top}_{\sharp}\widehat{P}_{T_{0}% }^{\rm dis})=\widetilde{\mathcal{O}}\left(\sqrt{\xi(n,\epsilon,L)}\right).

By definition, $\widehat{P}_{T_{0}}^{h,{\rm dis}}=(UW_{B})_{\sharp}^{\top}\widehat{P}_{T_{0}}^% {\rm dis}$ . This completes the proof of the total variation distance in (3.2).

For Wasserstein-2 distance ${\sf W}_{2}(P_{T_{0}}^{h},P_{h})$ , we bound it by using the same technique as (Chen et al., 2023b, Lemma 16). Specifically, our proof only requires finite second moment of $P_{h}$ verified in Assumption 2.2. As a result, we have

\displaystyle{\sf W}_{2}(P_{T_{0}}^{h},P_{h})=\mathcal{O}\left(\sqrt{d_{0}T_{0% }}\right).

•

Proof of (iii). We apply Lemma F.9 due to our score decomposition. With the marginal distribution at time $T-T_{0}$ and observing $\mu\ll T_{0}$ , we obtain the last property.

This completes the proof. ∎

Appendix G Proofs of Section 4

Our proofs are motivated by the observation of low-rank gradient decomposition in transformer-like models (Alman and Song, 2024a; Gu et al., 2024). With our simplifications and observations made in Section 4, we utilize the fine-grained complexity results of transformer and attention (Hu et al., 2024c; Alman and Song, 2024b, 2023) and tensor trick (Lemma D.1 and (Diao et al., 2019, 2018)) to proceed our proofs. Specifically, we approximate DiT training gradients with a series of low-rank approximations in Sections G.1.1, G.1.2 and G.1.3, and carefully match the multiplication dimensions so that the computation of $\derivative{g_{2}}{\underline{W}}$ forms a chained low-rank approximation in Section G.2.

G.1 Auxiliary Theoretical Results for Theorem 4.1

Here we present some auxiliary theoretical results to prepare our main proof of the Existence of almost-linear Time Algorithms for ADITGC Theorem 4.1.

G.1.1 Low-Rank Decomposition of DiT Gradients

We start by some definitions. Recall that $W\in\mathbb{R}^{d\times d}$ and $\underline{W}\in\mathbb{R}^{d^{2}}$ denotes the vectorization of $W\in\mathbb{R}^{d\times d}$ following Definition D.1.

Definition G.1.

Let $A_{1},A_{2}\in\mathbb{R}^{d\times L}$ be two matrices. Suppose $\operatorname{\mathsf{A}}=A_{1}^{\top}\otimes A_{2}^{\top}\in\mathbb{R}^{L^{2}% \times d^{2}}$ . Define $\operatorname{\mathsf{A}}_{j_{0}}\in\mathbb{R}^{L\times d^{2}}$ as an $L\times d^{2}$ sub-block of $\operatorname{\mathsf{A}}$ . There are $L$ such sub-blocks in total. For each $j_{0}\in[L]$ , define the function $u(\underline{W})_{j_{0}}:\mathbb{R}^{d^{2}}\to\mathbb{R}^{L}$ by $u(\underline{W})_{j_{0}}:=\exp(\operatorname{\mathsf{A}}_{j_{0}}\underline{W})% \in\mathbb{R}^{L}$ .

Definition G.2.

Let $A_{1},A_{2}\in\mathbb{R}^{d\times L}$ be two matrices. Suppose $\operatorname{\mathsf{A}}=A_{1}^{\top}\otimes A_{2}^{\top}\in\mathbb{R}^{L^{2}% \times d^{2}}$ . Define $\operatorname{\mathsf{A}}_{j_{0}}\in\mathbb{R}^{L\times d^{2}}$ as an $L\times d^{2}$ sub-block of $\operatorname{\mathsf{A}}$ . There are $L$ such sub-blocks in total. For every index $j_{0}\in[L]$ , consider the function $\alpha(\underline{W})_{j_{0}}:\mathbb{R}^{d^{2}}\to\mathbb{R}$ defined by $\alpha(\underline{W})_{j_{0}}:=\langle\underbrace{\exp(\operatorname{\mathsf{A% }}_{j_{0}}\underline{W})}_{L\times 1},\underbrace{\mathds{1}_{L}}_{L\times 1}\rangle$ .

Definition G.3.

Suppose that $\alpha(\underline{W})_{j_{0}}\in\mathbb{R}$ and $u(\underline{W})_{j_{0}}\in\mathbb{R}^{L}$ are defined as in Definitions G.2 and G.1, respectively. For a fixed $j_{0}\in[L]$ , consider the function $f(\underline{W})_{j_{0}}:\mathbb{R}^{d^{2}}\rightarrow\mathbb{R}^{L}$ defined by

\displaystyle f(\underline{W})_{j_{0}}:=\underbrace{\alpha(\underline{W})_{j_{% 0}}^{-1}}_{\mathrm{scalar}}\underbrace{u(\underline{W})_{j_{0}}}_{L\times 1}.

Define $f(\underline{W})\in\mathbb{R}^{L\times L}$ as the matrix where the $j_{0}$ -th row is $(f(\underline{W})_{j_{0}})^{\top}$ .

Definition G.4.

For every $i_{0}\in[d]$ , define the function $h(\underline{W}_{OV})_{i_{0}}:\mathbb{R}^{d^{2}}\rightarrow\mathbb{R}^{L}$ by

\displaystyle h(\underline{W}_{OV})_{i_{0}}:=\underbrace{A_{3}^{\top}}_{L% \times d}\underbrace{(W_{OV}^{\top})_{*,i_{0}}}_{d\times 1}.

Here, $W_{OV}\in\mathbb{R}^{d\times d}$ denotes the matrix representation of $\underline{W}_{OV}\in\mathbb{R}^{d^{2}}$ , and $(W_{OV})^{\top}_{*,i_{0}}$ represents the $i_{0}$ -th column of $W_{OV}^{\top}$ . Define $h(\underline{W}_{OV})\in\mathbb{R}^{L\times d}$ as the matrix where the $i_{0}$ -th column is $h(\underline{W}_{OV})_{i_{0}}$ .

Definition G.5.

For each $j_{0}\in[L]$ , we denote $f(\underline{W})_{j_{0}}\in\mathbb{R}^{L}$ as the normalized vector defined by Definition G.3. For each $i_{0}\in[d]$ , $h(\underline{W}_{OV})_{i_{0}}$ is defined as per Definition G.4. For every pair $(j_{0},i_{0})\in[L]\times[d]$ , define the function $c(\underline{W})_{j_{0},i_{0}}:\mathbb{R}^{d^{2}}\times\mathbb{R}^{d^{2}}% \rightarrow\mathbb{R}$ by

\displaystyle c(\underline{W})_{j_{0},i_{0}}:=\langle f(\underline{W})_{j_{0}}% ,h(\underline{W}_{OV})_{i_{0}}\rangle-Y^{\top}_{j_{0},i_{0}},

where $(W_{OV})_{j_{0},i_{0}}$ is the element at the $(j_{0},i_{0})$ position of the matrix $W_{OV}\in\mathbb{R}^{L\times d}$ . $c(\cdot)$ has matrix form

\displaystyle\underbrace{c(\underline{W})}_{L\times d}=\underbrace{f(% \underline{W})}_{L\times L}\underbrace{h(\underline{W}_{OV})}_{L\times d}-% \underbrace{Y^{\top}}_{L\times d}.

With the tensor trick (Section D.3), we compute the gradient $\derivative{g_{2}}{\underline{W}}$ of the DiT loss as follows:

\displaystyle\derivative{g_{2}}{\underline{W}}=\derivative{\underline{W}}\left% [{\frac{1}{2}}\sum_{j_{0}=1}^{L}\sum_{i_{0}=1}^{d}c_{j_{0},i_{0}}^{2}(% \underline{W})\right].

(G.1)

(G.1) presents a neat decomposition of $\derivative{g_{2}}{\underline{W}}$ . Each term is easy enough to handle. Thus, we arrive the following lemma. Let $Z[i,\cdot]$ and $Z[\cdot,j]$ be the $i$ -th row and $j$ -th column of matrix $Z$ .

Lemma G.1 (Low-Rank Decomposition of DiT Gradient).

Let matrix $A_{1},A_{2},A_{3},W,W_{OV},Y$ and loss function $\mathcal{L}$ follow Definition 4.1, and $\operatorname{\mathsf{A}}\coloneqq A_{1}^{\top}\otimes A_{2}^{\top}$ . It holds

\displaystyle\derivative{g_{2}}{\underline{W}}=\sum_{j_{0}=1}^{L}\sum_{i_{0}=1% }^{d}c(\underline{W})_{j_{0},i_{0}}\operatorname{\mathsf{A}}_{j_{0}}^{\top}% \underbrace{\Big{(}\overbrace{\mathop{\rm{diag}}\left(f(\underline{W})_{j}% \right)}^{(II)}-\overbrace{f(\underline{W})_{j_{0}}f(\underline{W})_{j_{0}}^{% \top}}^{(III)}\Big{)}}_{(I)}h(\underline{W}_{OV})_{i_{0}}.

(G.2)

Proof.

Let $Z[i,\cdot]$ and $Z[\cdot,j]$ be the $i$ -th row and $j$ -th column of matrix $Z$ .

With DiT loss Definition 4.1, we have

$\displaystyle\derivative{g_{2}}{\underline{W}}$	$\displaystyle={\frac{1}{2}}\sum_{j_{0}=1}^{L}\sum_{i=1}^{d}\derivative{% \underline{W}}c^{2}_{j_{0},i_{0}}(\underline{W})$
	$\displaystyle=\sum_{j_{0}=1}^{L}\sum_{i=1}^{d}\derivative{\underline{W}}c^{2}_% {j_{0},i_{0}}c(\underline{W})_{j_{0},i_{0}}\cdot\derivative{c(\underline{W})_{% j_{0},i_{0}}}{\underline{W}_{i_{0}}}$
	$\displaystyle=\sum_{j_{0}=1}^{L}\sum_{i=1}^{d}\derivative{\underline{W}}c^{2}_% {j_{0},i_{0}}c(\underline{W})_{j_{0},i_{0}}\cdot\derivative{\left\langle f(% \underline{W})_{j_{0}},h(\underline{W}_{OV})_{i_{0}}\right\rangle}{\underline{% W}_{i_{0}}}$	(By Definition G.5)
	$\displaystyle=\sum_{j_{0}=1}^{L}\sum_{i=1}^{d}\derivative{\underline{W}}c^{2}_% {j_{0},i_{0}}c(\underline{W})_{j_{0},i_{0}}\cdot\left\langle\derivative{f(% \underline{W})_{j_{0}}}{\underline{W}_{i}},h(\underline{W}_{OV})_{i_{0}}\right\rangle$
	$\displaystyle=\sum_{j_{0}=1}^{L}\sum_{i=1}^{d}\derivative{\underline{W}}c^{2}_% {j_{0},i_{0}}c(\underline{W})_{j_{0},i_{0}}\cdot\left\langle\derivative{\alpha% ^{-1}(\underline{W})_{j_{0}}u(\underline{W})_{j_{0}}}{\underline{W}_{i}},h(% \underline{W}_{OV})_{i_{0}}\right\rangle$	(By Definition G.3)
	$\displaystyle=\sum_{j_{0}=1}^{L}\sum_{i=1}^{d}\derivative{\underline{W}}c^{2}_% {j_{0},i_{0}}c(\underline{W})_{j_{0},i_{0}}\cdot\left\langle\alpha(\underline{% W})_{j_{0}}^{-1}\cdot\derivative{u(\underline{W})_{j_{0}}}{\underline{W}_{i_{0% }}}+\derivative{\alpha(\underline{W})_{j_{0}}^{-1}}{\underline{W}_{i_{0}}}% \cdot u(\underline{W})_{j_{0}},h(\underline{W}_{OV})_{i_{0}}\right\rangle$
	$\displaystyle=\sum_{j_{0}=1}^{L}\sum_{i=1}^{d}\derivative{\underline{W}}c^{2}_% {j_{0},i_{0}}c(\underline{W})_{j_{0},i_{0}}\cdot\left\langle\alpha(\underline{% W})_{j_{0}}^{-1}\cdot\derivative{u(\underline{W})_{j_{0}}}{\underline{W}_{i_{0% }}}-\alpha(\underline{W})_{j_{0}}^{-2}\derivative{\alpha(\underline{W})_{j_{0}% }}{\underline{W}_{i_{0}}}\cdot u(\underline{W})_{j_{0}},h(\underline{W}_{OV})_% {i_{0}}\right\rangle.$	(By chain rule)

For each $j_{0}\in[L]$ , we have

\displaystyle\derivative{\left(\operatorname{\mathsf{A}}_{j_{0}}\underline{W}% \right)}{\underline{W}_{i_{0}}}=\operatorname{\mathsf{A}}_{j_{0}}\cdot% \derivative{\underline{W}}{\underline{W}_{i_{0}}}=\left(\operatorname{\mathsf{% A}}_{j_{0}}\right)[\cdot,i].

Therefore, for each $j_{0}\in[L]$ , we have

$\displaystyle\derivative{u(\underline{W})_{j_{0}}}{\underline{W}_{i_{0}}}$	$\displaystyle=\derivative{\exp\left(\operatorname{\mathsf{A}}_{j_{0}}% \underline{W}\right)}{\underline{W}_{i_{0}}}$	(By Definition G.1)
	$\displaystyle=\exp\left(\operatorname{\mathsf{A}}_{j_{0}}\underline{W}\right)% \odot\derivative{\operatorname{\mathsf{A}}_{j_{0}}\underline{W}}{\underline{W}% _{i_{0}}}$	(By entry-wise product rule)
	$\displaystyle=\operatorname{\mathsf{A}}_{j_{0}}[\cdot,i]\odot u(\underline{W})% _{j_{0}}.$	(By Definition G.1 again)

Similarly,

$\displaystyle\derivative{\alpha(\underline{W})_{j_{0}}}{\underline{W}_{i_{0}}}=$	$\displaystyle\leavevmode\nobreak\ \derivative{\left\langle u(\underline{W})_{j% _{0}},\mathds{1}_{L}\right\rangle}{\underline{W}_{i_{0}}}$	(By Definition G.2)
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \left\langle\operatorname{\mathsf{A}}_{j_{0}% }[\cdot,i]\odot u(\underline{W})_{j_{0}},\mathds{1}_{L}\right\rangle$	(By entry-wise product rule)
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \left\langle\operatorname{\mathsf{A}}_{j_{0}% }[\cdot,i],u(\underline{W})_{j_{0}}\right\rangle.$	(By Definition G.1 again)

Putting all together, we have

		$\displaystyle\leavevmode\nobreak\ \derivative{g_{2}(\underline{W})_{j_{0},i_{0% }}}{\underline{W}_{i_{0}}}$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \left[\left\langle h(\underline{W}_{OV})_{i_% {0}},\operatorname{\mathsf{A}}_{j_{0}}[\cdot,i]\odot f(\underline{W})_{j_{0}}% \right\rangle-\left\langle h(\underline{W}_{OV})_{i_{0}},f(\underline{W})_{j_{% 0}}\right\rangle\cdot\left\langle\operatorname{\mathsf{A}}_{j_{0}}[\cdot,i],f(% \underline{W})_{j_{0}}\right\rangle\right]\cdot c(\underline{W})_{j_{0},i_{0}},$

where

		$\displaystyle\left\langle h(\underline{W}_{OV})_{i_{0}},\operatorname{\mathsf{% A}}_{j_{0}}[\cdot,i]\odot f(\underline{W})_{j_{0}}\right\rangle-\left\langle h% (\underline{W}_{OV})_{i_{0}},f(\underline{W})_{j_{0}}\right\rangle\cdot\left% \langle\operatorname{\mathsf{A}}_{j_{0}}[\cdot,i],f(\underline{W})_{j_{0}}\right\rangle$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \operatorname{\mathsf{A}}_{j_{0}}^{\top}% \left(\operatorname{\mathop{\rm{diag}}}\left(f({\underline{W}})_{j_{0}}\right)% -f({\underline{W}})_{j_{0}}f({\underline{W}})_{j_{0}}^{\top}\right)h(% \underline{W}_{OV})_{i_{0}}.$

This completes the proof. ∎

Observe (G.2) carefully. We see that (I) is diagonal and (II) is low-rank. This provides a hint for algorithmic speedup through low-rank approximation: If we approximate the other parts with low-rank approximation and carefully match the multiplication dimensions, we might formulate the computation of $\derivative{g_{2}}{\underline{W}}$ as a chained low-rank approximation.

Surprisingly, such approach makes computing (G.2) as fast as in almost-linear time. To proceed, we further decompose (G.2) according to the chain-rule in the next lemma, and then conduct the approximation term-by-term.

To facilitate our proof, it’s convenient to introduce the following notations.

Definition G.6 ( $q(\cdot)$ ).

Define $c(\underline{W})\in\mathbb{R}^{L\times d}$ as specified in Definition G.5 and $h(\underline{W}_{OV})\in\mathbb{R}^{L\times d}$ as described in Definition G.4. Define $q(\underline{W})\in\mathbb{R}^{L\times L}$ by

\displaystyle q(\underline{W}):=\underbrace{c(\underline{W})}_{L\times d}% \underbrace{h(\underline{W}_{OV})^{\top}}_{d\times L}.

In addition, $q(\underline{W})_{j_{0}}^{\top}$ denotes the $j_{0}$ -th row of $q(\underline{W})$ , transposed, making it an $L\times 1$ vector.

Definition G.7 ( $p(\cdot)$ , $p_{1}(\cdot)$ , $p_{2}(\cdot)$ ).

For each index $j_{0}\in[L]$ , we define $p(\underline{W})_{j_{0}}\in\mathbb{R}^{n}$ as follows:

\displaystyle p(\underline{W})_{j_{0}}:=\left(\mathop{\rm{diag}}(f(\underline{% W})_{j_{0}})-f(\underline{W})_{j_{0}}f(\underline{W})_{j_{0}}^{\top}\right)q(% \underline{W})_{j_{0}}.

We define $p(\underline{W})\in\mathbb{R}^{L\times L}$ such that $p(\underline{W})_{j_{0}}^{\top}$ forms the $j_{0}$ -th row of $p(\underline{W})$ . In addition, for every index $j_{0}\in[L]$ , we define $p_{1}(\underline{W})_{j_{0}},p_{2}(\underline{W})_{j_{0}}\in\mathbb{R}^{L}$ as

\displaystyle p_{1}(\underline{W})_{j_{0}}\coloneqq\mathop{\rm{diag}}\left(f% \left(\underline{W}\right)_{j_{0}}\right)q(\underline{W})_{j_{0}},\quad p_{2}(% \underline{W})_{j_{0}}\coloneqq f\left(\underline{W}\right)_{j_{0}}f\left(% \underline{W}\right)_{j_{0}}^{\top}q(\underline{W})_{j_{0}},

such that $p(\underline{W})=p_{1}(\underline{W})-p_{2}(\underline{W})$ .

$p(\cdot)$ allows us to express $\derivative{g_{2}}{\underline{W}}$ in a neat form:

Lemma G.2.

Define the functions $f(\underline{W})\in\mathbb{R}^{L\times L}$ , $c(\underline{W})\in\mathbb{R}^{d\times L}$ , $h(\underline{W}_{OV})\in\mathbb{R}^{d\times L}$ , $q(\underline{W})\in\mathbb{R}^{L\times L}$ , and $p(\underline{W})\in\mathbb{R}^{L\times L}$ as specified in Definitions G.3, G.5, G.4, G.6 and G.7, respectively. Let $A_{1},A_{2}\in\mathbb{R}^{d\times L}$ be two given matrices, and define $\operatorname{\mathsf{A}}=A_{1}^{\top}\otimes A_{2}^{\top}$ . Define $g_{2}$ according to (O1), and let $g_{2}(\underline{W})_{j_{0},i_{0}}$ be as described in (G.1). It holds

\displaystyle\derivative{g_{2}}{\underline{W}}=\operatorname{vec}\left(A_{1}p(% \underline{W})A_{2}^{\top}\right).

(G.3)

Proof.

By definitions, (G.1) gives

		$\displaystyle\leavevmode\nobreak\ \frac{\mathrm{d}(g_{2})_{j_{0},i_{0}}}{% \mathrm{d}\underline{W}_{i_{0}}}$		(G.4)
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ c_{j_{0},i_{0}}\cdot(\underbrace{\langle f(% \underline{W})_{j_{0}}\odot\operatorname{\mathsf{A}}_{j_{0},i_{0}},h(% \underline{W}_{OV})_{i_{0}}\rangle}_{=\operatorname{\mathsf{A}}_{j_{0},i}^{% \top}\mathop{\rm{diag}}(f(\underline{W})_{j_{0}})h(\underline{W}_{OV})_{i_{0}}% }-\underbrace{\langle f(\underline{W})_{j_{0}},h(\underline{W}_{OV})_{i_{0}}% \rangle\cdot\langle f(\underline{W})_{j_{0}},\operatorname{\mathsf{A}}_{j_{0},% i_{0}}\rangle)}_{=\operatorname{\mathsf{A}}_{j_{0},i}^{\top}f(\underline{W})_{% j_{0}}f(\underline{W})_{j_{0}}^{\top}h(\underline{W}_{OV})_{i_{0}}}.$		(By $\Braket{a\odot b,c}=a^{\top}\mathop{\rm{diag}}(b)c$ for $a,b,c\in\mathbb{R}^{L}$ )

Therefore, (G.4) becomes

	$\displaystyle\frac{\mathrm{d}(g_{2})_{j_{0},i_{0}}}{\mathrm{d}\underline{W}_{i% _{0}}}=$	$\displaystyle\leavevmode\nobreak\ c_{j_{0},i_{0}}\cdot(\operatorname{\mathsf{A% }}_{j_{0},i}^{\top}\mathop{\rm{diag}}(f(\underline{W})_{j_{0}})h(\underline{W}% _{OV})_{i_{0}}-\operatorname{\mathsf{A}}_{j_{0},i}^{\top}f(\underline{W})_{j_{% 0}}f(\underline{W})_{j_{0}}^{\top}h(\underline{W}_{OV})_{i_{0}})$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ c_{j_{0},i_{0}}\cdot\operatorname{\mathsf{A}% }_{j_{0},i}^{\top}(\mathop{\rm{diag}}(f(\underline{W})_{j_{0}})-f(\underline{W% })_{j_{0}}f(\underline{W})_{j_{0}}^{\top})h(\underline{W}_{OV})_{i_{0}}.$		(G.5)

Then, by definitions of $q(\cdot),p(\cdot)$ , we complete the proof. ∎

G.1.2 Low-Rank Approximations of Building Blocks I

The definitions of $p$ , $p_{1}$ , $p_{2}$ , and Lemma G.2 show that the DiT training gradient $\derivative{g_{2}}{\underline{W}}$ involves entry-wise products of $f$ , $q$ , and $c$ . Therefore, if we approximate these with inner-dimension-matched low-rank approximations, computing $\derivative{g_{2}}{\underline{W}}$ itself becomes a low-rank approximation. In the following sections, we present low-rank approximations for $f$ , $q$ , and $c$ .

Lemma G.3 (Approximate $f(\cdot)$ , Modified from (Alman and Song, 2023)).

Let $\Gamma=o(\sqrt{\log L})$ and $k_{1}=L^{o(1)}$ . Let $A_{1},A_{2},\in\mathbb{R}^{d\times L}$ , $W\in\mathbb{R}^{d\times d}$ and $f(\underline{W})=D^{-1}\exp(A_{1}^{\top}\mathbf{X}A_{2})$ with $D=\mathop{\rm{diag}}\left(\exp\left(A_{1}^{\top}WA_{2}\right){\mathds{1}_{L}}\right)$ follows Definitions G.1, G.2, G.5 and G.3. If $\max\big{(}\norm{A_{1}^{\top}W}_{\max}\leq\Gamma$ , $\norm{A_{2}}_{\max}\big{)}\leq\Gamma$ , then there exist two matrices $U_{1},V_{1}\in\mathbb{R}^{L\times k_{1}}$ such that $\norm{U_{1}V_{1}^{\top}-f(\underline{W})}_{\max}\leq\epsilon/\mathrm{poly}(L)$ . In addition, it takes $L^{1+o(1)}$ time to construct $U_{1}$ and $V_{1}$ .

Proof.

By (Alman and Song, 2023, Theorem 3), we complete the proof. ∎

Lemma G.4 (Approximate $c(\cdot)$ ).

Assume all numerical values are in $O(\log L)$ bits. Let $d=O(\log L)$ and $c(\underline{W})\in\mathbb{R}^{L\times d}$ follows Definition G.5. There exist two matrices $U_{1},V_{1}\in\mathbb{R}^{L\times k_{1}}$ such that $\left\|U_{1}V_{1}^{\top}h(W_{OV})-Y^{\top}-c(\underline{W})\right\|_{\max}\leq% \epsilon/\mathrm{poly}(L)$ .

Proof of Lemma G.4.

$\displaystyle\left\\|U_{1}V_{1}^{\top}h(W_{OV})-Y^{\top}-c(\underline{W})\right% \\|_{\max}$	$\displaystyle=\left\\|U_{1}V_{1}^{\top}h(W_{OV})-Y^{\top}-(f(\underline{W})h(W_% {OV})-Y^{\top})\right\\|_{\max}$	(By Definition G.5)
	$\displaystyle=\left\\|\left[U_{1}V_{1}^{\top}-f(\underline{W})\right]h(W_{OV})% \right\\|_{\max}$
	$\displaystyle\leq\epsilon/\mathrm{poly}(L).$	(By (Alman and Song, 2023, Theorem 3))

∎

Lemma G.5 (Approximate $q(\cdot)$ ).

Let $k_{2}=L^{o(1)}$ , $c(\cdot)\in\mathbb{R}^{L\times d}$ follow Definition G.5 and let $q(\underline{W})\coloneqq c(\underline{W})h(\underline{W}_{OV})^{\mathsf{T}}% \in\mathbb{R}^{L\times L}$ (follow Definition G.6). There exist two matrices $U_{2},V_{2}\in\mathbb{R}^{L\times k_{2}}$ such that $\left\|U_{2}V_{2}^{\top}-q(\underline{W})\right\|_{\max}\leq\epsilon/\mathrm{% poly}(L)$ . In addition, it takes $L^{1+o(1)}$ time to construct $U_{2},V_{2}$ .

Proof of Lemma G.5.

Our proof is built on (Alman and Song, 2023, Lemma D.3).

Let $\widetilde{q}(\cdot)$ denote an approximation to $q(\cdot)$ .

By Lemma G.4, $U_{1}V_{1}^{\top}h(W_{OV})-Y$ approximates $c(\underline{W})$ up to accuracy $\epsilon=1/\mathrm{poly}(L)$ .

Thus, by setting $\widetilde{q}(\underline{W})=h(W_{OV})\left(U_{1}V_{1}^{\top}h(W_{OV})-Y\right% )^{\top}$ , we find a low-rank form for $\widetilde{q}(\cdot)$ :

\displaystyle\widetilde{q}(\underline{W})=h(W_{OV})\left(h(W_{OV})\right)^{% \top}V_{1}U_{1}^{\top}-h(W_{OV})Y^{\top},

such that

	$\displaystyle\\|\widetilde{q}(\underline{W})-q(\underline{W})\\|_{\max}$	$\displaystyle=\left\\|h(W_{OV})\left(U_{1}V_{1}^{\top}h(W_{OV})-Y\right)^{\top}% -h(W_{OV})Y^{\top}\right\\|_{\max}$
		$\displaystyle\leq d\left\\|h(W_{OV})\right\\|_{\max}\left\\|U_{1}V_{1}^{\top}h(W_% {OV})-Y-c(\underline{W})\right\\|_{\max}$
		$\displaystyle\leq\epsilon/\mathrm{poly}(L).$

By $k_{1},d=L^{o(1)}$ , compute $\underbrace{\left(h(W_{OV})\right)^{\top}}_{{d\times L}}\underbrace{V_{1}}_{L% \times k_{1}}\underbrace{U_{1}^{\top}}_{k_{1}\times L}$ takes only $L^{1+o(1)}$ time. This completes the proof. ∎

G.1.3 Low-Rank Approximations of Building Blocks II

Now, we use the low-rank approximations of $f,q,c$ to construct low-rank approximations for $p_{1}(\cdot),p_{2}(\cdot),p(\cdot)$ .

Lemma G.6 (Approximate $p_{1}(\cdot)$ ).

Let $k_{1},k_{2}=L^{o(1)}$ . Suppose $U_{1},V_{1}\in\mathbb{R}^{L\times k_{1}}$ approximates $f(\underline{W})\in\mathbb{R}^{L\times L}$ such that $\left\|U_{1}V_{1}^{\top}-f(\underline{W})\right\|_{\max}\leq\epsilon/\mathrm{% poly}(L)$ , and $U_{2},V_{2}\in\mathbb{R}^{L\times k_{2}}$ approximates the $q(\underline{W})\in\mathbb{R}^{L\times L}$ such that $\left\|U_{2}V_{2}^{\top}-q(\underline{W})\right\|_{\max}\leq\epsilon/\mathrm{% poly}(L)$ . Then there exist two matrices $U_{3},V_{3}\in\mathbb{R}^{L\times k_{3}}$ such that $\left\|U_{3}V_{3}^{\top}-p_{1}(\underline{W})\right\|_{\max}\leq$ $\epsilon/\mathrm{poly}(L)$ . In addition, it takes $L^{1+o(1)}$ time to construct $U_{3},V_{3}$ .

Proof of Lemma G.6.

By tensor trick, we construct $U_{3}$ , $V_{3}$ as tensor products of $U_{1},V_{1}$ and $U_{2},V_{2}$ , respectively, while preserving their low-rank structures. Then, we show the low-rank approximation of $p_{1}(\cdot)$ with bounded error by Lemma G.3 and Lemma G.5.

Let $\oslash$ be column-wise Kronecker product such that $A\oslash B\coloneqq[A[\cdot,1]\otimes B[\cdot,1]\mid\ldots\mid A[\cdot,k_{1}]% \otimes B[\cdot,k_{1}]]\in\mathbb{R}^{L\times k_{1}k_{2}}$ for $A\in\mathbb{R}^{L\times k_{1}},B\in\mathbb{R}^{L\times k_{2}}$ .

Let $\widetilde{f}(\underline{W})\coloneqq U_{1}V_{1}^{\mathsf{T}}$ and $\widetilde{q}(\underline{W})\coloneqq U_{2}V_{2}^{\mathsf{T}}$ denote matrix-multiplication approximations to $f(\underline{W})$ and $q(\underline{W})$ , respectively.

For the case of presentation, let $U_{3}=\overbrace{U_{1}}^{L\times k_{1}}\oslash\overbrace{U_{2}}^{L\times k_{2}}$ and $V_{3}=\overbrace{V_{1}}^{L\times k_{1}}\oslash\overbrace{V_{2}}^{L\times k_{2}}$ . It holds

	$\displaystyle\leavevmode\nobreak\ \left\\|U_{3}V_{3}^{\top}-p_{1}(\underline{W}% )\right\\|_{\max}$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \left\\|U_{3}V_{3}^{\top}-f(\underline{W})% \odot q(\underline{W})\right\\|_{\max}$	( By $p_{1}(\underline{W})=f(\underline{W})\odot q(\underline{W})$ )
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \left\\|\left(U_{1}\oslash U_{2}\right)\left(% V_{1}\oslash V_{2}\right)^{\top}-f(\underline{W})\odot q(\underline{W})\right% \\|_{\max}$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \left\\|\left(U_{1}V_{1}^{\top}\right)\odot% \left(U_{2}V_{2}^{\top}\right)-f(\underline{W})\odot q(\underline{W})\right\\|_% {\max}$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \\|\widetilde{f}(\underline{W})\odot% \widetilde{q}(\underline{W})-f(\underline{W})\odot q(\underline{W})\\|_{\max}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \underbrace{\\|\widetilde{f}(\underline{W})% \odot\widetilde{q}(\underline{W})-\widetilde{f}(\underline{W})\odot q(% \underline{W})\\|_{\max}}_{\leq\epsilon/\mathrm{poly}(L)}+\underbrace{\\|% \widetilde{f}(\underline{W})\odot q(\underline{W})-f(\underline{W})\odot q(% \underline{W})\\|_{\max}}_{\leq\epsilon/\mathrm{poly}(L)}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \epsilon/\mathrm{poly}(L).$	(By Lemma G.3 and Lemma G.5)

Computationally, by $k_{1},k_{2}=L^{o(1)}$ , computing $U_{3}$ and $V_{3}$ takes $L^{1+o(1)}$ time. This completes the proof. ∎

Lemma G.7 (Approximate $p_{2}(\cdot)$ ).

Let $k_{1},k_{2},k_{4}=L^{o(1)}$ . Let $p_{2}(\underline{W})\in\mathbb{R}^{L\times L}$ follow Definition G.7 such that its $j_{0}$ -th column is $p_{2}(\underline{W})_{j_{0}}=f(\underline{W})_{j_{0}}f(\underline{W})_{j_{0}}^% {\top}q(\underline{W})_{j_{0}}$ for each $j_{0}\in[L]$ . Suppose $U_{1},V_{1}\in\mathbb{R}^{L\times k_{1}}$ approximates the $\mathrm{f}(\mathrm{\mathbf{X}})$ such that $\left\|U_{1}V_{1}^{\top}-f(\underline{W})\right\|_{\max}\leq\epsilon/\mathrm{% poly}(L)$ , and $U_{2},V_{2}\in\mathbb{R}^{L\times k_{2}}$ approximates the $q(\underline{W})\in\mathbb{R}^{L\times L}$ such that $\left\|U_{2}V_{2}^{\top}-q(\underline{W})\right\|_{\max}\leq\epsilon/\mathrm{% poly}(L)$ . Then there exist matrices $U_{4},V_{4}\in\mathbb{R}^{L\times k_{4}}$ such that $\left\|U_{4}V_{4}^{\top}-p_{2}(\underline{})\right\|_{\max}\leq\epsilon/% \mathrm{poly}(L)$ . In addition, it takes $L^{1+o(1)}$ time to construct $U_{4},V_{4}$ .

Proof of Lemma G.7.

From Definition G.7,

\displaystyle p_{2}(\underline{W})_{j_{0}}\coloneqq\overbrace{f\left(% \underline{W}\right)_{j_{0}}\underbrace{f\left(\underline{W}\right)_{j_{0}}^{% \top}q(\underline{W})_{j_{0}}}_{(I)}}^{(II)}.

For (I), we show its low-rank approximation by observing the low-rank-preserving property of the multiplication between $f(\cdot)$ and $q(\cdot)$ (from Lemma G.3 and Lemma G.5). For (II), we show its low-rank approximation by the low-rank structure of $f(\cdot)$ and (I).

Part (I).

We define a function $r(\underline{W}):\mathbb{R}^{d^{2}}\to\mathbb{R}^{L}$ such that the $j_{0}$ -th component $r(\underline{W})_{j_{0}}\coloneqq\left(f(\underline{W})_{j_{0}}\right)^{\top}q% (\underline{W})_{j_{0}}$ for all $j_{0}\in[L]$ . Let $\widetilde{r}(\underline{W})$ denote the approximation of $r(\underline{W})$ via decomposing into $f(\cdot)$ and $q(\cdot)$ :

	$\displaystyle\widetilde{r}(\underline{W})_{j_{0}}$	$\displaystyle\coloneqq\left\langle\widetilde{f}(\underline{W})_{j_{0}},% \widetilde{q}(\underline{W})_{j_{0}}\right\rangle=\left(U_{1}V_{1}^{\top}% \right)[j_{0},\cdot]\cdot\left[\left(U_{2}V_{2}^{\top}\right)[j_{0},\cdot]% \right]^{\top}$
		$\displaystyle=U_{1}[j_{0},\cdot]\underbrace{V_{1}^{\top}}_{{k_{1}\times L}}% \underbrace{V_{2}}_{{L\times k_{2}}}\left(U_{2}[j_{0},\cdot]\right)^{\top},$		(G.6)

for all $j_{0}\in[L]$ . This allows us to write ${p}_{2}(\underline{W})={f}(\underline{W})\mathop{\rm{diag}}({r}(\underline{W}))$ with $\mathop{\rm{diag}}(\widetilde{r}(\underline{W}))$ denoting a diagonal matrix with diagonal entries being components of $\widetilde{r}(\underline{W})$ .

Part (II).

With $r(\cdot)$ , we approximate $p_{2}(\cdot)$ with $\widetilde{p}_{2}(\underline{W})=\widetilde{f}(\underline{W})\mathop{\rm{diag}% }(\widetilde{r}(\underline{W}))$ as follows.

Since $\widetilde{f}(\underline{W})$ has low rank representation, and $\mathop{\rm{diag}}(\widetilde{r}(\underline{W}))$ is a diagonal matrix, $\widetilde{p}_{2}(\cdot)$ has low-rank representation by definition. Thus, we set $\widetilde{p}_{2}(\underline{W})=U_{4}V_{4}^{\mathsf{T}}$ with $U_{4}=U_{1}$ and $V_{4}=\mathop{\rm{diag}}(\widetilde{r}(\underline{W}))V_{1}$ . Then, we bound the approximation error

	$\displaystyle\leavevmode\nobreak\ \left\\|U_{4}V_{4}^{\top}-p_{2}(\underline{W}% )\right\\|_{\max}$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \left\\|\widetilde{p}_{2}(\underline{W})-p_{2% }(\underline{W})\right\\|_{\max}$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \max_{j_{0}\in[L]}\left\\|{\widetilde{f}(% \underline{W})_{j_{0}}\widetilde{r}(\underline{W})_{j_{0}}-f(\underline{W})_{j% _{0}}r(\underline{W})_{j_{0}}}\right\\|_{\max}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \max_{j_{0}\in[L]}\left[\left\\|\widetilde{f}% (\underline{W})_{j_{0}}\widetilde{r}(\underline{W})_{j_{0}}-f(\underline{W})_{% j_{0}}{r}(\underline{W})_{j_{0}}\right\\|_{\max}+\left\\|\widetilde{f}(% \underline{W})_{j_{0}}\widetilde{r}(\underline{W})_{j_{0}}-f(\underline{W})_{j% _{0}}r(\underline{W})_{j_{0}}\right\\|_{\max}\right]$	(By triangle inequality)
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \epsilon/\mathrm{poly}(L).$

Computationally, computing $V_{1}^{\top}V_{2}$ takes $L^{1+o(1)}$ time by $k_{1},k_{2}=L^{o(1)}$ . Once we have $V_{1}^{\top}V_{2}$ precomputed, (G.6) only takes $O(k_{1}k_{2})$ time for each $j_{0}\in[L]$ . Thus, the total time is $O\left(Lk_{1}k_{2}\right)=L^{1+o(1)}$ . Since $U_{1}$ and $V_{1}$ takes $L^{1+o(1)}$ time to construct and $V_{4}=\underbrace{\mathop{\rm{diag}}(\widetilde{r}(\underline{W}))}_{L\times L% }\underbrace{V_{1}}_{L\times k_{1}}$ also takes $L^{1+o(1)}$ time, $U_{4}$ and $V_{4}$ takes $L^{1+o(1)}$ time to construct. This completes the proof. ∎

G.2 Proof of Theorem 4.1

Proof of Theorem 4.1.

By the definitions of matrices $p(\cdot)$ , $p_{1}(\cdot)$ and $p_{2}(\cdot)$ (Definition G.7), we have

\displaystyle p(\underline{W})=p_{1}(\underline{W})-p_{2}(\underline{W}).

By Lemma G.2, we have

\displaystyle\derivative{g_{2}}{\underline{W}}=\operatorname{vec}\left(A_{1}p(% \underline{W})A_{2}^{\top}\right).

(G.7)

To show the existence of $L^{1+o(1)}$ algorithms for DiT backward computation Problem 1, we prove fast low-rank approximations for $A_{1}p_{1}(\underline{W})A_{2}^{\top}$ and $A_{1}p_{2}(\underline{W})A_{2}^{\top}$ as follows.

Let $\widetilde{p}_{1}(\underline{W}),\widetilde{p_{2}}(\underline{W})$ denote the approximations to $p_{1}(\underline{W}),p_{2}(\underline{W})$ , respectively.

By Lemma G.6, it takes $L^{1+o(1)}$ time to construct $U_{3},V_{3}\in\mathbb{R}^{L\times k_{3}}$ such that

\displaystyle A_{1}\widetilde{p}_{1}(\underline{W})A_{2}^{\top}=A_{1}U_{3}V_{3% }^{\top}A_{2}^{\top}.

Then, computing $\underbrace{A_{1}}_{d\times L}\underbrace{U_{3}}_{L\times k_{3}}\underbrace{V_% {3}^{\top}}_{k_{3}\times L}\underbrace{A_{2}^{\top}}_{L\times d}$ takes $L^{1+o(1)}$ due to the fact that $d,k_{1}k_{3}=L^{o(1)}$ .

Therefore, total running time for $A_{1}p_{1}(\underline{W})A_{2}^{\top}$ is $L\cdot L^{o(1)}=L^{1+o(1)}$ .

For the same reason (by Lemma G.7), total running time for $A_{1}p_{2}(\underline{W})A_{2}^{\top}$ is $L\cdot L^{o(1)}=L^{1+o(1)}$ .

Lastly, we have

	$\displaystyle\leavevmode\nobreak\ \left\\|\partialderivative{g_{2}}{\underline{% W}}-\widetilde{G}^{(W)}\right\\|_{\max}$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \left\\|\operatorname{vec}\left(A_{1}% \widetilde{p}(\underline{W})A_{2}^{\top}\right)-\operatorname{vec}\left(A_{1}% \widetilde{p}(\underline{W})A_{2}^{\top}\right)\right\\|_{\max}$	(By Lemma G.2)
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \left\\|\left(A_{1}\widetilde{p}(\underline{W% })A_{2}^{\top}\right)-\left(A_{1}\widetilde{p}(\underline{W})A_{2}^{\top}% \right)\right\\|_{\max}$	(By definition, $\norm{A}_{\max}\coloneqq\max_{i,j}\absolutevalue{A_{ij}}$ for any matrix $A$ )
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \left\\|\left(A_{1}\left[p_{1}(\underline{W})% -\widetilde{p}_{1}(\underline{W})\right]A_{2}^{\top}\right)\right\\|_{\max}+% \left\\|\left(A_{1}\left[p_{2}(\underline{W})-\widetilde{p}_{2}(\underline{W})% \right]A_{2}^{\top}\right)\right\\|_{\max}$	(By Definition G.7 and triangle inequality)
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \norm{A_{1}}_{\infty}\norm{A_{2}}_{\infty}% \left(\left\\|\left(p_{1}(\underline{W})-\widetilde{p}_{1}(\underline{W})\right% )\right\\|_{\max}+\left\\|\left(p_{2}(\underline{W})-\widetilde{p}_{2}(% \underline{W})\right)\right\\|_{\max}\right)$	(By the sub-multiplicative property of $\norm{\cdot}_{\infty}$ )
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \epsilon/\mathrm{poly}(L).$	(By Lemma G.6 and Lemma G.7)

Set $\epsilon=1/\mathrm{poly}(L)$ . We complete the proof. ∎

References

Alman and Song [2023] Josh Alman and Zhao Song. Fast attention requires bounded entries. Advances in Neural Information Processing Systems (NeurIPS), 36, 2023.
Alman and Song [2024a] Josh Alman and Zhao Song. The fine-grained complexity of gradient computation for training large language models. arXiv preprint arXiv:2402.04497, 2024a.
Alman and Song [2024b] Josh Alman and Zhao Song. How to capture higher-order correlations? generalizing matrix softmax attention to kronecker computation. In The Twelfth International Conference on Learning Representations (ICLR), 2024b.
Ambrogioni [2023] Luca Ambrogioni. In search of dispersed memories: Generative diffusion models are associative memory networks. arXiv preprint arXiv:2309.17290, 2023.
Bao et al. [2022] Fan Bao, Chongxuan Li, Yue Cao, and Jun Zhu. All are worth words: a vit backbone for score-based diffusion models. In NeurIPS 2022 Workshop on Score-Based Methods, 2022.
Benton et al. [2024] Joe Benton, Valentin De Bortoli, Arnaud Doucet, and George Deligiannidis. Nearly d-linear convergence bounds for diffusion models via stochastic localization. In The Twelfth International Conference on Learning Representations (ICLR), 2024.
Bortoli [2022] Valentin De Bortoli. Convergence of denoising diffusion models under the manifold hypothesis. Transactions on Machine Learning Research, 2022. ISSN 2835-8856.
Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Chen et al. [2024] Junsong Chen, **cheng YU, Chongjian GE, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, ** Luo, Huchuan Lu, and Zhenguo Li. Pixart-$\alpha$: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In The Twelfth International Conference on Learning Representations (ICLR), 2024.
Chen et al. [2020a] Minshuo Chen, Xingguo Li, and Tuo Zhao. On generalization bounds of a family of recurrent neural networks. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (AISTATS), volume 108, pages 1233–1243, 2020a.
Chen et al. [2020b] Minshuo Chen, Wen**g Liao, Hongyuan Zha, and Tuo Zhao. Distribution approximation and statistical estimation guarantees of generative adversarial networks. arXiv preprint arXiv:2002.03938, 2020b.
Chen et al. [2023a] Minshuo Chen, Kaixuan Huang, Tuo Zhao, and Mengdi Wang. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. In International Conference on Machine Learning (ICML), pages 4672–4712. PMLR, 2023a.
Chen et al. [2023b] Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. In The Eleventh International Conference on Learning Representations (ICLR), 2023b.
Cygan et al. [2016] Marek Cygan, Holger Dell, Daniel Lokshtanov, Dániel Marx, Jesper Nederlof, Yoshio Okamoto, Ramamohan Paturi, Saket Saurabh, and Magnus Wahlström. On problems as hard as cnf-sat. ACM Transactions on Algorithms (TALG), 12(3):1–24, 2016.
Diao et al. [2018] Huaian Diao, Zhao Song, Wen Sun, and David Woodruff. Sketching for kronecker product regression and p-splines. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1299–1308. PMLR, 2018.
Diao et al. [2019] Huaian Diao, Rajesh Jayaram, Zhao Song, Wen Sun, and David Woodruff. Optimal sketching for kronecker product regression and low rank approximation. Advances in neural information processing systems (NeurIPS), 32, 2019.
Edelman et al. [2022] Benjamin L Edelman, Surbhi Goel, Sham Kakade, and Cyril Zhang. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning (ICML), pages 5793–5831. PMLR, 2022.
Esser et al. [2024] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024.
Floridi and Chiriatti [2020] Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020.
Gao et al. [2023a] Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Masked diffusion transformer is a strong image synthesizer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23164–23173, 2023a.
Gao et al. [2023b] Yeqi Gao, Zhao Song, Weixin Wang, and Junze Yin. A fast optimization view: Reformulating single layer attention in llm based on tensor and svm trick, and solving it in matrix multiplication time. arXiv preprint arXiv:2309.07418, 2023b.
Gao et al. [2023c] Yeqi Gao, Zhao Song, and Shenghao Xie. In-context learning for attention scheme: from single softmax regression to multiple softmax regression via a tensor trick. arXiv preprint arXiv:2307.02419, 2023c.
Gu et al. [2024] Jiuxiang Gu, Yingyu Liang, Zhenmei Shi, Zhao Song, and Yufa Zhou. Tensor attention training: Provably efficient learning of higher-order transformers. arXiv preprint arXiv:2405.16411, 2024.
Guan et al. [2024] Jiaqi Guan, Xiangxin Zhou, Yuwei Yang, Yu Bao, Jian Peng, Jianzhu Ma, Qiang Liu, Liang Wang, and Quanquan Gu. Decompdiff: diffusion models with decomposed priors for structure-based drug design. arXiv preprint arXiv:2403.07902, 2024.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Hoover et al. [2023] Benjamin Hoover, Hendrik Strobelt, Dmitry Krotov, Judy Hoffman, Zsolt Kira, and Duen Horng Chau. Memory in plain sight: A survey of the uncanny resemblances between diffusion models and associative memories. arXiv preprint arXiv:2309.16750, 2023.
Hu et al. [2023] Jerry Yao-Chieh Hu, Donglin Yang, Dennis Wu, Chenwei Xu, Bo-Yu Chen, and Han Liu. On sparse modern hopfield model. In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023.
Hu et al. [2024a] Jerry Yao-Chieh Hu, Pei-Hsuan Chang, Haozheng Luo, Hong-Yu Chen, Weijian Li, Wei-Po Wang, and Han Liu. Outlier-efficient hopfield layers for large transformer-based models. In Forty-first International Conference on Machine Learning (ICML), 2024a.
Hu et al. [2024b] Jerry Yao-Chieh Hu, Bo-Yu Chen, Dennis Wu, Feng Ruan, and Han Liu. Nonparametric modern hopfield models. arXiv preprint arXiv:2404.03900, 2024b.
Hu et al. [2024c] Jerry Yao-Chieh Hu, Thomas Lin, Zhao Song, and Han Liu. On computational limits of modern hopfield models: A fine-grained complexity analysis. In Forty-first International Conference on Machine Learning (ICML), 2024c.
Impagliazzo and Paturi [2001] Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. Journal of Computer and System Sciences, 62(2):367–375, 2001.
Ji et al. [2021] Yanrong Ji, Zhihan Zhou, Han Liu, and Ramana V Davuluri. Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome. Bioinformatics, 37(15):2112–2120, 2021.
Jiang and Li [2023] Haotian Jiang and Qianxiao Li. Approximation theory of transformer networks for sequence modeling. arXiv preprint arXiv:2305.18475, 2023.
Kajitsuka and Sato [2023] Tokio Kajitsuka and Issei Sato. Are transformers with one layer self-attention using low-rank weight matrices universal approximators? arXiv preprint arXiv:2307.14023, 2023.
Kim et al. [2022] Junghwan Kim, Michelle Kim, and Barzan Mozafari. Provable memorization capacity of transformers. In The Eleventh International Conference on Learning Representations (ICLR), 2022.
Lagler et al. [2013] Klemens Lagler, Michael Schindelegger, Johannes Böhm, Hana Krásná, and Tobias Nilsson. Gpt2: Empirical slant delay model for radio space geodetic techniques. Geophysical research letters, 40(6):1069–1073, 2013.
Liu et al. [2024] Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024.
Liu et al. [2021] Zhonghua Liu, Yue Lu, Zhihui Lai, Weihua Ou, and Kaibing Zhang. Robust sparse low-rank embedding for image dimension reduction. Applied Soft Computing, 113:107907, 2021.
Luo et al. [2023] Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, **gren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion models for high-quality video generation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10209–10218. IEEE, 2023.
Ma et al. [2024] Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. arXiv preprint arXiv:2401.08740, 2024.
Mahdavi et al. [2023] Sadegh Mahdavi, Renjie Liao, and Christos Thrampoulidis. Memorization capacity of multi-head attention in transformers. arXiv preprint arXiv:2306.02010, 2023.
Mo et al. [2023] Shentong Mo, Enze Xie, Ruihang Chu, Lanqing Hong, Matthias Niessner, and Zhenguo Li. Dit-3d: Exploring plain diffusion transformers for 3d shape generation. Advances in Neural Information Processing Systems (NeurIPS), 36, 2023.
Nichol et al. [2021] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
Oko et al. [2023] Kazusato Oko, Shunta Akiyama, and Taiji Suzuki. Diffusion models are minimax optimal distribution estimators. In International Conference on Machine Learning (ICML), pages 26517–26582. PMLR, 2023.
Peebles and Xie [2023] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023.
Pope et al. [2021] Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning. arXiv preprint arXiv:2104.08894, 2021.
Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
Ramsauer et al. [2020] Hubert Ramsauer, Bernhard Schafl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlovic, Geir Kjetil Sandve, et al. Hopfield networks is all you need. arXiv preprint arXiv:2008.02217, 2020.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 10684–10695, 2022.
Song and Ermon [2019] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems (NeurIPS), 32, 2019.
Su and Wu [2018] Bing Su and Ying Wu. Learning low-dimensional temporal representations. In International Conference on Machine Learning (ICML), pages 4761–4770. PMLR, 2018.
Vahdat et al. [2021] Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 11287–11302, 2021.
Wang et al. [2024a] Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, and Quanquan Gu. Diffusion language models are versatile protein learners. arXiv preprint arXiv:2402.18567, 2024a.
Wang et al. [2024b] Yan Wang, Lihao Wang, Yuning Shen, Yiqun Wang, Huizhuo Yuan, Yue Wu, and Quanquan Gu. Protein conformation generation via force-guided se (3) diffusion models. arXiv preprint arXiv:2403.14088, 2024b.
Wang et al. [2023] Yihan Wang, Jatin Chauhan, Wei Wang, and Cho-Jui Hsieh. Universality and limitations of prompt tuning. Advances in Neural Information Processing Systems (NeurIPS), 36, 2023.
Wibisono et al. [2024] Andre Wibisono, Yihong Wu, and Kaylee Yingxi Yang. Optimal score estimation via empirical bayes smoothing. arXiv preprint arXiv:2402.07747, 2024.
Williams [2018] Virginia Vassilevska Williams. On some fine-grained questions in algorithms and complexity. In Proceedings of the international congress of mathematicians: Rio de janeiro 2018, pages 3447–3487. World Scientific, 2018.
Wu et al. [2024a] Dennis Wu, Jerry Yao-Chieh Hu, Teng-Yun Hsiao, and Han Liu. Uniform memory retrieval with larger capacity for modern hopfield models. In Forty-first International Conference on Machine Learning (ICML), 2024a.
Wu et al. [2024b] Dennis Wu, Jerry Yao-Chieh Hu, Weijian Li, Bo-Yu Chen, and Han Liu. STanhop: Sparse tandem hopfield model for memory-enhanced time series prediction. In The Twelfth International Conference on Learning Representations (ICLR), 2024b.
Yun et al. [2020] Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations (ICLR), 2020.
Zheng et al. [2023] Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers. arXiv preprint arXiv:2306.09305, 2023.
Zhou et al. [2024a] Xiangxin Zhou, Xiwei Cheng, Yuwei Yang, Yu Bao, Liang Wang, and Quanquan Gu. Decompopt: Controllable and decomposed diffusion models for structure-based molecular optimization. arXiv preprint arXiv:2403.13829, 2024a.
Zhou et al. [2024b] Xiangxin Zhou, Dongyu Xue, Ruizhe Chen, Zaixiang Zheng, Liang Wang, and Quanquan Gu. Antigen-specific antibody design via direct energy-based preference optimization. arXiv preprint arXiv:2403.16576, 2024b.
Zhou et al. [2023] Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana Davuluri, and Han Liu. Dnabert-2: Efficient foundation model and benchmark for multi-species genome. arXiv preprint arXiv:2306.15006, 2023.
Zhou et al. [2024c] Zhihan Zhou, Weimin Wu, Harrison Ho, Jiayi Wang, Lizhen Shi, Ramana V Davuluri, Zhong Wang, and Han Liu. Dnabert-s: Learning species-aware dna embedding with genome foundation models. ArXiv, 2024c.
Zhu et al. [2023] Zhenyu Zhu, Francesco Locatello, and Volkan Cevher. Sample complexity bounds for score-matching: Causal discovery and generative modeling. Advances in Neural Information Processing Systems (NeurIPS), 36, 2023.

	$\displaystyle\\|\widetilde{q}(\underline{W})-q(\underline{W})\\|_{\max}$	$\displaystyle=\left\\|h(W_{OV})\left(U_{1}V_{1}^{\top}h(W_{OV})-Y\right)^{\top}% -h(W_{OV})Y^{\top}\right\\|_{\max}$
		$\displaystyle\leq d\left\\|h(W_{OV})\right\\|_{\max}\left\\|U_{1}V_{1}^{\top}h(W_% {OV})-Y-c(\underline{W})\right\\|_{\max}$
		$\displaystyle\leq\epsilon/\mathrm{poly}(L).$

1 Introduction

Question 1.

Question 2.

Question 3.

Contributions.

Organization.

Notations.

2 Background

2.1 Score-Matching Denoising Diffusion Models

Forward and Backward Process.

Score Matching.

2.2 Score Decomposition in Linear Latent Space

Assumption 2.1 (Low-Dimensional Linear Latent Space).

Remark 2.1.

Lemma 2.1 (Score Decomposition, Lemma 1 of (Chen et al., 2023a)).

Assumption 2.2 (Tail Behavior of Phsubscript𝑃ℎP_{h}italic_P start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT).

2.3 Score Network and Transformers

(Latent) Score Network.

Transformers.

3 Statistical Rates of Latent DiTs with Subspace Data Assumption

3.1 DiT Score Network Class

Definition 3.1 (DiT Reshape Layer R⁢(⋅)𝑅⋅R(\cdot)italic_R ( ⋅ )).

Definition 3.2 (Transformer Network Class 𝒯pr,m,lsuperscriptsubscript𝒯𝑝𝑟𝑚𝑙\mathcal{T}_{p}^{r,m,l}caligraphic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r , italic_m , italic_l end_POSTSUPERSCRIPT).

3.2 Score Approximation of DiT

Theorem 3.1 (Score Approximation of DiT).

Proof Sketch.

Remark 3.1 (Comparing with Existing Works).

Remark 3.2 (Latent Dimension Dependency).

3.3 Score Estimation and Distribution Estimation

Score Estimation.

Corollary 3.1.1 (Score Estimation of DiT).

Proof.

Remark 3.3 (Comparing with Existing Works).

Remark 3.4.

Definition 3.4.

Distribution Estimation.

Corollary 3.1.2 (Distribution Estimation of DiT, Modified From Theorem 3 of (Chen et al., 2023a)).

Proof.

Remark 3.5 (Comparing with Existing Works).

Remark 3.6 (Subspace Recovery Accuracy).

4 Provably Efficient Criteria

4.1 Computational Limits of Backward Computation

Definition 4.1 (Training Generic DiT Loss).

Remark 4.1 (Conditional and Unconditional Generation).

Problem 1 (Approximate DiT Gradient Computation (ADiTGC⁢(L,d,Γ,ϵ)ADiTGC𝐿𝑑Γitalic-ϵ\textsc{ADiTGC}(L,d,\Gamma,\epsilon)ADiTGC ( italic_L , italic_d , roman_Γ , italic_ϵ ))).

Theorem 4.1 (Existence of Almost-Linear Time Algorithms for ADiTGC).

Proof Sketch.

Remark 4.2.

4.2 Computational Limits of Forward Inference

Problem 2 (Approximate DiT Inference ADiTI⁢(d,L,Γ,δF)ADiTI𝑑𝐿Γsubscript𝛿𝐹\textsc{ADiTI}(d,L,\Gamma,\delta_{F})ADiTI ( italic_d , italic_L , roman_Γ , italic_δ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT )).

Proposition 4.1 (Norm-Based Efficiency Phase Transition).

Remark 4.3.

Proposition 4.2 (Almost-Linear Time DiT Inference).

Remark 4.4.

Remark 4.5.

5 Discussion and Conclusion

Broader Impact

Acknowledgments

Appendix

Appendix A More Discussion on Low-Dimensional Linear Latent Space

Appendix B Nomenclature Table

Appendix C Related Works

Organization.

Diffusion Transformers.

Universality and Memory Capacity of Transformers.

Theories of Diffusion Models.

Transformers in Foundation Models: Transformer-Based Pretrained Models.

Appendix D Supplementary Theoretical Background

D.1 Diffusion Models

Forward Process.

Backward Process.

D.2 Proof of Lemma 2.1

Proof.

D.3 Preliminaries: Strong Exponential Time Hypothesis (SETH) and Tensor Trick

Hypothesis 1 (SETH).

Definition D.1 (Vectorization).

Definition D.2 (Matrixization).

Definition D.3 (Kronecker Product).

Definition D.4 (Sub-Block of a Tensor).

Lemma D.1 (Tensor Trick (Diao et al., 2019, 2018)).

Assumption 2.2 (Tail Behavior of $P_{h}$ ).

Definition 3.1 (DiT Reshape Layer $R(\cdot)$ ).

Definition 3.2 (Transformer Network Class $\mathcal{T}_{p}^{r,m,l}$ ).

Problem 1 (Approximate DiT Gradient Computation ( $\textsc{ADiTGC}(L,d,\Gamma,\epsilon)$ )).

Problem 2 (Approximate DiT Inference $\textsc{ADiTI}(d,L,\Gamma,\delta_{F})$ ).

Definition E.1 (Grid and Cube with Width $\delta$ ).

Definition E.7 (Grid $\mathcal{G}_{\delta}^{+}$ ).