HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: complexity

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2311.12997v2 [cs.LG] 05 Feb 2024

Compositional Capabilities of Autoregressive Transformers:
A Study on Synthetic, Interpretable Tasks

Rahul Ramesh Computer and Information Science, University of Pennsylvania Ekdeep Singh Lubana Center for Brain Science, Harvard University Electrical Engineering and Computer Science, University of Michigan Mikail Khona Physics, MIT
Robert P. Dick
Electrical Engineering and Computer Science, University of Michigan
Hidenori Tanaka Center for Brain Science, Harvard University Physics & Information Labotratories, NTT Research
Abstract

Transformers trained on huge text corpora exhibit a remarkable set of capabilities, e.g., performing basic arithmetic. Given the inherent compositional nature of language, one can expect the model to learn to compose these capabilities, potentially yielding a combinatorial explosion of what operations it can perform on an input. Motivated by the above, we train autoregressive Transformer models on a synthetic data-generating process that involves compositions of a set of well-defined monolithic capabilities. Through a series of extensive and systematic experiments on this data-generating process, we show that: (1) autoregressive Transformers can learn compositional structures from small amounts of training data and generalize to exponentially or even combinatorially many functions; (2) generating intermediate outputs when composing functions is more effective for generalizing to new, unseen compositions than not generating any intermediate outputs (3) biases in the order of the compositions in the training data result in Transformers that fail to compose some combinations of functions; and (4) the attention layers select which capability to apply while the feed-forward layers execute the selected capability.

1 Introduction

Large scale Transformers pretrained on huge text corpora have revolutionized machine learning in recent years (Radford et al., 2018, 2019; Brown et al., 2020; Sanh et al., 2021; Wei et al., 2021; Thoppilan et al., 2022; Touvron et al., 2023). Due to an ever-increasing interest in adopting these models in our daily lives, evaluating and predicting their capabilities has become increasingly important (Bommasani et al., 2021; Ganguli et al., 2022; Shevlane et al., 2023; Rae et al., 2021; Hoffmann et al., 2022; Tay et al., 2022; Henighan et al., 2020; Hernandez et al., 2021; Sharma & Kaplan, 2020). Motivated by this, recent works have performed extensive empirical analyses to understand the possibilities and limitations of using these models in practical tasks of interest.

Refer to caption
Figure 1: Signatures of compositionality. ChatGPT (Bubeck et al., 2023) correctly responds to prompts that require composition of atomic arithmetic capabilities (sum, cube, square)—we argue these prompts are unlikely to be in the training data. However, the model does not always compose reliably (top-right panel). This motivates us to study the extent to which a Transformer can learn to compose its capabilities by mere pretraining on a compositional domain.

For example, such works show large language models (LLMs) can generate coherent text completions based on a provided context, perform code generation and debugging, use online APIs and tools in an automated manner, and even solve multimodal problems such as image captioning (Wei et al., 2022a; Bubeck et al., 2023; Austin et al., 2021; Chen et al., 2021; Lee et al., 2023; Liang et al., 2022; Qin et al., 2023; Liu et al., 2023; Suzgun et al., 2022; Srivastava et al., 2022). While such benchmarking of pretrained models is extremely valuable, it often focuses on evaluating rather “narrow” or “atomic” capabilities; for example, the ability to identify whether a given passage of text is biased or toxic (Gehman et al., 2020; Liang et al., 2022). However, given the compositional nature of training data (such as language), a model could learn to compose its atomic capabilities and perform complex tasks that it was never explicitly trained for. This can lead to an underestimation of the capabilities of the model; vice versa, if the model does not learn to compose, we can be certain that benchmarking for atomic capabilities is sufficient to characterize the model.

Motivated by the above, we analyze if a Transformer trained on a compositional data-generating process, without any special modifications to the usual training pipeline, can learn both relevant atomic capabilities and an ability to compose those capabilities. Bubeck et al. (2023) recently show that LLMs exhibit “sparks” of such compositionality, e.g., generating text that merges content of varying styles or evaluate mathematical expressions through the application of a sequence of functions (Fig. 1). However, due to their black-box nature, it is unclear if an LLM actually learns to compose capabilities or merely memorizes relevant samples from its training data. Moreover, while interacting with an LLM, it can be difficult to guarantee that we are utilizing a prompt that will appropriately guide the model to use the capabilities we desire, let alone compose them.

To circumvent challenges faced with LLMs pretrained on real world data and focus on our specific motivation, “can an autoregressive Transformer trained on compositional data learn to compose its capabilities”, we choose to limit the purview of this work to a well-defined synthetic domain. This is similar in spirit to recent works that utilize synthetic datasets generated using objects like first-order logic machines, context-free grammars, linear regressors, modular arithmetic, and even board games to establish and understand phenomenology of modern neural networks (Liu et al., 2022; Allen-Zhu & Li, 2023c, a, b; Garg et al., 2022; Li et al., 2023c; Saparov & He, 2022; Chan et al., 2022; Bhattamishra et al., 2020; Zhou et al., 2023; Nanda et al., 2023a, b; Li et al., 2023a; Lubana et al., 2023; Jones, 2021). The goal of such works, including ours, is to develop interpretable demonstrations and mechanistic hypotheses that enable a characterization of the target phenomenology in a controlled setting. Accordingly, we emphasize that we do not intend to develop novel protocols for improving Transformers’ ability to compositionally generalize, but rather to demonstrate its existence and understand what drives it. Overall, we make the following contributions.

  • A minimal synthetic setup for characterizing Transformers’ ability to compose. We propose a minimal setup involving compositions of predefined functions \mathcal{F}caligraphic_F (bijections and permutations) that operate on a string of arbitrary tokens (Section 3), which allows us to precisely study the ability of Transformers to compose functions. Motivated by instruction induction and tuning in LLMs (Honovich et al., 2022; Wei et al., 2021), we instantiate a notion of “task tokens” which specify what functions are to be applied to the input string. This helps us avoid any ambiguity in task-specification (Shah et al., 2022).

  • Transformers show explosion of capabilities. We characterize the ability of a Transformer trained on our proposed setup to compositionally generalize, i.e., to apply a composition of specific functions chosen from \mathcal{F}caligraphic_F, to an input string. We show that a Transformer, trained on very few compositions, can generalize to exponentially or even combinatorially many functions (Section 4.1)—these functions are entirely “out-of-distribution”, i.e., the model never sees them in its training data and hence was not explicitly trained to learn them. Crucially, allowing the model to recursively process its intermediate outputs—i.e., stepwise inference (Kojima et al., 2022; Wei et al., 2022b)—significantly improves compositional generalization (Sections 4.3 and C.3).

  • Characterizing limitations and mechanisms of compositionality in a Transformer. We formalize a notion of “distance” between the functions seen by the model during pretraining and the ones it is evaluated on, hence enabling a precise characterization of when the model struggles to compose (Section 4.2). As we show, the training data determines whether the Transformer generalizes to an exponential or combinatorial set of functions—which we call in-order and and out-of-order generalization respectively. Furthermore, linear probing (Tenney et al., 2019; Li et al., 2023a), and an analysis of the attention maps suggests the following mechanism for solving our task: the attention layer selects the task token and the fully connected layers compute the function corresponding to it (Section 4.4). We also prove the existence of Transformers that can compositionally generalize to our task and analyze why stepwise inference helps with it (Appendix C). Our mechanistic analysis and theoretical construction align extremely well.

2 Related Work

Capabilities in a Transformer.

Transformers pretrained on large-scale, web-crawled datasets have been shown to exhibit a slew of interesting capabilities, such as basic arithmetic, question answering, commonsense knowledge reasoning, stylistic transformation of a piece of text, and even multimodal reasoning (Radford et al., 2018, 2019; Brown et al., 2020; Bubeck et al., 2023; Wei et al., 2022a, 2021; Rae et al., 2021; Chowdhery et al., 2022; Austin et al., 2021; Chen et al., 2021; Bommasani et al., 2021). However, this generality can come at the cost of a model also learning capabilities that are undesirable (Bommasani et al., 2021; Tamkin et al., 2021; Chan et al., 2023), e.g., producing sensitive, biased, or toxic outputs (Weidinger et al., 2021; McGuffie & Newhouse, 2020; Garrido-Muñoz et al., 2021; Lin et al., 2021; Jiang et al., 2021; Abid et al., 2021; Parrish et al., 2021; Xu et al., 2021; Huang et al., 2019; Sheng et al., 2019; Gehman et al., 2020; Xu et al., 2020; Tamkin et al., 2021). This has motivated several works focused on understanding capabilities of a pretrained model, including (i) predicting capabilities of a future model, e.g., via fitting power laws to data/model scaling results (Rae et al., 2021; Hoffmann et al., 2022; Hernandez et al., 2021; Sharma & Kaplan, 2020; Arora & Goyal, 2023) and (ii) eliciting capabilities of a given model, e.g., via identification of appropriate prompts or via step-wise inference protocols such as chain-of-thought, to understand what tasks a model can be reliably used for (Liang et al., 2022; Suzgun et al., 2022; Lee et al., 2023). However, we argue that measuring a language model’s performance on benchmarks to identify the existence of a set of capabilities is bound to be insufficient for characterizing what tasks it can perform: given the compositional nature of data these models are trained on, it is possible that they learn to compose capabilities, hence learning to perform several more tasks than we explicitly train them on. In fact, with a related motivation, Yu et al. (2023) design a benchmark for evaluating a model’s ability to combine its skills in a recent contemporary work.

Compositionality in neural networks.

The ability to compositionally reason has been touted as a cornerstone of human intelligence (Fodor & Lepore, 2002; Fodor & Pylyshyn, 1988; Fodor, 1975; Schulz et al., 2016). Accordingly, several works have studied the ability of a neural network to compositionally generalize, usually demonstrating a negative result, and correspondingly develo** explicit strategies that help improve the model’s ability to generalize (Liška et al., 2018; Hupkes et al., 2018; Lake & Baroni, 2018; Csordás et al., 2021b, a, 2022; Ontanón et al., 2021; Lepori et al., 2023; Lewis et al., 2022; Yun et al., 2022; Okawa et al., 2023; Hosseini et al., 2022). Our work differs from prior literature in several ways. First, we do not intend to develop protocols for improving compositional generalization in a Transformer; instead, we show that Transformers can learn to compose its capabilities and perform tasks it was never explicitly trained on, with autoregressive training on tokens from a compositional data-generating process. To this end, we define a synthetic task that allows for perfect task specification and which avoids ambiguity from prompt misspecification. While similar to the compositional table lookup task used in prior work (Liška et al., 2018; Csordás et al., 2022), our task involves a much larger set of capabilities to train and test for (3125 or 4 million, depending on the setup, compared to 128 capabilities in prior work). Second, we aim to understand the extent of compositional generalization in a Transformer trained on our proposed domain, i.e., what kind of compositions does the model fail to perform and when. We define a framework to precisely characterize these failures modes and use the popular linear probing protocol for understanding model internals to show the critical role of attention layers in enabling compositionality (Li et al., 2023a). Finally, we analyze the impact of step-wise inference protocols, wherein intermediate outputs generated by the model are recursively passed to it as inputs, and which has been used for solving several challenging benchmark tasks recently (Suzgun et al., 2022; Wei et al., 2022b). Similar to our work, Li et al. (2023c) study step-wise inference in Transformers trained on synthetic data from a compositional data generating process. However, there are notable differences—we show that Transformers compositionally generalize to combinatorially many new functions and carefully controlling the training data allows us to highlight the benefit of step-wise inference. Furthermore, Li et al. (2023b) study compositionality with prompts used for in-context learning (Garg et al., 2022), while our synthetic setup avoids ambiguity in specifying the compositions. Many other works that study whether Transformers can compositionally generalize (Csordás et al., 2021a; Ontanón et al., 2021), focus on compositionality within a single forward pass, i.e., the model is not allowed to recursively process its inputs. We find the use of intermediate outputs significantly simplifies the problem and, given its popularity in practical scenarios (Kojima et al., 2022; Wei et al., 2022b), our results serve as a demonstration that inference protocols that allow Transformers to recursively refine their outputs can lead to a wide range of capabilities, especially ones that we never explicitly train the model for.

3 Formalizing capabilities and compositions

As noted by Hupkes et al. (2020), despite extensive work exploring compositionality in neural networks, the term is often used for several related concepts. To avoid ambiguity, we thus present a definition of a “compositional model” that captures our intended notion and, correspondingly, describe the data-generating process used in this work to understand Transformers’ ability to compose. Let \mathcal{F}caligraphic_F denote a set of predefined automorphisms, i.e., any given function F𝐹Fitalic_F from the set defines a map between points from its input space to the same space. This is motivated by the fact that the input and output domain of a language model are generally the same. We define an input x𝑥xitalic_x as a combination of two strings [xf,xd]subscript𝑥𝑓subscript𝑥𝑑[x_{f},x_{d}][ italic_x start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ], where xfXfLsubscript𝑥𝑓superscriptsubscript𝑋𝑓𝐿x_{f}\in X_{f}^{L}italic_x start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT is a sequence of L𝐿Litalic_L tokens that specify a series of L𝐿Litalic_L functions from \mathcal{F}caligraphic_F, and xdXdKsubscript𝑥𝑑superscriptsubscript𝑋𝑑𝐾x_{d}\in X_{d}^{K}italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT denotes a sequence of K𝐾Kitalic_K tokens to which the series of L𝐿Litalic_L functions are applied to. We refer to xfsubscript𝑥𝑓x_{f}italic_x start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT as task tokens and to xdsubscript𝑥𝑑x_{d}italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT as data tokens. For example, let xFisubscript𝑥subscript𝐹𝑖x_{F_{i}}italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT be the identifier that denotes that function Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is applied to the data tokens and xdksubscript𝑥subscript𝑑𝑘x_{d_{k}}italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT denote the kthsuperscript𝑘thk^{\text{th}}italic_k start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT token from the vocabulary Xdsubscript𝑋𝑑X_{d}italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. Assume L=2𝐿2L=2italic_L = 2 and k=1𝑘1k=1italic_k = 1 and define a sample x=[xF1,xF2,xd1]𝑥subscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝑑1x=\left[x_{F_{1}},x_{F_{2}},x_{d_{1}}\right]italic_x = [ italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ]. Then, a model M:XfL×XdKXdK:𝑀maps-tosuperscriptsubscript𝑋𝑓𝐿superscriptsubscript𝑋𝑑𝐾superscriptsubscript𝑋𝑑𝐾M:X_{f}^{L}\times X_{d}^{K}\mapsto X_{d}^{K}italic_M : italic_X start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT × italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ↦ italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT that takes x𝑥xitalic_x as input, is expected to produce the output F2F1(xd1)subscript𝐹2subscript𝐹1subscript𝑥subscript𝑑1F_{2}\circ F_{1}\left(x_{d_{1}}\right)italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). We use [L]delimited-[]𝐿[L][ italic_L ] to denote the ordered set (1,2,,L)12𝐿(1,2,\dots,L)( 1 , 2 , … , italic_L ).

A capability, in our setup, is defined as the ability of a model to accurately represent a function F𝐹F\in\mathcal{F}italic_F ∈ caligraphic_F. We emphasize that we do not expect pretrained models in practice to perfectly implement an arbitrary function; however, this idealized definition affords us precision and allows us to use accuracy over a random set of inputs to claim a model possesses a certain capability. Based on this definition, we intend to understand the set of capabilities—or the set of functions—that a Transformer can implement by composing them. We formalize this as follows.

Definition 3.1 (Compositionality.).

We say a model M(.)M(.)italic_M ( . ) compositionally generalizes if, for any subset of functions Fisubscript𝐹𝑖F_{i}\in\mathcal{F}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_F, where i[L]𝑖delimited-[]𝐿i\in[L]italic_i ∈ [ italic_L ], M([xF1,xF2,xFL,xd])=FLF2F1(xd)𝑀subscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹𝐿subscript𝑥𝑑subscript𝐹𝐿subscript𝐹2subscript𝐹1subscript𝑥𝑑M\left(\left[x_{F_{1}},x_{F_{2}},\cdots x_{F_{L}},x_{d}\right]\right)=F_{L}% \circ\dots\circ F_{2}\circ F_{1}\left(x_{d}\right)italic_M ( [ italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ] ) = italic_F start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∘ ⋯ ∘ italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ).

Refer to caption
Figure 2: Data generating process for in-order and out-of-order compositions. (a) Each of the L=5𝐿5L=5italic_L = 5 positions is associated with N=4𝑁4N=4italic_N = 4 functions fi[l]superscriptsubscript𝑓𝑖delimited-[]𝑙f_{i}^{[l]}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT [ italic_l ] end_POSTSUPERSCRIPT, in addition to an identity function, resulting in a total of 5×4+1=21541215\times 4+1=215 × 4 + 1 = 21 basis functions for composition. (b) The in-order compositions select functions within the same position while (c) out-of-order compositions allow for selecting functions across positions. Each position also includes the identity function since it allows us to compute compositions of fewer than 5555 functions. In the examples presented in (c), displaced functions are surrounded by a black line, and we then count the number of displaced functions.

In practical scenarios, we would not expect the pretraining data to present a capability in all possible scenarios that it can be used in. For example, simple arithmetic tasks like multiplication are often only seen in the context of numbers with 1–3 digits in web-crawled data (Razeghi et al., 2022), which leads to an inability of the model to perform multiplication in higher order numbers. To model this in our setup, we create a spurious correlation between a subset of the functions from \mathcal{F}caligraphic_F and the position of their identifiers in the task tokens xfsubscript𝑥𝑓x_{f}italic_x start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. Specifically, we define (l)superscript𝑙\mathcal{F}^{(l)}\subset\mathcal{F}caligraphic_F start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ⊂ caligraphic_F as the set of functions that are allowed at the position l𝑙litalic_l in the task tokens xfsubscript𝑥𝑓x_{f}italic_x start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. We let |(l)|=Nsuperscript𝑙𝑁|\mathcal{F}^{(l)}|=N| caligraphic_F start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT | = italic_N for all locations l𝑙litalic_l, i.e., \mathcal{F}caligraphic_F is partitioned into equally sized subsets and ||=N×L𝑁𝐿|\mathcal{F}|=N\times L| caligraphic_F | = italic_N × italic_L. The notation Fi(l)superscriptsubscript𝐹𝑖𝑙F_{i}^{(l)}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT, where i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] and l[L]𝑙delimited-[]𝐿l\in[L]italic_l ∈ [ italic_L ], is used to denote the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT possible function at position l𝑙litalic_l. Based on the above, we define two ways to compose L𝐿Litalic_L functions: in-order and out-of-order (see Fig. 2).

Definition 3.2 (In-order vs. out-of-order Compositions.).

Consider the composition F~=F(l1)F(l2)F(lL)(.)\widetilde{F}=F^{(l_{1})}\circ\dots\circ F^{(l_{2})}\circ F^{(l_{L})}\left(.\right)over~ start_ARG italic_F end_ARG = italic_F start_POSTSUPERSCRIPT ( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ∘ ⋯ ∘ italic_F start_POSTSUPERSCRIPT ( italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ∘ italic_F start_POSTSUPERSCRIPT ( italic_l start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( . ), where li[L]subscript𝑙𝑖delimited-[]𝐿l_{i}\in[L]italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ italic_L ]. Denote the ordered set (l1,l2,,lL)subscript𝑙1subscript𝑙2subscript𝑙𝐿(l_{1},l_{2},\dots,l_{L})( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) as 𝚘𝚛𝚍𝚎𝚛(F~)𝚘𝚛𝚍𝚎𝚛~𝐹\mathtt{order}(\widetilde{F})typewriter_order ( over~ start_ARG italic_F end_ARG ). If 𝚘𝚛𝚍𝚎𝚛(F~)𝚘𝚛𝚍𝚎𝚛~𝐹\mathtt{order}(\widetilde{F})typewriter_order ( over~ start_ARG italic_F end_ARG ) equals the set [L]delimited-[]𝐿[L][ italic_L ], we say F~~𝐹\widetilde{F}over~ start_ARG italic_F end_ARG is an in-order composition; else, we say it is out-of-order.

Consider a model M𝑀Mitalic_M that perfectly encodes all N×L𝑁𝐿N\times Litalic_N × italic_L functions from the set \mathcal{F}caligraphic_F. If the model can generalize to in-order compositions of these functions, then its set of capabilities will in fact grow to exponentially many functions—NLsuperscript𝑁𝐿N^{L}italic_N start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, to be precise. Further, the ability to compose out-of-order can increase this set combinatorially, i.e., proportional to (N×L)Lsuperscript𝑁𝐿𝐿(N\times L)^{L}( italic_N × italic_L ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, growing even more quickly compared to the set of in-order compositions. Such an “explosion of capabilities” would imply that it is difficult to characterize the set of all tasks that a pretrained model can perform, especially since the pretraining data used for training a model is generally unknown and hence it is hard to even characterize what “atomic” capabilities the model possesses. In our experiments, we find that while Transformers can generalize to both in-order and out-of-order compositions, the pretraining dataset for enabling out-of-order generalization must exhibit some—albeit not huge—diversity (we quantify this further when discussing our results). To empirically characterize out-of-order compositions and discuss the failure modes thereof, we find it useful to define the following notion of displacement (see Fig. 2).

Definition 3.3 (Displacement.).

Let D(s,s)𝐷𝑠superscript𝑠D(s,s^{\prime})italic_D ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) denote the hamming distance between two ordered sets s𝑠sitalic_s and ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Then, the displacement of a composition F~~𝐹\widetilde{F}over~ start_ARG italic_F end_ARG is defined as D(𝚘𝚛𝚍𝚎𝚛(F~),[L])𝐷𝚘𝚛𝚍𝚎𝚛~𝐹delimited-[]𝐿D(\mathtt{order}(\widetilde{F}),[L])italic_D ( typewriter_order ( over~ start_ARG italic_F end_ARG ) , [ italic_L ] ).

3.1 Experimental Setup and Data-Generating process

Having defined our notion of compositionality in a pre-trained model, we now briefly discuss the experimental setup used in this work (see Appendix A for details). Specifically, our data-generating process yields inputs consisting of a sequence of 6 data tokens, xdXd6subscript𝑥𝑑superscriptsubscript𝑋𝑑6x_{d}\in X_{d}^{6}italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, where each token is drawn from a vocabulary of size |Xd|=10subscript𝑋𝑑10|X_{d}|=10| italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT | = 10. Each of the 6 elements are drawn uniformly at random, with replacement, from Xdsubscript𝑋𝑑X_{d}italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. We consider two families of functions defined over these data tokens: bijections and permutations (see Fig. 10). Specifically, the set bsubscript𝑏\mathcal{F}_{b}caligraphic_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT (which we refer to as bijections) consists of all functions that apply a bijection on each of the 6 tokens in an element-wise manner. The number of such functions is the number of bijections on a single token: there are 10!1010!10 ! such functions when |Xd|=10subscript𝑋𝑑10|X_{d}|=10| italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT | = 10. The second set is psubscript𝑝\mathcal{F}_{p}caligraphic_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, which is the set of all permutations of 6 elements (|p|=6!subscript𝑝6|\mathcal{F}_{p}|=6!| caligraphic_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | = 6 !). The rationale for selecting these function families is that both bsubscript𝑏\mathcal{F}_{b}caligraphic_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and psubscript𝑝\mathcal{F}_{p}caligraphic_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT are groups with function composition as the group operator. Consequently, the composition of two functions is also a group element.

Refer to caption
Figure 3: Direct v.s. Step-by-step prompts. The task (rainbow) and data (blue) tokens can be completed in two ways. They are followed by: (a) the intermediate outputs of the composition in the step-by-step format or (b) directly by the final result of compositions in the direct format.

We consider two formats for representing a sample (see Fig. 3). Both formats start with task tokens xfsubscript𝑥𝑓x_{f}italic_x start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, that specify the sequence of functions to compose, followed by the data tokens xdsubscript𝑥𝑑x_{d}italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. The direct prompt format follows this with the final output of the function composition, while the step-by-step prompt format follows this with all intermediate outputs of the function composition, similar to chain-of-thought and related protocols (Kojima et al., 2022; Nye et al., 2021; Wei et al., 2022b).

We also control the set of task tokens seen during training. In particular, we control compositions in the training data to either only contain in-order compositions, or also include out-of-order compositions. The training data for random contains task tokens corresponding to a random subset of the set of all possible in-order compositions. The training data for base contains task tokens where at most one position in the composition is not the identity function. For example, if we consider N=4𝑁4N=4italic_N = 4 and L=5𝐿5L=5italic_L = 5 like in Fig. 2, then base contains compositions of functions where at least four of the five positions are identity, totalling to overall 21 functions. The set of functions base helps us assess whether mere learning of “atomic” capabilities is sufficient to yield compositionality in a model. (See Section A.2)

We generate 100K samples using the process above for a given prompt format (step-by-step or direct) and with restrictions on the task tokens (in-order, out-of-order, base, random). The model is autoregressively trained on this data using the cross-entropy loss (see Appendix A). After training, we evaluate whether the model possesses a capability corresponding to a set of composition of functions, by computing the accuracy of the model completion on 1000 different data tokens. The accuracy of a completion is the average accuracy over the last 6 tokens.

4 Results

In this section, we systematically investigate the capabilities of an autoregressive Transformer trained on synthetic tasks with compositional structure. Broadly, we would like to understand how this structure in the data manifests in the network. We focus on addressing the following questions:

  1. (1)

    Do Transformers compostionally generalize to functions not present in the training data and to what extent do they exhibit in-order and out-of-order generalization?

  2. (2)

    How do properties of the training data influence in-order and out-of-order generalization?

  3. (3)

    Are there differences between direct and step-by-step prompt formats?

  4. (4)

    Do Transformers first learn to compose fewer functions before learning to compose many of them?

  5. (5)

    What is the role of the attention and feed-forward layers?

  6. (6)

    Can another popularly used architecture for autoregressive modeling, e.g., LSTMs, compositionally generalize in our setup?

We use nanoGPT (Appendix A), a Transformer with 12 layers with each Transformer block identical to the one in Vaswani et al. (2017). We use the same architecture across all our experiments in this section, but provide ablations that vary the number of layers, attention heads, and embedding dimension in Section B.1.

4.1 Combinatorial explosion and Exponential growth in capabilities

Do Transformers only generalize to functions present in the training data or do they reflect compositional structure present in data? In Fig. 4, we train on data consisting of a small subset of in-order compositions of bijections bsubscript𝑏\mathcal{F}_{b}caligraphic_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, in the step-by-step prompt format. We consider the composition of 5 functions in both Figs. 3(a) and 3(b). Each position of the composition can be one of four choices, with the four choices at different positions being different in Fig. 3(a) and the same in Fig. 3(b). In addition, any position can also be selected to be identity.

We find that Transformers can capture the compositional structure in data and generalize to exponential and combinatorial sets of functions in Figs. 3(a) and 3(b), despite being trained on an extremely small subset of function compositions. For example, a Transformer trained on 30–100 function compositions, generalizes to 3125 unseen compositions of these functions almost perfectly.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: Transformers trained on the step-by-step format can generalize to an exponential (a) or combinatorial (b) number of new functions. We plot the accuracy averaged over all compositions of L=5𝐿5L=5italic_L = 5 bijections, where each position of composition has 4+1 choices, with one of them being the identity function. Each curve corresponds to training data generated by a different subset of functions and the model is trained using the step-by-step prompt format. (a) The choice of 5 functions are different at different positions of composition—there are 21 different functions which can be composed (in-order) in 3125 different ways. (b) The choice of 5 functions are identical across all 5 positions of the composition which means there are 3125 different ways to compose them; only 1365 of them are unique. Both figures are evidence that one can train on a small number of compositions of functions (around 31-100) and generalize to exponentially (a) and combinatorially (b) many functions that would be considered ”out-of-distribution”.

In contrast, we note that LSTMs fail to compositionally generalize in this same setup (Section B.2), while Transformers with different numbers of layers and attention heads show compositional generalization (Section B.1). This indicates that the inductive bias of the architecture contributes to compositional generalization and any autoregressive model is not guaranteed to succeed. We also observe that base—which serves as a null model that only trains on the atomic capabilities (or functions)—does not compositionally generalize. Overall, then, we note that compositional generalization occurs with the step-by-step prompt format, provided the right architecture and training data are used.

Refer to caption
Figure 5: The training data determines if a Transformer generalizes to an exponential (in-order generalization) or combinatorial (out-of-order generalization) number of functions. Each sub-plot uses a different subset of functions (from bsubscript𝑏\mathcal{F}_{b}caligraphic_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT) to generate the training data and we evaluate them on combinatorial set of functions generated from 20+1 functions (one of them being identity). The x-axis varies the number of displacements and the y-axis varies the number of compositions—equivalently the number of functions that are not identity. We make the following observations: (1) A Transformer trained on just 31 functions (top-middle) generalize to nearly exponentially many or 3125 compositions of functions. (2) All the above configurations do not generalize perfectly to the entire combinatorial set. They however partially generalize to nearly 4 million compositions of functions. The generalization is worse if we increase the number of compositions or displacements (see Fig. 2 for pictorial description of displacements).

4.2 In-order vs. Out-of-order generalization

How do biases in the training data influence a Transformer’s ability to compose? Are Transformers capable of both in-order and out-of-order generalization or does it depend on the nature of training data? For the functions in Fig. 3(a), the number of in-order compositions is 55=3125superscript5531255^{5}=31255 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT = 3125 and the number of out-of-order compositions is a whop** (21)5=4084101superscript2154084101(21)^{5}=4084101( 21 ) start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT = 4084101; essentially all of these functions are different from the ones seen in the training data. Like in Section 4.1, we only consider Transformers trained with the step-by-step prompt format on functions from the set of bijections bsubscript𝑏\mathcal{F}_{b}caligraphic_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. In Fig. 5, we consider the training data to have functions from base, some in-order and some out-of-order compositions. We fail to see in-order or out-of-order generalization unless the data also includes in-order or out-of-order compositions respectively. However, a small number of in-order (10 of them) or out-of-order compositions (100 of them) in the training data results in in-order generalization or limited out-of-order generalization. All scenarios in Fig. 5 do not fully generalize to out-of-order compositions. This indicates that out-of-order compositions may require a lot more data compared to in-order compositions.

4.3 Direct vs. step-by-step compositions

Both Sections 4.1 and 4.2 discuss experiments using the step-by-step prompt format, but do these results also hold for direct prompting? Fig. 6 (left) and Fig. 15 answer this in the negative. Specifically, in Fig. 6 (left), we consider a setup identical to Fig. 3(a) and train on a different number of random functions. Transformers fail to generalize to new in-order compositions with direct prompting when we consider compositions of bijections from bsubscript𝑏\mathcal{F}_{b}caligraphic_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. We observe this failure even if we train on 2000 of the 3125 possible in-order compositions of functions, i.e., even if the data has high diversity. In contrast, in Fig. 3(a), a mere 100 compositions in the step-by-step format suffices to generalize to all possible in-order compositions.

Refer to caption
Refer to caption
Figure 6: Compositional generalization is less frequently seen in the direct prompt format. (Left.) We train a Transformer using the direct prompt format on 20+1 bijections with 5 compositions with 4 choices at each position. The model fails to generalize to all 3125 compositions even if it trained on 2000 such functions. (Right.) We train a Transformer using the direct prompt forlat on a composition of two functions, with one function being one of 25 bijections and the other function being one of 25 permutations (totalling to 625 compositions). The model is able to compose previously unseen combinations of functions when trained on 250 of these functions in this scenario.

On the other hand, we see in-order generalization if a Transformer is trained on a composition of a a permutation function from psubscript𝑝\mathcal{F}_{p}caligraphic_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and a bijection function from bsubscript𝑏\mathcal{F}_{b}caligraphic_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. In Fig. 6 (right), we train on compositions of two functions, where one position is one of 25 bijections, and the other is one of 25 permutations. We vary the number of compositions seen in the training data and find that 250 compositions in the training data are enough for the model to generalize to all 625 possible compositions of the two functions. We note that bijections and permutations operate on orthogonal features of the input: bijections operate on the value of the token while permutations operate on the position of the token. We speculate that this is important for compositional generalization in the direct prompt format.

Why is compositional generalization harder for direct prompts? (Section C.3)

The ability to run multiple forward passes through the model allows us tackle a richer class of problems (Merrill & Sabharwal, 2023). The step-by-step and direct prompt formats differ because the former allows L𝐿Litalic_L forward passes through the model, while the latter only allows one forward pass. As a result, we expect for the direct prompt format to enable compositional generalization, it must compute the L𝐿Litalic_L steps of the composition in the intermediate layers of the model within a single forward pass itself. For example, consider a model that computes the functions F𝐹Fitalic_F and G𝐺Gitalic_G, and is able to compositionally generalize to function GF𝐺𝐹G\circ Fitalic_G ∘ italic_F. Since GF𝐺𝐹G\circ Fitalic_G ∘ italic_F is computed using a single forward pass, G𝐺Gitalic_G must occur in a layer after F𝐹Fitalic_F (see also Fig. 10(b)). However, this model may not generalize to FG𝐹𝐺F\circ Gitalic_F ∘ italic_G, since that will require F𝐹Fitalic_F to occur after G𝐺Gitalic_G in it model’s layers. Hence, to compositionally generalize to both combinations of F𝐹Fitalic_F and G𝐺Gitalic_G, a model may have to learn copies of F𝐹Fitalic_F and G𝐺Gitalic_G at multiple layers. This will likely require training data with large amounts of data diversity so that most combinations of functions are seen by the model during training itself.

We further formalize the intuition above in Appendix C. Specifically, in Section C.3, we argue that a model trained with the direct prompt format requires more compositions in the training data, by a factor of 𝒪(L)𝒪𝐿\mathcal{O}(L)caligraphic_O ( italic_L ), compared to a model trained with the step-by-step format. In Theorem C.2, we prove that there exists an L𝐿Litalic_L-layer Transformer that can compositionally generalize with direct prompting. However, empirically, we find that even with the additional training data, the direct prompt format fails to generalize in Fig. 6 (left). This is because the existence of a solution need not guarantee that a Transformer trained with gradient descent converges to that particular minima. The weights can instead converge to a minima that only memorizes compositions present in the training data.

4.4 Towards a mechanistic understanding

In this section, we try to uncover the underlying mechanism for compositional generalization exhibited by Transformers in our setup—particularly for compositions of bijections in the step-by-step prompt format. Prior work on mechanistic interpretability often studies smaller neural networks to extract insights for larger networks (Nelson et al., 2021; Wang et al., 2022; Chughtai et al., 2023). The rationale relates to the universaility hypothesis (Li et al., 2015; Olah et al., 2020), which states that networks of different scales are likely to learn similar functions when trained on the same data. In line with this direction, we attempt to understand a 1-layer Transformer111In fact, we use a deeper model in most experiments in the main paper to elicit maximal performance when using the direct format; the step-by-step format, as we argue in Appendix C, can generalize compositionally with fewer layers (one, for in-order generalization). trained on our data generating process.

To develop a hypothesis for our mechanistic evaluation, we first show in Section C.1 the existence of 1-layer Transformers that can compositionally generalize to a simplified version of our task via the step-by-step prompt format. In particular, our construction uses the attention layer to copy the relevant task token—similar to an induction head (Olsson et al., 2022)—and the feed-forward layer to compute an single step of the function composition. The model is run L𝐿Litalic_L times serially, where each run computes one step of the function composition. The attention layer uses a position encoding as the key and query to determine which tokens to attend to and propagates the task token as the value.

We next evaluate if the theoretical construction, even though a simplification, lines up with empirical evaluations on the actual task. Specifically, we first use linear probing to understand which layers contribute to improvements in the accuracy and then visualize the attention maps to understand which tokens the model attends to.

Linear probe accuracy. In Fig. 7 (left), we use a linear probe to analyze the importance of attention layers and MLP layers. Following Geva et al. (2022), we fix the parameters of probe to the last linear layer, i.e., the unembedding layer of the trained model. We use a Transformer trained on 100 random in-order compositions of 5 functions identical to the model in Fig. 3(a). In Fig. 14 we show the results of linear probe experiments on Transformers of different sizes. In Transformers of different sizes, we note a sharp increase in accuracy right after an MLP layer, i.e., the accuracy rarely increases after an attention layer.

Visualizing attention maps. Analyzing the attention maps of a 12-layer Transformer for a discernible pattern can be difficult. We hence analyze the attentin maps of a 1-layer Transformer trained for step-by-step prompts, which surprisingly also exhibits in-order generalization. In Fig. 7 (right), we plot the attention map for a predefined composition of functions from the set bsubscript𝑏\mathcal{F}_{b}caligraphic_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. Kee** the task tokens to be fixed corresponding to the predefined composition, we sample 1000 data tokens and compute the attention map for the 1-layer model. The average of these maps is reported in the figure. We see that all data tokens attend to: (i) the task token that specifies the current function to be computed and (ii) the data token that the function is to be applied to.

The results above remarkably line up with our theoretical construction. For example, the attention maps in Fig. 7 always attend to the relevant task tokens and data token when computing the next step of the composition. The task and data tokens are all embedded in orthogonal spaces, similar to our construction, with the exception of 5 tokens which all correspond to the the identity function (see Section B.7). In parallel, the linear probe accuracy for a 1-layer Transformer in Fig. 14 shows no increase in accuracy after the attention layer (similar to results in Fig. 7), but a sharp increase in accuracy occurs after the MLP layers, indicating that the function is entirely computed in the MLP layers.

Refer to caption

Refer to caption

Figure 7: (Left.) Attention layer picks a function to apply given the current input, and MLP applies the selected function for Transformers trained on compositions of bijections in the step-by-step prompt format. We see a sharp increases in accuracy after MLP layers in the last few layers of the Transformer. We compute the linear probe accuracy—averaged over in-order compositions of functions—after the MLP and attention layers at every layer of the model. (Right.) Attention is largest at the relevant data and task token. We plot the causal attention mask of a 1-layer Transformer trained using the step-by-step format on compositions of 5 in-order bijections (setup of Fig. 4). Kee** the prompt fixed to a specific composition of functions, we plot the attention map averaged over 1000 samples. We observe that the current data token attends to the a specific task relevant to compute the next step of the composition.

4.5 Training dynamics

Okawa et al. (2023) show that different capabilities can emerge multiplicatively over the course of training, i.e., a Transformer first learns functions F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and F2subscript𝐹2F_{2}italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT before it learns compositions like F1F2subscript𝐹1subscript𝐹2F_{1}\circ F_{2}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. In Fig. 8, we track the accuracy over the course of training to understand if compositions of fewer functions are learned before compositions of many functions. The setup for this figure is identical to Fig. 3(a) with the accuracy faceted by the number of function compositions. We find that the order in which functions are learned depends entirely on the training data. If the training data consists of base and very few in-order compositions, then a Transformer generalizes to fewer compositions (more identities) first before generalizing to compositions of multiple functions. On the other hand, if the model is trained on 25 random in-order compositions, then it is better at generalizing to more complex compositions of these functions; this trend is lost when we train on 50 random in-order compositions.

Refer to caption
Figure 8: A Transformer trained on a random subset of functions generalizes first to a composition of more functions before it generalizes to a composition of few of them. Each line is the average accuracy over all composition of k𝑘kitalic_k functions and each subplot is a Transformer trained on a different subset of functions. The base is trained on the individual functions and these Transformers learn to compose a smaller set of functions (more functions in composition are identity) before learning to compose many of them. The opposite is true when the model is trained on a random subset of 25 compositions of functions.

5 Conclusion

Given several recent works focused on prediction or elicitation of capabilities in pretrained models, we ask whether the very motivation guiding these works is tractable: can we possibly characterize all capabilities of a model, specifically a Transformer, pretrained on a compositional data domain? To address this question, we proposed a synthetic, but well-defined, data domain and formalized the notion of a capability as representing a function defined over the domain. Breaking compositional generalization into two relevant scenarios (in-order vs. out-of-order), we showed that the compositional structure of the data forces a model to learn to compose at relatively minimal data diversity, which indicatively address our primary question: an appropriate prompt could make the model compose its capabilities, yielding an “explosion of capabilities”. This can arguably make tractable analysis of capabilities in a pretrained model relatively difficult.

Acknowledgements

RR thanks Kento Nishi, Gautam Reddy and Eric Bigelow for their discussions at the early stages of this project. RR thanks AWS AI, for their gift to Penn Engineering’s ASSET Center for Trustworthy AI. ESL was partially supported by the National Science Foundation (IIS-2008151).

Author Contributions

ESL and RR conceived the initial project direction and defined the problem setup with with inputs from HT and MK. The experiments were led by RR with inputs from ESL, HT and MK. The writing of the introduction and related work was led by ESL with help from HT and RR. RR, ESL and HT extensively collaborated on the methods section. The results and appendix were led by RR. The expository figures were created by HT and RR. HT and RPD acted as advisors in the work.

References

  • Abid et al. (2021) Abid, A., Farooqi, M., and Zou, J. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp.  298–306, 2021.
  • Ahn et al. (2023) Ahn, K., Cheng, X., Daneshmand, H., and Sra, S. Transformers learn to implement preconditioned gradient descent for in-context learning. arXiv preprint arXiv:2306.00297, 2023.
  • Allen-Zhu & Li (2023a) Allen-Zhu, Z. and Li, Y. Physics of language models: Part 3.1, knowledge storage and extraction. arXiv preprint arXiv:2309.14316, 2023a.
  • Allen-Zhu & Li (2023b) Allen-Zhu, Z. and Li, Y. Physics of language models: Part 3.2, knowledge manipulation. arXiv preprint arXiv:2309.14402, 2023b.
  • Allen-Zhu & Li (2023c) Allen-Zhu, Z. and Li, Y. Physics of language models: Part 1, context-free grammar. arXiv preprint arXiv:2305.13673, 2023c.
  • Arora & Goyal (2023) Arora, S. and Goyal, A. A theory for emergence of complex skills in language models. arXiv preprint arXiv:2307.15936, 2023.
  • Austin et al. (2021) Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  • Bhattamishra et al. (2020) Bhattamishra, S., Ahuja, K., and Goyal, N. On the ability and limitations of transformers to recognize formal languages. arXiv preprint arXiv:2009.11264, 2020.
  • Bommasani et al. (2021) Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., et al. On the opportunities and risks of foundation models. arXiv preprint. arXiv:2108.07258, 2021.
  • Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Bubeck et al. (2023) Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  • Chan et al. (2023) Chan, A., Salganik, R., Markelius, A., Pang, C., Rajkumar, N., Krasheninnikov, D., Langosco, L., He, Z., Duan, Y., Carroll, M., et al. Harms from increasingly agentic algorithmic systems. arXiv preprint arXiv:2302.10329, 2023.
  • Chan et al. (2022) Chan, S., Santoro, A., Lampinen, A., Wang, J., Singh, A., Richemond, P., McClelland, J., and Hill, F. Data distributional properties drive emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 35:18878–18891, 2022.
  • Chen et al. (2021) Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  • Chowdhery et al. (2022) Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  • Chughtai et al. (2023) Chughtai, B., Chan, L., and Nanda, N. A toy model of universality: Reverse engineering how networks learn group operations, may 2023. URL http://arxiv. org/abs/2302, 3025, 2023.
  • Csordás et al. (2021a) Csordás, R., Irie, K., and Schmidhuber, J. The devil is in the detail: Simple tricks improve systematic generalization of transformers. arXiv preprint arXiv:2108.12284, 2021a.
  • Csordás et al. (2021b) Csordás, R., Irie, K., and Schmidhuber, J. The neural data router: Adaptive control flow in transformers improves systematic generalization. arXiv preprint arXiv:2110.07732, 2021b.
  • Csordás et al. (2022) Csordás, R., Irie, K., and Schmidhuber, J. Ctl++: Evaluating generalization on never-seen compositional patterns of known functions, and compatibility of neural representations. arXiv preprint arXiv:2210.06350, 2022.
  • Fodor (1975) Fodor, J. A. The language of thought, volume 5. Harvard university press, 1975.
  • Fodor & Lepore (2002) Fodor, J. A. and Lepore, E. The compositionality papers. Oxford University Press, 2002.
  • Fodor & Pylyshyn (1988) Fodor, J. A. and Pylyshyn, Z. W. Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1-2):3–71, 1988.
  • Ganguli et al. (2022) Ganguli, D., Hernandez, D., Lovitt, L., Askell, A., Bai, Y., Chen, A., Conerly, T., Dassarma, N., Drain, D., Elhage, N., et al. Predictability and surprise in large generative models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp.  1747–1764, 2022.
  • Garg et al. (2022) Garg, S., Tsipras, D., Liang, P. S., and Valiant, G. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
  • Garrido-Muñoz et al. (2021) Garrido-Muñoz, I., Montejo-Ráez, A., Martínez-Santiago, F., and Ureña-López, L. A. A survey on bias in deep nlp. Applied Sciences, 11(7):3184, 2021.
  • Gehman et al. (2020) Gehman, S., Gururangan, S., Sap, M., Choi, Y., and Smith, N. A. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
  • Geva et al. (2022) Geva, M., Caciularu, A., Dar, G., Roit, P., Sadde, S., Shlain, M., Tamir, B., and Goldberg, Y. Lm-debugger: An interactive tool for inspection and intervention in transformer-based language models. arXiv preprint arXiv:2204.12130, 2022.
  • Hendrycks & Gimpel (2016) Hendrycks, D. and Gimpel, K. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  • Henighan et al. (2020) Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown, T. B., Dhariwal, P., Gray, S., et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
  • Hernandez et al. (2021) Hernandez, D., Kaplan, J., Henighan, T., and McCandlish, S. Scaling laws for transfer. arXiv preprint arXiv:2102.01293, 2021.
  • Hoffmann et al. (2022) Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  • Honovich et al. (2022) Honovich, O., Shaham, U., Bowman, S. R., and Levy, O. Instruction induction: From few examples to natural language task descriptions. arXiv preprint arXiv:2205.10782, 2022.
  • Hosseini et al. (2022) Hosseini, A., Vani, A., Bahdanau, D., Sordoni, A., and Courville, A. On the compositional generalization gap of in-context learning. arXiv preprint arXiv:2211.08473, 2022.
  • Huang et al. (2019) Huang, P.-S., Zhang, H., Jiang, R., Stanforth, R., Welbl, J., Rae, J., Maini, V., Yogatama, D., and Kohli, P. Reducing sentiment bias in language models via counterfactual evaluation. arXiv preprint arXiv:1911.03064, 2019.
  • Hupkes et al. (2018) Hupkes, D., Singh, A., Korrel, K., Kruszewski, G., and Bruni, E. Learning compositionally through attentive guidance. arXiv preprint arXiv:1805.09657, 2018.
  • Hupkes et al. (2020) Hupkes, D., Dankers, V., Mul, M., and Bruni, E. Compositionality decomposed: How do neural networks generalise? Journal of Artificial Intelligence Research, 67:757–795, 2020.
  • Jiang et al. (2021) Jiang, L., Hwang, J. D., Bhagavatula, C., Le Bras, R., Liang, J., Dodge, J., Sakaguchi, K., Forbes, M., Borchardt, J., Gabriel, S., et al. Can machines learn morality? the delphi experiment. arXiv e-prints, pp.  arXiv–2110, 2021.
  • Jones (2021) Jones, A. L. Scaling scaling laws with board games. arXiv preprint arXiv:2104.03113, 2021.
  • Kojima et al. (2022) Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
  • Lake & Baroni (2018) Lake, B. and Baroni, M. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International conference on machine learning, pp.  2873–2882. PMLR, 2018.
  • Lee et al. (2023) Lee, T., Yasunaga, M., Meng, C., Mai, Y., Park, J. S., Gupta, A., Zhang, Y., Narayanan, D., Teufel, H. B., Bellagente, M., et al. Holistic evaluation of text-to-image models. arXiv preprint arXiv:2311.04287, 2023.
  • Lepori et al. (2023) Lepori, M. A., Serre, T., and Pavlick, E. Break it down: Evidence for structural compositionality in neural networks. arXiv preprint arXiv:2301.10884, 2023.
  • Lewis et al. (2022) Lewis, M., Yu, Q., Merullo, J., and Pavlick, E. Does clip bind concepts? probing compositionality in large image models. arXiv preprint arXiv:2212.10537, 2022.
  • Li et al. (2023a) Li, K., Hopkins, A. K., Bau, D., Viégas, F., Pfister, H., and Wattenberg, M. Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task, 2023a. Comment: ICLR 2023 oral (notable-top-5%): https://openreview.net/forum?id=DeG07_TcZvT ; code: https://github.com/likenneth/othello_world.
  • Li et al. (2015) Li, Y., Yosinski, J., Clune, J., Lipson, H., and Hopcroft, J. Convergent learning: Do different neural networks learn the same representations? arXiv preprint arXiv:1511.07543, 2015.
  • Li et al. (2023b) Li, Y., Ildiz, M. E., Papailiopoulos, D., and Oymak, S. Transformers as algorithms: Generalization and implicit model selection in in-context learning. arXiv preprint arXiv:2301.07067, 2023b.
  • Li et al. (2023c) Li, Y., Sreenivasan, K., Giannou, A., Papailiopoulos, D., and Oymak, S. Dissecting chain-of-thought: A study on compositional in-context learning of mlps. arXiv preprint arXiv:2305.18869, 2023c.
  • Liang et al. (2022) Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  • Lin et al. (2021) Lin, S., Hilton, J., and Evans, O. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  • Liška et al. (2018) Liška, A., Kruszewski, G., and Baroni, M. Memorize or generalize? searching for a compositional rnn in a haystack. arXiv preprint arXiv:1802.06467, 2018.
  • Liu et al. (2022) Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to automata. arXiv preprint arXiv:2210.10749, 2022.
  • Liu et al. (2023) Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  • Lubana et al. (2023) Lubana, E. S., Bigelow, E. J., Dick, R. P., Krueger, D., and Tanaka, H. Mechanistic mode connectivity. In International Conference on Machine Learning, pp.  22965–23004. PMLR, 2023.
  • McGuffie & Newhouse (2020) McGuffie, K. and Newhouse, A. The radicalization risks of gpt-3 and advanced neural language models. arXiv preprint arXiv:2009.06807, 2020.
  • Merrill & Sabharwal (2023) Merrill, W. and Sabharwal, A. The expresssive power of transformers with chain of thought. arXiv preprint arXiv:2310.07923, 2023.
  • Myers & Wilf (2006) Myers, A. N. and Wilf, H. S. Some new aspects of the coupon collector’s problem. SIAM review, 48(3):549–565, 2006.
  • Nanda et al. (2023a) Nanda, N., Chan, L., Liberum, T., Smith, J., and Steinhardt, J. Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217, 2023a.
  • Nanda et al. (2023b) Nanda, N., Lee, A., and Wattenberg, M. Emergent linear representations in world models of self-supervised sequence models. arXiv preprint arXiv:2309.00941, 2023b.
  • Nelson et al. (2021) Nelson, E., Neel, N., Catherine, O., Tom, H., Nicholas, J., Ben, M., Amanda, A., Yuntao, B., Anna, C., Tom, C., et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021.
  • Nye et al. (2021) Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  • Okawa et al. (2023) Okawa, M., Lubana, E. S., Dick, R. P., and Tanaka, H. Compositional abilities emerge multiplicatively: Exploring diffusion models on a synthetic task. arXiv preprint arXiv:2310.09336, 2023.
  • Olah et al. (2020) Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020.
  • Olsson et al. (2022) Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
  • Ontanón et al. (2021) Ontanón, S., Ainslie, J., Cvicek, V., and Fisher, Z. Making transformers solve compositional tasks. arXiv preprint arXiv:2108.04378, 2021.
  • Parrish et al. (2021) Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut, P. M., and Bowman, S. R. Bbq: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193, 2021.
  • Press & Wolf (2016) Press, O. and Wolf, L. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016.
  • Qin et al. (2023) Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., Lin, Y., Cong, X., Tang, X., Qian, B., et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
  • Radford et al. (2018) Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. Improving language understanding by generative pre-training. 2018.
  • Radford et al. (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • Rae et al. (2021) Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  • Razeghi et al. (2022) Razeghi, Y., Logan IV, R. L., Gardner, M., and Singh, S. Impact of pretraining term frequencies on few-shot reasoning. arXiv preprint arXiv:2202.07206, 2022.
  • Sanh et al. (2021) Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja, A., et al. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
  • Saparov & He (2022) Saparov, A. and He, H. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. arXiv preprint arXiv:2210.01240, 2022.
  • Schulz et al. (2016) Schulz, E., Tenenbaum, J., Duvenaud, D. K., Speekenbrink, M., and Gershman, S. J. Probing the compositionality of intuitive functions. Advances in neural information processing systems, 29, 2016.
  • Shah et al. (2022) Shah, R., Varma, V., Kumar, R., Phuong, M., Krakovna, V., Uesato, J., and Kenton, Z. Goal misgeneralization: Why correct specifications aren’t enough for correct goals. arXiv preprint arXiv:2210.01790, 2022.
  • Sharma & Kaplan (2020) Sharma, U. and Kaplan, J. A neural scaling law from the dimension of the data manifold. arXiv preprint arXiv:2004.10802, 2020.
  • Sheng et al. (2019) Sheng, E., Chang, K.-W., Natarajan, P., and Peng, N. The woman worked as a babysitter: On biases in language generation. arXiv preprint arXiv:1909.01326, 2019.
  • Shevlane et al. (2023) Shevlane, T., Farquhar, S., Garfinkel, B., Phuong, M., Whittlestone, J., Leung, J., Kokotajlo, D., Marchal, N., Anderljung, M., Kolt, N., et al. Model evaluation for extreme risks. arXiv preprint arXiv:2305.15324, 2023.
  • Srivastava et al. (2022) Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  • Suzgun et al. (2022) Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q. V., Chi, E. H., Zhou, D., et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  • Tamkin et al. (2021) Tamkin, A., Brundage, M., Clark, J., and Ganguli, D. Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503, 2021.
  • Tay et al. (2022) Tay, Y., Wei, J., Chung, H. W., Tran, V. Q., So, D. R., Shakeri, S., Garcia, X., Zheng, H. S., Rao, J., Chowdhery, A., et al. Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399, 2022.
  • Tenney et al. (2019) Tenney, I., Das, D., and Pavlick, E. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950, 2019.
  • Thoppilan et al. (2022) Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., **, A., Bos, T., Baker, L., Du, Y., et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  • Touvron et al. (2023) Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Von Oswald et al. (2023) Von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., and Vladymyrov, M. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pp.  35151–35174. PMLR, 2023.
  • Wang et al. (2022) Wang, K., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022.
  • Wei et al. (2021) Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  • Wei et al. (2022a) Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
  • Wei et al. (2022b) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022b.
  • Weidinger et al. (2021) Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., et al. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
  • Weiss et al. (2021) Weiss, G., Goldberg, Y., and Yahav, E. Thinking like transformers. In International Conference on Machine Learning, pp.  11080–11090. PMLR, 2021.
  • Xu et al. (2021) Xu, A., Pathak, E., Wallace, E., Gururangan, S., Sap, M., and Klein, D. Detoxifying language models risks marginalizing minority voices. arXiv preprint arXiv:2104.06390, 2021.
  • Xu et al. (2020) Xu, J., Ju, D., Li, M., Boureau, Y.-L., Weston, J., and Dinan, E. Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079, 2020.
  • Xu & Tang (2011) Xu, W. and Tang, A. K. A generalized coupon collector problem. Journal of Applied Probability, 48(4):1081–1094, 2011. doi: 10.1239/jap/1324046020.
  • Yu et al. (2023) Yu, D., Kaur, S., Gupta, A., Brown-Cohen, J., Goyal, A., and Arora, S. Skill-mix: A flexible and expandable family of evaluations for ai models. arXiv preprint arXiv:2310.17567, 2023.
  • Yun et al. (2022) Yun, T., Bhalla, U., Pavlick, E., and Sun, C. Do vision-language pretrained models learn composable primitive concepts? arXiv preprint arXiv:2203.17271, 2022.
  • Zhou et al. (2023) Zhou, H., Bradley, A., Littwin, E., Razin, N., Saremi, O., Susskind, J., Bengio, S., and Nakkiran, P. What algorithms can transformers learn? a study in length generalization. arXiv preprint arXiv:2310.16028, 2023.

Appendix A Experimental Details

A.1 Training methodology

Transformer architecture
Refer to caption
Figure 9: We use nanoGPT as the Transformer architecture in all our experiments. The core Transformer block is a LayerNorm, a causal attention block, followed by another layer-norm and a 2-layer multi-layer perceptron (MLP). The Transformer block has two residual connections.

We use nanoGPT222https://github.com/karpathy/nanoGPT with 12 layers, 12 attention heads and an embedding dimension of size 120. Each transformer block contains a causal attention layer, layer-norms, residual connections and an MLP (see Fig. 9). The MLP contains two fully-connected layers sandwiched by a GELU layer (Hendrycks & Gimpel, 2016) The first fully-connected layers has a hidden layer with size 4 times the embedding dimension (480) and the second hidden layer has a size equal to the embedding dimension (120).

The input tokens are converted to one-hot vectors before being passed through to the model. We do not use dropout or biases in the LayerNorm layers. We use weight-tying (Press & Wolf, 2016), i.e., the input and the output embedding layers share weights. Finally, we make use of mixed-precision (bf16 in torch) to speed-up training.

Loss and Optimizer

Models are trained using an autoregressive objective to predict the next token using the cross-entropy loss. Specifically, assume a sequence of tokens of t𝑡titalic_t tokens denoted by x1:tsubscript𝑥:1𝑡x_{1:t}italic_x start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT. Let pw(yx1:tp_{w}(y\mid x_{1:t}italic_p start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_y ∣ italic_x start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT) denote the probability distribution over the next token as predicted by a model with weights w𝑤witalic_w. For a sequence x1:Tsubscript𝑥:1𝑇x_{1:T}italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT of length T𝑇Titalic_T, the autoregressive objective is

L(w)=t=1T1logpw(y=xt+1x1:t).𝐿𝑤superscriptsubscript𝑡1𝑇1subscript𝑝𝑤𝑦conditionalsubscript𝑥𝑡1subscript𝑥:1𝑡L(w)=-\sum_{t=1}^{T-1}\log p_{w}\left(y=x_{t+1}\mid x_{1:t}\right).italic_L ( italic_w ) = - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_y = italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ) .

Training is performed for 100 epochs with a cosine-annealed scheduled with warmup. We use an initial learning rate of 3e-4 annealed eventually to 6e-5. We use AdamW as the optimizer (β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β2=0.95subscript𝛽20.95\beta_{2}=0.95italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.95) with a weight decay of 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and a batch-size of 512. We also make use of gradient clip** with a magnitude of 1.

A.2 Data generating process

Data and task tokens.

Both data and task tokens are converted to one-hot vectors before being fed to the Transformer. The set of data tokens is denoted by Xdsubscript𝑋𝑑X_{d}italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and the size of the vocabulary, |Xd|subscript𝑋𝑑|X_{d}|| italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT |, is 10 in all our experiments. The data tokens in the input xdXd6subscript𝑥𝑑superscriptsubscript𝑋𝑑6x_{d}\in X_{d}^{6}italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT is a sequence of 6666 tokens and is the input to the function composition. The 6 tokens are sampled uniformly at random from Xdsubscript𝑋𝑑X_{d}italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT with replacement.

There are two sets of functions considered in this work. The set of functions bsubscript𝑏\mathcal{F}_{b}caligraphic_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT (which we refer to as bijections) applies a lookup table in an element-wise fashion to each of the 6 tokens in xdsubscript𝑥𝑑x_{d}italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. The set of functions in psubscript𝑝\mathcal{F}_{p}caligraphic_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT permute the 6 tokens in xdsubscript𝑥𝑑x_{d}italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. The family of functions in bsubscript𝑏\mathcal{F}_{b}caligraphic_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and psubscript𝑝\mathcal{F}_{p}caligraphic_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT are described in Fig. 10. Each function from psubscript𝑝\mathcal{F}_{p}caligraphic_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and 𝒳bsubscript𝒳𝑏\mathcal{X}_{b}caligraphic_X start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT has its own task token in XFsubscript𝑋𝐹X_{F}italic_X start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT.

The input starts with a sequence of L𝐿Litalic_L task tokens xfXFLsubscript𝑥𝑓superscriptsubscript𝑋𝐹𝐿x_{f}\in X_{F}^{L}italic_x start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT. The number of compositions is generally L=5𝐿5L=5italic_L = 5, but in a few experiments like Figs. 156 (Right), L=2𝐿2L=2italic_L = 2.

Sampling task tokens

The task tokens can be sampled such that they satisfy certain properties. For example, let us consider the composition of two functions—one from the set 1psubscript1subscript𝑝\mathcal{F}_{1}\subset\mathcal{F}_{p}caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊂ caligraphic_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and another from 2bsubscript2subscript𝑏\mathcal{F}_{2}\subset\mathcal{F}_{b}caligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⊂ caligraphic_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT (which is the setting in Fig. 6 (Right)). We can restrict the training data to compositions from the set 21subscript2subscript1\mathcal{F}_{2}\circ\mathcal{F}_{1}caligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT which are in-order compositions (see Fig. 2). Alternatively, we can also choose to include out-of-order compositions, which include compositions from 11,22subscript1subscript1subscript2subscript2\mathcal{F}_{1}\circ\mathcal{F}_{1},\mathcal{F}_{2}\circ\mathcal{F}_{2}caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ caligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 12subscript1subscript2\mathcal{F}_{1}\circ\mathcal{F}_{2}caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ caligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. In Fig. 6 (Right), we restrict our training and evaluation to in-order compositions of functions and we observe that training on a subset of the elements from 21subscript2subscript1\mathcal{F}_{2}\circ\mathcal{F}_{1}caligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT suffices to compositionally generalize all functions in the set.

Two other commonly used subsets of functions are base and random. Consider 1,2,,5bsubscript1subscript2subscript5subscript𝑏\mathcal{F}_{1},\mathcal{F}_{2},\ldots,\mathcal{F}_{5}\subset\mathcal{F}_{b}caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_F start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ⊂ caligraphic_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. The set random considers k𝑘kitalic_k functions from the set 541subscript5subscript4subscript1\mathcal{F}_{5}\circ\mathcal{F}_{4}\circ\cdots\circ\mathcal{F}_{1}caligraphic_F start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ∘ caligraphic_F start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∘ ⋯ ∘ caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT which are drawn uniformly at random.

base is used to test if the compositionality is seen when the Transformer is trained on the individual functions from isubscript𝑖\mathcal{F}_{i}caligraphic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for all i[5]𝑖delimited-[]5i\in[5]italic_i ∈ [ 5 ]. In the training data, all compositions have 4 of the 5 functions to be the identity function I𝐼Iitalic_I, i.e it considers compositions of the form II3II𝐼𝐼subscript3𝐼𝐼I\circ I\circ\mathcal{F}_{3}\circ I\circ Iitalic_I ∘ italic_I ∘ caligraphic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∘ italic_I ∘ italic_I or I4I𝐼subscript4𝐼I\circ\mathcal{F}_{4}\circ\cdots\circ Iitalic_I ∘ caligraphic_F start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∘ ⋯ ∘ italic_I. There are a total of 1+i=15i1superscriptsubscript𝑖15subscript𝑖1+\sum_{i=1}^{5}\mathcal{F}_{i}1 + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT such functions; the 1 is when all 5 functions in the composition are identity. The model is never trained on the composition of two or more functions, and at least compositions of 3 functions are necessary to generalize to all in-order compositions Fig. 18.

Refer to caption
Figure 10: A permutation from psubscript𝑝\mathcal{F}_{p}caligraphic_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT permutes the 6 tokens in the input xdsubscript𝑥𝑑x_{d}italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. A bijection from bsubscript𝑏\mathcal{F}_{b}caligraphic_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT applies a lookup table to each of the 6 tokens individually.
Generating a sequence of tokens

A sequence starts with a sequence of two task tokens xf=[xF1,xF2]subscript𝑥𝑓subscript𝑥subscript𝐹1subscript𝑥subscript𝐹2x_{f}=[x_{F_{1}},x_{F_{2}}]italic_x start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] followed by a sequence of data tokens xdsubscript𝑥𝑑x_{d}italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. The sequence can either be presented in: (i) The step-by-step format, where the intermediate outputs are also included in the sequence; e.g., the sequence in the step-by-step format would look like [xF1,xF2,xd,F1(xd),F2(F1(xd))]subscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥𝑑subscript𝐹1subscript𝑥𝑑subscript𝐹2subscript𝐹1subscript𝑥𝑑[x_{F_{1}},x_{F_{2}},x_{d},F_{1}(x_{d}),F_{2}(F_{1}(x_{d}))][ italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ) ] (see Fig. 10(a)) or (ii) The direct format, which does not include the intermediate outputs of the composition in the sequence and an example of such a sequence is [xF1,xF2,xd,F2(F1(xg))]subscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥𝑑subscript𝐹2subscript𝐹1subscript𝑥𝑔[x_{F_{1}},x_{F_{2}},x_{d},F_{2}(F_{1}(x_{g}))][ italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) ) ] (see Fig. 10(b)).

The step-by-step and direct formats are also discussed in Fig. 3. The training data consists of 100,000 sequences for all experiments in one of the two formats.

Refer to caption
(a)

Refer to caption

(b)
Figure 11: Step-by-step composition v.s. Direct composition. We test two possible routes for compositions. (a) Step-by-step prompting, which allows for generating intermediate outputs. (b) Direct prompting, where the model must compose the functions without the intermediate outputs.
Evaluating compositions

When evaluating trained models, we evaluate on 1000 different inputs for every composition of functions. Since Fig. 5 requires us to evaluate on a combinatorial set of functions, we sample 1000 functions (or the total number of functions, whichever was lower) for each cell which can be identified by the displacement and number of compositions; we then compute the accuracy averaged over those functions to populate the cell. The accuracy of a completion is calculated by averaging the accuracy of the last six tokens. We see that qualitative trends do not change when we use different metrics Fig. 19.

Computing linear probe accuracy

We consider the outputs after every attention block and every MLP block (including the residual stream in both cases). We then pass these outputs through the final embedding layer and a Softmax layer to get predictions over the next token. We use these predictions to compute the accuracy at that layer. The accuracy is averaged over 1000 different input data tokens and for 200 different compositions of functions.

Appendix B Additional Experiments

B.1 Swee** hyper-parameters of the Transformer

Refer to caption
Figure 12: Transformers requires at least 2-3 layers for compositional generalization with the direct prompt format. We vary the number of layers in the Transformer and train on direct composition in a setup identical to Fig. 6 (Right).

We vary the number of layers, the number of attention heads, and the embedding dimension of the nanoGPT model in Fig. 13. We consider a setup identical to Fig. 4; all models are trained on 50 random in-order compositions of 5 bijections. We report accuracy averaged over all 3125 in-order compositions.

We make the following observations. (1) Most surprisingly, the accuracy reduces as the number of layers become huge for this compositional task; we expect that this is due to issues with optimization of a large depth model. (2) The accuracy does not change with the number of attention heads for a 1-layer Transformer. (3) The accuracy increases as we increase the embedding dimension and the model under fits the training data when the embedding dimension is too small.

Refer to caption
Refer to caption
Figure 13: We see compositionality in Transformers even if we change the number of layers and attention heads. Compositionality is seen even in a 1-layer Transformer when trained with the step-by-step prompt format on 50 in-order compositions of bijections. However the ability to compose degrades as we increase the number of layers in the Transformer.

B.2 LSTMs do not learn to compose

We report results on autoregressively trained LSTMs using the direct prompt format from Table 1 and the step-by-step prompt format in Table 2. LSTMs fail to generalize outside of the training data while Transformers generalize compositionally in both these scenarios. This points to an inductive bias that helps Transformers trained with an autoregressive objective generalize. Specifically, our mechanistic evaluation in Sec. 4.4 shows this is likely attributable to the use of Attention.

The LSTMs are trained using the same data using the autoregressive objective defined in Appendix A. We use the AdamW optimizer with learning rate equal to 3e-4 (β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β2=0.95subscript𝛽20.95\beta_{2}=0.95italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.95), batch size of 512 and weight decay of 1e-4 for 150 epochs. As is common, we do not use a positional embedding, since the architecture is not permutation invariant.

Hidden dimension
Layers 256 5124
1 22.5 46.0
2 33.4 69.1
Table 1: LSTMs fail to compose in the direct prompt format. We train an LSTM on 250 composition of two functions (one permutation and one bijection) in the direct prompt format and tabulate the accuracy (%); the setup is identical to Fig. 6 (Right).

The inputs are passed through an input embedding layer before being passed to the LSTM and the outputs of the LSTM are also passed through a linear layer which outputs the logits. In our experiments, we vary the number of stacked LSTMs (or no. of layers) and the dimension of the internal hidden vector.

Despite our attempt to train multiple different LSTMs with the best set of hyper-parameters, we observe that they do not show any compositional generalization on all our synthetic setups. This observation is further evidence for our hypothesis that the attention layers are important for compositionality.

Hidden layer dimension
Layers 120 256 512 1024
1 16.2 36.2 99.9 99.9
2 60.3 99.3 99.9 99.8
4 18.7 100.0 100.0 9.9
Hidden layer dimension
Layers 120 256 512 1024
1 9.3 10.3 20.1 22.9
2 12.4 21.3 25.3 28.8
4 6.6 13.9 17.6 10.0
Table 2: LSTMs fail to compose in the step-by-step prompt format. We train autoregressive LSTMs on 50 in-order compositions of 5 bijections from bsubscript𝑏\mathcal{F}_{b}caligraphic_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT in the step-by-step format and tabulate the accuracy (%); The setup is identical to Fig. 4. We evaluate the LSTM on the (left) compositions seen during training and (right) in-order compositions not seen during training. LSTMs fail to generalize to functions outside of the training data while transformers generalize compositionally in the same setting.

B.3 Attention Masks

Detailed setup. We train a 1-layer Transformer on a composition of 50 random in-order compositions of 5 bijections in the step-by-step prompt format. We visualize the attention masks for a fixed sequence of task tokens, averaged over 1000 different data tokens in Fig. 7(right). We found the attention masks to be identical across different choices of the task tokens. Each row corresponds to a causal attention mask for a single token and sums up to 1. At any given row, the attention is over two elements—the task token and the intermediate output of the composition. The five contiguous blocks along the columns correspond to the five steps of composition. These preliminary results indicate that it is possible to build a complete mechanistic understanding of attention for compositional tasks (see also Sec. C).

B.4 Probing the layers in Transformers of different sizes

Refer to caption
Figure 14: We use a linear probe to study the accuracy at different layers on Transformers of different sizes. Most architectures see an increasing in accuracy in the latter half of the Transformer. The increase in accuracy is more gradual for Transformers with more layers. The accuracy increases sharply after an attention layer across all architectures.

In this section, we consider an experimental setup that is identical to the linear probe experiments in Fig. 7. We compute the probe accuracies for Transformers with different number of layers in Fig. 14. Across all models, we observe that accuracy increases in the last few layers. Furthermore, we also observe a sharp increase in accuracy right after the MLPs in the last few layers of the transformer.

We saw in Fig. 7(right) that the attention masks for a 1-layer model seem to select an input and a task token to operate on at every step of the composition. We hence believe that attention has a huge role in compositionality and propose the following hypothesis: The probe accuracy after some MLPs see a sharp in increase in accuracy because the attention layers play a critical role in selecting the right inputs to pass to the MLP. Specifically, unlike the 1-layer model, we suspect functions are now distributed across the model layers instead of being localized in the first MLP layer. Consequently, similar to the 1-layer model, attention heads at different layers will infer if the relevant functions implemented in MLP layers in that block are part of the prompt; if so, they transfer the input data through said function.

B.5 Another failure with the direct format with bijections

Refer to caption
Figure 15: Transformers fail to generalize to compositions of even 2 bijections, when trained with the direct prompt format. The curve depicts the accuracy over all 625 in-order compositions of two bijections (25 choices for each bijection) when trained on different subsets of in-order compositions. The model is trained with direct composition. Even if we train on 500 such compositions, the model fails to generalize to the remaining 125 compositions. This is additional evidence that the model is incapable composing bijections through direct composition.

In Fig. 6 (Left) we show that Transformers do not learn to compose 5 bijections and only generalize to compositions in the training data. Fig. 15 augments this result and shows that a similar failure occurs even when we consider the composition of just two bijections. Hence the model may not compose some function in the direct prompt format and the step-by-step format with an autoregressive objective is far more amenable to compositions.

B.6 Additional experiments with training data from random and base

In this section, we conduct a collection of analyses for a model trained on in-order compositions of 5 bijections in the step-by-step prompt format. We perform the following experiments: (1) compare how base and random generalize to other in-order compositions (Fig. 16); (2) change the number of random functions in the training data (Fig. 17); (3) limit the maximum number of compositions in the training data and evaluate compositional generalization (Fig. 18); (4) look at alternate evaluation metrics (Fig. 19); and (5) test if the compositions are systematic (Hupkes et al., 2020) (Fig. 20).

Refer to caption
Figure 16: How do different training datasets generalize to compositions of many and few functions? This is a fine-grained version of Fig. 3(a). Model trained on 50 random compositions generalizes poorly compositions of small number of functions while a model trained on the base generalizes poorly to composition of 4 or 5 functions.
Refer to caption
Figure 17: Training with different numbers of random  functions. We train on a different number of random functions ranging from 5-70 in steps of 5. These plots are the accuracies averaged over all in-order compositions of 5 bijections over the course of training.
Refer to caption
Figure 18: Limiting maximum number of compositions in the training data. The figure plots the accuracy on all in-order compositions against the number of training iterations. Each sub-plot considers compositions of size exactly 2, 3, 4, 5, respectively in the training data. The model is able to generalize to most in-order compositions only if the training data consists of compositions of size at least 3 (bottom-right).
Refer to caption
Refer to caption
Figure 19: Evaluation metric. We consider 3 different metrics for evaluating the models. The left column considers the average accuracy when the model generates The choice of metric doesn’t change qualitative trends. Each sub-plot considers compositions of only size 2, 3, 4, 5, respectively. In each plot, we vary the number of such functions that are present int he training data. One exception is when we train on compositions of size 2. In this case, the guided generation accuracy is high, but the free generation accuracy is not.
Refer to caption
Figure 20: Systematicity. We consider trained models from Fig. 3(a) and analyze the accuracy of each of the 20 functions (atomic capabilities) when averaged all instances in which it was used compositionally. We breakdown the results to see if certain functions are more accurate when used in compositions compared to others and find that models seem to learn all functions equally well.

B.7 Token embeddings

We study the token embeddings of the Transformer models and observe that they are similar for models with different number of layers and attention heads (see Fig. 21). We notice a block diagonal structure that separates task tokens from the data tokens. We also observe another block diagonal structure within the task tokens which occurs when we train only on in-order compositions.

Refer to caption
Figure 21: Word embedding correlations present a block-diagonal structure that separates data tokens from task tokens. We plot the inner product between all pairs of word embeddings of the tokens. The task tokens are orthogonal to the set of input tokens. Different functions in the same level, i.e. {Fi(l)}i=1Nsuperscriptsubscriptsuperscriptsubscript𝐹𝑖𝑙𝑖1𝑁\{F_{i}^{(l)}\}_{i=1}^{N}{ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT for a fixed l𝑙litalic_l, form a block-diagonal in this matrix. We observe similar word embeddings in Transformers of different sizes.

Appendix C Analysis of Step-by-step and Direct Prompt Formats

C.1 Transformers for the step-by-step prompt format

We prove that there exists Transformers that can compositionally generalize in the step-by-step prompt format. Such a constructive proof, similar to (Von Oswald et al., 2023; Ahn et al., 2023; Weiss et al., 2021; Li et al., 2023c), can be used to generate plausible mechanistic hypothesis by highlighting the role of the attention and MLP layers. While the universal approximation theorem suggests that any function can be represented by a wide enough multi-layer perceptron (MLP), the construction suggests that Transformers can represent the same function efficiently.

Description of the data.

We will operate with a simplified prompt format where a composition of three functions is to be applied to a single input token. The construction can be generalized to compositions of more functions or to multiple input tokens. The input prompt [xF1,xF2,xF3,xd]subscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹3subscript𝑥𝑑[x_{F_{1}},x_{F_{2}},x_{F_{3}},x_{d}][ italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ] has three task tokens and a single data token, and the desired output for this prompt is [F1(xd),F2F1(xd),F3F2F1(xd)]subscript𝐹1subscript𝑥𝑑subscript𝐹2subscript𝐹1subscript𝑥𝑑subscript𝐹3subscript𝐹2subscript𝐹1subscript𝑥𝑑[F_{1}(x_{d}),F_{2}\circ F_{1}(x_{d}),F_{3}\circ F_{2}\circ F_{1}(x_{d})][ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ].

The position encodings P=[p1p2p6]𝑃matrixsubscript𝑝1subscript𝑝2subscript𝑝6P=\begin{bmatrix}p_{1}&p_{2}&\cdots&p_{6}\end{bmatrix}italic_P = [ start_ARG start_ROW start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_p start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] are learnable parameters and have dimension dpsubscript𝑑𝑝d_{p}italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, i.e., Pdp×6𝑃superscriptsubscript𝑑𝑝6P\in\mathbb{R}^{d_{p}\times 6}italic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × 6 end_POSTSUPERSCRIPT. The number of input tokens is dvsubscript𝑑𝑣d_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and the number of task tokens is dfsubscript𝑑𝑓d_{f}italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. Both input tokens xdsubscript𝑥𝑑x_{d}italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and task tokens xF1subscript𝑥subscript𝐹1x_{F_{1}}italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT are embedded as a one-hot vector in dxsuperscriptsubscript𝑑𝑥\mathbb{R}^{d_{x}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT where dx=dv+dfsubscript𝑑𝑥subscript𝑑𝑣subscript𝑑𝑓d_{x}=d_{v}+d_{f}italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. The first dvsubscript𝑑𝑣d_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT dimensions are used to embed the data tokens and the last dfsubscript𝑑𝑓d_{f}italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT dimensions embed the task token. Henceforth, both xdsubscript𝑥𝑑x_{d}italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and xF1subscript𝑥subscript𝐹1x_{F_{1}}italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT refer to the corresponding one-hot vectors in dxsuperscriptsubscript𝑑𝑥\mathbb{R}^{d_{x}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. For convenience, we define d=dx+dp𝑑subscript𝑑𝑥subscript𝑑𝑝d=d_{x}+d_{p}italic_d = italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. Tying this back to to section 3, observe that |Xd|=dvsubscript𝑋𝑑subscript𝑑𝑣|X_{d}|=d_{v}| italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT | = italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and |Xf|=dfsubscript𝑋𝑓subscript𝑑𝑓|X_{f}|=d_{f}| italic_X start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | = italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. We denote the input to the model using Z𝑍Zitalic_Z, which includes the token embedding and position encoding. Specifically, we have

Z=[xF1xF2xF3xF1(xd)F2F1(xd)p1p2p3p4p5p6],𝑍matrixsubscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹3𝑥subscript𝐹1subscript𝑥𝑑subscript𝐹2subscript𝐹1subscript𝑥𝑑subscript𝑝1subscript𝑝2subscript𝑝3subscript𝑝4subscript𝑝5subscript𝑝6Z=\begin{bmatrix}x_{F_{1}}&x_{F_{2}}&x_{F_{3}}&x&F_{1}(x_{d})&F_{2}\circ F_{1}% (x_{d})\\ p_{1}&p_{2}&p_{3}&p_{4}&p_{5}&p_{6}\end{bmatrix},italic_Z = [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x end_CELL start_CELL italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ,

i.e., Zd×6𝑍superscript𝑑6Z\in\mathbb{R}^{d\times 6}italic_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 6 end_POSTSUPERSCRIPT. We assume that the position encoding is concatenated to the token embedding as opposed to added to it.

Matrix notation.

We use 𝟙xdsubscript1subscript𝑥𝑑\mathbbm{1}_{x_{d}}blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT to denote a one-hot vector in the space dvsuperscriptsubscript𝑑𝑣\mathbb{R}^{d_{v}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, i.e., it excludes dimensions for the task token. On the other hand, xdsubscript𝑥𝑑x_{d}italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT denotes a one-hot vector in dxsuperscriptsubscript𝑑𝑥\mathbb{R}^{d_{x}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. We use In×nsubscript𝐼𝑛𝑛I_{n\times n}italic_I start_POSTSUBSCRIPT italic_n × italic_n end_POSTSUBSCRIPT to denote an identity matrix of size n×n𝑛𝑛n\times nitalic_n × italic_n, 1m×nsubscript1𝑚𝑛1_{m\times n}1 start_POSTSUBSCRIPT italic_m × italic_n end_POSTSUBSCRIPT and 0m×nsubscript0𝑚𝑛0_{m\times n}0 start_POSTSUBSCRIPT italic_m × italic_n end_POSTSUBSCRIPT to denote matrices of 1s and 0s of size m×n𝑚𝑛m\times nitalic_m × italic_n, and 1nsubscript1𝑛1_{n}1 start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and 0nsubscript0𝑛0_{n}0 start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to denote matrices of size n×1𝑛1n\times 1italic_n × 1.

Description of the architecture.

Before describing the Transformer architecture, we first define the attention and MLP layers. We use a simplified parameterization of linear attention (Ahn et al., 2023) with weights Q𝑄Qitalic_Q and K𝐾Kitalic_K. The MLP contains two fully connected layers with a ReLU non-linearity parameterized by the weights W1subscript𝑊1W_{1}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and W2subscript𝑊2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The attention and MLP layers are functions of Zd×6𝑍superscript𝑑6Z\in\mathbb{R}^{d\times 6}italic_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 6 end_POSTSUPERSCRIPT and are defined as:

AttnQ,K(Z)=(KZ)(MZTQZ),andsubscriptAttn𝑄𝐾𝑍𝐾𝑍direct-product𝑀superscript𝑍𝑇𝑄𝑍and\displaystyle\mathrm{Attn}_{Q,K}(Z)=(KZ)(M\odot Z^{T}QZ),\,\,\text{and}roman_Attn start_POSTSUBSCRIPT italic_Q , italic_K end_POSTSUBSCRIPT ( italic_Z ) = ( italic_K italic_Z ) ( italic_M ⊙ italic_Z start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Q italic_Z ) , and
MLPW1,W2(Z)=W2ReLU(W1Z),subscriptMLPsubscript𝑊1subscript𝑊2𝑍subscript𝑊2ReLUsubscript𝑊1𝑍\displaystyle\mathrm{MLP}_{W_{1},W_{2}}(Z)=W_{2}\mathrm{ReLU}(W_{1}Z),roman_MLP start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_Z ) = italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ReLU ( italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_Z ) ,

where Q,Kd×d𝑄𝐾superscript𝑑𝑑Q,K\in\mathbb{R}^{d\times d}italic_Q , italic_K ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT, W1d×(dfdv)subscript𝑊1superscript𝑑subscript𝑑𝑓subscript𝑑𝑣W_{1}\in\mathbb{R}^{d\times(d_{f}d_{v})}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × ( italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT and W2(dfdv)×dsubscript𝑊2superscriptsubscript𝑑𝑓subscript𝑑𝑣𝑑W_{2}\in\mathbb{R}^{(d_{f}d_{v})\times d}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) × italic_d end_POSTSUPERSCRIPT. The matrix M6×6𝑀superscript66M\in\mathbb{R}^{6\times 6}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT 6 × 6 end_POSTSUPERSCRIPT enforces causal attention and restricts the attention to inputs from previous time-steps, i.e.,

M=[111101110001].𝑀matrix111101110001M=\begin{bmatrix}1&1&1&\cdots&1\\ 0&1&1&\cdots&1\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ 0&0&0&\cdots&1\end{bmatrix}.italic_M = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL ⋯ end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL ⋯ end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] .

We consider a 1-layer Transformer with an attention layer followed by an MLP layer. We omit layer-norm to simplify the proofs. The function computed by the Transformer is

TrQ,K,W1,W2(Z)=MLP(Attn(Z)+Z)+Attn(Z)+Z).\mathrm{Tr}_{Q,K,W_{1},W_{2}}(Z)=\mathrm{MLP}\left(\mathrm{Attn}(Z)+Z)+\mathrm% {Attn}(Z)+Z\right).roman_Tr start_POSTSUBSCRIPT italic_Q , italic_K , italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_Z ) = roman_MLP ( roman_Attn ( italic_Z ) + italic_Z ) + roman_Attn ( italic_Z ) + italic_Z ) .

Henceforth, we omit the subscripts of AttnAttn\mathrm{Attn}roman_Attn, MLPMLP\mathrm{MLP}roman_MLP and TrTr\mathrm{Tr}roman_Tr for brevity. We include a residual connection after both the attention and MLP layers which mirrors a typical Transformer architecture (Vaswani et al., 2017).

The output of the Transformer is passed through an unembedding matrix Wesubscript𝑊𝑒W_{e}italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT followed by a Softmax layer to obtain a probability distribution over the next token denoted by

P(Y|Z)=Softmax(WeTr(Z)).𝑃conditional𝑌𝑍Softmaxsubscript𝑊𝑒Tr𝑍P(Y|Z)=\mathrm{Softmax}(W_{e}\mathrm{Tr}(Z)).italic_P ( italic_Y | italic_Z ) = roman_Softmax ( italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT roman_Tr ( italic_Z ) ) .
Theorem C.1.

There exists weights P,Q,K,W1,W2𝑃𝑄𝐾subscript𝑊1subscript𝑊2P,Q,K,W_{1},W_{2}italic_P , italic_Q , italic_K , italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and position encodings P𝑃Pitalic_P such that an Autoregressive Transformer can compositionally generalize to any prompt [xF1,xF2,xF3,xd]subscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹3subscript𝑥𝑑[x_{F_{1}},x_{F_{2}},x_{F_{3}},x_{d}][ italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ]. The values of the weights satisfy

PTP=[I3×3I3×3I3×3I3×3],Q=[0d×d0d×dp0dp×dIdp×dp],K=[0dv×dv0df×dv0d×dp0df×dvIdf×df0d×dp0dv×d0df×d0dp×dp],formulae-sequencesuperscript𝑃𝑇𝑃matrixsubscript𝐼33subscript𝐼33subscript𝐼33subscript𝐼33formulae-sequence𝑄matrixsubscript0𝑑𝑑subscript0𝑑subscript𝑑𝑝subscript0subscript𝑑𝑝𝑑subscript𝐼subscript𝑑𝑝subscript𝑑𝑝𝐾matrixsubscript0subscript𝑑𝑣subscript𝑑𝑣subscript0subscript𝑑𝑓subscript𝑑𝑣subscript0𝑑subscript𝑑𝑝subscript0subscript𝑑𝑓subscript𝑑𝑣subscript𝐼subscript𝑑𝑓subscript𝑑𝑓subscript0𝑑subscript𝑑𝑝subscript0subscript𝑑𝑣𝑑subscript0subscript𝑑𝑓𝑑subscript0subscript𝑑𝑝subscript𝑑𝑝\displaystyle P^{T}P=\begin{bmatrix}I_{3\times 3}&I_{3\times 3}\\ I_{3\times 3}&I_{3\times 3}\\ \end{bmatrix},\qquad Q=\begin{bmatrix}0_{d\times d}&0_{d\times d_{p}}\\ 0_{d_{p}\times d}&I_{d_{p}\times d_{p}}\\ \end{bmatrix},\qquad K=\begin{bmatrix}0_{d_{v}\times d_{v}}&0_{d_{f}\times d_{% v}}&0_{d\times d_{p}}\\ 0_{d_{f}\times d_{v}}&I_{d_{f}\times d_{f}}&0_{d\times d_{p}}\\ 0_{d_{v}\times d}&0_{d_{f}\times d}&0_{d_{p}\times d_{p}}\\ \end{bmatrix},italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P = [ start_ARG start_ROW start_CELL italic_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] , italic_Q = [ start_ARG start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] , italic_K = [ start_ARG start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ,
W1=[𝟙xd1T𝟙xd1T𝟙xd1T𝟙xd1T𝟙xd2T𝟙xd2T𝟙xd2T𝟙xd2T𝟙xdvT𝟙xdvT𝟙xdvT𝟙xdvT0dvT1dvT1dvT1dvT1dvT0dvT1dvT1dvT1dvT1dvT1dvT0dvT0dp×dv0dp×dv0dp×dv0dp×dv]Tdf×dv𝑐𝑜𝑙𝑢𝑚𝑛𝑠, andW2=[Fi1(xd1)Txd1TxFi1TFi1(xd2)Txd2TxFi1TFi1(xdv)TxdvTxFi1TFi2(xd1)Txd1TxFi2TFi2(xd2)Txd2TxFi2TFi2(xdv)TxdvTxFi2TFiT(xd1)Txd1TxFiTTFiT(xd2)Txd2TxFiTTFT(xdv)xdvxFiT]T.formulae-sequencesubscript𝑊1subscriptsuperscriptmatrixsuperscriptsubscript1subscript𝑥subscript𝑑1𝑇superscriptsubscript1subscript𝑥subscript𝑑1𝑇superscriptsubscript1subscript𝑥subscript𝑑1𝑇superscriptsubscript1subscript𝑥subscript𝑑1𝑇superscriptsubscript1subscript𝑥subscript𝑑2𝑇superscriptsubscript1subscript𝑥subscript𝑑2𝑇superscriptsubscript1subscript𝑥subscript𝑑2𝑇superscriptsubscript1subscript𝑥subscript𝑑2𝑇superscriptsubscript1subscript𝑥subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑥subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑥subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑥subscript𝑑𝑣𝑇superscriptsubscript0subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript0subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript0subscript𝑑𝑣𝑇subscript0subscript𝑑𝑝subscript𝑑𝑣subscript0subscript𝑑𝑝subscript𝑑𝑣subscript0subscript𝑑𝑝subscript𝑑𝑣subscript0subscript𝑑𝑝subscript𝑑𝑣𝑇subscript𝑑𝑓subscript𝑑𝑣𝑐𝑜𝑙𝑢𝑚𝑛𝑠 andsubscript𝑊2superscriptmatrixsubscript𝐹subscript𝑖1superscriptsubscript𝑥subscript𝑑1𝑇superscriptsubscript𝑥subscript𝑑1𝑇superscriptsubscript𝑥subscript𝐹subscript𝑖1𝑇subscript𝐹subscript𝑖1superscriptsubscript𝑥subscript𝑑2𝑇superscriptsubscript𝑥subscript𝑑2𝑇superscriptsubscript𝑥subscript𝐹subscript𝑖1𝑇subscript𝐹subscript𝑖1superscriptsubscript𝑥subscript𝑑𝑣𝑇superscriptsubscript𝑥subscript𝑑𝑣𝑇superscriptsubscript𝑥subscript𝐹subscript𝑖1𝑇subscript𝐹subscript𝑖2superscriptsubscript𝑥subscript𝑑1𝑇superscriptsubscript𝑥subscript𝑑1𝑇superscriptsubscript𝑥subscript𝐹subscript𝑖2𝑇subscript𝐹subscript𝑖2superscriptsubscript𝑥subscript𝑑2𝑇superscriptsubscript𝑥subscript𝑑2𝑇superscriptsubscript𝑥subscript𝐹subscript𝑖2𝑇subscript𝐹subscript𝑖2superscriptsubscript𝑥subscript𝑑𝑣𝑇superscriptsubscript𝑥subscript𝑑𝑣𝑇superscriptsubscript𝑥subscript𝐹subscript𝑖2𝑇subscript𝐹subscript𝑖𝑇superscriptsubscript𝑥subscript𝑑1𝑇superscriptsubscript𝑥subscript𝑑1𝑇superscriptsubscript𝑥subscript𝐹subscript𝑖𝑇𝑇subscript𝐹subscript𝑖𝑇superscriptsubscript𝑥subscript𝑑2𝑇superscriptsubscript𝑥subscript𝑑2𝑇superscriptsubscript𝑥subscript𝐹subscript𝑖𝑇𝑇subscript𝐹𝑇subscript𝑥subscript𝑑𝑣subscript𝑥subscript𝑑𝑣subscript𝑥subscript𝐹subscript𝑖𝑇𝑇\displaystyle W_{1}=\underbrace{\begin{bmatrix}\mathbbm{1}_{x_{d_{1}}}^{T}&% \mathbbm{1}_{x_{d_{1}}}^{T}&\mathbbm{1}_{x_{d_{1}}}^{T}&\cdots&\mathbbm{1}_{x_% {d_{1}}}^{T}\\ \mathbbm{1}_{x_{d_{2}}}^{T}&\mathbbm{1}_{x_{d_{2}}}^{T}&\mathbbm{1}_{x_{d_{2}}% }^{T}&\cdots&\mathbbm{1}_{x_{d_{2}}}^{T}\\ \vdots&\vdots&\vdots&\vdots\\ \mathbbm{1}_{x_{d_{v}}}^{T}&\mathbbm{1}_{x_{d_{v}}}^{T}&\mathbbm{1}_{x_{d_{v}}% }^{T}&\cdots&\mathbbm{1}_{x_{d_{v}}}^{T}\\ 0_{d_{v}}^{T}&-1_{d_{v}}^{T}&-1_{d_{v}}^{T}&\cdots&-1_{d_{v}}^{T}\\ -1_{d_{v}}^{T}&0_{d_{v}}^{T}&-1_{d_{v}}^{T}&\cdots&-1_{d_{v}}^{T}\\ \vdots&\vdots&\vdots&\vdots\\ -1_{d_{v}}^{T}&-1_{d_{v}}^{T}&-1_{d_{v}}^{T}&\cdots&0_{d_{v}}^{T}\\ 0_{d_{p}\times d_{v}}&0_{d_{p}\times d_{v}}&0_{d_{p}\times d_{v}}&\cdots&0_{d_% {p}\times d_{v}}\\ \end{bmatrix}^{T}}_{d_{f}\times d_{v}\text{columns}},\text{ and}\qquad W_{2}=% \begin{bmatrix}F_{i_{1}}(x_{d_{1}})^{T}-x_{d_{1}}^{T}-x_{F_{i_{1}}}^{T}\\ F_{i_{1}}(x_{d_{2}})^{T}-x_{d_{2}}^{T}-x_{F_{i_{1}}}^{T}\\ \vdots\\ F_{i_{1}}(x_{d_{v}})^{T}-x_{d_{v}}^{T}-x_{F_{i_{1}}}^{T}\\ F_{i_{2}}(x_{d_{1}})^{T}-x_{d_{1}}^{T}-x_{F_{i_{2}}}^{T}\\ F_{i_{2}}(x_{d_{2}})^{T}-x_{d_{2}}^{T}-x_{F_{i_{2}}}^{T}\\ \vdots\\ F_{i_{2}}(x_{d_{v}})^{T}-x_{d_{v}}^{T}-x_{F_{i_{2}}}^{T}\\ \vdots\\ F_{i_{T}}(x_{d_{1}})^{T}-x_{d_{1}}^{T}-x_{F_{i_{T}}}^{T}\\ F_{i_{T}}(x_{d_{2}})^{T}-x_{d_{2}}^{T}-x_{F_{i_{T}}}^{T}\\ \vdots\\ F_{T}(x_{d_{v}})-x_{d_{v}}-x_{F_{i_{T}}}\end{bmatrix}^{T}.italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = under⏟ start_ARG [ start_ARG start_ROW start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT columns end_POSTSUBSCRIPT , and italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .
Proof.

See Section C.4. ∎

Refer to caption
Figure 22: We see a sharp increase in accuracy as we increase the embedding dimension of the Transformer. The number of hidden units in the MLP of the Transformer is 4 times the size of the embedding dimension.

The construction uses the attention layer to aggregate the task token and data token, i.e., attention selects the relevant task token. The query vector of the attention selects the right task using the position encoding. The first layer of the MLP projects the summation of the task and data tokens (present in orthogonal spaces) onto the Cartesian product of the set of task and data tokens. The second layer computes the function and acts similar to a lookup table (Geva et al., 2022).

The construction requires the output of the first fully-connected layer has size at least dfdvsubscript𝑑𝑓subscript𝑑𝑣d_{f}d_{v}italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT in order to encode the task and input tokens. In our experiments, we set dv=10subscript𝑑𝑣10d_{v}=10italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 10 and df=21subscript𝑑𝑓21d_{f}=21italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 21 and hence the number of hidden units must be at least 210. In practice, we require at least 500500500500 hidden units (see Fig. 22), which is not too far from our estimate. We conjecture that the additional hidden units are helpful for optimization.

C.2 Transformers for the direct prompt format

We also prove the existence of a Transformer for a compositions of bijections in the direct prompt format. Unlike the step-by-step format, the direct prompt format lacks a “scratchpad” (Nye et al., 2021) for the intermediates outputs of the composition. In our construction, we use K=3𝐾3K=3italic_K = 3 Transformer blocks to compute the composition of K𝐾Kitalic_K functions; the output of the k𝑘kitalic_k-th block is the result of the kthsuperscript𝑘thk^{\text{th}}italic_k start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT step of the composition.

Description of the data.

We consider the composition of 3 functions with an input prompt denoted by [xF1,xF2,xF3,xd]subscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹3subscript𝑥𝑑[x_{F_{1}},x_{F_{2}},x_{F_{3}},x_{d}][ italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ]. Unlike the step-by-step format, the output is just a single token [F3F2F1(xd)]delimited-[]subscript𝐹3subscript𝐹2subscript𝐹1subscript𝑥𝑑[F_{3}\circ F_{2}\circ F_{1}(x_{d})][ italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ]. The position encodings are denoted by P=[p1,p2,,p4]𝑃subscript𝑝1subscript𝑝2subscript𝑝4P=[p_{1},p_{2},\dots,p_{4}]italic_P = [ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ] where pi=[pi1Tpi2Tpi3T]Tsubscript𝑝𝑖superscriptmatrixsuperscriptsubscript𝑝𝑖1𝑇superscriptsubscript𝑝𝑖2𝑇superscriptsubscript𝑝𝑖3𝑇𝑇p_{i}=\begin{bmatrix}p_{i1}^{T}&p_{i2}^{T}&p_{i3}^{T}\end{bmatrix}^{T}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_p start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT italic_i 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and pidpsubscript𝑝𝑖superscriptsubscript𝑑𝑝p_{i}\in\mathbb{R}^{d_{p}}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and pijdp/3subscript𝑝𝑖𝑗subscriptsubscript𝑑𝑝3p_{ij}\in\mathbb{R}_{d_{p}/3}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT / 3 end_POSTSUBSCRIPT. The dimensions dx,dv,dsubscript𝑑𝑥subscript𝑑𝑣𝑑d_{x},d_{v},ditalic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_d and dpsubscript𝑑𝑝d_{p}italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT represent the same quantities. We use d¯psubscript¯𝑑𝑝\bar{d}_{p}over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT to replace dp3subscript𝑑𝑝3\frac{d_{p}}{3}divide start_ARG italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG 3 end_ARG. The input to the model is

Z=[xF1xF2xF3xdp11p12p13p14p21p22p23p24p31p32p33p34],𝑍matrixsubscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹3subscript𝑥𝑑subscript𝑝11subscript𝑝12subscript𝑝13subscript𝑝14subscript𝑝21subscript𝑝22subscript𝑝23subscript𝑝24subscript𝑝31subscript𝑝32subscript𝑝33subscript𝑝34Z=\begin{bmatrix}x_{F_{1}}&x_{F_{2}}&x_{F_{3}}&x_{d}\\ p_{11}&p_{12}&p_{13}&p_{14}\\ p_{21}&p_{22}&p_{23}&p_{24}\\ p_{31}&p_{32}&p_{33}&p_{34}\\ \end{bmatrix},italic_Z = [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 14 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 24 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 31 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 34 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ,

where Zd×4𝑍superscript𝑑4Z\in\mathbb{R}^{d\times 4}italic_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 4 end_POSTSUPERSCRIPT.

Description of the architecture.

Each Transformer block is defined similar to the step-by-step format, i.e.,

BlockQi,Ki,Wi1,Wi2(Z)=MLPi(Attni(Z)+Z)+(Attni(Z)+Z),subscriptBlocksubscript𝑄𝑖subscript𝐾𝑖subscript𝑊𝑖1subscript𝑊𝑖2𝑍subscriptMLP𝑖subscriptAttn𝑖𝑍𝑍subscriptAttn𝑖𝑍𝑍\mathrm{Block}_{Q_{i},K_{i},W_{i1},W_{i2}}(Z)=\mathrm{MLP}_{i}(\mathrm{Attn}_{% i}(Z)+Z)+(\mathrm{Attn}_{i}(Z)+Z),roman_Block start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_Z ) = roman_MLP start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_Attn start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_Z ) + italic_Z ) + ( roman_Attn start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_Z ) + italic_Z ) ,

which we henceforth denote by Blocki(Z)subscriptBlock𝑖𝑍\mathrm{Block}_{i}(Z)roman_Block start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_Z ). Unlike the step-by-step format, the model is now composed of 3333 blocks corresponding to the 3 steps of the compositional task the model is expected to solve, i.e.,

Tr(Z)=Block3(Block2(Block1(Z))).Tr𝑍subscriptBlock3subscriptBlock2subscriptBlock1𝑍\mathrm{Tr}(Z)=\mathrm{Block}_{3}(\mathrm{Block}_{2}(\mathrm{Block}_{1}(Z))).roman_Tr ( italic_Z ) = roman_Block start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( roman_Block start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Block start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ) ) ) .

This input is passed through a Softmax layer to predict a probability distribution over the next token, denoted by P(YZ)=Softmax(WeTr(Z))𝑃conditional𝑌𝑍Softmaxsubscript𝑊𝑒Tr𝑍P(Y\mid Z)=\mathrm{Softmax}(W_{e}\mathrm{Tr}(Z))italic_P ( italic_Y ∣ italic_Z ) = roman_Softmax ( italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT roman_Tr ( italic_Z ) ).

Theorem C.2.

There exist weights Pi,Qi,Ki,W1i,W2isubscript𝑃𝑖subscript𝑄𝑖subscript𝐾𝑖subscript𝑊1𝑖subscript𝑊2𝑖P_{i},Q_{i},K_{i},W_{1i},W_{2i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT for i[1,3]𝑖13i\in[1,3]italic_i ∈ [ 1 , 3 ] and position encodings P𝑃Pitalic_P such that the a 3-layer Transformer can compositionally generalize to any prompt of the form [xF1,xF2,xF3,xd]subscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹3subscript𝑥𝑑[x_{F_{1}},x_{F_{2}},x_{F_{3}},x_{d}][ italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ]. The values of the weights satisfy

Q1=[0d×d0d×d¯p0d×d¯p0d×d¯p0d¯p×dId¯p0d¯p×d¯p0d¯p×d¯p0d¯p×d0d¯p×d¯p0d¯p×d¯p0d¯p×d¯p0d¯p×d0d¯p×d¯p0d¯p×d¯p0d¯p×d¯p],Q2=[0d×d0d×d¯p0d×d¯p0d×d¯p0d¯p×d0d¯p×d¯p0d¯p×d¯p0d¯p×d¯p0d¯p×d0d¯p×d¯pId¯p0d¯p×d¯p0d¯p×d0d¯p×d¯p0d¯p×d¯p0d¯p×d¯p],formulae-sequencesubscript𝑄1matrixsubscript0𝑑𝑑subscript0𝑑subscript¯𝑑𝑝subscript0𝑑subscript¯𝑑𝑝subscript0𝑑subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝𝑑subscript𝐼subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝𝑑subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝𝑑subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript𝑄2matrixsubscript0𝑑𝑑subscript0𝑑subscript¯𝑑𝑝subscript0𝑑subscript¯𝑑𝑝subscript0𝑑subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝𝑑subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝𝑑subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript𝐼subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝𝑑subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝\displaystyle Q_{1}=\begin{bmatrix}0_{d\times d}&0_{d\times\bar{d}_{p}}&0_{d% \times\bar{d}_{p}}&0_{d\times\bar{d}_{p}}\\ 0_{\bar{d}_{p}\times d}&I_{\bar{d}_{p}}&0_{\bar{d}_{p}\times\bar{d}_{p}}&0_{% \bar{d}_{p}\times\bar{d}_{p}}\\ 0_{\bar{d}_{p}\times d}&0_{\bar{d}_{p}\times\bar{d}_{p}}&0_{\bar{d}_{p}\times% \bar{d}_{p}}&0_{\bar{d}_{p}\times\bar{d}_{p}}\\ 0_{\bar{d}_{p}\times d}&0_{\bar{d}_{p}\times\bar{d}_{p}}&0_{\bar{d}_{p}\times% \bar{d}_{p}}&0_{\bar{d}_{p}\times\bar{d}_{p}}\end{bmatrix},\qquad Q_{2}=\begin% {bmatrix}0_{d\times d}&0_{d\times\bar{d}_{p}}&0_{d\times\bar{d}_{p}}&0_{d% \times\bar{d}_{p}}\\ 0_{\bar{d}_{p}\times d}&0_{\bar{d}_{p}\times\bar{d}_{p}}&0_{\bar{d}_{p}\times% \bar{d}_{p}}&0_{\bar{d}_{p}\times\bar{d}_{p}}\\ 0_{\bar{d}_{p}\times d}&0_{\bar{d}_{p}\times\bar{d}_{p}}&I_{\bar{d}_{p}}&0_{% \bar{d}_{p}\times\bar{d}_{p}}\\ 0_{\bar{d}_{p}\times d}&0_{\bar{d}_{p}\times\bar{d}_{p}}&0_{\bar{d}_{p}\times% \bar{d}_{p}}&0_{\bar{d}_{p}\times\bar{d}_{p}}\end{bmatrix},italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] , italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ,
Q3=[0d×d0d×d¯p0d×d¯p0d×d¯p0d¯p×d0d¯p×d¯p0d¯p×d¯p0d¯p×d¯p0d¯p×d0d¯p×d¯p0d¯p×d¯p0d¯p×d¯p0d¯p×d0d¯p×d¯p0d¯p×d¯pId¯p,],K1=[0dv×dv0df×dv0d×dp0df×dvIdf0d×dp0dv×d0df×d0dp×dp],formulae-sequencesubscript𝑄3matrixsubscript0𝑑𝑑subscript0𝑑subscript¯𝑑𝑝subscript0𝑑subscript¯𝑑𝑝subscript0𝑑subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝𝑑subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝𝑑subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝𝑑subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript𝐼subscript¯𝑑𝑝subscript𝐾1matrixsubscript0subscript𝑑𝑣subscript𝑑𝑣subscript0subscript𝑑𝑓subscript𝑑𝑣subscript0𝑑subscript𝑑𝑝subscript0subscript𝑑𝑓subscript𝑑𝑣subscript𝐼subscript𝑑𝑓subscript0𝑑subscript𝑑𝑝subscript0subscript𝑑𝑣𝑑subscript0subscript𝑑𝑓𝑑subscript0subscript𝑑𝑝subscript𝑑𝑝\displaystyle Q_{3}=\begin{bmatrix}0_{d\times d}&0_{d\times\bar{d}_{p}}&0_{d% \times\bar{d}_{p}}&0_{d\times\bar{d}_{p}}\\ 0_{\bar{d}_{p}\times d}&0_{\bar{d}_{p}\times\bar{d}_{p}}&0_{\bar{d}_{p}\times% \bar{d}_{p}}&0_{\bar{d}_{p}\times\bar{d}_{p}}\\ 0_{\bar{d}_{p}\times d}&0_{\bar{d}_{p}\times\bar{d}_{p}}&0_{\bar{d}_{p}\times% \bar{d}_{p}}&0_{\bar{d}_{p}\times\bar{d}_{p}}\\ 0_{\bar{d}_{p}\times d}&0_{\bar{d}_{p}\times\bar{d}_{p}}&0_{\bar{d}_{p}\times% \bar{d}_{p}}&I_{\bar{d}_{p}},\quad\end{bmatrix},\qquad K_{1}=\begin{bmatrix}0_% {d_{v}\times d_{v}}&0_{d_{f}\times d_{v}}&0_{d\times d_{p}}\\ 0_{d_{f}\times d_{v}}&I_{d_{f}}&0_{d\times d_{p}}\\ 0_{d_{v}\times d}&0_{d_{f}\times d}&0_{d_{p}\times d_{p}}\\ \end{bmatrix},italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT , end_CELL end_ROW end_ARG ] , italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ,
K2=K12,K3=K13,formulae-sequencesubscript𝐾2subscript𝐾12subscript𝐾3subscript𝐾13\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad K_{2}=\frac{K_{1}}{2},\qquad K% _{3}=\frac{K_{1}}{3},italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG , italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = divide start_ARG italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 3 end_ARG ,
P1TP1=[1001010000101001],P2TP2=[1000010100100101],P3TP3=[1000010000110011],formulae-sequencesuperscriptsubscript𝑃1𝑇subscript𝑃1matrix1001010000101001formulae-sequencesuperscriptsubscript𝑃2𝑇subscript𝑃2matrix1000010100100101superscriptsubscript𝑃3𝑇subscript𝑃3matrix1000010000110011\displaystyle P_{1}^{T}P_{1}=\begin{bmatrix}1&0&0&1\\ 0&1&0&0\\ 0&0&1&0\\ 1&0&0&1\end{bmatrix},\qquad P_{2}^{T}P_{2}=\begin{bmatrix}1&0&0&0\\ 0&1&0&1\\ 0&0&1&0\\ 0&1&0&1\\ \end{bmatrix},\qquad P_{3}^{T}P_{3}=\begin{bmatrix}1&0&0&0\\ 0&1&0&0\\ 0&0&1&1\\ 0&0&1&1\end{bmatrix},italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] , italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] ,
W11=[𝟙xd1T𝟙xd1T𝟙xd1T𝟙xd1T𝟙xd2T𝟙xd2T𝟙xd2T𝟙xd2T𝟙xdvT𝟙xdvT𝟙xdvT𝟙xdvT01×dv11×dv11×dv11×dv11×dv01×dv11×dv11×dv11×dv11×dv11×dv01×dv0dp×dv0dp×dv0dp×dv0dp×dv]Tdf×dv𝑐𝑜𝑙𝑢𝑚𝑛𝑠,W12=[Fi1(xd1)Txd1TxFi1TFi1(xd2)Txd2TxFi1TFi1(xdv)TxdvTxFi1TFi2(xd1)Txd1TxFi2TFi2(xd2)Txd2TxFi2TFi2(xdv)TxdvTxFi2TFiT(xd1)Txd1TxFiTTFiT(xd2)Txd2TxFiTTFT(xdv)xdvxFiT.]Tformulae-sequencesubscript𝑊11subscriptsuperscriptmatrixsuperscriptsubscript1subscript𝑥subscript𝑑1𝑇superscriptsubscript1subscript𝑥subscript𝑑1𝑇superscriptsubscript1subscript𝑥subscript𝑑1𝑇superscriptsubscript1subscript𝑥subscript𝑑1𝑇superscriptsubscript1subscript𝑥subscript𝑑2𝑇superscriptsubscript1subscript𝑥subscript𝑑2𝑇superscriptsubscript1subscript𝑥subscript𝑑2𝑇superscriptsubscript1subscript𝑥subscript𝑑2𝑇superscriptsubscript1subscript𝑥subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑥subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑥subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑥subscript𝑑𝑣𝑇subscript01subscript𝑑𝑣subscript11subscript𝑑𝑣subscript11subscript𝑑𝑣subscript11subscript𝑑𝑣subscript11subscript𝑑𝑣subscript01subscript𝑑𝑣subscript11subscript𝑑𝑣subscript11subscript𝑑𝑣subscript11subscript𝑑𝑣subscript11subscript𝑑𝑣subscript11subscript𝑑𝑣subscript01subscript𝑑𝑣subscript0subscript𝑑𝑝subscript𝑑𝑣subscript0subscript𝑑𝑝subscript𝑑𝑣subscript0subscript𝑑𝑝subscript𝑑𝑣subscript0subscript𝑑𝑝subscript𝑑𝑣𝑇subscript𝑑𝑓subscript𝑑𝑣𝑐𝑜𝑙𝑢𝑚𝑛𝑠subscript𝑊12superscriptmatrixsubscript𝐹subscript𝑖1superscriptsubscript𝑥subscript𝑑1𝑇superscriptsubscript𝑥subscript𝑑1𝑇superscriptsubscript𝑥subscript𝐹subscript𝑖1𝑇subscript𝐹subscript𝑖1superscriptsubscript𝑥subscript𝑑2𝑇superscriptsubscript𝑥subscript𝑑2𝑇superscriptsubscript𝑥subscript𝐹subscript𝑖1𝑇subscript𝐹subscript𝑖1superscriptsubscript𝑥subscript𝑑𝑣𝑇superscriptsubscript𝑥subscript𝑑𝑣𝑇superscriptsubscript𝑥subscript𝐹subscript𝑖1𝑇subscript𝐹subscript𝑖2superscriptsubscript𝑥subscript𝑑1𝑇superscriptsubscript𝑥subscript𝑑1𝑇superscriptsubscript𝑥subscript𝐹subscript𝑖2𝑇subscript𝐹subscript𝑖2superscriptsubscript𝑥subscript𝑑2𝑇superscriptsubscript𝑥subscript𝑑2𝑇superscriptsubscript𝑥subscript𝐹subscript𝑖2𝑇subscript𝐹subscript𝑖2superscriptsubscript𝑥subscript𝑑𝑣𝑇superscriptsubscript𝑥subscript𝑑𝑣𝑇superscriptsubscript𝑥subscript𝐹subscript𝑖2𝑇subscript𝐹subscript𝑖𝑇superscriptsubscript𝑥subscript𝑑1𝑇superscriptsubscript𝑥subscript𝑑1𝑇superscriptsubscript𝑥subscript𝐹subscript𝑖𝑇𝑇subscript𝐹subscript𝑖𝑇superscriptsubscript𝑥subscript𝑑2𝑇superscriptsubscript𝑥subscript𝑑2𝑇superscriptsubscript𝑥subscript𝐹subscript𝑖𝑇𝑇subscript𝐹𝑇subscript𝑥subscript𝑑𝑣subscript𝑥subscript𝑑𝑣subscript𝑥subscript𝐹subscript𝑖𝑇𝑇\displaystyle W_{11}=\underbrace{\begin{bmatrix}\mathbbm{1}_{x_{d_{1}}}^{T}&% \mathbbm{1}_{x_{d_{1}}}^{T}&\mathbbm{1}_{x_{d_{1}}}^{T}&\cdots&\mathbbm{1}_{x_% {d_{1}}}^{T}\\ \mathbbm{1}_{x_{d_{2}}}^{T}&\mathbbm{1}_{x_{d_{2}}}^{T}&\mathbbm{1}_{x_{d_{2}}% }^{T}&\cdots&\mathbbm{1}_{x_{d_{2}}}^{T}\\ \vdots&\vdots&\vdots&\vdots\\ \mathbbm{1}_{x_{d_{v}}}^{T}&\mathbbm{1}_{x_{d_{v}}}^{T}&\mathbbm{1}_{x_{d_{v}}% }^{T}&\cdots&\mathbbm{1}_{x_{d_{v}}}^{T}\\ 0_{1\times d_{v}}&-1_{1\times d_{v}}&-1_{1\times d_{v}}&\cdots&-1_{1\times d_{% v}}\\ -1_{1\times d_{v}}&0_{1\times d_{v}}&-1_{1\times d_{v}}&\cdots&-1_{1\times d_{% v}}\\ \vdots&\vdots&\vdots&\vdots\\ -1_{1\times d_{v}}&-1_{1\times d_{v}}&-1_{1\times d_{v}}&\cdots&0_{1\times d_{% v}}\\ 0_{d_{p}\times d_{v}}&0_{d_{p}\times d_{v}}&0_{d_{p}\times d_{v}}&\cdots&0_{d_% {p}\times d_{v}}\\ \end{bmatrix}^{T}}_{d_{f}\times d_{v}\text{columns}},\qquad W_{12}=\begin{% bmatrix}F_{i_{1}}(x_{d_{1}})^{T}-x_{d_{1}}^{T}-x_{F_{i_{1}}}^{T}\\ F_{i_{1}}(x_{d_{2}})^{T}-x_{d_{2}}^{T}-x_{F_{i_{1}}}^{T}\\ \vdots\\ F_{i_{1}}(x_{d_{v}})^{T}-x_{d_{v}}^{T}-x_{F_{i_{1}}}^{T}\\ F_{i_{2}}(x_{d_{1}})^{T}-x_{d_{1}}^{T}-x_{F_{i_{2}}}^{T}\\ F_{i_{2}}(x_{d_{2}})^{T}-x_{d_{2}}^{T}-x_{F_{i_{2}}}^{T}\\ \vdots\\ F_{i_{2}}(x_{d_{v}})^{T}-x_{d_{v}}^{T}-x_{F_{i_{2}}}^{T}\\ \vdots\\ F_{i_{T}}(x_{d_{1}})^{T}-x_{d_{1}}^{T}-x_{F_{i_{T}}}^{T}\\ F_{i_{T}}(x_{d_{2}})^{T}-x_{d_{2}}^{T}-x_{F_{i_{T}}}^{T}\\ \vdots\\ F_{T}(x_{d_{v}})-x_{d_{v}}-x_{F_{i_{T}}}.\end{bmatrix}^{T}italic_W start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT = under⏟ start_ARG [ start_ARG start_ROW start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL - 1 start_POSTSUBSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL - 1 start_POSTSUBSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL - 1 start_POSTSUBSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL - 1 start_POSTSUBSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL - 1 start_POSTSUBSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL - 1 start_POSTSUBSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL - 1 start_POSTSUBSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL - 1 start_POSTSUBSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL - 1 start_POSTSUBSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL 0 start_POSTSUBSCRIPT 1 × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT columns end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT . end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT
W21=W22=W23, andW31=W32=W33.formulae-sequencesubscript𝑊21subscript𝑊22subscript𝑊23 andsubscript𝑊31subscript𝑊32subscript𝑊33\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad W_{21}=W_{22}=W_{23},\text{ % and}\qquad W_{31}=W_{32}=W_{33}.italic_W start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT , and italic_W start_POSTSUBSCRIPT 31 end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT .
Proof.

See Section C.5. ∎

C.3 Difference between the direct and step-by-step prompt formats

The ability to run multiple forward passes through the Transformer allows us tackle a richer class of problems (Merrill & Sabharwal, 2023). This ability differentiates the step-by-step and direct prompt formats. In the step-by-step prompt format, the Transformer makes L𝐿Litalic_L different forward passes, while the direct prompt format allows only one forward pass through the model to generate the output. This is also mirrored in our constructions in sections C.1 and C.2—a model for the step-by-step prompt format requires only 1 layer, while one for the direct prompt format uses L=3𝐿3L=3italic_L = 3 layers to compensate for the lack of multiple forward passes. We expect that a Transformer for the direct prompt format cannot circumvent these computations and conjecture that our Transformer construction for the direct format (in section C.5) is efficient with respect to the number of layers.

Conjecture C.3.

We conjecture that a Transformer with width of \poly(||)\poly\poly(|\mathcal{F}|)( | caligraphic_F | ), needs 𝒪(L)𝒪𝐿\mathcal{O}(L)caligraphic_O ( italic_L ) layers in the direct prompt format compared to the 𝒪(1)𝒪1\mathcal{O}(1)caligraphic_O ( 1 ) layers step-by-step format in order to compositionally generalize on our synthetic task.

That is, a model must compute all L𝐿Litalic_L intermediate outputs of the composition across different layers of the Transformer. We expand on this further in the next subsection. We also note that as per the universal approximation theorem, it is certainly possible to construct a Transformer with 1-layer such that it generalizes for the direct prompt format; however, such a model must have its width to be exponential in |||\mathcal{F}|| caligraphic_F | in order to store ||Lsuperscript𝐿|\mathcal{F}|^{L}| caligraphic_F | start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT different functions in a single layer.

C.3.1 How many training compositions does each prompt format need?

To further understand the difference between the two prompt formats, we will use a (highly simplified) model to reason about the number of function compositions in the training data that is required for perfect compositional generalization on our task. Let us consider a composition of L𝐿Litalic_L of functions from \mathcal{F}caligraphic_F. We assume that the compositions in the training data trainLLsuperscriptsubscripttrain𝐿superscript𝐿\mathcal{F}_{\text{train}}^{L}\subset\mathcal{F}^{L}caligraphic_F start_POSTSUBSCRIPT train end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ⊂ caligraphic_F start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT are sampled uniformly at random from the set of all compositions.

For this analysis, we assume that the Transformer can perfectly identify which functions to compose—which we ascribe to the attention layers—and will focus entirely on capability acquisition which we hypothesize is carried out by the MLP layers. We assume that a Transformer for the step-by-step prompt format must learn a function (capability) only once, while a Transformer for the direct prompt format must learn the function L𝐿Litalic_L different times—once for each layer of the Transformer. If the function composition F(l)trainLsuperscript𝐹𝑙superscriptsubscripttrain𝐿F^{(l)}\in\mathcal{F}_{\text{train}}^{L}italic_F start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT train end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT occurs in the training data, we assume that the Transformer for the step-by-step format has learned all the capabilities Fi(l)F(l)subscriptsuperscript𝐹𝑙𝑖superscript𝐹𝑙F^{(l)}_{i}\in F^{(l)}italic_F start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_F start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT for i[1,L]𝑖1𝐿i\in[1,L]italic_i ∈ [ 1 , italic_L ], while a Transformer for the direct prompt format can only learn capability Fi(l)subscriptsuperscript𝐹𝑙𝑖F^{(l)}_{i}italic_F start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at layer i𝑖iitalic_i. These assumptions are informed by Theorems C.1 and C.2.

Detour into the coupon collector’s problem.

In order to learn all F=||𝐹F=|\mathcal{F}|italic_F = | caligraphic_F | capabilities, the training data must contain each capability at least once. We note that this is the coupon collector’s problem (Myers & Wilf, 2006): the collector seeks all distinct coupons and recieves a coupon at every round drawn uniformly at random. The number of rounds corresponds to the number of function compositions in the training data and we would like to calculate the expected number of rounds required to learn all capabilities. It is a well known result that the expected number of rounds to collect all F𝐹Fitalic_F coupons is FHF𝐹subscript𝐻𝐹FH_{F}italic_F italic_H start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT where HFsubscript𝐻𝐹H_{F}italic_H start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is the Harmonic number; asymptotically this is 𝒪(FlogF)𝒪𝐹𝐹\mathcal{O}(F\log F)caligraphic_O ( italic_F roman_log italic_F ). Furthermore, the probability that we complete a collection of size f𝑓fitalic_f, in n𝑛nitalic_n rounds is

p(L,f)=F!FL{F1L1},𝑝𝐿𝑓𝐹superscript𝐹𝐿FRACOP𝐹1𝐿1p(L,f)=\frac{F!}{F^{L}}\genfrac{\{}{\}}{0.0pt}{}{F-1}{L-1},italic_p ( italic_L , italic_f ) = divide start_ARG italic_F ! end_ARG start_ARG italic_F start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT end_ARG { FRACOP start_ARG italic_F - 1 end_ARG start_ARG italic_L - 1 end_ARG } ,

where {F1K1}FRACOP𝐹1𝐾1\genfrac{\{}{\}}{0.0pt}{}{F-1}{K-1}{ FRACOP start_ARG italic_F - 1 end_ARG start_ARG italic_K - 1 end_ARG } is the Stirling number of the second kind.

In the step-by-step prompt format, we observe L𝐿Litalic_L capabilities (or coupons) with every composition. All capabilities are learned if we observe each of them in at least one training sample. The expected number of training compositions N𝑁Nitalic_N required to learn all capabilities is 𝒪(FlogFL)𝒪𝐹𝐹𝐿\mathcal{O}(\frac{F\log F}{L})caligraphic_O ( divide start_ARG italic_F roman_log italic_F end_ARG start_ARG italic_L end_ARG ) (see Xu & Tang (2011)). On the other hand, the direct prompt format can be treated as L𝐿Litalic_L independent coupon collector problems and must observe each capability once for each of the L𝐿Litalic_L layers. The expected number of rounds to learn all capabilities is the is the expected value of maximum number of rounds for L𝐿Litalic_L indepedent coupon collector problems. If we apply Chebyshev’s inequality, we get

P(NFHF+clogF)π26c2log2F,𝑃𝑁𝐹subscript𝐻𝐹𝑐𝐹superscript𝜋26superscript𝑐2superscript2𝐹P(N\geq FH_{F}+c\log F)\leq\frac{\pi^{2}}{6c^{2}\log^{2}F},italic_P ( italic_N ≥ italic_F italic_H start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + italic_c roman_log italic_F ) ≤ divide start_ARG italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 6 italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_F end_ARG ,

since the variance of N𝑁Nitalic_N is upper bounded by n2π26superscript𝑛2superscript𝜋26\frac{n^{2}\pi^{2}}{6}divide start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 6 end_ARG. Hence, the maximum value of L𝐿Litalic_L different runs is 𝒪(FlogF)𝒪𝐹𝐹\mathcal{O}(F\log F)caligraphic_O ( italic_F roman_log italic_F ) as n𝑛n\rightarrow\inftyitalic_n → ∞, or in other words, the expected number of rounds to learn all capabilities is 𝒪(FlogF)𝒪𝐹𝐹\mathcal{O}(F\log F)caligraphic_O ( italic_F roman_log italic_F ). The expected number of training compositions differ by a factor of L𝐿Litalic_L between the two prompt formats, which tallies with the observation that a Transformer is expected to learn the same set of capabilities L𝐿Litalic_L different times in the direct format.

In practice, we find that Transformers for the direct format can sometimes fail to compositionally generalize, even with a large number of compositions in the training data (Section 4.3). We hypothesize that this is attributable to the optimization landscape, i.e., gradient descent is unable to find weights that compositionally generalize and instead prefers weights that memorize compositions of functions present in the training data. In the direct prompt, gradient descent must recover the individual capabilities from a set of compositions of bijections and this is a computationally hard problem since it is similar to finding the minimal generating set of a group (its time complexity is linear in the size of the group which is 𝒪(FL)𝒪superscript𝐹𝐿\mathcal{O}(F^{L})caligraphic_O ( italic_F start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT )).

C.4 Proof of Theorem C.1

Step 1: Computing the attention layer.

The attention layer copies the task tokens onto the relevant data token similar to an induction head (Olsson et al., 2022). We first compute the query and value matrices of the attention.

ZTQZsuperscript𝑍𝑇𝑄𝑍\displaystyle Z^{T}QZitalic_Z start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Q italic_Z =[xF1Tp1TxF2Tp2TxF3Tp3TxdTp4TF1(xd)Tp5F2F1(xd)Tp6T][0d×d0d×dp0dp×dIdp×dp][xF1xF2xF3F2F1(xd)p1p2p3p6]absentmatrixsuperscriptsubscript𝑥subscript𝐹1𝑇superscriptsubscript𝑝1𝑇superscriptsubscript𝑥subscript𝐹2𝑇superscriptsubscript𝑝2𝑇superscriptsubscript𝑥subscript𝐹3𝑇superscriptsubscript𝑝3𝑇superscriptsubscript𝑥𝑑𝑇superscriptsubscript𝑝4𝑇subscript𝐹1superscriptsubscript𝑥𝑑𝑇subscript𝑝5subscript𝐹2subscript𝐹1superscriptsubscript𝑥𝑑𝑇superscriptsubscript𝑝6𝑇matrixsubscript0𝑑𝑑subscript0𝑑subscript𝑑𝑝subscript0subscript𝑑𝑝𝑑subscript𝐼subscript𝑑𝑝subscript𝑑𝑝matrixsubscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹3subscript𝐹2subscript𝐹1subscript𝑥𝑑subscript𝑝1subscript𝑝2subscript𝑝3subscript𝑝6\displaystyle=\begin{bmatrix}x_{F_{1}}^{T}&p_{1}^{T}\\ x_{F_{2}}^{T}&p_{2}^{T}\\ x_{F_{3}}^{T}&p_{3}^{T}\\ x_{d}^{T}&p_{4}^{T}\\ F_{1}(x_{d})^{T}&p_{5}\\ F_{2}\circ F_{1}(x_{d})^{T}&p_{6}^{T}\end{bmatrix}\begin{bmatrix}0_{d\times d}% &0_{d\times d_{p}}\\ 0_{d_{p}\times d}&I_{d_{p}\times d_{p}}\\ \end{bmatrix}\begin{bmatrix}x_{F_{1}}&x_{F_{2}}&x_{F_{3}}&\cdots&F_{2}\circ F_% {1}(x_{d})\\ p_{1}&p_{2}&p_{3}&\cdots&p_{6}\end{bmatrix}= [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_p start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]
=[0p1T0p2T0p3T0p4T0p5T0p6T][xF1xF2xF3F2F1(xd)p1p2p3p6]=PTPabsentmatrix0superscriptsubscript𝑝1𝑇0superscriptsubscript𝑝2𝑇0superscriptsubscript𝑝3𝑇0superscriptsubscript𝑝4𝑇0superscriptsubscript𝑝5𝑇0superscriptsubscript𝑝6𝑇matrixsubscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹3subscript𝐹2subscript𝐹1subscript𝑥𝑑subscript𝑝1subscript𝑝2subscript𝑝3subscript𝑝6superscript𝑃𝑇𝑃\displaystyle=\begin{bmatrix}0&p_{1}^{T}\\ 0&p_{2}^{T}\\ 0&p_{3}^{T}\\ 0&p_{4}^{T}\\ 0&p_{5}^{T}\\ 0&p_{6}^{T}\end{bmatrix}\begin{bmatrix}x_{F_{1}}&x_{F_{2}}&x_{F_{3}}&\cdots&F_% {2}\circ F_{1}(x_{d})\\ p_{1}&p_{2}&p_{3}&\cdots&p_{6}\end{bmatrix}=P^{T}P= [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_p start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_p start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_p start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P

Our construction considers a P𝑃Pitalic_P such that pi=pi+4subscript𝑝𝑖subscript𝑝𝑖4p_{i}=p_{i+4}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_i + 4 end_POSTSUBSCRIPT for all i[1,3]𝑖13i\in[1,3]italic_i ∈ [ 1 , 3 ] and pipj=0subscript𝑝𝑖subscript𝑝𝑗0p_{i}\cdot p_{j}=0italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0 for all j[1,3]𝑗13j\in[1,3]italic_j ∈ [ 1 , 3 ] and ji𝑗𝑖j\neq iitalic_j ≠ italic_i. The mask M𝑀Mitalic_M converts PTPsuperscript𝑃𝑇𝑃P^{T}Pitalic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P into an upper triangular matrix, and zeroes out all entries in the lower triangle of the matrix.

M(ZTQZ)=M(PTP)=M[I3×3I3×3I3×3I3×3]=[I3×3I3×303×3I3×3]direct-product𝑀superscript𝑍𝑇𝑄𝑍direct-product𝑀superscript𝑃𝑇𝑃direct-product𝑀matrixsubscript𝐼33subscript𝐼33subscript𝐼33subscript𝐼33matrixsubscript𝐼33subscript𝐼33subscript033subscript𝐼33M\odot(Z^{T}QZ)=M\odot(P^{T}P)=M\odot\begin{bmatrix}I_{3\times 3}&I_{3\times 3% }\\ I_{3\times 3}&I_{3\times 3}\end{bmatrix}=\begin{bmatrix}I_{3\times 3}&I_{3% \times 3}\\ 0_{3\times 3}&I_{3\times 3}\end{bmatrix}italic_M ⊙ ( italic_Z start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Q italic_Z ) = italic_M ⊙ ( italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P ) = italic_M ⊙ [ start_ARG start_ROW start_CELL italic_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL italic_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]

The attention layer computes

Attn(Z)Attn𝑍\displaystyle\mathrm{Attn}(Z)roman_Attn ( italic_Z ) =(KZ)(MZTQZ)absent𝐾𝑍direct-product𝑀superscript𝑍𝑇𝑄𝑍\displaystyle=(KZ)(M\odot Z^{T}QZ)= ( italic_K italic_Z ) ( italic_M ⊙ italic_Z start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Q italic_Z )
=(KZ)(MPPT)absent𝐾𝑍direct-product𝑀𝑃superscript𝑃𝑇\displaystyle=(KZ)(M\odot PP^{T})= ( italic_K italic_Z ) ( italic_M ⊙ italic_P italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT )
=[0dv×dv0df×dv0d×dp0df×dvIdf×df0d×dp0dv×d0df×d0dp×dp][xF1xF2xF3F2F1(xd)p1p2p3p6][I3×3I3×303×3I3×3]absentmatrixsubscript0subscript𝑑𝑣subscript𝑑𝑣subscript0subscript𝑑𝑓subscript𝑑𝑣subscript0𝑑subscript𝑑𝑝subscript0subscript𝑑𝑓subscript𝑑𝑣subscript𝐼subscript𝑑𝑓subscript𝑑𝑓subscript0𝑑subscript𝑑𝑝subscript0subscript𝑑𝑣𝑑subscript0subscript𝑑𝑓𝑑subscript0subscript𝑑𝑝subscript𝑑𝑝matrixsubscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹3subscript𝐹2subscript𝐹1subscript𝑥𝑑subscript𝑝1subscript𝑝2subscript𝑝3subscript𝑝6matrixsubscript𝐼33subscript𝐼33subscript033subscript𝐼33\displaystyle=\begin{bmatrix}0_{d_{v}\times d_{v}}&0_{d_{f}\times d_{v}}&0_{d% \times d_{p}}\\ 0_{d_{f}\times d_{v}}&I_{d_{f}\times d_{f}}&0_{d\times d_{p}}\\ 0_{d_{v}\times d}&0_{d_{f}\times d}&0_{d_{p}\times d_{p}}\\ \end{bmatrix}\begin{bmatrix}x_{F_{1}}&x_{F_{2}}&x_{F_{3}}&\cdots&F_{2}\circ F_% {1}(x_{d})\\ p_{1}&p_{2}&p_{3}&\cdots&p_{6}\end{bmatrix}\begin{bmatrix}I_{3\times 3}&I_{3% \times 3}\\ 0_{3\times 3}&I_{3\times 3}\\ \end{bmatrix}= [ start_ARG start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_p start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]
=[xF1xF2xF30d0d0d0dp0dp0dp0dp0dp0dp][I3×3I3×303×3I3×3]absentmatrixsubscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹3subscript0𝑑subscript0𝑑subscript0𝑑subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝matrixsubscript𝐼33subscript𝐼33subscript033subscript𝐼33\displaystyle=\begin{bmatrix}x_{F_{1}}&x_{F_{2}}&x_{F_{3}}&0_{d}&0_{d}&0_{d}\\ 0_{d_{p}}&0_{d_{p}}&0_{d_{p}}&0_{d_{p}}&0_{d_{p}}&0_{d_{p}}\\ \end{bmatrix}\begin{bmatrix}I_{3\times 3}&I_{3\times 3}\\ 0_{3\times 3}&I_{3\times 3}\\ \end{bmatrix}= [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]
=[xF1xF2xF3xF1xF2xF30dp0dp0dp0dp0dp0dp]absentmatrixsubscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹3subscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹3subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝\displaystyle=\begin{bmatrix}x_{F_{1}}&x_{F_{2}}&x_{F_{3}}&x_{F_{1}}&x_{F_{2}}% &x_{F_{3}}\\ 0_{d_{p}}&0_{d_{p}}&0_{d_{p}}&0_{d_{p}}&0_{d_{p}}&0_{d_{p}}\end{bmatrix}= [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]

which when added to Z𝑍Zitalic_Z yields

Attn(Z)+Z=[2xF12xF22xF3xd+xF1F1(xd)+xF2F2F1(xd)+xF3p1p2p3p4p5p6,]Attn𝑍𝑍matrix2subscript𝑥subscript𝐹12subscript𝑥subscript𝐹22subscript𝑥subscript𝐹3subscript𝑥𝑑subscript𝑥subscript𝐹1subscript𝐹1subscript𝑥𝑑subscript𝑥subscript𝐹2subscript𝐹2subscript𝐹1subscript𝑥𝑑subscript𝑥subscript𝐹3subscript𝑝1subscript𝑝2subscript𝑝3subscript𝑝4subscript𝑝5subscript𝑝6\mathrm{Attn}(Z)+Z=\begin{bmatrix}2x_{F_{1}}&2x_{F_{2}}&2x_{F_{3}}&x_{d}+x_{F_% {1}}&F_{1}(x_{d})+x_{F_{2}}&F_{2}\circ F_{1}(x_{d})+x_{F_{3}}\\ p_{1}&p_{2}&p_{3}&p_{4}&p_{5}&p_{6},\end{bmatrix}roman_Attn ( italic_Z ) + italic_Z = [ start_ARG start_ROW start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) + italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) + italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT , end_CELL end_ROW end_ARG ]

if we also include the residual stream to the output of the attention layer.

Step 2: Computing the MLP layer.

After the attention layer, the data and task tokens are aggregated at one location in orthogonal sub-spaces. The MLP uses the task and data token to compute the function. The first fully-connected layer projects the input dvdtsubscriptsubscript𝑑𝑣subscript𝑑𝑡\mathbb{R}_{d_{v}d_{t}}blackboard_R start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, which uniquely identifies the task and data tokens which is used to retrived the function from W2subscript𝑊2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The first fully-connected layer computes

(Attn(Z)+Z)TW1TsuperscriptAttn𝑍𝑍𝑇superscriptsubscript𝑊1𝑇\displaystyle(\mathrm{Attn}(Z)+Z)^{T}W_{1}^{T}( roman_Attn ( italic_Z ) + italic_Z ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT =[2xF1Tp1T2xF2Tp2T2xF3Tp3TxdT+xF1Tp4TF1(xd)T+xF2Tp5TF2(F1(xd))T+xF3Tp6T][𝟙xd1T𝟙xd1T𝟙xd1T𝟙xd1T𝟙xd2T𝟙xd2T𝟙xd2T𝟙xd2T𝟙xdvT𝟙xdvT𝟙xdvT𝟙xdvT0dvT1dvT1dvT1dvT1dvT0dvT1dvT1dvT1dvT1dvT1dvT0dvT0dp×dv0dp×dv0dp×dv0dp×dv]absentmatrix2superscriptsubscript𝑥subscript𝐹1𝑇superscriptsubscript𝑝1𝑇2superscriptsubscript𝑥subscript𝐹2𝑇superscriptsubscript𝑝2𝑇2superscriptsubscript𝑥subscript𝐹3𝑇superscriptsubscript𝑝3𝑇superscriptsubscript𝑥𝑑𝑇superscriptsubscript𝑥subscript𝐹1𝑇superscriptsubscript𝑝4𝑇subscript𝐹1superscriptsubscript𝑥𝑑𝑇superscriptsubscript𝑥subscript𝐹2𝑇superscriptsubscript𝑝5𝑇subscript𝐹2superscriptsubscript𝐹1subscript𝑥𝑑𝑇superscriptsubscript𝑥subscript𝐹3𝑇superscriptsubscript𝑝6𝑇matrixsuperscriptsubscript1subscript𝑥subscript𝑑1𝑇superscriptsubscript1subscript𝑥subscript𝑑1𝑇superscriptsubscript1subscript𝑥subscript𝑑1𝑇superscriptsubscript1subscript𝑥subscript𝑑1𝑇superscriptsubscript1subscript𝑥subscript𝑑2𝑇superscriptsubscript1subscript𝑥subscript𝑑2𝑇superscriptsubscript1subscript𝑥subscript𝑑2𝑇superscriptsubscript1subscript𝑥subscript𝑑2𝑇superscriptsubscript1subscript𝑥subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑥subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑥subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑥subscript𝑑𝑣𝑇superscriptsubscript0subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript0subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript0subscript𝑑𝑣𝑇subscript0subscript𝑑𝑝subscript𝑑𝑣subscript0subscript𝑑𝑝subscript𝑑𝑣subscript0subscript𝑑𝑝subscript𝑑𝑣subscript0subscript𝑑𝑝subscript𝑑𝑣\displaystyle=\begin{bmatrix}2x_{F_{1}}^{T}&p_{1}^{T}\\ 2x_{F_{2}}^{T}&p_{2}^{T}\\ 2x_{F_{3}}^{T}&p_{3}^{T}\\ x_{d}^{T}+x_{F_{1}}^{T}&p_{4}^{T}\\ F_{1}(x_{d})^{T}+x_{F_{2}}^{T}&p_{5}^{T}\\ F_{2}(F_{1}(x_{d}))^{T}+x_{F_{3}}^{T}&p_{6}^{T}\end{bmatrix}\begin{bmatrix}% \mathbbm{1}_{x_{d_{1}}}^{T}&\mathbbm{1}_{x_{d_{1}}}^{T}&\mathbbm{1}_{x_{d_{1}}% }^{T}&\cdots&\mathbbm{1}_{x_{d_{1}}}^{T}\\ \mathbbm{1}_{x_{d_{2}}}^{T}&\mathbbm{1}_{x_{d_{2}}}^{T}&\mathbbm{1}_{x_{d_{2}}% }^{T}&\cdots&\mathbbm{1}_{x_{d_{2}}}^{T}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbbm{1}_{x_{d_{v}}}^{T}&\mathbbm{1}_{x_{d_{v}}}^{T}&\mathbbm{1}_{x_{d_{v}}% }^{T}&\cdots&\mathbbm{1}_{x_{d_{v}}}^{T}\\ 0_{d_{v}}^{T}&-1_{d_{v}}^{T}&-1_{d_{v}}^{T}&\cdots&-1_{d_{v}}^{T}\\ -1_{d_{v}}^{T}&0_{d_{v}}^{T}&-1_{d_{v}}^{T}&\cdots&-1_{d_{v}}^{T}\\ \vdots&\vdots&\ddots&\vdots\\ -1_{d_{v}}^{T}&-1_{d_{v}}^{T}&-1_{d_{v}}^{T}&\cdots&0_{d_{v}}^{T}\\ 0_{d_{p}\times d_{v}}&0_{d_{p}\times d_{v}}&0_{d_{p}\times d_{v}}&\cdots&0_{d_% {p}\times d_{v}}\\ \end{bmatrix}= [ start_ARG start_ROW start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]
=[2dvT0dvT2dvT2dvT0dvT2dvT2dvT0dvT2dvT1dvT+𝟙xdT𝟙xdT1dvT+𝟙xdT1dvT+𝟙F1(xd)T𝟙F1(xd)T1dvT+𝟙F1(xd)T1dvT+𝟙F2F1(xd)T𝟙F2F1(xd)T1dvT+𝟙F2F1(xd)T]absentmatrixsuperscriptsubscript2subscript𝑑𝑣𝑇superscriptsubscript0subscript𝑑𝑣𝑇superscriptsubscript2subscript𝑑𝑣𝑇superscriptsubscript2subscript𝑑𝑣𝑇superscriptsubscript0subscript𝑑𝑣𝑇superscriptsubscript2subscript𝑑𝑣𝑇superscriptsubscript2subscript𝑑𝑣𝑇superscriptsubscript0subscript𝑑𝑣𝑇superscriptsubscript2subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑥𝑑𝑇superscriptsubscript1subscript𝑥𝑑𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript1subscript𝑥𝑑𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript1subscript𝐹1subscript𝑥𝑑𝑇superscriptsubscript1subscript𝐹1subscript𝑥𝑑𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript1subscript𝐹1subscript𝑥𝑑𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript1subscript𝐹2subscript𝐹1subscript𝑥𝑑𝑇missing-subexpressionsuperscriptsubscript1subscript𝐹2subscript𝐹1subscript𝑥𝑑𝑇superscriptsubscript1subscript𝑑𝑣𝑇superscriptsubscript1subscript𝐹2subscript𝐹1subscript𝑥𝑑𝑇\displaystyle=\begin{bmatrix}-2_{d_{v}}^{T}&\cdots&\cdots&0_{d_{v}}^{T}&\cdots% &\cdots&-2_{d_{v}}^{T}\\ -2_{d_{v}}^{T}&\cdots&0_{d_{v}}^{T}&\cdots&\cdots&\cdots&-2_{d_{v}}^{T}\\ -2_{d_{v}}^{T}&\cdots&\cdots&\cdots&0_{d_{v}}^{T}&\cdots&-2_{d_{v}}^{T}\\ -1_{d_{v}}^{T}+\mathbbm{1}_{x_{d}}^{T}&\cdots&\cdots&\mathbbm{1}_{x_{d}}^{T}&% \cdots&\cdots&-1_{d_{v}}^{T}+\mathbbm{1}_{x_{d}}^{T}\\ -1_{d_{v}}^{T}+\mathbbm{1}_{F_{1}(x_{d})}^{T}&\cdots&\mathbbm{1}_{F_{1}(x_{d})% }^{T}&\cdots&\cdots&\cdots&-1_{d_{v}}^{T}+\mathbbm{1}_{F_{1}(x_{d})}^{T}\\ -1_{d_{v}}^{T}+\mathbbm{1}_{F_{2}\circ F_{1}(x_{d})}^{T}&\cdots&&\cdots&% \mathbbm{1}_{F_{2}\circ F_{1}(x_{d})}^{T}&\cdots&-1_{d_{v}}^{T}+\mathbbm{1}_{F% _{2}\circ F_{1}(x_{d})}^{T}\\ \end{bmatrix}= [ start_ARG start_ROW start_CELL - 2 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL start_CELL - 2 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL - 2 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL start_CELL - 2 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL - 2 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL - 2 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + blackboard_1 start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + blackboard_1 start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + blackboard_1 start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL end_CELL start_CELL ⋯ end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL - 1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + blackboard_1 start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ]

The above matrix has dfdvsubscript𝑑𝑓subscript𝑑𝑣d_{f}d_{v}italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT columns represented as dfsubscript𝑑𝑓d_{f}italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT blocks of size dvsubscript𝑑𝑣d_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. The 00 matrix in the first, second and third row occupy dvsubscript𝑑𝑣d_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT columns each. In particular, they occupy the blocks j1subscript𝑗1j_{1}italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, j2subscript𝑗2j_{2}italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and j3subscript𝑗3j_{3}italic_j start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT where Fi=Fijisubscript𝐹𝑖subscript𝐹subscript𝑖subscript𝑗𝑖F_{i}=F_{i_{j_{i}}}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT, i.e. the block number corresponds to index in the one-hot representation of the task tokens. Let 𝟙(x,F)subscript1𝑥𝐹\mathbbm{1}_{(x,F)}blackboard_1 start_POSTSUBSCRIPT ( italic_x , italic_F ) end_POSTSUBSCRIPT denote a one-hot vector in dv×dfsuperscriptsubscript𝑑𝑣subscript𝑑𝑓\mathbb{R}^{d_{v}\times d_{f}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, i.e., it is a one-hot vector that uniquely identifies the task and data token. We can succincintly express the output after the non-linearity as follows:

ReLU(W1(Attn(Z)+Z))ReLUsubscript𝑊1Attn𝑍𝑍\displaystyle\mathrm{ReLU}(W_{1}(\mathrm{Attn}(Z)+Z))roman_ReLU ( italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( roman_Attn ( italic_Z ) + italic_Z ) ) =ReLU((Attn(Z)+Z)TW1T)T)\displaystyle=\mathrm{ReLU}((\mathrm{Attn}(Z)+Z)^{T}W_{1}^{T})^{T})= roman_ReLU ( ( roman_Attn ( italic_Z ) + italic_Z ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT )
=[0dv0dv0dv0dv0dv0dv0dv0dv0dv0dv0dv0dv𝟙F1(xd)0dv0dv0dv𝟙xd0dv0dv0dv𝟙F2F1(xd)0dv0dv0dv0dv0dv0dv]absentmatrixsubscript0subscript𝑑𝑣subscript0subscript𝑑𝑣subscript0subscript𝑑𝑣subscript0subscript𝑑𝑣subscript0subscript𝑑𝑣subscript0subscript𝑑𝑣subscript0subscript𝑑𝑣subscript0subscript𝑑𝑣subscript0subscript𝑑𝑣subscript0subscript𝑑𝑣subscript0subscript𝑑𝑣subscript0subscript𝑑𝑣subscript1subscript𝐹1subscript𝑥𝑑subscript0subscript𝑑𝑣subscript0subscript𝑑𝑣subscript0subscript𝑑𝑣subscript1subscript𝑥𝑑subscript0subscript𝑑𝑣subscript0subscript𝑑𝑣subscript0subscript𝑑𝑣subscript1subscript𝐹2subscript𝐹1subscript𝑥𝑑subscript0subscript𝑑𝑣subscript0subscript𝑑𝑣subscript0subscript𝑑𝑣subscript0subscript𝑑𝑣subscript0subscript𝑑𝑣subscript0subscript𝑑𝑣\displaystyle=\begin{bmatrix}0_{d_{v}}&0_{d_{v}}&0_{d_{v}}&0_{d_{v}}&0_{d_{v}}% &0_{d_{v}}\\ 0_{d_{v}}&0_{d_{v}}&0_{d_{v}}&\cdots&\cdots&\cdots\\ 0_{d_{v}}&0_{d_{v}}&0_{d_{v}}&\cdots&\mathbbm{1}_{F_{1}(x_{d})}&\cdots\\ 0_{d_{v}}&0_{d_{v}}&0_{d_{v}}&\mathbbm{1}_{x_{d}}&\cdots&\cdots\\ 0_{d_{v}}&0_{d_{v}}&0_{d_{v}}&\cdots&\cdots&\mathbbm{1}_{F_{2}\circ F_{1}(x_{d% })}\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ 0_{d_{v}}&0_{d_{v}}&0_{d_{v}}&0_{d_{v}}&0_{d_{v}}&0_{d_{v}}\\ \end{bmatrix}= [ start_ARG start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]
=[0dvdf0dvdf0dvdf𝟙(xd,F1)𝟙(F1(xd),F2)𝟙(F2F1(xd),F3)]absentmatrixsubscript0subscript𝑑𝑣subscript𝑑𝑓subscript0subscript𝑑𝑣subscript𝑑𝑓subscript0subscript𝑑𝑣subscript𝑑𝑓subscript1subscript𝑥𝑑subscript𝐹1subscript1subscript𝐹1subscript𝑥𝑑subscript𝐹2subscript1subscript𝐹2subscript𝐹1subscript𝑥𝑑subscript𝐹3\displaystyle=\begin{bmatrix}0_{d_{v}d_{f}}&0_{d_{v}d_{f}}&0_{d_{v}d_{f}}&% \mathbbm{1}_{(x_{d},F_{1})}&\mathbbm{1}_{(F_{1}(x_{d}),F_{2})}&\mathbbm{1}_{(F% _{2}\circ F_{1}(x_{d}),F_{3})}\end{bmatrix}= [ start_ARG start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]

Including the final weight matrix W2subscript𝑊2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we get

W2ReLU(W1(Attn(Z)+Z))subscript𝑊2ReLUsubscript𝑊1Attn𝑍𝑍\displaystyle W_{2}\mathrm{ReLU}(W_{1}(\mathrm{Attn}(Z)+Z))italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_ReLU ( italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( roman_Attn ( italic_Z ) + italic_Z ) ) =W2[0dvdf0dvdf0dvdf𝟙(xd,F1)𝟙(F1(xd),F2)𝟙(F2F1(xd),F3)]absentsubscript𝑊2matrixsubscript0subscript𝑑𝑣subscript𝑑𝑓subscript0subscript𝑑𝑣subscript𝑑𝑓subscript0subscript𝑑𝑣subscript𝑑𝑓subscript1subscript𝑥𝑑subscript𝐹1subscript1subscript𝐹1subscript𝑥𝑑subscript𝐹2subscript1subscript𝐹2subscript𝐹1subscript𝑥𝑑subscript𝐹3\displaystyle=W_{2}\begin{bmatrix}0_{d_{v}d_{f}}&0_{d_{v}d_{f}}&0_{d_{v}d_{f}}% &\mathbbm{1}_{(x_{d},F_{1})}&\mathbbm{1}_{(F_{1}(x_{d}),F_{2})}&\mathbbm{1}_{(% F_{2}\circ F_{1}(x_{d}),F_{3})}\end{bmatrix}= italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_CELL start_CELL blackboard_1 start_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]
=[0dT0dpT0dT0dpT0dT0dpTF1(xd)TxdxF10dpTF2F1(xd)xF1(xd)xF20dpTF3F2F1(xd)xF2F1(xd)xF30dpT]Tabsentsuperscriptmatrixsuperscriptsubscript0𝑑𝑇superscriptsubscript0subscript𝑑𝑝𝑇superscriptsubscript0𝑑𝑇superscriptsubscript0subscript𝑑𝑝𝑇superscriptsubscript0𝑑𝑇superscriptsubscript0subscript𝑑𝑝𝑇subscript𝐹1superscriptsubscript𝑥𝑑𝑇subscript𝑥𝑑subscript𝑥subscript𝐹1superscriptsubscript0subscript𝑑𝑝𝑇subscript𝐹2subscript𝐹1subscript𝑥𝑑subscript𝑥subscript𝐹1subscript𝑥𝑑subscript𝑥subscript𝐹2superscriptsubscript0subscript𝑑𝑝𝑇subscript𝐹3subscript𝐹2subscript𝐹1subscript𝑥𝑑subscript𝑥subscript𝐹2subscript𝐹1subscript𝑥𝑑subscript𝑥subscript𝐹3superscriptsubscript0subscript𝑑𝑝𝑇𝑇\displaystyle=\begin{bmatrix}0_{d}^{T}&0_{d_{p}}^{T}\\ 0_{d}^{T}&0_{d_{p}}^{T}\\ 0_{d}^{T}&0_{d_{p}}^{T}\\ F_{1}(x_{d})^{T}-x_{d}-x_{F_{1}}&0_{d_{p}}^{T}\\ F_{2}\circ F_{1}(x_{d})-x_{F_{1}(x_{d})}-x_{F_{2}}&0_{d_{p}}^{T}\\ F_{3}\circ F_{2}\circ F_{1}(x_{d})-x_{F_{2}\circ F_{1}(x_{d})}-x_{F_{3}}&0_{d_% {p}}^{T}\\ \end{bmatrix}^{T}= [ start_ARG start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT

Hence, the output of the Transformer is

Tr(Z)Tr𝑍\displaystyle\mathrm{Tr}(Z)roman_Tr ( italic_Z ) =MLP(Attn(Z)+Z)+(Attn(Z)+Z)absentMLPAttn𝑍𝑍Attn𝑍𝑍\displaystyle=\mathrm{MLP}(\mathrm{Attn}(Z)+Z)+(\mathrm{Attn}(Z)+Z)= roman_MLP ( roman_Attn ( italic_Z ) + italic_Z ) + ( roman_Attn ( italic_Z ) + italic_Z )
=[0dT0dpT0dT0dpT0dT0dpTF1(xd)TxdTxF1T0dpTF2F1(xd)TxF1(xd)TxF2T0dpTF3F2F1(xd)TxF2F1(xd)TxF3T0dpT]T+[2xF1Tp1T2xF2Tp2T2xF3Tp3TxdT+xF1Tp4TxF1(xd)T+xF2Tp5TxF2F1(xd)T+xF3Tp6T]Tabsentsuperscriptmatrixsuperscriptsubscript0𝑑𝑇superscriptsubscript0subscript𝑑𝑝𝑇superscriptsubscript0𝑑𝑇superscriptsubscript0subscript𝑑𝑝𝑇superscriptsubscript0𝑑𝑇superscriptsubscript0subscript𝑑𝑝𝑇subscript𝐹1superscriptsubscript𝑥𝑑𝑇superscriptsubscript𝑥𝑑𝑇superscriptsubscript𝑥subscript𝐹1𝑇superscriptsubscript0subscript𝑑𝑝𝑇subscript𝐹2subscript𝐹1superscriptsubscript𝑥𝑑𝑇superscriptsubscript𝑥subscript𝐹1subscript𝑥𝑑𝑇superscriptsubscript𝑥subscript𝐹2𝑇superscriptsubscript0subscript𝑑𝑝𝑇subscript𝐹3subscript𝐹2subscript𝐹1superscriptsubscript𝑥𝑑𝑇superscriptsubscript𝑥subscript𝐹2subscript𝐹1subscript𝑥𝑑𝑇superscriptsubscript𝑥subscript𝐹3𝑇superscriptsubscript0subscript𝑑𝑝𝑇𝑇superscriptmatrix2superscriptsubscript𝑥subscript𝐹1𝑇superscriptsubscript𝑝1𝑇2superscriptsubscript𝑥subscript𝐹2𝑇superscriptsubscript𝑝2𝑇2superscriptsubscript𝑥subscript𝐹3𝑇superscriptsubscript𝑝3𝑇superscriptsubscript𝑥𝑑𝑇superscriptsubscript𝑥subscript𝐹1𝑇superscriptsubscript𝑝4𝑇superscriptsubscript𝑥subscript𝐹1subscript𝑥𝑑𝑇superscriptsubscript𝑥subscript𝐹2𝑇superscriptsubscript𝑝5𝑇superscriptsubscript𝑥subscript𝐹2subscript𝐹1subscript𝑥𝑑𝑇superscriptsubscript𝑥subscript𝐹3𝑇superscriptsubscript𝑝6𝑇𝑇\displaystyle=\begin{bmatrix}0_{d}^{T}&0_{d_{p}}^{T}\\ 0_{d}^{T}&0_{d_{p}}^{T}\\ 0_{d}^{T}&0_{d_{p}}^{T}\\ F_{1}(x_{d})^{T}-x_{d}^{T}-x_{F_{1}}^{T}&0_{d_{p}}^{T}\\ F_{2}\circ F_{1}(x_{d})^{T}-x_{F_{1}(x_{d})}^{T}-x_{F_{2}}^{T}&0_{d_{p}}^{T}\\ F_{3}\circ F_{2}\circ F_{1}(x_{d})^{T}-x_{F_{2}\circ F_{1}(x_{d})}^{T}-x_{F_{3% }}^{T}&0_{d_{p}}^{T}\\ \end{bmatrix}^{T}+\begin{bmatrix}2x_{F_{1}}^{T}&p_{1}^{T}\\ 2x_{F_{2}}^{T}&p_{2}^{T}\\ 2x_{F_{3}}^{T}&p_{3}^{T}\\ x_{d}^{T}+x_{F_{1}}^{T}&p_{4}^{T}\\ x_{F_{1}(x_{d})}^{T}+x_{F_{2}}^{T}&p_{5}^{T}\\ x_{F_{2}\circ F_{1}(x_{d})}^{T}+x_{F_{3}}^{T}&p_{6}^{T}\\ \end{bmatrix}^{T}= [ start_ARG start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + [ start_ARG start_ROW start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT
=[2xF12xF22xF3F1(xd)F2F1(xd)F3F2F1(xd)p1p2p3p4p5p6]absentmatrix2subscript𝑥subscript𝐹12subscript𝑥subscript𝐹22subscript𝑥subscript𝐹3subscript𝐹1subscript𝑥𝑑subscript𝐹2subscript𝐹1subscript𝑥𝑑subscript𝐹3subscript𝐹2subscript𝐹1subscript𝑥𝑑subscript𝑝1subscript𝑝2subscript𝑝3subscript𝑝4subscript𝑝5subscript𝑝6\displaystyle=\begin{bmatrix}2x_{F_{1}}&2x_{F_{2}}&2x_{F_{3}}&F_{1}(x_{d})&F_{% 2}\circ F_{1}(x_{d})&F_{3}\circ F_{2}\circ F_{1}(x_{d})\\ p_{1}&p_{2}&p_{3}&p_{4}&p_{5}&p_{6}\end{bmatrix}= [ start_ARG start_ROW start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL start_CELL italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] (0)

If we set

We=[Id×d0d×dp0dp×dodp×dp],subscript𝑊𝑒matrixsubscript𝐼𝑑𝑑subscript0𝑑subscript𝑑𝑝subscript0subscript𝑑𝑝𝑑subscript𝑜subscript𝑑𝑝subscript𝑑𝑝W_{e}=\begin{bmatrix}I_{d\times d}&0_{d\times d_{p}}\\ 0_{d_{p}\times d}&o_{d_{p}\times d_{p}}\\ \end{bmatrix},italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_I start_POSTSUBSCRIPT italic_d × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL italic_o start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ,

then WeTr(Z)subscript𝑊𝑒Tr𝑍W_{e}\mathrm{Tr}(Z)italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT roman_Tr ( italic_Z ) evaluates to

[2xF12xF22xF3F1(xd)F2F1(xd)F3F2F1(xd)]matrix2subscript𝑥subscript𝐹12subscript𝑥subscript𝐹22subscript𝑥subscript𝐹3subscript𝐹1subscript𝑥𝑑subscript𝐹2subscript𝐹1subscript𝑥𝑑subscript𝐹3subscript𝐹2subscript𝐹1subscript𝑥𝑑\begin{bmatrix}2x_{F_{1}}&2x_{F_{2}}&2x_{F_{3}}&F_{1}(x_{d})&F_{2}\circ F_{1}(% x_{d})&F_{3}\circ F_{2}\circ F_{1}(x_{d})\\ \end{bmatrix}[ start_ARG start_ROW start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL start_CELL italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ]

which will assign high probabilities to the desired token when passed through a Softmax layer. Hence, a Transformer prompted with [xF1,xF2,xF3,xd]subscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹3subscript𝑥𝑑[x_{F_{1}},x_{F_{2}},x_{F_{3}},x_{d}][ italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ] will auto-regressively generate [F1(xd),F2F1(xd),F3F2F1(xd)]subscript𝐹1subscript𝑥𝑑subscript𝐹2subscript𝐹1subscript𝑥𝑑subscript𝐹3subscript𝐹2subscript𝐹1subscript𝑥𝑑[F_{1}(x_{d}),F_{2}\circ F_{1}(x_{d}),F_{3}\circ F_{2}\circ F_{1}(x_{d})][ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ] for any combination of data and task tokens. ∎

C.5 Proof of Theorem C.2

The details of construction are similar to Section C.4.

Step 1: Computing the output of the first block.

The first Transformer block computes the first step of the composition. The attention layer in particular, copies the relevant task token to the data token. The value and query matrices of the attention layer in the first Transformer block are

ZTQ1Zsuperscript𝑍𝑇subscript𝑄1𝑍\displaystyle Z^{T}Q_{1}Zitalic_Z start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_Z =[xF1Tp11Tp21Tp31TxF2Tp12Tp22Tp32TxF3Tp13Tp23Tp33TxdTp14Tp24Tp34T][0d×d0d×d¯p0d×d¯p0d×d¯p0d¯p×dId¯p×d¯p0d¯p×d¯p0d¯p×d¯p0d¯p×d0d¯p×d¯p0d¯p×d¯p0d¯p×d¯p0d¯p×d0d¯p×d¯p0d¯p×d¯p0d¯p×d¯p][xF1xF2xF3xdp11p12p13p14p21p22p23p24p31p32p33p34]absentmatrixsuperscriptsubscript𝑥subscript𝐹1𝑇superscriptsubscript𝑝11𝑇superscriptsubscript𝑝21𝑇superscriptsubscript𝑝31𝑇superscriptsubscript𝑥subscript𝐹2𝑇superscriptsubscript𝑝12𝑇superscriptsubscript𝑝22𝑇superscriptsubscript𝑝32𝑇superscriptsubscript𝑥subscript𝐹3𝑇superscriptsubscript𝑝13𝑇superscriptsubscript𝑝23𝑇superscriptsubscript𝑝33𝑇superscriptsubscript𝑥𝑑𝑇superscriptsubscript𝑝14𝑇superscriptsubscript𝑝24𝑇superscriptsubscript𝑝34𝑇matrixsubscript0𝑑𝑑subscript0𝑑subscript¯𝑑𝑝subscript0𝑑subscript¯𝑑𝑝subscript0𝑑subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝𝑑subscript𝐼subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝𝑑subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝𝑑subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝matrixsubscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹3subscript𝑥𝑑subscript𝑝11subscript𝑝12subscript𝑝13subscript𝑝14subscript𝑝21subscript𝑝22subscript𝑝23subscript𝑝24subscript𝑝31subscript𝑝32subscript𝑝33subscript𝑝34\displaystyle=\begin{bmatrix}x_{F_{1}}^{T}&p_{11}^{T}&p_{21}^{T}&p_{31}^{T}\\ x_{F_{2}}^{T}&p_{12}^{T}&p_{22}^{T}&p_{32}^{T}\\ x_{F_{3}}^{T}&p_{13}^{T}&p_{23}^{T}&p_{33}^{T}\\ x_{d}^{T}&p_{14}^{T}&p_{24}^{T}&p_{34}^{T}\end{bmatrix}\begin{bmatrix}0_{d% \times d}&0_{d\times\bar{d}_{p}}&0_{d\times\bar{d}_{p}}&0_{d\times\bar{d}_{p}}% \\ 0_{\bar{d}_{p}\times d}&I_{\bar{d}_{p}\times\bar{d}_{p}}&0_{\bar{d}_{p}\times% \bar{d}_{p}}&0_{\bar{d}_{p}\times\bar{d}_{p}}\\ 0_{\bar{d}_{p}\times d}&0_{\bar{d}_{p}\times\bar{d}_{p}}&0_{\bar{d}_{p}\times% \bar{d}_{p}}&0_{\bar{d}_{p}\times\bar{d}_{p}}\\ 0_{\bar{d}_{p}\times d}&0_{\bar{d}_{p}\times\bar{d}_{p}}&0_{\bar{d}_{p}\times% \bar{d}_{p}}&0_{\bar{d}_{p}\times\bar{d}_{p}}\end{bmatrix}\begin{bmatrix}x_{F_% {1}}&x_{F_{2}}&x_{F_{3}}&x_{d}\\ p_{11}&p_{12}&p_{13}&p_{14}\\ p_{21}&p_{22}&p_{23}&p_{24}\\ p_{31}&p_{32}&p_{33}&p_{34}\end{bmatrix}= [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 31 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 14 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 24 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 34 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 14 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 24 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 31 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 34 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]
=P1TP1,absentsuperscriptsubscript𝑃1𝑇subscript𝑃1\displaystyle=P_{1}^{T}P_{1},= italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

and

K1Z=[0dv×dv0df×dv0d×dp0df×dvIdf0d×dp0dv×d0df×d0dp×dp][xF1xF2xF3xdp1p2p3p4]=[xF1xF2xF30d0dp0dp0dp0dp]subscript𝐾1𝑍matrixsubscript0subscript𝑑𝑣subscript𝑑𝑣subscript0subscript𝑑𝑓subscript𝑑𝑣subscript0𝑑subscript𝑑𝑝subscript0subscript𝑑𝑓subscript𝑑𝑣subscript𝐼subscript𝑑𝑓subscript0𝑑subscript𝑑𝑝subscript0subscript𝑑𝑣𝑑subscript0subscript𝑑𝑓𝑑subscript0subscript𝑑𝑝subscript𝑑𝑝matrixsubscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹3subscript𝑥𝑑subscript𝑝1subscript𝑝2subscript𝑝3subscript𝑝4matrixsubscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹3subscript0𝑑subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝K_{1}Z=\begin{bmatrix}0_{d_{v}\times d_{v}}&0_{d_{f}\times d_{v}}&0_{d\times d% _{p}}\\ 0_{d_{f}\times d_{v}}&I_{d_{f}}&0_{d\times d_{p}}\\ 0_{d_{v}\times d}&0_{d_{f}\times d}&0_{d_{p}\times d_{p}}\\ \end{bmatrix}\begin{bmatrix}x_{F_{1}}&x_{F_{2}}&x_{F_{3}}&x_{d}\\ p_{1}&p_{2}&p_{3}&p_{4}\end{bmatrix}=\begin{bmatrix}x_{F_{1}}&x_{F_{2}}&x_{F_{% 3}}&0_{d}\\ 0_{d_{p}}&0_{d_{p}}&0_{d_{p}}&0_{d_{p}}\\ \end{bmatrix}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_Z = [ start_ARG start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]

Using the above, the output of the first attention layer added to the residual stream is

Attn1(Z)+ZsubscriptAttn1𝑍𝑍\displaystyle\mathrm{Attn}_{1}(Z)+Zroman_Attn start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ) + italic_Z =(K1Z)(MZTQ1Z)+Zabsentsubscript𝐾1𝑍direct-product𝑀superscript𝑍𝑇subscript𝑄1𝑍𝑍\displaystyle=(K_{1}Z)(M\odot Z^{T}Q_{1}Z)+Z= ( italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_Z ) ( italic_M ⊙ italic_Z start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_Z ) + italic_Z
=(K1Z)(MP1TP1)+Zabsentsubscript𝐾1𝑍direct-product𝑀superscriptsubscript𝑃1𝑇subscript𝑃1𝑍\displaystyle=(K_{1}Z)(M\odot P_{1}^{T}P_{1})+Z= ( italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_Z ) ( italic_M ⊙ italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_Z
=[xF1xF2xF300dp0dp0dp0dp][1001010000100001]+Zabsentmatrixsubscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹30subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝matrix1001010000100001𝑍\displaystyle=\begin{bmatrix}x_{F_{1}}&x_{F_{2}}&x_{F_{3}}&0\\ 0_{d_{p}}&0_{d_{p}}&0_{d_{p}}&0_{d_{p}}\end{bmatrix}\begin{bmatrix}1&0&0&1\\ 0&1&0&0\\ 0&0&1&0\\ 0&0&0&1\end{bmatrix}+Z= [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] + italic_Z
=[2xF12xF22xF3xd+xF1p1p2p3p4]absentmatrix2subscript𝑥subscript𝐹12subscript𝑥subscript𝐹22subscript𝑥subscript𝐹3subscript𝑥𝑑subscript𝑥subscript𝐹1subscript𝑝1subscript𝑝2subscript𝑝3subscript𝑝4\displaystyle=\begin{bmatrix}2x_{F_{1}}&2x_{F_{2}}&2x_{F_{3}}&x_{d}+x_{F_{1}}% \\ p_{1}&p_{2}&p_{3}&p_{4}\\ \end{bmatrix}= [ start_ARG start_ROW start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]

Note that W11subscript𝑊11W_{11}italic_W start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT and W21subscript𝑊21W_{21}italic_W start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT are identical to W1subscript𝑊1W_{1}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and W2subscript𝑊2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in Equation , and performing a similar calculation yields

Block1(Z)subscriptBlock1𝑍\displaystyle\mathrm{Block}_{1}(Z)roman_Block start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ) =W21ReLU(W11(Attn1(Z)+Z))+(Attn1(Z)+Z)absentsubscript𝑊21ReLUsubscript𝑊11subscriptAttn1𝑍𝑍subscriptAttn1𝑍𝑍\displaystyle=W_{21}\mathrm{ReLU}(W_{11}(\mathrm{Attn}_{1}(Z)+Z))+(\mathrm{% Attn}_{1}(Z)+Z)= italic_W start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT roman_ReLU ( italic_W start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT ( roman_Attn start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ) + italic_Z ) ) + ( roman_Attn start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ) + italic_Z )
=[2xF12xF22xF3F1(xd)p1p2p3p4]=ZB1.absentmatrix2subscript𝑥subscript𝐹12subscript𝑥subscript𝐹22subscript𝑥subscript𝐹3subscript𝐹1subscript𝑥𝑑subscript𝑝1subscript𝑝2subscript𝑝3subscript𝑝4subscript𝑍subscript𝐵1\displaystyle=\begin{bmatrix}2x_{F_{1}}&2x_{F_{2}}&2x_{F_{3}}&F_{1}(x_{d})\\ p_{1}&p_{2}&p_{3}&p_{4}\\ \end{bmatrix}=Z_{B_{1}}.= [ start_ARG start_ROW start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

We denote the output of the first Transformer block by ZB1subscript𝑍subscript𝐵1Z_{B_{1}}italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

Step 2: Computing the output of the second block.

The second block uses the output of the first Transformer block to compute the second step of the composition. We start similarly by computing the query and value matrices of the attention layer, i.e.,

ZB1Tsuperscriptsubscript𝑍subscript𝐵1𝑇\displaystyle Z_{B_{1}}^{T}italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT Q2ZB1=subscript𝑄2subscript𝑍subscript𝐵1absent\displaystyle Q_{2}Z_{B_{1}}=italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT =
=[2xF1Tp11Tp21Tp31T2xF2Tp12Tp22Tp32T2xF3Tp13Tp23Tp33TF1(xd)Tp14Tp24Tp34T][0d×d0d×d¯p0d×d¯p0d×d¯p0d¯p×d0d¯p×d¯p0d¯p×d¯p0d¯p×d¯p0d¯p×d0d¯p×d¯pId¯p×d¯p0d¯p×d¯p0d¯p×d0d¯p×d¯p0d¯p×d¯p0d¯p×d¯p][2xF12xF22xF3F1(xd)p11p12p13p14p21p22p23p24p31p32p33p34]absentmatrix2superscriptsubscript𝑥subscript𝐹1𝑇superscriptsubscript𝑝11𝑇superscriptsubscript𝑝21𝑇superscriptsubscript𝑝31𝑇2superscriptsubscript𝑥subscript𝐹2𝑇superscriptsubscript𝑝12𝑇superscriptsubscript𝑝22𝑇superscriptsubscript𝑝32𝑇2superscriptsubscript𝑥subscript𝐹3𝑇superscriptsubscript𝑝13𝑇superscriptsubscript𝑝23𝑇superscriptsubscript𝑝33𝑇subscript𝐹1superscriptsubscript𝑥𝑑𝑇superscriptsubscript𝑝14𝑇superscriptsubscript𝑝24𝑇superscriptsubscript𝑝34𝑇matrixsubscript0𝑑𝑑subscript0𝑑subscript¯𝑑𝑝subscript0𝑑subscript¯𝑑𝑝subscript0𝑑subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝𝑑subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝𝑑subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript𝐼subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝𝑑subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝subscript0subscript¯𝑑𝑝subscript¯𝑑𝑝matrix2subscript𝑥subscript𝐹12subscript𝑥subscript𝐹22subscript𝑥subscript𝐹3subscript𝐹1subscript𝑥𝑑subscript𝑝11subscript𝑝12subscript𝑝13subscript𝑝14subscript𝑝21subscript𝑝22subscript𝑝23subscript𝑝24subscript𝑝31subscript𝑝32subscript𝑝33subscript𝑝34\displaystyle=\begin{bmatrix}2x_{F_{1}}^{T}&p_{11}^{T}&p_{21}^{T}&p_{31}^{T}\\ 2x_{F_{2}}^{T}&p_{12}^{T}&p_{22}^{T}&p_{32}^{T}\\ 2x_{F_{3}}^{T}&p_{13}^{T}&p_{23}^{T}&p_{33}^{T}\\ F_{1}(x_{d})^{T}&p_{14}^{T}&p_{24}^{T}&p_{34}^{T}\end{bmatrix}\begin{bmatrix}0% _{d\times d}&0_{d\times\bar{d}_{p}}&0_{d\times\bar{d}_{p}}&0_{d\times\bar{d}_{% p}}\\ 0_{\bar{d}_{p}\times d}&0_{\bar{d}_{p}\times\bar{d}_{p}}&0_{\bar{d}_{p}\times% \bar{d}_{p}}&0_{\bar{d}_{p}\times\bar{d}_{p}}\\ 0_{\bar{d}_{p}\times d}&0_{\bar{d}_{p}\times\bar{d}_{p}}&I_{\bar{d}_{p}\times% \bar{d}_{p}}&0_{\bar{d}_{p}\times\bar{d}_{p}}\\ 0_{\bar{d}_{p}\times d}&0_{\bar{d}_{p}\times\bar{d}_{p}}&0_{\bar{d}_{p}\times% \bar{d}_{p}}&0_{\bar{d}_{p}\times\bar{d}_{p}}\end{bmatrix}\begin{bmatrix}2x_{F% _{1}}&2x_{F_{2}}&2x_{F_{3}}&F_{1}(x_{d})\\ p_{11}&p_{12}&p_{13}&p_{14}\\ p_{21}&p_{22}&p_{23}&p_{24}\\ p_{31}&p_{32}&p_{33}&p_{34}\end{bmatrix}= [ start_ARG start_ROW start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 31 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 14 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 24 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 34 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × over¯ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 14 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 24 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 31 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 34 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]
=P2TP2absentsuperscriptsubscript𝑃2𝑇subscript𝑃2\displaystyle=P_{2}^{T}P_{2}= italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

and

K2ZB1=12[0dv×dv0df×dv0d×dp0df×dvIdf0d×dp0dv×d0df×d0dp×dp][2xF12xF22xF3F1(xd)p1p2p3p4]=[xF1xF2xF30d0dp0dp0dp0dp].subscript𝐾2subscript𝑍subscript𝐵112matrixsubscript0subscript𝑑𝑣subscript𝑑𝑣subscript0subscript𝑑𝑓subscript𝑑𝑣subscript0𝑑subscript𝑑𝑝subscript0subscript𝑑𝑓subscript𝑑𝑣subscript𝐼subscript𝑑𝑓subscript0𝑑subscript𝑑𝑝subscript0subscript𝑑𝑣𝑑subscript0subscript𝑑𝑓𝑑subscript0subscript𝑑𝑝subscript𝑑𝑝matrix2subscript𝑥subscript𝐹12subscript𝑥subscript𝐹22subscript𝑥subscript𝐹3subscript𝐹1subscript𝑥𝑑subscript𝑝1subscript𝑝2subscript𝑝3subscript𝑝4matrixsubscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹3subscript0𝑑subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝K_{2}Z_{B_{1}}=\frac{1}{2}\begin{bmatrix}0_{d_{v}\times d_{v}}&0_{d_{f}\times d% _{v}}&0_{d\times d_{p}}\\ 0_{d_{f}\times d_{v}}&I_{d_{f}}&0_{d\times d_{p}}\\ 0_{d_{v}\times d}&0_{d_{f}\times d}&0_{d_{p}\times d_{p}}\\ \end{bmatrix}\begin{bmatrix}2x_{F_{1}}&2x_{F_{2}}&2x_{F_{3}}&F_{1}(x_{d})\\ p_{1}&p_{2}&p_{3}&p_{4}\end{bmatrix}=\begin{bmatrix}x_{F_{1}}&x_{F_{2}}&x_{F_{% 3}}&0_{d}\\ 0_{d_{p}}&0_{d_{p}}&0_{d_{p}}&0_{d_{p}}\end{bmatrix}.italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ start_ARG start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_d end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 2 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] .

Using the above, we can compute the output of the attention layer in the second Transformer block which evaluates to

Attn2(ZB1)+ZB1subscriptAttn2subscript𝑍subscript𝐵1subscript𝑍subscript𝐵1\displaystyle\mathrm{Attn}_{2}(Z_{B_{1}})+Z_{B_{1}}roman_Attn start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT =(K2ZB1)(MZB1TQ2ZB1)+ZB1absentsubscript𝐾2subscript𝑍subscript𝐵1direct-product𝑀superscriptsubscript𝑍subscript𝐵1𝑇subscript𝑄2subscript𝑍subscript𝐵1subscript𝑍subscript𝐵1\displaystyle=(K_{2}Z_{B_{1}})(M\odot Z_{B_{1}}^{T}Q_{2}Z_{B_{1}})+Z_{B_{1}}= ( italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ( italic_M ⊙ italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
=(K2ZB1)(MP2TP2)+ZB1absentsubscript𝐾2subscript𝑍subscript𝐵1direct-product𝑀superscriptsubscript𝑃2𝑇subscript𝑃2subscript𝑍subscript𝐵1\displaystyle=(K_{2}Z_{B_{1}})(M\odot P_{2}^{T}P_{2})+Z_{B_{1}}= ( italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ( italic_M ⊙ italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
=[xF1xF2xF300000][1000010100100001]+ZB1absentmatrixsubscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹300000matrix1000010100100001subscript𝑍subscript𝐵1\displaystyle=\begin{bmatrix}x_{F_{1}}&x_{F_{2}}&x_{F_{3}}&0\\ 0&0&0&0\end{bmatrix}\begin{bmatrix}1&0&0&0\\ 0&1&0&1\\ 0&0&1&0\\ 0&0&0&1\end{bmatrix}+Z_{B_{1}}= [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] + italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
=[3xF13xF23xF3F1(xd)+xF2p1p2p3p4].absentmatrix3subscript𝑥subscript𝐹13subscript𝑥subscript𝐹23subscript𝑥subscript𝐹3subscript𝐹1subscript𝑥𝑑subscript𝑥subscript𝐹2subscript𝑝1subscript𝑝2subscript𝑝3subscript𝑝4\displaystyle=\begin{bmatrix}3x_{F_{1}}&3x_{F_{2}}&3x_{F_{3}}&F_{1}(x_{d})+x_{% F_{2}}\\ p_{1}&p_{2}&p_{3}&p_{4}\\ \end{bmatrix}.= [ start_ARG start_ROW start_CELL 3 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 3 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 3 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) + italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] .

The attention layer uses sub-matrix P2subscript𝑃2P_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT of the position encodings to copy the second task token to the data token We repeat the calculations in Equation , with W21subscript𝑊21W_{21}italic_W start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT and W22subscript𝑊22W_{22}italic_W start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT which yields

Block2(Block1(Z)))\displaystyle\mathrm{Block}_{2}(\mathrm{Block}_{1}(Z)))roman_Block start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Block start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ) ) ) =W22ReLU(W21(Attn2(ZB1)+ZB1))+(Attn2(ZB1)+ZB1)absentsubscript𝑊22ReLUsubscript𝑊21subscriptAttn2subscript𝑍subscript𝐵1subscript𝑍subscript𝐵1subscriptAttn2subscript𝑍subscript𝐵1subscript𝑍subscript𝐵1\displaystyle=W_{22}\mathrm{ReLU}(W_{21}(\mathrm{Attn}_{2}(Z_{B_{1}})+Z_{B_{1}% }))+(\mathrm{Attn}_{2}(Z_{B_{1}})+Z_{B_{1}})= italic_W start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT roman_ReLU ( italic_W start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT ( roman_Attn start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) + ( roman_Attn start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
=[3xF13xF23xF3F2F1(xd)p1p2p3p4]=ZB2.absentmatrix3subscript𝑥subscript𝐹13subscript𝑥subscript𝐹23subscript𝑥subscript𝐹3subscript𝐹2subscript𝐹1subscript𝑥𝑑subscript𝑝1subscript𝑝2subscript𝑝3subscript𝑝4subscript𝑍subscript𝐵2\displaystyle=\begin{bmatrix}3x_{F_{1}}&3x_{F_{2}}&3x_{F_{3}}&F_{2}\circ F_{1}% (x_{d})\\ p_{1}&p_{2}&p_{3}&p_{4}\\ \end{bmatrix}=Z_{B_{2}}.= [ start_ARG start_ROW start_CELL 3 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 3 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 3 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT .
Step 3: Computing the output of the final Transformer block.

Unsurprisingly, the calculations for the last Transformer block are almost identical. The query matrix is ZB2TQ3ZB2=P3TP3superscriptsubscript𝑍subscript𝐵2𝑇subscript𝑄3subscript𝑍subscript𝐵2superscriptsubscript𝑃3𝑇subscript𝑃3Z_{B_{2}}^{T}Q_{3}Z_{B_{2}}=P_{3}^{T}P_{3}italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and the value matrix is

K3ZB2=13[xF1xF2xF30d0dp0dp0dp0dp][3xF13xF23xF3F2F1(xd)p1p2p3p4]=[xF1xF2xF30d0dp0dp0dp0dp].subscript𝐾3subscript𝑍subscript𝐵213matrixsubscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹3subscript0𝑑subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝matrix3subscript𝑥subscript𝐹13subscript𝑥subscript𝐹23subscript𝑥subscript𝐹3subscript𝐹2subscript𝐹1subscript𝑥𝑑subscript𝑝1subscript𝑝2subscript𝑝3subscript𝑝4matrixsubscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹3subscript0𝑑subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝subscript0subscript𝑑𝑝K_{3}Z_{B_{2}}=\frac{1}{3}\begin{bmatrix}x_{F_{1}}&x_{F_{2}}&x_{F_{3}}&0_{d}\\ 0_{d_{p}}&0_{d_{p}}&0_{d_{p}}&0_{d_{p}}\end{bmatrix}\begin{bmatrix}3x_{F_{1}}&% 3x_{F_{2}}&3x_{F_{3}}&F_{2}\circ F_{1}(x_{d})\\ p_{1}&p_{2}&p_{3}&p_{4}\end{bmatrix}=\begin{bmatrix}x_{F_{1}}&x_{F_{2}}&x_{F_{% 3}}&0_{d}\\ 0_{d_{p}}&0_{d_{p}}&0_{d_{p}}&0_{d_{p}}\end{bmatrix}.italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 3 end_ARG [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL 3 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 3 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 3 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] .

The output of the attention layer in the final block is

Attn3(ZB3)+ZB3subscriptAttn3subscript𝑍subscript𝐵3subscript𝑍subscript𝐵3\displaystyle\mathrm{Attn}_{3}(Z_{B_{3}})+Z_{B_{3}}roman_Attn start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT =(K3ZB2)(MZB2TQ2ZB2)+ZB2absentsubscript𝐾3subscript𝑍subscript𝐵2direct-product𝑀superscriptsubscript𝑍subscript𝐵2𝑇subscript𝑄2subscript𝑍subscript𝐵2subscript𝑍subscript𝐵2\displaystyle=(K_{3}Z_{B_{2}})(M\odot Z_{B_{2}}^{T}Q_{2}Z_{B_{2}})+Z_{B_{2}}= ( italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ( italic_M ⊙ italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
=(K3ZB1)(MP3TP3)+ZB2absentsubscript𝐾3subscript𝑍subscript𝐵1direct-product𝑀superscriptsubscript𝑃3𝑇subscript𝑃3subscript𝑍subscript𝐵2\displaystyle=(K_{3}Z_{B_{1}})(M\odot P_{3}^{T}P_{3})+Z_{B_{2}}= ( italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ( italic_M ⊙ italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) + italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
=[xF1xF2xF300000][1000010000110001]+ZB2absentmatrixsubscript𝑥subscript𝐹1subscript𝑥subscript𝐹2subscript𝑥subscript𝐹300000matrix1000010000110001subscript𝑍subscript𝐵2\displaystyle=\begin{bmatrix}x_{F_{1}}&x_{F_{2}}&x_{F_{3}}&0\\ 0&0&0&0\end{bmatrix}\begin{bmatrix}1&0&0&0\\ 0&1&0&0\\ 0&0&1&1\\ 0&0&0&1\end{bmatrix}+Z_{B_{2}}= [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] + italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
=[4xF14xF24xF3F2F1(xd)+xF3p1p2p3p4].absentmatrix4subscript𝑥subscript𝐹14subscript𝑥subscript𝐹24subscript𝑥subscript𝐹3subscript𝐹2subscript𝐹1subscript𝑥𝑑subscript𝑥subscript𝐹3subscript𝑝1subscript𝑝2subscript𝑝3subscript𝑝4\displaystyle=\begin{bmatrix}4x_{F_{1}}&4x_{F_{2}}&4x_{F_{3}}&F_{2}\circ F_{1}% (x_{d})+x_{F_{3}}\\ p_{1}&p_{2}&p_{3}&p_{4}\end{bmatrix}.= [ start_ARG start_ROW start_CELL 4 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 4 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 4 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) + italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] .

Passing the output of Attn2(ZB2)subscriptAttn2subscript𝑍subscript𝐵2\mathrm{Attn}_{2}(Z_{B_{2}})roman_Attn start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) through the last MLP, yields the output of the Transformer, which is

Tr(Z)Tr𝑍\displaystyle\mathrm{Tr}(Z)roman_Tr ( italic_Z ) =Block3(Block2(Block1(Z)))absentsubscriptBlock3subscriptBlock2subscriptBlock1𝑍\displaystyle=\mathrm{Block}_{3}(\mathrm{Block}_{2}(\mathrm{Block}_{1}(Z)))= roman_Block start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( roman_Block start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Block start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_Z ) ) )
=W32ReLU(W32(Attn3(ZB2)+ZB2))+(Attn3(ZB2)+ZB2)absentsubscript𝑊32ReLUsubscript𝑊32subscriptAttn3subscript𝑍subscript𝐵2subscript𝑍subscript𝐵2subscriptAttn3subscript𝑍subscript𝐵2subscript𝑍subscript𝐵2\displaystyle=W_{32}\mathrm{ReLU}(W_{32}(\mathrm{Attn}_{3}(Z_{B_{2}})+Z_{B_{2}% }))+(\mathrm{Attn}_{3}(Z_{B_{2}})+Z_{B_{2}})= italic_W start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT roman_ReLU ( italic_W start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT ( roman_Attn start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) + ( roman_Attn start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_Z start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
=[4xF14xF24xF3F3F2F1(xd)p1p2p3p4].absentmatrix4subscript𝑥subscript𝐹14subscript𝑥subscript𝐹24subscript𝑥subscript𝐹3subscript𝐹3subscript𝐹2subscript𝐹1subscript𝑥𝑑subscript𝑝1subscript𝑝2subscript𝑝3subscript𝑝4\displaystyle=\begin{bmatrix}4x_{F_{1}}&4x_{F_{2}}&4x_{F_{3}}&F_{3}\circ F_{2}% \circ F_{1}(x_{d})\\ p_{1}&p_{2}&p_{3}&p_{4}\end{bmatrix}.= [ start_ARG start_ROW start_CELL 4 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 4 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 4 italic_x start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] .

Hence, the output of the Transformer is a composition of the three functions F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, F2subscript𝐹2F_{2}italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and F3subscript𝐹3F_{3}italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT applied to token xdsubscript𝑥𝑑x_{d}italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. ∎