The Computational Complexity of Formal Reasoning for Encoder-Only Transformers

Marco Sälzer, Eric Alsmann, Martin Lange
Theoretical Computer Science / Formal Methods
University of Kassel, Germany
{marco.saelzer,eric.alsmann,martin.lange}@uni-kassel.de
Abstract

We investigate challenges and possibilities of formal reasoning for encoder-only transformers (EOT), meaning sound and complete methods for verifying or interpreting behaviour. In detail, we condense related formal reasoning tasks in the form of a naturally occurring satisfiability problem (SAT). We find that SAT is undecidable if we consider EOT, commonly considered in the expressiveness community. Furthermore, we identify practical scenarios where SAT is decidable and establish corresponding complexity bounds. Besides trivial cases, we find that quantized EOT, namely those restricted by some fixed-width arithmetic, lead to the decidability of SAT due to their limited attention capabilities. However, the problem remains difficult, as we establish those scenarios where SAT is NEXPTIME-hard and those where we can show that it is solvable in NEXPTIME for quantized EOT. To complement our theoretical results, we put our findings and their implications in the overall perspective of formal reasoning.

1 Introduction

Natural language processing (NLP) models, processing and computing human language, are gateways for modern applications aiming to interact with human users in a natural way. Although NLP is a traditional field of research, the use of deep learning techniques has undoubtedly revolutionised the field in recent years [22]. In this revolution, models such as Recurrent Neural Networks (RNN) or more specific Long Short-term Memory Networks (LSTM) [30] have long been the driving force, but for a few years now NLP has a new figurehead: transformers [28].

Transformers are a deep learning model using (multiple) self-attention mechanisms to process sequential input data, usually natural language. The efficient trainability of transformers, for example in contrast to LSTM, while achieving top-tier performance led to numerous heavy-impact implementations such as BERT [10], GPT-3 [6] or GPT-4 [21], sparking widespread use of the transformer architecture. However, the foreseeable omnipresence of transformer-based applications leads to serious security concerns.

In general, there are two approaches to establishing trustworthiness of learning-based models: first, certifying specific, application-dependent safety properties, called verification, and second, interpreting the behaviour of such models and giving explanations for it, called interpretation. In both approaches, the holy grail is to develop automatic methods that are sound and complete: algorithm A𝐴Aitalic_A that given some model T𝑇Titalic_T and (verification or interpretation) specification φ𝜑\varphiitalic_φ outputs true if T𝑇Titalic_T satisfies φ𝜑\varphiitalic_φ and false otherwise (soundness), and A𝐴Aitalic_A does so for all combinations of T𝑇Titalic_T and φ𝜑\varphiitalic_φ (completeness) it is designed for. We refer to such sound and complete methods and tasks for verification and interpretation collectively using the term formal reasoning.

We lay out a framework for the possibilities and challenges of formal reasoning for transformers by establishing basic computability and complexity results in this work. Thereby, we focus on the so-called satisfiability (Sat) problem of sequence-classifying transformers: given a transformer T𝑇Titalic_T, decide whether there is some input word w𝑤witalic_w such that T(w)=1𝑇𝑤1T(w)=1italic_T ( italic_w ) = 1. Although this may seem like an artificial problem at first glance, it is a natural abstraction of problems that commonly occur in almost all non-trivial formal reasoning tasks. Additionally, since it is detached from the specifics of particular reasoning specifications like safety properties for instance, uncomputability results and complexity-theoretic hardness results immediately transfer to more complex formal reasoning tasks. This also keeps the focus on the transformer architecture under consideration. Here we exclusively consider encoder-only transformers (EOT), mainly due to the fact that the known high expressive power of encoder-decoder transformers [23] makes formal reasoning trivially impossible.

Our work is structured as follows. We define necessary preliminaries in Section 2. In Section 3, we give an overview on our theoretical results and take a comprehensive look at their implications for formal reasoning for transformers. In Section 4 and Section 5 we present our theoretical results: we show that Sat is undecidable for classes of EOT commonly considered in research on transformer expressiveness, we show that a bounded version bSat of the satisfiability problem is decidable, for any class of (computable) EOT, and give corresponding complexity bounds and we show that considering quantized EOT, meaning EOT whose parameters and internal computations are limited by some fixed-width arithmetic, leads to decidability of Sat and give corresponding complexity bounds. Finally, we discuss limitations, open problems and future research in Section 6.

Related work.

We establish basic computability and complexity results about transformer-related formal reasoning problems, like formal verification or interpretation. This places our work in the intersection between research on verification and interpretation of transformers and transformer expressiveness.

There is a limited amount of work concerned with methods for the verification of safety properties of transformers [15, 25, 4, 11]. However, all those methods do not fall in the category of formal reasoning, as they are non-complete. This means, the rigorous computability and complexity bound established in this work cannot be applied without further considerations. The same applies for so far considered interpretability methods [31]. We remark that a lot of these approaches are not sound methods either. In contrast, there is an uprise in theoretical investigations of transformer expressiveness. Initial work dealt with encoder-decoder models and showed that such models are Turing-complete [23, 3]. Encoder-only models have so far been analysed in connection with circuit complexity [13, 14, 20, 19], logics [7, 18] and programming languages [29]. A recently published survey [26] provides an overview of these results. This work is adjacent as some of the here considered classes of EOT, mainly those considered in Section 4, are motivated by these results and some of the constructions we use in corresponding proofs are similar.

2 Fundamentals

Mathematical basics.

Let ΣΣ\Sigmaroman_Σ be a finite set of symbols, called alphabet. A (finite) word w𝑤witalic_w over ΣΣ\Sigmaroman_Σ is a finite sequence a1aksubscript𝑎1subscript𝑎𝑘a_{1}\dotsb a_{k}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT where aiΣsubscript𝑎𝑖Σa_{i}\in\Sigmaitalic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_Σ. We define |w|=k𝑤𝑘|w|=k| italic_w | = italic_k. As usual, we denote the set of all non-empty words by Σ+superscriptΣ\Sigma^{+}roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. A language is a set of words. We also extend the notion of an alphabet to vectors 𝒙idsubscript𝒙𝑖superscript𝑑\boldsymbol{x}_{i}\in\mathbb{R}^{d}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, meaning that a sequence 𝒙1𝒙ksubscript𝒙1subscript𝒙𝑘\boldsymbol{x}_{1}\dotsb\boldsymbol{x}_{k}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a word over some subset of dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Usually, we denote vectors using bold symbols like 𝒙,𝒚𝒙𝒚\boldsymbol{x},\boldsymbol{y}bold_italic_x , bold_italic_y or 𝒛𝒛\boldsymbol{z}bold_italic_z.

Encoder-only transformers (EOT).

We consider the encoder-only transformer (EOT) model introduced in [28]. We take a look at EOT from a computability and complexity perspective, which is why we follow more formal definitions as done in [13, 23, 14]. An EOT T𝑇Titalic_T with L𝐿Litalic_L layers and hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT attention heads in layer i𝑖iitalic_i is a tuple (𝑒𝑚𝑏,{𝑎𝑡𝑡i,j1iL,1jhi},{𝑐𝑜𝑚𝑏i1iL},𝑜𝑢𝑡)𝑒𝑚𝑏conditional-setsubscript𝑎𝑡𝑡𝑖𝑗formulae-sequence1𝑖𝐿1𝑗subscript𝑖conditional-setsubscript𝑐𝑜𝑚𝑏𝑖1𝑖𝐿𝑜𝑢𝑡(\mathit{emb},\{\mathit{att}_{i,j}\mid 1\leq i\leq L,1\leq j\leq h_{i}\},\{% \mathit{comb}_{i}\mid 1\leq i\leq L\},\mathit{out})( italic_emb , { italic_att start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∣ 1 ≤ italic_i ≤ italic_L , 1 ≤ italic_j ≤ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , { italic_comb start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ 1 ≤ italic_i ≤ italic_L } , italic_out ) where

  • 𝑒𝑚𝑏:Σ×d0:𝑒𝑚𝑏Σsuperscriptsubscript𝑑0\mathit{emb}\colon\Sigma\times\mathbb{N}\rightarrow\mathbb{R}^{d_{0}}italic_emb : roman_Σ × blackboard_N → blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for some d0subscript𝑑0d_{0}\in\mathbb{N}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_N is the positional embedding,

  • each attention head is a tuple 𝑎𝑡𝑡i,j=(𝑠𝑐𝑜𝑟𝑒i,j,𝑝𝑜𝑜𝑙i,j)subscript𝑎𝑡𝑡𝑖𝑗subscript𝑠𝑐𝑜𝑟𝑒𝑖𝑗subscript𝑝𝑜𝑜𝑙𝑖𝑗\mathit{att}_{i,j}=(\mathit{score}_{i,j},\mathit{pool}_{i,j})italic_att start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = ( italic_score start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_pool start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) where 𝑠𝑐𝑜𝑟𝑒i,j:di1×di1:subscript𝑠𝑐𝑜𝑟𝑒𝑖𝑗superscriptsubscript𝑑𝑖1superscriptsubscript𝑑𝑖1\mathit{score}_{i,j}\colon\mathbb{R}^{d_{i-1}}\times\mathbb{R}^{d_{i-1}}% \rightarrow\mathbb{R}italic_score start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R is a function called scoring and 𝑝𝑜𝑜𝑙i,j:(di1)+×+di:subscript𝑝𝑜𝑜𝑙𝑖𝑗superscriptsuperscriptsubscript𝑑𝑖1superscriptsuperscriptsubscript𝑑𝑖\mathit{pool}_{i,j}\colon(\mathbb{R}^{d_{i-1}})^{+}\times\mathbb{R}^{+}% \rightarrow\mathbb{R}^{d_{i}}italic_pool start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT : ( blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a function called pooling, computing (𝒙1,,𝒙n,s1,,sn)i=1n𝑛𝑜𝑟𝑚(i,s1,,sn)(W𝒙i)maps-tosubscript𝒙1subscript𝒙𝑛subscript𝑠1subscript𝑠𝑛superscriptsubscriptsuperscript𝑖1𝑛𝑛𝑜𝑟𝑚superscript𝑖subscript𝑠1subscript𝑠𝑛𝑊subscript𝒙superscript𝑖(\boldsymbol{x}_{1},\dotsc,\boldsymbol{x}_{n},s_{1},\dotsc,s_{n})\mapsto\sum_{% i^{\prime}=1}^{n}\mathit{norm}(i^{\prime},s_{1},\dotsc,s_{n})(W\boldsymbol{x}_% {i^{\prime}})( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ↦ ∑ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_norm ( italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ( italic_W bold_italic_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) where W𝑊Witalic_W is a linear map represented by a matrix and 𝑛𝑜𝑟𝑚:×+:𝑛𝑜𝑟𝑚superscript\mathit{norm}\colon\mathbb{N}\times\mathbb{R}^{+}\rightarrow\mathbb{R}italic_norm : blackboard_N × blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT → blackboard_R is a normalisation,

  • each 𝑐𝑜𝑚𝑏i:di,1××di,h1+1di:subscript𝑐𝑜𝑚𝑏𝑖superscriptsubscript𝑑𝑖1superscriptsubscript𝑑𝑖subscript11superscriptsubscript𝑑𝑖\mathit{comb}_{i}\colon\mathbb{R}^{d_{i,1}}\times\dotsb\times\mathbb{R}^{d_{i,% h_{1}+1}}\rightarrow\mathbb{R}^{d_{i}}italic_comb start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT × ⋯ × blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i , italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is called a combination and 𝑜𝑢𝑡:dL:𝑜𝑢𝑡superscriptsubscript𝑑𝐿\mathit{out}\colon\mathbb{R}^{d_{L}}\rightarrow\mathbb{R}italic_out : blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R is called the output.

For given ik𝑖𝑘i\leq kitalic_i ≤ italic_k we call the tuple (𝑎𝑡𝑡i,1,,𝑎𝑡𝑡i,hi,𝑐𝑜𝑚𝑏i)subscript𝑎𝑡𝑡𝑖1subscript𝑎𝑡𝑡𝑖subscript𝑖subscript𝑐𝑜𝑚𝑏𝑖(\mathit{att}_{i,1},\dotsc,\mathit{att}_{i,h_{i}},\mathit{comb}_{i})( italic_att start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … , italic_att start_POSTSUBSCRIPT italic_i , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_comb start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) the i𝑖iitalic_i-th layer of T𝑇Titalic_T. The EOT T𝑇Titalic_T computes a function Σ+superscriptΣ\Sigma^{+}\rightarrow\mathbb{R}roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT → blackboard_R as follows. Let w=a1,,anΣ+formulae-sequence𝑤subscript𝑎1subscript𝑎𝑛superscriptΣw=a_{1},\dotsc,a_{n}\in\Sigma^{+}italic_w = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT be a word. First, T𝑇Titalic_T computes an embedding of w𝑤witalic_w by 𝑒𝑚𝑏(w)=𝒙10𝒙n0𝑒𝑚𝑏𝑤superscriptsubscript𝒙10superscriptsubscript𝒙𝑛0\mathit{emb}(w)=\boldsymbol{x}_{1}^{0}\dotsb\boldsymbol{x}_{n}^{0}italic_emb ( italic_w ) = bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ⋯ bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT where 𝒙i0=𝑒𝑚𝑏(ai,i)subscriptsuperscript𝒙0𝑖𝑒𝑚𝑏subscript𝑎𝑖𝑖\boldsymbol{x}^{0}_{i}=\mathit{emb}(a_{i},i)bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_emb ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ). Next, each layer 1iL1𝑖𝐿1\leq i\leq L1 ≤ italic_i ≤ italic_L computes a sequence 𝒙1i𝒙nisuperscriptsubscript𝒙1𝑖superscriptsubscript𝒙𝑛𝑖\boldsymbol{x}_{1}^{i}\dotsb\boldsymbol{x}_{n}^{i}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋯ bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT as follows: for each input 𝒙mi1superscriptsubscript𝒙𝑚𝑖1\boldsymbol{x}_{m}^{i-1}bold_italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT and attention head 𝑎𝑡𝑡i,jsubscript𝑎𝑡𝑡𝑖𝑗\mathit{att}_{i,j}italic_att start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, layer i𝑖iitalic_i computes 𝒚m,ji=𝑝𝑜𝑜𝑙i,j(𝒙1i1,,𝒙ni1,𝑠𝑐𝑜𝑟𝑒i,j(𝒙mi1,𝒙1i1),,𝑠𝑐𝑜𝑟𝑒i,j(𝒙mi1,𝒙ni1))superscriptsubscript𝒚𝑚𝑗𝑖subscript𝑝𝑜𝑜𝑙𝑖𝑗superscriptsubscript𝒙1𝑖1superscriptsubscript𝒙𝑛𝑖1subscript𝑠𝑐𝑜𝑟𝑒𝑖𝑗superscriptsubscript𝒙𝑚𝑖1superscriptsubscript𝒙1𝑖1subscript𝑠𝑐𝑜𝑟𝑒𝑖𝑗superscriptsubscript𝒙𝑚𝑖1superscriptsubscript𝒙𝑛𝑖1\boldsymbol{y}_{m,j}^{i}=\mathit{pool}_{i,j}(\boldsymbol{x}_{1}^{i-1},\dotsc,% \boldsymbol{x}_{n}^{i-1},\mathit{score}_{i,j}(\boldsymbol{x}_{m}^{i-1},% \boldsymbol{x}_{1}^{i-1}),\dotsc,\mathit{score}_{i,j}(\boldsymbol{x}_{m}^{i-1}% ,\boldsymbol{x}_{n}^{i-1}))bold_italic_y start_POSTSUBSCRIPT italic_m , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_pool start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT , italic_score start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ) , … , italic_score start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ) ). Then, 𝒙misubscriptsuperscript𝒙𝑖𝑚\boldsymbol{x}^{i}_{m}bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is given by 𝑐𝑜𝑚𝑏i(𝒙mi1,𝒚m,1i,,𝒚m,hii)subscript𝑐𝑜𝑚𝑏𝑖subscriptsuperscript𝒙𝑖1𝑚superscriptsubscript𝒚𝑚1𝑖superscriptsubscript𝒚𝑚subscript𝑖𝑖\mathit{comb}_{i}(\boldsymbol{x}^{i-1}_{m},\boldsymbol{y}_{m,1}^{i},\dotsc,% \boldsymbol{y}_{m,h_{i}}^{i})italic_comb start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_m , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , bold_italic_y start_POSTSUBSCRIPT italic_m , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ). In the end, the output T(w)𝑇𝑤T(w)italic_T ( italic_w ) is computed by 𝑜𝑢𝑡(𝒙nk)𝑜𝑢𝑡subscriptsuperscript𝒙𝑘𝑛\mathit{out}(\boldsymbol{x}^{k}_{n})italic_out ( bold_italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), thus the value of the output function for the last symbol of w𝑤witalic_w after being transformed by the embedding and L𝐿Litalic_L layers of T𝑇Titalic_T. We say that T𝑇Titalic_T accepts w𝑤witalic_w if T(w)=1𝑇𝑤1T(w)=1italic_T ( italic_w ) = 1, and we say that T𝑇Titalic_T rejects w𝑤witalic_w otherwise. We call L𝐿Litalic_L the depth of T𝑇Titalic_T and the maximal hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the (maximum) width of T𝑇Titalic_T. Furthermore, we call the maximal disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the (maximum) dimensionality of T𝑇Titalic_T. Let 𝒯𝒯\mathcal{T}caligraphic_T be some class of EOT. The decision problem Sat[𝒯]Satdelimited-[]𝒯\textsc{Sat}[\mathcal{T}]Sat [ caligraphic_T ] for a class 𝒯𝒯\mathcal{T}caligraphic_T of EOT is: given T𝒯𝑇𝒯T\in\mathcal{T}italic_T ∈ caligraphic_T over alphabet ΣΣ\Sigmaroman_Σ, decide whether there is wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT such that T(w)=1𝑇𝑤1T(w)=1italic_T ( italic_w ) = 1. We refer to this as the satisfiability problem for 𝒯𝒯\mathcal{T}caligraphic_T.

Fixed-width arithmetics.

We consider commonly used fixed-width arithmetics (FA) that represent numbers using a fixed amount of bits, like floating- or fixed-point arithmetic in this work. See [1] (fixed-point) or [8] (floating-point) for rigorous mathematical definitions of such FA. In this work, however, we only make use of a high-level view on different FA. Namely, given some FA F𝐹Fitalic_F we assume that all values are represented in binary using b𝑏b\in\mathbb{N}italic_b ∈ blackboard_N bits for representing its numbers. Thus, there are 2bsuperscript2𝑏2^{b}2 start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT different rational number representable in F𝐹Fitalic_F. Furthermore, we assume that the considered FA can handle overflow situations using either saturation or wrap-around and rounding situations by rounding up or off. We consider EOT in the context of F𝐹Fitalic_F. We say that T𝑇Titalic_T works over F𝐹Fitalic_F, assuming that all computations as well as values occurring in a computation T(w)𝑇𝑤T(w)italic_T ( italic_w ) are carried out in the arithmetic defined by F𝐹Fitalic_F.

3 Overview: capturing and classifying formal transformer reasoning

We address elementary problems arising in formal reasoning for transformers. In doing so, we pursue the goal of establishing basic computability and complexity results for corresponding problems in order to frame possibilities and challenges.

To achieve widespread implications of our results, we focus our considerations on a fundamental problem arising in formal verification and interpretation tasks: given a transformer T𝑇Titalic_T, decide whether there is some input w𝑤witalic_w leading to some specific output T(w)𝑇𝑤T(w)italic_T ( italic_w ), as defined formally in terms of the satisfiability problem Sat[𝒯]Satdelimited-[]𝒯\textsc{Sat}[\mathcal{T}]Sat [ caligraphic_T ] for a class 𝒯𝒯\mathcal{T}caligraphic_T of specific transformers, see Section 2.

To see that this captures the essence of formal reasoning problems occurring in practice, consider the following formal verification task: given transformer T𝑇Titalic_T, verify that T𝑇Titalic_T only accepts inputs which contain some specific key from a set K𝐾Kitalic_K. Such a property is usually considered a robustness property [25, 16]. We can phrase this as a satisfiability problem by considering the property’s negation, namely to verify that there is some input w𝑤witalic_w such that no key from K𝐾Kitalic_K occurs and still we have T(w)=1𝑇𝑤1T(w)=1italic_T ( italic_w ) = 1.

Likewise, consider a formal interpretation task in which we want to find the minimal subset EEsuperscript𝐸𝐸E^{\prime}\subseteq Eitalic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ italic_E of some set of error symbols E𝐸Eitalic_E such that all w𝑤witalic_w that contain all errors Esuperscript𝐸E^{\prime}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are rejected by T𝑇Titalic_T. This is usually understood as an abductive explanation [17]. Given a candidate subset Esuperscript𝐸E^{\prime}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we can certify this by checking that there is some w𝑤witalic_w which contains all errors Esuperscript𝐸E^{\prime}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, but is accepted by T𝑇Titalic_T. This, again, is a special case of a satisfiability problem Sat[𝒯]Satdelimited-[]𝒯\textsc{Sat}[\mathcal{T}]Sat [ caligraphic_T ] for some transformer class 𝒯𝒯\mathcal{T}caligraphic_T.

Furthermore, we want our results to be detached from any intricacies of certain transformer architectures: first, we focus on encoder-only transformers (EOT), so leaving any decoder mechanism unconsidered. The primary reason for this is that encoder-decoder architectures are of such high expressive power [23] that Sat is easily seen to be undecidable in almost all non-trivial cases. The secondary reason for this is that encoder-decoder architectures subsume encoder-only architectures. So any lower computability or complexity bound, established in this work, is also a lower bound for encoder-decoder transformers. Additionally, the presentation of EOT in Section 2 paves the way for a parametrized view onto EOT architectures which allows us to study different classes of EOT by fixing or bounding such parameters.

We start by considering the class 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT of EOT, motivated by commonly considered architectures in the theoretical expressiveness community [23, 14, 13]: 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT consists of those EOT that use a positional embedding, expressive enough to compute a sum, hardmax ℎ𝑚𝑎𝑥ℎ𝑚𝑎𝑥\mathit{hmax}italic_hmax as normalisation functions and a scalar-product based scoring, enriched with a nonlinear map represented by an FNN.

Theorem 1 (Section 4).

The satisfiability problem Sat[𝒯𝑢𝑑𝑒𝑐]Satdelimited-[]subscript𝒯𝑢𝑑𝑒𝑐\textsc{Sat}[\mathcal{T}_{\mathit{udec}}]Sat [ caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT ] is undecidable.

Refer to caption
Figure 1: Schematic overview of the computability and complexity results, established in this work. The classes of EOT are described in the pretext of the respective theorem. Note that 𝒯𝒯\mathcal{T}caligraphic_T refers to an arbitrary class of (computable) EOT. The small subset in the classes NP and NEXPTIME refers to the complete problems. The NEXPTIME-hardness result of Sat[𝒯fix]Satdelimited-[]superscript𝒯fix\textsc{Sat}[\mathcal{T}^{\textsc{fix}}]Sat [ caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT ] is visualized by putting it exactly on the upper border between NEXPTIME and all decidable problems.

Essentially, this result implies that even for encoder-only EOT the combination of ℎ𝑚𝑎𝑥ℎ𝑚𝑎𝑥\mathit{hmax}italic_hmax normalizations and expressive scoring is enough to make satisfiability undecidable. Generally, this makes formal reasoning, like verifying robustness properties or giving formal explanations, impossible for classes of EOT that subsume 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT. Specifically, no such methods exist that are fully automatic, sound and complete. Theorem 1 does not preclude the existence of incomplete methods for instance.

Recently, so-called log-precision transformers have been studied [18]. These transformers are defined as usual, but given a word length n𝑛nitalic_n it is assumed that a log-precision transformer T𝑇Titalic_T uses at most 𝒪(log(n))𝒪𝑛\mathcal{O}(\log(n))caligraphic_O ( roman_log ( italic_n ) ) bits in its internal computations. To complement these theoretical considerations, we consider the class 𝒯𝑢𝑑𝑒𝑐logsubscriptsuperscript𝒯log𝑢𝑑𝑒𝑐\mathcal{T}^{\textsc{log}}_{\mathit{udec}}caligraphic_T start_POSTSUPERSCRIPT log end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT of EOT from 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT that work with log-precision. Unfortunately, this restriction is not enough to circumvent general undecidability.

Theorem 2 (Section 4).

The satisfiability problem Sat[𝒯𝑢𝑑𝑒𝑐log]Satdelimited-[]subscriptsuperscript𝒯log𝑢𝑑𝑒𝑐\textsc{Sat}[\mathcal{T}^{\textsc{log}}_{\mathit{udec}}]Sat [ caligraphic_T start_POSTSUPERSCRIPT log end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT ] is undecidable.

Given such impossibility results, we turn our attention to the search for decidable cases. We make the reasonable assumption that all considered EOT are computable, meaning that their components like scoring, normalisation, pooling, combination and output functions are computable functions.

First, we consider a natural restriction of the satisfiability problem by bounding the length of valid inputs. Then satisfiability becomes decidable, regardless of the respective class of EOT, but it is difficult from a complexity-theoretic perspective. To formalize this, we introduce the bounded satisfiability problem bSat[𝒯]bSatdelimited-[]𝒯\textsc{bSat}[\mathcal{T}]bSat [ caligraphic_T ] for a class 𝒯𝒯\mathcal{T}caligraphic_T: given an EOT T𝒯𝑇𝒯T\in\mathcal{T}italic_T ∈ caligraphic_T and a bound n𝑛n\in\mathbb{N}italic_n ∈ blackboard_N on its input length, decide whether there is word w𝑤witalic_w with |w|n𝑤𝑛|w|\leq n| italic_w | ≤ italic_n s.t. T(w)=1𝑇𝑤1T(w)=1italic_T ( italic_w ) = 1.

Theorem 3 (Section 5, informal).

The bounded satisfiability problem bSat[𝒯]bSatdelimited-[]𝒯\textsc{bSat}[\mathcal{T}]bSat [ caligraphic_T ] is decidable for all classes 𝒯𝒯\mathcal{T}caligraphic_T of (computable) EOT. Depending on whether n𝑛nitalic_n is given in binary or unary coding, bSat[𝒯]bSatdelimited-[]𝒯\textsc{bSat}[\mathcal{T}]bSat [ caligraphic_T ] is NEXPTIME-, resp. NP-hard whenever 𝒯𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐𝒯\mathcal{T}\supseteq\mathcal{T}_{\mathit{udec}}caligraphic_T ⊇ caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT.

Informally, this result implies that bounding the word length is a method to enable formal reasoning. However, it does not change the fact that satisfiability is an essentially hard problem. As hardness is a lower bound, this also translates to subsuming formal reasoning tasks.

Imposing a bound on the input length may not be a viable restriction for various formal reasoning tasks. We therefore study other ways of obtaining decidability. We address the unbounded satisfiability problem for practically motivated classes of EOT. We consider the class 𝒯fixsubscriptsuperscript𝒯fix\mathcal{T}^{\textsc{fix}}_{\circ}caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∘ end_POSTSUBSCRIPT of EOT that use a positional embedding with some periodicity in their positional encoding, commonly seen in practice [28, 12], use softmax or hardmax as normalisation and which work over some fixed-width arithmetic (FA). This last restriction is motivated by recent popular ways to handle ever increasing EOT sizes, for example via quantization or using low-bit arithmetics [5]. From a complexity-theoretic perspective, the use of fixed-width arithmetic has a similar effect to bounding the input length.

Theorem 4 (Section 5).

The satisfiability problem Sat[𝒯fix]Satdelimited-[]subscriptsuperscript𝒯fix\textsc{Sat}[\mathcal{T}^{\textsc{fix}}_{\circ}]Sat [ caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∘ end_POSTSUBSCRIPT ] is in NEXPTIME.

So automatic, sound and complete formal reasoning for periodical EOT in a fixed-width arithmetic environment is generally possible with potentially high complexity. Note that formal reasoning tasks with more complex safety or interpretability specifications than simple satisfiability may even lead to higher complexities.

We then aim to show that this is optimal by providing a matching lower bound. However, we need to relax these restrictions again, namely considering the class 𝒯fixsuperscript𝒯fix\mathcal{T}^{\textsc{fix}}caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT allowing for EOT that use non-periodical embeddings and work over some fixed-width arithmetic that can use saturation to handle overflow situations. We show that this high complexity is unavoidable, making sound and complete automatic formal reasoning for fixed-width arithmetic transformers with general positional embeddings practically intractable.

Theorem 5 (Section 5).

The satisfiability problem Sat[𝒯fix]Satdelimited-[]superscript𝒯fix\textsc{Sat}[\mathcal{T}^{\textsc{fix}}]Sat [ caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT ] is NEXPTIME-hard.

A schematic depiction of the computability and complexity results described in this section is given in Figure 1. Note that this figure is a purely technical presentation of our results, which means that it does not convey the implications for formal reasoning described above.

4 Transformer satisfiability is generally undecidable

We consider a class of EOT 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT, which is as weak as possible regarding the expressiveness of included EOT. We define 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT by giving minimum requirements: positional-embeddings can be of the form 𝑒𝑚𝑏(ak,0)=(1,1,0,0,k)𝑒𝑚𝑏subscript𝑎𝑘01100𝑘\mathit{emb}(a_{k},0)=(1,1,0,0,k)italic_emb ( italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , 0 ) = ( 1 , 1 , 0 , 0 , italic_k ) and 𝑒𝑚𝑏(ak,i)=(0,1,i,j=0ij,k)𝑒𝑚𝑏subscript𝑎𝑘𝑖01𝑖superscriptsubscript𝑗0𝑖𝑗𝑘\mathit{emb}(a_{k},i)=(0,1,i,\sum_{j=0}^{i}j,k)italic_emb ( italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_i ) = ( 0 , 1 , italic_i , ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_j , italic_k ) where we assume some order on the alphabet symbols a1,a2,subscript𝑎1subscript𝑎2a_{1},a_{2},\dotscitalic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , …. For scoring functions we allow for N(Q𝒙,K𝒚)𝑁𝑄𝒙𝐾𝒚N(\langle Q\boldsymbol{x},K\boldsymbol{y}\rangle)italic_N ( ⟨ italic_Q bold_italic_x , italic_K bold_italic_y ⟩ ) where N𝑁Nitalic_N is a classical Feedforward Neural Network (FNN) with 𝑟𝑒𝑙𝑢𝑟𝑒𝑙𝑢\mathit{relu}italic_relu activations, Q𝑄Qitalic_Q and K𝐾Kitalic_K are linear maps and delimited-⟨⟩\langle\dotsb\rangle⟨ ⋯ ⟩ denotes the usual scalar product, for normalisations we allow for hardmax ℎ𝑚𝑎𝑥(i,x1,,xn)=1mℎ𝑚𝑎𝑥𝑖subscript𝑥1subscript𝑥𝑛1𝑚\mathit{hmax}(i,x_{1},\dotsc,x_{n})=\frac{1}{m}italic_hmax ( italic_i , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG if xixjsubscript𝑥𝑖subscript𝑥𝑗x_{i}\geq x_{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for all jn𝑗𝑛j\leq nitalic_j ≤ italic_n and there are m𝑚mitalic_m distinct xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT such that xi=xjsubscript𝑥𝑖subscript𝑥𝑗x_{i}=x_{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT otherwise ℎ𝑚𝑎𝑥(i,x1,,xn)=0ℎ𝑚𝑎𝑥𝑖subscript𝑥1subscript𝑥𝑛0\mathit{hmax}(i,x_{1},\dotsc,x_{n})=0italic_hmax ( italic_i , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = 0. Combinations as well as output functions can be classical FNN with 𝑟𝑒𝑙𝑢𝑟𝑒𝑙𝑢\mathit{relu}italic_relu activation. Aside from technical reasons, we motivate the choice of 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT in Section 3. To ease our notation, we exploit the fact that using ℎ𝑚𝑎𝑥ℎ𝑚𝑎𝑥\mathit{hmax}italic_hmax as normalisation implies a clearly defined subset of positions M𝑀Mitalic_M that are effective in the computation of some attention head 𝑎𝑡𝑡𝑎𝑡𝑡\mathit{att}italic_att given some position i𝑖iitalic_i, namely those that are weighted non-zero. In this case, we say that 𝑎𝑡𝑡𝑎𝑡𝑡\mathit{att}italic_att attends to M𝑀Mitalic_M given position i𝑖iitalic_i.

We prove that Sat[𝒯𝑢𝑑𝑒𝑐]Satdelimited-[]subscript𝒯𝑢𝑑𝑒𝑐\textsc{Sat}[\mathcal{T}_{\mathit{udec}}]Sat [ caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT ] is undecidable by establishing a reduction from the (unbounded) octant tiling-word problem (OTWPsuperscriptOTWP\textsc{OTWP}^{*}OTWP start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT). For details on tiling problems, see Appendix A. The OTWPsuperscriptOTWP\textsc{OTWP}^{*}OTWP start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is defined as follows: given a tiling system 𝒮=(S,H,V,tI,tF)𝒮𝑆𝐻𝑉subscript𝑡𝐼subscript𝑡𝐹\mathcal{S}=(S,H,V,t_{I},t_{F})caligraphic_S = ( italic_S , italic_H , italic_V , italic_t start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) where S𝑆Sitalic_S is some finite set of tiles, H,VS2𝐻𝑉superscript𝑆2H,V\subseteq S^{2}italic_H , italic_V ⊆ italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and tI,tFSsubscript𝑡𝐼subscript𝑡𝐹𝑆t_{I},t_{F}\in Sitalic_t start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ∈ italic_S we have to decide whether there is a word (a) t0,0,t1,0,t1,1,t2,0,t2,1,t2,2,t3,0,,subscript𝑡00subscript𝑡10subscript𝑡11subscript𝑡20subscript𝑡21subscript𝑡22subscript𝑡30t_{0,0},t_{1,0},t_{1,1},t_{2,0},t_{2,1},t_{2,2},t_{3,0},\ldots,italic_t start_POSTSUBSCRIPT 0 , 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 , 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 , 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 , 2 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 3 , 0 end_POSTSUBSCRIPT , … , tk,kS+subscript𝑡𝑘𝑘superscript𝑆t_{k,k}\in S^{+}italic_t start_POSTSUBSCRIPT italic_k , italic_k end_POSTSUBSCRIPT ∈ italic_S start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT such that (b) t0,0=tIsubscript𝑡00subscript𝑡𝐼t_{0,0}=t_{I}italic_t start_POSTSUBSCRIPT 0 , 0 end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, tk,k=tFsubscript𝑡𝑘𝑘subscript𝑡𝐹t_{k,k}=t_{F}italic_t start_POSTSUBSCRIPT italic_k , italic_k end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT, (c) for all ik𝑖𝑘i\leq kitalic_i ≤ italic_k and 0j<i0𝑗𝑖0\leq j<i0 ≤ italic_j < italic_i holds (ti,j,ti,j+1)Hsubscript𝑡𝑖𝑗subscript𝑡𝑖𝑗1𝐻(t_{i,j},t_{i,j+1})\in H( italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i , italic_j + 1 end_POSTSUBSCRIPT ) ∈ italic_H and (d) for all ik1𝑖𝑘1i\leq k-1italic_i ≤ italic_k - 1 and ji𝑗𝑖j\leq iitalic_j ≤ italic_i holds (ti,j,ti+1,j)Vsubscript𝑡𝑖𝑗subscript𝑡𝑖1𝑗𝑉(t_{i,j},t_{i+1,j})\in V( italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i + 1 , italic_j end_POSTSUBSCRIPT ) ∈ italic_V. We call a word w𝑤witalic_w which satisfies (a) an encoded tiling and if (b)-(d) are satisfied as well then we call w𝑤witalic_w a valid encoded tiling. Our proof strategy is easily described: given a tiling system 𝒮𝒮\mathcal{S}caligraphic_S, we build an EOT T𝒮𝒯𝑢𝑑𝑒𝑐subscript𝑇𝒮subscript𝒯𝑢𝑑𝑒𝑐T_{\mathcal{S}}\in\mathcal{T}_{\mathit{udec}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ∈ caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT which accepts a word w𝑤witalic_w if it fulfils conditions (a) to (d) and otherwise T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT rejects w𝑤witalic_w. We derive most technical proofs of the following lemmas and theorems to Appendix B and instead provide intuitions and proof sketches in this section.

We start with the first observation: the expressiveness of EOT in 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT is sufficient to decode the octant tiling potentially represented by a given word w𝑤witalic_w. In detail, two encoder layers in combination with a positional embedding definable in 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT are expressive enough to compute for a given symbol t𝑡titalic_t in w𝑤witalic_w to which position in an octant tiling it corresponds, if we interpret w𝑤witalic_w as an encoded tiling.

Lemma 1.

Let 𝒮𝒮\mathcal{S}caligraphic_S be a tiling system with tiles S={a1,,ak}𝑆subscript𝑎1subscript𝑎𝑘S=\{a_{1},\dotsc,a_{k}\}italic_S = { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }. There is an embedding function 𝑒𝑚𝑏𝑒𝑚𝑏\mathit{emb}italic_emb and there are encoder layers l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT definable in 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT such that for each word w=t0,0t1,0t1,1t2,0tm,nS+𝑤subscript𝑡00subscript𝑡10subscript𝑡11subscript𝑡20subscript𝑡𝑚𝑛superscript𝑆w=t_{0,0}t_{1,0}t_{1,1}t_{2,0}\dotsb t_{m,n}\in S^{+}italic_w = italic_t start_POSTSUBSCRIPT 0 , 0 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 , 0 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 , 0 end_POSTSUBSCRIPT ⋯ italic_t start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ∈ italic_S start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT holds that l2(l1(𝑒𝑚𝑏(w)))=𝐱12𝐱|w|2subscript𝑙2subscript𝑙1𝑒𝑚𝑏𝑤subscriptsuperscript𝐱21subscriptsuperscript𝐱2𝑤l_{2}(l_{1}(\mathit{emb}(w)))=\boldsymbol{x}^{2}_{1}\dotsc\boldsymbol{x}^{2}_{% |w|}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_emb ( italic_w ) ) ) = bold_italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … bold_italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT | italic_w | end_POSTSUBSCRIPT where 𝐱i2=(1,i,r(i),c(i),ki)subscriptsuperscript𝐱2𝑖1𝑖𝑟𝑖𝑐𝑖subscript𝑘𝑖\boldsymbol{x}^{2}_{i}=(1,i,r(i),c(i),k_{i})bold_italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( 1 , italic_i , italic_r ( italic_i ) , italic_c ( italic_i ) , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) such that akisubscript𝑎subscript𝑘𝑖a_{k_{i}}italic_a start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is equal to the symbol at position i𝑖iitalic_i in w𝑤witalic_w and (r(1),c(1)),(r(2),c(2)),,(r(|w|),c(|w|))𝑟1𝑐1𝑟2𝑐2𝑟𝑤𝑐𝑤(r(1),c(1)),(r(2),c(2)),\dotsc,(r(|w|),c(|w|))( italic_r ( 1 ) , italic_c ( 1 ) ) , ( italic_r ( 2 ) , italic_c ( 2 ) ) , … , ( italic_r ( | italic_w | ) , italic_c ( | italic_w | ) ) is equal to (0,0),(1,0),(m,n)0010𝑚𝑛(0,0),(1,0)\dotsc,(m,n)( 0 , 0 ) , ( 1 , 0 ) … , ( italic_m , italic_n ).

Assume that wS+𝑤superscript𝑆w\in S^{+}italic_w ∈ italic_S start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. Lemma 1 implies that a EOT T𝒯𝑢𝑑𝑒𝑐𝑇subscript𝒯𝑢𝑑𝑒𝑐T\in\mathcal{T}_{\mathit{udec}}italic_T ∈ caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT is generally able to recognize whether w𝑤witalic_w is an encoded tiling as soon as T𝑇Titalic_T is able to check whether r(|w|)𝑟𝑤r(|w|)italic_r ( | italic_w | ) and c(|w|)𝑐𝑤c(|w|)italic_c ( | italic_w | ) of the last symbol of w𝑤witalic_w processed by l2(l1(𝑒𝑚𝑏()))subscript𝑙2subscript𝑙1𝑒𝑚𝑏l_{2}(l_{1}(\mathit{emb}(\dotsb)))italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_emb ( ⋯ ) ) ) are equal. Therefore, property (a) and also (b) can be checked by EOT in 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT using the residual connection in the combination functions together with the expressive power of FNN. Similarly, property (c) can be ensured if it is possible to build an attention head that is able to attend to position k+1𝑘1k+1italic_k + 1 given position k𝑘kitalic_k. Let w=t0,0t1,0t1,1t2,0tm,m𝑤subscript𝑡00subscript𝑡10subscript𝑡11subscript𝑡20subscript𝑡𝑚𝑚w=t_{0,0}t_{1,0}t_{1,1}t_{2,0}\dotsb t_{m,m}italic_w = italic_t start_POSTSUBSCRIPT 0 , 0 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 , 0 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 , 0 end_POSTSUBSCRIPT ⋯ italic_t start_POSTSUBSCRIPT italic_m , italic_m end_POSTSUBSCRIPT with ti,jSsubscript𝑡𝑖𝑗𝑆t_{i,j}\in Sitalic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ italic_S. To verify whether property (d) holds, an EOT must be able to attend to position k+(i+1)𝑘𝑖1k+(i+1)italic_k + ( italic_i + 1 ) given position k𝑘kitalic_k corresponding to symbol ti,jsubscript𝑡𝑖𝑗t_{i,j}italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT. In summary, to check properties (a) – (d) it is left to argue that there are attention heads in 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT that can attend to positions depending linearly on the values of the currently considered position.

Lemma 2.

Let f(x1,,xk)=a1x1++akxk+b𝑓subscript𝑥1subscript𝑥𝑘subscript𝑎1subscript𝑥1subscript𝑎𝑘subscript𝑥𝑘𝑏f(x_{1},\dotsc,x_{k})=a_{1}x_{1}+\dotsb+a_{k}x_{k}+bitalic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_b with ai,bsubscript𝑎𝑖𝑏a_{i},b\in\mathbb{R}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b ∈ blackboard_R be some linear function. There is attention head 𝑎𝑡𝑡fsubscript𝑎𝑡𝑡𝑓\mathit{att}_{f}italic_att start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT in 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT such that for all sequences 𝐱1,,𝐱msubscript𝐱1subscript𝐱𝑚\boldsymbol{x}_{1},\dotsc,\boldsymbol{x}_{m}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT where all 𝐱i=(1,i,𝐲i)subscript𝐱𝑖1𝑖subscript𝐲𝑖\boldsymbol{x}_{i}=(1,i,\boldsymbol{y}_{i})bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( 1 , italic_i , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for some 𝐲ik2subscript𝐲𝑖superscript𝑘2\boldsymbol{y}_{i}\in\mathbb{R}^{k-2}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k - 2 end_POSTSUPERSCRIPT attention head 𝑎𝑡𝑡fsubscript𝑎𝑡𝑡𝑓\mathit{att}_{f}italic_att start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT attends to {𝐱j,𝐱j+1}subscript𝐱𝑗subscript𝐱𝑗1\{\boldsymbol{x}_{j},\boldsymbol{x}_{j+1}\}{ bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT } given position i𝑖iitalic_i if f(𝐱i)=j+12𝑓subscript𝐱𝑖𝑗12f(\boldsymbol{x}_{i})=j+\frac{1}{2}italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_j + divide start_ARG 1 end_ARG start_ARG 2 end_ARG with jm1𝑗𝑚1j\leq m-1italic_j ≤ italic_m - 1 and otherwise to {𝐱j}subscript𝐱𝑗\{\boldsymbol{x}_{j}\}{ bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } where j𝑗jitalic_j is the value nearest to f(𝐱i)𝑓subscript𝐱𝑖f(\boldsymbol{x}_{i})italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

In combination, the previous lemmas indicate that EOT from 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT are able to verify whether a given word is a valid encoded tiling. This expressive power is enough, to lead to an undecidable satisfiability problem for EOT from 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT.

Theorem 1.

The decision problem Sat[𝒯𝑢𝑑𝑒𝑐]Satdelimited-[]subscript𝒯𝑢𝑑𝑒𝑐\textsc{Sat}[\mathcal{T}_{\mathit{udec}}]Sat [ caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT ] is undecidable.

Proof Sketch.

We establish a reduction from OTWPsuperscriptOTWP\textsc{OTWP}^{*}OTWP start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to Sat[𝒯𝑢𝑑𝑒𝑐]Satdelimited-[]subscript𝒯𝑢𝑑𝑒𝑐\textsc{Sat}[\mathcal{T}_{\mathit{udec}}]Sat [ caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT ] by constructing for each instance 𝒮=(S,H,V,tI,tF)𝒮𝑆𝐻𝑉subscript𝑡𝐼subscript𝑡𝐹\mathcal{S}=(S,H,V,t_{I},t_{F})caligraphic_S = ( italic_S , italic_H , italic_V , italic_t start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) of OTWPsuperscriptOTWP\textsc{OTWP}^{*}OTWP start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT an EOT T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT accepting exactly those w𝑤witalic_w corresponding to a valid encoded-tiling for 𝒮𝒮\mathcal{S}caligraphic_S.

T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT uses the positional embedding described in the beginning of Section 4 and has four layers. Layers l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are given by Lemma 1 and are used to decode the row and column indexes corresponding to a potential octant tiling for each symbol in a given word w𝑤witalic_w. Layer l3subscript𝑙3l_{3}italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT uses the informations encoded by the embedding and the decoded row and column indexes to check whether properties (a) to (d) described above hold for w𝑤witalic_w. The necessary informations are aggregated using three attention heads 𝑎𝑡𝑡𝑝𝑟𝑒𝑣subscript𝑎𝑡𝑡𝑝𝑟𝑒𝑣\mathit{att}_{\mathit{prev}}italic_att start_POSTSUBSCRIPT italic_prev end_POSTSUBSCRIPT, 𝑎𝑡𝑡𝑛𝑒𝑥𝑡subscript𝑎𝑡𝑡𝑛𝑒𝑥𝑡\mathit{att}_{\mathit{next}}italic_att start_POSTSUBSCRIPT italic_next end_POSTSUBSCRIPT and 𝑎𝑡𝑡𝑠𝑡𝑒𝑝subscript𝑎𝑡𝑡𝑠𝑡𝑒𝑝\mathit{att}_{\mathit{step}}italic_att start_POSTSUBSCRIPT italic_step end_POSTSUBSCRIPT, each built according to Lemma 2.Thereby, 𝑎𝑡𝑡𝑝𝑟𝑒𝑣subscript𝑎𝑡𝑡𝑝𝑟𝑒𝑣\mathit{att}_{\mathit{prev}}italic_att start_POSTSUBSCRIPT italic_prev end_POSTSUBSCRIPT attends each position to its predecessor, but the first position attends to itself. This allows to clearly identify the vector corresponding to the first position in w𝑤witalic_w and check whether this is equal to tile tIsubscript𝑡𝐼t_{I}italic_t start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT. Attention head 𝑎𝑡𝑡𝑛𝑒𝑥𝑡subscript𝑎𝑡𝑡𝑛𝑒𝑥𝑡\mathit{att}_{\mathit{next}}italic_att start_POSTSUBSCRIPT italic_next end_POSTSUBSCRIPT attends each position to its successor, but the last position attends to itself. This allows to clearly identify the vector corresponding to the last position in w𝑤witalic_w, in order to check whether this is equal to tFsubscript𝑡𝐹t_{F}italic_t start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT, and to check conditions given by H𝐻Hitalic_H. Attention head 𝑎𝑡𝑡𝑠𝑡𝑒𝑝subscript𝑎𝑡𝑡𝑠𝑡𝑒𝑝\mathit{att}_{\mathit{step}}italic_att start_POSTSUBSCRIPT italic_step end_POSTSUBSCRIPT attends each position to the position with the same column index but the successive row index. If there is no such successive row it attends to the last position. This allows to check whether conditions given by V𝑉Vitalic_V holds. Each of these conditions is checked in the combination function of l3subscript𝑙3l_{3}italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, using specifically built feed-forward neural networks outputting 00 to some predefined vector dimension if and only if the condition is met. Finally, layer l4subscript𝑙4l_{4}italic_l start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT aggregates the information of all positions in the vector corresponding to the last position using attention head 𝑎𝑡𝑡leqsubscript𝑎𝑡𝑡leq\mathit{att}_{\text{leq}}italic_att start_POSTSUBSCRIPT leq end_POSTSUBSCRIPT, again given by Lemma 2.

The correctness of this reduction follows from the detailed construction of T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT, which is technically extensive and given in Appendix B. ∎

Next, we consider the class 𝒯𝑢𝑑𝑒𝑐logsubscriptsuperscript𝒯log𝑢𝑑𝑒𝑐\mathcal{T}^{\textsc{log}}_{\mathit{udec}}caligraphic_T start_POSTSUPERSCRIPT log end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT which is defined exactly like 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT but for all T𝒯𝑢𝑑𝑒𝑐log𝑇subscriptsuperscript𝒯log𝑢𝑑𝑒𝑐T\in\mathcal{T}^{\textsc{log}}_{\mathit{udec}}italic_T ∈ caligraphic_T start_POSTSUPERSCRIPT log end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT working over alphabet ΣΣ\Sigmaroman_Σ and all words w𝑤witalic_w with |w|=n𝑤𝑛|w|=n| italic_w | = italic_n we assume that T(w)𝑇𝑤T(w)italic_T ( italic_w ) is carried out in some fixed-width arithmetic F𝐹Fitalic_F using 𝒪(log(max(|Σ|,n)))𝒪Σ𝑛\mathcal{O}(\log(\max(|\Sigma|,n)))caligraphic_O ( roman_log ( roman_max ( | roman_Σ | , italic_n ) ) ) bits.

Theorem 2.

The decision problem Sat[𝒯𝑢𝑑𝑒𝑐log]Satdelimited-[]subscriptsuperscript𝒯log𝑢𝑑𝑒𝑐\textsc{Sat}[\mathcal{T}^{\textsc{log}}_{\mathit{udec}}]Sat [ caligraphic_T start_POSTSUPERSCRIPT log end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT ] is undecidable.

Proof sketch.

This proof follows the exact same line as the proof of Theorem 1. Additionally, we need to argue that T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT works as intended, despite the fact that it is limited by some log-precision F𝐹Fitalic_F.

Looking at the proof of Theorem 1, it is imminent that the magnitude and precision of all values used and produced in the computation T𝒮(w)subscript𝑇𝒮𝑤T_{\mathcal{S}}(w)italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ( italic_w ) depend polynomially on n𝑛nitalic_n and, thus, we can choose the representation of F𝐹Fitalic_F to be linear in log(n)𝑛\log(n)roman_log ( italic_n ), which avoids any overflow or rounding situations and ensures that T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT works as intended. A formal proof is given in Appendix B. ∎

5 How to make transformer satisfiability decidable

In this section we investigate classes of EOT leading to decidable Sat problems or decidable restrictions of it. Additionally, we establish corresponding complexity bounds.

In order to establish clearly delineated upper complexity bounds, we need to bound the representation size of an EOT T𝑇Titalic_T. Instead of tediously analyzing the space needed to represent embedding, scoring, pooling, combination and normalisation functions, we note that it suffices to estimate the size up to polynomials only. The complexity of an EOT T𝑇Titalic_T with L𝐿Litalic_L layers and hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT attention heads in layer i𝑖iitalic_i, working on inputs over alphabet ΣΣ\Sigmaroman_Σ, is |T|:=|Σ|+L+H+Dassign𝑇Σ𝐿𝐻𝐷|T|:=|\Sigma|+L+H+D| italic_T | := | roman_Σ | + italic_L + italic_H + italic_D where H:=max{hi1iL}assign𝐻conditionalsubscript𝑖1𝑖𝐿H:=\max\{h_{i}\mid 1\leq i\leq L\}italic_H := roman_max { italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ 1 ≤ italic_i ≤ italic_L } and D𝐷Ditalic_D is the maximal dimensionality of vectors occurring in a computation of T𝑇Titalic_T. Note that one can reasonably assume the size of a syntactic representation of T𝑇Titalic_T to be polynomial in |T|𝑇|T|| italic_T |, and that EOT have the polynomial evaluation property: given a word wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, T(w)𝑇𝑤T(w)italic_T ( italic_w ) can be computed in time that is polynomial in |T|+|w|𝑇𝑤|T|+|w|| italic_T | + | italic_w |. Section 3 discusses why this assumption is reasonable.

Satisfiability restricted to words of bounded length is decidable, but difficult

We start with a natural restriction: bounding the word length. Let 𝒯𝒯\mathcal{T}caligraphic_T be a class of EOT. The bounded satisfiability problem, denoted by bSat[𝒯]bSatdelimited-[]𝒯\textsc{bSat}[\mathcal{T}]bSat [ caligraphic_T ] is: given T𝒯𝑇𝒯T\in\mathcal{T}italic_T ∈ caligraphic_T and some n𝑛n\in\mathbb{N}italic_n ∈ blackboard_N, decide whether there is a word w𝑤witalic_w with |w|n𝑤𝑛|w|\leq n| italic_w | ≤ italic_n such that T(w)=1𝑇𝑤1T(w)=1italic_T ( italic_w ) = 1. It is not hard to see that bSat[𝒯]bSatdelimited-[]𝒯\textsc{bSat}[\mathcal{T}]bSat [ caligraphic_T ] is decidable. However, its complexity depends on the value of n𝑛nitalic_n, and we therefore distinguish whether n𝑛nitalic_n is represented in binary or unary encoding. We denote the corresponding problems as bSat𝖻𝗂𝗇[𝒯]subscriptbSat𝖻𝗂𝗇delimited-[]𝒯\textsc{bSat}_{\mathsf{bin}}[\mathcal{T}]bSat start_POSTSUBSCRIPT sansserif_bin end_POSTSUBSCRIPT [ caligraphic_T ] and bSat𝗎𝗇[𝒯]subscriptbSat𝗎𝗇delimited-[]𝒯\textsc{bSat}_{\mathsf{un}}[\mathcal{T}]bSat start_POSTSUBSCRIPT sansserif_un end_POSTSUBSCRIPT [ caligraphic_T ].

Theorem 3.

Let 𝒯𝒯\mathcal{T}caligraphic_T be a class of EOT. Then

  1. 1.

    bSat𝗎𝗇[𝒯]subscriptbSat𝗎𝗇delimited-[]𝒯\textsc{bSat}_{\mathsf{un}}[\mathcal{T}]bSat start_POSTSUBSCRIPT sansserif_un end_POSTSUBSCRIPT [ caligraphic_T ] is decidable in NP and if 𝒯𝑢𝑑𝑒𝑐𝒯subscript𝒯𝑢𝑑𝑒𝑐𝒯\mathcal{T}_{\mathit{udec}}\subseteq\mathcal{T}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT ⊆ caligraphic_T then bSat𝗎𝗇[𝒯]subscriptbSat𝗎𝗇delimited-[]𝒯\textsc{bSat}_{\mathsf{un}}[\mathcal{T}]bSat start_POSTSUBSCRIPT sansserif_un end_POSTSUBSCRIPT [ caligraphic_T ] is NP-complete,

  2. 2.

    bSat𝖻𝗂𝗇[𝒯]subscriptbSat𝖻𝗂𝗇delimited-[]𝒯\textsc{bSat}_{\mathsf{bin}}[\mathcal{T}]bSat start_POSTSUBSCRIPT sansserif_bin end_POSTSUBSCRIPT [ caligraphic_T ] is decidable in NEXPTIME and if 𝒯𝑢𝑑𝑒𝑐𝒯subscript𝒯𝑢𝑑𝑒𝑐𝒯\mathcal{T}_{\mathit{udec}}\subseteq\mathcal{T}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT ⊆ caligraphic_T then bSat𝖻𝗂𝗇[𝒯]subscriptbSat𝖻𝗂𝗇delimited-[]𝒯\textsc{bSat}_{\mathsf{bin}}[\mathcal{T}]bSat start_POSTSUBSCRIPT sansserif_bin end_POSTSUBSCRIPT [ caligraphic_T ] is NEXPTIME-complete.

Proof Sketch.

The decidability result of statement (1) can be shown using a simple guess-and-check argument: given n𝑛n\in\mathbb{N}italic_n ∈ blackboard_N, guess a word wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT with |w|n𝑤𝑛|w|\leq n| italic_w | ≤ italic_n, compute T(w)𝑇𝑤T(w)italic_T ( italic_w ) and check that the result is 1111. This is possible in time polynomial in |T|+n𝑇𝑛|T|+n| italic_T | + italic_n using the polynomial evaluation property. Moreover, the value of |T|+n𝑇𝑛|T|+n| italic_T | + italic_n is polynomial in the size needed to represent n𝑛nitalic_n in unary encoding.

The decidability result of statement (2) is shown along the same lines. However, if the value n𝑛nitalic_n is encoded binarily then this part of the input is of size logn𝑛\log nroman_log italic_n, and |T|+n𝑇𝑛|T|+n| italic_T | + italic_n becomes exponential in this. Hence, the guess-and-check procedure only proves that bSat𝖻𝗂𝗇[𝒯]subscriptbSat𝖻𝗂𝗇delimited-[]𝒯absent\textsc{bSat}_{\mathsf{bin}}[\mathcal{T}]\inbSat start_POSTSUBSCRIPT sansserif_bin end_POSTSUBSCRIPT [ caligraphic_T ] ∈ NEXPTIME.

For the completeness result in (1) it suffices to argue that the problem is NP-hard. We make use of the fact that EOT in 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT are expressive enough to accept a given word w𝑤witalic_w if and only if it is a valid encoded tiling, cf. Section 4 for details. It is possible to establish NP-hardness of a corresponding restriction of the octant word-tiling problem, namely the bounded octant word-tiling problem (for unarily encoded input values). See Appendix A for details on tiling problems. It then only remains to observe that the construction in Theorem 1 is in fact a polynomial-time reduction, and that it reduces the bounded octant word-tiling problem to the bounded satisfiability problem. The argument for NEXPTIME-hardness in statement (2) is done along the same lines with, again, the bounded octant-word tiling problem shown to be NEXPTIME-hard when the input parameter n𝑛nitalic_n is given in binary coding. A formal proof for Theorem 3 is given in Appendix C. ∎

Satisfiability for fixed-width arithmetic EOT is decidable, but also difficult

We turn our attention to classes of EOT that naturally arise in practical contexts. We consider EOT that work over some fixed-width arithmetic, like fixed- or floating-point numbers, and which have an embedding relying on a periodical encoding of positions.

We start with establishing a scenario where Sat is decidable in NEXPTIME. Regardless of the underlying EOT class 𝒯𝒯\mathcal{T}caligraphic_T, our proof strategy always relies on a certifier-based understanding of NEXPTIME: given T𝒯𝑇𝒯T\in\mathcal{T}italic_T ∈ caligraphic_T, we nondeterministically guess a word w𝑤witalic_w, followed by a deterministic certification whether T(w)=1𝑇𝑤1T(w)=1italic_T ( italic_w ) = 1 holds. For this to show Sat[𝒯]NEXPTIMESatdelimited-[]𝒯NEXPTIME\textsc{Sat}[\mathcal{T}]\in\text{NEXPTIME}Sat [ caligraphic_T ] ∈ NEXPTIME, we need to argue that the overall running time of such a procedure is at most exponential, in particular that whenever there is a word w𝑤witalic_w with T(w)=1𝑇𝑤1T(w)=1italic_T ( italic_w ) = 1 then there is also some wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with T(w)=1𝑇superscript𝑤1T(w^{\prime})=1italic_T ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 1 and |w|2𝑝𝑜𝑙𝑦(|T|)superscript𝑤superscript2𝑝𝑜𝑙𝑦𝑇|w^{\prime}|\leq 2^{\mathit{poly}(|T|)}| italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | ≤ 2 start_POSTSUPERSCRIPT italic_poly ( | italic_T | ) end_POSTSUPERSCRIPT. Again, we rely on the polynomial evaluation property of EOT in 𝒯𝒯\mathcal{T}caligraphic_T, i.e. the fact that T(w)𝑇superscript𝑤T(w^{\prime})italic_T ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) can be computed in time polynomial in |T|+|w|𝑇superscript𝑤|T|+|w^{\prime}|| italic_T | + | italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT |.

We consider the class of EOT 𝒯fixsubscriptsuperscript𝒯fix\mathcal{T}^{\textsc{fix}}_{\circ}caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∘ end_POSTSUBSCRIPT, defined by placing restrictions on the positional embedding of an EOT T𝑇Titalic_T to be additive-periodical which means that 𝑒𝑚𝑏(a,i)=𝑒𝑚𝑏(a)+𝑝𝑜𝑠(i)𝑒𝑚𝑏𝑎𝑖superscript𝑒𝑚𝑏𝑎𝑝𝑜𝑠𝑖\mathit{emb}(a,i)=\mathit{emb}^{\prime}(a)+\mathit{pos}(i)italic_emb ( italic_a , italic_i ) = italic_emb start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_a ) + italic_pos ( italic_i ) where 𝑝𝑜𝑠𝑝𝑜𝑠\mathit{pos}italic_pos is periodical, i.e. there is p1𝑝1p\geq 1italic_p ≥ 1 such that 𝑝𝑜𝑠(i)=𝑝𝑜𝑠(i+p)𝑝𝑜𝑠𝑖𝑝𝑜𝑠𝑖𝑝\mathit{pos}(i)=\mathit{pos}(i+p)italic_pos ( italic_i ) = italic_pos ( italic_i + italic_p ) for all i𝑖i\in\mathbb{N}italic_i ∈ blackboard_N. Additionally, all normalisation functions are realised by either the softmax function 𝑠𝑚𝑎𝑥𝑠𝑚𝑎𝑥\mathit{smax}italic_smax or the hardmax function ℎ𝑚𝑎𝑥ℎ𝑚𝑎𝑥\mathit{hmax}italic_hmax. Moreover, we assume that all computations occurring in T𝑇Titalic_T are carried out in some fixed-width arithmetic, encoding values in binary using a fixed number b𝑏b\in\mathbb{N}italic_b ∈ blackboard_N of bits. Aside from technical reasons, we motivate the choice of 𝒯fixsubscriptsuperscript𝒯fix\mathcal{T}^{\textsc{fix}}_{\circ}caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∘ end_POSTSUBSCRIPT in Section 3. Given these restrictions, we adjust the definition of the complexity of T𝒯fix𝑇subscriptsuperscript𝒯fixT\in\mathcal{T}^{\textsc{fix}}_{\circ}italic_T ∈ caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∘ end_POSTSUBSCRIPT as a measure of the size (up to polynomials) as |T|:=|Σ|+L+H+D+p+bassign𝑇Σ𝐿𝐻𝐷𝑝𝑏|T|:=|\Sigma|+L+H+D+p+b| italic_T | := | roman_Σ | + italic_L + italic_H + italic_D + italic_p + italic_b.

Lemma 3.

There is a polynomial function 𝑝𝑜𝑙𝑦::𝑝𝑜𝑙𝑦\mathit{poly}\colon\mathbb{N}\to\mathbb{N}italic_poly : blackboard_N → blackboard_N such that for all T𝒯fix𝑇subscriptsuperscript𝒯fixT\in\mathcal{T}^{\textsc{fix}}_{\circ}italic_T ∈ caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∘ end_POSTSUBSCRIPT and all words w𝑤witalic_w with T(w)=1𝑇𝑤1T(w)=1italic_T ( italic_w ) = 1 there is word wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with T(w)=1𝑇superscript𝑤1T(w^{\prime})=1italic_T ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 1 and |w|2𝑝𝑜𝑙𝑦(|T|)superscript𝑤superscript2𝑝𝑜𝑙𝑦𝑇|w^{\prime}|\leq 2^{\mathit{poly}(|T|)}| italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | ≤ 2 start_POSTSUPERSCRIPT italic_poly ( | italic_T | ) end_POSTSUPERSCRIPT.

Proof Sketch.

The polynomial 𝑝𝑜𝑙𝑦𝑝𝑜𝑙𝑦\mathit{poly}italic_poly can be chosen uniformly for all T𝒯fix𝑇subscriptsuperscript𝒯fixT\in\mathcal{T}^{\textsc{fix}}_{\circ}italic_T ∈ caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∘ end_POSTSUBSCRIPT because for all positional embeddings of EOT in 𝒯fixsubscriptsuperscript𝒯fix\mathcal{T}^{\textsc{fix}}_{\circ}caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∘ end_POSTSUBSCRIPT there is an upper bound on the period and on the bit-width in the underlying arithmetic. The small-word property stated by the lemma is then shown by arguing, given polynomial 𝑝𝑜𝑙𝑦𝑝𝑜𝑙𝑦\mathit{poly}italic_poly, EOT T𝑇Titalic_T and |w|>2𝑝𝑜𝑙𝑦(|T|)𝑤superscript2𝑝𝑜𝑙𝑦𝑇|w|>2^{\mathit{poly}(|T|)}| italic_w | > 2 start_POSTSUPERSCRIPT italic_poly ( | italic_T | ) end_POSTSUPERSCRIPT, that w𝑤witalic_w contains unnecessary subwords u𝑢uitalic_u that can be cut out without changing the output in T𝑇Titalic_T. Here, we exploit the fact T𝑇Titalic_T has some periodicity p𝑝pitalic_p and only consider those u𝑢uitalic_u whose length is a multitude of p𝑝pitalic_p. This ensures that the resulting word wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, given by w𝑤witalic_w without u𝑢uitalic_u, is embedded the same way as w𝑤witalic_w by the positional embedding of T𝑇Titalic_T. The existence of such subwords follows from T𝑇Titalic_T’s limited distinguishing capabilities, especially in its normalisations, due to the bounded representation size of numerical values possible in the underlying fixed-width arithmetic.

A formal proof relies on basic combinatorial arguments, but is technically extensive, and given in Appendix C. ∎

Based on this preliminary result, we can then immediately derive an upper bound on the complexity of Sat[𝒯fix]Satdelimited-[]subscriptsuperscript𝒯fix\textsc{Sat}[\mathcal{T}^{\textsc{fix}}_{\circ}]Sat [ caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∘ end_POSTSUBSCRIPT ].

Theorem 4.

The satisfiability problem Sat[𝒯fix]Satdelimited-[]subscriptsuperscript𝒯fix\textsc{Sat}[\mathcal{T}^{\textsc{fix}}_{\circ}]Sat [ caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∘ end_POSTSUBSCRIPT ] for EOT over fixed-width arithmetic using additive-periodical embeddings is in NEXPTIME.

Proof.

Let T𝒯fix𝑇subscriptsuperscript𝒯fixT\in\mathcal{T}^{\textsc{fix}}_{\circ}italic_T ∈ caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∘ end_POSTSUBSCRIPT working over alphabet ΣΣ\Sigmaroman_Σ. We use a certifier-based understanding of a nondeterministic exponential-time algorithm as follows: We (a) guess an input wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and (b) compute T(w)𝑇𝑤T(w)italic_T ( italic_w ) to check whether T(w)=1𝑇𝑤1T(w)=1italic_T ( italic_w ) = 1. For correctness, we need to argue that the length of w𝑤witalic_w is at most exponential in |T|𝑇|T|| italic_T |. This argument is given by Lemma 3. Note that via assumption we have that T(w)𝑇𝑤T(w)italic_T ( italic_w ) can be computed in polynomial time regarding |T|𝑇|T|| italic_T | and |w|𝑤|w|| italic_w | and T(w)=1𝑇𝑤1T(w)=1italic_T ( italic_w ) = 1 obviously as well. ∎

Next, we address the goal of obtaining a matching lower bound, i.e. NEXPTIME-hardness. An obvious way to do so would be to follow Theorem 3.2 and form a reduction from the bounded octant word-tiling problem. Hence, given a tiling system 𝒮𝒮\mathcal{S}caligraphic_S and n𝑛n\in\mathbb{N}italic_n ∈ blackboard_N encoded binarily, we would have to construct – in time polynomial in |𝒮|+logn𝒮𝑛|\mathcal{S}|+\log n| caligraphic_S | + roman_log italic_n – an EOT T𝒮,n𝒯fixsubscript𝑇𝒮𝑛subscriptsuperscript𝒯fixT_{\mathcal{S},n}\in\mathcal{T}^{\textsc{fix}}_{\circ}italic_T start_POSTSUBSCRIPT caligraphic_S , italic_n end_POSTSUBSCRIPT ∈ caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∘ end_POSTSUBSCRIPT such that T𝒮,n(w)=1subscript𝑇𝒮𝑛𝑤1T_{\mathcal{S},n}(w)=1italic_T start_POSTSUBSCRIPT caligraphic_S , italic_n end_POSTSUBSCRIPT ( italic_w ) = 1 for some wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT iff there is a word w=t1,1,t2,1,t2,2,t3,1,,tn,n𝑤subscript𝑡11subscript𝑡21subscript𝑡22subscript𝑡31subscript𝑡𝑛𝑛w=t_{1,1},t_{2,1},t_{2,2},t_{3,1},\ldots,t_{n,n}italic_w = italic_t start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 , 2 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 3 , 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_n , italic_n end_POSTSUBSCRIPT representing a valid 𝒮𝒮\mathcal{S}caligraphic_S-tiling. In particular, T𝒮,nsubscript𝑇𝒮𝑛T_{\mathcal{S},n}italic_T start_POSTSUBSCRIPT caligraphic_S , italic_n end_POSTSUBSCRIPT would have to be able to recognise the correct word length and reject input that is longer than |w|=n(n+1)2𝑤𝑛𝑛12|w|=\frac{n(n+1)}{2}| italic_w | = divide start_ARG italic_n ( italic_n + 1 ) end_ARG start_ARG 2 end_ARG. This poses a problem for EOT with periodical embeddings. To recognize whether a word is too long, an EOT T𝑇Titalic_T must ultimately rely on its positional embedding, which seems to make a periodicity of pn(n+1)2𝑝𝑛𝑛12p\geq\frac{n(n+1)}{2}italic_p ≥ divide start_ARG italic_n ( italic_n + 1 ) end_ARG start_ARG 2 end_ARG necessary. Since the size of periodical EOT is linear in p𝑝pitalic_p, we get an exponential blow-up in a potential reduction of OTWP𝖻𝗂𝗇subscriptOTWP𝖻𝗂𝗇\textsc{OTWP}_{\mathsf{bin}}OTWP start_POSTSUBSCRIPT sansserif_bin end_POSTSUBSCRIPT to Sat[𝒯fix]Satdelimited-[]subscriptsuperscript𝒯fix\textsc{Sat}[\mathcal{T}^{\textsc{fix}}_{\circ}]Sat [ caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∘ end_POSTSUBSCRIPT ], given that the values of n(n+1)2𝑛𝑛12\frac{n(n+1)}{2}divide start_ARG italic_n ( italic_n + 1 ) end_ARG start_ARG 2 end_ARG and already n𝑛nitalic_n are exponential in the size of a binary representation of n𝑛nitalic_n. This problem vanishes when the requirement of the underyling positional embedding to be periodical is lifted: allowing for non-periodical EOT, working over some fixed-width arithmetic, leads to an NEXPTIME-hard satisfiability problem. Let 𝒯fixsuperscript𝒯fix\mathcal{T}^{\textsc{fix}}caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT be defined similar to 𝒯fixsubscriptsuperscript𝒯fix\mathcal{T}^{\textsc{fix}}_{\circ}caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∘ end_POSTSUBSCRIPT, but we allow for non-periodical embeddings. Furthermore, we assume that the considered fixed-width arithmetics can handle overflow situations using saturation.

Theorem 5.

The satisfiability problem Sat[𝒯fix]Satdelimited-[]superscript𝒯fix\textsc{Sat}[\mathcal{T}^{\textsc{fix}}]Sat [ caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT ] for EOT over fixed-width arithmetic is NEXPTIME-hard.

Proof sketch.

We establish a reduction from OTP𝖻𝗂𝗇subscriptOTP𝖻𝗂𝗇\textsc{OTP}_{\mathsf{bin}}OTP start_POSTSUBSCRIPT sansserif_bin end_POSTSUBSCRIPT to Sat[𝒯fix]Satdelimited-[]superscript𝒯fix\textsc{Sat}[\mathcal{T}^{\textsc{fix}}]Sat [ caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT ] by constructing, for each instance (𝒮,n)𝒮𝑛(\mathcal{S},n)( caligraphic_S , italic_n ) of OTP𝖻𝗂𝗇subscriptOTP𝖻𝗂𝗇\textsc{OTP}_{\mathsf{bin}}OTP start_POSTSUBSCRIPT sansserif_bin end_POSTSUBSCRIPT, an EOT T𝒮,nsubscript𝑇𝒮𝑛T_{\mathcal{S},n}italic_T start_POSTSUBSCRIPT caligraphic_S , italic_n end_POSTSUBSCRIPT working over some fixed-width arithmetic F𝐹Fitalic_F, which accepts exactly those w𝑤witalic_w with |w|=n(n+1)2𝑤𝑛𝑛12|w|=\frac{n(n+1)}{2}| italic_w | = divide start_ARG italic_n ( italic_n + 1 ) end_ARG start_ARG 2 end_ARG corresponding to a valid word-encoded tiling for 𝒮𝒮\mathcal{S}caligraphic_S. See Appendix A for details on tiling problems.

The construction is similar to the one given for T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT in the proof of Theorem 4, but we need to enable T𝒮,nsubscript𝑇𝒮𝑛T_{\mathcal{S},n}italic_T start_POSTSUBSCRIPT caligraphic_S , italic_n end_POSTSUBSCRIPT to reject words that are too long corresponding a polynomial bound dependent on n𝑛nitalic_n. This implies that T𝒮,nsubscript𝑇𝒮𝑛T_{\mathcal{S},n}italic_T start_POSTSUBSCRIPT caligraphic_S , italic_n end_POSTSUBSCRIPT, based on the positional embedding 𝑒𝑚𝑏𝑒𝑚𝑏\mathit{emb}italic_emb specified in Section 4, is able to check for all symbols if their respective position is less than or equal to a predefined bound. This can be achieved with similar tools as used in Lemma 2.

Furthermore, we need to ensure that T𝒮,nsubscript𝑇𝒮𝑛T_{\mathcal{S},n}italic_T start_POSTSUBSCRIPT caligraphic_S , italic_n end_POSTSUBSCRIPT works as intended, despite the fact that it is limited by F𝐹Fitalic_F. The arguments follow the same line as the proof of Theorem 2. A formal proof is given in Appendix C. ∎

6 Summary and outlook

We investigated the satisfiability problem of encoder-only transformer (EOT) through the lens of formal reasoning. In particular, we considered the computability and complexity of the satisfiability problem Sat of EOT in context of different classes of EOT, forming a baseline for understanding possibilities and challenges of formal reasoning of transformers.

We showed that Sat is undecidable for classes of EOT recently considered in research on the expressiveness of different transformer models (Theorem 1 and Theorem 2). This implies that formal reasoning is impossible as soon as we consider classes of EOT that are at least as expressive as the classes considered in these results. We remark that this result also translates to encoder-decoder architectures, whose encoder part is as expressive as the here considered EOT.

Additionally, we identified two ways to make formal reasoning for EOT possible: either we bound the length of considered inputs (Theorem 3) or we consider quantized EOT, meaning EOT whose computations and parameters are limited by some fixed-width arithmetic (Theorem 4). These imply that formal reasoning is possible as long as we consider classes of EOT that are at most as expressive as the classes considered in these results. We remark that this statement makes the reasonable assumption that the driving force for any upper computability or complexity bound is the expressiveness of the EOT, not the intricacies of considered safety or interpretability assumption. However, in both cases Sat remains difficult (Theorem 3 and Theorem 5) from a complexity perspective. Again, these results are only valid for classes of EOT that are at least as expressive as the ones considered.

While our results build a first framework for understanding possibilities and challenges of formal reasoning of transformers, there is room for more detailed investigations. Firstly, consider our undecidability and hardness results. These rely on the fact that we consider normalisations realised by the hardmax function. However, it is unclear whether similar results can be achieved if we stick to normalisations realised by the commonly used softmax function. Furthermore, it would be of interest to further investigate the interplay of embedding function and internal structure of the considered EOT. We expect that less expressive embeddings demand a richer structure of the attention mechanisms, but it is unclear where the limits are in the sense that undecidability of the satisfiability problem is still given. Secondly, consider our decidability and upper complexity bound results. It could be of practical interest, to take a more detailed look at specifics of particular fixed-width arithmetics. While this will not change our results, it could give tighter time-complexity estimates which may be of interest in certain formal reasoning applications.

References

  • [1] M. S. Baranowski, S. He, M. Lechner, T. S. Nguyen, and Z. Rakamaric. An SMT theory of fixed-point arithmetic. In N. Peltier and V. Sofronie-Stokkermans, editors, Automated Reasoning - 10th International Joint Conference, IJCAR 2020, Paris, France, July 1-4, 2020, Proceedings, Part I, volume 12166 of Lecture Notes in Computer Science, pages 13–31. Springer, 2020.
  • [2] R. Berger. The undecidability of the domino problem. Mem. Amer. Math. Soc., 66:72, 1966.
  • [3] S. Bhattamishra, A. Patel, and N. Goyal. On the computational power of transformers and its implications in sequence modeling. In R. Fernández and T. Linzen, editors, Proceedings of the 24th Conference on Computational Natural Language Learning, CoNLL 2020, Online, November 19-20, 2020, pages 455–475. Association for Computational Linguistics, 2020.
  • [4] G. Bonaert, D. I. Dimitrov, M. Baader, and M. Vechev. Fast and precise certification of transformers. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2021, pages 466–481. Association for Computing Machinery, 2021.
  • [5] Y. Bondarenko, M. Nagel, and T. Blankevoort. Understanding and overcoming the challenges of efficient transformer quantization. In M. Moens, X. Huang, L. Specia, and S. W. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 7947–7969. Association for Computational Linguistics, 2021.
  • [6] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  • [7] D. Chiang, P. Cholak, and A. Pillay. Tighter bounds on the expressivity of transformer encoders. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 5544–5562. PMLR, 2023.
  • [8] G. A. Constantinides, F. Dahlqvist, Z. Rakamaric, and R. Salvia. Rigorous roundoff error analysis of probabilistic floating-point computations. In A. Silva and K. R. M. Leino, editors, Computer Aided Verification - 33rd International Conference, CAV 2021, Virtual Event, July 20-23, 2021, Proceedings, Part II, volume 12760 of Lecture Notes in Computer Science, pages 626–650. Springer, 2021.
  • [9] S. Demri, V. Goranko, and M. Lange. Temporal Logics in Computer Science. Cambridge Tracts in Theoretical Computer Science. Cambridge University Press, 2016.
  • [10] J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
  • [11] X. Dong, A. T. Luu, R. Ji, and H. Liu. Towards robustness against natural language word substitutions. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  • [12] P. Dufter, M. Schmitt, and H. Schütze. Position information in transformers: An overview. Comput. Linguistics, 48(3):733–763, 2022.
  • [13] M. Hahn. Theoretical limitations of self-attention in neural sequence models. Trans. Assoc. Comput. Linguistics, 8:156–171, 2020.
  • [14] Y. Hao, D. Angluin, and R. Frank. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Trans. Assoc. Comput. Linguistics, 10:800–810, 2022.
  • [15] Y. Hsieh, M. Cheng, D. Juan, W. Wei, W. Hsu, and C. Hsieh. On the robustness of self-attentive models. In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 1520–1529. Association for Computational Linguistics, 2019.
  • [16] X. Huang, W. Ruan, W. Huang, G. **, Y. Dong, C. Wu, S. Bensalem, R. Mu, Y. Qi, X. Zhao, K. Cai, Y. Zhang, S. Wu, P. Xu, D. Wu, A. Freitas, and M. A. Mustafa. A survey of safety and trustworthiness of large language models through the lens of verification and validation. CoRR, abs/2305.11391, 2023.
  • [17] J. Marques-Silva and A. Ignatiev. Delivering trustworthy AI through formal XAI. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 12342–12350. AAAI Press, 2022.
  • [18] W. Merrill and A. Sabharwal. A logic for expressing log-precision transformers. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  • [19] W. Merrill and A. Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11:531–545, 2023.
  • [20] W. Merrill, A. Sabharwal, and N. A. Smith. Saturated transformers are constant-depth threshold circuits. Trans. Assoc. Comput. Linguistics, 10:843–856, 2022.
  • [21] OpenAI. Gpt-4 technical report, 2023.
  • [22] D. W. Otter, J. R. Medina, and J. K. Kalita. A survey of the usages of deep learning for natural language processing. IEEE Trans. Neural Networks Learn. Syst., 32(2):604–624, 2021.
  • [23] J. Pérez, P. Barceló, and J. Marinkovic. Attention is turing-complete. J. Mach. Learn. Res., 22:75:1–75:35, 2021.
  • [24] M. Sälzer and M. Lange. Fundamental limits in formal verification of message-passing neural networks. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  • [25] Z. Shi, H. Zhang, K. Chang, M. Huang, and C. Hsieh. Robustness verification for transformers. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
  • [26] L. Strobl, W. Merrill, G. Weiss, D. Chiang, and D. Angluin. What Formal Languages Can Transformers Express? A Survey. Transactions of the Association for Computational Linguistics, 12:543–561, 05 2024.
  • [27] P. van Emde Boas. The convenience of tilings. In A. Sorbi, editor, Complexity, Logic, and Recursion Theory, volume 187 of Lecture notes in pure and applied mathematics, pages 331–363. Marcel Dekker, Inc., 1997.
  • [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
  • [29] G. Weiss, Y. Goldberg, and E. Yahav. Thinking like transformers. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 11080–11090. PMLR, 2021.
  • [30] Y. Yu, X. Si, C. Hu, and J. Zhang. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput., 31(7):1235–1270, 2019.
  • [31] H. Zhao, H. Chen, F. Yang, N. Liu, H. Deng, H. Cai, S. Wang, D. Yin, and M. Du. Explainability for large language models: A survey. ACM Trans. Intell. Syst. Technol., 15(2):20:1–20:38, 2024.

Appendix A Tiling Problems

We make use of particular tiling problems in order to prove lower bounds on the complexity and decidability of Sat[𝒯]Satdelimited-[]𝒯\textsc{Sat}[\mathcal{T}]Sat [ caligraphic_T ] for different classes 𝒯𝒯\mathcal{T}caligraphic_T.

A tiling system is an 𝒮=(S,H,V,tI,tF)𝒮𝑆𝐻𝑉subscript𝑡𝐼subscript𝑡𝐹\mathcal{S}=(S,H,V,t_{I},t_{F})caligraphic_S = ( italic_S , italic_H , italic_V , italic_t start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) where S𝑆Sitalic_S is a finite set; its elements are called tiles. H,VS×S𝐻𝑉𝑆𝑆H,V\subseteq S\times Sitalic_H , italic_V ⊆ italic_S × italic_S define a horizontal, resp. vertical matching relation between tiles, and tI,tFsubscript𝑡𝐼subscript𝑡𝐹t_{I},t_{F}italic_t start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT are two designated initial, resp. final tiles in S𝑆Sitalic_S.

Problems associated with tiling systems are typically of the following form: given a discrete convex plain consisting of cells with horizontal and vertical neighbors, is it possible to cover the plane with tiles from S𝑆Sitalic_S in a way that horizontally adjacent tiles respect the relation H𝐻Hitalic_H and vertically adjacent tiles respect the relation V𝑉Vitalic_V, together with some additional constraints about where to put the initial and final tile tI,tFsubscript𝑡𝐼subscript𝑡𝐹t_{I},t_{F}italic_t start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT. Such tiling problems, in particular for rectangular planes, have proved to be extremely useful in computational complexity, cf. [2, 27], since they can be seen as abstract versions of halting problems.

We need a variant in which the plane to be tiled is of triangular shape. The n𝑛nitalic_n-th triangle is 𝒪n={(i,j)×jin}subscript𝒪𝑛conditional-set𝑖𝑗𝑗𝑖𝑛\mathcal{O}_{n}=\{(i,j)\in\mathbb{N}\times\mathbb{N}\mid j\leq i\leq n\}caligraphic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { ( italic_i , italic_j ) ∈ blackboard_N × blackboard_N ∣ italic_j ≤ italic_i ≤ italic_n } for n>0𝑛0n>0italic_n > 0. An (𝒮𝒮\mathcal{S}caligraphic_S)-tiling of 𝒪nsubscript𝒪𝑛\mathcal{O}_{n}caligraphic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is a function τ:𝒪nS:𝜏subscript𝒪𝑛𝑆\tau:\mathcal{O}_{n}\to Sitalic_τ : caligraphic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → italic_S s.t.

  • (τ(i,j),τ(i,j+1))H𝜏𝑖𝑗𝜏𝑖𝑗1𝐻(\tau(i,j),\tau(i,j+1))\in H( italic_τ ( italic_i , italic_j ) , italic_τ ( italic_i , italic_j + 1 ) ) ∈ italic_H for all (i,j)𝒪𝑖𝑗𝒪(i,j)\in\mathcal{O}( italic_i , italic_j ) ∈ caligraphic_O with j<in𝑗𝑖𝑛j<i\leq nitalic_j < italic_i ≤ italic_n,

  • (τ(i,j),τ(i+1,j))V𝜏𝑖𝑗𝜏𝑖1𝑗𝑉(\tau(i,j),\tau(i+1,j))\in V( italic_τ ( italic_i , italic_j ) , italic_τ ( italic_i + 1 , italic_j ) ) ∈ italic_V for all (i,j)𝒪𝑖𝑗𝒪(i,j)\in\mathcal{O}( italic_i , italic_j ) ∈ caligraphic_O with ji<n𝑗𝑖𝑛j\leq i<nitalic_j ≤ italic_i < italic_n.

Such a tiling a successful, if additionally τ(0,0)=tI𝜏00subscript𝑡𝐼\tau(0,0)=t_{I}italic_τ ( 0 , 0 ) = italic_t start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and τ(i,i)=tF𝜏𝑖𝑖subscript𝑡𝐹\tau(i,i)=t_{F}italic_τ ( italic_i , italic_i ) = italic_t start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT for some (i,i)𝒪n𝑖𝑖subscript𝒪𝑛(i,i)\in\mathcal{O}_{n}( italic_i , italic_i ) ∈ caligraphic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

The unbounded octant tiling problem (OTPsuperscriptOTP\textsc{OTP}^{*}OTP start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT) is: given a tiling system 𝒮𝒮\mathcal{S}caligraphic_S, decide whether a successful 𝒮𝒮\mathcal{S}caligraphic_S-tiling of 𝒪nsubscript𝒪𝑛\mathcal{O}_{n}caligraphic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT exists for some n𝑛n\in\mathbb{N}italic_n ∈ blackboard_N. The bounded octant tiling problem (OTP) is: given a tiling system 𝒮𝒮\mathcal{S}caligraphic_S and an n1𝑛1n\geq 1italic_n ≥ 1, decide whether a successful 𝒮𝒮\mathcal{S}caligraphic_S-tiling of 𝒪nsubscript𝒪𝑛\mathcal{O}_{n}caligraphic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT exists. Note that here, n𝑛nitalic_n is part of the input, and that it can be represented differently, for example in binary or in unary encoding. We distinguish these two cases by referring to OTP𝖻𝗂𝗇subscriptOTP𝖻𝗂𝗇\textsc{OTP}_{\mathsf{bin}}OTP start_POSTSUBSCRIPT sansserif_bin end_POSTSUBSCRIPT and OTP𝗎𝗇subscriptOTP𝗎𝗇\textsc{OTP}_{\mathsf{un}}OTP start_POSTSUBSCRIPT sansserif_un end_POSTSUBSCRIPT.

It is well-known that OTPsuperscriptOTP\textsc{OTP}^{*}OTP start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is undecidable [27]. It is also not hard to imagine that OTP𝗎𝗇subscriptOTP𝗎𝗇\textsc{OTP}_{\mathsf{un}}OTP start_POSTSUBSCRIPT sansserif_un end_POSTSUBSCRIPT is NP-complete while OTP𝗎𝗇subscriptOTP𝗎𝗇\textsc{OTP}_{\mathsf{un}}OTP start_POSTSUBSCRIPT sansserif_un end_POSTSUBSCRIPT is NEXPTIME-complete. In fact, this is well-known for the variants in which the underlying plane is not a triangle of height n𝑛nitalic_n but a square of height n𝑛nitalic_n [27]. The exponential difference incurred by the more compact binary representation of the input parameter n𝑛nitalic_n is best seen when regarding the upper complexity bound for these problems: given n𝑛nitalic_n, a nondeterministic algorithm can simply guess all the n2superscript𝑛2n^{2}italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT many tiles of the underlying square and verify the horizontal and vertical matchings in time 𝒪(n2)𝒪superscript𝑛2\mathcal{O}(n^{2})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). If n𝑛nitalic_n is encoded unarily, i.e. the space needed to write it down is s:=nassign𝑠𝑛s:=nitalic_s := italic_n, then the time needed for this is polynomial in the input size s𝑠sitalic_s; if n𝑛nitalic_n is encoded binarily with space s:=lognassign𝑠𝑛s:=\lceil\log n\rceilitalic_s := ⌈ roman_log italic_n ⌉ then the time needed for this is exponential in s𝑠sitalic_s.

It then remains to argue that the tiling problems based on triangular planes are also NP- resp. NEXPTIME-complete. Clearly, the upper bounds can be established with the same guess-and-check procedure. For the lower bounds it suffices to observe that hardness of the tiling problems for the squares is established by a reduction from the halting problem for Turing machines (TM) such that a square of size n×n𝑛𝑛n\times nitalic_n × italic_n represents a run of the TM of length n𝑛nitalic_n as a sequence of rows, and each row represents a configuration of the TM using at most n𝑛nitalic_n tape cells. This makes use of the observation that the space consumption of a TM can never exceed the time consumption. Likewise, assuming that a TM always starts a computation with its head on the very left end of a tape, one can easily observe that after i𝑖iitalic_i time steps, it can change at most the i𝑖iitalic_i leftmost tape cells. Hence, a run of a TM can therefore also be represented as a triangle with its first configuration of length 1 in row 1, the second of length 2 in row 2 etc.

At last, we consider two slight modifications of these two problems which are easily seen to preserve undecidability resp. NP- and NEXPTIME-completeness. The unbounded octant tiling-word problem (OTWPsuperscriptOTWP\textsc{OTWP}^{*}OTWP start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT) is: given some 𝒮=(S,H,V,tI,tF)𝒮𝑆𝐻𝑉subscript𝑡𝐼subscript𝑡𝐹\mathcal{S}=(S,H,V,t_{I},t_{F})caligraphic_S = ( italic_S , italic_H , italic_V , italic_t start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ), decide whether there is a word t0,0,t1,0,t1,1,t2,0,t2,1,t2,2,,subscript𝑡00subscript𝑡10subscript𝑡11subscript𝑡20subscript𝑡21subscript𝑡22t_{0,0},t_{1,0},t_{1,1},t_{2,0},t_{2,1},t_{2,2},\ldots,italic_t start_POSTSUBSCRIPT 0 , 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 , 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 , 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 , 2 end_POSTSUBSCRIPT , … , tn,nSsubscript𝑡𝑛𝑛superscript𝑆t_{n,n}\in S^{*}italic_t start_POSTSUBSCRIPT italic_n , italic_n end_POSTSUBSCRIPT ∈ italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for some n𝑛n\in\mathbb{N}italic_n ∈ blackboard_N, s.t. the tiling τ𝜏\tauitalic_τ defined by τ(i,j):=ti,jassign𝜏𝑖𝑗subscript𝑡𝑖𝑗\tau(i,j):=t_{i,j}italic_τ ( italic_i , italic_j ) := italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT comprises a successful tiling of 𝒪nsubscript𝒪𝑛\mathcal{O}_{n}caligraphic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. The two variants of the bounded octant tiling-word problem are both: given some 𝒮𝒮\mathcal{S}caligraphic_S as above and n𝑛nitalic_n, decide whether such a word exists. Note that, again, here n𝑛nitalic_n is an input parameter, and so its representation may affect the complexity of the problem, leading to the distinction between OTWP𝖻𝗂𝗇subscriptOTWP𝖻𝗂𝗇\textsc{OTWP}_{\mathsf{bin}}OTWP start_POSTSUBSCRIPT sansserif_bin end_POSTSUBSCRIPT with binary encoding and OTWP𝗎𝗇subscriptOTWP𝗎𝗇\textsc{OTWP}_{\mathsf{un}}OTWP start_POSTSUBSCRIPT sansserif_un end_POSTSUBSCRIPT with unary encoding.

Theorem 6.
  • a)

    OTWPsuperscriptOTWP\textsc{OTWP}^{*}OTWP start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is undecidable (Σ01superscriptsubscriptΣ01\Sigma_{0}^{1}roman_Σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT-complete).

  • b)

    OTWP𝖻𝗂𝗇subscriptOTWP𝖻𝗂𝗇\textsc{OTWP}_{\mathsf{bin}}OTWP start_POSTSUBSCRIPT sansserif_bin end_POSTSUBSCRIPT is NEXPTIME-complete.

  • c)

    OTWP𝗎𝗇subscriptOTWP𝗎𝗇\textsc{OTWP}_{\mathsf{un}}OTWP start_POSTSUBSCRIPT sansserif_un end_POSTSUBSCRIPT is NP-complete.

Proof.

(a) It should be clear that a tiling problem and its tiling-word variant (like OTPsuperscriptOTP\textsc{OTP}^{*}OTP start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and OTWPsuperscriptOTWP\textsc{OTWP}^{*}OTWP start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT) are interreducible since they only differ in the formulation of how the witness for a successful tiling should be presented. So they are essentially the same problems. Undecidability of OTPsuperscriptOTP\textsc{OTP}^{*}OTP start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and, thus, OTWPsuperscriptOTWP\textsc{OTWP}^{*}OTWP start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is known from [27], the Σ01subscriptsuperscriptΣ10\Sigma^{1}_{0}roman_Σ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT-upper bound can be obtained through a semi-decision procedure that searches through the infinite space of 𝒪nsubscript𝒪𝑛\mathcal{O}_{n}caligraphic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-tiling for any n>1𝑛1n>1italic_n > 1. This justifies the statement in part (a) of Thm. 6.

(b) With the same argument as in (a) t suffices to consider OTP𝖻𝗂𝗇subscriptOTP𝖻𝗂𝗇\textsc{OTP}_{\mathsf{bin}}OTP start_POSTSUBSCRIPT sansserif_bin end_POSTSUBSCRIPT instead of OTWP𝖻𝗂𝗇subscriptOTWP𝖻𝗂𝗇\textsc{OTWP}_{\mathsf{bin}}OTWP start_POSTSUBSCRIPT sansserif_bin end_POSTSUBSCRIPT. The upper bound is easy to see: a nondeterministic procedure can easily guess a tiling for 𝒪nsubscript𝒪𝑛\mathcal{O}_{n}caligraphic_O start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and verify the horizontal and vertical matching conditions, as well as the use of the initial and final tile in appropriate places. This is possible in time 𝒪(n2)𝒪superscript𝑛2\mathcal{O}(n^{2})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), resp. 𝒪(22logn)𝒪superscript22𝑛\mathcal{O}(2^{2\log n})caligraphic_O ( 2 start_POSTSUPERSCRIPT 2 roman_log italic_n end_POSTSUPERSCRIPT ) which is therefore exponential in the input size logn𝑛\lceil\log n\rceil⌈ roman_log italic_n ⌉ for binarily encoded parameters n𝑛nitalic_n. This shows inclusion in NEXPTIME.

For the lower bound we argue that the halting problem for nondeterministic, exponentially-time bounded TM can be reduced to OTP𝖻𝗂𝗇subscriptOTP𝖻𝗂𝗇\textsc{OTP}_{\mathsf{bin}}OTP start_POSTSUBSCRIPT sansserif_bin end_POSTSUBSCRIPT: given a nondeterministic TM \mathcal{M}caligraphic_M over input alphabet ΣΣ\Sigmaroman_Σ and tape alphabet ΓΓ\Gammaroman_Γ that halts after at most time 2p(n)superscript2𝑝𝑛2^{p(n)}2 start_POSTSUPERSCRIPT italic_p ( italic_n ) end_POSTSUPERSCRIPT steps on input words of length n𝑛nitalic_n for some polynomial n𝑛nitalic_n, and a word wΣ𝑤superscriptΣw\in\Sigma^{*}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we first construct a TM wsubscript𝑤\mathcal{M}_{w}caligraphic_M start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT that is started in on the empty tape and begins by writing w𝑤witalic_w onto the tape and then simulates \mathcal{M}caligraphic_M on it. This is a standard construction in complexity theory, and it is easy to see that the running time of wsubscript𝑤\mathcal{M}_{w}caligraphic_M start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is bounded by a function 2p(|w|)superscript2superscript𝑝𝑤2^{p^{\prime}(|w|)}2 start_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( | italic_w | ) end_POSTSUPERSCRIPT for some polynomial psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. With the observation made above, a computation of wsubscript𝑤\mathcal{M}_{w}caligraphic_M start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT can be seen as a sequence of configurations C1,,Cp(|w|)subscript𝐶1subscript𝐶superscript𝑝𝑤C_{1},\ldots,C_{p^{\prime}(|w|)}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( | italic_w | ) end_POSTSUBSCRIPT, with |Ci|=isubscript𝐶𝑖𝑖|C_{i}|=i| italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | = italic_i. This does not directly define a tiling system, instead and again by a standard trick, cf. [27] or [9, Chp. 11], one compresses three adjacent tape cells into one tile in order to naturally derive a horizontal matching relation from overlaps between such triples and a vertical matching relation from the TM’s transition function. At last, let n:=p(|w|)assignsuperscript𝑛superscript𝑝𝑤n^{\prime}:=p^{\prime}(|w|)italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT := italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( | italic_w | ). It is then a simple exercise to verify that a valid tiling of the triangle ΔnsubscriptΔsuperscript𝑛\Delta_{n^{\prime}}roman_Δ start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT corresponds to an accepting run of \mathcal{M}caligraphic_M on w𝑤witalic_w and vice-versa, which establishes NEXPTIME-hardness.

(c) This is down exactly along the same lines as part (b), but instead making use of the fact that, when n𝑛nitalic_n is given in unary encoding, p(n)𝑝𝑛p(n)italic_p ( italic_n ) is polynomial in the size of the representation of n𝑛nitalic_n, and hence, the time needed for the guess-and-check procedure in the upper bound is only polynomial, and for the lower bound we need to assume that the running time of the TM is polynomially bounded. Thus, we get NP-completeness instead of NEXPTIME-completeness. ∎

Appendix B Proofs of Section 4

In the following, we give formal proof for the undecidability results of Section 4. To do so, we make use of classical Feed-Forward Neural Networks.

Feed-Forward Neural Network

A neuron v𝑣vitalic_v is a computational unit computing a function msuperscript𝑚\mathbb{R}^{m}\rightarrow\mathbb{R}blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R by v(x1,,xm)=σ(b+i=1mwixi)𝑣subscript𝑥1subscript𝑥𝑚𝜎𝑏superscriptsubscript𝑖1𝑚subscript𝑤𝑖subscript𝑥𝑖v(x_{1},\dotsc,x_{m})=\sigma(b+\sum_{i=1}^{m}w_{i}x_{i})italic_v ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) = italic_σ ( italic_b + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) where σ𝜎\sigmaitalic_σ is a function called activation and b,wi𝑏subscript𝑤𝑖b,w_{i}italic_b , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are parameters called bias resp. weight. A layer l𝑙litalic_l is a tuple of nodes (v1,,vn)subscript𝑣1subscript𝑣𝑛(v_{1},\dotsc,v_{n})( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) where we assume that all nodes have the same input dimensionality m𝑚mitalic_m. Therefore, l𝑙litalic_l computes a function mnsuperscript𝑚superscript𝑛\mathbb{R}^{m}\rightarrow\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. We call n𝑛nitalic_n the size of layer l𝑙litalic_l. Let l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT be a layer with input dimensionality m𝑚mitalic_m and lksubscript𝑙𝑘l_{k}italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT a layer of size n𝑛nitalic_n. A Feed-Forward Neural Network (FNN) N𝑁Nitalic_N is a tuple (l1,,lk)subscript𝑙1subscript𝑙𝑘(l_{1},\dotsc,l_{k})( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) of layers where we assume that for all ik1𝑖𝑘1i\leq k-1italic_i ≤ italic_k - 1 holds that the size of lisubscript𝑙𝑖l_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT equals the input dimensionality of li+1subscript𝑙𝑖1l_{i+1}italic_l start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT. Therefore, N𝑁Nitalic_N computes a function mnsuperscript𝑚superscript𝑛\mathbb{R}^{m}\rightarrow\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT by processing an input layer by layer.

In particular, we use specific FNN with 𝑟𝑒𝑙𝑢(x)=max(0,x)𝑟𝑒𝑙𝑢𝑥0𝑥\mathit{relu}(x)=\max(0,x)italic_relu ( italic_x ) = roman_max ( 0 , italic_x ) activations, called gadgets, to derive lower bounds in connection with the expressibility of transformers. We denote the class of all FNN with 𝑟𝑒𝑙𝑢𝑟𝑒𝑙𝑢\mathit{relu}italic_relu activations by 𝒩(𝑟𝑒𝑙𝑢)𝒩𝑟𝑒𝑙𝑢\mathcal{N}(\mathit{relu})caligraphic_N ( italic_relu ).

Lemma 4.

Let k>0𝑘superscriptabsent0k\in\mathbb{R}^{>0}italic_k ∈ blackboard_R start_POSTSUPERSCRIPT > 0 end_POSTSUPERSCRIPT. There are basic gadgets

  1. 1.

    N||𝒩(𝑟𝑒𝑙𝑢)N_{|\cdot|}\in\mathcal{N}(\mathit{relu})italic_N start_POSTSUBSCRIPT | ⋅ | end_POSTSUBSCRIPT ∈ caligraphic_N ( italic_relu ) computing N||(x)=|x|N_{|\cdot|}(x)=|x|italic_N start_POSTSUBSCRIPT | ⋅ | end_POSTSUBSCRIPT ( italic_x ) = | italic_x |,

  2. 2.

    N<𝒩(𝑟𝑒𝑙𝑢)subscript𝑁𝒩𝑟𝑒𝑙𝑢N_{<}\in\mathcal{N}(\mathit{relu})italic_N start_POSTSUBSCRIPT < end_POSTSUBSCRIPT ∈ caligraphic_N ( italic_relu ) computing a function 2superscript2\mathbb{R}^{2}\rightarrow\mathbb{R}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → blackboard_R such that N<(x1,x2)=0subscript𝑁subscript𝑥1subscript𝑥20N_{<}(x_{1},x_{2})=0italic_N start_POSTSUBSCRIPT < end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0 if (x1+1)x20subscript𝑥11subscript𝑥20(x_{1}+1)-x_{2}\leq 0( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 ) - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 0, N<(x1,x2)=(x1+1)x2subscript𝑁subscript𝑥1subscript𝑥2subscript𝑥11subscript𝑥2N_{<}(x_{1},x_{2})=(x_{1}+1)-x_{2}italic_N start_POSTSUBSCRIPT < end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 ) - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT if (x1+1)x2(0;1)subscript𝑥11subscript𝑥201(x_{1}+1)-x_{2}\in(0;1)( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 ) - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ ( 0 ; 1 ) and N<(x1,x2)=1subscript𝑁subscript𝑥1subscript𝑥21N_{<}(x_{1},x_{2})=1italic_N start_POSTSUBSCRIPT < end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 1 otherwise,

  3. 3.

    N=𝒩(𝑟𝑒𝑙𝑢)subscript𝑁𝒩𝑟𝑒𝑙𝑢N_{=}\in\mathcal{N}(\mathit{relu})italic_N start_POSTSUBSCRIPT = end_POSTSUBSCRIPT ∈ caligraphic_N ( italic_relu ) computing a function 2superscript2\mathbb{R}^{2}\rightarrow\mathbb{R}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → blackboard_R such that N=(x1,x2)=0subscript𝑁subscript𝑥1subscript𝑥20N_{=}(x_{1},x_{2})=0italic_N start_POSTSUBSCRIPT = end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0 if x1x2=0subscript𝑥1subscript𝑥20x_{1}-x_{2}=0italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0, N=(x1,x2)=|x2x1|subscript𝑁subscript𝑥1subscript𝑥2subscript𝑥2subscript𝑥1N_{=}(x_{1},x_{2})=|x_{2}-x_{1}|italic_N start_POSTSUBSCRIPT = end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = | italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | if |x2x1|(0;1)subscript𝑥2subscript𝑥101|x_{2}-x_{1}|\in(0;1)| italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ∈ ( 0 ; 1 ) and N=(x1,x2)=1subscript𝑁subscript𝑥1subscript𝑥21N_{=}(x_{1},x_{2})=1italic_N start_POSTSUBSCRIPT = end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 1 otherwise,

  4. 4.

    N𝒩(𝑟𝑒𝑙𝑢)subscript𝑁𝒩𝑟𝑒𝑙𝑢N_{\rightarrow}\in\mathcal{N}(\mathit{relu})italic_N start_POSTSUBSCRIPT → end_POSTSUBSCRIPT ∈ caligraphic_N ( italic_relu ) computing a function 2superscript2\mathbb{R}^{2}\rightarrow\mathbb{R}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → blackboard_R such for all inputs x1,x2subscript𝑥1subscript𝑥2x_{1},x_{2}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with x1{0,1}subscript𝑥101x_{1}\in\{0,1\}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ { 0 , 1 } and x2[0;k]subscript𝑥20𝑘x_{2}\in[0;k]italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ [ 0 ; italic_k ] holds N(x1,x2)=0subscript𝑁subscript𝑥1subscript𝑥20N_{\rightarrow}(x_{1},x_{2})=0italic_N start_POSTSUBSCRIPT → end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0 if x1=x2=0subscript𝑥1subscript𝑥20x_{1}=x_{2}=0italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0 or x1=1subscript𝑥11x_{1}=1italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 and N(x1,x2)=𝑟𝑒𝑙𝑢(x2)subscript𝑁subscript𝑥1subscript𝑥2𝑟𝑒𝑙𝑢subscript𝑥2N_{\rightarrow}(x_{1},x_{2})=\mathit{relu}(x_{2})italic_N start_POSTSUBSCRIPT → end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_relu ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) otherwise.

Proof.

Let N||N_{|\cdot|}italic_N start_POSTSUBSCRIPT | ⋅ | end_POSTSUBSCRIPT be the minimal FNN computing 𝑟𝑒𝑙𝑢(𝑟𝑒𝑙𝑢(x)+𝑟𝑒𝑙𝑢(x))𝑟𝑒𝑙𝑢𝑟𝑒𝑙𝑢𝑥𝑟𝑒𝑙𝑢𝑥\mathit{relu}(\mathit{relu}(-x)+\mathit{relu}(x))italic_relu ( italic_relu ( - italic_x ) + italic_relu ( italic_x ) ), let N<subscript𝑁N_{<}italic_N start_POSTSUBSCRIPT < end_POSTSUBSCRIPT be the minimal FNN computing 𝑟𝑒𝑙𝑢(f<(x1,x2)f<(x1,x2+1))𝑟𝑒𝑙𝑢subscript𝑓subscript𝑥1subscript𝑥2subscript𝑓subscript𝑥1subscript𝑥21\mathit{relu}(f_{<}(x_{1},x_{2})-f_{<}(x_{1},x_{2}+1))italic_relu ( italic_f start_POSTSUBSCRIPT < end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT < end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 ) ) where f<(y1,y2)=𝑟𝑒𝑙𝑢(y1y2+1)subscript𝑓subscript𝑦1subscript𝑦2𝑟𝑒𝑙𝑢subscript𝑦1subscript𝑦21f_{<}(y_{1},y_{2})=\mathit{relu}(y_{1}-y_{2}+1)italic_f start_POSTSUBSCRIPT < end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_relu ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 ) and let N=subscript𝑁N_{=}italic_N start_POSTSUBSCRIPT = end_POSTSUBSCRIPT be the minimal FNN computing 𝑟𝑒𝑙𝑢(f=(x1,x2)f=(x1+1,x2)+f=(x2,x1)f=(x2+1,x1))𝑟𝑒𝑙𝑢subscript𝑓subscript𝑥1subscript𝑥2subscript𝑓subscript𝑥11subscript𝑥2subscript𝑓subscript𝑥2subscript𝑥1subscript𝑓subscript𝑥21subscript𝑥1\mathit{relu}(f_{=}(x_{1},x_{2})-f_{=}(x_{1}+1,x_{2})+f_{=}(x_{2},x_{1})-f_{=}% (x_{2}+1,x_{1}))italic_relu ( italic_f start_POSTSUBSCRIPT = end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT = end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_f start_POSTSUBSCRIPT = end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT = end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) where f=(y1,y2)=𝑟𝑒𝑙𝑢(y2y1)subscript𝑓subscript𝑦1subscript𝑦2𝑟𝑒𝑙𝑢subscript𝑦2subscript𝑦1f_{=}(y_{1},y_{2})=\mathit{relu}(y_{2}-y_{1})italic_f start_POSTSUBSCRIPT = end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_relu ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). The claims of the lemma regarding these gadgets are straightforward given their functional form. Let Nsubscript𝑁N_{\rightarrow}italic_N start_POSTSUBSCRIPT → end_POSTSUBSCRIPT be the minimal FNN computing 𝑟𝑒𝑙𝑢(𝑟𝑒𝑙𝑢(x2)k𝑟𝑒𝑙𝑢(x1))𝑟𝑒𝑙𝑢𝑟𝑒𝑙𝑢subscript𝑥2𝑘𝑟𝑒𝑙𝑢subscript𝑥1\mathit{relu}(\mathit{relu}(x_{2})-k\cdot\mathit{relu}(x_{1}))italic_relu ( italic_relu ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_k ⋅ italic_relu ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ). As stated in the lemma, we assume that x1{0,1}subscript𝑥101x_{1}\in\{0,1\}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ { 0 , 1 } and x2[0;k]subscript𝑥20𝑘x_{2}\in[0;k]italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ [ 0 ; italic_k ]. Then, k𝑟𝑒𝑙𝑢(x1)𝑘𝑟𝑒𝑙𝑢subscript𝑥1-k\cdot\mathit{relu}(x_{1})- italic_k ⋅ italic_relu ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is k𝑘-k- italic_k if x1=1subscript𝑥11x_{1}=1italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 and 00 if x1=0subscript𝑥10x_{1}=0italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0. Thus, Nsubscript𝑁N_{\rightarrow}italic_N start_POSTSUBSCRIPT → end_POSTSUBSCRIPT is guaranteed to be 00 if x1=1subscript𝑥11x_{1}=1italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 and otherwise it depends on x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. This gives the claim regarding gadget Nsubscript𝑁N_{\rightarrow}italic_N start_POSTSUBSCRIPT → end_POSTSUBSCRIPT. ∎

We will combine gadgets in different ways. Let N1subscript𝑁1N_{1}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and N2subscript𝑁2N_{2}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be FNN with the same input dimensionality m𝑚mitalic_m and output dimensionality n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT respectively n2subscript𝑛2n_{2}italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We extend the computation of N1subscript𝑁1N_{1}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to functions mn1superscriptsuperscript𝑚superscriptsubscript𝑛1\mathbb{R}^{m^{\prime}}\rightarrow\mathbb{R}^{n_{1}}blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with m<m𝑚superscript𝑚m<m^{\prime}italic_m < italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by weighting additional dimensions with 00 in the input layer. Given a set of input dimensions x1,,xmsubscript𝑥1subscript𝑥superscript𝑚x_{1},\dotsc,x_{m^{\prime}}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, we denote the effective dimensions xi1,,ximsubscript𝑥subscript𝑖1subscript𝑥subscript𝑖𝑚x_{i_{1}},\dotsc,x_{i_{m}}italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT with pairwise different ij{1,,m}subscript𝑖𝑗1superscript𝑚i_{j}\in\{1,\dotsc,m^{\prime}\}italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ { 1 , … , italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } by N1xi1,,ximsubscriptsuperscript𝑁subscript𝑥subscript𝑖1subscript𝑥subscript𝑖𝑚1N^{x_{i_{1}},\dotsc,x_{i_{m}}}_{1}italic_N start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Formally, this means that N1xi1,,xim(x1,,xm)=N1(xi1,,xim)subscriptsuperscript𝑁subscript𝑥subscript𝑖1subscript𝑥subscript𝑖𝑚1subscript𝑥1subscript𝑥superscript𝑚subscript𝑁1subscript𝑥subscript𝑖1subscript𝑥subscript𝑖𝑚N^{x_{i_{1}},\dotsc,x_{i_{m}}}_{1}(x_{1},\dotsc,x_{m^{\prime}})=N_{1}(x_{i_{1}% },\dotsc,x_{i_{m}})italic_N start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) for all inputs. We denote the FNN consisting of N1subscript𝑁1N_{1}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and N2subscript𝑁2N_{2}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT placed next to each other by N1||N2N_{1}|\!|N_{2}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Formally, this is done by combining N1subscript𝑁1N_{1}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and N2subscript𝑁2N_{2}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT layer by layer using 00 weights in intersecting connections. Then, N1||N2N_{1}|\!|N_{2}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT computes mn1+n2superscript𝑚superscriptsubscript𝑛1subscript𝑛2\mathbb{R}^{m}\rightarrow\mathbb{R}^{n_{1}+n_{2}}blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT given by N1||N2(𝒙)=(N1(𝒙),N2(𝒙))N_{1}|\!|N_{2}(\boldsymbol{x})=(N_{1}(\boldsymbol{x}),N_{2}(\boldsymbol{x}))italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x ) = ( italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ) , italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x ) ). We generalize this operation to k𝑘kitalic_k FNN N1Nksubscript𝑁1normsubscript𝑁𝑘N_{1}|\!|\dotsb|\!|N_{k}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | ⋯ | | italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in the obvious sense. Let N3subscript𝑁3N_{3}italic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT be an FNN with input dimensionality n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and output dimensionality n3subscript𝑛3n_{3}italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. We denote the FNN consisting of N1subscript𝑁1N_{1}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and N3subscript𝑁3N_{3}italic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT placed sequentially by N3N1subscript𝑁3subscript𝑁1N_{3}\circ N_{1}italic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∘ italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Formally, this is done by connecting the output layer of N1subscript𝑁1N_{1}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with the input layer of N3subscript𝑁3N_{3}italic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. Then, N3N1subscript𝑁3subscript𝑁1N_{3}\circ N_{1}italic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∘ italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT computes mn3superscript𝑚superscriptsubscript𝑛3\mathbb{R}^{m}\rightarrow\mathbb{R}^{n_{3}}blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT given by N3N1(𝒙)=N3(N1(𝒙))subscript𝑁3subscript𝑁1𝒙subscript𝑁3subscript𝑁1𝒙N_{3}\circ N_{1}(\boldsymbol{x})=N_{3}(N_{1}(\boldsymbol{x}))italic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∘ italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ) = italic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ) ).

We also consider specific gadgets needed in the context of tiling problems.

Lemma 5.

Let S𝑆S\subseteq\mathbb{N}italic_S ⊆ blackboard_N be a finite set and RS2𝑅superscript𝑆2R\subseteq S^{2}italic_R ⊆ italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. There is FNN NR𝒩(𝑟𝑒𝑙𝑢)subscript𝑁𝑅𝒩𝑟𝑒𝑙𝑢N_{R}\in\mathcal{N}(\mathit{relu})italic_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∈ caligraphic_N ( italic_relu ) computing 2superscript2\mathbb{R}^{2}\rightarrow\mathbb{R}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → blackboard_R such that NR(x1,x2){0,1}subscript𝑁𝑅subscript𝑥1subscript𝑥201N_{R}(x_{1},x_{2})\in\{0,1\}italic_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ { 0 , 1 } if (x1,x2)S2subscript𝑥1subscript𝑥2superscript𝑆2(x_{1},x_{2})\in S^{2}( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and NR(x1,x2)=0subscript𝑁𝑅subscript𝑥1subscript𝑥20N_{R}(x_{1},x_{2})=0italic_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0 iff (x1,x2)Rsubscript𝑥1subscript𝑥2𝑅(x_{1},x_{2})\in R( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ italic_R and there is N=t𝒩(𝑟𝑒𝑙𝑢)subscript𝑁absent𝑡𝒩𝑟𝑒𝑙𝑢N_{=t}\in\mathcal{N}(\mathit{relu})italic_N start_POSTSUBSCRIPT = italic_t end_POSTSUBSCRIPT ∈ caligraphic_N ( italic_relu ) for each tS𝑡𝑆t\in Sitalic_t ∈ italic_S computing \mathbb{R}\rightarrow\mathbb{R}blackboard_R → blackboard_R such that N=t(x){0,1}subscript𝑁absent𝑡𝑥01N_{=t}(x)\in\{0,1\}italic_N start_POSTSUBSCRIPT = italic_t end_POSTSUBSCRIPT ( italic_x ) ∈ { 0 , 1 } for each x𝑥x\in\mathbb{N}italic_x ∈ blackboard_N and N=t(x)=0subscript𝑁absent𝑡𝑥0N_{=t}(x)=0italic_N start_POSTSUBSCRIPT = italic_t end_POSTSUBSCRIPT ( italic_x ) = 0 iff x=t𝑥𝑡x=titalic_x = italic_t.

Proof.

Let S𝑆S\subseteq\mathbb{N}italic_S ⊆ blackboard_N be finite, RS2𝑅superscript𝑆2R\subseteq S^{2}italic_R ⊆ italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and tS𝑡𝑆t\in Sitalic_t ∈ italic_S. First, consider N=tsubscript𝑁absent𝑡N_{=t}italic_N start_POSTSUBSCRIPT = italic_t end_POSTSUBSCRIPT. Let Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be the minimal FNN computing 𝑟𝑒𝑙𝑢(0x+t)𝑟𝑒𝑙𝑢0𝑥𝑡\mathit{relu}(0\cdot x+t)italic_relu ( 0 ⋅ italic_x + italic_t ) and N𝑖𝑑subscript𝑁𝑖𝑑N_{\mathit{id}}italic_N start_POSTSUBSCRIPT italic_id end_POSTSUBSCRIPT be the minimal FNN computing (𝑟𝑒𝑙𝑢(x),𝑟𝑒𝑙𝑢(x))𝑟𝑒𝑙𝑢𝑥𝑟𝑒𝑙𝑢𝑥(\mathit{relu}(x),-\mathit{relu}(-x))( italic_relu ( italic_x ) , - italic_relu ( - italic_x ) ). Obviously, Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT computes the constant t𝑡titalic_t function and N𝑖𝑑subscript𝑁𝑖𝑑N_{\mathit{id}}italic_N start_POSTSUBSCRIPT italic_id end_POSTSUBSCRIPT computes the identity in the form of two dimensional vectors. Let N=tsubscript𝑁absent𝑡N_{=t}italic_N start_POSTSUBSCRIPT = italic_t end_POSTSUBSCRIPT be given by the minimal FNN computing N=(N𝑖𝑑||Nt)N_{=}\circ(N_{\mathit{id}}|\!|N_{t})italic_N start_POSTSUBSCRIPT = end_POSTSUBSCRIPT ∘ ( italic_N start_POSTSUBSCRIPT italic_id end_POSTSUBSCRIPT | | italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) with the slight alteration that the two output dimensions of N𝑖𝑑subscript𝑁𝑖𝑑N_{\mathit{id}}italic_N start_POSTSUBSCRIPT italic_id end_POSTSUBSCRIPT are connected to the first dimension of N=subscript𝑁N_{=}italic_N start_POSTSUBSCRIPT = end_POSTSUBSCRIPT. Then, the claim of the lemma regarding N=tsubscript𝑁absent𝑡N_{=t}italic_N start_POSTSUBSCRIPT = italic_t end_POSTSUBSCRIPT follows from Lemma 4 and the operations on FNN described in Appendix B.

Now, consider NRsubscript𝑁𝑅N_{R}italic_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT. Given some sS𝑠𝑆s\in Sitalic_s ∈ italic_S let R[s]={r(s,r)R}𝑅delimited-[]𝑠conditional-set𝑟𝑠𝑟𝑅R[s]=\{r\mid(s,r)\in R\}italic_R [ italic_s ] = { italic_r ∣ ( italic_s , italic_r ) ∈ italic_R }. Let Nksubscriptsuperscript𝑁𝑘N^{k}_{\land}italic_N start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∧ end_POSTSUBSCRIPT be the minimal FNN computing 𝑟𝑒𝑙𝑢(x1++xk)𝑟𝑒𝑙𝑢subscript𝑥1subscript𝑥𝑘\mathit{relu}(x_{1}+\dotsb+x_{k})italic_relu ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Furthermore, let NTsubscript𝑁absent𝑇N_{\in T}italic_N start_POSTSUBSCRIPT ∈ italic_T end_POSTSUBSCRIPT for some set TS𝑇𝑆T\subseteq Sitalic_T ⊆ italic_S be the minimal FNN such that NT(x)=0subscript𝑁absent𝑇𝑥0N_{\in T}(x)=0italic_N start_POSTSUBSCRIPT ∈ italic_T end_POSTSUBSCRIPT ( italic_x ) = 0 if xT𝑥𝑇x\in Titalic_x ∈ italic_T and NT(x)=1subscript𝑁absent𝑇𝑥1N_{\in T}(x)=1italic_N start_POSTSUBSCRIPT ∈ italic_T end_POSTSUBSCRIPT ( italic_x ) = 1 if xST𝑥𝑆𝑇x\in S\setminus Titalic_x ∈ italic_S ∖ italic_T. A construction for NTsubscript𝑁absent𝑇N_{\in T}italic_N start_POSTSUBSCRIPT ∈ italic_T end_POSTSUBSCRIPT is given in Theorem 4 in [24]. According to this construction, NTsubscript𝑁absent𝑇N_{\in T}italic_N start_POSTSUBSCRIPT ∈ italic_T end_POSTSUBSCRIPT consists of three layers and is polynomial in T𝑇Titalic_T. In the case that T=𝑇T=\emptysetitalic_T = ∅ we assume that Nsubscript𝑁absentN_{\in\emptyset}italic_N start_POSTSUBSCRIPT ∈ ∅ end_POSTSUBSCRIPT is the constant 1111 function represented by a suitable FNN. Then, NRsubscript𝑁𝑅N_{R}italic_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT is given by N|S|((N(N=s1||NR[s1]))||||(N(N=s|S|||NR[s|S|])))N^{|S|}_{\land}\circ((N_{\rightarrow}\circ(N_{=s_{1}}|\!|N_{\in R[s_{1}]}))|\!% |\dotsb|\!|(N_{\rightarrow}\circ(N_{=s_{|S|}}|\!|N_{\in R[s_{|S|}]})))italic_N start_POSTSUPERSCRIPT | italic_S | end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∧ end_POSTSUBSCRIPT ∘ ( ( italic_N start_POSTSUBSCRIPT → end_POSTSUBSCRIPT ∘ ( italic_N start_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_N start_POSTSUBSCRIPT ∈ italic_R [ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT ) ) | | ⋯ | | ( italic_N start_POSTSUBSCRIPT → end_POSTSUBSCRIPT ∘ ( italic_N start_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT | italic_S | end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_N start_POSTSUBSCRIPT ∈ italic_R [ italic_s start_POSTSUBSCRIPT | italic_S | end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT ) ) ) for some arbitrary order on S𝑆Sitalic_S with the slight alteration that NRsubscript𝑁𝑅N_{R}italic_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT has two input dimensions, meaning that each subnet (N=si||NR[si])(N_{=s_{i}}|\!|N_{\in R[s_{i}]})( italic_N start_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_N start_POSTSUBSCRIPT ∈ italic_R [ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT ) is connected to the same two input dimensions. Again, the claim of the lemma regarding NRsubscript𝑁𝑅N_{R}italic_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT follows from Lemma 4 and the operations on FNN described in Appendix B. ∎

Given these understandings of gadgets, we are set to formally prove the results of Section 4.

Proof of Lemma 1.

Let w=t0,0t1,0t1,1t2,0tm,nS+𝑤subscript𝑡00subscript𝑡10subscript𝑡11subscript𝑡20subscript𝑡𝑚𝑛superscript𝑆w=t_{0,0}t_{1,0}t_{1,1}t_{2,0}\dotsb t_{m,n}\in S^{+}italic_w = italic_t start_POSTSUBSCRIPT 0 , 0 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 , 0 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 , 0 end_POSTSUBSCRIPT ⋯ italic_t start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ∈ italic_S start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT as stated in the lemma and assume some order aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on S𝑆Sitalic_S. Furthermore, let 𝑒𝑚𝑏(ai,1)=(1,1,1,1,i)𝑒𝑚𝑏subscript𝑎𝑖11111𝑖\mathit{emb}(a_{i},1)=(1,1,1,1,i)italic_emb ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 1 ) = ( 1 , 1 , 1 , 1 , italic_i ) and 𝑒𝑚𝑏(ai,j)=(0,1,j,h=0jh,i)𝑒𝑚𝑏subscript𝑎𝑖𝑗01𝑗superscriptsubscript0𝑗𝑖\mathit{emb}(a_{i},j)=(0,1,j,\sum_{h=0}^{j}h,i)italic_emb ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_j ) = ( 0 , 1 , italic_j , ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_h , italic_i ) if j>1𝑗1j>1italic_j > 1. Let 𝑒𝑚𝑏(w)=𝒙10𝒙k0𝑒𝑚𝑏𝑤subscriptsuperscript𝒙01subscriptsuperscript𝒙0𝑘\mathit{emb}(w)=\boldsymbol{x}^{0}_{1}\dotsb\boldsymbol{x}^{0}_{k}italic_emb ( italic_w ) = bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. In the following, we build two layers l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT using components allowed in 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT, satisfying the statement of the lemma. Layer l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT consists of a single attention head 𝑎𝑡𝑡1,1=(𝑠𝑐𝑜𝑟𝑒1,1,𝑝𝑜𝑜𝑙1,1)subscript𝑎𝑡𝑡11subscript𝑠𝑐𝑜𝑟𝑒11subscript𝑝𝑜𝑜𝑙11\mathit{att}_{1,1}=(\mathit{score}_{1,1},\mathit{pool}_{1,1})italic_att start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT = ( italic_score start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , italic_pool start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ). The scoring function is given by 𝑠𝑐𝑜𝑟𝑒1,1(𝒙i0,𝒙j0)=N1,1(Q1,1𝒙i0,K1,1𝒙j0)subscript𝑠𝑐𝑜𝑟𝑒11subscriptsuperscript𝒙0𝑖subscriptsuperscript𝒙0𝑗subscript𝑁11subscript𝑄11subscriptsuperscript𝒙0𝑖subscript𝐾11subscriptsuperscript𝒙0𝑗\mathit{score}_{1,1}(\boldsymbol{x}^{0}_{i},\boldsymbol{x}^{0}_{j})=N_{1,1}(% \langle Q_{1,1}\boldsymbol{x}^{0}_{i},K_{1,1}\boldsymbol{x}^{0}_{j}\rangle)italic_score start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_N start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ( ⟨ italic_Q start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ ) where Q1,1=[(0,0,1,0,0),(0,1,0,0,0),(0,1,0,0,0)]subscript𝑄11001000100001000Q_{1,1}=[(0,0,-1,0,0),(0,1,0,0,0),(0,1,0,0,0)]italic_Q start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT = [ ( 0 , 0 , - 1 , 0 , 0 ) , ( 0 , 1 , 0 , 0 , 0 ) , ( 0 , 1 , 0 , 0 , 0 ) ] and K1,1=[(0,1,0,0,0),(0,1,0,0,0),(0,0,0,1,0)]subscript𝐾11010000100000010K_{1,1}=[(0,1,0,0,0),(0,1,0,0,0),(0,0,0,1,0)]italic_K start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT = [ ( 0 , 1 , 0 , 0 , 0 ) , ( 0 , 1 , 0 , 0 , 0 ) , ( 0 , 0 , 0 , 1 , 0 ) ] and N(x)=𝑟𝑒𝑙𝑢(x)𝑁𝑥𝑟𝑒𝑙𝑢𝑥N(x)=-\mathit{relu}(x)italic_N ( italic_x ) = - italic_relu ( italic_x ). We have that 𝑠𝑐𝑜𝑟𝑒1,1(𝒙i0,𝒙j0)=𝑟𝑒𝑙𝑢((h=0jh)(i1))subscript𝑠𝑐𝑜𝑟𝑒11subscriptsuperscript𝒙0𝑖subscriptsuperscript𝒙0𝑗𝑟𝑒𝑙𝑢superscriptsubscript0𝑗𝑖1\mathit{score}_{1,1}(\boldsymbol{x}^{0}_{i},\boldsymbol{x}^{0}_{j})=-\mathit{% relu}((\sum_{h=0}^{j}h)-(i-1))italic_score start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = - italic_relu ( ( ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_h ) - ( italic_i - 1 ) ) and it follows that 𝑠𝑐𝑜𝑟𝑒1,1(𝒙i0,𝒙j0)=0subscript𝑠𝑐𝑜𝑟𝑒11subscriptsuperscript𝒙0𝑖subscriptsuperscript𝒙0𝑗0\mathit{score}_{1,1}(\boldsymbol{x}^{0}_{i},\boldsymbol{x}^{0}_{j})=0italic_score start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = 0 if h=0jhi1superscriptsubscript0𝑗𝑖1\sum_{h=0}^{j}h\leq i-1∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_h ≤ italic_i - 1 and otherwise we have that 𝑠𝑐𝑜𝑟𝑒1,1(𝒙i0,𝒙j0)<0subscript𝑠𝑐𝑜𝑟𝑒11subscriptsuperscript𝒙0𝑖subscriptsuperscript𝒙0𝑗0\mathit{score}_{1,1}(\boldsymbol{x}^{0}_{i},\boldsymbol{x}^{0}_{j})<0italic_score start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) < 0. The pooling function is specified by the matrix W1,1=[(1,0,0,0,0)]subscript𝑊11delimited-[]10000W_{1,1}=[(1,0,0,0,0)]italic_W start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT = [ ( 1 , 0 , 0 , 0 , 0 ) ] and uses ℎ𝑚𝑎𝑥ℎ𝑚𝑎𝑥\mathit{hmax}italic_hmax as normalisation function. The combination 𝑐𝑜𝑚𝑏1subscript𝑐𝑜𝑚𝑏1\mathit{comb}_{1}italic_comb start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT function is given by the FNN N1(x1,,x5,y)=𝑟𝑒𝑙𝑢(x2)||||𝑟𝑒𝑙𝑢(x5)||𝑟𝑒𝑙𝑢(y)subscript𝑁1subscript𝑥1subscript𝑥5𝑦𝑟𝑒𝑙𝑢subscript𝑥2𝑟𝑒𝑙𝑢subscript𝑥5𝑟𝑒𝑙𝑢𝑦N_{1}(x_{1},\dotsc,x_{5},y)=\mathit{relu}(x_{2})|\!|\dotsb|\!|\mathit{relu}(x_% {5})|\!|\mathit{relu}(y)italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_y ) = italic_relu ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | | ⋯ | | italic_relu ( italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) | | italic_relu ( italic_y ). Given a position 𝒙i0subscriptsuperscript𝒙0𝑖\boldsymbol{x}^{0}_{i}bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the attention head 𝑎𝑡𝑡1,1subscript𝑎𝑡𝑡11\mathit{att}_{1,1}italic_att start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT attends to all positions 𝒙j0subscriptsuperscript𝒙0𝑗\boldsymbol{x}^{0}_{j}bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT satisfying h=0jhi1superscriptsubscript0𝑗𝑖1\sum_{h=0}^{j}h\leq i-1∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_h ≤ italic_i - 1. This is due to the way 𝑠𝑐𝑜𝑟𝑒1,1subscript𝑠𝑐𝑜𝑟𝑒11\mathit{score}_{1,1}italic_score start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT is build. Then, 𝑎𝑡𝑡1,1subscript𝑎𝑡𝑡11\mathit{att}_{1,1}italic_att start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT computes 1l1𝑙\frac{1}{l}divide start_ARG 1 end_ARG start_ARG italic_l end_ARG using 𝑝𝑜𝑜𝑙1,1subscript𝑝𝑜𝑜𝑙11\mathit{pool}_{1,1}italic_pool start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT where l𝑙litalic_l is the number of positions 𝑎𝑡𝑡1,1subscript𝑎𝑡𝑡11\mathit{att}_{1,1}italic_att start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT attends to. Here, we exploit the fact that only the first position 𝒙10subscriptsuperscript𝒙01\boldsymbol{x}^{0}_{1}bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT has a non-zero entry in the its first dimension and that for all i𝑖iitalic_i head 𝑎𝑡𝑡1,1subscript𝑎𝑡𝑡11\mathit{att}_{1,1}italic_att start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT attends to 𝒙10subscriptsuperscript𝒙01\boldsymbol{x}^{0}_{1}bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Finally, 𝑐𝑜𝑚𝑏1subscript𝑐𝑜𝑚𝑏1\mathit{comb}_{1}italic_comb start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT simply stacks the old vector 𝒙i0subscriptsuperscript𝒙0𝑖\boldsymbol{x}^{0}_{i}bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT onto the value 1l1𝑙\frac{1}{l}divide start_ARG 1 end_ARG start_ARG italic_l end_ARG, but leaves out the first dimension of 𝒙i0subscriptsuperscript𝒙0𝑖\boldsymbol{x}^{0}_{i}bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Let l1(𝑒𝑚𝑏(w))=𝒙11𝒙k1subscript𝑙1𝑒𝑚𝑏𝑤subscriptsuperscript𝒙11subscriptsuperscript𝒙1𝑘l_{1}(\mathit{emb}(w))=\boldsymbol{x}^{1}_{1}\dotsb\boldsymbol{x}^{1}_{k}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_emb ( italic_w ) ) = bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Layer l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT consists of a single attention head 𝑎𝑡𝑡2,1=(𝑠𝑐𝑜𝑟𝑒2,1,𝑝𝑜𝑜𝑙2,1)subscript𝑎𝑡𝑡21subscript𝑠𝑐𝑜𝑟𝑒21subscript𝑝𝑜𝑜𝑙21\mathit{att}_{2,1}=(\mathit{score}_{2,1},\mathit{pool}_{2,1})italic_att start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT = ( italic_score start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT , italic_pool start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ). The scoring function 𝑠𝑐𝑜𝑟𝑒2,1subscript𝑠𝑐𝑜𝑟𝑒21\mathit{score}_{2,1}italic_score start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT is given by N2,1(Q2,1𝒙i1,K2,1𝒙j1)subscript𝑁21subscript𝑄21subscriptsuperscript𝒙1𝑖subscript𝐾21subscriptsuperscript𝒙1𝑗N_{2,1}(\langle Q_{2,1}\boldsymbol{x}^{1}_{i},K_{2,1}\boldsymbol{x}^{1}_{j}\rangle)italic_N start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ( ⟨ italic_Q start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ ) where Q2,1=[(0,0,0,0,1)]subscript𝑄21delimited-[]00001Q_{2,1}=[(0,0,0,0,1)]italic_Q start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT = [ ( 0 , 0 , 0 , 0 , 1 ) ], K2,1=[(0,1,0,0,0)]subscript𝐾21delimited-[]01000K_{2,1}=[(0,1,0,0,0)]italic_K start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT = [ ( 0 , 1 , 0 , 0 , 0 ) ] and N2,1(x)=𝑟𝑒𝑙𝑢(𝑟𝑒𝑙𝑢(x1)+𝑟𝑒𝑙𝑢(1x))subscript𝑁21𝑥𝑟𝑒𝑙𝑢𝑟𝑒𝑙𝑢𝑥1𝑟𝑒𝑙𝑢1𝑥N_{2,1}(x)=-\mathit{relu}(\mathit{relu}(x-1)+\mathit{relu}(1-x))italic_N start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ( italic_x ) = - italic_relu ( italic_relu ( italic_x - 1 ) + italic_relu ( 1 - italic_x ) ). We have that 𝑠𝑐𝑜𝑟𝑒2,1(𝒙i1,𝒙j1)=0subscript𝑠𝑐𝑜𝑟𝑒21subscriptsuperscript𝒙1𝑖subscriptsuperscript𝒙1𝑗0\mathit{score}_{2,1}(\boldsymbol{x}^{1}_{i},\boldsymbol{x}^{1}_{j})=0italic_score start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = 0 if 1lj=11𝑙𝑗1\frac{1}{l}\cdot j=1divide start_ARG 1 end_ARG start_ARG italic_l end_ARG ⋅ italic_j = 1 where 1l1𝑙\frac{1}{l}divide start_ARG 1 end_ARG start_ARG italic_l end_ARG is the fifth dimension of 𝒙i1subscriptsuperscript𝒙1𝑖\boldsymbol{x}^{1}_{i}bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and otherwise 𝑠𝑐𝑜𝑟𝑒2,1(𝒙i1,𝒙j1)<0subscript𝑠𝑐𝑜𝑟𝑒21subscriptsuperscript𝒙1𝑖subscriptsuperscript𝒙1𝑗0\mathit{score}_{2,1}(\boldsymbol{x}^{1}_{i},\boldsymbol{x}^{1}_{j})<0italic_score start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) < 0. The pooling function 𝑝𝑜𝑜𝑙2,1subscript𝑝𝑜𝑜𝑙21\mathit{pool}_{2,1}italic_pool start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT is specified by W2,1=[(0,1,0,0,0),(0,0,1,0,0)]subscript𝑊210100000100W_{2,1}=[(0,1,0,0,0),(0,0,1,0,0)]italic_W start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT = [ ( 0 , 1 , 0 , 0 , 0 ) , ( 0 , 0 , 1 , 0 , 0 ) ] and uses ℎ𝑚𝑎𝑥ℎ𝑚𝑎𝑥\mathit{hmax}italic_hmax as normalisation. The combination 𝑐𝑜𝑚𝑏2subscript𝑐𝑜𝑚𝑏2\mathit{comb}_{2}italic_comb start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is given by the FNN N2(x1,,x5,y1,y2)=𝑟𝑒𝑙𝑢(x1)𝑟𝑒𝑙𝑢(x2)𝑟𝑒𝑙𝑢(y1)𝑟𝑒𝑙𝑢(x2y21)𝑟𝑒𝑙𝑢(x4)subscript𝑁2subscript𝑥1subscript𝑥5subscript𝑦1subscript𝑦2𝑟𝑒𝑙𝑢subscript𝑥1norm𝑟𝑒𝑙𝑢subscript𝑥2𝑟𝑒𝑙𝑢subscript𝑦1norm𝑟𝑒𝑙𝑢subscript𝑥2subscript𝑦21𝑟𝑒𝑙𝑢subscript𝑥4N_{2}(x_{1},\dotsc,x_{5},y_{1},y_{2})=\mathit{relu}(x_{1})|\!|\mathit{relu}(x_% {2})|\!|\mathit{relu}(y_{1})|\!|\mathit{relu}(x_{2}-y_{2}-1)|\!|\mathit{relu}(% x_{4})italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_relu ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) | | italic_relu ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | | italic_relu ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) | | italic_relu ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1 ) | | italic_relu ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ). Given a position 𝒙i1subscriptsuperscript𝒙1𝑖\boldsymbol{x}^{1}_{i}bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the attention head 𝑎𝑡𝑡2,1subscript𝑎𝑡𝑡21\mathit{att}_{2,1}italic_att start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT attends to the position j𝑗jitalic_j, where 1lj=11𝑙𝑗1\frac{1}{l}\cdot j=1divide start_ARG 1 end_ARG start_ARG italic_l end_ARG ⋅ italic_j = 1. Relying on our arguments regarding the computation of l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, this is the position j𝑗jitalic_j satisfying maxj(h=0jhi1)subscript𝑗superscriptsubscript0𝑗𝑖1\max_{j}(\sum_{h=0}^{j}h\leq i-1)roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_h ≤ italic_i - 1 ). However, this j𝑗jitalic_j is equal to the row index r(i)𝑟𝑖r(i)italic_r ( italic_i ) of the decomposition of i𝑖iitalic_i based on the inversion of Cantor’s pairing function. Thus, we have that r(i)=j𝑟𝑖𝑗r(i)=jitalic_r ( italic_i ) = italic_j. Furthermore, we have that c(i)=(i1)(h=0jh)𝑐𝑖𝑖1superscriptsubscript0𝑗c(i)=(i-1)-(\sum_{h=0}^{j}h)italic_c ( italic_i ) = ( italic_i - 1 ) - ( ∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_h ), which is computed by 𝑟𝑒𝑙𝑢(x2y21)𝑟𝑒𝑙𝑢subscript𝑥2subscript𝑦21\mathit{relu}(x_{2}-y_{2}-1)italic_relu ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1 ) in the combination function 𝑐𝑜𝑚𝑏2subscript𝑐𝑜𝑚𝑏2\mathit{comb}_{2}italic_comb start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Overall, we see that l2(l1(𝑒𝑚𝑏(w)))subscript𝑙2subscript𝑙1𝑒𝑚𝑏𝑤l_{2}(l_{1}(\mathit{emb}(w)))italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_emb ( italic_w ) ) ) gives the desired result. ∎

Proof of Lemma 2.

Let f𝑓fitalic_f be as stated in the lemma. By definition of 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT, the scoring function of 𝑎𝑡𝑡fsubscript𝑎𝑡𝑡𝑓\mathit{att}_{f}italic_att start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is of the form N(Q𝒙i,K𝒙j)𝑁𝑄subscript𝒙𝑖𝐾subscript𝒙𝑗N(\langle Q\boldsymbol{x}_{i},K\boldsymbol{x}_{j}\rangle)italic_N ( ⟨ italic_Q bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ ) and the normalisation is ℎ𝑚𝑎𝑥ℎ𝑚𝑎𝑥\mathit{hmax}italic_hmax. Let Q=[(a1,,ak),(b,0,,0),(1,0,,0)]𝑄subscript𝑎1subscript𝑎𝑘𝑏00100Q=[(a_{1},\dotsc,a_{k}),(b,0,\dotsc,0),(1,0,\dotsc,0)]italic_Q = [ ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , ( italic_b , 0 , … , 0 ) , ( 1 , 0 , … , 0 ) ], K=[(1,0,,0),(1,0,,0),(0,1,0,,0)]𝐾1001000100K=[(1,0,\dotsc,0),(1,0,\dotsc,0),(0,-1,0,\dotsc,0)]italic_K = [ ( 1 , 0 , … , 0 ) , ( 1 , 0 , … , 0 ) , ( 0 , - 1 , 0 , … , 0 ) ] and N𝑁Nitalic_N be the minimal FNN computing N(x)=𝑟𝑒𝑙𝑢(N||(x))=|x|N(x)=-\mathit{relu}(N_{|\cdot|}(x))=-|x|italic_N ( italic_x ) = - italic_relu ( italic_N start_POSTSUBSCRIPT | ⋅ | end_POSTSUBSCRIPT ( italic_x ) ) = - | italic_x | where N||N_{|\cdot|}italic_N start_POSTSUBSCRIPT | ⋅ | end_POSTSUBSCRIPT is given by Lemma 4. Overall, this ensures that the scoring is given by 𝑠𝑐𝑜𝑟𝑒(𝒙i,𝒙j)=|f(𝒙i)j|𝑠𝑐𝑜𝑟𝑒subscript𝒙𝑖subscript𝒙𝑗𝑓subscript𝒙𝑖𝑗\mathit{score}(\boldsymbol{x}_{i},\boldsymbol{x}_{j})=-|f(\boldsymbol{x}_{i})-j|italic_score ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = - | italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_j |. Then, the statement of the lemma follows from the fact that ℎ𝑚𝑎𝑥ℎ𝑚𝑎𝑥\mathit{hmax}italic_hmax attends to the maximum, which is 00 given this scoring, and that j𝑗j\in\mathbb{N}italic_j ∈ blackboard_N is unique for each 𝒙jsubscript𝒙𝑗\boldsymbol{x}_{j}bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. ∎

Lemma 6.

There is attention head 𝑎𝑡𝑡subscript𝑎𝑡𝑡\mathit{att}_{\leq}italic_att start_POSTSUBSCRIPT ≤ end_POSTSUBSCRIPT in 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT such that for all sequences 𝐱1,,𝐱msubscript𝐱1subscript𝐱𝑚\boldsymbol{x}_{1},\dotsc,\boldsymbol{x}_{m}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT where all 𝐱i=(1,i,𝐲i)subscript𝐱𝑖1𝑖subscript𝐲𝑖\boldsymbol{x}_{i}=(1,i,\boldsymbol{y}_{i})bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( 1 , italic_i , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) the head 𝑎𝑡𝑡subscript𝑎𝑡𝑡\mathit{att}_{\leq}italic_att start_POSTSUBSCRIPT ≤ end_POSTSUBSCRIPT attends to {𝐱1,,𝐱i}subscript𝐱1subscript𝐱𝑖\{\boldsymbol{x}_{1},\dotsc,\boldsymbol{x}_{i}\}{ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } given i𝑖iitalic_i.

Proof.

By definition of 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT, the scoring function of 𝑎𝑡𝑡fsubscript𝑎𝑡𝑡𝑓\mathit{att}_{f}italic_att start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is of the form N(Q𝒙i,K𝒙j)𝑁𝑄subscript𝒙𝑖𝐾subscript𝒙𝑗N(\langle Q\boldsymbol{x}_{i},K\boldsymbol{x}_{j}\rangle)italic_N ( ⟨ italic_Q bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟩ ) and the normalisation is ℎ𝑚𝑎𝑥ℎ𝑚𝑎𝑥\mathit{hmax}italic_hmax. Let Q=[(0,1,0,,0),(1,0,,0)]𝑄0100100Q=[(0,1,0,\dotsc,0),(1,0,\dotsc,0)]italic_Q = [ ( 0 , 1 , 0 , … , 0 ) , ( 1 , 0 , … , 0 ) ] and let K𝐾Kitalic_K be equal to [(1,0,0),(0,1,0,,0)]1000100[(1,0\dotsc,0),(0,-1,0,\dotsc,0)][ ( 1 , 0 … , 0 ) , ( 0 , - 1 , 0 , … , 0 ) ]. Furthermore, let N(x)=𝑟𝑒𝑙𝑢(x)𝑁𝑥𝑟𝑒𝑙𝑢𝑥N(x)=-\mathit{relu}(x)italic_N ( italic_x ) = - italic_relu ( italic_x ). We observe that N𝑁Nitalic_N outputs 00 if ji𝑗𝑖j\leq iitalic_j ≤ italic_i and otherwise N(x)<0𝑁𝑥0N(x)<0italic_N ( italic_x ) < 0. In combination with ℎ𝑚𝑎𝑥ℎ𝑚𝑎𝑥\mathit{hmax}italic_hmax, this ensures that 𝑎𝑡𝑡subscript𝑎𝑡𝑡\mathit{att}_{\leq}italic_att start_POSTSUBSCRIPT ≤ end_POSTSUBSCRIPT behaves as stated by the lemma. ∎

Proof of Theorem 1.

We prove the statement via reduction from OTWPsuperscriptOTWP\textsc{OTWP}^{*}OTWP start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Let 𝒮=(S,H,V,tI,tF)𝒮𝑆𝐻𝑉subscript𝑡𝐼subscript𝑡𝐹\mathcal{S}=(S,H,V,t_{I},t_{F})caligraphic_S = ( italic_S , italic_H , italic_V , italic_t start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) be an instance of OTWPsuperscriptOTWP\textsc{OTWP}^{*}OTWP start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with |S|=k𝑆𝑘|S|=k| italic_S | = italic_k. W.l.o.g we assume that S𝑆S\subseteq\mathbb{N}italic_S ⊆ blackboard_N. Let T𝒮𝒯𝑢𝑑𝑒𝑐subscript𝑇𝒮subscript𝒯𝑢𝑑𝑒𝑐T_{\mathcal{S}}\in\mathcal{T}_{\mathit{udec}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ∈ caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT be built the following way. T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT uses the embedding 𝑒𝑚𝑏𝑒𝑚𝑏\mathit{emb}italic_emb of transformer in 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT specified in the beginning of Section 4. Furthermore, it has four layers. Layers l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are as in Lemma 1. Layer l3subscript𝑙3l_{3}italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is given by l3=(𝑎𝑡𝑡𝑝𝑟𝑒𝑣,𝑎𝑡𝑡𝑛𝑒𝑥𝑡,𝑎𝑡𝑡𝑠𝑡𝑒𝑝,𝑐𝑜𝑚𝑏3)subscript𝑙3subscript𝑎𝑡𝑡𝑝𝑟𝑒𝑣subscript𝑎𝑡𝑡𝑛𝑒𝑥𝑡subscript𝑎𝑡𝑡𝑠𝑡𝑒𝑝subscript𝑐𝑜𝑚𝑏3l_{3}=(\mathit{att}_{\mathit{prev}},\mathit{att}_{\mathit{next}},\mathit{att}_% {\mathit{step}},\mathit{comb}_{3})italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = ( italic_att start_POSTSUBSCRIPT italic_prev end_POSTSUBSCRIPT , italic_att start_POSTSUBSCRIPT italic_next end_POSTSUBSCRIPT , italic_att start_POSTSUBSCRIPT italic_step end_POSTSUBSCRIPT , italic_comb start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) where 𝑎𝑡𝑡prevsubscript𝑎𝑡𝑡prev\mathit{att}_{\text{prev}}italic_att start_POSTSUBSCRIPT prev end_POSTSUBSCRIPT, 𝑎𝑡𝑡nextsubscript𝑎𝑡𝑡next\mathit{att}_{\text{next}}italic_att start_POSTSUBSCRIPT next end_POSTSUBSCRIPT and 𝑎𝑡𝑡stepsubscript𝑎𝑡𝑡step\mathit{att}_{\text{step}}italic_att start_POSTSUBSCRIPT step end_POSTSUBSCRIPT are of Lemma 2 whereby 𝑝𝑟𝑒𝑣(x1,,x5)=x21𝑝𝑟𝑒𝑣subscript𝑥1subscript𝑥5subscript𝑥21\mathit{prev}(x_{1},\dotsc,x_{5})=x_{2}-1italic_prev ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1, 𝑛𝑒𝑥𝑡(x1,,x5)=x2+1𝑛𝑒𝑥𝑡subscript𝑥1subscript𝑥5subscript𝑥21\mathit{next}(x_{1},\dotsc,x_{5})=x_{2}+1italic_next ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 and 𝑠𝑡𝑒𝑝(x1,,x5)=x2+x3+1𝑠𝑡𝑒𝑝subscript𝑥1subscript𝑥5subscript𝑥2subscript𝑥31\mathit{step}(x_{1},\dotsc,x_{5})=x_{2}+x_{3}+1italic_step ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + 1. We assume that all three attention heads use the identity matrix as linear maps in their respective pooling function. 𝑐𝑜𝑚𝑏3subscript𝑐𝑜𝑚𝑏3\mathit{comb}_{3}italic_comb start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is given by an FNN N3subscript𝑁3N_{3}italic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT computing 45superscript45\mathbb{R}^{4\cdot 5}\rightarrow\mathbb{R}blackboard_R start_POSTSUPERSCRIPT 4 ⋅ 5 end_POSTSUPERSCRIPT → blackboard_R. Let the input dimensions of N3subscript𝑁3N_{3}italic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT be x1,1,,x1,5,x2,1,,x4,5subscript𝑥11subscript𝑥15subscript𝑥21subscript𝑥45x_{1,1},\dotsc,x_{1,5},x_{2,1},\dotsc,x_{4,5}italic_x start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT 1 , 5 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT 4 , 5 end_POSTSUBSCRIPT. Then, N3subscript𝑁3N_{3}italic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is equal to

𝑟𝑒𝑙𝑢(x1,1)𝑟𝑒𝑙𝑢(x1,2)NaNb1Nb2NcNd𝑟𝑒𝑙𝑢subscript𝑥11norm𝑟𝑒𝑙𝑢subscript𝑥12subscript𝑁𝑎normsubscript𝑁subscript𝑏1subscript𝑁subscript𝑏2normsubscript𝑁𝑐subscript𝑁𝑑\mathit{relu}(x_{1,1})|\!|\mathit{relu}(x_{1,2})|\!|N_{a}|\!|N_{b_{1}}|\!|N_{b% _{2}}|\!|N_{c}|\!|N_{d}italic_relu ( italic_x start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ) | | italic_relu ( italic_x start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT ) | | italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | | italic_N start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_N start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | | italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT

where Na=N(N=x1,2,x3,2||N=x1,3,x1,4)N_{a}=N_{\rightarrow}\circ(N^{x_{1,2},x_{3,2}}_{=}|\!|N_{=}^{x_{1,3},x_{1,4}})italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT → end_POSTSUBSCRIPT ∘ ( italic_N start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 , 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT = end_POSTSUBSCRIPT | | italic_N start_POSTSUBSCRIPT = end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 , 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 , 4 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ), Nb1=N(N=x1,2,x2,2||N=tIx1,5)N_{b_{1}}=N_{\rightarrow}\circ(N^{x_{1,2},x_{2,2}}_{=}|\!|N_{=t_{I}}^{x_{1,5}})italic_N start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT → end_POSTSUBSCRIPT ∘ ( italic_N start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 , 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT = end_POSTSUBSCRIPT | | italic_N start_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 , 5 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ), Nb2=N(N=x1,2,x3,2||N=tFx1,5)N_{b_{2}}=N_{\rightarrow}\circ(N^{x_{1,2},x_{3,2}}_{=}|\!|N_{=t_{F}}^{x_{1,5}})italic_N start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT → end_POSTSUBSCRIPT ∘ ( italic_N start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 , 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT = end_POSTSUBSCRIPT | | italic_N start_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 , 5 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ), Nc=N(N<x1,4,x1,3||NHx1,5,x3,5)N_{c}=N_{\rightarrow}\circ(N^{x_{1,4},x_{1,3}}_{<}|\!|N^{x_{1,5},x_{3,5}}_{H})italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT → end_POSTSUBSCRIPT ∘ ( italic_N start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 , 4 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 , 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT < end_POSTSUBSCRIPT | | italic_N start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 , 5 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 , 5 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) and Nd=N(N<x1,3,x4,3||NVx1,5,x4,5)N_{d}=N_{\rightarrow}\circ(N_{<}^{x_{1,3},{x_{4,3}}}|\!|N^{x_{1,5},x_{4,5}}_{V})italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT → end_POSTSUBSCRIPT ∘ ( italic_N start_POSTSUBSCRIPT < end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 , 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 , 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | | italic_N start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 , 5 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 4 , 5 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ) using the gadgets and constructions described in Appendix B. Layer l4subscript𝑙4l_{4}italic_l start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT is given by l4=(𝑎𝑡𝑡leq,𝑐𝑜𝑚𝑏4)subscript𝑙4subscript𝑎𝑡𝑡leqsubscript𝑐𝑜𝑚𝑏4l_{4}=(\mathit{att}_{\text{leq}},\mathit{comb}_{4})italic_l start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = ( italic_att start_POSTSUBSCRIPT leq end_POSTSUBSCRIPT , italic_comb start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) where 𝑎𝑡𝑡leqsubscript𝑎𝑡𝑡leq\mathit{att}_{\text{leq}}italic_att start_POSTSUBSCRIPT leq end_POSTSUBSCRIPT attends to {𝒙1,,𝒙i}subscript𝒙1subscript𝒙𝑖\{\boldsymbol{x}_{1},\dotsc,\boldsymbol{x}_{i}\}{ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } given i𝑖iitalic_i and 𝑐𝑜𝑚𝑏4subscript𝑐𝑜𝑚𝑏4\mathit{comb}_{4}italic_comb start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT is given by the minimal FNN N4subscript𝑁4N_{4}italic_N start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT computing 𝑟𝑒𝑙𝑢(x3++x7)𝑟𝑒𝑙𝑢subscript𝑥3subscript𝑥7\mathit{relu}(x_{3}+\dotsb+x_{7})italic_relu ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + ⋯ + italic_x start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT ). A formal proof for the existence of 𝑎𝑡𝑡leqsubscript𝑎𝑡𝑡leq\mathit{att}_{\text{leq}}italic_att start_POSTSUBSCRIPT leq end_POSTSUBSCRIPT in 𝒯𝑢𝑑𝑒𝑐subscript𝒯𝑢𝑑𝑒𝑐\mathcal{T}_{\mathit{udec}}caligraphic_T start_POSTSUBSCRIPT italic_udec end_POSTSUBSCRIPT is given in Lemma 6. Furthermore, the output function 𝑜𝑢𝑡𝑜𝑢𝑡\mathit{out}italic_out of T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT is given by the minimal FNN N𝑜𝑢𝑡subscript𝑁𝑜𝑢𝑡N_{\mathit{out}}italic_N start_POSTSUBSCRIPT italic_out end_POSTSUBSCRIPT computing N(x1)=𝑟𝑒𝑙𝑢(1x1)𝑁subscript𝑥1𝑟𝑒𝑙𝑢1subscript𝑥1N(x_{1})=\mathit{relu}(1-x_{1})italic_N ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_relu ( 1 - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).

Let w=t1tlS𝑤subscript𝑡1subscript𝑡𝑙superscript𝑆w=t_{1}\dotsb t_{l}\in S^{*}italic_w = italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_t start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be some word over alphabet S𝑆Sitalic_S. As defined above, we have that 𝑒𝑚𝑏(ti,i)=(1,i,j=0ij,ki)𝑒𝑚𝑏subscript𝑡𝑖𝑖1𝑖superscriptsubscript𝑗0𝑖𝑗subscript𝑘𝑖\mathit{emb}(t_{i},i)=(1,i,\sum_{j=0}^{i}j,k_{i})italic_emb ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ) = ( 1 , italic_i , ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_j , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) where ki{1,,|S|}subscript𝑘𝑖1𝑆k_{i}\in\{1,\dotsc,|S|\}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 1 , … , | italic_S | }. Consider 𝒙12𝒙m2subscriptsuperscript𝒙21subscriptsuperscript𝒙2𝑚\boldsymbol{x}^{2}_{1}\dotsb\boldsymbol{x}^{2}_{m}bold_italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ bold_italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, namely the sequence of vectors after propagating w𝑤witalic_w through the embedding 𝑒𝑚𝑏𝑒𝑚𝑏\mathit{emb}italic_emb and layers l1,l2subscript𝑙1subscript𝑙2l_{1},l_{2}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT of T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT. As stated by Lemma  1, we have that 𝒙i2=(1,i,r(i),c(i),ki)subscriptsuperscript𝒙2𝑖1𝑖𝑟𝑖𝑐𝑖subscript𝑘𝑖\boldsymbol{x}^{2}_{i}=(1,i,r(i),c(i),k_{i})bold_italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( 1 , italic_i , italic_r ( italic_i ) , italic_c ( italic_i ) , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) where r(i)𝑟𝑖r(i)italic_r ( italic_i ) and c(i)𝑐𝑖c(i)italic_c ( italic_i ) are the row respectively column of tile tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT if we interpret w𝑤witalic_w as an encoded tiling. Note that all vectors 𝒙i3subscriptsuperscript𝒙3𝑖\boldsymbol{x}^{3}_{i}bold_italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are non-negative due to the way N3subscript𝑁3N_{3}italic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is built. In the following, we argue that all 𝒙i3=𝟎subscriptsuperscript𝒙3𝑖0\boldsymbol{x}^{3}_{i}=\boldsymbol{0}bold_italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_0 if and only if w𝑤witalic_w is a valid encoded tiling. Given this equivalence, the statement of the lemma follows immediately as l4subscript𝑙4l_{4}italic_l start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT simply sums up all vectors and dimensions (except for the first and second) of 𝒙13,,𝒙m3subscriptsuperscript𝒙31subscriptsuperscript𝒙3𝑚\boldsymbol{x}^{3}_{1},\dotsc,\boldsymbol{x}^{3}_{m}bold_italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT in 𝒙m4subscriptsuperscript𝒙4𝑚\boldsymbol{x}^{4}_{m}bold_italic_x start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and the output of N4subscript𝑁4N_{4}italic_N start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT indicates whether there was some non-zero value. We fix some arbitrary 𝒙i2=(1,i,r(i),c(i),ki)subscriptsuperscript𝒙2𝑖1𝑖𝑟𝑖𝑐𝑖subscript𝑘𝑖\boldsymbol{x}^{2}_{i}=(1,i,r(i),c(i),k_{i})bold_italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( 1 , italic_i , italic_r ( italic_i ) , italic_c ( italic_i ) , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Then, 𝒙i3=N3(𝒙i2,𝒙i𝑝𝑟𝑒𝑣2,𝒙i𝑛𝑒𝑥𝑡2,𝒙i𝑠𝑡𝑒𝑝2)subscriptsuperscript𝒙3𝑖subscript𝑁3subscriptsuperscript𝒙2𝑖subscriptsuperscript𝒙2subscript𝑖𝑝𝑟𝑒𝑣subscriptsuperscript𝒙2subscript𝑖𝑛𝑒𝑥𝑡subscriptsuperscript𝒙2subscript𝑖𝑠𝑡𝑒𝑝\boldsymbol{x}^{3}_{i}=N_{3}(\boldsymbol{x}^{2}_{i},\boldsymbol{x}^{2}_{i_{% \mathit{prev}}},\boldsymbol{x}^{2}_{i_{\mathit{next}}},\boldsymbol{x}^{2}_{i_{% \mathit{step}}})bold_italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_prev end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_next end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_step end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) where i𝑛𝑒𝑥𝑡=i+1subscript𝑖𝑛𝑒𝑥𝑡𝑖1i_{\mathit{next}}=i+1italic_i start_POSTSUBSCRIPT italic_next end_POSTSUBSCRIPT = italic_i + 1 if i<m𝑖𝑚i<mitalic_i < italic_m and m𝑚mitalic_m otherwise, i𝑝𝑟𝑒𝑣=i1subscript𝑖𝑝𝑟𝑒𝑣𝑖1i_{\mathit{prev}}=i-1italic_i start_POSTSUBSCRIPT italic_prev end_POSTSUBSCRIPT = italic_i - 1 if i>1𝑖1i>1italic_i > 1 and 1111 otherwise and i𝑠𝑡𝑒𝑝=i+r(i)+1subscript𝑖𝑠𝑡𝑒𝑝𝑖𝑟𝑖1i_{\mathit{step}}=i+r(i)+1italic_i start_POSTSUBSCRIPT italic_step end_POSTSUBSCRIPT = italic_i + italic_r ( italic_i ) + 1 if i<mr(i)1𝑖𝑚𝑟𝑖1i<m-r(i)-1italic_i < italic_m - italic_r ( italic_i ) - 1 and m𝑚mitalic_m otherwise.

Consider property (a)𝑎(a)( italic_a ) and subnetwork Nasubscript𝑁𝑎N_{a}italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. With the understanding gained in Appendix B, N=x1,2,x3,2subscriptsuperscript𝑁subscript𝑥12subscript𝑥32N^{x_{1,2},x_{3,2}}_{=}italic_N start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 , 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT = end_POSTSUBSCRIPT outputs 00 iff x1,2=x3,2subscript𝑥12subscript𝑥32x_{1,2}=x_{3,2}italic_x start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 3 , 2 end_POSTSUBSCRIPT. These dimensions correspond to positions i𝑖iitalic_i and i𝑛𝑒𝑥𝑡subscript𝑖𝑛𝑒𝑥𝑡i_{\mathit{next}}italic_i start_POSTSUBSCRIPT italic_next end_POSTSUBSCRIPT, which are only equal if i=m𝑖𝑚i=mitalic_i = italic_m (Lemma 2). Furthermore, the property of Nsubscript𝑁N_{\rightarrow}italic_N start_POSTSUBSCRIPT → end_POSTSUBSCRIPT stated by Lemma 4 is given as the output of N=subscript𝑁N_{=}italic_N start_POSTSUBSCRIPT = end_POSTSUBSCRIPT is guaranteed to be in [0;1]01[0;1][ 0 ; 1 ] and the values of x1,2subscript𝑥12x_{1,2}italic_x start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT and x3,2subscript𝑥32x_{3,2}italic_x start_POSTSUBSCRIPT 3 , 2 end_POSTSUBSCRIPT are guaranteed to be in \mathbb{N}blackboard_N. In summary, this ensures that the third dimension of 𝒙m3subscriptsuperscript𝒙3𝑚\boldsymbol{x}^{3}_{m}bold_italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is 00 iff r(m)=c(m)𝑟𝑚𝑐𝑚r(m)=c(m)italic_r ( italic_m ) = italic_c ( italic_m ). For other positions the third dimension is always 00 since Nsubscript𝑁N_{\rightarrow}italic_N start_POSTSUBSCRIPT → end_POSTSUBSCRIPT outputs 00 in these cases due to the fact that N=x1,2,x3,2subscriptsuperscript𝑁subscript𝑥12subscript𝑥32N^{x_{1,2},x_{3,2}}_{=}italic_N start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 , 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT = end_POSTSUBSCRIPT equals 1111. Analogously, Nb1subscript𝑁subscript𝑏1N_{b_{1}}italic_N start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Nb2subscript𝑁subscript𝑏2N_{b_{2}}italic_N start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ensure that t1=tIsubscript𝑡1subscript𝑡𝐼t_{1}=t_{I}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and tm=tFsubscript𝑡𝑚subscript𝑡𝐹t_{m}=t_{F}italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and, thus, property (b) iff the fourth and fifth dimensions in all positions are equal to 00. Consider properties (c) and (d) described above and assume that property (a) holds. These two properties are non-local in the sense that they depend on at least two positions in 𝒙12𝒙m2subscriptsuperscript𝒙21subscriptsuperscript𝒙2𝑚\boldsymbol{x}^{2}_{1}\dotsb\boldsymbol{x}^{2}_{m}bold_italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ bold_italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Consider the subnet Ncsubscript𝑁𝑐N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. By construction and the gadgets described in Appendix B, we have that Ncsubscript𝑁𝑐N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT outputs 00 if c(i)<r(i)𝑐𝑖𝑟𝑖c(i)<r(i)italic_c ( italic_i ) < italic_r ( italic_i ) and (ti,ti+1)Hsubscript𝑡𝑖subscript𝑡𝑖1𝐻(t_{i},t_{i+1})\in H( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) ∈ italic_H or if c(i)=r(i)𝑐𝑖𝑟𝑖c(i)=r(i)italic_c ( italic_i ) = italic_r ( italic_i ), which means that tile tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is rightmost in its corresponding row. Otherwise the value computed by Ncsubscript𝑁𝑐N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is greater than 00. Analogously, subnet Ndsubscript𝑁𝑑N_{d}italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT checks whether vertically stacked tiles do match. In summary, this ensures that the sixth and seventh dimension of each 𝒙i3subscriptsuperscript𝒙3𝑖\boldsymbol{x}^{3}_{i}bold_italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is equal to 00 if and only if properties (c) and (d) hold. ∎

Proof of Theorem 2.

In the same manner as in the proof of Theorem 1, we prove the statement via reduction from OTWPsuperscriptOTWP\textsc{OTWP}^{*}OTWP start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. The reduction is exactly the same, namely given an OTWPsuperscriptOTWP\textsc{OTWP}^{*}OTWP start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT instance 𝒮=(S,H,V,tI,tF)𝒮𝑆𝐻𝑉subscript𝑡𝐼subscript𝑡𝐹\mathcal{S}=(S,H,V,t_{I},t_{F})caligraphic_S = ( italic_S , italic_H , italic_V , italic_t start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) we build EOT T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT which recognizes exactly those words w𝑤witalic_w representing a valid encoded tiling of 𝒮𝒮\mathcal{S}caligraphic_S. For details, see the proof of Theorem 1.

Given the correctness arguments for T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT in Theorem 1, it is left to argue that T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT works as intended, despite the fact that it works over some FA F𝐹Fitalic_F using at most 𝒪(log(max(|S|,n)))𝒪𝑆𝑛\mathcal{O}(\log(\max(|S|,n)))caligraphic_O ( roman_log ( roman_max ( | italic_S | , italic_n ) ) ) bits where n𝑛nitalic_n is the length of an input word. We choose F𝐹Fitalic_F such that overflow situations do not occur in any computation T𝒮(w)subscript𝑇𝒮𝑤T_{\mathcal{S}}(w)italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ( italic_w ) and rounding is handled such that T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT works as intended. Throughout this proof, we use log(n)𝑛\log(n)roman_log ( italic_n ) Namely, given a word w𝑤witalic_w with |w|=n𝑤𝑛|w|=n| italic_w | = italic_n assume that F𝐹Fitalic_F uses m=4log(max(|S|,n))+2𝑚4𝑆𝑛2m=\lfloor 4\log(\max(|S|,n))\rfloor+2italic_m = ⌊ 4 roman_log ( roman_max ( | italic_S | , italic_n ) ) ⌋ + 2 bits and rounds values off to the nearest representable number. We denote the value resulting from rounding x𝑥xitalic_x off in arithmetic F𝐹Fitalic_F by xFsubscript𝑥𝐹\lfloor x\rfloor_{F}⌊ italic_x ⌋ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT. We assume that there is an extra bit that is used as a sign bit and that at least 3log(n)+13𝑛1\lfloor 3\log(n)\rfloor+1⌊ 3 roman_log ( italic_n ) ⌋ + 1 bits can be used to represent integer and at least log(n)+1𝑛1\lfloor\log(n)\rfloor+1⌊ roman_log ( italic_n ) ⌋ + 1 bits can be used to represent fractional parts. Note that this is a reasonable assumption for all common FA, like fixed-point or floating-point arithmetic. Furthermore, it is clearly the case that m𝒪(log(max(|S|,n)))𝑚𝒪𝑆𝑛m\in\mathcal{O}(\log(\max(|S|,n)))italic_m ∈ caligraphic_O ( roman_log ( roman_max ( | italic_S | , italic_n ) ) ). To ease our arguments and notation from here on, we assume w.l.o.g.  that we represent n𝑛nitalic_n using log(n)𝑛\log(n)roman_log ( italic_n ) instead of log(n)+1𝑙𝑜𝑔𝑛1\lfloor log(n)\rfloor+1⌊ italic_l italic_o italic_g ( italic_n ) ⌋ + 1.

Per definition, T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT uses the embedding function 𝑒𝑚𝑏(ak,0)=(1,1,0,0,k)𝑒𝑚𝑏subscript𝑎𝑘01100𝑘\mathit{emb}(a_{k},0)=(1,1,0,0,k)italic_emb ( italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , 0 ) = ( 1 , 1 , 0 , 0 , italic_k ) and 𝑒𝑚𝑏(ak,i)=(0,1,i,j=0ij,k)𝑒𝑚𝑏subscript𝑎𝑘𝑖01𝑖superscriptsubscript𝑗0𝑖𝑗𝑘\mathit{emb}(a_{k},i)=(0,1,i,\sum_{j=0}^{i}j,k)italic_emb ( italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_i ) = ( 0 , 1 , italic_i , ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_j , italic_k ). First, we assume that each k𝑘kitalic_k, namely the value representing a specific tile from S𝑆Sitalic_S, is a unique, positive value. This is possible as F𝐹Fitalic_F uses m>log(|S|)𝑚𝑆m>\log(|S|)italic_m > roman_log ( | italic_S | ) bits. Furthermore, we see that 𝑒𝑚𝑏𝑒𝑚𝑏\mathit{emb}italic_emb, especially the sum j=0ij=i(i+1)2i2superscriptsubscript𝑗0𝑖𝑗𝑖𝑖12superscript𝑖2\sum_{j=0}^{i}j=\frac{i(i+1)}{2}\leq i^{2}∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_j = divide start_ARG italic_i ( italic_i + 1 ) end_ARG start_ARG 2 end_ARG ≤ italic_i start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, works as intended up to i=n𝑖𝑛i=nitalic_i = italic_n due to the fact that F𝐹Fitalic_F uses more than m>2log(n)𝑚2𝑛m>2\log(n)italic_m > 2 roman_log ( italic_n ) bits to represent integer parts. Next, consider layer l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT of Lemma 1. Layer l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT consists of a single attention head 𝑎𝑡𝑡1,1subscript𝑎𝑡𝑡11\mathit{att}_{1,1}italic_att start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT. Here, the only crucial parts are the computation of value 1l1𝑙\frac{1}{l}divide start_ARG 1 end_ARG start_ARG italic_l end_ARG in 𝑝𝑜𝑜𝑙1,1subscript𝑝𝑜𝑜𝑙11\mathit{pool}_{1,1}italic_pool start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT for a position i𝑖iitalic_i. Per definition, l𝑙litalic_l corresponds to the number of positions j𝑗jitalic_j such that h=0jhi1superscriptsubscript0𝑗𝑖1\sum_{h=0}^{j}h\leq i-1∑ start_POSTSUBSCRIPT italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_h ≤ italic_i - 1. As i𝑖iitalic_i is bounded by n𝑛nitalic_n, this inequality can only be satisfied by positions j𝑗jitalic_j for which jn𝑗𝑛j\leq\sqrt{n}italic_j ≤ square-root start_ARG italic_n end_ARG holds. As T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT uses ℎ𝑚𝑎𝑥ℎ𝑚𝑎𝑥\mathit{hmax}italic_hmax to count the positions for which this inequality holds, l𝑙litalic_l is bounded by n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG. Next, we observe that 1lF=2log(n)1l2log(n)=nlnsubscript1𝑙𝐹superscript2𝑛1𝑙superscript2𝑛𝑛𝑙𝑛\lfloor\frac{1}{l}\rfloor_{F}=\frac{\lfloor 2^{\log(n)}\frac{1}{l}\rfloor}{2^{% \log(n)}}=\frac{\lfloor\frac{n}{l}\rfloor}{n}⌊ divide start_ARG 1 end_ARG start_ARG italic_l end_ARG ⌋ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = divide start_ARG ⌊ 2 start_POSTSUPERSCRIPT roman_log ( italic_n ) end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_l end_ARG ⌋ end_ARG start_ARG 2 start_POSTSUPERSCRIPT roman_log ( italic_n ) end_POSTSUPERSCRIPT end_ARG = divide start_ARG ⌊ divide start_ARG italic_n end_ARG start_ARG italic_l end_ARG ⌋ end_ARG start_ARG italic_n end_ARG, namely the general understanding of rounding off where we use log(n)𝑛\log(n)roman_log ( italic_n ) bits to represent fractions. However, this gives that for all 1l1<l2n1subscript𝑙1subscript𝑙2𝑛1\leq l_{1}<l_{2}\leq\sqrt{n}1 ≤ italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ square-root start_ARG italic_n end_ARG that 1l1F1l2Fsubscript1subscript𝑙1𝐹subscript1subscript𝑙2𝐹\lfloor\frac{1}{l_{1}}\rfloor_{F}\neq\lfloor\frac{1}{l_{2}}\rfloor_{F}⌊ divide start_ARG 1 end_ARG start_ARG italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ⌋ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≠ ⌊ divide start_ARG 1 end_ARG start_ARG italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ⌋ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT as nl1nl2𝑛subscript𝑙1𝑛subscript𝑙2\lfloor\frac{n}{l_{1}}\rfloor\neq\lfloor\frac{n}{l_{2}}\rfloor⌊ divide start_ARG italic_n end_ARG start_ARG italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ⌋ ≠ ⌊ divide start_ARG italic_n end_ARG start_ARG italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ⌋ holds for all l1<l2nsubscript𝑙1subscript𝑙2𝑛l_{1}<l_{2}\leq\sqrt{n}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ square-root start_ARG italic_n end_ARG. This means, that it is ensured by F𝐹Fitalic_F that 1l1𝑙\frac{1}{l}divide start_ARG 1 end_ARG start_ARG italic_l end_ARG is uniquely representable.

Next, the only crucial part in l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the computation of the product 1lj1𝑙𝑗\frac{1}{l}\cdot jdivide start_ARG 1 end_ARG start_ARG italic_l end_ARG ⋅ italic_j, which is used to determine the position j𝑗jitalic_j for which 1lj=11𝑙𝑗1\frac{1}{l}\cdot j=1divide start_ARG 1 end_ARG start_ARG italic_l end_ARG ⋅ italic_j = 1 in 𝑠𝑐𝑜𝑟𝑒2,1subscript𝑠𝑐𝑜𝑟𝑒21\mathit{score}_{2,1}italic_score start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT, which is obviously given by position l𝑙litalic_l. This equality is no longer guaranteed to exist if we consider 1lFjsubscript1𝑙𝐹𝑗\lfloor\frac{1}{l}\rfloor_{F}\cdot j⌊ divide start_ARG 1 end_ARG start_ARG italic_l end_ARG ⌋ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ⋅ italic_j. However, due to the monotonicity of 1lFsubscript1𝑙𝐹\lfloor\frac{1}{l}\rfloor_{F}⌊ divide start_ARG 1 end_ARG start_ARG italic_l end_ARG ⌋ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT for ln𝑙𝑛l\leq\sqrt{n}italic_l ≤ square-root start_ARG italic_n end_ARG and that the maximum round of error is given by 12log(n)1superscript2𝑛\frac{1}{2^{\log(n)}}divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT roman_log ( italic_n ) end_POSTSUPERSCRIPT end_ARG, we have that the j=l𝑗𝑙j=litalic_j = italic_l produces the value closest to 1111 in the product 1lj1𝑙𝑗\frac{1}{l}\cdot jdivide start_ARG 1 end_ARG start_ARG italic_l end_ARG ⋅ italic_j. Taking a look at 𝑠𝑐𝑜𝑟𝑒2,1subscript𝑠𝑐𝑜𝑟𝑒21\mathit{score}_{2,1}italic_score start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT, this ensures that l𝑙litalic_l is still the position that 𝑎𝑡𝑡2,1subscript𝑎𝑡𝑡21\mathit{att}_{2,1}italic_att start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT attends to. Therefore, the statement of Lemma 1 is still valid for T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT working over F𝐹Fitalic_F. We observe that all values of some vector 𝒙j2subscriptsuperscript𝒙2𝑗\boldsymbol{x}^{2}_{j}bold_italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT after layer l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are positive integers whose magnitude is bounded by n2superscript𝑛2n^{2}italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Now, consider layer l3subscript𝑙3l_{3}italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and l4subscript𝑙4l_{4}italic_l start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT. From the proof of Theorem 1 we see that the gadgets at most sum up two values or compute a fraction of the form i+j2𝑖𝑗2\frac{i+j}{2}divide start_ARG italic_i + italic_j end_ARG start_ARG 2 end_ARG and ij2𝑖𝑗2\frac{i-j}{2}divide start_ARG italic_i - italic_j end_ARG start_ARG 2 end_ARG (in gadgets NHsubscript𝑁𝐻N_{H}italic_N start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT or NVsubscript𝑁𝑉N_{V}italic_N start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT). Both can safely be done with at least 3log(n)3𝑛3\log(n)3 roman_log ( italic_n ) bits for integer and log(n)𝑛\log(n)roman_log ( italic_n ) for fractional parts, as all previously computed values, up to layer l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, in a computation of T𝒮(w)subscript𝑇𝒮𝑤T_{\mathcal{S}}(w)italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ( italic_w ) are representable using 2log(n)2𝑛2\log(n)2 roman_log ( italic_n ) bits. We observe that the values of the third to seventh dimension of some 𝒙j3subscriptsuperscript𝒙3𝑗\boldsymbol{x}^{3}_{j}bold_italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are either 00 or 1111. This is due to the fact that all values after layer l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are guaranteed to be integers. Next, consider layer l4subscript𝑙4l_{4}italic_l start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT. The computation done by 𝑎𝑡𝑡subscript𝑎𝑡𝑡\mathit{att}_{\leq}italic_att start_POSTSUBSCRIPT ≤ end_POSTSUBSCRIPT is safe (see Lemma 6) and the crucial step here is the computation of 𝑐𝑜𝑚𝑏4subscript𝑐𝑜𝑚𝑏4\mathit{comb}_{4}italic_comb start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT given by 𝑟𝑒𝑙𝑢(x3++x7)𝑟𝑒𝑙𝑢subscript𝑥3subscript𝑥7\mathit{relu}(x_{3}+\dotsb+x_{7})italic_relu ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + ⋯ + italic_x start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT ). The values xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are all of the form ij𝑖𝑗\frac{i}{j}divide start_ARG italic_i end_ARG start_ARG italic_j end_ARG where i𝑖iitalic_i is guaranteed to be 00 or 1111 and j𝑗jitalic_j is the normalisation induced by 𝑎𝑡𝑡subscript𝑎𝑡𝑡\mathit{att}_{\leq}italic_att start_POSTSUBSCRIPT ≤ end_POSTSUBSCRIPT from perspective of position j𝑗jitalic_j. However, this means j𝑗jitalic_j is bounded by n𝑛nitalic_n and, thus, ijF>0subscript𝑖𝑗𝐹0\lfloor\frac{i}{j}\rfloor_{F}>0⌊ divide start_ARG italic_i end_ARG start_ARG italic_j end_ARG ⌋ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT > 0 if and only if i=1𝑖1i=1italic_i = 1 for all j𝑗jitalic_j due to the fact that F𝐹Fitalic_F allows for log(n)𝑛\log(n)roman_log ( italic_n ) bits to represent fractional parts. Finally, 𝑜𝑢𝑡𝑜𝑢𝑡\mathit{out}italic_out is trivially computable in F𝐹Fitalic_F, which finishes the proof. ∎

Appendix C Proofs of Section 5

Proof of Theorem 3.

The decidability and membership results of statements (1) and (2 )are sufficiently argued in the proof sketch given in Section 5.

To prove the hardness results of statements (1) and (2), we establish a reduction from OTWP𝗎𝗇subscriptOTWP𝗎𝗇\textsc{OTWP}_{\mathsf{un}}OTWP start_POSTSUBSCRIPT sansserif_un end_POSTSUBSCRIPT respectively OTWP𝖻𝗂𝗇subscriptOTWP𝖻𝗂𝗇\textsc{OTWP}_{\mathsf{bin}}OTWP start_POSTSUBSCRIPT sansserif_bin end_POSTSUBSCRIPT: given some bounded word-tiling instance (𝒮,n)𝒮𝑛(\mathcal{S},n)( caligraphic_S , italic_n ) we build an instance (T𝒮,n)subscript𝑇𝒮𝑛(T_{\mathcal{S}},n)( italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , italic_n ) of bSat𝗎𝗇subscriptbSat𝗎𝗇\textsc{bSat}_{\mathsf{un}}bSat start_POSTSUBSCRIPT sansserif_un end_POSTSUBSCRIPT respectively bSat𝖻𝗂𝗇subscriptbSat𝖻𝗂𝗇\textsc{bSat}_{\mathsf{bin}}bSat start_POSTSUBSCRIPT sansserif_bin end_POSTSUBSCRIPT where T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT is build as described in Theorem 1. The only missing argument is that these reductions are polynomial. In particular, this means that T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT must be built in polynomial time regarding the size of (𝒮,n)𝒮𝑛(\mathcal{S},n)( caligraphic_S , italic_n ). Therefore, we recall the proof of Theorem 1.

First, we see that the embedding function 𝑒𝑚𝑏𝑒𝑚𝑏\mathit{emb}italic_emb and the amount of layers of T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT is independent of 𝒮𝒮\mathcal{S}caligraphic_S and n𝑛nitalic_n. The first two layers l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT of T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT are specified in Lemma 1. Recalling the proof of Lemma 1, we see that l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT each consist of a single attention head, whose internal parameters like scoring, pooling or combination are independent of (𝒮,n)𝒮𝑛(\mathcal{S},n)( caligraphic_S , italic_n ) as well. Next, consider layer l3subscript𝑙3l_{3}italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. This layer consists of three attention heads 𝑎𝑡𝑡𝑝𝑟𝑒𝑣subscript𝑎𝑡𝑡𝑝𝑟𝑒𝑣\mathit{att}_{\mathit{prev}}italic_att start_POSTSUBSCRIPT italic_prev end_POSTSUBSCRIPT, 𝑎𝑡𝑡𝑛𝑒𝑥𝑡subscript𝑎𝑡𝑡𝑛𝑒𝑥𝑡\mathit{att}_{\mathit{next}}italic_att start_POSTSUBSCRIPT italic_next end_POSTSUBSCRIPT and 𝑎𝑡𝑡𝑠𝑡𝑒𝑝subscript𝑎𝑡𝑡𝑠𝑡𝑒𝑝\mathit{att}_{\mathit{step}}italic_att start_POSTSUBSCRIPT italic_step end_POSTSUBSCRIPT each given by the template described in Lemma 2, which again is independent of (𝒮,n)𝒮𝑛(\mathcal{S},n)( caligraphic_S , italic_n ). Additionally, l3subscript𝑙3l_{3}italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT contains the combination function 𝑐𝑜𝑚𝑏3subscript𝑐𝑜𝑚𝑏3\mathit{comb}_{3}italic_comb start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. This combination function is represented by a FNN N3subscript𝑁3N_{3}italic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, using smaller FNN Nasubscript𝑁𝑎N_{a}italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, Nb1subscript𝑁subscript𝑏1N_{b_{1}}italic_N start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, Nb2subscript𝑁subscript𝑏2N_{b_{2}}italic_N start_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, Ncsubscript𝑁𝑐N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and Ndsubscript𝑁𝑑N_{d}italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT as building blocks. These are dependent on 𝒮𝒮\mathcal{S}caligraphic_S, as they are built using gadgets N=tIsubscript𝑁absentsubscript𝑡𝐼N_{=t_{I}}italic_N start_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT, N=tFsubscript𝑁absentsubscript𝑡𝐹N_{=t_{F}}italic_N start_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT, NHsubscript𝑁𝐻N_{H}italic_N start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT and NVsubscript𝑁𝑉N_{V}italic_N start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT where tIsubscript𝑡𝐼t_{I}italic_t start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, tFsubscript𝑡𝐹t_{F}italic_t start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT, H𝐻Hitalic_H and V𝑉Vitalic_V are components of 𝒮𝒮\mathcal{S}caligraphic_S. However, in the proof of Lemma 5 we see that these gadgets are at most polynomial in their respective parameter. Layer l4subscript𝑙4l_{4}italic_l start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT and the output function, specified by FNN N𝑜𝑢𝑡subscript𝑁𝑜𝑢𝑡N_{\mathit{out}}italic_N start_POSTSUBSCRIPT italic_out end_POSTSUBSCRIPT, are again independent of (𝒮,n)𝒮𝑛(\mathcal{S},n)( caligraphic_S , italic_n ). In summary, the EOT T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT is polynomial in (𝒮,n)𝒮𝑛(\mathcal{S},n)( caligraphic_S , italic_n ), which makes the reductions from OTWP𝖾𝗑𝗉superscriptOTWP𝖾𝗑𝗉\textsc{OTWP}^{\mathsf{exp}}OTWP start_POSTSUPERSCRIPT sansserif_exp end_POSTSUPERSCRIPT und OTWP𝗉𝗈𝗅𝗒superscriptOTWP𝗉𝗈𝗅𝗒\textsc{OTWP}^{\mathsf{poly}}OTWP start_POSTSUPERSCRIPT sansserif_poly end_POSTSUPERSCRIPT polynomial. ∎

Next, we address the proof of Lemma 3. We need some preliminary, rather technical result first. Let T𝑇Titalic_T be an EOT and wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT be a word and consider the computation T(w)𝑇𝑤T(w)italic_T ( italic_w ). Let XT(w)0=𝑒𝑚𝑏(w)superscriptsubscript𝑋𝑇𝑤0𝑒𝑚𝑏𝑤X_{T(w)}^{0}=\mathit{emb}(w)italic_X start_POSTSUBSCRIPT italic_T ( italic_w ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_emb ( italic_w ) and XT(w)isuperscriptsubscript𝑋𝑇𝑤𝑖X_{T(w)}^{i}italic_X start_POSTSUBSCRIPT italic_T ( italic_w ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT be the sequence of vectors occurring after the computation of layer lisubscript𝑙𝑖l_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of T𝑇Titalic_T. Let 𝒙𝒙\boldsymbol{x}bold_italic_x and 𝒙superscript𝒙\boldsymbol{x}^{\prime}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be two vectors matching the dimensionality of 𝑠𝑐𝑜𝑟𝑒i,jsubscript𝑠𝑐𝑜𝑟𝑒𝑖𝑗\mathit{score}_{i,j}italic_score start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT of T𝑇Titalic_T. Overloading some notation, let Nw(𝒙,𝒙,i,j)=𝑛𝑜𝑟𝑚i,j(𝑠𝑐𝑜𝑟𝑒i,j(𝒙,𝒙),𝑠𝑐𝑜𝑟𝑒i,j(𝒙,XT(w)i1))subscript𝑁𝑤𝒙superscript𝒙𝑖𝑗subscript𝑛𝑜𝑟𝑚𝑖𝑗subscript𝑠𝑐𝑜𝑟𝑒𝑖𝑗𝒙superscript𝒙subscript𝑠𝑐𝑜𝑟𝑒𝑖𝑗𝒙superscriptsubscript𝑋𝑇𝑤𝑖1N_{w}(\boldsymbol{x},\boldsymbol{x}^{\prime},i,j)=\mathit{norm}_{i,j}(\mathit{% score}_{i,j}(\boldsymbol{x},\boldsymbol{x}^{\prime}),\mathit{score}_{i,j}(% \boldsymbol{x},X_{T(w)}^{i-1}))italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i , italic_j ) = italic_norm start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_score start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_score start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( bold_italic_x , italic_X start_POSTSUBSCRIPT italic_T ( italic_w ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ) ) where 𝑠𝑐𝑜𝑟𝑒i,j(𝒙,XT(w)i1)subscript𝑠𝑐𝑜𝑟𝑒𝑖𝑗𝒙superscriptsubscript𝑋𝑇𝑤𝑖1\mathit{score}_{i,j}(\boldsymbol{x},X_{T(w)}^{i-1})italic_score start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( bold_italic_x , italic_X start_POSTSUBSCRIPT italic_T ( italic_w ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ) is the vector of all scorings of 𝒙𝒙\boldsymbol{x}bold_italic_x with sequence XT(w)i1superscriptsubscript𝑋𝑇𝑤𝑖1X_{T(w)}^{i-1}italic_X start_POSTSUBSCRIPT italic_T ( italic_w ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT. We remark that it is not necessary that 𝒙𝒙\boldsymbol{x}bold_italic_x or 𝒙superscript𝒙\boldsymbol{x}^{\prime}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT must occur in XT(w)i1superscriptsubscript𝑋𝑇𝑤𝑖1X_{T(w)}^{i-1}italic_X start_POSTSUBSCRIPT italic_T ( italic_w ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT for this to be well defined. Again overloading some notation, let Pw(𝒙,i,j)=𝑝𝑜𝑜𝑙i,j(XT(w)i1,𝑠𝑐𝑜𝑟𝑒i,j(𝒙,XT(w)i1))subscript𝑃𝑤𝒙𝑖𝑗subscript𝑝𝑜𝑜𝑙𝑖𝑗superscriptsubscript𝑋𝑇𝑤𝑖1subscript𝑠𝑐𝑜𝑟𝑒𝑖𝑗𝒙superscriptsubscript𝑋𝑇𝑤𝑖1P_{w}(\boldsymbol{x},i,j)=\mathit{pool}_{i,j}(X_{T(w)}^{i-1},\mathit{score}_{i% ,j}(\boldsymbol{x},X_{T(w)}^{i-1}))italic_P start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_italic_x , italic_i , italic_j ) = italic_pool start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_T ( italic_w ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT , italic_score start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( bold_italic_x , italic_X start_POSTSUBSCRIPT italic_T ( italic_w ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ) ).

Lemma 7.

Let T𝑇Titalic_T be a additive-periodical EOT of depth L𝐿Litalic_L, maximum width H𝐻Hitalic_H and periodicity p𝑝pitalic_p with 𝑛𝑜𝑟𝑚i,j{𝑠𝑚𝑎𝑥,ℎ𝑚𝑎𝑥}subscript𝑛𝑜𝑟𝑚𝑖𝑗𝑠𝑚𝑎𝑥ℎ𝑚𝑎𝑥\mathit{norm}_{i,j}\in\{\mathit{smax},\mathit{hmax}\}italic_norm start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ { italic_smax , italic_hmax } for all iL,jHformulae-sequence𝑖𝐿𝑗𝐻i\leq L,j\leq Hitalic_i ≤ italic_L , italic_j ≤ italic_H, let w=u1uj1ujhu2Σ+𝑤subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript𝑢2superscriptΣw=u_{1}u_{j_{1}}\dotsb u_{j_{h}}u_{2}\in\Sigma^{+}italic_w = italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT where u1,u2Σ+subscript𝑢1subscript𝑢2superscriptΣu_{1},u_{2}\in\Sigma^{+}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, all ujiΣpsubscript𝑢subscript𝑗𝑖superscriptΣ𝑝u_{j_{i}}\in\Sigma^{p}italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ roman_Σ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and all ujisubscript𝑢subscript𝑗𝑖u_{j_{i}}italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT also occur in u1subscript𝑢1u_{1}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT or u2subscript𝑢2u_{2}italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and let 𝒳𝒳\mathcal{X}caligraphic_X be the set of all vectors occurring in any of the sequences XT(w)isubscriptsuperscript𝑋𝑖𝑇𝑤X^{i}_{T(w)}italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T ( italic_w ) end_POSTSUBSCRIPT. If there are indexes h1<h2hsubscript1subscript2h_{1}<h_{2}\leq hitalic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_h such that for all 𝐱,𝐱𝒳,iL,jHformulae-sequence𝐱superscript𝐱𝒳formulae-sequence𝑖𝐿𝑗𝐻\boldsymbol{x},\boldsymbol{x}^{\prime}\in\mathcal{X},i\leq L,j\leq Hbold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X , italic_i ≤ italic_L , italic_j ≤ italic_H holds that Nu1uj1ujh1(𝐱,𝐱,i,j)=Nu1uj1ujh2(𝐱,𝐱,i,j)subscript𝑁subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript1𝐱superscript𝐱𝑖𝑗subscript𝑁subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript2𝐱superscript𝐱𝑖𝑗N_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}}(\boldsymbol{x},\boldsymbol{x}^{\prime},% i,j)=N_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{2}}}}(\boldsymbol{x},\boldsymbol{x}^{% \prime},i,j)italic_N start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i , italic_j ) = italic_N start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i , italic_j ) and Pu1uj1ujh1(𝐱,i,j)=Pu1uj1ujh2(𝐱,i,j)subscript𝑃subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript1𝐱𝑖𝑗subscript𝑃subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript2𝐱𝑖𝑗P_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}}(\boldsymbol{x},i,j)=P_{u_{1}u_{j_{1}}% \dotsb u_{j_{h_{2}}}}(\boldsymbol{x},i,j)italic_P start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , italic_i , italic_j ) = italic_P start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , italic_i , italic_j ) then it holds that Nu1uj1ujh1ujh2+1u2(𝐱,𝐱,i,j)=Nu1u2(𝐱,𝐱,i,j)subscript𝑁subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript1subscript𝑢subscript𝑗subscript21subscript𝑢2𝐱superscript𝐱𝑖𝑗subscript𝑁subscript𝑢1subscript𝑢2𝐱superscript𝐱𝑖𝑗N_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2}}(\boldsymbol{% x},\boldsymbol{x}^{\prime},i,j)=N_{u_{1}\dotsb u_{2}}(\boldsymbol{x},% \boldsymbol{x}^{\prime},i,j)italic_N start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i , italic_j ) = italic_N start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i , italic_j ) and Pu1uj1ujh1ujh2+1u2(𝐱,i,j)=Pu1u2(𝐱,i,j)subscript𝑃subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript1subscript𝑢subscript𝑗subscript21subscript𝑢2𝐱𝑖𝑗subscript𝑃subscript𝑢1subscript𝑢2𝐱𝑖𝑗P_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2}}(\boldsymbol{% x},i,j)=P_{u_{1}\dotsb u_{2}}(\boldsymbol{x},i,j)italic_P start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , italic_i , italic_j ) = italic_P start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , italic_i , italic_j ).

Proof.

Let T𝑇Titalic_T, w𝑤witalic_w, 𝒳𝒳\mathcal{X}caligraphic_X, h1subscript1h_{1}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and h2subscript2h_{2}italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be as stated above. We prove the statement via induction on the layers lisubscript𝑙𝑖l_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. First, consider layer l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and fix some tuple (𝒙,𝒙,1,j)𝒙superscript𝒙1𝑗(\boldsymbol{x},\boldsymbol{x}^{\prime},1,j)( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 , italic_j ). We first show that Nu1uj1ujh1ujh2+1u2(𝒙,𝒙,1,j)=Nu1u2(𝒙,𝒙,1,j)subscript𝑁subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript1subscript𝑢subscript𝑗subscript21subscript𝑢2𝒙superscript𝒙1𝑗subscript𝑁subscript𝑢1subscript𝑢2𝒙superscript𝒙1𝑗N_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2}}(\boldsymbol{% x},\boldsymbol{x}^{\prime},1,j)=N_{u_{1}\dotsb u_{2}}(\boldsymbol{x},% \boldsymbol{x}^{\prime},1,j)italic_N start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 , italic_j ) = italic_N start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 , italic_j ). Assume that 𝑛𝑜𝑟𝑚1,jsubscript𝑛𝑜𝑟𝑚1𝑗\mathit{norm}_{1,j}italic_norm start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT is given by 𝑠𝑚𝑎𝑥𝑠𝑚𝑎𝑥\mathit{smax}italic_smax. Then, 𝑛𝑜𝑟𝑚1,jsubscript𝑛𝑜𝑟𝑚1𝑗\mathit{norm}_{1,j}italic_norm start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT computes e𝑠𝑐𝑜𝑟𝑒1,j(𝒙,𝒙)𝑠𝑐𝑜𝑟𝑒1,j(𝒙,XT(w)0)esisuperscript𝑒subscript𝑠𝑐𝑜𝑟𝑒1𝑗𝒙superscript𝒙subscriptsubscript𝑠𝑐𝑜𝑟𝑒1𝑗𝒙superscriptsubscript𝑋𝑇superscript𝑤0superscript𝑒subscript𝑠superscript𝑖\frac{e^{\mathit{score}_{1,j}(\boldsymbol{x},\boldsymbol{x}^{\prime})}}{\sum_{% \mathit{score}_{1,j}(\boldsymbol{x},X_{T(w^{\prime})}^{0})}e^{s_{i^{\prime}}}}divide start_ARG italic_e start_POSTSUPERSCRIPT italic_score start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_score start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ( bold_italic_x , italic_X start_POSTSUBSCRIPT italic_T ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG for all words wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Obviously, the numerator in Nu1uh1uh2+1u2(𝒙,𝒙,1,j)subscript𝑁subscript𝑢1subscript𝑢subscript1subscript𝑢subscript21subscript𝑢2𝒙superscript𝒙1𝑗N_{u_{1}\dotsb u_{h_{1}}u_{h_{2}+1}\dotsb u_{2}}(\boldsymbol{x},\boldsymbol{x}% ^{\prime},1,j)italic_N start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 , italic_j ) and Nu1u2(𝒙,𝒙,1,j)subscript𝑁subscript𝑢1subscript𝑢2𝒙superscript𝒙1𝑗N_{u_{1}\dotsb u_{2}}(\boldsymbol{x},\boldsymbol{x}^{\prime},1,j)italic_N start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 , italic_j ) is equal. By definition, we have that 𝑠𝑐𝑜𝑟𝑒i,jsubscript𝑠𝑐𝑜𝑟𝑒𝑖𝑗\mathit{score}_{i,j}italic_score start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is local in the sense that it compares vectors pairwise, producing the different scoring values sisubscript𝑠superscript𝑖s_{i^{\prime}}italic_s start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT independent of the overall word. Furthermore, due to the fact that 𝑒𝑚𝑏𝑒𝑚𝑏\mathit{emb}italic_emb is additive-periodical, we have XT(u1uj1ujh1ujh2+1u2)0subscriptsuperscript𝑋0𝑇subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript1subscript𝑢subscript𝑗subscript21subscript𝑢2X^{0}_{T(u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2})}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT and XT(u1u2)0subscriptsuperscript𝑋0𝑇subscript𝑢1subscript𝑢2X^{0}_{T(u_{1}\dotsb u_{2})}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT are equal in the sense that the vectors corresponding to ujh2+1u2subscript𝑢subscript𝑗subscript21subscript𝑢2u_{j_{h_{2}+1}}\dotsb u_{2}italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are equal. We refer to this property (*) later on. Using these observations and that Nu1uj1ujh1(𝒙,𝒙,1,j)=Nu1uj1ujh2(𝒙,𝒙,1,j)subscript𝑁subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript1𝒙superscript𝒙1𝑗subscript𝑁subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript2𝒙superscript𝒙1𝑗N_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}}(\boldsymbol{x},\boldsymbol{x}^{\prime},% 1,j)=N_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{2}}}}(\boldsymbol{x},\boldsymbol{x}^{% \prime},1,j)italic_N start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 , italic_j ) = italic_N start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 , italic_j ), we have that the denominator is equal as well. Now, assume that 𝑛𝑜𝑟𝑚1,jsubscript𝑛𝑜𝑟𝑚1𝑗\mathit{norm}_{1,j}italic_norm start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT is given by ℎ𝑚𝑎𝑥ℎ𝑚𝑎𝑥\mathit{hmax}italic_hmax. Then, 𝑛𝑜𝑟𝑚1,jsubscript𝑛𝑜𝑟𝑚1𝑗\mathit{norm}_{1,j}italic_norm start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT computes f(𝑠𝑐𝑜𝑟𝑒1,j(𝒙,𝒙),𝑠𝑐𝑜𝑟𝑒1,j(𝒙,XT(w)0))𝑠𝑐𝑜𝑟𝑒1,j(𝒙,XT(w)0)f(si,𝑠𝑐𝑜𝑟𝑒1,j(𝒙,XT(w)0))𝑓subscript𝑠𝑐𝑜𝑟𝑒1𝑗𝒙superscript𝒙subscript𝑠𝑐𝑜𝑟𝑒1𝑗𝒙subscriptsuperscript𝑋0𝑇superscript𝑤subscriptsubscript𝑠𝑐𝑜𝑟𝑒1𝑗𝒙subscriptsuperscript𝑋0𝑇superscript𝑤𝑓subscript𝑠superscript𝑖subscript𝑠𝑐𝑜𝑟𝑒1𝑗𝒙subscriptsuperscript𝑋0𝑇superscript𝑤\frac{f(\mathit{score}_{1,j}(\boldsymbol{x},\boldsymbol{x}^{\prime}),\mathit{% score}_{1,j}(\boldsymbol{x},X^{0}_{T(w^{\prime})}))}{\sum_{\mathit{score}_{1,j% }(\boldsymbol{x},X^{0}_{T(w^{\prime})})}f(s_{i^{\prime}},\mathit{score}_{1,j}(% \boldsymbol{x},X^{0}_{T(w^{\prime})}))}divide start_ARG italic_f ( italic_score start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_score start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ( bold_italic_x , italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_score start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ( bold_italic_x , italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_f ( italic_s start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_score start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ( bold_italic_x , italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ) ) end_ARG where f(s,S)=1𝑓𝑠𝑆1f(s,S)=1italic_f ( italic_s , italic_S ) = 1 if s𝑠sitalic_s is maximal in S𝑆Sitalic_S and 00 otherwise for any word wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. In contrast to 𝑠𝑚𝑎𝑥𝑠𝑚𝑎𝑥\mathit{smax}italic_smax, we have that the values of f()𝑓f(\dotsb)italic_f ( ⋯ ) are dependent of the overall context, namely the vector of all scorings 𝑠𝑐𝑜𝑟𝑒1,j(𝒙,XT(w)0)subscript𝑠𝑐𝑜𝑟𝑒1𝑗𝒙subscriptsuperscript𝑋0𝑇superscript𝑤\mathit{score}_{1,j}(\boldsymbol{x},X^{0}_{T(w^{\prime})})italic_score start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ( bold_italic_x , italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ). Compare XT(u1uj1ujh1ujh2+1u2)0subscriptsuperscript𝑋0𝑇subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript1subscript𝑢subscript𝑗subscript21subscript𝑢2X^{0}_{T(u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2})}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT and XT(u1u2)0subscriptsuperscript𝑋0𝑇subscript𝑢1subscript𝑢2X^{0}_{T(u_{1}\dotsb u_{2})}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT, both given by the additive-periodical embedding 𝑒𝑚𝑏𝑒𝑚𝑏\mathit{emb}italic_emb. Via assumption, we have that each ujisubscript𝑢subscript𝑗𝑖u_{j_{i}}italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT block also occurs in u1subscript𝑢1u_{1}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT or u2subscript𝑢2u_{2}italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. In particular, this means every vector that occurs in 𝑒𝑚𝑏(u1u2)𝑒𝑚𝑏subscript𝑢1subscript𝑢2\mathit{emb}(u_{1}\dotsb u_{2})italic_emb ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) does also occur in 𝑒𝑚𝑏(u1uj1ujh1ujh2+1u2)𝑒𝑚𝑏subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript1subscript𝑢subscript𝑗subscript21subscript𝑢2\mathit{emb}(u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2})italic_emb ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and vice-versa. This implies that f(𝑠𝑐𝑜𝑟𝑒1,j(𝒙,𝒙),𝑠𝑐𝑜𝑟𝑒1,j(𝒙,XT(u1uj1ujh1ujh2+1u2)0))=f(𝑠𝑐𝑜𝑟𝑒1,j(𝒙,𝒙),𝑠𝑐𝑜𝑟𝑒1,j(𝒙,XT(u1u2)0))𝑓subscript𝑠𝑐𝑜𝑟𝑒1𝑗𝒙superscript𝒙subscript𝑠𝑐𝑜𝑟𝑒1𝑗𝒙subscriptsuperscript𝑋0𝑇subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript1subscript𝑢subscript𝑗subscript21subscript𝑢2𝑓subscript𝑠𝑐𝑜𝑟𝑒1𝑗𝒙superscript𝒙subscript𝑠𝑐𝑜𝑟𝑒1𝑗𝒙subscriptsuperscript𝑋0𝑇subscript𝑢1subscript𝑢2f(\mathit{score}_{1,j}(\boldsymbol{x},\boldsymbol{x}^{\prime}),\mathit{score}_% {1,j}(\boldsymbol{x},X^{0}_{T(u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}% }\dotsb u_{2})}))=f(\mathit{score}_{1,j}(\boldsymbol{x},\boldsymbol{x}^{\prime% }),\mathit{score}_{1,j}(\boldsymbol{x},X^{0}_{T(u_{1}\dotsb u_{2})}))italic_f ( italic_score start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_score start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ( bold_italic_x , italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ) ) = italic_f ( italic_score start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_score start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ( bold_italic_x , italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ) ) for any scoring value 𝑠𝑐𝑜𝑟𝑒1,j(𝒙,𝒙)subscript𝑠𝑐𝑜𝑟𝑒1𝑗𝒙superscript𝒙\mathit{score}_{1,j}(\boldsymbol{x},\boldsymbol{x}^{\prime})italic_score start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). In combination with the assumption that Nu1uj1ujh1(𝒙,𝒙,1,j)=Nu1uj1ujh2(𝒙,𝒙,1,j)subscript𝑁subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript1𝒙superscript𝒙1𝑗subscript𝑁subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript2𝒙superscript𝒙1𝑗N_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}}(\boldsymbol{x},\boldsymbol{x}^{\prime},% 1,j)=N_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{2}}}}(\boldsymbol{x},\boldsymbol{x}^{% \prime},1,j)italic_N start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 , italic_j ) = italic_N start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 , italic_j ) and the observations above, we also get Nu1uj1ujh1ujh2+1u2(𝒙,𝒙,1,j)=Nu1u2(𝒙,𝒙,1,j)subscript𝑁subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript1subscript𝑢subscript𝑗subscript21subscript𝑢2𝒙superscript𝒙1𝑗subscript𝑁subscript𝑢1subscript𝑢2𝒙superscript𝒙1𝑗N_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2}}(\boldsymbol{% x},\boldsymbol{x}^{\prime},1,j)=N_{u_{1}\dotsb u_{2}}(\boldsymbol{x},% \boldsymbol{x}^{\prime},1,j)italic_N start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 , italic_j ) = italic_N start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 , italic_j ) in the ℎ𝑚𝑎𝑥ℎ𝑚𝑎𝑥\mathit{hmax}italic_hmax case. Next, consider the pooling functions. By definition, we have that 𝑝𝑜𝑜𝑙1,j(XT(w)0,𝑠𝑐𝑜𝑟𝑒1,j(𝒙,XT(w)0))subscript𝑝𝑜𝑜𝑙1𝑗subscriptsuperscript𝑋0𝑇superscript𝑤subscript𝑠𝑐𝑜𝑟𝑒1𝑗𝒙subscriptsuperscript𝑋0𝑇superscript𝑤\mathit{pool}_{1,j}(X^{0}_{T(w^{\prime})},\mathit{score}_{1,j}(\boldsymbol{x},% X^{0}_{T(w^{\prime})}))italic_pool start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ( italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT , italic_score start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ( bold_italic_x , italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ) ) computes XT(w)0𝑛𝑜𝑟𝑚1,j(𝒙,𝒙i,𝑠𝑐𝑜𝑟𝑒i,j(𝒙,XT(w)0))(W𝒙i)subscriptsubscriptsuperscript𝑋0𝑇superscript𝑤subscript𝑛𝑜𝑟𝑚1𝑗𝒙subscript𝒙superscript𝑖subscript𝑠𝑐𝑜𝑟𝑒𝑖𝑗𝒙subscriptsuperscript𝑋0𝑇superscript𝑤𝑊subscript𝒙superscript𝑖\sum_{X^{0}_{T(w^{\prime})}}\mathit{norm}_{1,j}(\boldsymbol{x},\boldsymbol{x}_% {i^{\prime}},\mathit{score}_{i,j}(\boldsymbol{x},X^{0}_{T(w^{\prime})}))(W% \boldsymbol{x}_{i^{\prime}})∑ start_POSTSUBSCRIPT italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_norm start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_score start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( bold_italic_x , italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T ( italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ) ) ( italic_W bold_italic_x start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) for any word wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Our previous arguments give that Nu1uj1ujh1ujh2+1u2(𝒙,𝒙,1,j)=Nu1u2(𝒙,𝒙,1,j)subscript𝑁subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript1subscript𝑢subscript𝑗subscript21subscript𝑢2𝒙superscript𝒙1𝑗subscript𝑁subscript𝑢1subscript𝑢2𝒙superscript𝒙1𝑗N_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2}}(\boldsymbol{% x},\boldsymbol{x}^{\prime},1,j)=N_{u_{1}\dotsb u_{2}}(\boldsymbol{x},% \boldsymbol{x}^{\prime},1,j)italic_N start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 , italic_j ) = italic_N start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 , italic_j ). In combination with Pu1uj1ujh1(𝒙,i,j)=Pu1uj1ujh2(𝒙,i,j)subscript𝑃subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript1𝒙𝑖𝑗subscript𝑃subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript2𝒙𝑖𝑗P_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}}(\boldsymbol{x},i,j)=P_{u_{1}u_{j_{1}}% \dotsb u_{j_{h_{2}}}}(\boldsymbol{x},i,j)italic_P start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , italic_i , italic_j ) = italic_P start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , italic_i , italic_j ) and (*), we immediately get that Pu1uj1ujh1ujh2+1u2(𝒙,i,j)=Pu1u2(𝒙,i,j)subscript𝑃subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript1subscript𝑢subscript𝑗subscript21subscript𝑢2𝒙𝑖𝑗subscript𝑃subscript𝑢1subscript𝑢2𝒙𝑖𝑗P_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2}}(\boldsymbol{% x},i,j)=P_{u_{1}\dotsb u_{2}}(\boldsymbol{x},i,j)italic_P start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , italic_i , italic_j ) = italic_P start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , italic_i , italic_j ) holds as well. Next, consider layer lisubscript𝑙𝑖l_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The arguments are exactly the same as in the base case. However, we need to rely on the induction hypothesis. Namely, we assume that all 𝑝𝑜𝑜𝑙i1,jsubscript𝑝𝑜𝑜𝑙𝑖1𝑗\mathit{pool}_{{i-1},j}italic_pool start_POSTSUBSCRIPT italic_i - 1 , italic_j end_POSTSUBSCRIPT produce the same output in computation T(u1uj1ujh1ujh2+1u2)𝑇subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript1subscript𝑢subscript𝑗subscript21subscript𝑢2T(u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2})italic_T ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and computation T(u1u2)𝑇subscript𝑢1subscript𝑢2T(u_{1}\dotsb u_{2})italic_T ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). This implies that all vectors present in XT(u1uj1ujh1ujh2+1u2)i1subscriptsuperscript𝑋𝑖1𝑇subscript𝑢1subscript𝑢subscript𝑗1subscript𝑢subscript𝑗subscript1subscript𝑢subscript𝑗subscript21subscript𝑢2X^{i-1}_{T(u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2})}italic_X start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT are also present in XT(u1u2)i1subscriptsuperscript𝑋𝑖1𝑇subscript𝑢1subscript𝑢2X^{i-1}_{T(u_{1}\dotsb u_{2})}italic_X start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT and vice-versa and that the vectors corresponding to ujh2+1u2subscript𝑢subscript𝑗subscript21subscript𝑢2u_{j_{h_{2}+1}}\dotsb u_{2}italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are equal in both computations. ∎

Proof of Lemma 3.

Let T𝒯fix𝑇subscriptsuperscript𝒯fixT\in\mathcal{T}^{\textsc{fix}}_{\circ}italic_T ∈ caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∘ end_POSTSUBSCRIPT be an additive-periodical EOT working over alphabet ΣΣ\Sigmaroman_Σ, having periodicity p𝑝pitalic_p, depth L𝐿Litalic_L, maximum width H𝐻Hitalic_H, maximum dimensionality D𝐷Ditalic_D and working over an FA F𝐹Fitalic_F using b𝑏bitalic_b bits for binary encoding. We use V𝑉Vitalic_V to denote the set of values representable in the fixed arithmetic that T𝑇Titalic_T works over. Note that |V|2b𝑉superscript2𝑏|V|\leq 2^{b}| italic_V | ≤ 2 start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT. Let wΣ+𝑤superscriptΣw\in\Sigma^{+}italic_w ∈ roman_Σ start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT be a word such that T(w)=1𝑇𝑤1T(w)=1italic_T ( italic_w ) = 1. We observe that there is m𝑚m\in\mathbb{N}italic_m ∈ blackboard_N such that w=u1umu𝑤subscript𝑢1subscript𝑢𝑚𝑢w=u_{1}\dotsb u_{m}uitalic_w = italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_u where uiΣpsubscript𝑢𝑖superscriptΣ𝑝u_{i}\in\Sigma^{p}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_Σ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT are blocks of symbols of length p𝑝pitalic_p and uΣp𝑢superscriptΣabsent𝑝u\in\Sigma^{\leq p}italic_u ∈ roman_Σ start_POSTSUPERSCRIPT ≤ italic_p end_POSTSUPERSCRIPT. Our goal is to prove that a not necessarily connected subsequence of at most 2(|T|)6superscript2superscript𝑇62^{(|T|)^{6}}2 start_POSTSUPERSCRIPT ( | italic_T | ) start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT many p𝑝pitalic_p-blocks uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from u1umsubscript𝑢1subscript𝑢𝑚u_{1}\dotsb u_{m}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is sufficient to ensure the same computation of T𝑇Titalic_T. In the case that pm+p2(|T|)6𝑝𝑚𝑝superscript2superscript𝑇6pm+p\leq 2^{(|T|)^{6}}italic_p italic_m + italic_p ≤ 2 start_POSTSUPERSCRIPT ( | italic_T | ) start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT we are done. Therefore, assume that m>2(|T|)6𝑚superscript2superscript𝑇6m>2^{(|T|)^{6}}italic_m > 2 start_POSTSUPERSCRIPT ( | italic_T | ) start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT.

Let U𝑈Uitalic_U be the set of all unique uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We observe that |U||Σ|p𝑈superscriptΣ𝑝|U|\leq|\Sigma|^{p}| italic_U | ≤ | roman_Σ | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. Next, we fix some not necessarily connected but ordered subsequence S=uj0uj1ujnujn+1𝑆subscript𝑢subscript𝑗0subscript𝑢subscript𝑗1subscript𝑢subscript𝑗𝑛subscript𝑢subscript𝑗𝑛1S=u_{j_{0}}u_{j_{1}}\dotsb u_{j_{n}}u_{j_{n+1}}italic_S = italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT with uj0=u1subscript𝑢subscript𝑗0subscript𝑢1u_{j_{0}}=u_{1}italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, ji{2,,m}subscript𝑗𝑖2𝑚j_{i}\in\{2,\dotsc,m\}italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 2 , … , italic_m } and ujn+1=usubscript𝑢subscript𝑗𝑛1𝑢u_{j_{n+1}}=uitalic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_u of w𝑤witalic_w such that each uUsuperscript𝑢𝑈u^{\prime}\in Uitalic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_U occurs exactly once. For the case that u1=usubscript𝑢1𝑢u_{1}=uitalic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_u we allow this specific block to occur twice in S𝑆Sitalic_S. The assumption m>2𝑝𝑜𝑙𝑦(|T|)𝑚superscript2𝑝𝑜𝑙𝑦𝑇m>2^{\mathit{poly}(|T|)}italic_m > 2 start_POSTSUPERSCRIPT italic_poly ( | italic_T | ) end_POSTSUPERSCRIPT implies that Sw𝑆𝑤S\neq witalic_S ≠ italic_w. This means that there are pairs (ujh,ujh+1)subscript𝑢subscript𝑗subscript𝑢subscript𝑗1(u_{j_{h}},u_{j_{h+1}})( italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) in S𝑆Sitalic_S with some non-empty sequence of p𝑝pitalic_p-blocks uj1ujlsubscript𝑢subscriptsuperscript𝑗1subscript𝑢subscriptsuperscript𝑗𝑙u_{j^{\prime}_{1}}\dotsb u_{j^{\prime}_{l}}italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT in between. W.lo.g. assume uj0subscript𝑢subscript𝑗0u_{j_{0}}italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and uj1subscript𝑢subscript𝑗1u_{j_{1}}italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is such a pair. Our goal is to argue that there are at most 2(|T|)5superscript2superscript𝑇52^{(|T|)^{5}}2 start_POSTSUPERSCRIPT ( | italic_T | ) start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT blocks from uj1ujlsubscript𝑢subscriptsuperscript𝑗1subscript𝑢subscriptsuperscript𝑗𝑙u_{j^{\prime}_{1}}\dotsb u_{j^{\prime}_{l}}italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT needed to ensure the same computation of T𝑇Titalic_T. Given that this argument works for all |Σ|psuperscriptΣ𝑝|\Sigma|^{p}| roman_Σ | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT adjacent pairs in S𝑆Sitalic_S, we are done.

Consider the computation T(w)𝑇𝑤T(w)italic_T ( italic_w ). The additive-periodical embedding 𝑒𝑚𝑏𝑒𝑚𝑏\mathit{emb}italic_emb of T𝑇Titalic_T implies that 𝑒𝑚𝑏(w)𝑒𝑚𝑏𝑤\mathit{emb}(w)italic_emb ( italic_w ) includes at most ΣpΣ𝑝\Sigma proman_Σ italic_p different vectors. Furthermore, from layer to layer equal vectors are mapped equally, which means that each Xw1,,XwLsuperscriptsubscript𝑋𝑤1superscriptsubscript𝑋𝑤𝐿X_{w}^{1},\dotsc,X_{w}^{L}italic_X start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT contains at most ΣpΣ𝑝\Sigma proman_Σ italic_p different vectors as well. This implies that the computation T(w)𝑇𝑤T(w)italic_T ( italic_w ) induces at most (LΣp)2×L×H(ΣpL2H)2(ΣpLH)4superscript𝐿Σ𝑝2𝐿𝐻superscriptΣ𝑝superscript𝐿2𝐻2superscriptΣ𝑝𝐿𝐻4(L\Sigma p)^{2}\times L\times H\leq(\Sigma pL^{2}H)^{2}\leq(\Sigma pLH)^{4}( italic_L roman_Σ italic_p ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_L × italic_H ≤ ( roman_Σ italic_p italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ( roman_Σ italic_p italic_L italic_H ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT different tuples (𝒙,𝒙,i,j)𝒙superscript𝒙𝑖𝑗(\boldsymbol{x},\boldsymbol{x}^{\prime},i,j)( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i , italic_j ) where 𝒙,𝒙𝒙superscript𝒙\boldsymbol{x},\boldsymbol{x}^{\prime}bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are vectors induced by T(w)𝑇𝑤T(w)italic_T ( italic_w ) and iL,jHformulae-sequence𝑖𝐿𝑗𝐻i\leq L,j\leq Hitalic_i ≤ italic_L , italic_j ≤ italic_H. Additionally, we have that for each value Nw(𝒙,𝒙,i,j)subscript𝑁𝑤𝒙superscript𝒙𝑖𝑗N_{w}(\boldsymbol{x},\boldsymbol{x}^{\prime},i,j)italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i , italic_j ) and Pw(𝒙,i,j)subscript𝑃𝑤𝒙𝑖𝑗P_{w}(\boldsymbol{x},i,j)italic_P start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_italic_x , italic_i , italic_j ), as defined in the beginning of this section, there are at most |VD|2bDsuperscript𝑉𝐷superscript2𝑏𝐷|V^{D}|\leq 2^{bD}| italic_V start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT | ≤ 2 start_POSTSUPERSCRIPT italic_b italic_D end_POSTSUPERSCRIPT possibilities. Simple combinatorics, namely the pigeon hole principle, states that in the increasing sequence uj1,uj2,subscript𝑢subscriptsuperscript𝑗1subscript𝑢subscriptsuperscript𝑗2u_{j^{\prime}_{1}},u_{j^{\prime}_{2}},\dotsitalic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … there must be points h1subscript1h_{1}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and h2subscript2h_{2}italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with h12bD(ΣpLH)42(|T|)5subscript1superscript2𝑏𝐷superscriptΣ𝑝𝐿𝐻4superscript2superscript𝑇5h_{1}\leq 2^{bD(\Sigma pLH)^{4}}\leq 2^{(|T|)^{5}}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ 2 start_POSTSUPERSCRIPT italic_b italic_D ( roman_Σ italic_p italic_L italic_H ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ≤ 2 start_POSTSUPERSCRIPT ( | italic_T | ) start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT such that for all tuples (𝒙,𝒙,i,j)𝒙superscript𝒙𝑖𝑗(\boldsymbol{x},\boldsymbol{x}^{\prime},i,j)( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i , italic_j ) induced by T(w)𝑇𝑤T(w)italic_T ( italic_w ) we have that Nuj0uj1ujh1(𝒙,𝒙,i,j)=Nuj0uj1ujh2(𝒙,𝒙,i,j)subscript𝑁subscript𝑢subscript𝑗0subscript𝑢subscriptsuperscript𝑗1subscript𝑢subscriptsuperscript𝑗subscript1𝒙superscript𝒙𝑖𝑗subscript𝑁subscript𝑢subscript𝑗0subscript𝑢subscriptsuperscript𝑗1subscript𝑢subscriptsuperscript𝑗subscript2𝒙superscript𝒙𝑖𝑗N_{u_{j_{0}}u_{j^{\prime}_{1}}\dotsb u_{j^{\prime}_{h_{1}}}}(\boldsymbol{x},% \boldsymbol{x}^{\prime},i,j)=N_{u_{j_{0}}u_{j^{\prime}_{1}}\dotsb u_{j^{\prime% }_{h_{2}}}}(\boldsymbol{x},\boldsymbol{x}^{\prime},i,j)italic_N start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i , italic_j ) = italic_N start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i , italic_j ) and Puj0uj1ujh1(𝒙,i,j)=Puj0uj1ujh2(𝒙,i,j)subscript𝑃subscript𝑢subscript𝑗0subscript𝑢subscriptsuperscript𝑗1subscript𝑢subscriptsuperscript𝑗subscript1𝒙𝑖𝑗subscript𝑃subscript𝑢subscript𝑗0subscript𝑢subscriptsuperscript𝑗1subscript𝑢subscriptsuperscript𝑗subscript2𝒙𝑖𝑗P_{u_{j_{0}}u_{j^{\prime}_{1}}\dotsb u_{j^{\prime}_{h_{1}}}}(\boldsymbol{x},i,% j)=P_{u_{j_{0}}u_{j^{\prime}_{1}}\dotsb u_{j^{\prime}_{h_{2}}}}(\boldsymbol{x}% ,i,j)italic_P start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , italic_i , italic_j ) = italic_P start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , italic_i , italic_j ). Now, Lemma 7 states that this implies Nuj0uj1ujh1ujh2+1uj1u(𝒙,𝒙,i,j)=Nw(𝒙,𝒙,i,j)subscript𝑁subscript𝑢subscript𝑗0subscript𝑢subscriptsuperscript𝑗1subscript𝑢subscriptsuperscript𝑗subscript1subscript𝑢subscriptsuperscript𝑗subscript21subscript𝑢subscript𝑗1𝑢𝒙superscript𝒙𝑖𝑗subscript𝑁𝑤𝒙superscript𝒙𝑖𝑗N_{u_{j_{0}}u_{j^{\prime}_{1}}\dotsb u_{j^{\prime}_{h_{1}}}u_{j^{\prime}_{h_{2% }+1}}\dotsb u_{j_{1}}\dotsb u}(\boldsymbol{x},\boldsymbol{x}^{\prime},i,j)=N_{% w}(\boldsymbol{x},\boldsymbol{x}^{\prime},i,j)italic_N start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i , italic_j ) = italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i , italic_j ) and Puj0uj1ujh1ujh2+1uj1u(𝒙,i,j)=Pw(𝒙,i,j)subscript𝑃subscript𝑢subscript𝑗0subscript𝑢subscriptsuperscript𝑗1subscript𝑢subscriptsuperscript𝑗subscript1subscript𝑢subscriptsuperscript𝑗subscript21subscript𝑢subscript𝑗1𝑢𝒙𝑖𝑗subscript𝑃𝑤𝒙𝑖𝑗P_{u_{j_{0}}u_{j^{\prime}_{1}}\dotsb u_{j^{\prime}_{h_{1}}}u_{j^{\prime}_{h_{2% }+1}}\dotsb u_{j_{1}}\dotsb u}(\boldsymbol{x},i,j)=P_{w}(\boldsymbol{x},i,j)italic_P start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u end_POSTSUBSCRIPT ( bold_italic_x , italic_i , italic_j ) = italic_P start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_italic_x , italic_i , italic_j ). However, this implies that the subsequence ujh1+1ujh2subscript𝑢subscriptsuperscript𝑗subscript11subscript𝑢subscriptsuperscript𝑗subscript2u_{j^{\prime}_{h_{1}+1}}\dotsb u_{j^{\prime}_{h_{2}}}italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT has no influence in the computation of T𝑇Titalic_T on w𝑤witalic_w and, thus, can be left out. As we can argue this for every such cycle occurring in uj1ujlsubscript𝑢subscriptsuperscript𝑗1subscript𝑢subscriptsuperscript𝑗𝑙u_{j^{\prime}_{1}}\dotsb u_{j^{\prime}_{l}}italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋯ italic_u start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we get the desired bound of 2(|T|)5superscript2superscript𝑇52^{(|T|)^{5}}2 start_POSTSUPERSCRIPT ( | italic_T | ) start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. ∎

Proof of Theorem 5.

We prove the statement via reduction from OTWP𝖻𝗂𝗇subscriptOTWP𝖻𝗂𝗇\textsc{OTWP}_{\mathsf{bin}}OTWP start_POSTSUBSCRIPT sansserif_bin end_POSTSUBSCRIPT. Let 𝒮=(S,H,V,tI,tF)𝒮𝑆𝐻𝑉subscript𝑡𝐼subscript𝑡𝐹\mathcal{S}=(S,H,V,t_{I},t_{F})caligraphic_S = ( italic_S , italic_H , italic_V , italic_t start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) and n1𝑛1n\geq 1italic_n ≥ 1 be an instance of OTWP𝖻𝗂𝗇subscriptOTWP𝖻𝗂𝗇\textsc{OTWP}_{\mathsf{bin}}OTWP start_POSTSUBSCRIPT sansserif_bin end_POSTSUBSCRIPT. We construct an EOT T𝒮,n𝒯fixsubscript𝑇𝒮𝑛superscript𝒯fixT_{\mathcal{S},n}\in\mathcal{T}^{\textsc{fix}}italic_T start_POSTSUBSCRIPT caligraphic_S , italic_n end_POSTSUBSCRIPT ∈ caligraphic_T start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT working over some FA F𝐹Fitalic_F with T𝒮,n(w)=1subscript𝑇𝒮𝑛𝑤1T_{\mathcal{S},n}(w)=1italic_T start_POSTSUBSCRIPT caligraphic_S , italic_n end_POSTSUBSCRIPT ( italic_w ) = 1 if and only if wS+𝑤superscript𝑆w\in S^{+}italic_w ∈ italic_S start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT witnesses the validity of the OTWP𝖻𝗂𝗇subscriptOTWP𝖻𝗂𝗇\textsc{OTWP}_{\mathsf{bin}}OTWP start_POSTSUBSCRIPT sansserif_bin end_POSTSUBSCRIPT instance (𝒮,n)𝒮𝑛(\mathcal{S},n)( caligraphic_S , italic_n ).

Next, let T𝒮,nsubscript𝑇𝒮𝑛T_{\mathcal{S},n}italic_T start_POSTSUBSCRIPT caligraphic_S , italic_n end_POSTSUBSCRIPT be built exactly like T𝒮subscript𝑇𝒮T_{\mathcal{S}}italic_T start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT in the proof of Theorem 4, but with the following structural adjustments. In layer l3subscript𝑙3l_{3}italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT we adjust 𝑐𝑜𝑚𝑏3subscript𝑐𝑜𝑚𝑏3\mathit{comb}_{3}italic_comb start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT to be 𝑐𝑜𝑚𝑏3=N3NeNfsubscript𝑐𝑜𝑚𝑏3subscript𝑁3normsubscript𝑁𝑒subscript𝑁𝑓\mathit{comb}_{3}=N_{3}|\!|N_{e}|\!|N_{f}italic_comb start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | | italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT | | italic_N start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT where N3subscript𝑁3N_{3}italic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is specified as in the proof of Theorem 4, Ne=N(N=x1,2,x3,2||N=nx1,3)N_{e}=N_{\rightarrow}\circ(N^{x_{1,2},x_{3,2}}_{=}|\!|N^{x_{1,3}}_{=n})italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT → end_POSTSUBSCRIPT ∘ ( italic_N start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 , 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT = end_POSTSUBSCRIPT | | italic_N start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 , 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT = italic_n end_POSTSUBSCRIPT ) and Nf=N(n+1)((n+1)+1)2+1x1,2subscript𝑁𝑓subscriptsuperscript𝑁subscript𝑥12absent𝑛1𝑛1121N_{f}=N^{x_{1,2}}_{\neq\frac{(n+1)((n+1)+1)}{2}+1}italic_N start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = italic_N start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ≠ divide start_ARG ( italic_n + 1 ) ( ( italic_n + 1 ) + 1 ) end_ARG start_ARG 2 end_ARG + 1 end_POSTSUBSCRIPT where Ntsubscript𝑁absent𝑡N_{\neq t}italic_N start_POSTSUBSCRIPT ≠ italic_t end_POSTSUBSCRIPT is analogous to the construction of N=tsubscript𝑁absent𝑡N_{=t}italic_N start_POSTSUBSCRIPT = italic_t end_POSTSUBSCRIPT given in Lemma 5. Furthermore, we adjust 𝑐𝑜𝑚𝑏4subscript𝑐𝑜𝑚𝑏4\mathit{comb}_{4}italic_comb start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT in layer l4subscript𝑙4l_{4}italic_l start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT to be represented by the FNN 𝑟𝑒𝑙𝑢(x3++x8+x9)𝑟𝑒𝑙𝑢subscript𝑥3subscript𝑥8subscript𝑥9\mathit{relu}(x_{3}+\dotsb+x_{8}+x_{9})italic_relu ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + ⋯ + italic_x start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT ). We refer to the gadgets described in Lemma 4 and Lemma 5 as well as the proof of Theorem 1 for further details.

Consider the adjustment in l3subscript𝑙3l_{3}italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. FNN Nesubscript𝑁𝑒N_{e}italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT in 𝑐𝑜𝑚𝑏3subscript𝑐𝑜𝑚𝑏3\mathit{comb}_{3}italic_comb start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ensures that T𝒮,n(w)=1subscript𝑇𝒮𝑛𝑤1T_{\mathcal{S},n}(w)=1italic_T start_POSTSUBSCRIPT caligraphic_S , italic_n end_POSTSUBSCRIPT ( italic_w ) = 1 only if the row index corresponding to the last symbol is equal to n𝑛nitalic_n. Note that N3subscript𝑁3N_{3}italic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT checks whether row and column index corresponding to the last symbol are equal. Additionally, Nfsubscript𝑁𝑓N_{f}italic_N start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT checks if there is no id equal to (n+1)((n+1)+1)2+1𝑛1𝑛1121\frac{(n+1)((n+1)+1)}{2}+1divide start_ARG ( italic_n + 1 ) ( ( italic_n + 1 ) + 1 ) end_ARG start_ARG 2 end_ARG + 1. This corresponds to the position id of the successor of the vector representing tile (n,n)𝑛𝑛(n,n)( italic_n , italic_n ). Furthermore, the adjustment of 𝑐𝑜𝑚𝑏4subscript𝑐𝑜𝑚𝑏4\mathit{comb}_{4}italic_comb start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT considers the output of Nesubscript𝑁𝑒N_{e}italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and Nfsubscript𝑁𝑓N_{f}italic_N start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT in addition to the outputs of N3subscript𝑁3N_{3}italic_N start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. In summary, we have that T𝒮,nsubscript𝑇𝒮𝑛T_{\mathcal{S},n}italic_T start_POSTSUBSCRIPT caligraphic_S , italic_n end_POSTSUBSCRIPT only outputs 1111 given w𝑤witalic_w if the word length is such that the row index corresponding to the position of the last symbol of w𝑤witalic_w in a respective octant tiling is equal to n𝑛nitalic_n (ensured by Nesubscript𝑁𝑒N_{e}italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT), that w𝑤witalic_w is at most of length (n+1)((n+1)+1)2𝑛1𝑛112\frac{(n+1)((n+1)+1)}{2}divide start_ARG ( italic_n + 1 ) ( ( italic_n + 1 ) + 1 ) end_ARG start_ARG 2 end_ARG (ensured by Nfsubscript𝑁𝑓N_{f}italic_N start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT) and if w𝑤witalic_w represents a valid encoded tiling (the remaining parts of T𝒮,n)T_{\mathcal{S},n})italic_T start_POSTSUBSCRIPT caligraphic_S , italic_n end_POSTSUBSCRIPT ).

Additionally, we need to argue that T𝒮,nsubscript𝑇𝒮𝑛T_{\mathcal{S},n}italic_T start_POSTSUBSCRIPT caligraphic_S , italic_n end_POSTSUBSCRIPT works as intended, despite the fact that it is limited by some FA F𝐹Fitalic_F using a representation size that is at most logarithmic in n𝑛nitalic_n. These arguments follow the exact same line as in the proof of Theorem 2, but using FA F𝐹Fitalic_F that uses m=6log(max(|S|,n))+2𝑚6𝑆𝑛2m=\lfloor 6\log(\max(|S|,n))\rfloor+2italic_m = ⌊ 6 roman_log ( roman_max ( | italic_S | , italic_n ) ) ⌋ + 2 bits and handles overflow using saturation. The reason for the larger representation size is that words w𝑤witalic_w representing a valid encoded tiling ending at position (n,n)𝑛𝑛(n,n)( italic_n , italic_n ) are of length |w|=(n+1)((n+1)+1)2n2𝑤𝑛1𝑛112superscript𝑛2|w|=\frac{(n+1)((n+1)+1)}{2}\leq n^{2}| italic_w | = divide start_ARG ( italic_n + 1 ) ( ( italic_n + 1 ) + 1 ) end_ARG start_ARG 2 end_ARG ≤ italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Thus, we use 4log(n)+14𝑛1\lfloor 4\log(n)\rfloor+1⌊ 4 roman_log ( italic_n ) ⌋ + 1 integer bits to be able to represent a sum j=0ij=i(i+1)2i2superscriptsubscript𝑗0𝑖𝑗𝑖𝑖12superscript𝑖2\sum_{j=0}^{i}j=\frac{i(i+1)}{2}\leq i^{2}∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_j = divide start_ARG italic_i ( italic_i + 1 ) end_ARG start_ARG 2 end_ARG ≤ italic_i start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for all in2𝑖superscript𝑛2i\leq n^{2}italic_i ≤ italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and 2log(n)+12𝑛1\lfloor 2\log(n)\rfloor+1⌊ 2 roman_log ( italic_n ) ⌋ + 1 fractional bits to uniquely represent fraction 1l1𝑙\frac{1}{l}divide start_ARG 1 end_ARG start_ARG italic_l end_ARG for ln𝑙𝑛l\leq nitalic_l ≤ italic_n. For detail see the proof of Theorem 2. Furthermore, the fact that we use 4log(n)+14𝑛1\lfloor 4\log(n)\rfloor+1⌊ 4 roman_log ( italic_n ) ⌋ + 1 bits to encode integers and that F𝐹Fitalic_F handles overflow using saturation ensures that Nfsubscript𝑁𝑓N_{f}italic_N start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT works as intended: we have that (n+1)((n+1)+1)2+1<n4𝑛1𝑛1121superscript𝑛4\frac{(n+1)((n+1)+1)}{2}+1<n^{4}divide start_ARG ( italic_n + 1 ) ( ( italic_n + 1 ) + 1 ) end_ARG start_ARG 2 end_ARG + 1 < italic_n start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT and, thus, we have that the id (n+1)((n+1)+1)2+1𝑛1𝑛1121\frac{(n+1)((n+1)+1)}{2}+1divide start_ARG ( italic_n + 1 ) ( ( italic_n + 1 ) + 1 ) end_ARG start_ARG 2 end_ARG + 1 occurs at most once, independent of the length of w𝑤witalic_w as it is not the point where F𝐹Fitalic_F enforces saturation on the positional embedding. Thus, 𝑎𝑡𝑡selfsubscript𝑎𝑡𝑡self\mathit{att}_{\text{self}}italic_att start_POSTSUBSCRIPT self end_POSTSUBSCRIPT works for this position as intended and then Nfsubscript𝑁𝑓N_{f}italic_N start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT checks the property described above correctly.

The argument that T𝒮,nsubscript𝑇𝒮𝑛T_{\mathcal{S},n}italic_T start_POSTSUBSCRIPT caligraphic_S , italic_n end_POSTSUBSCRIPT can be built in polynomial time is a straightforward implication from the arguments for Theorem 3 and the fact that Nesubscript𝑁𝑒N_{e}italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and Nfsubscript𝑁𝑓N_{f}italic_N start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT are a small gadgets with maximum parameter quadratic in n𝑛nitalic_n, which can be represented using a logarithmic amount of bits. ∎

NeurIPS Paper Checklist

  1. 1.

    Claims

  2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

  3. Answer: [Yes]

  4. Justification: Each claimed computability or complexity result, namely Theorem 1 to 5, is sufficiently argued in the main paper with full, formal proods in the appendix.

  5. Guidelines:

    • The answer NA means that the abstract and introduction do not include the claims made in the paper.

    • The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.

    • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.

    • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

  6. 2.

    Limitations

  7. Question: Does the paper discuss the limitations of the work performed by the authors?

  8. Answer: [Yes]

  9. Justification: We limited our theoretical results in Section 3 and also in Section 6 in detail.

  10. Guidelines:

    • The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.

    • The authors are encouraged to create a separate "Limitations" section in their paper.

    • The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.

    • The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.

    • The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.

    • The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.

    • If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.

    • While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in develo** norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

  11. 3.

    Theory Assumptions and Proofs

  12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

  13. Answer: [Yes]

  14. Justification: For each result we gave a full formal proof in the Appendix and a short proof sketch and intuitive explanation in the main paper.

  15. Guidelines:

    • The answer NA means that the paper does not include theoretical results.

    • All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.

    • All assumptions should be clearly stated or referenced in the statement of any theorems.

    • The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.

    • Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.

    • Theorems and Lemmas that the proof relies upon should be properly referenced.

  16. 4.

    Experimental Result Reproducibility

  17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

  18. Answer: [N/A]

  19. Justification:

  20. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.

    • If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.

    • Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.

    • While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example

      1. (a)

        If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.

      2. (b)

        If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.

      3. (c)

        If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

      4. (d)

        We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

  21. 5.

    Open access to data and code

  22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

  23. Answer: [N/A]

  24. Justification:

  25. Guidelines:

    • The answer NA means that paper does not include experiments requiring code.

    • Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.

    • While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).

    • The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.

    • The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.

    • The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.

    • At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).

    • Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

  26. 6.

    Experimental Setting/Details

  27. Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

  28. Answer: [N/A]

  29. Justification:

  30. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.

    • The full details can be provided either with the code, in appendix, or as supplemental material.

  31. 7.

    Experiment Statistical Significance

  32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

  33. Answer: [N/A]

  34. Justification:

  35. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

    • The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

    • The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

    • The assumptions made should be given (e.g., Normally distributed errors).

    • It should be clear whether the error bar is the standard deviation or the standard error of the mean.

    • It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.

    • For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).

    • If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

  36. 8.

    Experiments Compute Resources

  37. Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

  38. Answer: [N/A]

  39. Justification:

  40. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.

    • The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.

    • The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).

  41. 9.

    Code Of Ethics

  42. Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?

  43. Answer: [Yes]

  44. Justification: None of the topics “Potential Harms Caused by Research Process”, “Societal Impact and Potential Harmful Consequences” or “Impact Mitigation Measures” does apply to our theoretical results.

  45. Guidelines:

    • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.

    • If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.

    • The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

  46. 10.

    Broader Impacts

  47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

  48. Answer: [N/A]

  49. Justification:

  50. Guidelines:

    • The answer NA means that there is no societal impact of the work performed.

    • If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

    • Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

    • The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.

    • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.

    • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

  51. 11.

    Safeguards

  52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

  53. Answer: [N/A]

  54. Justification:

  55. Guidelines:

    • The answer NA means that the paper poses no such risks.

    • Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.

    • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.

    • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

  56. 12.

    Licenses for existing assets

  57. Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

  58. Answer: [N/A]

  59. Justification:

  60. Guidelines:

    • The answer NA means that the paper does not use existing assets.

    • The authors should cite the original paper that produced the code package or dataset.

    • The authors should state which version of the asset is used and, if possible, include a URL.

    • The name of the license (e.g., CC-BY 4.0) should be included for each asset.

    • For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

    • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

    • For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

    • If this information is not available online, the authors are encouraged to reach out to the asset’s creators.

  61. 13.

    New Assets

  62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

  63. Answer: [N/A]

  64. Justification:

  65. Guidelines:

    • The answer NA means that the paper does not release new assets.

    • Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.

    • The paper should discuss whether and how consent was obtained from people whose asset is used.

    • At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

  66. 14.

    Crowdsourcing and Research with Human Subjects

  67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

  68. Answer: [N/A]

  69. Justification:

  70. Guidelines:

    • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

    • Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.

    • According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

  71. 15.

    Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

  72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

  73. Answer: [N/A]

  74. Justification:

  75. Guidelines:

    • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

    • Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

    • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.

    • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.