The Computational Complexity of Formal Reasoning for Encoder-Only Transformers

Marco Sälzer, Eric Alsmann, Martin Lange
Theoretical Computer Science / Formal Methods
University of Kassel, Germany
{marco.saelzer,eric.alsmann,martin.lange}@uni-kassel.de

Abstract

We investigate challenges and possibilities of formal reasoning for encoder-only transformers (EOT), meaning sound and complete methods for verifying or interpreting behaviour. In detail, we condense related formal reasoning tasks in the form of a naturally occurring satisfiability problem (SAT). We find that SAT is undecidable if we consider EOT, commonly considered in the expressiveness community. Furthermore, we identify practical scenarios where SAT is decidable and establish corresponding complexity bounds. Besides trivial cases, we find that quantized EOT, namely those restricted by some fixed-width arithmetic, lead to the decidability of SAT due to their limited attention capabilities. However, the problem remains difficult, as we establish those scenarios where SAT is NEXPTIME-hard and those where we can show that it is solvable in NEXPTIME for quantized EOT. To complement our theoretical results, we put our findings and their implications in the overall perspective of formal reasoning.

1 Introduction

Natural language processing (NLP) models, processing and computing human language, are gateways for modern applications aiming to interact with human users in a natural way. Although NLP is a traditional field of research, the use of deep learning techniques has undoubtedly revolutionised the field in recent years [22]. In this revolution, models such as Recurrent Neural Networks (RNN) or more specific Long Short-term Memory Networks (LSTM) [30] have long been the driving force, but for a few years now NLP has a new figurehead: transformers [28].

Transformers are a deep learning model using (multiple) self-attention mechanisms to process sequential input data, usually natural language. The efficient trainability of transformers, for example in contrast to LSTM, while achieving top-tier performance led to numerous heavy-impact implementations such as BERT [10], GPT-3 [6] or GPT-4 [21], sparking widespread use of the transformer architecture. However, the foreseeable omnipresence of transformer-based applications leads to serious security concerns.

In general, there are two approaches to establishing trustworthiness of learning-based models: first, certifying specific, application-dependent safety properties, called verification, and second, interpreting the behaviour of such models and giving explanations for it, called interpretation. In both approaches, the holy grail is to develop automatic methods that are sound and complete: algorithm $A$ that given some model $T$ and (verification or interpretation) specification $\varphi$ outputs true if $T$ satisfies $\varphi$ and false otherwise (soundness), and $A$ does so for all combinations of $T$ and $\varphi$ (completeness) it is designed for. We refer to such sound and complete methods and tasks for verification and interpretation collectively using the term formal reasoning.

We lay out a framework for the possibilities and challenges of formal reasoning for transformers by establishing basic computability and complexity results in this work. Thereby, we focus on the so-called satisfiability (Sat) problem of sequence-classifying transformers: given a transformer $T$ , decide whether there is some input word $w$ such that $T(w)=1$ . Although this may seem like an artificial problem at first glance, it is a natural abstraction of problems that commonly occur in almost all non-trivial formal reasoning tasks. Additionally, since it is detached from the specifics of particular reasoning specifications like safety properties for instance, uncomputability results and complexity-theoretic hardness results immediately transfer to more complex formal reasoning tasks. This also keeps the focus on the transformer architecture under consideration. Here we exclusively consider encoder-only transformers (EOT), mainly due to the fact that the known high expressive power of encoder-decoder transformers [23] makes formal reasoning trivially impossible.

Our work is structured as follows. We define necessary preliminaries in Section 2. In Section 3, we give an overview on our theoretical results and take a comprehensive look at their implications for formal reasoning for transformers. In Section 4 and Section 5 we present our theoretical results: we show that Sat is undecidable for classes of EOT commonly considered in research on transformer expressiveness, we show that a bounded version bSat of the satisfiability problem is decidable, for any class of (computable) EOT, and give corresponding complexity bounds and we show that considering quantized EOT, meaning EOT whose parameters and internal computations are limited by some fixed-width arithmetic, leads to decidability of Sat and give corresponding complexity bounds. Finally, we discuss limitations, open problems and future research in Section 6.

Related work.

We establish basic computability and complexity results about transformer-related formal reasoning problems, like formal verification or interpretation. This places our work in the intersection between research on verification and interpretation of transformers and transformer expressiveness.

There is a limited amount of work concerned with methods for the verification of safety properties of transformers [15, 25, 4, 11]. However, all those methods do not fall in the category of formal reasoning, as they are non-complete. This means, the rigorous computability and complexity bound established in this work cannot be applied without further considerations. The same applies for so far considered interpretability methods [31]. We remark that a lot of these approaches are not sound methods either. In contrast, there is an uprise in theoretical investigations of transformer expressiveness. Initial work dealt with encoder-decoder models and showed that such models are Turing-complete [23, 3]. Encoder-only models have so far been analysed in connection with circuit complexity [13, 14, 20, 19], logics [7, 18] and programming languages [29]. A recently published survey [26] provides an overview of these results. This work is adjacent as some of the here considered classes of EOT, mainly those considered in Section 4, are motivated by these results and some of the constructions we use in corresponding proofs are similar.

2 Fundamentals

Mathematical basics.

Let $\Sigma$ be a finite set of symbols, called alphabet. A (finite) word $w$ over $\Sigma$ is a finite sequence $a_{1}\dotsb a_{k}$ where $a_{i}\in\Sigma$ . We define $|w|=k$ . As usual, we denote the set of all non-empty words by $\Sigma^{+}$ . A language is a set of words. We also extend the notion of an alphabet to vectors $\boldsymbol{x}_{i}\in\mathbb{R}^{d}$ , meaning that a sequence $\boldsymbol{x}_{1}\dotsb\boldsymbol{x}_{k}$ is a word over some subset of $\mathbb{R}^{d}$ . Usually, we denote vectors using bold symbols like $\boldsymbol{x},\boldsymbol{y}$ or $\boldsymbol{z}$ .

Encoder-only transformers (EOT).

We consider the encoder-only transformer (EOT) model introduced in [28]. We take a look at EOT from a computability and complexity perspective, which is why we follow more formal definitions as done in [13, 23, 14]. An EOT $T$ with $L$ layers and $h_{i}$ attention heads in layer $i$ is a tuple $(\mathit{emb},\{\mathit{att}_{i,j}\mid 1\leq i\leq L,1\leq j\leq h_{i}\},\{% \mathit{comb}_{i}\mid 1\leq i\leq L\},\mathit{out})$ where

•

$\mathit{emb}\colon\Sigma\times\mathbb{N}\rightarrow\mathbb{R}^{d_{0}}$ for some $d_{0}\in\mathbb{N}$ is the positional embedding,
•

each attention head is a tuple $\mathit{att}_{i,j}=(\mathit{score}_{i,j},\mathit{pool}_{i,j})$ where $\mathit{score}_{i,j}\colon\mathbb{R}^{d_{i-1}}\times\mathbb{R}^{d_{i-1}}% \rightarrow\mathbb{R}$ is a function called scoring and $\mathit{pool}_{i,j}\colon(\mathbb{R}^{d_{i-1}})^{+}\times\mathbb{R}^{+}% \rightarrow\mathbb{R}^{d_{i}}$ is a function called pooling, computing $(\boldsymbol{x}_{1},\dotsc,\boldsymbol{x}_{n},s_{1},\dotsc,s_{n})\mapsto\sum_{% i^{\prime}=1}^{n}\mathit{norm}(i^{\prime},s_{1},\dotsc,s_{n})(W\boldsymbol{x}_% {i^{\prime}})$ where $W$ is a linear map represented by a matrix and $\mathit{norm}\colon\mathbb{N}\times\mathbb{R}^{+}\rightarrow\mathbb{R}$ is a normalisation,
•

each $\mathit{comb}_{i}\colon\mathbb{R}^{d_{i,1}}\times\dotsb\times\mathbb{R}^{d_{i,% h_{1}+1}}\rightarrow\mathbb{R}^{d_{i}}$ is called a combination and $\mathit{out}\colon\mathbb{R}^{d_{L}}\rightarrow\mathbb{R}$ is called the output.

For given $i\leq k$ we call the tuple $(\mathit{att}_{i,1},\dotsc,\mathit{att}_{i,h_{i}},\mathit{comb}_{i})$ the $i$ -th layer of $T$ . The EOT $T$ computes a function $\Sigma^{+}\rightarrow\mathbb{R}$ as follows. Let $w=a_{1},\dotsc,a_{n}\in\Sigma^{+}$ be a word. First, $T$ computes an embedding of $w$ by $\mathit{emb}(w)=\boldsymbol{x}_{1}^{0}\dotsb\boldsymbol{x}_{n}^{0}$ where $\boldsymbol{x}^{0}_{i}=\mathit{emb}(a_{i},i)$ . Next, each layer $1\leq i\leq L$ computes a sequence $\boldsymbol{x}_{1}^{i}\dotsb\boldsymbol{x}_{n}^{i}$ as follows: for each input $\boldsymbol{x}_{m}^{i-1}$ and attention head $\mathit{att}_{i,j}$ , layer $i$ computes $\boldsymbol{y}_{m,j}^{i}=\mathit{pool}_{i,j}(\boldsymbol{x}_{1}^{i-1},\dotsc,% \boldsymbol{x}_{n}^{i-1},\mathit{score}_{i,j}(\boldsymbol{x}_{m}^{i-1},% \boldsymbol{x}_{1}^{i-1}),\dotsc,\mathit{score}_{i,j}(\boldsymbol{x}_{m}^{i-1}% ,\boldsymbol{x}_{n}^{i-1}))$ . Then, $\boldsymbol{x}^{i}_{m}$ is given by $\mathit{comb}_{i}(\boldsymbol{x}^{i-1}_{m},\boldsymbol{y}_{m,1}^{i},\dotsc,% \boldsymbol{y}_{m,h_{i}}^{i})$ . In the end, the output $T(w)$ is computed by $\mathit{out}(\boldsymbol{x}^{k}_{n})$ , thus the value of the output function for the last symbol of $w$ after being transformed by the embedding and $L$ layers of $T$ . We say that $T$ accepts $w$ if $T(w)=1$ , and we say that $T$ rejects $w$ otherwise. We call $L$ the depth of $T$ and the maximal $h_{i}$ the (maximum) width of $T$ . Furthermore, we call the maximal $d_{i}$ the (maximum) dimensionality of $T$ . Let $\mathcal{T}$ be some class of EOT. The decision problem $\textsc{Sat}[\mathcal{T}]$ for a class $\mathcal{T}$ of EOT is: given $T\in\mathcal{T}$ over alphabet $\Sigma$ , decide whether there is $w\in\Sigma^{+}$ such that $T(w)=1$ . We refer to this as the satisfiability problem for $\mathcal{T}$ .

Fixed-width arithmetics.

We consider commonly used fixed-width arithmetics (FA) that represent numbers using a fixed amount of bits, like floating- or fixed-point arithmetic in this work. See [1] (fixed-point) or [8] (floating-point) for rigorous mathematical definitions of such FA. In this work, however, we only make use of a high-level view on different FA. Namely, given some FA $F$ we assume that all values are represented in binary using $b\in\mathbb{N}$ bits for representing its numbers. Thus, there are $2^{b}$ different rational number representable in $F$ . Furthermore, we assume that the considered FA can handle overflow situations using either saturation or wrap-around and rounding situations by rounding up or off. We consider EOT in the context of $F$ . We say that $T$ works over $F$ , assuming that all computations as well as values occurring in a computation $T(w)$ are carried out in the arithmetic defined by $F$ .

3 Overview: capturing and classifying formal transformer reasoning

We address elementary problems arising in formal reasoning for transformers. In doing so, we pursue the goal of establishing basic computability and complexity results for corresponding problems in order to frame possibilities and challenges.

To achieve widespread implications of our results, we focus our considerations on a fundamental problem arising in formal verification and interpretation tasks: given a transformer $T$ , decide whether there is some input $w$ leading to some specific output $T(w)$ , as defined formally in terms of the satisfiability problem $\textsc{Sat}[\mathcal{T}]$ for a class $\mathcal{T}$ of specific transformers, see Section 2.

To see that this captures the essence of formal reasoning problems occurring in practice, consider the following formal verification task: given transformer $T$ , verify that $T$ only accepts inputs which contain some specific key from a set $K$ . Such a property is usually considered a robustness property [25, 16]. We can phrase this as a satisfiability problem by considering the property’s negation, namely to verify that there is some input $w$ such that no key from $K$ occurs and still we have $T(w)=1$ .

Likewise, consider a formal interpretation task in which we want to find the minimal subset $E^{\prime}\subseteq E$ of some set of error symbols $E$ such that all $w$ that contain all errors $E^{\prime}$ are rejected by $T$ . This is usually understood as an abductive explanation [17]. Given a candidate subset $E^{\prime}$ , we can certify this by checking that there is some $w$ which contains all errors $E^{\prime}$ , but is accepted by $T$ . This, again, is a special case of a satisfiability problem $\textsc{Sat}[\mathcal{T}]$ for some transformer class $\mathcal{T}$ .

Furthermore, we want our results to be detached from any intricacies of certain transformer architectures: first, we focus on encoder-only transformers (EOT), so leaving any decoder mechanism unconsidered. The primary reason for this is that encoder-decoder architectures are of such high expressive power [23] that Sat is easily seen to be undecidable in almost all non-trivial cases. The secondary reason for this is that encoder-decoder architectures subsume encoder-only architectures. So any lower computability or complexity bound, established in this work, is also a lower bound for encoder-decoder transformers. Additionally, the presentation of EOT in Section 2 paves the way for a parametrized view onto EOT architectures which allows us to study different classes of EOT by fixing or bounding such parameters.

We start by considering the class $\mathcal{T}_{\mathit{udec}}$ of EOT, motivated by commonly considered architectures in the theoretical expressiveness community [23, 14, 13]: $\mathcal{T}_{\mathit{udec}}$ consists of those EOT that use a positional embedding, expressive enough to compute a sum, hardmax $\mathit{hmax}$ as normalisation functions and a scalar-product based scoring, enriched with a nonlinear map represented by an FNN.

Theorem 1 (Section 4).

The satisfiability problem $\textsc{Sat}[\mathcal{T}_{\mathit{udec}}]$ is undecidable.

Refer to caption — Figure 1: Schematic overview of the computability and complexity results, established in this work. The classes of EOT are described in the pretext of the respective theorem. Note that $\mathcal{T}$ refers to an arbitrary class of (computable) EOT. The small subset in the classes NP and NEXPTIME refers to the complete problems. The NEXPTIME-hardness result of $\textsc{Sat}[\mathcal{T}^{\textsc{fix}}]$ is visualized by putting it exactly on the upper border between NEXPTIME and all decidable problems.

Essentially, this result implies that even for encoder-only EOT the combination of $\mathit{hmax}$ normalizations and expressive scoring is enough to make satisfiability undecidable. Generally, this makes formal reasoning, like verifying robustness properties or giving formal explanations, impossible for classes of EOT that subsume $\mathcal{T}_{\mathit{udec}}$ . Specifically, no such methods exist that are fully automatic, sound and complete. Theorem 1 does not preclude the existence of incomplete methods for instance.

Recently, so-called log-precision transformers have been studied [18]. These transformers are defined as usual, but given a word length $n$ it is assumed that a log-precision transformer $T$ uses at most $\mathcal{O}(\log(n))$ bits in its internal computations. To complement these theoretical considerations, we consider the class $\mathcal{T}^{\textsc{log}}_{\mathit{udec}}$ of EOT from $\mathcal{T}_{\mathit{udec}}$ that work with log-precision. Unfortunately, this restriction is not enough to circumvent general undecidability.

Theorem 2 (Section 4).

The satisfiability problem $\textsc{Sat}[\mathcal{T}^{\textsc{log}}_{\mathit{udec}}]$ is undecidable.

Given such impossibility results, we turn our attention to the search for decidable cases. We make the reasonable assumption that all considered EOT are computable, meaning that their components like scoring, normalisation, pooling, combination and output functions are computable functions.

First, we consider a natural restriction of the satisfiability problem by bounding the length of valid inputs. Then satisfiability becomes decidable, regardless of the respective class of EOT, but it is difficult from a complexity-theoretic perspective. To formalize this, we introduce the bounded satisfiability problem $\textsc{bSat}[\mathcal{T}]$ for a class $\mathcal{T}$ : given an EOT $T\in\mathcal{T}$ and a bound $n\in\mathbb{N}$ on its input length, decide whether there is word $w$ with $|w|\leq n$ s.t. $T(w)=1$ .

Theorem 3 (Section 5, informal).

The bounded satisfiability problem $\textsc{bSat}[\mathcal{T}]$ is decidable for all classes $\mathcal{T}$ of (computable) EOT. Depending on whether $n$ is given in binary or unary coding, $\textsc{bSat}[\mathcal{T}]$ is NEXPTIME-, resp. NP-hard whenever $\mathcal{T}\supseteq\mathcal{T}_{\mathit{udec}}$ .

Informally, this result implies that bounding the word length is a method to enable formal reasoning. However, it does not change the fact that satisfiability is an essentially hard problem. As hardness is a lower bound, this also translates to subsuming formal reasoning tasks.

Imposing a bound on the input length may not be a viable restriction for various formal reasoning tasks. We therefore study other ways of obtaining decidability. We address the unbounded satisfiability problem for practically motivated classes of EOT. We consider the class $\mathcal{T}^{\textsc{fix}}_{\circ}$ of EOT that use a positional embedding with some periodicity in their positional encoding, commonly seen in practice [28, 12], use softmax or hardmax as normalisation and which work over some fixed-width arithmetic (FA). This last restriction is motivated by recent popular ways to handle ever increasing EOT sizes, for example via quantization or using low-bit arithmetics [5]. From a complexity-theoretic perspective, the use of fixed-width arithmetic has a similar effect to bounding the input length.

Theorem 4 (Section 5).

The satisfiability problem $\textsc{Sat}[\mathcal{T}^{\textsc{fix}}_{\circ}]$ is in NEXPTIME.

So automatic, sound and complete formal reasoning for periodical EOT in a fixed-width arithmetic environment is generally possible with potentially high complexity. Note that formal reasoning tasks with more complex safety or interpretability specifications than simple satisfiability may even lead to higher complexities.

We then aim to show that this is optimal by providing a matching lower bound. However, we need to relax these restrictions again, namely considering the class $\mathcal{T}^{\textsc{fix}}$ allowing for EOT that use non-periodical embeddings and work over some fixed-width arithmetic that can use saturation to handle overflow situations. We show that this high complexity is unavoidable, making sound and complete automatic formal reasoning for fixed-width arithmetic transformers with general positional embeddings practically intractable.

Theorem 5 (Section 5).

The satisfiability problem $\textsc{Sat}[\mathcal{T}^{\textsc{fix}}]$ is NEXPTIME-hard.

A schematic depiction of the computability and complexity results described in this section is given in Figure 1. Note that this figure is a purely technical presentation of our results, which means that it does not convey the implications for formal reasoning described above.

4 Transformer satisfiability is generally undecidable

We consider a class of EOT $\mathcal{T}_{\mathit{udec}}$ , which is as weak as possible regarding the expressiveness of included EOT. We define $\mathcal{T}_{\mathit{udec}}$ by giving minimum requirements: positional-embeddings can be of the form $\mathit{emb}(a_{k},0)=(1,1,0,0,k)$ and $\mathit{emb}(a_{k},i)=(0,1,i,\sum_{j=0}^{i}j,k)$ where we assume some order on the alphabet symbols $a_{1},a_{2},\dotsc$ . For scoring functions we allow for $N(\langle Q\boldsymbol{x},K\boldsymbol{y}\rangle)$ where $N$ is a classical Feedforward Neural Network (FNN) with $\mathit{relu}$ activations, $Q$ and $K$ are linear maps and $\langle\dotsb\rangle$ denotes the usual scalar product, for normalisations we allow for hardmax $\mathit{hmax}(i,x_{1},\dotsc,x_{n})=\frac{1}{m}$ if $x_{i}\geq x_{j}$ for all $j\leq n$ and there are $m$ distinct $x_{j}$ such that $x_{i}=x_{j}$ otherwise $\mathit{hmax}(i,x_{1},\dotsc,x_{n})=0$ . Combinations as well as output functions can be classical FNN with $\mathit{relu}$ activation. Aside from technical reasons, we motivate the choice of $\mathcal{T}_{\mathit{udec}}$ in Section 3. To ease our notation, we exploit the fact that using $\mathit{hmax}$ as normalisation implies a clearly defined subset of positions $M$ that are effective in the computation of some attention head $\mathit{att}$ given some position $i$ , namely those that are weighted non-zero. In this case, we say that $\mathit{att}$ attends to $M$ given position $i$ .

We prove that $\textsc{Sat}[\mathcal{T}_{\mathit{udec}}]$ is undecidable by establishing a reduction from the (unbounded) octant tiling-word problem ( $\textsc{OTWP}^{*}$ ). For details on tiling problems, see Appendix A. The $\textsc{OTWP}^{*}$ is defined as follows: given a tiling system $\mathcal{S}=(S,H,V,t_{I},t_{F})$ where $S$ is some finite set of tiles, $H,V\subseteq S^{2}$ and $t_{I},t_{F}\in S$ we have to decide whether there is a word (a) $t_{0,0},t_{1,0},t_{1,1},t_{2,0},t_{2,1},t_{2,2},t_{3,0},\ldots,$ $t_{k,k}\in S^{+}$ such that (b) $t_{0,0}=t_{I}$ , $t_{k,k}=t_{F}$ , (c) for all $i\leq k$ and $0\leq j<i$ holds $(t_{i,j},t_{i,j+1})\in H$ and (d) for all $i\leq k-1$ and $j\leq i$ holds $(t_{i,j},t_{i+1,j})\in V$ . We call a word $w$ which satisfies (a) an encoded tiling and if (b)-(d) are satisfied as well then we call $w$ a valid encoded tiling. Our proof strategy is easily described: given a tiling system $\mathcal{S}$ , we build an EOT $T_{\mathcal{S}}\in\mathcal{T}_{\mathit{udec}}$ which accepts a word $w$ if it fulfils conditions (a) to (d) and otherwise $T_{\mathcal{S}}$ rejects $w$ . We derive most technical proofs of the following lemmas and theorems to Appendix B and instead provide intuitions and proof sketches in this section.

We start with the first observation: the expressiveness of EOT in $\mathcal{T}_{\mathit{udec}}$ is sufficient to decode the octant tiling potentially represented by a given word $w$ . In detail, two encoder layers in combination with a positional embedding definable in $\mathcal{T}_{\mathit{udec}}$ are expressive enough to compute for a given symbol $t$ in $w$ to which position in an octant tiling it corresponds, if we interpret $w$ as an encoded tiling.

Lemma 1.

Let $\mathcal{S}$ be a tiling system with tiles $S=\{a_{1},\dotsc,a_{k}\}$ . There is an embedding function $\mathit{emb}$ and there are encoder layers $l_{1}$ and $l_{2}$ definable in $\mathcal{T}_{\mathit{udec}}$ such that for each word $w=t_{0,0}t_{1,0}t_{1,1}t_{2,0}\dotsb t_{m,n}\in S^{+}$ holds that $l_{2}(l_{1}(\mathit{emb}(w)))=\boldsymbol{x}^{2}_{1}\dotsc\boldsymbol{x}^{2}_{% |w|}$ where $\boldsymbol{x}^{2}_{i}=(1,i,r(i),c(i),k_{i})$ such that $a_{k_{i}}$ is equal to the symbol at position $i$ in $w$ and $(r(1),c(1)),(r(2),c(2)),\dotsc,(r(|w|),c(|w|))$ is equal to $(0,0),(1,0)\dotsc,(m,n)$ .

Assume that $w\in S^{+}$ . Lemma 1 implies that a EOT $T\in\mathcal{T}_{\mathit{udec}}$ is generally able to recognize whether $w$ is an encoded tiling as soon as $T$ is able to check whether $r(|w|)$ and $c(|w|)$ of the last symbol of $w$ processed by $l_{2}(l_{1}(\mathit{emb}(\dotsb)))$ are equal. Therefore, property (a) and also (b) can be checked by EOT in $\mathcal{T}_{\mathit{udec}}$ using the residual connection in the combination functions together with the expressive power of FNN. Similarly, property (c) can be ensured if it is possible to build an attention head that is able to attend to position $k+1$ given position $k$ . Let $w=t_{0,0}t_{1,0}t_{1,1}t_{2,0}\dotsb t_{m,m}$ with $t_{i,j}\in S$ . To verify whether property (d) holds, an EOT must be able to attend to position $k+(i+1)$ given position $k$ corresponding to symbol $t_{i,j}$ . In summary, to check properties (a) – (d) it is left to argue that there are attention heads in $\mathcal{T}_{\mathit{udec}}$ that can attend to positions depending linearly on the values of the currently considered position.

Lemma 2.

Let $f(x_{1},\dotsc,x_{k})=a_{1}x_{1}+\dotsb+a_{k}x_{k}+b$ with $a_{i},b\in\mathbb{R}$ be some linear function. There is attention head $\mathit{att}_{f}$ in $\mathcal{T}_{\mathit{udec}}$ such that for all sequences $\boldsymbol{x}_{1},\dotsc,\boldsymbol{x}_{m}$ where all $\boldsymbol{x}_{i}=(1,i,\boldsymbol{y}_{i})$ for some $\boldsymbol{y}_{i}\in\mathbb{R}^{k-2}$ attention head $\mathit{att}_{f}$ attends to $\{\boldsymbol{x}_{j},\boldsymbol{x}_{j+1}\}$ given position $i$ if $f(\boldsymbol{x}_{i})=j+\frac{1}{2}$ with $j\leq m-1$ and otherwise to $\{\boldsymbol{x}_{j}\}$ where $j$ is the value nearest to $f(\boldsymbol{x}_{i})$ .

In combination, the previous lemmas indicate that EOT from $\mathcal{T}_{\mathit{udec}}$ are able to verify whether a given word is a valid encoded tiling. This expressive power is enough, to lead to an undecidable satisfiability problem for EOT from $\mathcal{T}_{\mathit{udec}}$ .

Theorem 1.

The decision problem $\textsc{Sat}[\mathcal{T}_{\mathit{udec}}]$ is undecidable.

Proof Sketch.

We establish a reduction from $\textsc{OTWP}^{*}$ to $\textsc{Sat}[\mathcal{T}_{\mathit{udec}}]$ by constructing for each instance $\mathcal{S}=(S,H,V,t_{I},t_{F})$ of $\textsc{OTWP}^{*}$ an EOT $T_{\mathcal{S}}$ accepting exactly those $w$ corresponding to a valid encoded-tiling for $\mathcal{S}$ .

$T_{\mathcal{S}}$ uses the positional embedding described in the beginning of Section 4 and has four layers. Layers $l_{1}$ and $l_{2}$ are given by Lemma 1 and are used to decode the row and column indexes corresponding to a potential octant tiling for each symbol in a given word $w$ . Layer $l_{3}$ uses the informations encoded by the embedding and the decoded row and column indexes to check whether properties (a) to (d) described above hold for $w$ . The necessary informations are aggregated using three attention heads $\mathit{att}_{\mathit{prev}}$ , $\mathit{att}_{\mathit{next}}$ and $\mathit{att}_{\mathit{step}}$ , each built according to Lemma 2.Thereby, $\mathit{att}_{\mathit{prev}}$ attends each position to its predecessor, but the first position attends to itself. This allows to clearly identify the vector corresponding to the first position in $w$ and check whether this is equal to tile $t_{I}$ . Attention head $\mathit{att}_{\mathit{next}}$ attends each position to its successor, but the last position attends to itself. This allows to clearly identify the vector corresponding to the last position in $w$ , in order to check whether this is equal to $t_{F}$ , and to check conditions given by $H$ . Attention head $\mathit{att}_{\mathit{step}}$ attends each position to the position with the same column index but the successive row index. If there is no such successive row it attends to the last position. This allows to check whether conditions given by $V$ holds. Each of these conditions is checked in the combination function of $l_{3}$ , using specifically built feed-forward neural networks outputting $0$ to some predefined vector dimension if and only if the condition is met. Finally, layer $l_{4}$ aggregates the information of all positions in the vector corresponding to the last position using attention head $\mathit{att}_{\text{leq}}$ , again given by Lemma 2.

The correctness of this reduction follows from the detailed construction of $T_{\mathcal{S}}$ , which is technically extensive and given in Appendix B. ∎

Next, we consider the class $\mathcal{T}^{\textsc{log}}_{\mathit{udec}}$ which is defined exactly like $\mathcal{T}_{\mathit{udec}}$ but for all $T\in\mathcal{T}^{\textsc{log}}_{\mathit{udec}}$ working over alphabet $\Sigma$ and all words $w$ with $|w|=n$ we assume that $T(w)$ is carried out in some fixed-width arithmetic $F$ using $\mathcal{O}(\log(\max(|\Sigma|,n)))$ bits.

Theorem 2.

The decision problem $\textsc{Sat}[\mathcal{T}^{\textsc{log}}_{\mathit{udec}}]$ is undecidable.

Proof sketch.

This proof follows the exact same line as the proof of Theorem 1. Additionally, we need to argue that $T_{\mathcal{S}}$ works as intended, despite the fact that it is limited by some log-precision $F$ .

Looking at the proof of Theorem 1, it is imminent that the magnitude and precision of all values used and produced in the computation $T_{\mathcal{S}}(w)$ depend polynomially on $n$ and, thus, we can choose the representation of $F$ to be linear in $\log(n)$ , which avoids any overflow or rounding situations and ensures that $T_{\mathcal{S}}$ works as intended. A formal proof is given in Appendix B. ∎

5 How to make transformer satisfiability decidable

In this section we investigate classes of EOT leading to decidable Sat problems or decidable restrictions of it. Additionally, we establish corresponding complexity bounds.

In order to establish clearly delineated upper complexity bounds, we need to bound the representation size of an EOT $T$ . Instead of tediously analyzing the space needed to represent embedding, scoring, pooling, combination and normalisation functions, we note that it suffices to estimate the size up to polynomials only. The complexity of an EOT $T$ with $L$ layers and $h_{i}$ attention heads in layer $i$ , working on inputs over alphabet $\Sigma$ , is $|T|:=|\Sigma|+L+H+D$ where $H:=\max\{h_{i}\mid 1\leq i\leq L\}$ and $D$ is the maximal dimensionality of vectors occurring in a computation of $T$ . Note that one can reasonably assume the size of a syntactic representation of $T$ to be polynomial in $|T|$ , and that EOT have the polynomial evaluation property: given a word $w\in\Sigma^{+}$ , $T(w)$ can be computed in time that is polynomial in $|T|+|w|$ . Section 3 discusses why this assumption is reasonable.

Satisfiability restricted to words of bounded length is decidable, but difficult

We start with a natural restriction: bounding the word length. Let $\mathcal{T}$ be a class of EOT. The bounded satisfiability problem, denoted by $\textsc{bSat}[\mathcal{T}]$ is: given $T\in\mathcal{T}$ and some $n\in\mathbb{N}$ , decide whether there is a word $w$ with $|w|\leq n$ such that $T(w)=1$ . It is not hard to see that $\textsc{bSat}[\mathcal{T}]$ is decidable. However, its complexity depends on the value of $n$ , and we therefore distinguish whether $n$ is represented in binary or unary encoding. We denote the corresponding problems as $\textsc{bSat}_{\mathsf{bin}}[\mathcal{T}]$ and $\textsc{bSat}_{\mathsf{un}}[\mathcal{T}]$ .

Theorem 3.

Let $\mathcal{T}$ be a class of EOT. Then

1.

$\textsc{bSat}_{\mathsf{un}}[\mathcal{T}]$ is decidable in NP and if $\mathcal{T}_{\mathit{udec}}\subseteq\mathcal{T}$ then $\textsc{bSat}_{\mathsf{un}}[\mathcal{T}]$ is NP-complete,
2.

$\textsc{bSat}_{\mathsf{bin}}[\mathcal{T}]$ is decidable in NEXPTIME and if $\mathcal{T}_{\mathit{udec}}\subseteq\mathcal{T}$ then $\textsc{bSat}_{\mathsf{bin}}[\mathcal{T}]$ is NEXPTIME-complete.

Proof Sketch.

The decidability result of statement (1) can be shown using a simple guess-and-check argument: given $n\in\mathbb{N}$ , guess a word $w\in\Sigma^{+}$ with $|w|\leq n$ , compute $T(w)$ and check that the result is $1$ . This is possible in time polynomial in $|T|+n$ using the polynomial evaluation property. Moreover, the value of $|T|+n$ is polynomial in the size needed to represent $n$ in unary encoding.

The decidability result of statement (2) is shown along the same lines. However, if the value $n$ is encoded binarily then this part of the input is of size $\log n$ , and $|T|+n$ becomes exponential in this. Hence, the guess-and-check procedure only proves that $\textsc{bSat}_{\mathsf{bin}}[\mathcal{T}]\in$ NEXPTIME.

For the completeness result in (1) it suffices to argue that the problem is NP-hard. We make use of the fact that EOT in $\mathcal{T}_{\mathit{udec}}$ are expressive enough to accept a given word $w$ if and only if it is a valid encoded tiling, cf. Section 4 for details. It is possible to establish NP-hardness of a corresponding restriction of the octant word-tiling problem, namely the bounded octant word-tiling problem (for unarily encoded input values). See Appendix A for details on tiling problems. It then only remains to observe that the construction in Theorem 1 is in fact a polynomial-time reduction, and that it reduces the bounded octant word-tiling problem to the bounded satisfiability problem. The argument for NEXPTIME-hardness in statement (2) is done along the same lines with, again, the bounded octant-word tiling problem shown to be NEXPTIME-hard when the input parameter $n$ is given in binary coding. A formal proof for Theorem 3 is given in Appendix C. ∎

Satisfiability for fixed-width arithmetic EOT is decidable, but also difficult

We turn our attention to classes of EOT that naturally arise in practical contexts. We consider EOT that work over some fixed-width arithmetic, like fixed- or floating-point numbers, and which have an embedding relying on a periodical encoding of positions.

We start with establishing a scenario where Sat is decidable in NEXPTIME. Regardless of the underlying EOT class $\mathcal{T}$ , our proof strategy always relies on a certifier-based understanding of NEXPTIME: given $T\in\mathcal{T}$ , we nondeterministically guess a word $w$ , followed by a deterministic certification whether $T(w)=1$ holds. For this to show $\textsc{Sat}[\mathcal{T}]\in\text{NEXPTIME}$ , we need to argue that the overall running time of such a procedure is at most exponential, in particular that whenever there is a word $w$ with $T(w)=1$ then there is also some $w^{\prime}$ with $T(w^{\prime})=1$ and $|w^{\prime}|\leq 2^{\mathit{poly}(|T|)}$ . Again, we rely on the polynomial evaluation property of EOT in $\mathcal{T}$ , i.e. the fact that $T(w^{\prime})$ can be computed in time polynomial in $|T|+|w^{\prime}|$ .

We consider the class of EOT $\mathcal{T}^{\textsc{fix}}_{\circ}$ , defined by placing restrictions on the positional embedding of an EOT $T$ to be additive-periodical which means that $\mathit{emb}(a,i)=\mathit{emb}^{\prime}(a)+\mathit{pos}(i)$ where $\mathit{pos}$ is periodical, i.e. there is $p\geq 1$ such that $\mathit{pos}(i)=\mathit{pos}(i+p)$ for all $i\in\mathbb{N}$ . Additionally, all normalisation functions are realised by either the softmax function $\mathit{smax}$ or the hardmax function $\mathit{hmax}$ . Moreover, we assume that all computations occurring in $T$ are carried out in some fixed-width arithmetic, encoding values in binary using a fixed number $b\in\mathbb{N}$ of bits. Aside from technical reasons, we motivate the choice of $\mathcal{T}^{\textsc{fix}}_{\circ}$ in Section 3. Given these restrictions, we adjust the definition of the complexity of $T\in\mathcal{T}^{\textsc{fix}}_{\circ}$ as a measure of the size (up to polynomials) as $|T|:=|\Sigma|+L+H+D+p+b$ .

Lemma 3.

There is a polynomial function $\mathit{poly}\colon\mathbb{N}\to\mathbb{N}$ such that for all $T\in\mathcal{T}^{\textsc{fix}}_{\circ}$ and all words $w$ with $T(w)=1$ there is word $w^{\prime}$ with $T(w^{\prime})=1$ and $|w^{\prime}|\leq 2^{\mathit{poly}(|T|)}$ .

Proof Sketch.

The polynomial $\mathit{poly}$ can be chosen uniformly for all $T\in\mathcal{T}^{\textsc{fix}}_{\circ}$ because for all positional embeddings of EOT in $\mathcal{T}^{\textsc{fix}}_{\circ}$ there is an upper bound on the period and on the bit-width in the underlying arithmetic. The small-word property stated by the lemma is then shown by arguing, given polynomial $\mathit{poly}$ , EOT $T$ and $|w|>2^{\mathit{poly}(|T|)}$ , that $w$ contains unnecessary subwords $u$ that can be cut out without changing the output in $T$ . Here, we exploit the fact $T$ has some periodicity $p$ and only consider those $u$ whose length is a multitude of $p$ . This ensures that the resulting word $w^{\prime}$ , given by $w$ without $u$ , is embedded the same way as $w$ by the positional embedding of $T$ . The existence of such subwords follows from $T$ ’s limited distinguishing capabilities, especially in its normalisations, due to the bounded representation size of numerical values possible in the underlying fixed-width arithmetic.

A formal proof relies on basic combinatorial arguments, but is technically extensive, and given in Appendix C. ∎

Based on this preliminary result, we can then immediately derive an upper bound on the complexity of $\textsc{Sat}[\mathcal{T}^{\textsc{fix}}_{\circ}]$ .

Theorem 4.

The satisfiability problem $\textsc{Sat}[\mathcal{T}^{\textsc{fix}}_{\circ}]$ for EOT over fixed-width arithmetic using additive-periodical embeddings is in NEXPTIME.

Proof.

Let $T\in\mathcal{T}^{\textsc{fix}}_{\circ}$ working over alphabet $\Sigma$ . We use a certifier-based understanding of a nondeterministic exponential-time algorithm as follows: We (a) guess an input $w\in\Sigma^{+}$ and (b) compute $T(w)$ to check whether $T(w)=1$ . For correctness, we need to argue that the length of $w$ is at most exponential in $|T|$ . This argument is given by Lemma 3. Note that via assumption we have that $T(w)$ can be computed in polynomial time regarding $|T|$ and $|w|$ and $T(w)=1$ obviously as well. ∎

Next, we address the goal of obtaining a matching lower bound, i.e. NEXPTIME-hardness. An obvious way to do so would be to follow Theorem 3.2 and form a reduction from the bounded octant word-tiling problem. Hence, given a tiling system $\mathcal{S}$ and $n\in\mathbb{N}$ encoded binarily, we would have to construct – in time polynomial in $|\mathcal{S}|+\log n$ – an EOT $T_{\mathcal{S},n}\in\mathcal{T}^{\textsc{fix}}_{\circ}$ such that $T_{\mathcal{S},n}(w)=1$ for some $w\in\Sigma^{+}$ iff there is a word $w=t_{1,1},t_{2,1},t_{2,2},t_{3,1},\ldots,t_{n,n}$ representing a valid $\mathcal{S}$ -tiling. In particular, $T_{\mathcal{S},n}$ would have to be able to recognise the correct word length and reject input that is longer than $|w|=\frac{n(n+1)}{2}$ . This poses a problem for EOT with periodical embeddings. To recognize whether a word is too long, an EOT $T$ must ultimately rely on its positional embedding, which seems to make a periodicity of $p\geq\frac{n(n+1)}{2}$ necessary. Since the size of periodical EOT is linear in $p$ , we get an exponential blow-up in a potential reduction of $\textsc{OTWP}_{\mathsf{bin}}$ to $\textsc{Sat}[\mathcal{T}^{\textsc{fix}}_{\circ}]$ , given that the values of $\frac{n(n+1)}{2}$ and already $n$ are exponential in the size of a binary representation of $n$ . This problem vanishes when the requirement of the underyling positional embedding to be periodical is lifted: allowing for non-periodical EOT, working over some fixed-width arithmetic, leads to an NEXPTIME-hard satisfiability problem. Let $\mathcal{T}^{\textsc{fix}}$ be defined similar to $\mathcal{T}^{\textsc{fix}}_{\circ}$ , but we allow for non-periodical embeddings. Furthermore, we assume that the considered fixed-width arithmetics can handle overflow situations using saturation.

Theorem 5.

The satisfiability problem $\textsc{Sat}[\mathcal{T}^{\textsc{fix}}]$ for EOT over fixed-width arithmetic is NEXPTIME-hard.

Proof sketch.

We establish a reduction from $\textsc{OTP}_{\mathsf{bin}}$ to $\textsc{Sat}[\mathcal{T}^{\textsc{fix}}]$ by constructing, for each instance $(\mathcal{S},n)$ of $\textsc{OTP}_{\mathsf{bin}}$ , an EOT $T_{\mathcal{S},n}$ working over some fixed-width arithmetic $F$ , which accepts exactly those $w$ with $|w|=\frac{n(n+1)}{2}$ corresponding to a valid word-encoded tiling for $\mathcal{S}$ . See Appendix A for details on tiling problems.

The construction is similar to the one given for $T_{\mathcal{S}}$ in the proof of Theorem 4, but we need to enable $T_{\mathcal{S},n}$ to reject words that are too long corresponding a polynomial bound dependent on $n$ . This implies that $T_{\mathcal{S},n}$ , based on the positional embedding $\mathit{emb}$ specified in Section 4, is able to check for all symbols if their respective position is less than or equal to a predefined bound. This can be achieved with similar tools as used in Lemma 2.

Furthermore, we need to ensure that $T_{\mathcal{S},n}$ works as intended, despite the fact that it is limited by $F$ . The arguments follow the same line as the proof of Theorem 2. A formal proof is given in Appendix C. ∎

6 Summary and outlook

We investigated the satisfiability problem of encoder-only transformer (EOT) through the lens of formal reasoning. In particular, we considered the computability and complexity of the satisfiability problem Sat of EOT in context of different classes of EOT, forming a baseline for understanding possibilities and challenges of formal reasoning of transformers.

We showed that Sat is undecidable for classes of EOT recently considered in research on the expressiveness of different transformer models (Theorem 1 and Theorem 2). This implies that formal reasoning is impossible as soon as we consider classes of EOT that are at least as expressive as the classes considered in these results. We remark that this result also translates to encoder-decoder architectures, whose encoder part is as expressive as the here considered EOT.

Additionally, we identified two ways to make formal reasoning for EOT possible: either we bound the length of considered inputs (Theorem 3) or we consider quantized EOT, meaning EOT whose computations and parameters are limited by some fixed-width arithmetic (Theorem 4). These imply that formal reasoning is possible as long as we consider classes of EOT that are at most as expressive as the classes considered in these results. We remark that this statement makes the reasonable assumption that the driving force for any upper computability or complexity bound is the expressiveness of the EOT, not the intricacies of considered safety or interpretability assumption. However, in both cases Sat remains difficult (Theorem 3 and Theorem 5) from a complexity perspective. Again, these results are only valid for classes of EOT that are at least as expressive as the ones considered.

While our results build a first framework for understanding possibilities and challenges of formal reasoning of transformers, there is room for more detailed investigations. Firstly, consider our undecidability and hardness results. These rely on the fact that we consider normalisations realised by the hardmax function. However, it is unclear whether similar results can be achieved if we stick to normalisations realised by the commonly used softmax function. Furthermore, it would be of interest to further investigate the interplay of embedding function and internal structure of the considered EOT. We expect that less expressive embeddings demand a richer structure of the attention mechanisms, but it is unclear where the limits are in the sense that undecidability of the satisfiability problem is still given. Secondly, consider our decidability and upper complexity bound results. It could be of practical interest, to take a more detailed look at specifics of particular fixed-width arithmetics. While this will not change our results, it could give tighter time-complexity estimates which may be of interest in certain formal reasoning applications.

References

[1] M. S. Baranowski, S. He, M. Lechner, T. S. Nguyen, and Z. Rakamaric. An SMT theory of fixed-point arithmetic. In N. Peltier and V. Sofronie-Stokkermans, editors, Automated Reasoning - 10th International Joint Conference, IJCAR 2020, Paris, France, July 1-4, 2020, Proceedings, Part I, volume 12166 of Lecture Notes in Computer Science, pages 13–31. Springer, 2020.
[2] R. Berger. The undecidability of the domino problem. Mem. Amer. Math. Soc., 66:72, 1966.
[3] S. Bhattamishra, A. Patel, and N. Goyal. On the computational power of transformers and its implications in sequence modeling. In R. Fernández and T. Linzen, editors, Proceedings of the 24th Conference on Computational Natural Language Learning, CoNLL 2020, Online, November 19-20, 2020, pages 455–475. Association for Computational Linguistics, 2020.
[4] G. Bonaert, D. I. Dimitrov, M. Baader, and M. Vechev. Fast and precise certification of transformers. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2021, pages 466–481. Association for Computing Machinery, 2021.
[5] Y. Bondarenko, M. Nagel, and T. Blankevoort. Understanding and overcoming the challenges of efficient transformer quantization. In M. Moens, X. Huang, L. Specia, and S. W. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 7947–7969. Association for Computational Linguistics, 2021.
[6] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
[7] D. Chiang, P. Cholak, and A. Pillay. Tighter bounds on the expressivity of transformer encoders. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 5544–5562. PMLR, 2023.
[8] G. A. Constantinides, F. Dahlqvist, Z. Rakamaric, and R. Salvia. Rigorous roundoff error analysis of probabilistic floating-point computations. In A. Silva and K. R. M. Leino, editors, Computer Aided Verification - 33rd International Conference, CAV 2021, Virtual Event, July 20-23, 2021, Proceedings, Part II, volume 12760 of Lecture Notes in Computer Science, pages 626–650. Springer, 2021.
[9] S. Demri, V. Goranko, and M. Lange. Temporal Logics in Computer Science. Cambridge Tracts in Theoretical Computer Science. Cambridge University Press, 2016.
[10] J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
[11] X. Dong, A. T. Luu, R. Ji, and H. Liu. Towards robustness against natural language word substitutions. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
[12] P. Dufter, M. Schmitt, and H. Schütze. Position information in transformers: An overview. Comput. Linguistics, 48(3):733–763, 2022.
[13] M. Hahn. Theoretical limitations of self-attention in neural sequence models. Trans. Assoc. Comput. Linguistics, 8:156–171, 2020.
[14] Y. Hao, D. Angluin, and R. Frank. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Trans. Assoc. Comput. Linguistics, 10:800–810, 2022.
[15] Y. Hsieh, M. Cheng, D. Juan, W. Wei, W. Hsu, and C. Hsieh. On the robustness of self-attentive models. In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 1520–1529. Association for Computational Linguistics, 2019.
[16] X. Huang, W. Ruan, W. Huang, G. **, Y. Dong, C. Wu, S. Bensalem, R. Mu, Y. Qi, X. Zhao, K. Cai, Y. Zhang, S. Wu, P. Xu, D. Wu, A. Freitas, and M. A. Mustafa. A survey of safety and trustworthiness of large language models through the lens of verification and validation. CoRR, abs/2305.11391, 2023.
[17] J. Marques-Silva and A. Ignatiev. Delivering trustworthy AI through formal XAI. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 12342–12350. AAAI Press, 2022.
[18] W. Merrill and A. Sabharwal. A logic for expressing log-precision transformers. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
[19] W. Merrill and A. Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11:531–545, 2023.
[20] W. Merrill, A. Sabharwal, and N. A. Smith. Saturated transformers are constant-depth threshold circuits. Trans. Assoc. Comput. Linguistics, 10:843–856, 2022.
[21] OpenAI. Gpt-4 technical report, 2023.
[22] D. W. Otter, J. R. Medina, and J. K. Kalita. A survey of the usages of deep learning for natural language processing. IEEE Trans. Neural Networks Learn. Syst., 32(2):604–624, 2021.
[23] J. Pérez, P. Barceló, and J. Marinkovic. Attention is turing-complete. J. Mach. Learn. Res., 22:75:1–75:35, 2021.
[24] M. Sälzer and M. Lange. Fundamental limits in formal verification of message-passing neural networks. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
[25] Z. Shi, H. Zhang, K. Chang, M. Huang, and C. Hsieh. Robustness verification for transformers. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
[26] L. Strobl, W. Merrill, G. Weiss, D. Chiang, and D. Angluin. What Formal Languages Can Transformers Express? A Survey. Transactions of the Association for Computational Linguistics, 12:543–561, 05 2024.
[27] P. van Emde Boas. The convenience of tilings. In A. Sorbi, editor, Complexity, Logic, and Recursion Theory, volume 187 of Lecture notes in pure and applied mathematics, pages 331–363. Marcel Dekker, Inc., 1997.
[28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
[29] G. Weiss, Y. Goldberg, and E. Yahav. Thinking like transformers. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 11080–11090. PMLR, 2021.
[30] Y. Yu, X. Si, C. Hu, and J. Zhang. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput., 31(7):1235–1270, 2019.
[31] H. Zhao, H. Chen, F. Yang, N. Liu, H. Deng, H. Cai, S. Wang, D. Yin, and M. Du. Explainability for large language models: A survey. ACM Trans. Intell. Syst. Technol., 15(2):20:1–20:38, 2024.

Appendix A Tiling Problems

We make use of particular tiling problems in order to prove lower bounds on the complexity and decidability of $\textsc{Sat}[\mathcal{T}]$ for different classes $\mathcal{T}$ .

A tiling system is an $\mathcal{S}=(S,H,V,t_{I},t_{F})$ where $S$ is a finite set; its elements are called tiles. $H,V\subseteq S\times S$ define a horizontal, resp. vertical matching relation between tiles, and $t_{I},t_{F}$ are two designated initial, resp. final tiles in $S$ .

Problems associated with tiling systems are typically of the following form: given a discrete convex plain consisting of cells with horizontal and vertical neighbors, is it possible to cover the plane with tiles from $S$ in a way that horizontally adjacent tiles respect the relation $H$ and vertically adjacent tiles respect the relation $V$ , together with some additional constraints about where to put the initial and final tile $t_{I},t_{F}$ . Such tiling problems, in particular for rectangular planes, have proved to be extremely useful in computational complexity, cf. [2, 27], since they can be seen as abstract versions of halting problems.

We need a variant in which the plane to be tiled is of triangular shape. The $n$ -th triangle is $\mathcal{O}_{n}=\{(i,j)\in\mathbb{N}\times\mathbb{N}\mid j\leq i\leq n\}$ for $n>0$ . An ( $\mathcal{S}$ )-tiling of $\mathcal{O}_{n}$ is a function $\tau:\mathcal{O}_{n}\to S$ s.t.

•

$(\tau(i,j),\tau(i,j+1))\in H$ for all $(i,j)\in\mathcal{O}$ with $j<i\leq n$ ,
•

$(\tau(i,j),\tau(i+1,j))\in V$ for all $(i,j)\in\mathcal{O}$ with $j\leq i<n$ .

Such a tiling a successful, if additionally $\tau(0,0)=t_{I}$ and $\tau(i,i)=t_{F}$ for some $(i,i)\in\mathcal{O}_{n}$ .

The unbounded octant tiling problem ( $\textsc{OTP}^{*}$ ) is: given a tiling system $\mathcal{S}$ , decide whether a successful $\mathcal{S}$ -tiling of $\mathcal{O}_{n}$ exists for some $n\in\mathbb{N}$ . The bounded octant tiling problem (OTP) is: given a tiling system $\mathcal{S}$ and an $n\geq 1$ , decide whether a successful $\mathcal{S}$ -tiling of $\mathcal{O}_{n}$ exists. Note that here, $n$ is part of the input, and that it can be represented differently, for example in binary or in unary encoding. We distinguish these two cases by referring to $\textsc{OTP}_{\mathsf{bin}}$ and $\textsc{OTP}_{\mathsf{un}}$ .

It is well-known that $\textsc{OTP}^{*}$ is undecidable [27]. It is also not hard to imagine that $\textsc{OTP}_{\mathsf{un}}$ is NP-complete while $\textsc{OTP}_{\mathsf{un}}$ is NEXPTIME-complete. In fact, this is well-known for the variants in which the underlying plane is not a triangle of height $n$ but a square of height $n$ [27]. The exponential difference incurred by the more compact binary representation of the input parameter $n$ is best seen when regarding the upper complexity bound for these problems: given $n$ , a nondeterministic algorithm can simply guess all the $n^{2}$ many tiles of the underlying square and verify the horizontal and vertical matchings in time $\mathcal{O}(n^{2})$ . If $n$ is encoded unarily, i.e. the space needed to write it down is $s:=n$ , then the time needed for this is polynomial in the input size $s$ ; if $n$ is encoded binarily with space $s:=\lceil\log n\rceil$ then the time needed for this is exponential in $s$ .

It then remains to argue that the tiling problems based on triangular planes are also NP- resp. NEXPTIME-complete. Clearly, the upper bounds can be established with the same guess-and-check procedure. For the lower bounds it suffices to observe that hardness of the tiling problems for the squares is established by a reduction from the halting problem for Turing machines (TM) such that a square of size $n\times n$ represents a run of the TM of length $n$ as a sequence of rows, and each row represents a configuration of the TM using at most $n$ tape cells. This makes use of the observation that the space consumption of a TM can never exceed the time consumption. Likewise, assuming that a TM always starts a computation with its head on the very left end of a tape, one can easily observe that after $i$ time steps, it can change at most the $i$ leftmost tape cells. Hence, a run of a TM can therefore also be represented as a triangle with its first configuration of length 1 in row 1, the second of length 2 in row 2 etc.

At last, we consider two slight modifications of these two problems which are easily seen to preserve undecidability resp. NP- and NEXPTIME-completeness. The unbounded octant tiling-word problem ( $\textsc{OTWP}^{*}$ ) is: given some $\mathcal{S}=(S,H,V,t_{I},t_{F})$ , decide whether there is a word $t_{0,0},t_{1,0},t_{1,1},t_{2,0},t_{2,1},t_{2,2},\ldots,$ $t_{n,n}\in S^{*}$ for some $n\in\mathbb{N}$ , s.t. the tiling $\tau$ defined by $\tau(i,j):=t_{i,j}$ comprises a successful tiling of $\mathcal{O}_{n}$ . The two variants of the bounded octant tiling-word problem are both: given some $\mathcal{S}$ as above and $n$ , decide whether such a word exists. Note that, again, here $n$ is an input parameter, and so its representation may affect the complexity of the problem, leading to the distinction between $\textsc{OTWP}_{\mathsf{bin}}$ with binary encoding and $\textsc{OTWP}_{\mathsf{un}}$ with unary encoding.

Theorem 6.

a)

$\textsc{OTWP}^{*}$ is undecidable ( $\Sigma_{0}^{1}$ -complete).
b)

$\textsc{OTWP}_{\mathsf{bin}}$ is NEXPTIME-complete.
c)

$\textsc{OTWP}_{\mathsf{un}}$ is NP-complete.

Proof.

(a) It should be clear that a tiling problem and its tiling-word variant (like $\textsc{OTP}^{*}$ and $\textsc{OTWP}^{*}$ ) are interreducible since they only differ in the formulation of how the witness for a successful tiling should be presented. So they are essentially the same problems. Undecidability of $\textsc{OTP}^{*}$ and, thus, $\textsc{OTWP}^{*}$ is known from [27], the $\Sigma^{1}_{0}$ -upper bound can be obtained through a semi-decision procedure that searches through the infinite space of $\mathcal{O}_{n}$ -tiling for any $n>1$ . This justifies the statement in part (a) of Thm. 6.

(b) With the same argument as in (a) t suffices to consider $\textsc{OTP}_{\mathsf{bin}}$ instead of $\textsc{OTWP}_{\mathsf{bin}}$ . The upper bound is easy to see: a nondeterministic procedure can easily guess a tiling for $\mathcal{O}_{n}$ and verify the horizontal and vertical matching conditions, as well as the use of the initial and final tile in appropriate places. This is possible in time $\mathcal{O}(n^{2})$ , resp. $\mathcal{O}(2^{2\log n})$ which is therefore exponential in the input size $\lceil\log n\rceil$ for binarily encoded parameters $n$ . This shows inclusion in NEXPTIME.

For the lower bound we argue that the halting problem for nondeterministic, exponentially-time bounded TM can be reduced to $\textsc{OTP}_{\mathsf{bin}}$ : given a nondeterministic TM $\mathcal{M}$ over input alphabet $\Sigma$ and tape alphabet $\Gamma$ that halts after at most time $2^{p(n)}$ steps on input words of length $n$ for some polynomial $n$ , and a word $w\in\Sigma^{*}$ , we first construct a TM $\mathcal{M}_{w}$ that is started in on the empty tape and begins by writing $w$ onto the tape and then simulates $\mathcal{M}$ on it. This is a standard construction in complexity theory, and it is easy to see that the running time of $\mathcal{M}_{w}$ is bounded by a function $2^{p^{\prime}(|w|)}$ for some polynomial $p^{\prime}$ . With the observation made above, a computation of $\mathcal{M}_{w}$ can be seen as a sequence of configurations $C_{1},\ldots,C_{p^{\prime}(|w|)}$ , with $|C_{i}|=i$ . This does not directly define a tiling system, instead and again by a standard trick, cf. [27] or [9, Chp. 11], one compresses three adjacent tape cells into one tile in order to naturally derive a horizontal matching relation from overlaps between such triples and a vertical matching relation from the TM’s transition function. At last, let $n^{\prime}:=p^{\prime}(|w|)$ . It is then a simple exercise to verify that a valid tiling of the triangle $\Delta_{n^{\prime}}$ corresponds to an accepting run of $\mathcal{M}$ on $w$ and vice-versa, which establishes NEXPTIME-hardness.

(c) This is down exactly along the same lines as part (b), but instead making use of the fact that, when $n$ is given in unary encoding, $p(n)$ is polynomial in the size of the representation of $n$ , and hence, the time needed for the guess-and-check procedure in the upper bound is only polynomial, and for the lower bound we need to assume that the running time of the TM is polynomially bounded. Thus, we get NP-completeness instead of NEXPTIME-completeness. ∎

Appendix B Proofs of Section 4

In the following, we give formal proof for the undecidability results of Section 4. To do so, we make use of classical Feed-Forward Neural Networks.

Feed-Forward Neural Network

A neuron $v$ is a computational unit computing a function $\mathbb{R}^{m}\rightarrow\mathbb{R}$ by $v(x_{1},\dotsc,x_{m})=\sigma(b+\sum_{i=1}^{m}w_{i}x_{i})$ where $\sigma$ is a function called activation and $b,w_{i}$ are parameters called bias resp. weight. A layer $l$ is a tuple of nodes $(v_{1},\dotsc,v_{n})$ where we assume that all nodes have the same input dimensionality $m$ . Therefore, $l$ computes a function $\mathbb{R}^{m}\rightarrow\mathbb{R}^{n}$ . We call $n$ the size of layer $l$ . Let $l_{1}$ be a layer with input dimensionality $m$ and $l_{k}$ a layer of size $n$ . A Feed-Forward Neural Network (FNN) $N$ is a tuple $(l_{1},\dotsc,l_{k})$ of layers where we assume that for all $i\leq k-1$ holds that the size of $l_{i}$ equals the input dimensionality of $l_{i+1}$ . Therefore, $N$ computes a function $\mathbb{R}^{m}\rightarrow\mathbb{R}^{n}$ by processing an input layer by layer.

In particular, we use specific FNN with $\mathit{relu}(x)=\max(0,x)$ activations, called gadgets, to derive lower bounds in connection with the expressibility of transformers. We denote the class of all FNN with $\mathit{relu}$ activations by $\mathcal{N}(\mathit{relu})$ .

Lemma 4.

Let $k\in\mathbb{R}^{>0}$ . There are basic gadgets

1.

$N_{|\cdot|}\in\mathcal{N}(\mathit{relu})$ computing $N_{|\cdot|}(x)=|x|$ ,
2.

$N_{<}\in\mathcal{N}(\mathit{relu})$ computing a function $\mathbb{R}^{2}\rightarrow\mathbb{R}$ such that $N_{<}(x_{1},x_{2})=0$ if $(x_{1}+1)-x_{2}\leq 0$ , $N_{<}(x_{1},x_{2})=(x_{1}+1)-x_{2}$ if $(x_{1}+1)-x_{2}\in(0;1)$ and $N_{<}(x_{1},x_{2})=1$ otherwise,
3.

$N_{=}\in\mathcal{N}(\mathit{relu})$ computing a function $\mathbb{R}^{2}\rightarrow\mathbb{R}$ such that $N_{=}(x_{1},x_{2})=0$ if $x_{1}-x_{2}=0$ , $N_{=}(x_{1},x_{2})=|x_{2}-x_{1}|$ if $|x_{2}-x_{1}|\in(0;1)$ and $N_{=}(x_{1},x_{2})=1$ otherwise,
4.

$N_{\rightarrow}\in\mathcal{N}(\mathit{relu})$ computing a function $\mathbb{R}^{2}\rightarrow\mathbb{R}$ such for all inputs $x_{1},x_{2}$ with $x_{1}\in\{0,1\}$ and $x_{2}\in[0;k]$ holds $N_{\rightarrow}(x_{1},x_{2})=0$ if $x_{1}=x_{2}=0$ or $x_{1}=1$ and $N_{\rightarrow}(x_{1},x_{2})=\mathit{relu}(x_{2})$ otherwise.

Proof.

Let $N_{|\cdot|}$ be the minimal FNN computing $\mathit{relu}(\mathit{relu}(-x)+\mathit{relu}(x))$ , let $N_{<}$ be the minimal FNN computing $\mathit{relu}(f_{<}(x_{1},x_{2})-f_{<}(x_{1},x_{2}+1))$ where $f_{<}(y_{1},y_{2})=\mathit{relu}(y_{1}-y_{2}+1)$ and let $N_{=}$ be the minimal FNN computing $\mathit{relu}(f_{=}(x_{1},x_{2})-f_{=}(x_{1}+1,x_{2})+f_{=}(x_{2},x_{1})-f_{=}% (x_{2}+1,x_{1}))$ where $f_{=}(y_{1},y_{2})=\mathit{relu}(y_{2}-y_{1})$ . The claims of the lemma regarding these gadgets are straightforward given their functional form. Let $N_{\rightarrow}$ be the minimal FNN computing $\mathit{relu}(\mathit{relu}(x_{2})-k\cdot\mathit{relu}(x_{1}))$ . As stated in the lemma, we assume that $x_{1}\in\{0,1\}$ and $x_{2}\in[0;k]$ . Then, $-k\cdot\mathit{relu}(x_{1})$ is $-k$ if $x_{1}=1$ and $0$ if $x_{1}=0$ . Thus, $N_{\rightarrow}$ is guaranteed to be $0$ if $x_{1}=1$ and otherwise it depends on $x_{2}$ . This gives the claim regarding gadget $N_{\rightarrow}$ . ∎

We will combine gadgets in different ways. Let $N_{1}$ and $N_{2}$ be FNN with the same input dimensionality $m$ and output dimensionality $n_{1}$ respectively $n_{2}$ . We extend the computation of $N_{1}$ to functions $\mathbb{R}^{m^{\prime}}\rightarrow\mathbb{R}^{n_{1}}$ with $m<m^{\prime}$ by weighting additional dimensions with $0$ in the input layer. Given a set of input dimensions $x_{1},\dotsc,x_{m^{\prime}}$ , we denote the effective dimensions $x_{i_{1}},\dotsc,x_{i_{m}}$ with pairwise different $i_{j}\in\{1,\dotsc,m^{\prime}\}$ by $N^{x_{i_{1}},\dotsc,x_{i_{m}}}_{1}$ . Formally, this means that $N^{x_{i_{1}},\dotsc,x_{i_{m}}}_{1}(x_{1},\dotsc,x_{m^{\prime}})=N_{1}(x_{i_{1}% },\dotsc,x_{i_{m}})$ for all inputs. We denote the FNN consisting of $N_{1}$ and $N_{2}$ placed next to each other by $N_{1}|\!|N_{2}$ . Formally, this is done by combining $N_{1}$ and $N_{2}$ layer by layer using $0$ weights in intersecting connections. Then, $N_{1}|\!|N_{2}$ computes $\mathbb{R}^{m}\rightarrow\mathbb{R}^{n_{1}+n_{2}}$ given by $N_{1}|\!|N_{2}(\boldsymbol{x})=(N_{1}(\boldsymbol{x}),N_{2}(\boldsymbol{x}))$ . We generalize this operation to $k$ FNN $N_{1}|\!|\dotsb|\!|N_{k}$ in the obvious sense. Let $N_{3}$ be an FNN with input dimensionality $n_{1}$ and output dimensionality $n_{3}$ . We denote the FNN consisting of $N_{1}$ and $N_{3}$ placed sequentially by $N_{3}\circ N_{1}$ . Formally, this is done by connecting the output layer of $N_{1}$ with the input layer of $N_{3}$ . Then, $N_{3}\circ N_{1}$ computes $\mathbb{R}^{m}\rightarrow\mathbb{R}^{n_{3}}$ given by $N_{3}\circ N_{1}(\boldsymbol{x})=N_{3}(N_{1}(\boldsymbol{x}))$ .

We also consider specific gadgets needed in the context of tiling problems.

Lemma 5.

Let $S\subseteq\mathbb{N}$ be a finite set and $R\subseteq S^{2}$ . There is FNN $N_{R}\in\mathcal{N}(\mathit{relu})$ computing $\mathbb{R}^{2}\rightarrow\mathbb{R}$ such that $N_{R}(x_{1},x_{2})\in\{0,1\}$ if $(x_{1},x_{2})\in S^{2}$ and $N_{R}(x_{1},x_{2})=0$ iff $(x_{1},x_{2})\in R$ and there is $N_{=t}\in\mathcal{N}(\mathit{relu})$ for each $t\in S$ computing $\mathbb{R}\rightarrow\mathbb{R}$ such that $N_{=t}(x)\in\{0,1\}$ for each $x\in\mathbb{N}$ and $N_{=t}(x)=0$ iff $x=t$ .

Proof.

Let $S\subseteq\mathbb{N}$ be finite, $R\subseteq S^{2}$ and $t\in S$ . First, consider $N_{=t}$ . Let $N_{t}$ be the minimal FNN computing $\mathit{relu}(0\cdot x+t)$ and $N_{\mathit{id}}$ be the minimal FNN computing $(\mathit{relu}(x),-\mathit{relu}(-x))$ . Obviously, $N_{t}$ computes the constant $t$ function and $N_{\mathit{id}}$ computes the identity in the form of two dimensional vectors. Let $N_{=t}$ be given by the minimal FNN computing $N_{=}\circ(N_{\mathit{id}}|\!|N_{t})$ with the slight alteration that the two output dimensions of $N_{\mathit{id}}$ are connected to the first dimension of $N_{=}$ . Then, the claim of the lemma regarding $N_{=t}$ follows from Lemma 4 and the operations on FNN described in Appendix B.

Now, consider $N_{R}$ . Given some $s\in S$ let $R[s]=\{r\mid(s,r)\in R\}$ . Let $N^{k}_{\land}$ be the minimal FNN computing $\mathit{relu}(x_{1}+\dotsb+x_{k})$ . Furthermore, let $N_{\in T}$ for some set $T\subseteq S$ be the minimal FNN such that $N_{\in T}(x)=0$ if $x\in T$ and $N_{\in T}(x)=1$ if $x\in S\setminus T$ . A construction for $N_{\in T}$ is given in Theorem 4 in [24]. According to this construction, $N_{\in T}$ consists of three layers and is polynomial in $T$ . In the case that $T=\emptyset$ we assume that $N_{\in\emptyset}$ is the constant $1$ function represented by a suitable FNN. Then, $N_{R}$ is given by $N^{|S|}_{\land}\circ((N_{\rightarrow}\circ(N_{=s_{1}}|\!|N_{\in R[s_{1}]}))|\!% |\dotsb|\!|(N_{\rightarrow}\circ(N_{=s_{|S|}}|\!|N_{\in R[s_{|S|}]})))$ for some arbitrary order on $S$ with the slight alteration that $N_{R}$ has two input dimensions, meaning that each subnet $(N_{=s_{i}}|\!|N_{\in R[s_{i}]})$ is connected to the same two input dimensions. Again, the claim of the lemma regarding $N_{R}$ follows from Lemma 4 and the operations on FNN described in Appendix B. ∎

Given these understandings of gadgets, we are set to formally prove the results of Section 4.

Proof of Lemma 1.

Let $w=t_{0,0}t_{1,0}t_{1,1}t_{2,0}\dotsb t_{m,n}\in S^{+}$ as stated in the lemma and assume some order $a_{i}$ on $S$ . Furthermore, let $\mathit{emb}(a_{i},1)=(1,1,1,1,i)$ and $\mathit{emb}(a_{i},j)=(0,1,j,\sum_{h=0}^{j}h,i)$ if $j>1$ . Let $\mathit{emb}(w)=\boldsymbol{x}^{0}_{1}\dotsb\boldsymbol{x}^{0}_{k}$ . In the following, we build two layers $l_{1}$ and $l_{2}$ using components allowed in $\mathcal{T}_{\mathit{udec}}$ , satisfying the statement of the lemma. Layer $l_{1}$ consists of a single attention head $\mathit{att}_{1,1}=(\mathit{score}_{1,1},\mathit{pool}_{1,1})$ . The scoring function is given by $\mathit{score}_{1,1}(\boldsymbol{x}^{0}_{i},\boldsymbol{x}^{0}_{j})=N_{1,1}(% \langle Q_{1,1}\boldsymbol{x}^{0}_{i},K_{1,1}\boldsymbol{x}^{0}_{j}\rangle)$ where $Q_{1,1}=[(0,0,-1,0,0),(0,1,0,0,0),(0,1,0,0,0)]$ and $K_{1,1}=[(0,1,0,0,0),(0,1,0,0,0),(0,0,0,1,0)]$ and $N(x)=-\mathit{relu}(x)$ . We have that $\mathit{score}_{1,1}(\boldsymbol{x}^{0}_{i},\boldsymbol{x}^{0}_{j})=-\mathit{% relu}((\sum_{h=0}^{j}h)-(i-1))$ and it follows that $\mathit{score}_{1,1}(\boldsymbol{x}^{0}_{i},\boldsymbol{x}^{0}_{j})=0$ if $\sum_{h=0}^{j}h\leq i-1$ and otherwise we have that $\mathit{score}_{1,1}(\boldsymbol{x}^{0}_{i},\boldsymbol{x}^{0}_{j})<0$ . The pooling function is specified by the matrix $W_{1,1}=[(1,0,0,0,0)]$ and uses $\mathit{hmax}$ as normalisation function. The combination $\mathit{comb}_{1}$ function is given by the FNN $N_{1}(x_{1},\dotsc,x_{5},y)=\mathit{relu}(x_{2})|\!|\dotsb|\!|\mathit{relu}(x_% {5})|\!|\mathit{relu}(y)$ . Given a position $\boldsymbol{x}^{0}_{i}$ , the attention head $\mathit{att}_{1,1}$ attends to all positions $\boldsymbol{x}^{0}_{j}$ satisfying $\sum_{h=0}^{j}h\leq i-1$ . This is due to the way $\mathit{score}_{1,1}$ is build. Then, $\mathit{att}_{1,1}$ computes $\frac{1}{l}$ using $\mathit{pool}_{1,1}$ where $l$ is the number of positions $\mathit{att}_{1,1}$ attends to. Here, we exploit the fact that only the first position $\boldsymbol{x}^{0}_{1}$ has a non-zero entry in the its first dimension and that for all $i$ head $\mathit{att}_{1,1}$ attends to $\boldsymbol{x}^{0}_{1}$ . Finally, $\mathit{comb}_{1}$ simply stacks the old vector $\boldsymbol{x}^{0}_{i}$ onto the value $\frac{1}{l}$ , but leaves out the first dimension of $\boldsymbol{x}^{0}_{i}$ . Let $l_{1}(\mathit{emb}(w))=\boldsymbol{x}^{1}_{1}\dotsb\boldsymbol{x}^{1}_{k}$ . Layer $l_{2}$ consists of a single attention head $\mathit{att}_{2,1}=(\mathit{score}_{2,1},\mathit{pool}_{2,1})$ . The scoring function $\mathit{score}_{2,1}$ is given by $N_{2,1}(\langle Q_{2,1}\boldsymbol{x}^{1}_{i},K_{2,1}\boldsymbol{x}^{1}_{j}\rangle)$ where $Q_{2,1}=[(0,0,0,0,1)]$ , $K_{2,1}=[(0,1,0,0,0)]$ and $N_{2,1}(x)=-\mathit{relu}(\mathit{relu}(x-1)+\mathit{relu}(1-x))$ . We have that $\mathit{score}_{2,1}(\boldsymbol{x}^{1}_{i},\boldsymbol{x}^{1}_{j})=0$ if $\frac{1}{l}\cdot j=1$ where $\frac{1}{l}$ is the fifth dimension of $\boldsymbol{x}^{1}_{i}$ and otherwise $\mathit{score}_{2,1}(\boldsymbol{x}^{1}_{i},\boldsymbol{x}^{1}_{j})<0$ . The pooling function $\mathit{pool}_{2,1}$ is specified by $W_{2,1}=[(0,1,0,0,0),(0,0,1,0,0)]$ and uses $\mathit{hmax}$ as normalisation. The combination $\mathit{comb}_{2}$ is given by the FNN $N_{2}(x_{1},\dotsc,x_{5},y_{1},y_{2})=\mathit{relu}(x_{1})|\!|\mathit{relu}(x_% {2})|\!|\mathit{relu}(y_{1})|\!|\mathit{relu}(x_{2}-y_{2}-1)|\!|\mathit{relu}(% x_{4})$ . Given a position $\boldsymbol{x}^{1}_{i}$ , the attention head $\mathit{att}_{2,1}$ attends to the position $j$ , where $\frac{1}{l}\cdot j=1$ . Relying on our arguments regarding the computation of $l_{1}$ , this is the position $j$ satisfying $\max_{j}(\sum_{h=0}^{j}h\leq i-1)$ . However, this $j$ is equal to the row index $r(i)$ of the decomposition of $i$ based on the inversion of Cantor’s pairing function. Thus, we have that $r(i)=j$ . Furthermore, we have that $c(i)=(i-1)-(\sum_{h=0}^{j}h)$ , which is computed by $\mathit{relu}(x_{2}-y_{2}-1)$ in the combination function $\mathit{comb}_{2}$ . Overall, we see that $l_{2}(l_{1}(\mathit{emb}(w)))$ gives the desired result. ∎

Proof of Lemma 2.

Let $f$ be as stated in the lemma. By definition of $\mathcal{T}_{\mathit{udec}}$ , the scoring function of $\mathit{att}_{f}$ is of the form $N(\langle Q\boldsymbol{x}_{i},K\boldsymbol{x}_{j}\rangle)$ and the normalisation is $\mathit{hmax}$ . Let $Q=[(a_{1},\dotsc,a_{k}),(b,0,\dotsc,0),(1,0,\dotsc,0)]$ , $K=[(1,0,\dotsc,0),(1,0,\dotsc,0),(0,-1,0,\dotsc,0)]$ and $N$ be the minimal FNN computing $N(x)=-\mathit{relu}(N_{|\cdot|}(x))=-|x|$ where $N_{|\cdot|}$ is given by Lemma 4. Overall, this ensures that the scoring is given by $\mathit{score}(\boldsymbol{x}_{i},\boldsymbol{x}_{j})=-|f(\boldsymbol{x}_{i})-j|$ . Then, the statement of the lemma follows from the fact that $\mathit{hmax}$ attends to the maximum, which is $0$ given this scoring, and that $j\in\mathbb{N}$ is unique for each $\boldsymbol{x}_{j}$ . ∎

Lemma 6.

There is attention head $\mathit{att}_{\leq}$ in $\mathcal{T}_{\mathit{udec}}$ such that for all sequences $\boldsymbol{x}_{1},\dotsc,\boldsymbol{x}_{m}$ where all $\boldsymbol{x}_{i}=(1,i,\boldsymbol{y}_{i})$ the head $\mathit{att}_{\leq}$ attends to $\{\boldsymbol{x}_{1},\dotsc,\boldsymbol{x}_{i}\}$ given $i$ .

Proof.

By definition of $\mathcal{T}_{\mathit{udec}}$ , the scoring function of $\mathit{att}_{f}$ is of the form $N(\langle Q\boldsymbol{x}_{i},K\boldsymbol{x}_{j}\rangle)$ and the normalisation is $\mathit{hmax}$ . Let $Q=[(0,1,0,\dotsc,0),(1,0,\dotsc,0)]$ and let $K$ be equal to $[(1,0\dotsc,0),(0,-1,0,\dotsc,0)]$ . Furthermore, let $N(x)=-\mathit{relu}(x)$ . We observe that $N$ outputs $0$ if $j\leq i$ and otherwise $N(x)<0$ . In combination with $\mathit{hmax}$ , this ensures that $\mathit{att}_{\leq}$ behaves as stated by the lemma. ∎

Proof of Theorem 1.

We prove the statement via reduction from $\textsc{OTWP}^{*}$ . Let $\mathcal{S}=(S,H,V,t_{I},t_{F})$ be an instance of $\textsc{OTWP}^{*}$ with $|S|=k$ . W.l.o.g we assume that $S\subseteq\mathbb{N}$ . Let $T_{\mathcal{S}}\in\mathcal{T}_{\mathit{udec}}$ be built the following way. $T_{\mathcal{S}}$ uses the embedding $\mathit{emb}$ of transformer in $\mathcal{T}_{\mathit{udec}}$ specified in the beginning of Section 4. Furthermore, it has four layers. Layers $l_{1}$ , $l_{2}$ are as in Lemma 1. Layer $l_{3}$ is given by $l_{3}=(\mathit{att}_{\mathit{prev}},\mathit{att}_{\mathit{next}},\mathit{att}_% {\mathit{step}},\mathit{comb}_{3})$ where $\mathit{att}_{\text{prev}}$ , $\mathit{att}_{\text{next}}$ and $\mathit{att}_{\text{step}}$ are of Lemma 2 whereby $\mathit{prev}(x_{1},\dotsc,x_{5})=x_{2}-1$ , $\mathit{next}(x_{1},\dotsc,x_{5})=x_{2}+1$ and $\mathit{step}(x_{1},\dotsc,x_{5})=x_{2}+x_{3}+1$ . We assume that all three attention heads use the identity matrix as linear maps in their respective pooling function. $\mathit{comb}_{3}$ is given by an FNN $N_{3}$ computing $\mathbb{R}^{4\cdot 5}\rightarrow\mathbb{R}$ . Let the input dimensions of $N_{3}$ be $x_{1,1},\dotsc,x_{1,5},x_{2,1},\dotsc,x_{4,5}$ . Then, $N_{3}$ is equal to

\mathit{relu}(x_{1,1})|\!|\mathit{relu}(x_{1,2})|\!|N_{a}|\!|N_{b_{1}}|\!|N_{b% _{2}}|\!|N_{c}|\!|N_{d}

where $N_{a}=N_{\rightarrow}\circ(N^{x_{1,2},x_{3,2}}_{=}|\!|N_{=}^{x_{1,3},x_{1,4}})$ , $N_{b_{1}}=N_{\rightarrow}\circ(N^{x_{1,2},x_{2,2}}_{=}|\!|N_{=t_{I}}^{x_{1,5}})$ , $N_{b_{2}}=N_{\rightarrow}\circ(N^{x_{1,2},x_{3,2}}_{=}|\!|N_{=t_{F}}^{x_{1,5}})$ , $N_{c}=N_{\rightarrow}\circ(N^{x_{1,4},x_{1,3}}_{<}|\!|N^{x_{1,5},x_{3,5}}_{H})$ and $N_{d}=N_{\rightarrow}\circ(N_{<}^{x_{1,3},{x_{4,3}}}|\!|N^{x_{1,5},x_{4,5}}_{V})$ using the gadgets and constructions described in Appendix B. Layer $l_{4}$ is given by $l_{4}=(\mathit{att}_{\text{leq}},\mathit{comb}_{4})$ where $\mathit{att}_{\text{leq}}$ attends to $\{\boldsymbol{x}_{1},\dotsc,\boldsymbol{x}_{i}\}$ given $i$ and $\mathit{comb}_{4}$ is given by the minimal FNN $N_{4}$ computing $\mathit{relu}(x_{3}+\dotsb+x_{7})$ . A formal proof for the existence of $\mathit{att}_{\text{leq}}$ in $\mathcal{T}_{\mathit{udec}}$ is given in Lemma 6. Furthermore, the output function $\mathit{out}$ of $T_{\mathcal{S}}$ is given by the minimal FNN $N_{\mathit{out}}$ computing $N(x_{1})=\mathit{relu}(1-x_{1})$ .

Let $w=t_{1}\dotsb t_{l}\in S^{*}$ be some word over alphabet $S$ . As defined above, we have that $\mathit{emb}(t_{i},i)=(1,i,\sum_{j=0}^{i}j,k_{i})$ where $k_{i}\in\{1,\dotsc,|S|\}$ . Consider $\boldsymbol{x}^{2}_{1}\dotsb\boldsymbol{x}^{2}_{m}$ , namely the sequence of vectors after propagating $w$ through the embedding $\mathit{emb}$ and layers $l_{1},l_{2}$ of $T_{\mathcal{S}}$ . As stated by Lemma 1, we have that $\boldsymbol{x}^{2}_{i}=(1,i,r(i),c(i),k_{i})$ where $r(i)$ and $c(i)$ are the row respectively column of tile $t_{i}$ if we interpret $w$ as an encoded tiling. Note that all vectors $\boldsymbol{x}^{3}_{i}$ are non-negative due to the way $N_{3}$ is built. In the following, we argue that all $\boldsymbol{x}^{3}_{i}=\boldsymbol{0}$ if and only if $w$ is a valid encoded tiling. Given this equivalence, the statement of the lemma follows immediately as $l_{4}$ simply sums up all vectors and dimensions (except for the first and second) of $\boldsymbol{x}^{3}_{1},\dotsc,\boldsymbol{x}^{3}_{m}$ in $\boldsymbol{x}^{4}_{m}$ and the output of $N_{4}$ indicates whether there was some non-zero value. We fix some arbitrary $\boldsymbol{x}^{2}_{i}=(1,i,r(i),c(i),k_{i})$ . Then, $\boldsymbol{x}^{3}_{i}=N_{3}(\boldsymbol{x}^{2}_{i},\boldsymbol{x}^{2}_{i_{% \mathit{prev}}},\boldsymbol{x}^{2}_{i_{\mathit{next}}},\boldsymbol{x}^{2}_{i_{% \mathit{step}}})$ where $i_{\mathit{next}}=i+1$ if $i<m$ and $m$ otherwise, $i_{\mathit{prev}}=i-1$ if $i>1$ and $1$ otherwise and $i_{\mathit{step}}=i+r(i)+1$ if $i<m-r(i)-1$ and $m$ otherwise.

Consider property $(a)$ and subnetwork $N_{a}$ . With the understanding gained in Appendix B, $N^{x_{1,2},x_{3,2}}_{=}$ outputs $0$ iff $x_{1,2}=x_{3,2}$ . These dimensions correspond to positions $i$ and $i_{\mathit{next}}$ , which are only equal if $i=m$ (Lemma 2). Furthermore, the property of $N_{\rightarrow}$ stated by Lemma 4 is given as the output of $N_{=}$ is guaranteed to be in $[0;1]$ and the values of $x_{1,2}$ and $x_{3,2}$ are guaranteed to be in $\mathbb{N}$ . In summary, this ensures that the third dimension of $\boldsymbol{x}^{3}_{m}$ is $0$ iff $r(m)=c(m)$ . For other positions the third dimension is always $0$ since $N_{\rightarrow}$ outputs $0$ in these cases due to the fact that $N^{x_{1,2},x_{3,2}}_{=}$ equals $1$ . Analogously, $N_{b_{1}}$ and $N_{b_{2}}$ ensure that $t_{1}=t_{I}$ and $t_{m}=t_{F}$ and, thus, property (b) iff the fourth and fifth dimensions in all positions are equal to $0$ . Consider properties (c) and (d) described above and assume that property (a) holds. These two properties are non-local in the sense that they depend on at least two positions in $\boldsymbol{x}^{2}_{1}\dotsb\boldsymbol{x}^{2}_{m}$ . Consider the subnet $N_{c}$ . By construction and the gadgets described in Appendix B, we have that $N_{c}$ outputs $0$ if $c(i)<r(i)$ and $(t_{i},t_{i+1})\in H$ or if $c(i)=r(i)$ , which means that tile $t_{i}$ is rightmost in its corresponding row. Otherwise the value computed by $N_{c}$ is greater than $0$ . Analogously, subnet $N_{d}$ checks whether vertically stacked tiles do match. In summary, this ensures that the sixth and seventh dimension of each $\boldsymbol{x}^{3}_{i}$ is equal to $0$ if and only if properties (c) and (d) hold. ∎

Proof of Theorem 2.

In the same manner as in the proof of Theorem 1, we prove the statement via reduction from $\textsc{OTWP}^{*}$ . The reduction is exactly the same, namely given an $\textsc{OTWP}^{*}$ instance $\mathcal{S}=(S,H,V,t_{I},t_{F})$ we build EOT $T_{\mathcal{S}}$ which recognizes exactly those words $w$ representing a valid encoded tiling of $\mathcal{S}$ . For details, see the proof of Theorem 1.

Given the correctness arguments for $T_{\mathcal{S}}$ in Theorem 1, it is left to argue that $T_{\mathcal{S}}$ works as intended, despite the fact that it works over some FA $F$ using at most $\mathcal{O}(\log(\max(|S|,n)))$ bits where $n$ is the length of an input word. We choose $F$ such that overflow situations do not occur in any computation $T_{\mathcal{S}}(w)$ and rounding is handled such that $T_{\mathcal{S}}$ works as intended. Throughout this proof, we use $\log(n)$ Namely, given a word $w$ with $|w|=n$ assume that $F$ uses $m=\lfloor 4\log(\max(|S|,n))\rfloor+2$ bits and rounds values off to the nearest representable number. We denote the value resulting from rounding $x$ off in arithmetic $F$ by $\lfloor x\rfloor_{F}$ . We assume that there is an extra bit that is used as a sign bit and that at least $\lfloor 3\log(n)\rfloor+1$ bits can be used to represent integer and at least $\lfloor\log(n)\rfloor+1$ bits can be used to represent fractional parts. Note that this is a reasonable assumption for all common FA, like fixed-point or floating-point arithmetic. Furthermore, it is clearly the case that $m\in\mathcal{O}(\log(\max(|S|,n)))$ . To ease our arguments and notation from here on, we assume w.l.o.g. that we represent $n$ using $\log(n)$ instead of $\lfloor log(n)\rfloor+1$ .

Per definition, $T_{\mathcal{S}}$ uses the embedding function $\mathit{emb}(a_{k},0)=(1,1,0,0,k)$ and $\mathit{emb}(a_{k},i)=(0,1,i,\sum_{j=0}^{i}j,k)$ . First, we assume that each $k$ , namely the value representing a specific tile from $S$ , is a unique, positive value. This is possible as $F$ uses $m>\log(|S|)$ bits. Furthermore, we see that $\mathit{emb}$ , especially the sum $\sum_{j=0}^{i}j=\frac{i(i+1)}{2}\leq i^{2}$ , works as intended up to $i=n$ due to the fact that $F$ uses more than $m>2\log(n)$ bits to represent integer parts. Next, consider layer $l_{1}$ and $l_{2}$ of Lemma 1. Layer $l_{1}$ consists of a single attention head $\mathit{att}_{1,1}$ . Here, the only crucial parts are the computation of value $\frac{1}{l}$ in $\mathit{pool}_{1,1}$ for a position $i$ . Per definition, $l$ corresponds to the number of positions $j$ such that $\sum_{h=0}^{j}h\leq i-1$ . As $i$ is bounded by $n$ , this inequality can only be satisfied by positions $j$ for which $j\leq\sqrt{n}$ holds. As $T_{\mathcal{S}}$ uses $\mathit{hmax}$ to count the positions for which this inequality holds, $l$ is bounded by $\sqrt{n}$ . Next, we observe that $\lfloor\frac{1}{l}\rfloor_{F}=\frac{\lfloor 2^{\log(n)}\frac{1}{l}\rfloor}{2^{% \log(n)}}=\frac{\lfloor\frac{n}{l}\rfloor}{n}$ , namely the general understanding of rounding off where we use $\log(n)$ bits to represent fractions. However, this gives that for all $1\leq l_{1}<l_{2}\leq\sqrt{n}$ that $\lfloor\frac{1}{l_{1}}\rfloor_{F}\neq\lfloor\frac{1}{l_{2}}\rfloor_{F}$ as $\lfloor\frac{n}{l_{1}}\rfloor\neq\lfloor\frac{n}{l_{2}}\rfloor$ holds for all $l_{1}<l_{2}\leq\sqrt{n}$ . This means, that it is ensured by $F$ that $\frac{1}{l}$ is uniquely representable.

Next, the only crucial part in $l_{2}$ is the computation of the product $\frac{1}{l}\cdot j$ , which is used to determine the position $j$ for which $\frac{1}{l}\cdot j=1$ in $\mathit{score}_{2,1}$ , which is obviously given by position $l$ . This equality is no longer guaranteed to exist if we consider $\lfloor\frac{1}{l}\rfloor_{F}\cdot j$ . However, due to the monotonicity of $\lfloor\frac{1}{l}\rfloor_{F}$ for $l\leq\sqrt{n}$ and that the maximum round of error is given by $\frac{1}{2^{\log(n)}}$ , we have that the $j=l$ produces the value closest to $1$ in the product $\frac{1}{l}\cdot j$ . Taking a look at $\mathit{score}_{2,1}$ , this ensures that $l$ is still the position that $\mathit{att}_{2,1}$ attends to. Therefore, the statement of Lemma 1 is still valid for $T_{\mathcal{S}}$ working over $F$ . We observe that all values of some vector $\boldsymbol{x}^{2}_{j}$ after layer $l_{2}$ are positive integers whose magnitude is bounded by $n^{2}$ .

Now, consider layer $l_{3}$ and $l_{4}$ . From the proof of Theorem 1 we see that the gadgets at most sum up two values or compute a fraction of the form $\frac{i+j}{2}$ and $\frac{i-j}{2}$ (in gadgets $N_{H}$ or $N_{V}$ ). Both can safely be done with at least $3\log(n)$ bits for integer and $\log(n)$ for fractional parts, as all previously computed values, up to layer $l_{2}$ , in a computation of $T_{\mathcal{S}}(w)$ are representable using $2\log(n)$ bits. We observe that the values of the third to seventh dimension of some $\boldsymbol{x}^{3}_{j}$ are either $0$ or $1$ . This is due to the fact that all values after layer $l_{2}$ are guaranteed to be integers. Next, consider layer $l_{4}$ . The computation done by $\mathit{att}_{\leq}$ is safe (see Lemma 6) and the crucial step here is the computation of $\mathit{comb}_{4}$ given by $\mathit{relu}(x_{3}+\dotsb+x_{7})$ . The values $x_{i}$ are all of the form $\frac{i}{j}$ where $i$ is guaranteed to be $0$ or $1$ and $j$ is the normalisation induced by $\mathit{att}_{\leq}$ from perspective of position $j$ . However, this means $j$ is bounded by $n$ and, thus, $\lfloor\frac{i}{j}\rfloor_{F}>0$ if and only if $i=1$ for all $j$ due to the fact that $F$ allows for $\log(n)$ bits to represent fractional parts. Finally, $\mathit{out}$ is trivially computable in $F$ , which finishes the proof. ∎

Appendix C Proofs of Section 5

Proof of Theorem 3.

The decidability and membership results of statements (1) and (2 )are sufficiently argued in the proof sketch given in Section 5.

To prove the hardness results of statements (1) and (2), we establish a reduction from $\textsc{OTWP}_{\mathsf{un}}$ respectively $\textsc{OTWP}_{\mathsf{bin}}$ : given some bounded word-tiling instance $(\mathcal{S},n)$ we build an instance $(T_{\mathcal{S}},n)$ of $\textsc{bSat}_{\mathsf{un}}$ respectively $\textsc{bSat}_{\mathsf{bin}}$ where $T_{\mathcal{S}}$ is build as described in Theorem 1. The only missing argument is that these reductions are polynomial. In particular, this means that $T_{\mathcal{S}}$ must be built in polynomial time regarding the size of $(\mathcal{S},n)$ . Therefore, we recall the proof of Theorem 1.

First, we see that the embedding function $\mathit{emb}$ and the amount of layers of $T_{\mathcal{S}}$ is independent of $\mathcal{S}$ and $n$ . The first two layers $l_{1}$ and $l_{2}$ of $T_{\mathcal{S}}$ are specified in Lemma 1. Recalling the proof of Lemma 1, we see that $l_{1}$ and $l_{2}$ each consist of a single attention head, whose internal parameters like scoring, pooling or combination are independent of $(\mathcal{S},n)$ as well. Next, consider layer $l_{3}$ . This layer consists of three attention heads $\mathit{att}_{\mathit{prev}}$ , $\mathit{att}_{\mathit{next}}$ and $\mathit{att}_{\mathit{step}}$ each given by the template described in Lemma 2, which again is independent of $(\mathcal{S},n)$ . Additionally, $l_{3}$ contains the combination function $\mathit{comb}_{3}$ . This combination function is represented by a FNN $N_{3}$ , using smaller FNN $N_{a}$ , $N_{b_{1}}$ , $N_{b_{2}}$ , $N_{c}$ and $N_{d}$ as building blocks. These are dependent on $\mathcal{S}$ , as they are built using gadgets $N_{=t_{I}}$ , $N_{=t_{F}}$ , $N_{H}$ and $N_{V}$ where $t_{I}$ , $t_{F}$ , $H$ and $V$ are components of $\mathcal{S}$ . However, in the proof of Lemma 5 we see that these gadgets are at most polynomial in their respective parameter. Layer $l_{4}$ and the output function, specified by FNN $N_{\mathit{out}}$ , are again independent of $(\mathcal{S},n)$ . In summary, the EOT $T_{\mathcal{S}}$ is polynomial in $(\mathcal{S},n)$ , which makes the reductions from $\textsc{OTWP}^{\mathsf{exp}}$ und $\textsc{OTWP}^{\mathsf{poly}}$ polynomial. ∎

Next, we address the proof of Lemma 3. We need some preliminary, rather technical result first. Let $T$ be an EOT and $w\in\Sigma^{+}$ be a word and consider the computation $T(w)$ . Let $X_{T(w)}^{0}=\mathit{emb}(w)$ and $X_{T(w)}^{i}$ be the sequence of vectors occurring after the computation of layer $l_{i}$ of $T$ . Let $\boldsymbol{x}$ and $\boldsymbol{x}^{\prime}$ be two vectors matching the dimensionality of $\mathit{score}_{i,j}$ of $T$ . Overloading some notation, let $N_{w}(\boldsymbol{x},\boldsymbol{x}^{\prime},i,j)=\mathit{norm}_{i,j}(\mathit{% score}_{i,j}(\boldsymbol{x},\boldsymbol{x}^{\prime}),\mathit{score}_{i,j}(% \boldsymbol{x},X_{T(w)}^{i-1}))$ where $\mathit{score}_{i,j}(\boldsymbol{x},X_{T(w)}^{i-1})$ is the vector of all scorings of $\boldsymbol{x}$ with sequence $X_{T(w)}^{i-1}$ . We remark that it is not necessary that $\boldsymbol{x}$ or $\boldsymbol{x}^{\prime}$ must occur in $X_{T(w)}^{i-1}$ for this to be well defined. Again overloading some notation, let $P_{w}(\boldsymbol{x},i,j)=\mathit{pool}_{i,j}(X_{T(w)}^{i-1},\mathit{score}_{i% ,j}(\boldsymbol{x},X_{T(w)}^{i-1}))$ .

Lemma 7.

Let $T$ be a additive-periodical EOT of depth $L$ , maximum width $H$ and periodicity $p$ with $\mathit{norm}_{i,j}\in\{\mathit{smax},\mathit{hmax}\}$ for all $i\leq L,j\leq H$ , let $w=u_{1}u_{j_{1}}\dotsb u_{j_{h}}u_{2}\in\Sigma^{+}$ where $u_{1},u_{2}\in\Sigma^{+}$ , all $u_{j_{i}}\in\Sigma^{p}$ and all $u_{j_{i}}$ also occur in $u_{1}$ or $u_{2}$ and let $\mathcal{X}$ be the set of all vectors occurring in any of the sequences $X^{i}_{T(w)}$ . If there are indexes $h_{1}<h_{2}\leq h$ such that for all $\boldsymbol{x},\boldsymbol{x}^{\prime}\in\mathcal{X},i\leq L,j\leq H$ holds that $N_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}}(\boldsymbol{x},\boldsymbol{x}^{\prime},% i,j)=N_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{2}}}}(\boldsymbol{x},\boldsymbol{x}^{% \prime},i,j)$ and $P_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}}(\boldsymbol{x},i,j)=P_{u_{1}u_{j_{1}}% \dotsb u_{j_{h_{2}}}}(\boldsymbol{x},i,j)$ then it holds that $N_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2}}(\boldsymbol{% x},\boldsymbol{x}^{\prime},i,j)=N_{u_{1}\dotsb u_{2}}(\boldsymbol{x},% \boldsymbol{x}^{\prime},i,j)$ and $P_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2}}(\boldsymbol{% x},i,j)=P_{u_{1}\dotsb u_{2}}(\boldsymbol{x},i,j)$ .

Proof.

Let $T$ , $w$ , $\mathcal{X}$ , $h_{1}$ and $h_{2}$ be as stated above. We prove the statement via induction on the layers $l_{i}$ . First, consider layer $l_{1}$ and fix some tuple $(\boldsymbol{x},\boldsymbol{x}^{\prime},1,j)$ . We first show that $N_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2}}(\boldsymbol{% x},\boldsymbol{x}^{\prime},1,j)=N_{u_{1}\dotsb u_{2}}(\boldsymbol{x},% \boldsymbol{x}^{\prime},1,j)$ . Assume that $\mathit{norm}_{1,j}$ is given by $\mathit{smax}$ . Then, $\mathit{norm}_{1,j}$ computes $\frac{e^{\mathit{score}_{1,j}(\boldsymbol{x},\boldsymbol{x}^{\prime})}}{\sum_{% \mathit{score}_{1,j}(\boldsymbol{x},X_{T(w^{\prime})}^{0})}e^{s_{i^{\prime}}}}$ for all words $w^{\prime}$ . Obviously, the numerator in $N_{u_{1}\dotsb u_{h_{1}}u_{h_{2}+1}\dotsb u_{2}}(\boldsymbol{x},\boldsymbol{x}% ^{\prime},1,j)$ and $N_{u_{1}\dotsb u_{2}}(\boldsymbol{x},\boldsymbol{x}^{\prime},1,j)$ is equal. By definition, we have that $\mathit{score}_{i,j}$ is local in the sense that it compares vectors pairwise, producing the different scoring values $s_{i^{\prime}}$ independent of the overall word. Furthermore, due to the fact that $\mathit{emb}$ is additive-periodical, we have $X^{0}_{T(u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2})}$ and $X^{0}_{T(u_{1}\dotsb u_{2})}$ are equal in the sense that the vectors corresponding to $u_{j_{h_{2}+1}}\dotsb u_{2}$ are equal. We refer to this property (*) later on. Using these observations and that $N_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}}(\boldsymbol{x},\boldsymbol{x}^{\prime},% 1,j)=N_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{2}}}}(\boldsymbol{x},\boldsymbol{x}^{% \prime},1,j)$ , we have that the denominator is equal as well. Now, assume that $\mathit{norm}_{1,j}$ is given by $\mathit{hmax}$ . Then, $\mathit{norm}_{1,j}$ computes $\frac{f(\mathit{score}_{1,j}(\boldsymbol{x},\boldsymbol{x}^{\prime}),\mathit{% score}_{1,j}(\boldsymbol{x},X^{0}_{T(w^{\prime})}))}{\sum_{\mathit{score}_{1,j% }(\boldsymbol{x},X^{0}_{T(w^{\prime})})}f(s_{i^{\prime}},\mathit{score}_{1,j}(% \boldsymbol{x},X^{0}_{T(w^{\prime})}))}$ where $f(s,S)=1$ if $s$ is maximal in $S$ and $0$ otherwise for any word $w^{\prime}$ . In contrast to $\mathit{smax}$ , we have that the values of $f(\dotsb)$ are dependent of the overall context, namely the vector of all scorings $\mathit{score}_{1,j}(\boldsymbol{x},X^{0}_{T(w^{\prime})})$ . Compare $X^{0}_{T(u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2})}$ and $X^{0}_{T(u_{1}\dotsb u_{2})}$ , both given by the additive-periodical embedding $\mathit{emb}$ . Via assumption, we have that each $u_{j_{i}}$ block also occurs in $u_{1}$ or $u_{2}$ . In particular, this means every vector that occurs in $\mathit{emb}(u_{1}\dotsb u_{2})$ does also occur in $\mathit{emb}(u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2})$ and vice-versa. This implies that $f(\mathit{score}_{1,j}(\boldsymbol{x},\boldsymbol{x}^{\prime}),\mathit{score}_% {1,j}(\boldsymbol{x},X^{0}_{T(u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}% }\dotsb u_{2})}))=f(\mathit{score}_{1,j}(\boldsymbol{x},\boldsymbol{x}^{\prime% }),\mathit{score}_{1,j}(\boldsymbol{x},X^{0}_{T(u_{1}\dotsb u_{2})}))$ for any scoring value $\mathit{score}_{1,j}(\boldsymbol{x},\boldsymbol{x}^{\prime})$ . In combination with the assumption that $N_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}}(\boldsymbol{x},\boldsymbol{x}^{\prime},% 1,j)=N_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{2}}}}(\boldsymbol{x},\boldsymbol{x}^{% \prime},1,j)$ and the observations above, we also get $N_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2}}(\boldsymbol{% x},\boldsymbol{x}^{\prime},1,j)=N_{u_{1}\dotsb u_{2}}(\boldsymbol{x},% \boldsymbol{x}^{\prime},1,j)$ in the $\mathit{hmax}$ case. Next, consider the pooling functions. By definition, we have that $\mathit{pool}_{1,j}(X^{0}_{T(w^{\prime})},\mathit{score}_{1,j}(\boldsymbol{x},% X^{0}_{T(w^{\prime})}))$ computes $\sum_{X^{0}_{T(w^{\prime})}}\mathit{norm}_{1,j}(\boldsymbol{x},\boldsymbol{x}_% {i^{\prime}},\mathit{score}_{i,j}(\boldsymbol{x},X^{0}_{T(w^{\prime})}))(W% \boldsymbol{x}_{i^{\prime}})$ for any word $w^{\prime}$ . Our previous arguments give that $N_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2}}(\boldsymbol{% x},\boldsymbol{x}^{\prime},1,j)=N_{u_{1}\dotsb u_{2}}(\boldsymbol{x},% \boldsymbol{x}^{\prime},1,j)$ . In combination with $P_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}}(\boldsymbol{x},i,j)=P_{u_{1}u_{j_{1}}% \dotsb u_{j_{h_{2}}}}(\boldsymbol{x},i,j)$ and (*), we immediately get that $P_{u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2}}(\boldsymbol{% x},i,j)=P_{u_{1}\dotsb u_{2}}(\boldsymbol{x},i,j)$ holds as well. Next, consider layer $l_{i}$ . The arguments are exactly the same as in the base case. However, we need to rely on the induction hypothesis. Namely, we assume that all $\mathit{pool}_{{i-1},j}$ produce the same output in computation $T(u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2})$ and computation $T(u_{1}\dotsb u_{2})$ . This implies that all vectors present in $X^{i-1}_{T(u_{1}u_{j_{1}}\dotsb u_{j_{h_{1}}}u_{j_{h_{2}+1}}\dotsb u_{2})}$ are also present in $X^{i-1}_{T(u_{1}\dotsb u_{2})}$ and vice-versa and that the vectors corresponding to $u_{j_{h_{2}+1}}\dotsb u_{2}$ are equal in both computations. ∎

Proof of Lemma 3.

Let $T\in\mathcal{T}^{\textsc{fix}}_{\circ}$ be an additive-periodical EOT working over alphabet $\Sigma$ , having periodicity $p$ , depth $L$ , maximum width $H$ , maximum dimensionality $D$ and working over an FA $F$ using $b$ bits for binary encoding. We use $V$ to denote the set of values representable in the fixed arithmetic that $T$ works over. Note that $|V|\leq 2^{b}$ . Let $w\in\Sigma^{+}$ be a word such that $T(w)=1$ . We observe that there is $m\in\mathbb{N}$ such that $w=u_{1}\dotsb u_{m}u$ where $u_{i}\in\Sigma^{p}$ are blocks of symbols of length $p$ and $u\in\Sigma^{\leq p}$ . Our goal is to prove that a not necessarily connected subsequence of at most $2^{(|T|)^{6}}$ many $p$ -blocks $u_{i}$ from $u_{1}\dotsb u_{m}$ is sufficient to ensure the same computation of $T$ . In the case that $pm+p\leq 2^{(|T|)^{6}}$ we are done. Therefore, assume that $m>2^{(|T|)^{6}}$ .

Let $U$ be the set of all unique $u_{i}$ . We observe that $|U|\leq|\Sigma|^{p}$ . Next, we fix some not necessarily connected but ordered subsequence $S=u_{j_{0}}u_{j_{1}}\dotsb u_{j_{n}}u_{j_{n+1}}$ with $u_{j_{0}}=u_{1}$ , $j_{i}\in\{2,\dotsc,m\}$ and $u_{j_{n+1}}=u$ of $w$ such that each $u^{\prime}\in U$ occurs exactly once. For the case that $u_{1}=u$ we allow this specific block to occur twice in $S$ . The assumption $m>2^{\mathit{poly}(|T|)}$ implies that $S\neq w$ . This means that there are pairs $(u_{j_{h}},u_{j_{h+1}})$ in $S$ with some non-empty sequence of $p$ -blocks $u_{j^{\prime}_{1}}\dotsb u_{j^{\prime}_{l}}$ in between. W.lo.g. assume $u_{j_{0}}$ and $u_{j_{1}}$ is such a pair. Our goal is to argue that there are at most $2^{(|T|)^{5}}$ blocks from $u_{j^{\prime}_{1}}\dotsb u_{j^{\prime}_{l}}$ needed to ensure the same computation of $T$ . Given that this argument works for all $|\Sigma|^{p}$ adjacent pairs in $S$ , we are done.

Consider the computation $T(w)$ . The additive-periodical embedding $\mathit{emb}$ of $T$ implies that $\mathit{emb}(w)$ includes at most $\Sigma p$ different vectors. Furthermore, from layer to layer equal vectors are mapped equally, which means that each $X_{w}^{1},\dotsc,X_{w}^{L}$ contains at most $\Sigma p$ different vectors as well. This implies that the computation $T(w)$ induces at most $(L\Sigma p)^{2}\times L\times H\leq(\Sigma pL^{2}H)^{2}\leq(\Sigma pLH)^{4}$ different tuples $(\boldsymbol{x},\boldsymbol{x}^{\prime},i,j)$ where $\boldsymbol{x},\boldsymbol{x}^{\prime}$ are vectors induced by $T(w)$ and $i\leq L,j\leq H$ . Additionally, we have that for each value $N_{w}(\boldsymbol{x},\boldsymbol{x}^{\prime},i,j)$ and $P_{w}(\boldsymbol{x},i,j)$ , as defined in the beginning of this section, there are at most $|V^{D}|\leq 2^{bD}$ possibilities. Simple combinatorics, namely the pigeon hole principle, states that in the increasing sequence $u_{j^{\prime}_{1}},u_{j^{\prime}_{2}},\dots$ there must be points $h_{1}$ and $h_{2}$ with $h_{1}\leq 2^{bD(\Sigma pLH)^{4}}\leq 2^{(|T|)^{5}}$ such that for all tuples $(\boldsymbol{x},\boldsymbol{x}^{\prime},i,j)$ induced by $T(w)$ we have that $N_{u_{j_{0}}u_{j^{\prime}_{1}}\dotsb u_{j^{\prime}_{h_{1}}}}(\boldsymbol{x},% \boldsymbol{x}^{\prime},i,j)=N_{u_{j_{0}}u_{j^{\prime}_{1}}\dotsb u_{j^{\prime% }_{h_{2}}}}(\boldsymbol{x},\boldsymbol{x}^{\prime},i,j)$ and $P_{u_{j_{0}}u_{j^{\prime}_{1}}\dotsb u_{j^{\prime}_{h_{1}}}}(\boldsymbol{x},i,% j)=P_{u_{j_{0}}u_{j^{\prime}_{1}}\dotsb u_{j^{\prime}_{h_{2}}}}(\boldsymbol{x}% ,i,j)$ . Now, Lemma 7 states that this implies $N_{u_{j_{0}}u_{j^{\prime}_{1}}\dotsb u_{j^{\prime}_{h_{1}}}u_{j^{\prime}_{h_{2% }+1}}\dotsb u_{j_{1}}\dotsb u}(\boldsymbol{x},\boldsymbol{x}^{\prime},i,j)=N_{% w}(\boldsymbol{x},\boldsymbol{x}^{\prime},i,j)$ and $P_{u_{j_{0}}u_{j^{\prime}_{1}}\dotsb u_{j^{\prime}_{h_{1}}}u_{j^{\prime}_{h_{2% }+1}}\dotsb u_{j_{1}}\dotsb u}(\boldsymbol{x},i,j)=P_{w}(\boldsymbol{x},i,j)$ . However, this implies that the subsequence $u_{j^{\prime}_{h_{1}+1}}\dotsb u_{j^{\prime}_{h_{2}}}$ has no influence in the computation of $T$ on $w$ and, thus, can be left out. As we can argue this for every such cycle occurring in $u_{j^{\prime}_{1}}\dotsb u_{j^{\prime}_{l}}$ , we get the desired bound of $2^{(|T|)^{5}}$ . ∎

Proof of Theorem 5.

We prove the statement via reduction from $\textsc{OTWP}_{\mathsf{bin}}$ . Let $\mathcal{S}=(S,H,V,t_{I},t_{F})$ and $n\geq 1$ be an instance of $\textsc{OTWP}_{\mathsf{bin}}$ . We construct an EOT $T_{\mathcal{S},n}\in\mathcal{T}^{\textsc{fix}}$ working over some FA $F$ with $T_{\mathcal{S},n}(w)=1$ if and only if $w\in S^{+}$ witnesses the validity of the $\textsc{OTWP}_{\mathsf{bin}}$ instance $(\mathcal{S},n)$ .

Next, let $T_{\mathcal{S},n}$ be built exactly like $T_{\mathcal{S}}$ in the proof of Theorem 4, but with the following structural adjustments. In layer $l_{3}$ we adjust $\mathit{comb}_{3}$ to be $\mathit{comb}_{3}=N_{3}|\!|N_{e}|\!|N_{f}$ where $N_{3}$ is specified as in the proof of Theorem 4, $N_{e}=N_{\rightarrow}\circ(N^{x_{1,2},x_{3,2}}_{=}|\!|N^{x_{1,3}}_{=n})$ and $N_{f}=N^{x_{1,2}}_{\neq\frac{(n+1)((n+1)+1)}{2}+1}$ where $N_{\neq t}$ is analogous to the construction of $N_{=t}$ given in Lemma 5. Furthermore, we adjust $\mathit{comb}_{4}$ in layer $l_{4}$ to be represented by the FNN $\mathit{relu}(x_{3}+\dotsb+x_{8}+x_{9})$ . We refer to the gadgets described in Lemma 4 and Lemma 5 as well as the proof of Theorem 1 for further details.

Consider the adjustment in $l_{3}$ . FNN $N_{e}$ in $\mathit{comb}_{3}$ ensures that $T_{\mathcal{S},n}(w)=1$ only if the row index corresponding to the last symbol is equal to $n$ . Note that $N_{3}$ checks whether row and column index corresponding to the last symbol are equal. Additionally, $N_{f}$ checks if there is no id equal to $\frac{(n+1)((n+1)+1)}{2}+1$ . This corresponds to the position id of the successor of the vector representing tile $(n,n)$ . Furthermore, the adjustment of $\mathit{comb}_{4}$ considers the output of $N_{e}$ and $N_{f}$ in addition to the outputs of $N_{3}$ . In summary, we have that $T_{\mathcal{S},n}$ only outputs $1$ given $w$ if the word length is such that the row index corresponding to the position of the last symbol of $w$ in a respective octant tiling is equal to $n$ (ensured by $N_{e}$ ), that $w$ is at most of length $\frac{(n+1)((n+1)+1)}{2}$ (ensured by $N_{f}$ ) and if $w$ represents a valid encoded tiling (the remaining parts of $T_{\mathcal{S},n})$ .

Additionally, we need to argue that $T_{\mathcal{S},n}$ works as intended, despite the fact that it is limited by some FA $F$ using a representation size that is at most logarithmic in $n$ . These arguments follow the exact same line as in the proof of Theorem 2, but using FA $F$ that uses $m=\lfloor 6\log(\max(|S|,n))\rfloor+2$ bits and handles overflow using saturation. The reason for the larger representation size is that words $w$ representing a valid encoded tiling ending at position $(n,n)$ are of length $|w|=\frac{(n+1)((n+1)+1)}{2}\leq n^{2}$ . Thus, we use $\lfloor 4\log(n)\rfloor+1$ integer bits to be able to represent a sum $\sum_{j=0}^{i}j=\frac{i(i+1)}{2}\leq i^{2}$ for all $i\leq n^{2}$ and $\lfloor 2\log(n)\rfloor+1$ fractional bits to uniquely represent fraction $\frac{1}{l}$ for $l\leq n$ . For detail see the proof of Theorem 2. Furthermore, the fact that we use $\lfloor 4\log(n)\rfloor+1$ bits to encode integers and that $F$ handles overflow using saturation ensures that $N_{f}$ works as intended: we have that $\frac{(n+1)((n+1)+1)}{2}+1<n^{4}$ and, thus, we have that the id $\frac{(n+1)((n+1)+1)}{2}+1$ occurs at most once, independent of the length of $w$ as it is not the point where $F$ enforces saturation on the positional embedding. Thus, $\mathit{att}_{\text{self}}$ works for this position as intended and then $N_{f}$ checks the property described above correctly.

The argument that $T_{\mathcal{S},n}$ can be built in polynomial time is a straightforward implication from the arguments for Theorem 3 and the fact that $N_{e}$ and $N_{f}$ are a small gadgets with maximum parameter quadratic in $n$ , which can be represented using a logarithmic amount of bits. ∎

NeurIPS Paper Checklist

1.

Claims
Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
Answer: [Yes]
Justification: Each claimed computability or complexity result, namely Theorem 1 to 5, is sufficiently argued in the main paper with full, formal proods in the appendix.
Guidelines:
- •
  
  The answer NA means that the abstract and introduction do not include the claims made in the paper.
- •
  
  The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.
- •
  
  The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.
- •
  
  It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.
2.

Limitations
Question: Does the paper discuss the limitations of the work performed by the authors?
Answer: [Yes]
Justification: We limited our theoretical results in Section 3 and also in Section 6 in detail.
Guidelines:
- •
  
  The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.
- •
  
  The authors are encouraged to create a separate "Limitations" section in their paper.
- •
  
  The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.
- •
  
  The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.
- •
  
  The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.
- •
  
  The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.
- •
  
  If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.
- •
  
  While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in develo** norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.
3.

Theory Assumptions and Proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
Answer: [Yes]
Justification: For each result we gave a full formal proof in the Appendix and a short proof sketch and intuitive explanation in the main paper.
Guidelines:
- •
  
  The answer NA means that the paper does not include theoretical results.
- •
  
  All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.
- •
  
  All assumptions should be clearly stated or referenced in the statement of any theorems.
- •
  
  The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.
- •
  
  Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.
- •
  
  Theorems and Lemmas that the proof relies upon should be properly referenced.
4.

Experimental Result Reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?
Answer: [N/A]
Justification:
Guidelines:
- •
  
  The answer NA means that the paper does not include experiments.
- •
  
  If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.
- •
  
  If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.
- •
  
  Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.
- •
  While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example
  1. (a)
    
    If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.
  2. (b)
    
    If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.
  3. (c)
    
    If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).
  4. (d)
    
    We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.
5.

Open access to data and code
Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
Answer: [N/A]
Justification:
Guidelines:
- •
  
  The answer NA means that paper does not include experiments requiring code.
- •
  
  Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.
- •
  
  While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).
- •
  
  The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.
- •
  
  The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.
- •
  
  The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.
- •
  
  At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).
- •
  
  Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.
6.

Experimental Setting/Details
Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?
Answer: [N/A]
Justification:
Guidelines:
- •
  
  The answer NA means that the paper does not include experiments.
- •
  
  The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.
- •
  
  The full details can be provided either with the code, in appendix, or as supplemental material.
7.

Experiment Statistical Significance
Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
Answer: [N/A]
Justification:
Guidelines:
- •
  
  The answer NA means that the paper does not include experiments.
- •
  
  The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.
- •
  
  The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).
- •
  
  The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)
- •
  
  The assumptions made should be given (e.g., Normally distributed errors).
- •
  
  It should be clear whether the error bar is the standard deviation or the standard error of the mean.
- •
  
  It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.
- •
  
  For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).
- •
  
  If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.
8.

Experiments Compute Resources
Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?
Answer: [N/A]
Justification:
Guidelines:
- •
  
  The answer NA means that the paper does not include experiments.
- •
  
  The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.
- •
  
  The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.
- •
  
  The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).
9.

Code Of Ethics
Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?
Answer: [Yes]
Justification: None of the topics “Potential Harms Caused by Research Process”, “Societal Impact and Potential Harmful Consequences” or “Impact Mitigation Measures” does apply to our theoretical results.
Guidelines:
- •
  
  The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
- •
  
  If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.
- •
  
  The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).
10.

Broader Impacts
Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
Answer: [N/A]
Justification:
Guidelines:
- •
  
  The answer NA means that there is no societal impact of the work performed.
- •
  
  If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.
- •
  
  Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.
- •
  
  The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.
- •
  
  The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.
- •
  
  If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).
11.

Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
Answer: [N/A]
Justification:
Guidelines:
- •
  
  The answer NA means that the paper poses no such risks.
- •
  
  Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.
- •
  
  Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.
- •
  
  We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
12.

Licenses for existing assets
Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
Answer: [N/A]
Justification:
Guidelines:
- •
  
  The answer NA means that the paper does not use existing assets.
- •
  
  The authors should cite the original paper that produced the code package or dataset.
- •
  
  The authors should state which version of the asset is used and, if possible, include a URL.
- •
  
  The name of the license (e.g., CC-BY 4.0) should be included for each asset.
- •
  
  For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.
- •
  
  If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.
- •
  
  For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.
- •
  
  If this information is not available online, the authors are encouraged to reach out to the asset’s creators.
13.

New Assets
Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
Answer: [N/A]
Justification:
Guidelines:
- •
  
  The answer NA means that the paper does not release new assets.
- •
  
  Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.
- •
  
  The paper should discuss whether and how consent was obtained from people whose asset is used.
- •
  
  At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.
14.

Crowdsourcing and Research with Human Subjects
Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?
Answer: [N/A]
Justification:
Guidelines:
- •
  
  The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
- •
  
  Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.
- •
  
  According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.
15.

Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects
Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?
Answer: [N/A]
Justification:
Guidelines:
- •
  
  The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
- •
  
  Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.
- •
  
  We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.
- •
  
  For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.