The Computational Complexity of Formal Reasoning for Encoder-Only Transformers
Abstract
We investigate challenges and possibilities of formal reasoning for encoder-only transformers (EOT), meaning sound and complete methods for verifying or interpreting behaviour. In detail, we condense related formal reasoning tasks in the form of a naturally occurring satisfiability problem (SAT). We find that SAT is undecidable if we consider EOT, commonly considered in the expressiveness community. Furthermore, we identify practical scenarios where SAT is decidable and establish corresponding complexity bounds. Besides trivial cases, we find that quantized EOT, namely those restricted by some fixed-width arithmetic, lead to the decidability of SAT due to their limited attention capabilities. However, the problem remains difficult, as we establish those scenarios where SAT is NEXPTIME-hard and those where we can show that it is solvable in NEXPTIME for quantized EOT. To complement our theoretical results, we put our findings and their implications in the overall perspective of formal reasoning.
1 Introduction
Natural language processing (NLP) models, processing and computing human language, are gateways for modern applications aiming to interact with human users in a natural way. Although NLP is a traditional field of research, the use of deep learning techniques has undoubtedly revolutionised the field in recent years [22]. In this revolution, models such as Recurrent Neural Networks (RNN) or more specific Long Short-term Memory Networks (LSTM) [30] have long been the driving force, but for a few years now NLP has a new figurehead: transformers [28].
Transformers are a deep learning model using (multiple) self-attention mechanisms to process sequential input data, usually natural language. The efficient trainability of transformers, for example in contrast to LSTM, while achieving top-tier performance led to numerous heavy-impact implementations such as BERT [10], GPT-3 [6] or GPT-4 [21], sparking widespread use of the transformer architecture. However, the foreseeable omnipresence of transformer-based applications leads to serious security concerns.
In general, there are two approaches to establishing trustworthiness of learning-based models: first, certifying specific, application-dependent safety properties, called verification, and second, interpreting the behaviour of such models and giving explanations for it, called interpretation. In both approaches, the holy grail is to develop automatic methods that are sound and complete: algorithm that given some model and (verification or interpretation) specification outputs true if satisfies and false otherwise (soundness), and does so for all combinations of and (completeness) it is designed for. We refer to such sound and complete methods and tasks for verification and interpretation collectively using the term formal reasoning.
We lay out a framework for the possibilities and challenges of formal reasoning for transformers by establishing basic computability and complexity results in this work. Thereby, we focus on the so-called satisfiability (Sat) problem of sequence-classifying transformers: given a transformer , decide whether there is some input word such that . Although this may seem like an artificial problem at first glance, it is a natural abstraction of problems that commonly occur in almost all non-trivial formal reasoning tasks. Additionally, since it is detached from the specifics of particular reasoning specifications like safety properties for instance, uncomputability results and complexity-theoretic hardness results immediately transfer to more complex formal reasoning tasks. This also keeps the focus on the transformer architecture under consideration. Here we exclusively consider encoder-only transformers (EOT), mainly due to the fact that the known high expressive power of encoder-decoder transformers [23] makes formal reasoning trivially impossible.
Our work is structured as follows. We define necessary preliminaries in Section 2. In Section 3, we give an overview on our theoretical results and take a comprehensive look at their implications for formal reasoning for transformers. In Section 4 and Section 5 we present our theoretical results: we show that Sat is undecidable for classes of EOT commonly considered in research on transformer expressiveness, we show that a bounded version bSat of the satisfiability problem is decidable, for any class of (computable) EOT, and give corresponding complexity bounds and we show that considering quantized EOT, meaning EOT whose parameters and internal computations are limited by some fixed-width arithmetic, leads to decidability of Sat and give corresponding complexity bounds. Finally, we discuss limitations, open problems and future research in Section 6.
Related work.
We establish basic computability and complexity results about transformer-related formal reasoning problems, like formal verification or interpretation. This places our work in the intersection between research on verification and interpretation of transformers and transformer expressiveness.
There is a limited amount of work concerned with methods for the verification of safety properties of transformers [15, 25, 4, 11]. However, all those methods do not fall in the category of formal reasoning, as they are non-complete. This means, the rigorous computability and complexity bound established in this work cannot be applied without further considerations. The same applies for so far considered interpretability methods [31]. We remark that a lot of these approaches are not sound methods either. In contrast, there is an uprise in theoretical investigations of transformer expressiveness. Initial work dealt with encoder-decoder models and showed that such models are Turing-complete [23, 3]. Encoder-only models have so far been analysed in connection with circuit complexity [13, 14, 20, 19], logics [7, 18] and programming languages [29]. A recently published survey [26] provides an overview of these results. This work is adjacent as some of the here considered classes of EOT, mainly those considered in Section 4, are motivated by these results and some of the constructions we use in corresponding proofs are similar.
2 Fundamentals
Mathematical basics.
Let be a finite set of symbols, called alphabet. A (finite) word over is a finite sequence where . We define . As usual, we denote the set of all non-empty words by . A language is a set of words. We also extend the notion of an alphabet to vectors , meaning that a sequence is a word over some subset of . Usually, we denote vectors using bold symbols like or .
Encoder-only transformers (EOT).
We consider the encoder-only transformer (EOT) model introduced in [28]. We take a look at EOT from a computability and complexity perspective, which is why we follow more formal definitions as done in [13, 23, 14]. An EOT with layers and attention heads in layer is a tuple where
-
•
for some is the positional embedding,
-
•
each attention head is a tuple where is a function called scoring and is a function called pooling, computing where is a linear map represented by a matrix and is a normalisation,
-
•
each is called a combination and is called the output.
For given we call the tuple the -th layer of . The EOT computes a function as follows. Let be a word. First, computes an embedding of by where . Next, each layer computes a sequence as follows: for each input and attention head , layer computes . Then, is given by . In the end, the output is computed by , thus the value of the output function for the last symbol of after being transformed by the embedding and layers of . We say that accepts if , and we say that rejects otherwise. We call the depth of and the maximal the (maximum) width of . Furthermore, we call the maximal the (maximum) dimensionality of . Let be some class of EOT. The decision problem for a class of EOT is: given over alphabet , decide whether there is such that . We refer to this as the satisfiability problem for .
Fixed-width arithmetics.
We consider commonly used fixed-width arithmetics (FA) that represent numbers using a fixed amount of bits, like floating- or fixed-point arithmetic in this work. See [1] (fixed-point) or [8] (floating-point) for rigorous mathematical definitions of such FA. In this work, however, we only make use of a high-level view on different FA. Namely, given some FA we assume that all values are represented in binary using bits for representing its numbers. Thus, there are different rational number representable in . Furthermore, we assume that the considered FA can handle overflow situations using either saturation or wrap-around and rounding situations by rounding up or off. We consider EOT in the context of . We say that works over , assuming that all computations as well as values occurring in a computation are carried out in the arithmetic defined by .
3 Overview: capturing and classifying formal transformer reasoning
We address elementary problems arising in formal reasoning for transformers. In doing so, we pursue the goal of establishing basic computability and complexity results for corresponding problems in order to frame possibilities and challenges.
To achieve widespread implications of our results, we focus our considerations on a fundamental problem arising in formal verification and interpretation tasks: given a transformer , decide whether there is some input leading to some specific output , as defined formally in terms of the satisfiability problem for a class of specific transformers, see Section 2.
To see that this captures the essence of formal reasoning problems occurring in practice, consider the following formal verification task: given transformer , verify that only accepts inputs which contain some specific key from a set . Such a property is usually considered a robustness property [25, 16]. We can phrase this as a satisfiability problem by considering the property’s negation, namely to verify that there is some input such that no key from occurs and still we have .
Likewise, consider a formal interpretation task in which we want to find the minimal subset of some set of error symbols such that all that contain all errors are rejected by . This is usually understood as an abductive explanation [17]. Given a candidate subset , we can certify this by checking that there is some which contains all errors , but is accepted by . This, again, is a special case of a satisfiability problem for some transformer class .
Furthermore, we want our results to be detached from any intricacies of certain transformer architectures: first, we focus on encoder-only transformers (EOT), so leaving any decoder mechanism unconsidered. The primary reason for this is that encoder-decoder architectures are of such high expressive power [23] that Sat is easily seen to be undecidable in almost all non-trivial cases. The secondary reason for this is that encoder-decoder architectures subsume encoder-only architectures. So any lower computability or complexity bound, established in this work, is also a lower bound for encoder-decoder transformers. Additionally, the presentation of EOT in Section 2 paves the way for a parametrized view onto EOT architectures which allows us to study different classes of EOT by fixing or bounding such parameters.
We start by considering the class of EOT, motivated by commonly considered architectures in the theoretical expressiveness community [23, 14, 13]: consists of those EOT that use a positional embedding, expressive enough to compute a sum, hardmax as normalisation functions and a scalar-product based scoring, enriched with a nonlinear map represented by an FNN.
Theorem 1 (Section 4).
The satisfiability problem is undecidable.
![Refer to caption](extracted/5627503/assets/overview.png)
Essentially, this result implies that even for encoder-only EOT the combination of normalizations and expressive scoring is enough to make satisfiability undecidable. Generally, this makes formal reasoning, like verifying robustness properties or giving formal explanations, impossible for classes of EOT that subsume . Specifically, no such methods exist that are fully automatic, sound and complete. Theorem 1 does not preclude the existence of incomplete methods for instance.
Recently, so-called log-precision transformers have been studied [18]. These transformers are defined as usual, but given a word length it is assumed that a log-precision transformer uses at most bits in its internal computations. To complement these theoretical considerations, we consider the class of EOT from that work with log-precision. Unfortunately, this restriction is not enough to circumvent general undecidability.
Theorem 2 (Section 4).
The satisfiability problem is undecidable.
Given such impossibility results, we turn our attention to the search for decidable cases. We make the reasonable assumption that all considered EOT are computable, meaning that their components like scoring, normalisation, pooling, combination and output functions are computable functions.
First, we consider a natural restriction of the satisfiability problem by bounding the length of valid inputs. Then satisfiability becomes decidable, regardless of the respective class of EOT, but it is difficult from a complexity-theoretic perspective. To formalize this, we introduce the bounded satisfiability problem for a class : given an EOT and a bound on its input length, decide whether there is word with s.t. .
Theorem 3 (Section 5, informal).
The bounded satisfiability problem is decidable for all classes of (computable) EOT. Depending on whether is given in binary or unary coding, is NEXPTIME-, resp. NP-hard whenever .
Informally, this result implies that bounding the word length is a method to enable formal reasoning. However, it does not change the fact that satisfiability is an essentially hard problem. As hardness is a lower bound, this also translates to subsuming formal reasoning tasks.
Imposing a bound on the input length may not be a viable restriction for various formal reasoning tasks. We therefore study other ways of obtaining decidability. We address the unbounded satisfiability problem for practically motivated classes of EOT. We consider the class of EOT that use a positional embedding with some periodicity in their positional encoding, commonly seen in practice [28, 12], use softmax or hardmax as normalisation and which work over some fixed-width arithmetic (FA). This last restriction is motivated by recent popular ways to handle ever increasing EOT sizes, for example via quantization or using low-bit arithmetics [5]. From a complexity-theoretic perspective, the use of fixed-width arithmetic has a similar effect to bounding the input length.
Theorem 4 (Section 5).
The satisfiability problem is in NEXPTIME.
So automatic, sound and complete formal reasoning for periodical EOT in a fixed-width arithmetic environment is generally possible with potentially high complexity. Note that formal reasoning tasks with more complex safety or interpretability specifications than simple satisfiability may even lead to higher complexities.
We then aim to show that this is optimal by providing a matching lower bound. However, we need to relax these restrictions again, namely considering the class allowing for EOT that use non-periodical embeddings and work over some fixed-width arithmetic that can use saturation to handle overflow situations. We show that this high complexity is unavoidable, making sound and complete automatic formal reasoning for fixed-width arithmetic transformers with general positional embeddings practically intractable.
Theorem 5 (Section 5).
The satisfiability problem is NEXPTIME-hard.
A schematic depiction of the computability and complexity results described in this section is given in Figure 1. Note that this figure is a purely technical presentation of our results, which means that it does not convey the implications for formal reasoning described above.
4 Transformer satisfiability is generally undecidable
We consider a class of EOT , which is as weak as possible regarding the expressiveness of included EOT. We define by giving minimum requirements: positional-embeddings can be of the form and where we assume some order on the alphabet symbols . For scoring functions we allow for where is a classical Feedforward Neural Network (FNN) with activations, and are linear maps and denotes the usual scalar product, for normalisations we allow for hardmax if for all and there are distinct such that otherwise . Combinations as well as output functions can be classical FNN with activation. Aside from technical reasons, we motivate the choice of in Section 3. To ease our notation, we exploit the fact that using as normalisation implies a clearly defined subset of positions that are effective in the computation of some attention head given some position , namely those that are weighted non-zero. In this case, we say that attends to given position .
We prove that is undecidable by establishing a reduction from the (unbounded) octant tiling-word problem (). For details on tiling problems, see Appendix A. The is defined as follows: given a tiling system where is some finite set of tiles, and we have to decide whether there is a word (a) such that (b) , , (c) for all and holds and (d) for all and holds . We call a word which satisfies (a) an encoded tiling and if (b)-(d) are satisfied as well then we call a valid encoded tiling. Our proof strategy is easily described: given a tiling system , we build an EOT which accepts a word if it fulfils conditions (a) to (d) and otherwise rejects . We derive most technical proofs of the following lemmas and theorems to Appendix B and instead provide intuitions and proof sketches in this section.
We start with the first observation: the expressiveness of EOT in is sufficient to decode the octant tiling potentially represented by a given word . In detail, two encoder layers in combination with a positional embedding definable in are expressive enough to compute for a given symbol in to which position in an octant tiling it corresponds, if we interpret as an encoded tiling.
Lemma 1.
Let be a tiling system with tiles . There is an embedding function and there are encoder layers and definable in such that for each word holds that where such that is equal to the symbol at position in and is equal to .
Assume that . Lemma 1 implies that a EOT is generally able to recognize whether is an encoded tiling as soon as is able to check whether and of the last symbol of processed by are equal. Therefore, property (a) and also (b) can be checked by EOT in using the residual connection in the combination functions together with the expressive power of FNN. Similarly, property (c) can be ensured if it is possible to build an attention head that is able to attend to position given position . Let with . To verify whether property (d) holds, an EOT must be able to attend to position given position corresponding to symbol . In summary, to check properties (a) – (d) it is left to argue that there are attention heads in that can attend to positions depending linearly on the values of the currently considered position.
Lemma 2.
Let with be some linear function. There is attention head in such that for all sequences where all for some attention head attends to given position if with and otherwise to where is the value nearest to .
In combination, the previous lemmas indicate that EOT from are able to verify whether a given word is a valid encoded tiling. This expressive power is enough, to lead to an undecidable satisfiability problem for EOT from .
Theorem 1.
The decision problem is undecidable.
Proof Sketch.
We establish a reduction from to by constructing for each instance of an EOT accepting exactly those corresponding to a valid encoded-tiling for .
uses the positional embedding described in the beginning of Section 4 and has four layers. Layers and are given by Lemma 1 and are used to decode the row and column indexes corresponding to a potential octant tiling for each symbol in a given word . Layer uses the informations encoded by the embedding and the decoded row and column indexes to check whether properties (a) to (d) described above hold for . The necessary informations are aggregated using three attention heads , and , each built according to Lemma 2.Thereby, attends each position to its predecessor, but the first position attends to itself. This allows to clearly identify the vector corresponding to the first position in and check whether this is equal to tile . Attention head attends each position to its successor, but the last position attends to itself. This allows to clearly identify the vector corresponding to the last position in , in order to check whether this is equal to , and to check conditions given by . Attention head attends each position to the position with the same column index but the successive row index. If there is no such successive row it attends to the last position. This allows to check whether conditions given by holds. Each of these conditions is checked in the combination function of , using specifically built feed-forward neural networks outputting to some predefined vector dimension if and only if the condition is met. Finally, layer aggregates the information of all positions in the vector corresponding to the last position using attention head , again given by Lemma 2.
The correctness of this reduction follows from the detailed construction of , which is technically extensive and given in Appendix B. ∎
Next, we consider the class which is defined exactly like but for all working over alphabet and all words with we assume that is carried out in some fixed-width arithmetic using bits.
Theorem 2.
The decision problem is undecidable.
Proof sketch.
This proof follows the exact same line as the proof of Theorem 1. Additionally, we need to argue that works as intended, despite the fact that it is limited by some log-precision .
Looking at the proof of Theorem 1, it is imminent that the magnitude and precision of all values used and produced in the computation depend polynomially on and, thus, we can choose the representation of to be linear in , which avoids any overflow or rounding situations and ensures that works as intended. A formal proof is given in Appendix B. ∎
5 How to make transformer satisfiability decidable
In this section we investigate classes of EOT leading to decidable Sat problems or decidable restrictions of it. Additionally, we establish corresponding complexity bounds.
In order to establish clearly delineated upper complexity bounds, we need to bound the representation size of an EOT . Instead of tediously analyzing the space needed to represent embedding, scoring, pooling, combination and normalisation functions, we note that it suffices to estimate the size up to polynomials only. The complexity of an EOT with layers and attention heads in layer , working on inputs over alphabet , is where and is the maximal dimensionality of vectors occurring in a computation of . Note that one can reasonably assume the size of a syntactic representation of to be polynomial in , and that EOT have the polynomial evaluation property: given a word , can be computed in time that is polynomial in . Section 3 discusses why this assumption is reasonable.
Satisfiability restricted to words of bounded length is decidable, but difficult
We start with a natural restriction: bounding the word length. Let be a class of EOT. The bounded satisfiability problem, denoted by is: given and some , decide whether there is a word with such that . It is not hard to see that is decidable. However, its complexity depends on the value of , and we therefore distinguish whether is represented in binary or unary encoding. We denote the corresponding problems as and .
Theorem 3.
Let be a class of EOT. Then
-
1.
is decidable in NP and if then is NP-complete,
-
2.
is decidable in NEXPTIME and if then is NEXPTIME-complete.
Proof Sketch.
The decidability result of statement (1) can be shown using a simple guess-and-check argument: given , guess a word with , compute and check that the result is . This is possible in time polynomial in using the polynomial evaluation property. Moreover, the value of is polynomial in the size needed to represent in unary encoding.
The decidability result of statement (2) is shown along the same lines. However, if the value is encoded binarily then this part of the input is of size , and becomes exponential in this. Hence, the guess-and-check procedure only proves that NEXPTIME.
For the completeness result in (1) it suffices to argue that the problem is NP-hard. We make use of the fact that EOT in are expressive enough to accept a given word if and only if it is a valid encoded tiling, cf. Section 4 for details. It is possible to establish NP-hardness of a corresponding restriction of the octant word-tiling problem, namely the bounded octant word-tiling problem (for unarily encoded input values). See Appendix A for details on tiling problems. It then only remains to observe that the construction in Theorem 1 is in fact a polynomial-time reduction, and that it reduces the bounded octant word-tiling problem to the bounded satisfiability problem. The argument for NEXPTIME-hardness in statement (2) is done along the same lines with, again, the bounded octant-word tiling problem shown to be NEXPTIME-hard when the input parameter is given in binary coding. A formal proof for Theorem 3 is given in Appendix C. ∎
Satisfiability for fixed-width arithmetic EOT is decidable, but also difficult
We turn our attention to classes of EOT that naturally arise in practical contexts. We consider EOT that work over some fixed-width arithmetic, like fixed- or floating-point numbers, and which have an embedding relying on a periodical encoding of positions.
We start with establishing a scenario where Sat is decidable in NEXPTIME. Regardless of the underlying EOT class , our proof strategy always relies on a certifier-based understanding of NEXPTIME: given , we nondeterministically guess a word , followed by a deterministic certification whether holds. For this to show , we need to argue that the overall running time of such a procedure is at most exponential, in particular that whenever there is a word with then there is also some with and . Again, we rely on the polynomial evaluation property of EOT in , i.e. the fact that can be computed in time polynomial in .
We consider the class of EOT , defined by placing restrictions on the positional embedding of an EOT to be additive-periodical which means that where is periodical, i.e. there is such that for all . Additionally, all normalisation functions are realised by either the softmax function or the hardmax function . Moreover, we assume that all computations occurring in are carried out in some fixed-width arithmetic, encoding values in binary using a fixed number of bits. Aside from technical reasons, we motivate the choice of in Section 3. Given these restrictions, we adjust the definition of the complexity of as a measure of the size (up to polynomials) as .
Lemma 3.
There is a polynomial function such that for all and all words with there is word with and .
Proof Sketch.
The polynomial can be chosen uniformly for all because for all positional embeddings of EOT in there is an upper bound on the period and on the bit-width in the underlying arithmetic. The small-word property stated by the lemma is then shown by arguing, given polynomial , EOT and , that contains unnecessary subwords that can be cut out without changing the output in . Here, we exploit the fact has some periodicity and only consider those whose length is a multitude of . This ensures that the resulting word , given by without , is embedded the same way as by the positional embedding of . The existence of such subwords follows from ’s limited distinguishing capabilities, especially in its normalisations, due to the bounded representation size of numerical values possible in the underlying fixed-width arithmetic.
A formal proof relies on basic combinatorial arguments, but is technically extensive, and given in Appendix C. ∎
Based on this preliminary result, we can then immediately derive an upper bound on the complexity of .
Theorem 4.
The satisfiability problem for EOT over fixed-width arithmetic using additive-periodical embeddings is in NEXPTIME.
Proof.
Let working over alphabet . We use a certifier-based understanding of a nondeterministic exponential-time algorithm as follows: We (a) guess an input and (b) compute to check whether . For correctness, we need to argue that the length of is at most exponential in . This argument is given by Lemma 3. Note that via assumption we have that can be computed in polynomial time regarding and and obviously as well. ∎
Next, we address the goal of obtaining a matching lower bound, i.e. NEXPTIME-hardness. An obvious way to do so would be to follow Theorem 3.2 and form a reduction from the bounded octant word-tiling problem. Hence, given a tiling system and encoded binarily, we would have to construct – in time polynomial in – an EOT such that for some iff there is a word representing a valid -tiling. In particular, would have to be able to recognise the correct word length and reject input that is longer than . This poses a problem for EOT with periodical embeddings. To recognize whether a word is too long, an EOT must ultimately rely on its positional embedding, which seems to make a periodicity of necessary. Since the size of periodical EOT is linear in , we get an exponential blow-up in a potential reduction of to , given that the values of and already are exponential in the size of a binary representation of . This problem vanishes when the requirement of the underyling positional embedding to be periodical is lifted: allowing for non-periodical EOT, working over some fixed-width arithmetic, leads to an NEXPTIME-hard satisfiability problem. Let be defined similar to , but we allow for non-periodical embeddings. Furthermore, we assume that the considered fixed-width arithmetics can handle overflow situations using saturation.
Theorem 5.
The satisfiability problem for EOT over fixed-width arithmetic is NEXPTIME-hard.
Proof sketch.
We establish a reduction from to by constructing, for each instance of , an EOT working over some fixed-width arithmetic , which accepts exactly those with corresponding to a valid word-encoded tiling for . See Appendix A for details on tiling problems.
The construction is similar to the one given for in the proof of Theorem 4, but we need to enable to reject words that are too long corresponding a polynomial bound dependent on . This implies that , based on the positional embedding specified in Section 4, is able to check for all symbols if their respective position is less than or equal to a predefined bound. This can be achieved with similar tools as used in Lemma 2.
6 Summary and outlook
We investigated the satisfiability problem of encoder-only transformer (EOT) through the lens of formal reasoning. In particular, we considered the computability and complexity of the satisfiability problem Sat of EOT in context of different classes of EOT, forming a baseline for understanding possibilities and challenges of formal reasoning of transformers.
We showed that Sat is undecidable for classes of EOT recently considered in research on the expressiveness of different transformer models (Theorem 1 and Theorem 2). This implies that formal reasoning is impossible as soon as we consider classes of EOT that are at least as expressive as the classes considered in these results. We remark that this result also translates to encoder-decoder architectures, whose encoder part is as expressive as the here considered EOT.
Additionally, we identified two ways to make formal reasoning for EOT possible: either we bound the length of considered inputs (Theorem 3) or we consider quantized EOT, meaning EOT whose computations and parameters are limited by some fixed-width arithmetic (Theorem 4). These imply that formal reasoning is possible as long as we consider classes of EOT that are at most as expressive as the classes considered in these results. We remark that this statement makes the reasonable assumption that the driving force for any upper computability or complexity bound is the expressiveness of the EOT, not the intricacies of considered safety or interpretability assumption. However, in both cases Sat remains difficult (Theorem 3 and Theorem 5) from a complexity perspective. Again, these results are only valid for classes of EOT that are at least as expressive as the ones considered.
While our results build a first framework for understanding possibilities and challenges of formal reasoning of transformers, there is room for more detailed investigations. Firstly, consider our undecidability and hardness results. These rely on the fact that we consider normalisations realised by the hardmax function. However, it is unclear whether similar results can be achieved if we stick to normalisations realised by the commonly used softmax function. Furthermore, it would be of interest to further investigate the interplay of embedding function and internal structure of the considered EOT. We expect that less expressive embeddings demand a richer structure of the attention mechanisms, but it is unclear where the limits are in the sense that undecidability of the satisfiability problem is still given. Secondly, consider our decidability and upper complexity bound results. It could be of practical interest, to take a more detailed look at specifics of particular fixed-width arithmetics. While this will not change our results, it could give tighter time-complexity estimates which may be of interest in certain formal reasoning applications.
References
- [1] M. S. Baranowski, S. He, M. Lechner, T. S. Nguyen, and Z. Rakamaric. An SMT theory of fixed-point arithmetic. In N. Peltier and V. Sofronie-Stokkermans, editors, Automated Reasoning - 10th International Joint Conference, IJCAR 2020, Paris, France, July 1-4, 2020, Proceedings, Part I, volume 12166 of Lecture Notes in Computer Science, pages 13–31. Springer, 2020.
- [2] R. Berger. The undecidability of the domino problem. Mem. Amer. Math. Soc., 66:72, 1966.
- [3] S. Bhattamishra, A. Patel, and N. Goyal. On the computational power of transformers and its implications in sequence modeling. In R. Fernández and T. Linzen, editors, Proceedings of the 24th Conference on Computational Natural Language Learning, CoNLL 2020, Online, November 19-20, 2020, pages 455–475. Association for Computational Linguistics, 2020.
- [4] G. Bonaert, D. I. Dimitrov, M. Baader, and M. Vechev. Fast and precise certification of transformers. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2021, pages 466–481. Association for Computing Machinery, 2021.
- [5] Y. Bondarenko, M. Nagel, and T. Blankevoort. Understanding and overcoming the challenges of efficient transformer quantization. In M. Moens, X. Huang, L. Specia, and S. W. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 7947–7969. Association for Computational Linguistics, 2021.
- [6] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- [7] D. Chiang, P. Cholak, and A. Pillay. Tighter bounds on the expressivity of transformer encoders. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 5544–5562. PMLR, 2023.
- [8] G. A. Constantinides, F. Dahlqvist, Z. Rakamaric, and R. Salvia. Rigorous roundoff error analysis of probabilistic floating-point computations. In A. Silva and K. R. M. Leino, editors, Computer Aided Verification - 33rd International Conference, CAV 2021, Virtual Event, July 20-23, 2021, Proceedings, Part II, volume 12760 of Lecture Notes in Computer Science, pages 626–650. Springer, 2021.
- [9] S. Demri, V. Goranko, and M. Lange. Temporal Logics in Computer Science. Cambridge Tracts in Theoretical Computer Science. Cambridge University Press, 2016.
- [10] J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
- [11] X. Dong, A. T. Luu, R. Ji, and H. Liu. Towards robustness against natural language word substitutions. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- [12] P. Dufter, M. Schmitt, and H. Schütze. Position information in transformers: An overview. Comput. Linguistics, 48(3):733–763, 2022.
- [13] M. Hahn. Theoretical limitations of self-attention in neural sequence models. Trans. Assoc. Comput. Linguistics, 8:156–171, 2020.
- [14] Y. Hao, D. Angluin, and R. Frank. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Trans. Assoc. Comput. Linguistics, 10:800–810, 2022.
- [15] Y. Hsieh, M. Cheng, D. Juan, W. Wei, W. Hsu, and C. Hsieh. On the robustness of self-attentive models. In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 1520–1529. Association for Computational Linguistics, 2019.
- [16] X. Huang, W. Ruan, W. Huang, G. **, Y. Dong, C. Wu, S. Bensalem, R. Mu, Y. Qi, X. Zhao, K. Cai, Y. Zhang, S. Wu, P. Xu, D. Wu, A. Freitas, and M. A. Mustafa. A survey of safety and trustworthiness of large language models through the lens of verification and validation. CoRR, abs/2305.11391, 2023.
- [17] J. Marques-Silva and A. Ignatiev. Delivering trustworthy AI through formal XAI. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 12342–12350. AAAI Press, 2022.
- [18] W. Merrill and A. Sabharwal. A logic for expressing log-precision transformers. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
- [19] W. Merrill and A. Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11:531–545, 2023.
- [20] W. Merrill, A. Sabharwal, and N. A. Smith. Saturated transformers are constant-depth threshold circuits. Trans. Assoc. Comput. Linguistics, 10:843–856, 2022.
- [21] OpenAI. Gpt-4 technical report, 2023.
- [22] D. W. Otter, J. R. Medina, and J. K. Kalita. A survey of the usages of deep learning for natural language processing. IEEE Trans. Neural Networks Learn. Syst., 32(2):604–624, 2021.
- [23] J. Pérez, P. Barceló, and J. Marinkovic. Attention is turing-complete. J. Mach. Learn. Res., 22:75:1–75:35, 2021.
- [24] M. Sälzer and M. Lange. Fundamental limits in formal verification of message-passing neural networks. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- [25] Z. Shi, H. Zhang, K. Chang, M. Huang, and C. Hsieh. Robustness verification for transformers. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
- [26] L. Strobl, W. Merrill, G. Weiss, D. Chiang, and D. Angluin. What Formal Languages Can Transformers Express? A Survey. Transactions of the Association for Computational Linguistics, 12:543–561, 05 2024.
- [27] P. van Emde Boas. The convenience of tilings. In A. Sorbi, editor, Complexity, Logic, and Recursion Theory, volume 187 of Lecture notes in pure and applied mathematics, pages 331–363. Marcel Dekker, Inc., 1997.
- [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
- [29] G. Weiss, Y. Goldberg, and E. Yahav. Thinking like transformers. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 11080–11090. PMLR, 2021.
- [30] Y. Yu, X. Si, C. Hu, and J. Zhang. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput., 31(7):1235–1270, 2019.
- [31] H. Zhao, H. Chen, F. Yang, N. Liu, H. Deng, H. Cai, S. Wang, D. Yin, and M. Du. Explainability for large language models: A survey. ACM Trans. Intell. Syst. Technol., 15(2):20:1–20:38, 2024.
Appendix A Tiling Problems
We make use of particular tiling problems in order to prove lower bounds on the complexity and decidability of for different classes .
A tiling system is an where is a finite set; its elements are called tiles. define a horizontal, resp. vertical matching relation between tiles, and are two designated initial, resp. final tiles in .
Problems associated with tiling systems are typically of the following form: given a discrete convex plain consisting of cells with horizontal and vertical neighbors, is it possible to cover the plane with tiles from in a way that horizontally adjacent tiles respect the relation and vertically adjacent tiles respect the relation , together with some additional constraints about where to put the initial and final tile . Such tiling problems, in particular for rectangular planes, have proved to be extremely useful in computational complexity, cf. [2, 27], since they can be seen as abstract versions of halting problems.
We need a variant in which the plane to be tiled is of triangular shape. The -th triangle is for . An ()-tiling of is a function s.t.
-
•
for all with ,
-
•
for all with .
Such a tiling a successful, if additionally and for some .
The unbounded octant tiling problem () is: given a tiling system , decide whether a successful -tiling of exists for some . The bounded octant tiling problem (OTP) is: given a tiling system and an , decide whether a successful -tiling of exists. Note that here, is part of the input, and that it can be represented differently, for example in binary or in unary encoding. We distinguish these two cases by referring to and .
It is well-known that is undecidable [27]. It is also not hard to imagine that is NP-complete while is NEXPTIME-complete. In fact, this is well-known for the variants in which the underlying plane is not a triangle of height but a square of height [27]. The exponential difference incurred by the more compact binary representation of the input parameter is best seen when regarding the upper complexity bound for these problems: given , a nondeterministic algorithm can simply guess all the many tiles of the underlying square and verify the horizontal and vertical matchings in time . If is encoded unarily, i.e. the space needed to write it down is , then the time needed for this is polynomial in the input size ; if is encoded binarily with space then the time needed for this is exponential in .
It then remains to argue that the tiling problems based on triangular planes are also NP- resp. NEXPTIME-complete. Clearly, the upper bounds can be established with the same guess-and-check procedure. For the lower bounds it suffices to observe that hardness of the tiling problems for the squares is established by a reduction from the halting problem for Turing machines (TM) such that a square of size represents a run of the TM of length as a sequence of rows, and each row represents a configuration of the TM using at most tape cells. This makes use of the observation that the space consumption of a TM can never exceed the time consumption. Likewise, assuming that a TM always starts a computation with its head on the very left end of a tape, one can easily observe that after time steps, it can change at most the leftmost tape cells. Hence, a run of a TM can therefore also be represented as a triangle with its first configuration of length 1 in row 1, the second of length 2 in row 2 etc.
At last, we consider two slight modifications of these two problems which are easily seen to preserve undecidability resp. NP- and NEXPTIME-completeness. The unbounded octant tiling-word problem () is: given some , decide whether there is a word for some , s.t. the tiling defined by comprises a successful tiling of . The two variants of the bounded octant tiling-word problem are both: given some as above and , decide whether such a word exists. Note that, again, here is an input parameter, and so its representation may affect the complexity of the problem, leading to the distinction between with binary encoding and with unary encoding.
Theorem 6.
-
a)
is undecidable (-complete).
-
b)
is NEXPTIME-complete.
-
c)
is NP-complete.
Proof.
(a) It should be clear that a tiling problem and its tiling-word variant (like and ) are interreducible since they only differ in the formulation of how the witness for a successful tiling should be presented. So they are essentially the same problems. Undecidability of and, thus, is known from [27], the -upper bound can be obtained through a semi-decision procedure that searches through the infinite space of -tiling for any . This justifies the statement in part (a) of Thm. 6.
(b) With the same argument as in (a) t suffices to consider instead of . The upper bound is easy to see: a nondeterministic procedure can easily guess a tiling for and verify the horizontal and vertical matching conditions, as well as the use of the initial and final tile in appropriate places. This is possible in time , resp. which is therefore exponential in the input size for binarily encoded parameters . This shows inclusion in NEXPTIME.
For the lower bound we argue that the halting problem for nondeterministic, exponentially-time bounded TM can be reduced to : given a nondeterministic TM over input alphabet and tape alphabet that halts after at most time steps on input words of length for some polynomial , and a word , we first construct a TM that is started in on the empty tape and begins by writing onto the tape and then simulates on it. This is a standard construction in complexity theory, and it is easy to see that the running time of is bounded by a function for some polynomial . With the observation made above, a computation of can be seen as a sequence of configurations , with . This does not directly define a tiling system, instead and again by a standard trick, cf. [27] or [9, Chp. 11], one compresses three adjacent tape cells into one tile in order to naturally derive a horizontal matching relation from overlaps between such triples and a vertical matching relation from the TM’s transition function. At last, let . It is then a simple exercise to verify that a valid tiling of the triangle corresponds to an accepting run of on and vice-versa, which establishes NEXPTIME-hardness.
(c) This is down exactly along the same lines as part (b), but instead making use of the fact that, when is given in unary encoding, is polynomial in the size of the representation of , and hence, the time needed for the guess-and-check procedure in the upper bound is only polynomial, and for the lower bound we need to assume that the running time of the TM is polynomially bounded. Thus, we get NP-completeness instead of NEXPTIME-completeness. ∎
Appendix B Proofs of Section 4
In the following, we give formal proof for the undecidability results of Section 4. To do so, we make use of classical Feed-Forward Neural Networks.
Feed-Forward Neural Network
A neuron is a computational unit computing a function by where is a function called activation and are parameters called bias resp. weight. A layer is a tuple of nodes where we assume that all nodes have the same input dimensionality . Therefore, computes a function . We call the size of layer . Let be a layer with input dimensionality and a layer of size . A Feed-Forward Neural Network (FNN) is a tuple of layers where we assume that for all holds that the size of equals the input dimensionality of . Therefore, computes a function by processing an input layer by layer.
In particular, we use specific FNN with activations, called gadgets, to derive lower bounds in connection with the expressibility of transformers. We denote the class of all FNN with activations by .
Lemma 4.
Let . There are basic gadgets
-
1.
computing ,
-
2.
computing a function such that if , if and otherwise,
-
3.
computing a function such that if , if and otherwise,
-
4.
computing a function such for all inputs with and holds if or and otherwise.
Proof.
Let be the minimal FNN computing , let be the minimal FNN computing where and let be the minimal FNN computing where . The claims of the lemma regarding these gadgets are straightforward given their functional form. Let be the minimal FNN computing . As stated in the lemma, we assume that and . Then, is if and if . Thus, is guaranteed to be if and otherwise it depends on . This gives the claim regarding gadget . ∎
We will combine gadgets in different ways. Let and be FNN with the same input dimensionality and output dimensionality respectively . We extend the computation of to functions with by weighting additional dimensions with in the input layer. Given a set of input dimensions , we denote the effective dimensions with pairwise different by . Formally, this means that for all inputs. We denote the FNN consisting of and placed next to each other by . Formally, this is done by combining and layer by layer using weights in intersecting connections. Then, computes given by . We generalize this operation to FNN in the obvious sense. Let be an FNN with input dimensionality and output dimensionality . We denote the FNN consisting of and placed sequentially by . Formally, this is done by connecting the output layer of with the input layer of . Then, computes given by .
We also consider specific gadgets needed in the context of tiling problems.
Lemma 5.
Let be a finite set and . There is FNN computing such that if and iff and there is for each computing such that for each and iff .
Proof.
Let be finite, and . First, consider . Let be the minimal FNN computing and be the minimal FNN computing . Obviously, computes the constant function and computes the identity in the form of two dimensional vectors. Let be given by the minimal FNN computing with the slight alteration that the two output dimensions of are connected to the first dimension of . Then, the claim of the lemma regarding follows from Lemma 4 and the operations on FNN described in Appendix B.
Now, consider . Given some let . Let be the minimal FNN computing . Furthermore, let for some set be the minimal FNN such that if and if . A construction for is given in Theorem 4 in [24]. According to this construction, consists of three layers and is polynomial in . In the case that we assume that is the constant function represented by a suitable FNN. Then, is given by for some arbitrary order on with the slight alteration that has two input dimensions, meaning that each subnet is connected to the same two input dimensions. Again, the claim of the lemma regarding follows from Lemma 4 and the operations on FNN described in Appendix B. ∎
Given these understandings of gadgets, we are set to formally prove the results of Section 4.
Proof of Lemma 1.
Let as stated in the lemma and assume some order on . Furthermore, let and if . Let . In the following, we build two layers and using components allowed in , satisfying the statement of the lemma. Layer consists of a single attention head . The scoring function is given by where and and . We have that and it follows that if and otherwise we have that . The pooling function is specified by the matrix and uses as normalisation function. The combination function is given by the FNN . Given a position , the attention head attends to all positions satisfying . This is due to the way is build. Then, computes using where is the number of positions attends to. Here, we exploit the fact that only the first position has a non-zero entry in the its first dimension and that for all head attends to . Finally, simply stacks the old vector onto the value , but leaves out the first dimension of . Let . Layer consists of a single attention head . The scoring function is given by where , and . We have that if where is the fifth dimension of and otherwise . The pooling function is specified by and uses as normalisation. The combination is given by the FNN . Given a position , the attention head attends to the position , where . Relying on our arguments regarding the computation of , this is the position satisfying . However, this is equal to the row index of the decomposition of based on the inversion of Cantor’s pairing function. Thus, we have that . Furthermore, we have that , which is computed by in the combination function . Overall, we see that gives the desired result. ∎
Proof of Lemma 2.
Let be as stated in the lemma. By definition of , the scoring function of is of the form and the normalisation is . Let , and be the minimal FNN computing where is given by Lemma 4. Overall, this ensures that the scoring is given by . Then, the statement of the lemma follows from the fact that attends to the maximum, which is given this scoring, and that is unique for each . ∎
Lemma 6.
There is attention head in such that for all sequences where all the head attends to given .
Proof.
By definition of , the scoring function of is of the form and the normalisation is . Let and let be equal to . Furthermore, let . We observe that outputs if and otherwise . In combination with , this ensures that behaves as stated by the lemma. ∎
Proof of Theorem 1.
We prove the statement via reduction from . Let be an instance of with . W.l.o.g we assume that . Let be built the following way. uses the embedding of transformer in specified in the beginning of Section 4. Furthermore, it has four layers. Layers , are as in Lemma 1. Layer is given by where , and are of Lemma 2 whereby , and . We assume that all three attention heads use the identity matrix as linear maps in their respective pooling function. is given by an FNN computing . Let the input dimensions of be . Then, is equal to
where , , , and using the gadgets and constructions described in Appendix B. Layer is given by where attends to given and is given by the minimal FNN computing . A formal proof for the existence of in is given in Lemma 6. Furthermore, the output function of is given by the minimal FNN computing .
Let be some word over alphabet . As defined above, we have that where . Consider , namely the sequence of vectors after propagating through the embedding and layers of . As stated by Lemma 1, we have that where and are the row respectively column of tile if we interpret as an encoded tiling. Note that all vectors are non-negative due to the way is built. In the following, we argue that all if and only if is a valid encoded tiling. Given this equivalence, the statement of the lemma follows immediately as simply sums up all vectors and dimensions (except for the first and second) of in and the output of indicates whether there was some non-zero value. We fix some arbitrary . Then, where if and otherwise, if and otherwise and if and otherwise.
Consider property and subnetwork . With the understanding gained in Appendix B, outputs iff . These dimensions correspond to positions and , which are only equal if (Lemma 2). Furthermore, the property of stated by Lemma 4 is given as the output of is guaranteed to be in and the values of and are guaranteed to be in . In summary, this ensures that the third dimension of is iff . For other positions the third dimension is always since outputs in these cases due to the fact that equals . Analogously, and ensure that and and, thus, property (b) iff the fourth and fifth dimensions in all positions are equal to . Consider properties (c) and (d) described above and assume that property (a) holds. These two properties are non-local in the sense that they depend on at least two positions in . Consider the subnet . By construction and the gadgets described in Appendix B, we have that outputs if and or if , which means that tile is rightmost in its corresponding row. Otherwise the value computed by is greater than . Analogously, subnet checks whether vertically stacked tiles do match. In summary, this ensures that the sixth and seventh dimension of each is equal to if and only if properties (c) and (d) hold. ∎
Proof of Theorem 2.
In the same manner as in the proof of Theorem 1, we prove the statement via reduction from . The reduction is exactly the same, namely given an instance we build EOT which recognizes exactly those words representing a valid encoded tiling of . For details, see the proof of Theorem 1.
Given the correctness arguments for in Theorem 1, it is left to argue that works as intended, despite the fact that it works over some FA using at most bits where is the length of an input word. We choose such that overflow situations do not occur in any computation and rounding is handled such that works as intended. Throughout this proof, we use Namely, given a word with assume that uses bits and rounds values off to the nearest representable number. We denote the value resulting from rounding off in arithmetic by . We assume that there is an extra bit that is used as a sign bit and that at least bits can be used to represent integer and at least bits can be used to represent fractional parts. Note that this is a reasonable assumption for all common FA, like fixed-point or floating-point arithmetic. Furthermore, it is clearly the case that . To ease our arguments and notation from here on, we assume w.l.o.g. that we represent using instead of .
Per definition, uses the embedding function and . First, we assume that each , namely the value representing a specific tile from , is a unique, positive value. This is possible as uses bits. Furthermore, we see that , especially the sum , works as intended up to due to the fact that uses more than bits to represent integer parts. Next, consider layer and of Lemma 1. Layer consists of a single attention head . Here, the only crucial parts are the computation of value in for a position . Per definition, corresponds to the number of positions such that . As is bounded by , this inequality can only be satisfied by positions for which holds. As uses to count the positions for which this inequality holds, is bounded by . Next, we observe that , namely the general understanding of rounding off where we use bits to represent fractions. However, this gives that for all that as holds for all . This means, that it is ensured by that is uniquely representable.
Next, the only crucial part in is the computation of the product , which is used to determine the position for which in , which is obviously given by position . This equality is no longer guaranteed to exist if we consider . However, due to the monotonicity of for and that the maximum round of error is given by , we have that the produces the value closest to in the product . Taking a look at , this ensures that is still the position that attends to. Therefore, the statement of Lemma 1 is still valid for working over . We observe that all values of some vector after layer are positive integers whose magnitude is bounded by .
Now, consider layer and . From the proof of Theorem 1 we see that the gadgets at most sum up two values or compute a fraction of the form and (in gadgets or ). Both can safely be done with at least bits for integer and for fractional parts, as all previously computed values, up to layer , in a computation of are representable using bits. We observe that the values of the third to seventh dimension of some are either or . This is due to the fact that all values after layer are guaranteed to be integers. Next, consider layer . The computation done by is safe (see Lemma 6) and the crucial step here is the computation of given by . The values are all of the form where is guaranteed to be or and is the normalisation induced by from perspective of position . However, this means is bounded by and, thus, if and only if for all due to the fact that allows for bits to represent fractional parts. Finally, is trivially computable in , which finishes the proof. ∎
Appendix C Proofs of Section 5
Proof of Theorem 3.
The decidability and membership results of statements (1) and (2 )are sufficiently argued in the proof sketch given in Section 5.
To prove the hardness results of statements (1) and (2), we establish a reduction from respectively : given some bounded word-tiling instance we build an instance of respectively where is build as described in Theorem 1. The only missing argument is that these reductions are polynomial. In particular, this means that must be built in polynomial time regarding the size of . Therefore, we recall the proof of Theorem 1.
First, we see that the embedding function and the amount of layers of is independent of and . The first two layers and of are specified in Lemma 1. Recalling the proof of Lemma 1, we see that and each consist of a single attention head, whose internal parameters like scoring, pooling or combination are independent of as well. Next, consider layer . This layer consists of three attention heads , and each given by the template described in Lemma 2, which again is independent of . Additionally, contains the combination function . This combination function is represented by a FNN , using smaller FNN , , , and as building blocks. These are dependent on , as they are built using gadgets , , and where , , and are components of . However, in the proof of Lemma 5 we see that these gadgets are at most polynomial in their respective parameter. Layer and the output function, specified by FNN , are again independent of . In summary, the EOT is polynomial in , which makes the reductions from und polynomial. ∎
Next, we address the proof of Lemma 3. We need some preliminary, rather technical result first. Let be an EOT and be a word and consider the computation . Let and be the sequence of vectors occurring after the computation of layer of . Let and be two vectors matching the dimensionality of of . Overloading some notation, let where is the vector of all scorings of with sequence . We remark that it is not necessary that or must occur in for this to be well defined. Again overloading some notation, let .
Lemma 7.
Let be a additive-periodical EOT of depth , maximum width and periodicity with for all , let where , all and all also occur in or and let be the set of all vectors occurring in any of the sequences . If there are indexes such that for all holds that and then it holds that and .
Proof.
Let , , , and be as stated above. We prove the statement via induction on the layers . First, consider layer and fix some tuple . We first show that . Assume that is given by . Then, computes for all words . Obviously, the numerator in and is equal. By definition, we have that is local in the sense that it compares vectors pairwise, producing the different scoring values independent of the overall word. Furthermore, due to the fact that is additive-periodical, we have and are equal in the sense that the vectors corresponding to are equal. We refer to this property (*) later on. Using these observations and that , we have that the denominator is equal as well. Now, assume that is given by . Then, computes where if is maximal in and otherwise for any word . In contrast to , we have that the values of are dependent of the overall context, namely the vector of all scorings . Compare and , both given by the additive-periodical embedding . Via assumption, we have that each block also occurs in or . In particular, this means every vector that occurs in does also occur in and vice-versa. This implies that for any scoring value . In combination with the assumption that and the observations above, we also get in the case. Next, consider the pooling functions. By definition, we have that computes for any word . Our previous arguments give that . In combination with and (*), we immediately get that holds as well. Next, consider layer . The arguments are exactly the same as in the base case. However, we need to rely on the induction hypothesis. Namely, we assume that all produce the same output in computation and computation . This implies that all vectors present in are also present in and vice-versa and that the vectors corresponding to are equal in both computations. ∎
Proof of Lemma 3.
Let be an additive-periodical EOT working over alphabet , having periodicity , depth , maximum width , maximum dimensionality and working over an FA using bits for binary encoding. We use to denote the set of values representable in the fixed arithmetic that works over. Note that . Let be a word such that . We observe that there is such that where are blocks of symbols of length and . Our goal is to prove that a not necessarily connected subsequence of at most many -blocks from is sufficient to ensure the same computation of . In the case that we are done. Therefore, assume that .
Let be the set of all unique . We observe that . Next, we fix some not necessarily connected but ordered subsequence with , and of such that each occurs exactly once. For the case that we allow this specific block to occur twice in . The assumption implies that . This means that there are pairs in with some non-empty sequence of -blocks in between. W.lo.g. assume and is such a pair. Our goal is to argue that there are at most blocks from needed to ensure the same computation of . Given that this argument works for all adjacent pairs in , we are done.
Consider the computation . The additive-periodical embedding of implies that includes at most different vectors. Furthermore, from layer to layer equal vectors are mapped equally, which means that each contains at most different vectors as well. This implies that the computation induces at most different tuples where are vectors induced by and . Additionally, we have that for each value and , as defined in the beginning of this section, there are at most possibilities. Simple combinatorics, namely the pigeon hole principle, states that in the increasing sequence there must be points and with such that for all tuples induced by we have that and . Now, Lemma 7 states that this implies and . However, this implies that the subsequence has no influence in the computation of on and, thus, can be left out. As we can argue this for every such cycle occurring in , we get the desired bound of . ∎
Proof of Theorem 5.
We prove the statement via reduction from . Let and be an instance of . We construct an EOT working over some FA with if and only if witnesses the validity of the instance .
Next, let be built exactly like in the proof of Theorem 4, but with the following structural adjustments. In layer we adjust to be where is specified as in the proof of Theorem 4, and where is analogous to the construction of given in Lemma 5. Furthermore, we adjust in layer to be represented by the FNN . We refer to the gadgets described in Lemma 4 and Lemma 5 as well as the proof of Theorem 1 for further details.
Consider the adjustment in . FNN in ensures that only if the row index corresponding to the last symbol is equal to . Note that checks whether row and column index corresponding to the last symbol are equal. Additionally, checks if there is no id equal to . This corresponds to the position id of the successor of the vector representing tile . Furthermore, the adjustment of considers the output of and in addition to the outputs of . In summary, we have that only outputs given if the word length is such that the row index corresponding to the position of the last symbol of in a respective octant tiling is equal to (ensured by ), that is at most of length (ensured by ) and if represents a valid encoded tiling (the remaining parts of .
Additionally, we need to argue that works as intended, despite the fact that it is limited by some FA using a representation size that is at most logarithmic in . These arguments follow the exact same line as in the proof of Theorem 2, but using FA that uses bits and handles overflow using saturation. The reason for the larger representation size is that words representing a valid encoded tiling ending at position are of length . Thus, we use integer bits to be able to represent a sum for all and fractional bits to uniquely represent fraction for . For detail see the proof of Theorem 2. Furthermore, the fact that we use bits to encode integers and that handles overflow using saturation ensures that works as intended: we have that and, thus, we have that the id occurs at most once, independent of the length of as it is not the point where enforces saturation on the positional embedding. Thus, works for this position as intended and then checks the property described above correctly.
The argument that can be built in polynomial time is a straightforward implication from the arguments for Theorem 3 and the fact that and are a small gadgets with maximum parameter quadratic in , which can be represented using a logarithmic amount of bits. ∎
NeurIPS Paper Checklist
-
1.
Claims
-
Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
-
Answer: [Yes]
-
Guidelines:
-
•
The answer NA means that the abstract and introduction do not include the claims made in the paper.
-
•
The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.
-
•
The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.
-
•
It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.
-
•
-
2.
Limitations
-
Question: Does the paper discuss the limitations of the work performed by the authors?
-
Answer: [Yes]
-
Guidelines:
-
•
The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.
-
•
The authors are encouraged to create a separate "Limitations" section in their paper.
-
•
The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.
-
•
The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.
-
•
The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.
-
•
The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.
-
•
If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.
-
•
While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in develo** norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.
-
•
-
3.
Theory Assumptions and Proofs
-
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
-
Answer: [Yes]
-
Justification: For each result we gave a full formal proof in the Appendix and a short proof sketch and intuitive explanation in the main paper.
-
Guidelines:
-
•
The answer NA means that the paper does not include theoretical results.
-
•
All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.
-
•
All assumptions should be clearly stated or referenced in the statement of any theorems.
-
•
The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.
-
•
Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.
-
•
Theorems and Lemmas that the proof relies upon should be properly referenced.
-
•
-
4.
Experimental Result Reproducibility
-
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?
-
Answer: [N/A]
-
Justification:
-
Guidelines:
-
•
The answer NA means that the paper does not include experiments.
-
•
If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.
-
•
If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.
-
•
Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.
-
•
While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example
-
(a)
If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.
-
(b)
If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.
-
(c)
If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).
-
(d)
We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.
-
(a)
-
•
-
5.
Open access to data and code
-
Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
-
Answer: [N/A]
-
Justification:
-
Guidelines:
-
•
The answer NA means that paper does not include experiments requiring code.
-
•
Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.
-
•
While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).
-
•
The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.
-
•
The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.
-
•
The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.
-
•
At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).
-
•
Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.
-
•
-
6.
Experimental Setting/Details
-
Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?
-
Answer: [N/A]
-
Justification:
-
Guidelines:
-
•
The answer NA means that the paper does not include experiments.
-
•
The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.
-
•
The full details can be provided either with the code, in appendix, or as supplemental material.
-
•
-
7.
Experiment Statistical Significance
-
Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
-
Answer: [N/A]
-
Justification:
-
Guidelines:
-
•
The answer NA means that the paper does not include experiments.
-
•
The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.
-
•
The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).
-
•
The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)
-
•
The assumptions made should be given (e.g., Normally distributed errors).
-
•
It should be clear whether the error bar is the standard deviation or the standard error of the mean.
-
•
It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.
-
•
For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).
-
•
If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.
-
•
-
8.
Experiments Compute Resources
-
Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?
-
Answer: [N/A]
-
Justification:
-
Guidelines:
-
•
The answer NA means that the paper does not include experiments.
-
•
The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.
-
•
The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.
-
•
The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).
-
•
-
9.
Code Of Ethics
-
Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?
-
Answer: [Yes]
-
Justification: None of the topics “Potential Harms Caused by Research Process”, “Societal Impact and Potential Harmful Consequences” or “Impact Mitigation Measures” does apply to our theoretical results.
-
Guidelines:
-
•
The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
-
•
If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.
-
•
The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).
-
•
-
10.
Broader Impacts
-
Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
-
Answer: [N/A]
-
Justification:
-
Guidelines:
-
•
The answer NA means that there is no societal impact of the work performed.
-
•
If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.
-
•
Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.
-
•
The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.
-
•
The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.
-
•
If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).
-
•
-
11.
Safeguards
-
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
-
Answer: [N/A]
-
Justification:
-
Guidelines:
-
•
The answer NA means that the paper poses no such risks.
-
•
Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.
-
•
Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.
-
•
We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
-
•
-
12.
Licenses for existing assets
-
Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
-
Answer: [N/A]
-
Justification:
-
Guidelines:
-
•
The answer NA means that the paper does not use existing assets.
-
•
The authors should cite the original paper that produced the code package or dataset.
-
•
The authors should state which version of the asset is used and, if possible, include a URL.
-
•
The name of the license (e.g., CC-BY 4.0) should be included for each asset.
-
•
For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.
-
•
If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.
-
•
For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.
-
•
If this information is not available online, the authors are encouraged to reach out to the asset’s creators.
-
•
-
13.
New Assets
-
Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
-
Answer: [N/A]
-
Justification:
-
Guidelines:
-
•
The answer NA means that the paper does not release new assets.
-
•
Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.
-
•
The paper should discuss whether and how consent was obtained from people whose asset is used.
-
•
At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.
-
•
-
14.
Crowdsourcing and Research with Human Subjects
-
Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?
-
Answer: [N/A]
-
Justification:
-
Guidelines:
-
•
The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
-
•
Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.
-
•
According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.
-
•
-
15.
Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects
-
Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?
-
Answer: [N/A]
-
Justification:
-
Guidelines:
-
•
The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
-
•
Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.
-
•
We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.
-
•
For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.
-
•