Search | arXiv e-print repository

Exponentially Faster Language Modelling

Authors: Peter Belcak, Roger Wattenhofer

Abstract: Language models only really need to use an exponential fraction of their neurons for individual inferences. As proof, we present UltraFastBERT, a BERT variant that uses 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with… ▽ More Language models only really need to use an exponential fraction of their neurons for individual inferences. As proof, we present UltraFastBERT, a BERT variant that uses 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide high-level CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference. We publish our training code, benchmarking setup, and model weights. △ Less

Submitted 21 November, 2023; v1 submitted 15 November, 2023; originally announced November 2023.

arXiv:2308.14711 [pdf, other]

Fast Feedforward Networks

Authors: Peter Belcak, Roger Wattenhofer

Abstract: We break the linear link between the layer size and its inference cost by introducing the fast feedforward (FFF) architecture, a log-time alternative to feedforward networks. We demonstrate that FFFs are up to 220x faster than feedforward networks, up to 6x faster than mixture-of-experts networks, and exhibit better training properties than mixtures of experts thanks to noiseless conditional execu… ▽ More We break the linear link between the layer size and its inference cost by introducing the fast feedforward (FFF) architecture, a log-time alternative to feedforward networks. We demonstrate that FFFs are up to 220x faster than feedforward networks, up to 6x faster than mixture-of-experts networks, and exhibit better training properties than mixtures of experts thanks to noiseless conditional execution. Pushing FFFs to the limit, we show that they can use as little as 1% of layer neurons for inference in vision transformers while preserving 94.2% of predictive performance. △ Less

Submitted 18 September, 2023; v1 submitted 28 August, 2023; originally announced August 2023.

Comments: 12 pages, 6 figures, 4 tables

arXiv:2306.01009 [pdf, other]

Examining the Emergence of Deductive Reasoning in Generative Language Models

Authors: Peter Belcak, Luca A. Lanzendörfer, Roger Wattenhofer

Abstract: We conduct a preliminary inquiry into the ability of generative transformer models to deductively reason from premises provided. We observe notable differences in the performance of models coming from different training setups and find that the deductive reasoning ability increases with scale. Further, we discover that the performance generally does not decrease with the length of the deductive ch… ▽ More We conduct a preliminary inquiry into the ability of generative transformer models to deductively reason from premises provided. We observe notable differences in the performance of models coming from different training setups and find that the deductive reasoning ability increases with scale. Further, we discover that the performance generally does not decrease with the length of the deductive chain needed to reach the conclusion, with the exception of OpenAI GPT-3 and GPT-3.5 models. Our study considers a wide variety of transformer-decoder models, ranging from 117 million to 175 billion parameters in size. △ Less

Submitted 31 May, 2023; originally announced June 2023.

Comments: Accepted to the 1st Natural Language Reasoning and Structured Explanations Workshop (NLRSE@ACL'23). 8 pages, 4 figures, 3 tables

arXiv:2210.16606 [pdf, other]

Neural Combinatorial Logic Circuit Synthesis from Input-Output Examples

Authors: Peter Belcak, Roger Wattenhofer

Abstract: We propose a novel, fully explainable neural approach to synthesis of combinatorial logic circuits from input-output examples. The carrying advantage of our method is that it readily extends to inductive scenarios, where the set of examples is incomplete but still indicative of the desired behaviour. Our method can be employed for a virtually arbitrary choice of atoms - from logic gates to FPGA bl… ▽ More We propose a novel, fully explainable neural approach to synthesis of combinatorial logic circuits from input-output examples. The carrying advantage of our method is that it readily extends to inductive scenarios, where the set of examples is incomplete but still indicative of the desired behaviour. Our method can be employed for a virtually arbitrary choice of atoms - from logic gates to FPGA blocks - as long as they can be formulated in a differentiable fashion, and consistently yields good results for synthesis of practical circuits of increasing size. In particular, we succeed in learning a number of arithmetic, bitwise, and signal-routing operations, and even generalise towards the correct behaviour in inductive scenarios. Our method, attacking a discrete logical synthesis problem with an explainable neural approach, hints at a wider promise for synthesis and reasoning-related tasks. △ Less

Submitted 29 October, 2022; originally announced October 2022.

Comments: Accepted to the 2nd Workshop on Math-AI (MATH-AI@NeurIPS'22). 10 pages, 1 figure

arXiv:2209.11628 [pdf, other]

A Neural Model for Regular Grammar Induction

Authors: Peter Belcák, David Hofer, Roger Wattenhofer

Abstract: Grammatical inference is a classical problem in computational learning theory and a topic of wider influence in natural language processing. We treat grammars as a model of computation and propose a novel neural approach to induction of regular grammars from positive and negative examples. Our model is fully explainable, its intermediate results are directly interpretable as partial parses, and it… ▽ More Grammatical inference is a classical problem in computational learning theory and a topic of wider influence in natural language processing. We treat grammars as a model of computation and propose a novel neural approach to induction of regular grammars from positive and negative examples. Our model is fully explainable, its intermediate results are directly interpretable as partial parses, and it can be used to learn arbitrary regular grammars when provided with sufficient data. We find that our method consistently attains high recall and precision scores across a range of tests of varying complexity. △ Less

Submitted 1 October, 2022; v1 submitted 23 September, 2022; originally announced September 2022.

Comments: Accepted to the 21st IEEE International Conference on Machine Learning and Applications (ICMLA) 2022, 6 pages, 4 figures

arXiv:2209.10280 [pdf, other]

Periodic Extrapolative Generalisation in Neural Networks

Authors: Peter Belcák, Roger Wattenhofer

Abstract: The learning of the simplest possible computational pattern -- periodicity -- is an open problem in the research of strong generalisation in neural networks. We formalise the problem of extrapolative generalisation for periodic signals and systematically investigate the generalisation abilities of classical, population-based, and recently proposed periodic architectures on a set of benchmarking ta… ▽ More The learning of the simplest possible computational pattern -- periodicity -- is an open problem in the research of strong generalisation in neural networks. We formalise the problem of extrapolative generalisation for periodic signals and systematically investigate the generalisation abilities of classical, population-based, and recently proposed periodic architectures on a set of benchmarking tasks. We find that periodic and "snake" activation functions consistently fail at periodic extrapolation, regardless of the trainability of their periodicity parameters. Further, our results show that traditional sequential models still outperform the novel architectures designed specifically for extrapolation, and that these are in turn trumped by population-based training. We make our benchmarking and evaluation toolkit, PerKit, available and easily accessible to facilitate future work in the area. △ Less

Submitted 21 September, 2022; originally announced September 2022.

Comments: Accepted to IEEE Symposium on Deep Learning (IEEE DL) 2022, 8 pages, 7 figures

arXiv:2209.09543 [pdf, other]

FACT: Learning Governing Abstractions Behind Integer Sequences

Authors: Peter Belcák, Ard Kastrati, Flavio Schenker, Roger Wattenhofer

Abstract: Integer sequences are of central importance to the modeling of concepts admitting complete finitary descriptions. We introduce a novel view on the learning of such concepts and lay down a set of benchmarking tasks aimed at conceptual understanding by machine learning models. These tasks indirectly assess model ability to abstract, and challenge them to reason both interpolatively and extrapolative… ▽ More Integer sequences are of central importance to the modeling of concepts admitting complete finitary descriptions. We introduce a novel view on the learning of such concepts and lay down a set of benchmarking tasks aimed at conceptual understanding by machine learning models. These tasks indirectly assess model ability to abstract, and challenge them to reason both interpolatively and extrapolatively from the knowledge gained by observing representative examples. To further aid research in knowledge representation and reasoning, we present FACT, the Finitary Abstraction Comprehension Toolkit. The toolkit surrounds a large dataset of integer sequences comprising both organic and synthetic entries, a library for data pre-processing and generation, a set of model performance evaluation tools, and a collection of baseline model implementations, enabling the making of the future advancements with ease. △ Less

Submitted 20 September, 2022; originally announced September 2022.

Comments: Accepted to the 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks. 37 pages

arXiv:2208.10290 [pdf, other]

Deterministic Graph-Walking Program Mining

Authors: Peter Belcak, Roger Wattenhofer

Abstract: Owing to their versatility, graph structures admit representations of intricate relationships between the separate entities comprising the data. We formalise the notion of connection between two vertex sets in terms of edge and vertex features by introducing graph-walking programs. We give two algorithms for mining of deterministic graph-walking programs that yield programs in the order of increas… ▽ More Owing to their versatility, graph structures admit representations of intricate relationships between the separate entities comprising the data. We formalise the notion of connection between two vertex sets in terms of edge and vertex features by introducing graph-walking programs. We give two algorithms for mining of deterministic graph-walking programs that yield programs in the order of increasing length. These programs characterise linear long-distance relationships between the given two vertex sets in the context of the whole graph. △ Less

Submitted 22 August, 2022; originally announced August 2022.

Comments: Paper accepted for an oral presentation at Advanced Data Mining and Applications (ADMA) 2022. 15 pages, 3 figures

MSC Class: 68T10; 68T09 ACM Class: I.3; I.5

arXiv:2010.07874

The LL(finite) strategy for optimal LL(k) parsing

Authors: Peter Belcak

Abstract: The LL(finite) parsing strategy for parsing of LL(k) grammars where k needs not to be known is presented. The strategy parses input in linear time, uses arbitrary but always minimal lookahead necessary to disambiguate between alternatives of nonterminals, and it is optimal in the number of lookahead terminal scans performed. Modifications to the algorithm are shown that allow for resolution of gra… ▽ More The LL(finite) parsing strategy for parsing of LL(k) grammars where k needs not to be known is presented. The strategy parses input in linear time, uses arbitrary but always minimal lookahead necessary to disambiguate between alternatives of nonterminals, and it is optimal in the number of lookahead terminal scans performed. Modifications to the algorithm are shown that allow for resolution of grammar ambiguities by precedence -- effectively interpreting the input as a parsing expression grammar -- as well as for the use of predicates, and a proof of concept, the open-source parser generator Astir, employs the LL(finite) strategy in the output it generates. △ Less

Submitted 20 January, 2021; v1 submitted 15 October, 2020; originally announced October 2020.

Comments: An error was found in one of the algorithms for weak LL(k) grammars

arXiv:2008.07871 [pdf, other]

Fast Agent-Based Simulation Framework with Applications to Reinforcement Learning and the Study of Trading Latency Effects

Authors: Peter Belcak, Jan-Peter Calliess, Stefan Zohren

Abstract: We introduce a new software toolbox for agent-based simulation. Facilitating rapid prototy** by offering a user-friendly Python API, its core rests on an efficient C++ implementation to support simulation of large-scale multi-agent systems. Our software environment benefits from a versatile message-driven architecture. Originally developed to support research on financial markets, it offers the… ▽ More We introduce a new software toolbox for agent-based simulation. Facilitating rapid prototy** by offering a user-friendly Python API, its core rests on an efficient C++ implementation to support simulation of large-scale multi-agent systems. Our software environment benefits from a versatile message-driven architecture. Originally developed to support research on financial markets, it offers the flexibility to simulate a wide-range of different (easily customisable) market rules and to study the effect of auxiliary factors, such as delays, on the market dynamics. As a simple illustration, we employ our toolbox to investigate the role of the order processing delay in normal trading and for the scenario of a significant price change. Owing to its general architecture, our toolbox can also be employed as a generic multi-agent system simulator. We provide an example of such a non-financial application by simulating a mechanism for the coordination of no-regret learning agents in a multi-agent network routing scenario previously proposed in the literature. △ Less

Submitted 21 September, 2022; v1 submitted 18 August, 2020; originally announced August 2020.

Comments: Presented at the International Workshop on Multi-Agent Systems and Agent-Based Simulation (MABS@AAMAS) 2021, 12 pages, 8 figures

Showing 1–10 of 10 results for author: Belcak, P