-
Determination of the stably free cancellation property for orders
Authors:
Werner Bley,
Tommy Hofmann,
Henri Johnston
Abstract:
Let $K$ be a number field, let $A$ be a finite-dimensional semisimple $K$-algebra, and let $Λ$ be an $\mathcal{O}_{K}$-order in $A$. We give practical algorithms that determine whether or not $Λ$ has the stably free cancellation property (SFC). As an application, we determine all finite groups $G$ of order at most $383$ such that the integral group ring $\mathbb{Z}[G]$ has SFC.
Let $K$ be a number field, let $A$ be a finite-dimensional semisimple $K$-algebra, and let $Λ$ be an $\mathcal{O}_{K}$-order in $A$. We give practical algorithms that determine whether or not $Λ$ has the stably free cancellation property (SFC). As an application, we determine all finite groups $G$ of order at most $383$ such that the integral group ring $\mathbb{Z}[G]$ has SFC.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Landsca** Linear Mode Connectivity
Authors:
Sidak Pal Singh,
Linara Adilova,
Michael Kamp,
Asja Fischer,
Bernhard Schölkopf,
Thomas Hofmann
Abstract:
The presence of linear paths in parameter space between two different network solutions in certain cases, i.e., linear mode connectivity (LMC), has garnered interest from both theoretical and practical fronts. There has been significant research that either practically designs algorithms catered for connecting networks by adjusting for the permutation symmetries as well as some others that more th…
▽ More
The presence of linear paths in parameter space between two different network solutions in certain cases, i.e., linear mode connectivity (LMC), has garnered interest from both theoretical and practical fronts. There has been significant research that either practically designs algorithms catered for connecting networks by adjusting for the permutation symmetries as well as some others that more theoretically construct paths through which networks can be connected. Yet, the core reasons for the occurrence of LMC, when in fact it does occur, in the highly non-convex loss landscapes of neural networks are far from clear. In this work, we take a step towards understanding it by providing a model of how the loss landscape needs to behave topographically for LMC (or the lack thereof) to manifest. Concretely, we present a `mountainside and ridge' perspective that helps to neatly tie together different geometric features that can be spotted in the loss landscape along the training runs. We also complement this perspective by providing a theoretical analysis of the barrier height, for which we provide empirical support, and which additionally extends as a faithful predictor of layer-wise LMC. We close with a toy example that provides further intuition on how barriers arise in the first place, all in all, showcasing the larger aim of the work -- to provide a working model of the landscape and its topography for the occurrence of LMC.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
Non-Weyl Behavior Induced by Superradiance: A Microwave Graph Study
Authors:
Junjie Lu,
Tobias Hofmann,
Hans-Jürgen Stöckmann,
Ulrich Kuhl
Abstract:
We study experimentally the manifestation of non-Weyl graph behavior in open systems using microwave networks. For this a coupling variation to the network is necessary, which was out of reach till now. The coupling to the environment is changed by indirectly varying the boundary condition at the coupling vertex from Dirichlet to Neumann using a dangling bond with variable length attached the coup…
▽ More
We study experimentally the manifestation of non-Weyl graph behavior in open systems using microwave networks. For this a coupling variation to the network is necessary, which was out of reach till now. The coupling to the environment is changed by indirectly varying the boundary condition at the coupling vertex from Dirichlet to Neumann using a dangling bond with variable length attached the coupling vertex. A transformation of equal length spectra to equal reflection phase spectra of the dangling bond allows to create spectra with different fixed coupling strength. This allows to follow the resonances in the complex plane as a function of the coupling. While going from closed (Dirichlet) to fully open (Neumann) graph we see resonances esca** via a superradiant transition leading to non-Weyl behavior if the coupling to the outside is balanced. The open tetrahedral graph displays a rich parametric dynamic of the resonances in the complex plane presenting loops, regions of connected resonances and resonances approaching infinite imaginary parts.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Explicit Word Density Estimation for Language Modelling
Authors:
Jovan Andonov,
Octavian Ganea,
Paulina Grnarova,
Gary Bécigneul,
Thomas Hofmann
Abstract:
Language Modelling has been a central part of Natural Language Processing for a very long time and in the past few years LSTM-based language models have been the go-to method for commercial language modeling. Recently, it has been shown that when looking at language modelling from a matrix factorization point of view, the final Softmax layer limits the expressiveness of the model, by putting an up…
▽ More
Language Modelling has been a central part of Natural Language Processing for a very long time and in the past few years LSTM-based language models have been the go-to method for commercial language modeling. Recently, it has been shown that when looking at language modelling from a matrix factorization point of view, the final Softmax layer limits the expressiveness of the model, by putting an upper bound on the rank of the resulting matrix. Additionally, a new family of neural networks based called NeuralODEs, has been introduced as a continuous alternative to Residual Networks. Moreover, it has been shown that there is a connection between these models and Normalizing Flows. In this work we propose a new family of language models based on NeuralODEs and the continuous analogue of Normalizing Flows and manage to improve on some of the baselines.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Causal Estimation of Memorisation Profiles
Authors:
Pietro Lesci,
Clara Meister,
Thomas Hofmann,
Andreas Vlachos,
Tiago Pimentel
Abstract:
Understanding memorisation in language models has practical and societal implications, e.g., studying models' training dynamics or preventing copyright infringements. Prior work defines memorisation as the causal effect of training with an instance on the model's ability to predict that instance. This definition relies on a counterfactual: the ability to observe what would have happened had the mo…
▽ More
Understanding memorisation in language models has practical and societal implications, e.g., studying models' training dynamics or preventing copyright infringements. Prior work defines memorisation as the causal effect of training with an instance on the model's ability to predict that instance. This definition relies on a counterfactual: the ability to observe what would have happened had the model not seen that instance. Existing methods struggle to provide computationally efficient and accurate estimates of this counterfactual. Further, they often estimate memorisation for a model architecture rather than for a specific model instance. This paper fills an important gap in the literature, proposing a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics. Using this method, we characterise a model's memorisation profile--its memorisation trends across training--by only observing its behaviour on a small set of instances throughout training. In experiments with the Pythia model suite, we find that memorisation (i) is stronger and more persistent in larger models, (ii) is determined by data order and learning rate, and (iii) has stable trends across model sizes, thus making memorisation in larger models predictable from smaller ones.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Theory of Eigenstate Thermalisation
Authors:
Tobias Helbig,
Tobias Hofmann,
Ronny Thomale,
Martin Greiter
Abstract:
If we prepare an isolated, interacting quantum system in an eigenstate and perturb a local observable at an initial time, its expectation value will relax towards a thermal expectation value, even though the time evolution of the system is deterministic. The eigenstate thermalization hypothesis (ETH) of Deutsch and Srednicki suggests that this is possible because each eigenstate of the full quantu…
▽ More
If we prepare an isolated, interacting quantum system in an eigenstate and perturb a local observable at an initial time, its expectation value will relax towards a thermal expectation value, even though the time evolution of the system is deterministic. The eigenstate thermalization hypothesis (ETH) of Deutsch and Srednicki suggests that this is possible because each eigenstate of the full quantum system acts as a thermal bath to its subsystems, such that the reduced density matrices of the subsystems resemble thermal density matrices. Here, we use the observation that the eigenvalue distribution of interacting quantum systems is a Gaussian under very general circumstances, and Dyson Brownian motion random matrix theory, to derive the ETH and thereby elevate it from hypothesis to theory. Our analysis provides a derivation of statistical mechanics which neither requires the concepts of ergodicity or typicality, nor that of entropy. Thermodynamic equilibrium follows solely from the applicability of quantum mechanics to large systems and the absence of integrability.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Understanding and Minimising Outlier Features in Neural Network Training
Authors:
Bobby He,
Lorenzo Noci,
Daniele Paliotta,
Imanol Schlag,
Thomas Hofmann
Abstract:
Outlier Features (OF) are neurons whose activation magnitudes significantly exceed the average over a neural network's (NN) width. They are well known to emerge during standard transformer training and have the undesirable effect of hindering quantisation in afflicted models. Despite their practical importance, little is known behind why OFs emerge during training, nor how one can minimise them.…
▽ More
Outlier Features (OF) are neurons whose activation magnitudes significantly exceed the average over a neural network's (NN) width. They are well known to emerge during standard transformer training and have the undesirable effect of hindering quantisation in afflicted models. Despite their practical importance, little is known behind why OFs emerge during training, nor how one can minimise them.
Our work focuses on the above questions, first identifying several quantitative metrics, such as the kurtosis over neuron activation norms, to measure OFs. With these metrics, we study how architectural and optimisation choices influence OFs, and provide practical insights to minimise OFs during training. As highlights, we emphasise the importance of controlling signal propagation throughout training, and propose the Outlier Protected transformer block, which removes standard Pre-Norm layers to mitigate OFs, without loss of convergence speed or training stability. Overall, our findings shed new light on our understanding of, our ability to prevent, and the complexity of this important facet in NN training dynamics.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control
Authors:
Maria Mihaela Trusca,
Wolf Nuyts,
Jonathan Thomm,
Robert Honig,
Thomas Hofmann,
Tinne Tuytelaars,
Marie-Francine Moens
Abstract:
Current diffusion models create photorealistic images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image. This is evidenced by our novel image-graph alignment model called EPViT (Edge Prediction Vision Transformer) for the evaluation of image-text alignment. To alleviate the above problem, we propose focused cross-attentio…
▽ More
Current diffusion models create photorealistic images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image. This is evidenced by our novel image-graph alignment model called EPViT (Edge Prediction Vision Transformer) for the evaluation of image-text alignment. To alleviate the above problem, we propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence. Additionally, the syntax structure of the prompt helps to disentangle the multimodal CLIP embeddings that are commonly used in T2I generation. The resulting DisCLIP embeddings and FCA are easily integrated in state-of-the-art diffusion models without additional training of these models. We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.\footnote{Code and data will be made available upon acceptance.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
MAM-STM: A software for autonomous control of single moieties towards specific surface positions
Authors:
Bernhard Ramsauer,
Johannes J. Cartus,
Oliver T. Hofmann
Abstract:
In this publication we introduce MAM-STM, a software to autonomously manipulate arbitrary moieties towards specific positions on a metal surface utilizing the tip of a scanning tunneling microscope (STM). Finding the optimal manipulation parameters for a specific moiety is challenging and time consuming, even for human experts. MAM-STM combines autonomous data acquisition with a sophisticated Q-le…
▽ More
In this publication we introduce MAM-STM, a software to autonomously manipulate arbitrary moieties towards specific positions on a metal surface utilizing the tip of a scanning tunneling microscope (STM). Finding the optimal manipulation parameters for a specific moiety is challenging and time consuming, even for human experts. MAM-STM combines autonomous data acquisition with a sophisticated Q-learning implementation to determine the optimal bias voltage, the z-approach distance, and the tip position relative to the moiety. This then allows to arrange single molecules and atoms at will. In this work, we provide a tutorial based on a simulated response to offer a comprehensive explanation on how to use and customize MAM-STM. Additionally, we assess the performance of the machine learning algorithm by benchmarking it within a simulated stochastic environment.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
Language Imbalance Can Boost Cross-lingual Generalisation
Authors:
Anton Schäfer,
Shauli Ravfogel,
Thomas Hofmann,
Tiago Pimentel,
Imanol Schlag
Abstract:
Multilinguality is crucial for extending recent advancements in language modelling to diverse linguistic communities. To maintain high performance while representing multiple languages, multilingual models ideally align representations, allowing what is learned in one language to generalise to others. Prior research has emphasised the importance of parallel data and shared vocabulary elements as k…
▽ More
Multilinguality is crucial for extending recent advancements in language modelling to diverse linguistic communities. To maintain high performance while representing multiple languages, multilingual models ideally align representations, allowing what is learned in one language to generalise to others. Prior research has emphasised the importance of parallel data and shared vocabulary elements as key factors for such alignment. In this study, we investigate an unintuitive novel driver of cross-lingual generalisation: language imbalance. In controlled experiments on perfectly equivalent cloned languages, we observe that the existence of a predominant language during training boosts the performance of less frequent languages and leads to stronger alignment of model representations across languages. Furthermore, we find that this trend is amplified with scale: with large enough models or long enough training, we observe that bilingual training data with a 90/10 language split yields better performance on both languages than a balanced 50/50 split. Building on these insights, we design training schemes that can improve performance in all cloned languages, even without altering the training data. As we extend our analysis to real languages, we find that infrequent languages still benefit from frequent ones, yet whether language imbalance causes cross-lingual generalisation there is not conclusive.
△ Less
Submitted 13 May, 2024; v1 submitted 11 April, 2024;
originally announced April 2024.
-
Number Theory in OSCAR
Authors:
Claus Fieker,
Tommy Hofmann
Abstract:
We give a brief introduction to computational algebraic number theory in OSCAR. Our main focus is on number fields, rings of integers and their invariants. After recalling some classical results and their constructive counterparts, we showcase the functionality in two examples related to the investigation of the Cohen-Lenstra heuristic for quadratic fields and the Galois module structure of rings…
▽ More
We give a brief introduction to computational algebraic number theory in OSCAR. Our main focus is on number fields, rings of integers and their invariants. After recalling some classical results and their constructive counterparts, we showcase the functionality in two examples related to the investigation of the Cohen-Lenstra heuristic for quadratic fields and the Galois module structure of rings of integers.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
On the Effect of (Near) Duplicate Subwords in Language Modelling
Authors:
Anton Schäfer,
Thomas Hofmann,
Imanol Schlag,
Tiago Pimentel
Abstract:
Tokenisation is a core part of language models (LMs). It involves splitting a character sequence into subwords which are assigned arbitrary indices before being served to the LM. While typically lossless, however, this process may lead to less sample efficient LM training: as it removes character-level information, it could make it harder for LMs to generalise across similar subwords, such as now…
▽ More
Tokenisation is a core part of language models (LMs). It involves splitting a character sequence into subwords which are assigned arbitrary indices before being served to the LM. While typically lossless, however, this process may lead to less sample efficient LM training: as it removes character-level information, it could make it harder for LMs to generalise across similar subwords, such as now and Now. We refer to such subwords as near duplicates. In this paper, we study the impact of near duplicate subwords on LM training efficiency. First, we design an experiment that gives us an upper bound to how much we should expect a model to improve if we could perfectly generalise across near duplicates. We do this by duplicating each subword in our LM's vocabulary, creating perfectly equivalent classes of subwords. Experimentally, we find that LMs need roughly 17% more data when trained in a fully duplicated setting. Second, we investigate the impact of naturally occurring near duplicates on LMs. Here, we see that merging them considerably hurts LM performance. Therefore, although subword duplication negatively impacts LM training efficiency, naturally occurring near duplicates may not be as similar as anticipated, limiting the potential for performance improvements.
△ Less
Submitted 2 May, 2024; v1 submitted 9 April, 2024;
originally announced April 2024.
-
Learning Generalized Policies for Fully Observable Non-Deterministic Planning Domains
Authors:
Till Hofmann,
Hector Geffner
Abstract:
General policies represent reactive strategies for solving large families of planning problems like the infinite collection of solvable instances from a given domain. Methods for learning such policies from a collection of small training instances have been developed successfully for classical domains. In this work, we extend the formulations and the resulting combinatorial methods for learning ge…
▽ More
General policies represent reactive strategies for solving large families of planning problems like the infinite collection of solvable instances from a given domain. Methods for learning such policies from a collection of small training instances have been developed successfully for classical domains. In this work, we extend the formulations and the resulting combinatorial methods for learning general policies over fully observable, non-deterministic (FOND) domains. We also evaluate the resulting approach experimentally over a number of benchmark domains in FOND planning, present the general policies that result in some of these domains, and prove their correctness. The method for learning general policies for FOND planning can actually be seen as an alternative FOND planning method that searches for solutions, not in the given state space but in an abstract space defined by features that must be learned as well.
△ Less
Submitted 13 May, 2024; v1 submitted 3 April, 2024;
originally announced April 2024.
-
Hallmarks of Optimization Trajectories in Neural Networks: Directional Exploration and Redundancy
Authors:
Sidak Pal Singh,
Bobby He,
Thomas Hofmann,
Bernhard Schölkopf
Abstract:
We propose a fresh take on understanding the mechanisms of neural networks by analyzing the rich directional structure of optimization trajectories, represented by their pointwise parameters. Towards this end, we introduce some natural notions of the complexity of optimization trajectories, both qualitative and quantitative, which hallmark the directional nature of optimization in neural networks:…
▽ More
We propose a fresh take on understanding the mechanisms of neural networks by analyzing the rich directional structure of optimization trajectories, represented by their pointwise parameters. Towards this end, we introduce some natural notions of the complexity of optimization trajectories, both qualitative and quantitative, which hallmark the directional nature of optimization in neural networks: when is there redundancy, and when exploration. We use them to reveal the inherent nuance and interplay involved between various optimization choices, such as momentum and weight decay. Further, the trajectory perspective helps us see the effect of scale on regularizing the directional nature of trajectories, and as a by-product, we also observe an intriguing heterogeneity of Q,K,V dynamics in the middle attention layers in LLMs and which is homogenized by scale. Importantly, we put the significant directional redundancy observed to the test by demonstrating that training only scalar batchnorm parameters some while into training matches the performance of training the entire network, which thus exhibits the potential of hybrid optimization schemes that are geared towards efficiency.
△ Less
Submitted 24 June, 2024; v1 submitted 12 March, 2024;
originally announced March 2024.
-
Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning
Authors:
Lorenzo Noci,
Alexandru Meterez,
Thomas Hofmann,
Antonio Orvieto
Abstract:
Recently, there has been growing evidence that if the width and depth of a neural network are scaled toward the so-called rich feature learning limit ($μ$P and its depth extension), then some hyperparameters - such as the learning rate - exhibit transfer from small to very large models, thus reducing the cost of hyperparameter tuning. From an optimization perspective, this phenomenon is puzzling,…
▽ More
Recently, there has been growing evidence that if the width and depth of a neural network are scaled toward the so-called rich feature learning limit ($μ$P and its depth extension), then some hyperparameters - such as the learning rate - exhibit transfer from small to very large models, thus reducing the cost of hyperparameter tuning. From an optimization perspective, this phenomenon is puzzling, as it implies that the loss landscape is remarkably consistent across very different model sizes. In this work, we find empirical evidence that learning rate transfer can be attributed to the fact that under $μ$P and its depth extension, the largest eigenvalue of the training loss Hessian (i.e. the sharpness) is largely independent of the width and depth of the network for a sustained period of training time. On the other hand, we show that under the neural tangent kernel (NTK) regime, the sharpness exhibits very different dynamics at different scales, thus preventing learning rate transfer. But what causes these differences in the sharpness dynamics? Through a connection between the spectra of the Hessian and the NTK matrix, we argue that the cause lies in the presence (for $μ$P) or progressive absence (for the NTK regime) of feature learning, which results in a different evolution of the NTK, and thus of the sharpness. We corroborate our claims with a substantial suite of experiments, covering a wide range of datasets and architectures: from ResNets and Vision Transformers trained on benchmark vision datasets to Transformers-based language models trained on WikiText
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
A Language Model's Guide Through Latent Space
Authors:
Dimitri von Rütte,
Sotiris Anagnostidis,
Gregor Bachmann,
Thomas Hofmann
Abstract:
Concept guidance has emerged as a cheap and simple way to control the behavior of language models by probing their hidden representations for concept vectors and using them to perturb activations at inference time. While the focus of previous work has largely been on truthfulness, in this paper we extend this framework to a richer set of concepts such as appropriateness, humor, creativity and qual…
▽ More
Concept guidance has emerged as a cheap and simple way to control the behavior of language models by probing their hidden representations for concept vectors and using them to perturb activations at inference time. While the focus of previous work has largely been on truthfulness, in this paper we extend this framework to a richer set of concepts such as appropriateness, humor, creativity and quality, and explore to what degree current detection and guidance strategies work in these challenging settings. To facilitate evaluation, we develop a novel metric for concept guidance that takes into account both the success of concept elicitation as well as the potential degradation in fluency of the guided model. Our extensive experiments reveal that while some concepts such as truthfulness more easily allow for guidance with current techniques, novel concepts such as appropriateness or humor either remain difficult to elicit, need extensive tuning to work, or even experience confusion. Moreover, we find that probes with optimal detection accuracies do not necessarily make for the optimal guides, contradicting previous observations for truthfulness. Our work warrants a deeper investigation into the interplay between detectability, guidability, and the nature of the concept, and we hope that our rich experimental test-bed for guidance research inspires stronger follow-up approaches.
△ Less
Submitted 22 February, 2024;
originally announced February 2024.
-
Towards Meta-Pruning via Optimal Transport
Authors:
Alexander Theus,
Olin Geimer,
Friedrich Wicke,
Thomas Hofmann,
Sotiris Anagnostidis,
Sidak Pal Singh
Abstract:
Structural pruning of neural networks conventionally relies on identifying and discarding less important neurons, a practice often resulting in significant accuracy loss that necessitates subsequent fine-tuning efforts. This paper introduces a novel approach named Intra-Fusion, challenging this prevailing pruning paradigm. Unlike existing methods that focus on designing meaningful neuron importanc…
▽ More
Structural pruning of neural networks conventionally relies on identifying and discarding less important neurons, a practice often resulting in significant accuracy loss that necessitates subsequent fine-tuning efforts. This paper introduces a novel approach named Intra-Fusion, challenging this prevailing pruning paradigm. Unlike existing methods that focus on designing meaningful neuron importance metrics, Intra-Fusion redefines the overlying pruning procedure. Through utilizing the concepts of model fusion and Optimal Transport, we leverage an agnostically given importance metric to arrive at a more effective sparse model representation. Notably, our approach achieves substantial accuracy recovery without the need for resource-intensive fine-tuning, making it an efficient and promising tool for neural network compression.
Additionally, we explore how fusion can be added to the pruning process to significantly decrease the training time while maintaining competitive performance. We benchmark our results for various networks on commonly used datasets such as CIFAR-10, CIFAR-100, and ImageNet. More broadly, we hope that the proposed Intra-Fusion approach invigorates exploration into a fresh alternative to the predominant compression approaches. Our code is available here: https://github.com/alexandertheus/Intra-Fusion.
△ Less
Submitted 13 February, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
How Good is a Single Basin?
Authors:
Kai Lion,
Lorenzo Noci,
Thomas Hofmann,
Gregor Bachmann
Abstract:
The multi-modal nature of neural loss landscapes is often considered to be the main driver behind the empirical success of deep ensembles. In this work, we probe this belief by constructing various "connected" ensembles which are restricted to lie in the same basin. Through our experiments, we demonstrate that increased connectivity indeed negatively impacts performance. However, when incorporatin…
▽ More
The multi-modal nature of neural loss landscapes is often considered to be the main driver behind the empirical success of deep ensembles. In this work, we probe this belief by constructing various "connected" ensembles which are restricted to lie in the same basin. Through our experiments, we demonstrate that increased connectivity indeed negatively impacts performance. However, when incorporating the knowledge from other basins implicitly through distillation, we show that the gap in performance can be mitigated by re-discovering (multi-basin) deep ensembles within a single basin. Thus, we conjecture that while the extra-basin knowledge is at least partially present in any given basin, it cannot be easily harnessed without learning it from other basins.
△ Less
Submitted 5 February, 2024;
originally announced February 2024.
-
Decidable Reasoning About Time in Finite-Domain Situation Calculus Theories
Authors:
Till Hofmann,
Stefan Schupp,
Gerhard Lakemeyer
Abstract:
Representing time is crucial for cyber-physical systems and has been studied extensively in the Situation Calculus. The most commonly used approach represents time by adding a real-valued fluent $\mathit{time}(a)$ that attaches a time point to each action and consequently to each situation. We show that in this approach, checking whether there is a reachable situation that satisfies a given formul…
▽ More
Representing time is crucial for cyber-physical systems and has been studied extensively in the Situation Calculus. The most commonly used approach represents time by adding a real-valued fluent $\mathit{time}(a)$ that attaches a time point to each action and consequently to each situation. We show that in this approach, checking whether there is a reachable situation that satisfies a given formula is undecidable, even if the domain of discourse is restricted to a finite set of objects. We present an alternative approach based on well-established results from timed automata theory by introducing clocks as real-valued fluents with restricted successor state axioms and comparison operators. %that only allow comparisons against fixed rationals. With this restriction, we can show that the reachability problem for finite-domain basic action theories is decidable. Finally, we apply our results on Golog program realization by presenting a decidable procedure for determining an action sequence that is a successful execution of a given program.
△ Less
Submitted 5 February, 2024;
originally announced February 2024.
-
Probabilistic Abduction for Visual Abstract Reasoning via Learning Rules in Vector-symbolic Architectures
Authors:
Michael Hersche,
Francesco di Stefano,
Thomas Hofmann,
Abu Sebastian,
Abbas Rahimi
Abstract:
Abstract reasoning is a cornerstone of human intelligence, and replicating it with artificial intelligence (AI) presents an ongoing challenge. This study focuses on efficiently solving Raven's progressive matrices (RPM), a visual test for assessing abstract reasoning abilities, by using distributed computation and operators provided by vector-symbolic architectures (VSA). Instead of hard-coding th…
▽ More
Abstract reasoning is a cornerstone of human intelligence, and replicating it with artificial intelligence (AI) presents an ongoing challenge. This study focuses on efficiently solving Raven's progressive matrices (RPM), a visual test for assessing abstract reasoning abilities, by using distributed computation and operators provided by vector-symbolic architectures (VSA). Instead of hard-coding the rule formulations associated with RPMs, our approach can learn the VSA rule formulations (hence the name Learn-VRF) with just one pass through the training data. Yet, our approach, with compact parameters, remains transparent and interpretable. Learn-VRF yields accurate predictions on I-RAVEN's in-distribution data, and exhibits strong out-of-distribution capabilities concerning unseen attribute-rule pairs, significantly outperforming pure connectionist baselines including large language models. Our code is available at https://github.com/IBM/learn-vector-symbolic-architectures-rule-formulations.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
Wafer-scale fabrication of mesoporous silicon functionalized with electrically conductive polymers
Authors:
Manfred May,
Mathis Boderius,
Natalia Gostkowska-Lekner,
Mark Busch,
Klaus Habicht,
Tommy Hofmann,
Patrick Huber
Abstract:
The fabrication of hybrid materials consisting of nanoporous hosts with conductive polymers is a challenging task, since the extreme spatial confinement often conflicts with the stringent physico-chemical requirements for polymerization of organic constituents. Here, several low-threshold and scalable synthesis routes for such hybrids are presented. First, the electrochemical synthesis of composit…
▽ More
The fabrication of hybrid materials consisting of nanoporous hosts with conductive polymers is a challenging task, since the extreme spatial confinement often conflicts with the stringent physico-chemical requirements for polymerization of organic constituents. Here, several low-threshold and scalable synthesis routes for such hybrids are presented. First, the electrochemical synthesis of composites based on mesoporous silicon (pore size of 7 nm) and the polymers PANI, PPy and PEDOT is discussed and validated by scanning electron microscopy (SEM) and energy-dispersive X-ray spectroscopy (EDX). Polymer filling degrees of 74% are achieved. Second, the production of PEDOT/pSi hybrids, based on the solid-state polymerization (SSP) of DBEDOT to PEDOT is shown. The resulting amorphous structure of the nanopore-embedded PEDOT is investigated via in-situ synchrotron-based X-ray scattering. In addition, a twofold increase in the electrical conductivity of the hybrid compared to the porous silicon host is shown, making this system particularly promising for thermoelectric applications.
△ Less
Submitted 17 January, 2024;
originally announced January 2024.
-
Towards Bridging the Gap between High-Level Reasoning and Execution on Robots
Authors:
Till Hofmann
Abstract:
When reasoning about actions, e.g., by means of task planning or agent programming with Golog, the robot's actions are typically modeled on an abstract level, where complex actions such as picking up an object are treated as atomic primitives with deterministic effects and preconditions that only depend on the current state. However, when executing such an action on a robot it can no longer be see…
▽ More
When reasoning about actions, e.g., by means of task planning or agent programming with Golog, the robot's actions are typically modeled on an abstract level, where complex actions such as picking up an object are treated as atomic primitives with deterministic effects and preconditions that only depend on the current state. However, when executing such an action on a robot it can no longer be seen as a primitive. Instead, action execution is a complex task involving multiple steps with additional temporal preconditions and timing constraints. Furthermore, the action may be noisy, e.g., producing erroneous sensing results and not always having the desired effects. While these aspects are typically ignored in reasoning tasks, they need to be dealt with during execution. In this thesis, we propose several approaches towards closing this gap.
△ Less
Submitted 30 December, 2023;
originally announced January 2024.
-
Disentangling Linear Mode-Connectivity
Authors:
Gul Sena Altintas,
Gregor Bachmann,
Lorenzo Noci,
Thomas Hofmann
Abstract:
Linear mode-connectivity (LMC) (or lack thereof) is one of the intriguing characteristics of neural network loss landscapes. While empirically well established, it unfortunately still lacks a proper theoretical understanding. Even worse, although empirical data points are abound, a systematic study of when networks exhibit LMC is largely missing in the literature. In this work we aim to close this…
▽ More
Linear mode-connectivity (LMC) (or lack thereof) is one of the intriguing characteristics of neural network loss landscapes. While empirically well established, it unfortunately still lacks a proper theoretical understanding. Even worse, although empirical data points are abound, a systematic study of when networks exhibit LMC is largely missing in the literature. In this work we aim to close this gap. We explore how LMC is affected by three factors: (1) architecture (sparsity, weight-sharing), (2) training strategy (optimization setup) as well as (3) the underlying dataset. We place particular emphasis on minimal but non-trivial settings, removing as much unnecessary complexity as possible. We believe that our insights can guide future theoretical works on uncovering the inner workings of LMC.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
LIME: Localized Image Editing via Attention Regularization in Diffusion Models
Authors:
Enis Simsar,
Alessio Tonioni,
Yongqin Xian,
Thomas Hofmann,
Federico Tombari
Abstract:
Diffusion models (DMs) have gained prominence due to their ability to generate high-quality, varied images, with recent advancements in text-to-image generation. The research focus is now shifting towards the controllability of DMs. A significant challenge within this domain is localized editing, where specific areas of an image are modified without affecting the rest of the content. This paper in…
▽ More
Diffusion models (DMs) have gained prominence due to their ability to generate high-quality, varied images, with recent advancements in text-to-image generation. The research focus is now shifting towards the controllability of DMs. A significant challenge within this domain is localized editing, where specific areas of an image are modified without affecting the rest of the content. This paper introduces LIME for localized image editing in diffusion models that do not require user-specified regions of interest (RoI) or additional text input. Our method employs features from pre-trained methods and a simple clustering technique to obtain precise semantic segmentation maps. Then, by leveraging cross-attention maps, it refines these segments for localized edits. Finally, we propose a novel cross-attention regularization technique that penalizes unrelated cross-attention scores in the RoI during the denoising steps, ensuring localized edits. Our approach, without re-training and fine-tuning, consistently improves the performance of existing methods in various editing benchmarks.
△ Less
Submitted 14 December, 2023;
originally announced December 2023.
-
Recurrent Distance Filtering for Graph Representation Learning
Authors:
Yuhui Ding,
Antonio Orvieto,
Bobby He,
Thomas Hofmann
Abstract:
Graph neural networks based on iterative one-hop message passing have been shown to struggle in harnessing the information from distant nodes effectively. Conversely, graph transformers allow each node to attend to all other nodes directly, but lack graph inductive bias and have to rely on ad-hoc positional encoding. In this paper, we propose a new architecture to reconcile these challenges. Our a…
▽ More
Graph neural networks based on iterative one-hop message passing have been shown to struggle in harnessing the information from distant nodes effectively. Conversely, graph transformers allow each node to attend to all other nodes directly, but lack graph inductive bias and have to rely on ad-hoc positional encoding. In this paper, we propose a new architecture to reconcile these challenges. Our approach stems from the recent breakthroughs in long-range modeling provided by deep state-space models: for a given target node, our model aggregates other nodes by their shortest distances to the target and uses a linear RNN to encode the sequence of hop representations. The linear RNN is parameterized in a particular diagonal form for stable long-range signal propagation and is theoretically expressive enough to encode the neighborhood hierarchy. With no need for positional encoding, we empirically show that the performance of our model is comparable to or better than that of state-of-the-art graph transformers on various benchmarks, with a significantly reduced computational cost. Our code is open-source at https://github.com/skeletondyh/GRED.
△ Less
Submitted 5 June, 2024; v1 submitted 3 December, 2023;
originally announced December 2023.
-
Harnessing Synthetic Datasets: The Role of Shape Bias in Deep Neural Network Generalization
Authors:
Elior Benarous,
Sotiris Anagnostidis,
Luca Biggio,
Thomas Hofmann
Abstract:
Recent advancements in deep learning have been primarily driven by the use of large models trained on increasingly vast datasets. While neural scaling laws have emerged to predict network performance given a specific level of computational resources, the growing demand for expansive datasets raises concerns. To address this, a new research direction has emerged, focusing on the creation of synthet…
▽ More
Recent advancements in deep learning have been primarily driven by the use of large models trained on increasingly vast datasets. While neural scaling laws have emerged to predict network performance given a specific level of computational resources, the growing demand for expansive datasets raises concerns. To address this, a new research direction has emerged, focusing on the creation of synthetic data as a substitute. In this study, we investigate how neural networks exhibit shape bias during training on synthetic datasets, serving as an indicator of the synthetic data quality. Specifically, our findings indicate three key points: (1) Shape bias varies across network architectures and types of supervision, casting doubt on its reliability as a predictor for generalization and its ability to explain differences in model recognition compared to human capabilities. (2) Relying solely on shape bias to estimate generalization is unreliable, as it is entangled with diversity and naturalism. (3) We propose a novel interpretation of shape bias as a tool for estimating the diversity of samples within a dataset. Our research aims to clarify the implications of using synthetic data and its associated shape bias in deep learning, addressing concerns regarding generalization and dataset quality.
△ Less
Submitted 10 November, 2023;
originally announced November 2023.
-
Navigating Scaling Laws: Compute Optimality in Adaptive Model Training
Authors:
Sotiris Anagnostidis,
Gregor Bachmann,
Imanol Schlag,
Thomas Hofmann
Abstract:
In recent years, the state-of-the-art in deep learning has been dominated by very large models that have been pre-trained on vast amounts of data. The paradigm is very simple: investing more computational resources (optimally) leads to better performance, and even predictably so; neural scaling laws have been derived that accurately forecast the performance of a network for a desired level of comp…
▽ More
In recent years, the state-of-the-art in deep learning has been dominated by very large models that have been pre-trained on vast amounts of data. The paradigm is very simple: investing more computational resources (optimally) leads to better performance, and even predictably so; neural scaling laws have been derived that accurately forecast the performance of a network for a desired level of compute. This leads to the notion of a `compute-optimal' model, i.e. a model that allocates a given level of compute during training optimally to maximize performance. In this work, we extend the concept of optimality by allowing for an `adaptive' model, i.e. a model that can change its shape during training. By doing so, we can design adaptive models that optimally traverse between the underlying scaling laws and outpace their `static' counterparts, leading to a significant reduction in the required compute to reach a given target performance. We show that our approach generalizes across modalities and different shape parameters.
△ Less
Submitted 23 May, 2024; v1 submitted 6 November, 2023;
originally announced November 2023.
-
Simplifying Transformer Blocks
Authors:
Bobby He,
Thomas Hofmann
Abstract:
A simple design recipe for deep Transformers is to compose identical building blocks. But standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks with skip connections & normalisation layers in precise arrangements. This complexity leads to brittle architectures, where seemingly minor changes can significantly reduce training speed, or render models untrainable.…
▽ More
A simple design recipe for deep Transformers is to compose identical building blocks. But standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks with skip connections & normalisation layers in precise arrangements. This complexity leads to brittle architectures, where seemingly minor changes can significantly reduce training speed, or render models untrainable.
In this work, we ask to what extent the standard transformer block can be simplified? Combining signal propagation theory and empirical observations, we motivate modifications that allow many block components to be removed with no loss of training speed, including skip connections, projection or value parameters, sequential sub-blocks and normalisation layers. In experiments on both autoregressive decoder-only and BERT encoder-only models, our simplified transformers emulate the per-update training speed and performance of standard transformers, while enjoying 15% faster training throughput, and using 15% fewer parameters.
△ Less
Submitted 31 May, 2024; v1 submitted 3 November, 2023;
originally announced November 2023.
-
Transformer Fusion with Optimal Transport
Authors:
Moritz Imfeld,
Jacopo Graldi,
Marco Giordano,
Thomas Hofmann,
Sotiris Anagnostidis,
Sidak Pal Singh
Abstract:
Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities. Past attempts have been restricted to the case of fully-connected, convolutional, and residual networks. This paper presents a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components.…
▽ More
Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities. Past attempts have been restricted to the case of fully-connected, convolutional, and residual networks. This paper presents a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components. We flesh out an abstraction for layer alignment, that can generalize to arbitrary architectures - in principle - and we apply this to the key ingredients of Transformers such as multi-head self-attention, layer-normalization, and residual connections, and we discuss how to handle them via various ablation studies. Furthermore, our method allows the fusion of models of different sizes (heterogeneous fusion), providing a new and efficient way to compress Transformers. The proposed approach is evaluated on both image classification tasks via Vision Transformer and natural language modeling tasks using BERT. Our approach consistently outperforms vanilla fusion, and, after a surprisingly short finetuning, also outperforms the individual converged parent models. In our analysis, we uncover intriguing insights about the significant role of soft alignment in the case of Transformers. Our results showcase the potential of fusing multiple Transformers, thus compounding their expertise, in the budding paradigm of model fusion and recombination. Code is available at https://github.com/graldij/transformer-fusion.
△ Less
Submitted 22 April, 2024; v1 submitted 9 October, 2023;
originally announced October 2023.
-
Kinetic trap** of charge-transfer molecules at metal interfaces
Authors:
Anna Werkovits,
Simon B Hollweger,
Max Niederreiter,
Thomas Risse,
Johannes J. Cartus,
Martin Sterrer,
Sebastian Matera,
Oliver T. Hofmann
Abstract:
Despite the common expectation that conjugated organic molecules on metals tend to adsorb in a flat-lying wetting layer, several recent studies have found strong indications for coverage-dependent transitions to upright-standing phases, which exhibit notably different physical properties. In this work, we argue that from an energetic perspective, thermodynamically stable upright-standing phases ma…
▽ More
Despite the common expectation that conjugated organic molecules on metals tend to adsorb in a flat-lying wetting layer, several recent studies have found strong indications for coverage-dependent transitions to upright-standing phases, which exhibit notably different physical properties. In this work, we argue that from an energetic perspective, thermodynamically stable upright-standing phases may be more common than hitherto thought. However, for kinetic reasons this phase may often not be observed experimentally. Indeed, using first principles kinetic Monte Carlo simulations, we find that the structure with lower molecular density is (almost) always formed first, reminiscent of Ostwalds rule of stages. The phase transitions to the thermodynamically stable upright-standing phase are likely to be kinetically hindered under conditions typically used in surface science (gas phase adsorption at low flux). This provides a possible explanation why they are commonly not observed. Investigating both the role of the growth conditions and the energetics of the interface, we find that the time for the phase transition is determined mostly by the deposition rate and, thus, mostly independent of the nature of the molecule.
△ Less
Submitted 3 October, 2023; v1 submitted 2 October, 2023;
originally announced October 2023.
-
Towards guarantees for parameter isolation in continual learning
Authors:
Giulia Lanzillotta,
Sidak Pal Singh,
Benjamin F. Grewe,
Thomas Hofmann
Abstract:
Deep learning has proved to be a successful paradigm for solving many challenges in machine learning. However, deep neural networks fail when trained sequentially on multiple tasks, a shortcoming known as catastrophic forgetting in the continual learning literature. Despite a recent flourish of learning algorithms successfully addressing this problem, we find that provable guarantees against catas…
▽ More
Deep learning has proved to be a successful paradigm for solving many challenges in machine learning. However, deep neural networks fail when trained sequentially on multiple tasks, a shortcoming known as catastrophic forgetting in the continual learning literature. Despite a recent flourish of learning algorithms successfully addressing this problem, we find that provable guarantees against catastrophic forgetting are lacking. In this work, we study the relationship between learning and forgetting by looking at the geometry of neural networks' loss landscape. We offer a unifying perspective on a family of continual learning algorithms, namely methods based on parameter isolation, and we establish guarantees on catastrophic forgetting for some of them.
△ Less
Submitted 2 October, 2023;
originally announced October 2023.
-
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute
Authors:
Aleksandar Stanić,
Dylan Ashley,
Oleg Serikov,
Louis Kirsch,
Francesco Faccio,
Jürgen Schmidhuber,
Thomas Hofmann,
Imanol Schlag
Abstract:
The Languini Kitchen serves as both a research collective and codebase designed to empower researchers with limited computational resources to contribute meaningfully to the field of language modelling. We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. The number of tokens on which a model is trained is defined by the m…
▽ More
The Languini Kitchen serves as both a research collective and codebase designed to empower researchers with limited computational resources to contribute meaningfully to the field of language modelling. We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. The number of tokens on which a model is trained is defined by the model's throughput and the chosen compute class. Notably, this approach avoids constraints on critical hyperparameters which affect total parameters or floating-point operations. For evaluation, we pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. On it, we compare methods based on their empirical scaling trends which are estimated through experiments at various levels of compute. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput. While the GPT baseline achieves better perplexity throughout all our levels of compute, our LSTM baseline exhibits a predictable and more favourable scaling law. This is due to the improved throughput and the need for fewer training tokens to achieve the same decrease in test perplexity. Extrapolating the scaling laws leads of both models results in an intersection at roughly 50,000 accelerator hours. We hope this work can serve as the foundation for meaningful and reproducible language modelling research.
△ Less
Submitted 20 September, 2023;
originally announced September 2023.
-
Adsorption configurations of Co-phthalocyanine on In2O3(111)
Authors:
Margareta Wagner,
Fabio Calcinelli,
Andreas Jeindl,
Michael Schmid,
Oliver T. Hofmann,
Ulrike Diebold
Abstract:
Indium oxide offers optical transparency paired with electric conductivity, a combination required in many optoelectronic applications. The most-stable In2O3(111) surface has a large unit cell (1.43 nm lattice constant). It contains a mixture of both bulk-like and undercoordinated O and In atoms and provides an ideal playground to explore the interaction of surfaces with organic molecules of simil…
▽ More
Indium oxide offers optical transparency paired with electric conductivity, a combination required in many optoelectronic applications. The most-stable In2O3(111) surface has a large unit cell (1.43 nm lattice constant). It contains a mixture of both bulk-like and undercoordinated O and In atoms and provides an ideal playground to explore the interaction of surfaces with organic molecules of similar size as the unit cell. Non-contact atomic force microscopy (nc-AFM), scanning tunneling microscopy (STM), and density functional theory (DFT) were used to study the adsorption of Co-phthalocyanine (CoPc) on In2O3(111). Isolated CoPc molecules adsorb at two adsorption sites in a 7:3 ratio. The Co atom sits either on top of a surface oxygen ('F configuration') or indium atom ('S configuration'). This subtle change in adsorption site induces bending of the molecules, which is reflected in their electronic structure. According to DFT the lowest unoccupied molecular orbital of the undistorted gas-phase CoPc remains mostly unaffected in the F configuration but is filled by one electron in S configuration. At coverages up to one CoPc molecule per substrate unit cell, a mixture of domains with molecules in F and S configuration are found. Molecules at F sites first condense into a F-(2x2) structure and finally rearrange into a F-(1x1) symmetry with partially overlap** molecules, while S-sited molecules only assume a S-(1x1) superstructure.
△ Less
Submitted 22 August, 2023;
originally announced August 2023.
-
The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit
Authors:
Lorenzo Noci,
Chuning Li,
Mufan Bill Li,
Bobby He,
Thomas Hofmann,
Chris Maddison,
Daniel M. Roy
Abstract:
In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network's trainability. Motivated by the success of Transformers, we study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width. We show that at initialization the limiting distribution can be described by a…
▽ More
In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network's trainability. Motivated by the success of Transformers, we study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width. We show that at initialization the limiting distribution can be described by a stochastic differential equation (SDE) indexed by the depth-to-width ratio. To achieve a well-defined stochastic limit, the Transformer's attention mechanism is modified by centering the Softmax output at identity, and scaling the Softmax logits by a width-dependent temperature parameter. We examine the stability of the network through the corresponding SDE, showing how the scale of both the drift and diffusion can be elegantly controlled with the aid of residual connections. The existence of a stable SDE implies that the covariance structure is well-behaved, even for very large depth and width, thus preventing the notorious issues of rank degeneracy in deep attention models. Finally, we show, through simulations, that the SDE provides a surprisingly good description of the corresponding finite-size model. We coin the name shaped Transformer for these architectural modifications.
△ Less
Submitted 9 December, 2023; v1 submitted 30 June, 2023;
originally announced June 2023.
-
Realizing efficient topological temporal pum** in electrical circuits
Authors:
Alexander Stegmaier,
Hauke Brand,
Stefan Imhof,
Alexander Fritzsche,
Tobias Helbig,
Tobias Hofmann,
Igor Boettcher,
Martin Greiter,
Ching Hua Lee,
Gaurav Bahl,
Alexander Szameit,
Tobias Kießling,
Ronny Thomale,
Lavi K. Upreti
Abstract:
Quantized adiabatic transport can occur when a system is slowly modulated over time. In most realizations however, the efficiency of such transport is reduced by unwanted dissipation, back-scattering, and non-adiabatic effects. In this work, we realize a topological adiabatic pump in an electrical circuit network that supports remarkably stable and long-lasting pum** of a voltage signal. We furt…
▽ More
Quantized adiabatic transport can occur when a system is slowly modulated over time. In most realizations however, the efficiency of such transport is reduced by unwanted dissipation, back-scattering, and non-adiabatic effects. In this work, we realize a topological adiabatic pump in an electrical circuit network that supports remarkably stable and long-lasting pum** of a voltage signal. We further characterize the topology of our system by deducing the Chern number from the measured edge band structure. To achieve this, the experimental setup makes use of active circuit elements that act as time-variable voltage-controlled inductors.
△ Less
Submitted 27 June, 2023;
originally announced June 2023.
-
Scaling MLPs: A Tale of Inductive Bias
Authors:
Gregor Bachmann,
Sotiris Anagnostidis,
Thomas Hofmann
Abstract:
In this work we revisit the most fundamental building block in deep learning, the multi-layer perceptron (MLP), and study the limits of its performance on vision tasks. Empirical insights into MLPs are important for multiple reasons. (1) Given the recent narrative "less inductive bias is better", popularized due to transformers eclipsing convolutional models, it is natural to explore the limits of…
▽ More
In this work we revisit the most fundamental building block in deep learning, the multi-layer perceptron (MLP), and study the limits of its performance on vision tasks. Empirical insights into MLPs are important for multiple reasons. (1) Given the recent narrative "less inductive bias is better", popularized due to transformers eclipsing convolutional models, it is natural to explore the limits of this hypothesis. To that end, MLPs offer an ideal test bed, as they lack any vision-specific inductive bias. (2) MLPs have almost exclusively been the main protagonist in the deep learning theory literature due to their mathematical simplicity, serving as a proxy to explain empirical phenomena observed for more complex architectures. Surprisingly, experimental datapoints for MLPs are very difficult to find in the literature, especially when coupled with large pre-training protocols. This discrepancy between practice and theory is worrying: Do MLPs reflect the empirical advances exhibited by practical models? Or do theorists need to rethink the role of MLPs as a proxy? We provide insights into both these aspects. We show that the performance of MLPs drastically improves with scale (95% on CIFAR10, 82% on CIFAR100, 58% on ImageNet ReaL), highlighting that lack of inductive bias can indeed be compensated. We observe that MLPs mimic the behaviour of their modern counterparts faithfully, with some components in the learning setting however exhibiting stronger or unexpected behaviours. Due to their inherent computational efficiency, large pre-training experiments become more accessible for academic researchers. All of our experiments were run on a single GPU.
△ Less
Submitted 3 October, 2023; v1 submitted 23 June, 2023;
originally announced June 2023.
-
Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes
Authors:
Alexandros Delitzas,
Maria Parelli,
Nikolas Hars,
Georgios Vlassis,
Sotirios Anagnostidis,
Gregor Bachmann,
Thomas Hofmann
Abstract:
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore. However, it still remains understudied whether 2D distilled knowledge can provide useful representations for downstream 3D vision-language tasks such as 3D question answering. In this paper, we propo…
▽ More
Training models to apply common-sense linguistic knowledge and visual concepts from 2D images to 3D scene understanding is a promising direction that researchers have only recently started to explore. However, it still remains understudied whether 2D distilled knowledge can provide useful representations for downstream 3D vision-language tasks such as 3D question answering. In this paper, we propose a novel 3D pre-training Vision-Language method, namely Multi-CLIP, that enables a model to learn language-grounded and transferable 3D scene point cloud representations. We leverage the representational power of the CLIP model by maximizing the agreement between the encoded 3D scene features and the corresponding 2D multi-view image and text embeddings in the CLIP space via a contrastive objective. To validate our approach, we consider the challenging downstream tasks of 3D Visual Question Answering (3D-VQA) and 3D Situated Question Answering (3D-SQA). To this end, we develop novel multi-modal transformer-based architectures and we demonstrate how our pre-training method can benefit their performance. Quantitative and qualitative experimental results show that Multi-CLIP outperforms state-of-the-art works across the downstream tasks of 3D-VQA and 3D-SQA and leads to a well-structured 3D scene feature space.
△ Less
Submitted 4 June, 2023;
originally announced June 2023.
-
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
Authors:
Sotiris Anagnostidis,
Dario Pavllo,
Luca Biggio,
Lorenzo Noci,
Aurelien Lucchi,
Thomas Hofmann
Abstract:
Autoregressive Transformers adopted in Large Language Models (LLMs) are hard to scale to long sequences. Despite several works trying to reduce their computational cost, most of LLMs still adopt attention layers between all pairs of tokens in the sequence, thus incurring a quadratic cost. In this study, we present a novel approach that dynamically prunes contextual information while preserving the…
▽ More
Autoregressive Transformers adopted in Large Language Models (LLMs) are hard to scale to long sequences. Despite several works trying to reduce their computational cost, most of LLMs still adopt attention layers between all pairs of tokens in the sequence, thus incurring a quadratic cost. In this study, we present a novel approach that dynamically prunes contextual information while preserving the model's expressiveness, resulting in reduced memory and computational requirements during inference. Our method employs a learnable mechanism that determines which uninformative tokens can be dropped from the context at any point across the generation process. By doing so, our approach not only addresses performance concerns but also enhances interpretability, providing valuable insight into the model's decision-making process. Our technique can be applied to existing pre-trained models through a straightforward fine-tuning process, and the pruning strength can be specified by a sparsity parameter. Notably, our empirical findings demonstrate that we can effectively prune up to 80\% of the context without significant performance degradation on downstream tasks, offering a valuable tool for mitigating inference costs. Our reference implementation achieves up to $2\times$ increase in inference throughput and even greater memory savings.
△ Less
Submitted 31 May, 2024; v1 submitted 25 May, 2023;
originally announced May 2023.
-
The Hessian perspective into the Nature of Convolutional Neural Networks
Authors:
Sidak Pal Singh,
Thomas Hofmann,
Bernhard Schölkopf
Abstract:
While Convolutional Neural Networks (CNNs) have long been investigated and applied, as well as theorized, we aim to provide a slightly different perspective into their nature -- through the perspective of their Hessian maps. The reason is that the loss Hessian captures the pairwise interaction of parameters and therefore forms a natural ground to probe how the architectural aspects of CNN get mani…
▽ More
While Convolutional Neural Networks (CNNs) have long been investigated and applied, as well as theorized, we aim to provide a slightly different perspective into their nature -- through the perspective of their Hessian maps. The reason is that the loss Hessian captures the pairwise interaction of parameters and therefore forms a natural ground to probe how the architectural aspects of CNN get manifested in its structure and properties. We develop a framework relying on Toeplitz representation of CNNs, and then utilize it to reveal the Hessian structure and, in particular, its rank. We prove tight upper bounds (with linear activations), which closely follow the empirical trend of the Hessian rank and hold in practice in more general settings. Overall, our work generalizes and establishes the key insight that, even in CNNs, the Hessian rank grows as the square root of the number of parameters.
△ Less
Submitted 15 May, 2023;
originally announced May 2023.
-
The impact of static distortion waves on superlubricity
Authors:
Lukas Hörmann,
Johannes J. Cartus,
Oliver T. Hofmann
Abstract:
Friction is a major source of energy loss in mechanical devices. This energy loss may be minimized by creating interfaces with extremely reduced friction, i.e. superlubricity. Conventional wisdom holds that incommensurate interface structures facilitate superlubricity. Accurately describing friction necessitates precise modeling of the interface structure. This, in turn, requires the use of accura…
▽ More
Friction is a major source of energy loss in mechanical devices. This energy loss may be minimized by creating interfaces with extremely reduced friction, i.e. superlubricity. Conventional wisdom holds that incommensurate interface structures facilitate superlubricity. Accurately describing friction necessitates precise modeling of the interface structure. This, in turn, requires the use of accurate first-principles electronic structure methods, especially when studying organic/metal interfaces, which are highly relevant due to their tunability and propensity to form incommensurate structures. However, the system size required to calculate incommensurate structures renders such calculations intractable. As a result, studies of incommensurate interfaces have been limited to very simple model systems or strongly simplified methodology. We overcome this limitation by develo** a machine-learned interatomic potential that is able to determine energies and forces for structures containing thousands to tens of thousands of atoms with an accuracy comparable to conventional first principles methods but at a fraction of the cost. Using this approach, we quantify the breakdown of superlubricity in incommensurate structures due to the formation of static distortion waves. Moreover, we extract design principles to engineer incommensurate interface systems where the formation of static distortion waves is suppressed, which facilitates low friction coefficients.
△ Less
Submitted 22 December, 2023; v1 submitted 24 April, 2023;
originally announced April 2023.
-
CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes
Authors:
Maria Parelli,
Alexandros Delitzas,
Nikolas Hars,
Georgios Vlassis,
Sotirios Anagnostidis,
Gregor Bachmann,
Thomas Hofmann
Abstract:
Training models to apply linguistic knowledge and visual concepts from 2D images to 3D world understanding is a promising direction that researchers have only recently started to explore. In this work, we design a novel 3D pre-training Vision-Language method that helps a model learn semantically meaningful and transferable 3D scene point cloud representations. We inject the representational power…
▽ More
Training models to apply linguistic knowledge and visual concepts from 2D images to 3D world understanding is a promising direction that researchers have only recently started to explore. In this work, we design a novel 3D pre-training Vision-Language method that helps a model learn semantically meaningful and transferable 3D scene point cloud representations. We inject the representational power of the popular CLIP model into our 3D encoder by aligning the encoded 3D scene features with the corresponding 2D image and text embeddings produced by CLIP. To assess our model's 3D world reasoning capability, we evaluate it on the downstream task of 3D Visual Question Answering. Experimental quantitative and qualitative results show that our pre-training method outperforms state-of-the-art works in this task and leads to an interpretable representation of 3D scene features.
△ Less
Submitted 12 April, 2023;
originally announced April 2023.
-
Achieving a Better Stability-Plasticity Trade-off via Auxiliary Networks in Continual Learning
Authors:
Sanghwan Kim,
Lorenzo Noci,
Antonio Orvieto,
Thomas Hofmann
Abstract:
In contrast to the natural capabilities of humans to learn new tasks in a sequential fashion, neural networks are known to suffer from catastrophic forgetting, where the model's performances on old tasks drop dramatically after being optimized for a new task. Since then, the continual learning (CL) community has proposed several solutions aiming to equip the neural network with the ability to lear…
▽ More
In contrast to the natural capabilities of humans to learn new tasks in a sequential fashion, neural networks are known to suffer from catastrophic forgetting, where the model's performances on old tasks drop dramatically after being optimized for a new task. Since then, the continual learning (CL) community has proposed several solutions aiming to equip the neural network with the ability to learn the current task (plasticity) while still achieving high accuracy on the previous tasks (stability). Despite remarkable improvements, the plasticity-stability trade-off is still far from being solved and its underlying mechanism is poorly understood. In this work, we propose Auxiliary Network Continual Learning (ANCL), a novel method that applies an additional auxiliary network which promotes plasticity to the continually learned model which mainly focuses on stability. More concretely, the proposed framework materializes in a regularizer that naturally interpolates between plasticity and stability, surpassing strong baselines on task incremental and class incremental scenarios. Through extensive analyses on ANCL solutions, we identify some essential principles beneath the stability-plasticity trade-off.
△ Less
Submitted 31 March, 2023; v1 submitted 16 March, 2023;
originally announced March 2023.
-
Random Teachers are Good Teachers
Authors:
Felix Sarnthein,
Gregor Bachmann,
Sotiris Anagnostidis,
Thomas Hofmann
Abstract:
In this work, we investigate the implicit regularization induced by teacher-student learning dynamics in self-distillation. To isolate its effect, we describe a simple experiment where we consider teachers at random initialization instead of trained teachers. Surprisingly, when distilling a student into such a random teacher, we observe that the resulting model and its representations already poss…
▽ More
In this work, we investigate the implicit regularization induced by teacher-student learning dynamics in self-distillation. To isolate its effect, we describe a simple experiment where we consider teachers at random initialization instead of trained teachers. Surprisingly, when distilling a student into such a random teacher, we observe that the resulting model and its representations already possess very interesting characteristics; (1) we observe a strong improvement of the distilled student over its teacher in terms of probing accuracy. (2) The learned representations are data-dependent and transferable between different tasks but deteriorate strongly if trained on random inputs. (3) The student checkpoint contains sparse subnetworks, so-called lottery tickets, and lies on the border of linear basins in the supervised loss landscape. These observations have interesting consequences for several important areas in machine learning: (1) Self-distillation can work solely based on the implicit regularization present in the gradient dynamics without relying on any dark knowledge, (2) self-supervised learning can learn features even in the absence of data augmentation and (3) training dynamics during the early phase of supervised training do not necessarily require label information. Finally, we shed light on an intriguing local property of the loss landscape: the process of feature learning is strongly amplified if the student is initialized closely to the teacher. These results raise interesting questions about the nature of the landscape that have remained unexplored so far. Code is available at https://github.com/safelix/dinopl.
△ Less
Submitted 19 June, 2023; v1 submitted 23 February, 2023;
originally announced February 2023.
-
Properties of uniformly $3$-connected graphs
Authors:
Frank Göring,
Tobias Hofmann
Abstract:
A graph on at least ${k+1}$ vertices is uniformly $k$-connected if each pair of its vertices is connected by $k$ and not more than $k$ independent paths. We reinvestigate a recent constructive characterization of uniformly $3$-connected graphs and obtain a more detailed result that relates the number of vertices to the operations involved in constructing a respective uniformly $3$-connected graph.…
▽ More
A graph on at least ${k+1}$ vertices is uniformly $k$-connected if each pair of its vertices is connected by $k$ and not more than $k$ independent paths. We reinvestigate a recent constructive characterization of uniformly $3$-connected graphs and obtain a more detailed result that relates the number of vertices to the operations involved in constructing a respective uniformly $3$-connected graph. Furthermore, we investigate how crossing numbers and treewidths behave under the mentioned constructions. We demonstrate how these results can be utilized to study the structure and properties of uniformly $3$-connected graphs with minimum number of vertices of minimum degree.
△ Less
Submitted 5 June, 2024; v1 submitted 30 November, 2022;
originally announced November 2022.
-
Cosmology from Galaxy Redshift Surveys with PointNet
Authors:
Sotiris Anagnostidis,
Arne Thomsen,
Tomasz Kacprzak,
Tilman Tröster,
Luca Biggio,
Alexandre Refregier,
Thomas Hofmann
Abstract:
In recent years, deep learning approaches have achieved state-of-the-art results in the analysis of point cloud data. In cosmology, galaxy redshift surveys resemble such a permutation invariant collection of positions in space. These surveys have so far mostly been analysed with two-point statistics, such as power spectra and correlation functions. The usage of these summary statistics is best jus…
▽ More
In recent years, deep learning approaches have achieved state-of-the-art results in the analysis of point cloud data. In cosmology, galaxy redshift surveys resemble such a permutation invariant collection of positions in space. These surveys have so far mostly been analysed with two-point statistics, such as power spectra and correlation functions. The usage of these summary statistics is best justified on large scales, where the density field is linear and Gaussian. However, in light of the increased precision expected from upcoming surveys, the analysis of -- intrinsically non-Gaussian -- small angular separations represents an appealing avenue to better constrain cosmological parameters. In this work, we aim to improve upon two-point statistics by employing a \textit{PointNet}-like neural network to regress the values of the cosmological parameters directly from point cloud data. Our implementation of PointNets can analyse inputs of $\mathcal{O}(10^4) - \mathcal{O}(10^5)$ galaxies at a time, which improves upon earlier work for this application by roughly two orders of magnitude. Additionally, we demonstrate the ability to analyse galaxy redshift survey data on the lightcone, as opposed to previously static simulation boxes at a given fixed redshift.
△ Less
Submitted 22 November, 2022;
originally announced November 2022.
-
The Curious Case of Benign Memorization
Authors:
Sotiris Anagnostidis,
Gregor Bachmann,
Lorenzo Noci,
Thomas Hofmann
Abstract:
Despite the empirical advances of deep learning across a variety of learning tasks, our theoretical understanding of its success is still very restricted. One of the key challenges is the overparametrized nature of modern models, enabling complete overfitting of the data even if the labels are randomized, i.e. networks can completely \textit{memorize} all given patterns. While such a memorization…
▽ More
Despite the empirical advances of deep learning across a variety of learning tasks, our theoretical understanding of its success is still very restricted. One of the key challenges is the overparametrized nature of modern models, enabling complete overfitting of the data even if the labels are randomized, i.e. networks can completely \textit{memorize} all given patterns. While such a memorization capacity seems worrisome, in this work we show that under training protocols that include \textit{data augmentation}, neural networks learn to memorize entirely random labels in a benign way, i.e. they learn embeddings that lead to highly non-trivial performance under nearest neighbour probing. We demonstrate that deep models have the surprising ability to separate noise from signal by distributing the task of memorization and feature learning to different layers. As a result, only the very last layers are used for memorization, while preceding layers encode performant features which remain largely unaffected by the label noise. We explore the intricate role of the augmentations used for training and identify a memorization-generalization trade-off in terms of their diversity, marking a clear distinction to all previous works. Finally, we give a first explanation for the emergence of benign memorization by showing that \textit{malign} memorization under data augmentation is infeasible due to the insufficient capacity of the model for the increased sample size. As a consequence, the network is forced to leverage the correlated nature of the augmentations and as a result learns meaningful features. To complete the picture, a better theory of feature learning in deep neural networks is required to fully understand the origins of this phenomenon.
△ Less
Submitted 23 February, 2023; v1 submitted 25 October, 2022;
originally announced October 2022.
-
Decoding a Neural Retriever's Latent Space for Query Suggestion
Authors:
Leonard Adolphs,
Michelle Chen Huebscher,
Christian Buck,
Sertan Girgin,
Olivier Bachem,
Massimiliano Ciaramita,
Thomas Hofmann
Abstract:
Neural retrieval models have superseded classic bag-of-words methods such as BM25 as the retrieval framework of choice. However, neural systems lack the interpretability of bag-of-words models; it is not trivial to connect a query change to a change in the latent space that ultimately determines the retrieval results. To shed light on this embedding space, we learn a "query decoder" that, given a…
▽ More
Neural retrieval models have superseded classic bag-of-words methods such as BM25 as the retrieval framework of choice. However, neural systems lack the interpretability of bag-of-words models; it is not trivial to connect a query change to a change in the latent space that ultimately determines the retrieval results. To shed light on this embedding space, we learn a "query decoder" that, given a latent representation of a neural search engine, generates the corresponding query. We show that it is possible to decode a meaningful query from its latent representation and, when moving in the right direction in latent space, to decode a query that retrieves the relevant paragraph. In particular, the query decoder can be useful to understand "what should have been asked" to retrieve a particular paragraph from the collection. We employ the query decoder to generate a large synthetic dataset of query reformulations for MSMarco, leading to improved retrieval performance. On this data, we train a pseudo-relevance feedback (PRF) T5 model for the application of query suggestion that outperforms both query reformulation and PRF information retrieval baselines.
△ Less
Submitted 21 October, 2022;
originally announced October 2022.
-
Mastering Spatial Graph Prediction of Road Networks
Authors:
Sotiris Anagnostidis,
Aurelien Lucchi,
Thomas Hofmann
Abstract:
Accurately predicting road networks from satellite images requires a global understanding of the network topology. We propose to capture such high-level information by introducing a graph-based framework that simulates the addition of sequences of graph edges using a reinforcement learning (RL) approach. In particular, given a partially generated graph associated with a satellite image, an RL agen…
▽ More
Accurately predicting road networks from satellite images requires a global understanding of the network topology. We propose to capture such high-level information by introducing a graph-based framework that simulates the addition of sequences of graph edges using a reinforcement learning (RL) approach. In particular, given a partially generated graph associated with a satellite image, an RL agent nominates modifications that maximize a cumulative reward. As opposed to standard supervised techniques that tend to be more restricted to commonly used surrogate losses, these rewards can be based on various complex, potentially non-continuous, metrics of interest. This yields more power and flexibility to encode problem-dependent knowledge. Empirical results on several benchmark datasets demonstrate enhanced performance and increased high-level reasoning about the graph topology when using a tree-based search. We further highlight the superiority of our approach under substantial occlusions by introducing a new synthetic benchmark dataset for this task.
△ Less
Submitted 3 October, 2022;
originally announced October 2022.
-
Using Abstraction for Interpretable Robot Programs in Stochastic Domains
Authors:
Till Hofmann,
Vaishak Belle
Abstract:
A robot's actions are inherently stochastic, as its sensors are noisy and its actions do not always have the intended effects. For this reason, the agent language Golog has been extended to models with degrees of belief and stochastic actions. While this allows more precise robot models, the resulting programs are much harder to comprehend, because they need to deal with the noise, e.g., by loopin…
▽ More
A robot's actions are inherently stochastic, as its sensors are noisy and its actions do not always have the intended effects. For this reason, the agent language Golog has been extended to models with degrees of belief and stochastic actions. While this allows more precise robot models, the resulting programs are much harder to comprehend, because they need to deal with the noise, e.g., by loo** until some desired state has been reached with certainty, and because the resulting action traces consist of a large number of actions cluttered with sensor noise. To alleviate these issues, we propose to use abstraction. We define a high-level and nonstochastic model of the robot and then map the high-level model into the lower-level stochastic model. The resulting programs are much easier to understand, often do not require belief operators or loops, and produce much shorter action traces.
△ Less
Submitted 26 July, 2022;
originally announced July 2022.
-
OpenFilter: A Framework to Democratize Research Access to Social Media AR Filters
Authors:
Piera Riccio,
Bill Psomas,
Francesco Galati,
Francisco Escolano,
Thomas Hofmann,
Nuria Oliver
Abstract:
Augmented Reality or AR filters on selfies have become very popular on social media platforms for a variety of applications, including marketing, entertainment and aesthetics. Given the wide adoption of AR face filters and the importance of faces in our social structures and relations, there is increased interest by the scientific community to analyze the impact of such filters from a psychologica…
▽ More
Augmented Reality or AR filters on selfies have become very popular on social media platforms for a variety of applications, including marketing, entertainment and aesthetics. Given the wide adoption of AR face filters and the importance of faces in our social structures and relations, there is increased interest by the scientific community to analyze the impact of such filters from a psychological, artistic and sociological perspective. However, there are few quantitative analyses in this area mainly due to a lack of publicly available datasets of facial images with applied AR filters. The proprietary, close nature of most social media platforms does not allow users, scientists and practitioners to access the code and the details of the available AR face filters. Scra** faces from these platforms to collect data is ethically unacceptable and should, therefore, be avoided in research. In this paper, we present OpenFilter, a flexible framework to apply AR filters available in social media platforms on existing large collections of human faces. Moreover, we share FairBeauty and B-LFW, two beautified versions of the publicly available FairFace and LFW datasets and we outline insights derived from the analysis of these beautified datasets.
△ Less
Submitted 27 September, 2022; v1 submitted 19 July, 2022;
originally announced July 2022.