Search | arXiv e-print repository

Recurrent Complex-Weighted Autoencoders for Unsupervised Object Discovery

Authors: Anand Gopalakrishnan, Aleksandar Stanić, Jürgen Schmidhuber, Michael Curtis Mozer

Abstract: Current state-of-the-art synchrony-based models encode object bindings with complex-valued activations and compute with real-valued weights in feedforward architectures. We argue for the computational advantages of a recurrent architecture with complex-valued weights. We propose a fully convolutional autoencoder, SynCx, that performs iterative constraint satisfaction: at each iteration, a hidden l… ▽ More Current state-of-the-art synchrony-based models encode object bindings with complex-valued activations and compute with real-valued weights in feedforward architectures. We argue for the computational advantages of a recurrent architecture with complex-valued weights. We propose a fully convolutional autoencoder, SynCx, that performs iterative constraint satisfaction: at each iteration, a hidden layer bottleneck encodes statistically regular configurations of features in particular phase relationships; over iterations, local constraints propagate and the model converges to a globally consistent configuration of phase assignments. Binding is achieved simply by the matrix-vector product operation between complex-valued weights and activations, without the need for additional mechanisms that have been incorporated into current synchrony-based models. SynCx outperforms or is strongly competitive with current models for unsupervised object discovery. SynCx also avoids certain systematic grou** errors of current models, such as the inability to separate similarly colored objects without additional supervision. △ Less

Submitted 28 May, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

Comments: minor typo fixed

arXiv:2401.01974 [pdf, other]

Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

Authors: Aleksandar Stanić, Sergi Caelles, Michael Tschannen

Abstract: Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the t… ▽ More Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools. Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning. Nevertheless, in their current form, these models heavily rely on human engineering of in-context examples in the prompt, which are often dataset- and task-specific and require significant labor by highly skilled programmers. In this work, we present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples, thereby avoiding human-created in-context examples. On a number of visual reasoning tasks, we show that our framework leads to consistent gains in performance, makes LLMs as controllers setup more robust, and removes the need for human engineering of in-context examples. △ Less

Submitted 14 May, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

arXiv:2309.11197 [pdf, other]

The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute

Authors: Aleksandar Stanić, Dylan Ashley, Oleg Serikov, Louis Kirsch, Francesco Faccio, Jürgen Schmidhuber, Thomas Hofmann, Imanol Schlag

Abstract: The Languini Kitchen serves as both a research collective and codebase designed to empower researchers with limited computational resources to contribute meaningfully to the field of language modelling. We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. The number of tokens on which a model is trained is defined by the m… ▽ More The Languini Kitchen serves as both a research collective and codebase designed to empower researchers with limited computational resources to contribute meaningfully to the field of language modelling. We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. The number of tokens on which a model is trained is defined by the model's throughput and the chosen compute class. Notably, this approach avoids constraints on critical hyperparameters which affect total parameters or floating-point operations. For evaluation, we pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. On it, we compare methods based on their empirical scaling trends which are estimated through experiments at various levels of compute. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput. While the GPT baseline achieves better perplexity throughout all our levels of compute, our LSTM baseline exhibits a predictable and more favourable scaling law. This is due to the improved throughput and the need for fewer training tokens to achieve the same decrease in test perplexity. Extrapolating the scaling laws leads of both models results in an intersection at roughly 50,000 accelerator hours. We hope this work can serve as the foundation for meaningful and reproducible language modelling research. △ Less

Submitted 20 September, 2023; originally announced September 2023.

arXiv:2305.17066 [pdf, other]

Mindstorms in Natural Language-Based Societies of Mind

Authors: Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R. Ashley, Róbert Csordás, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, Louis Kirsch, Bing Li, Guohao Li, Shuming Liu, **jie Mai, Piotr Piękos, Aditya Ramesh, Imanol Schlag, Weimin Shi, Aleksandar Stanić, Wenyi Wang, Yuhui Wang, Mengmeng Xu, Deng-** Fan, Bernard Ghanem , et al. (1 additional authors not shown)

Abstract: Both Minsky's "society of mind" and Schmidhuber's "learning to think" inspire diverse societies of large multimodal neural networks (NNs) that solve problems by interviewing each other in a "mindstorm." Recent implementations of NN-based societies of minds consist of large language models (LLMs) and other NN-based experts communicating through a natural language interface. In doing so, they overco… ▽ More Both Minsky's "society of mind" and Schmidhuber's "learning to think" inspire diverse societies of large multimodal neural networks (NNs) that solve problems by interviewing each other in a "mindstorm." Recent implementations of NN-based societies of minds consist of large language models (LLMs) and other NN-based experts communicating through a natural language interface. In doing so, they overcome the limitations of single LLMs, improving multimodal zero-shot reasoning. In these natural language-based societies of mind (NLSOMs), new agents -- all communicating through the same universal symbolic language -- are easily added in a modular fashion. To demonstrate the power of NLSOMs, we assemble and experiment with several of them (having up to 129 members), leveraging mindstorms in them to solve some practical AI tasks: visual question answering, image captioning, text-to-image synthesis, 3D generation, egocentric retrieval, embodied AI, and general language-based task solving. We view this as a starting point towards much larger NLSOMs with billions of agents-some of which may be humans. And with this emergence of great societies of heterogeneous minds, many new research questions have suddenly become paramount to the future of artificial intelligence. What should be the social structure of an NLSOM? What would be the (dis)advantages of having a monarchical rather than a democratic structure? How can principles of NN economies be used to maximize the total reward of a reinforcement learning NLSOM? In this work, we identify, discuss, and try to answer some of these questions. △ Less

Submitted 26 May, 2023; originally announced May 2023.

Comments: 9 pages in main text + 7 pages of references + 38 pages of appendices, 14 figures in main text + 13 in appendices, 7 tables in appendices

MSC Class: 68T07 ACM Class: I.2.6; I.2.11

arXiv:2305.15001 [pdf, other]

Contrastive Training of Complex-Valued Autoencoders for Object Discovery

Authors: Aleksandar Stanić, Anand Gopalakrishnan, Kazuki Irie, Jürgen Schmidhuber

Abstract: Current state-of-the-art object-centric models use slots and attention-based routing for binding. However, this class of models has several conceptual limitations: the number of slots is hardwired; all slots have equal capacity; training has high computational cost; there are no object-level relational factors within slots. Synchrony-based models in principle can address these limitations by using… ▽ More Current state-of-the-art object-centric models use slots and attention-based routing for binding. However, this class of models has several conceptual limitations: the number of slots is hardwired; all slots have equal capacity; training has high computational cost; there are no object-level relational factors within slots. Synchrony-based models in principle can address these limitations by using complex-valued activations which store binding information in their phase components. However, working examples of such synchrony-based models have been developed only very recently, and are still limited to toy grayscale datasets and simultaneous storage of less than three objects in practice. Here we introduce architectural modifications and a novel contrastive learning method that greatly improve the state-of-the-art synchrony-based model. For the first time, we obtain a class of synchrony-based models capable of discovering objects in an unsupervised manner in multi-object color datasets and simultaneously representing more than three objects. △ Less

Submitted 9 November, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: accepted to NeurIPS 2023

arXiv:2208.03374 [pdf, other]

Learning to Generalize with Object-centric Agents in the Open World Survival Game Crafter

Authors: Aleksandar Stanić, Yu** Tang, David Ha, Jürgen Schmidhuber

Abstract: Reinforcement learning agents must generalize beyond their training experience. Prior work has focused mostly on identical training and evaluation environments. Starting from the recently introduced Crafter benchmark, a 2D open world survival game, we introduce a new set of environments suitable for evaluating some agent's ability to generalize on previously unseen (numbers of) objects and to adap… ▽ More Reinforcement learning agents must generalize beyond their training experience. Prior work has focused mostly on identical training and evaluation environments. Starting from the recently introduced Crafter benchmark, a 2D open world survival game, we introduce a new set of environments suitable for evaluating some agent's ability to generalize on previously unseen (numbers of) objects and to adapt quickly (meta-learning). In Crafter, the agents are evaluated by the number of unlocked achievements (such as collecting resources) when trained for 1M steps. We show that current agents struggle to generalize, and introduce novel object-centric agents that improve over strong baselines. We also provide critical insights of general interest for future work on Crafter through several experiments. We show that careful hyper-parameter tuning improves the PPO baseline agent by a large margin and that even feedforward agents can unlock almost all achievements by relying on the inventory display. We achieve new state-of-the-art performance on the original Crafter environment. Additionally, when trained beyond 1M steps, our tuned agents can unlock almost all achievements. We show that the recurrent PPO agents improve over feedforward ones, even with the inventory information removed. We introduce CrafterOOD, a set of 15 new environments that evaluate OOD generalization. On CrafterOOD, we show that the current agents fail to generalize, whereas our novel object-centric agents achieve state-of-the-art OOD generalization while also being interpretable. Our code is public. △ Less

Submitted 5 August, 2022; originally announced August 2022.

ACM Class: I.2.6

arXiv:2103.08877 [pdf, other]

Spatial Dependency Networks: Neural Layers for Improved Generative Image Modeling

Authors: Đorđe Miladinović, Aleksandar Stanić, Stefan Bauer, Jürgen Schmidhuber, Joachim M. Buhmann

Abstract: How to improve generative modeling by better exploiting spatial regularities and coherence in images? We introduce a novel neural network for building image generators (decoders) and apply it to variational autoencoders (VAEs). In our spatial dependency networks (SDNs), feature maps at each level of a deep neural net are computed in a spatially coherent way, using a sequential gating-based mechani… ▽ More How to improve generative modeling by better exploiting spatial regularities and coherence in images? We introduce a novel neural network for building image generators (decoders) and apply it to variational autoencoders (VAEs). In our spatial dependency networks (SDNs), feature maps at each level of a deep neural net are computed in a spatially coherent way, using a sequential gating-based mechanism that distributes contextual information across 2-D space. We show that augmenting the decoder of a hierarchical VAE by spatial dependency layers considerably improves density estimation over baseline convolutional architectures and the state-of-the-art among the models within the same class. Furthermore, we demonstrate that SDN can be applied to large images by synthesizing samples of high quality and coherence. In a vanilla VAE setting, we find that a powerful SDN decoder also improves learning disentangled representations, indicating that neural architectures play an important role in this task. Our results suggest favoring spatial dependency over convolutional layers in various VAE settings. The accompanying source code is given at https://github.com/djordjemila/sdn. △ Less

Submitted 16 March, 2021; originally announced March 2021.

Journal ref: International Conference on Learning Representations (2021);

arXiv:2012.09581 [pdf, other]

doi 10.1016/j.cma.2021.114090

Crack propagation simulation without crack tracking algorithm: embedded discontinuity formulation with incompatible modes

Authors: A. Stanic, B. Brank, A. Ibrahimbegovic, H. G. Matthies

Abstract: We show that for the simulation of crack propagation in quasi-brittle, two-dimensional solids, very good results can be obtained with an embedded strong discontinuity quadrilateral finite element that has incompatible modes. Even more importantly, we demonstrate that these results can be obtained without using a crack tracking algorithm. Therefore, the simulation of crack patterns with several cra… ▽ More We show that for the simulation of crack propagation in quasi-brittle, two-dimensional solids, very good results can be obtained with an embedded strong discontinuity quadrilateral finite element that has incompatible modes. Even more importantly, we demonstrate that these results can be obtained without using a crack tracking algorithm. Therefore, the simulation of crack patterns with several cracks, including branching, becomes possible. The avoidance of a tracking algorithm is mainly enabled by the application of a novel, local (Gauss-point based) criterion for crack nucleation, which determines the time of embedding the localisation line as well as its position and orientation. We treat the crack evolution in terms of a thermodynamical framework, with softening variables describing internal dissipative mechanisms of material degradation. As presented by numerical examples, many elements in the mesh may develop a crack, but only some of them actually open and/or slide, dissipate fracture energy, and eventually form the crack pattern. The novel approach has been implemented for statics and dynamics, and the results of computed difficult examples (including Kalthoff's test) illustrate its very satisfying performance. It effectively overcomes unfavourable restrictions of the standard embedded strong discontinuity formulations, namely the simulation of the propagation of a single crack only. Moreover, it is computationally fast and straightforward to implement. Our numerical solutions match the results of experimental tests and previously reported numerical results in terms of crack pattern, dissipated fracture energy, and load-displacement curve. △ Less

Submitted 6 August, 2021; v1 submitted 17 December, 2020; originally announced December 2020.

Comments: 53 pages, 43 figures, research paper

arXiv:2010.03635 [pdf, other]

Hierarchical Relational Inference

Authors: Aleksandar Stanić, Sjoerd van Steenkiste, Jürgen Schmidhuber

Abstract: Common-sense physical reasoning in the real world requires learning about the interactions of objects and their dynamics. The notion of an abstract object, however, encompasses a wide variety of physical objects that differ greatly in terms of the complex behaviors they support. To address this, we propose a novel approach to physical reasoning that models objects as hierarchies of parts that may… ▽ More Common-sense physical reasoning in the real world requires learning about the interactions of objects and their dynamics. The notion of an abstract object, however, encompasses a wide variety of physical objects that differ greatly in terms of the complex behaviors they support. To address this, we propose a novel approach to physical reasoning that models objects as hierarchies of parts that may locally behave separately, but also act more globally as a single whole. Unlike prior approaches, our method learns in an unsupervised fashion directly from raw visual images to discover objects, parts, and their relations. It explicitly distinguishes multiple levels of abstraction and improves over a strong baseline at modeling synthetic and real-world videos. △ Less

Submitted 14 December, 2020; v1 submitted 7 October, 2020; originally announced October 2020.

Comments: Accepted to AAAI 2021

ACM Class: I.2.6

arXiv:1910.05231 [pdf, other]

R-SQAIR: Relational Sequential Attend, Infer, Repeat

Authors: Aleksandar Stanić, Jürgen Schmidhuber

Abstract: Traditional sequential multi-object attention models rely on a recurrent mechanism to infer object relations. We propose a relational extension (R-SQAIR) of one such attention model (SQAIR) by endowing it with a module with strong relational inductive bias that computes in parallel pairwise interactions between inferred objects. Two recently proposed relational modules are studied on tasks of unsu… ▽ More Traditional sequential multi-object attention models rely on a recurrent mechanism to infer object relations. We propose a relational extension (R-SQAIR) of one such attention model (SQAIR) by endowing it with a module with strong relational inductive bias that computes in parallel pairwise interactions between inferred objects. Two recently proposed relational modules are studied on tasks of unsupervised learning from videos. We demonstrate gains over sequential relational mechanisms, also in terms of combinatorial generalization. △ Less

Submitted 11 October, 2019; originally announced October 2019.

Comments: 4 page workshop paper accepted at the NeurIPS 2019 Workshop on Perception as Generative Reasoning: Structure, Causality, Probability

ACM Class: I.2.6

arXiv:1605.08283 [pdf, other]

Discrete Deep Feature Extraction: A Theory and New Architectures

Authors: Thomas Wiatowski, Michael Tschannen, Aleksandar Stanić, Philipp Grohs, Helmut Bölcskei

Abstract: First steps towards a mathematical theory of deep convolutional neural networks for feature extraction were made---for the continuous-time case---in Mallat, 2012, and Wiatowski and Bölcskei, 2015. This paper considers the discrete case, introduces new convolutional neural network architectures, and proposes a mathematical framework for their analysis. Specifically, we establish deformation and tra… ▽ More First steps towards a mathematical theory of deep convolutional neural networks for feature extraction were made---for the continuous-time case---in Mallat, 2012, and Wiatowski and Bölcskei, 2015. This paper considers the discrete case, introduces new convolutional neural network architectures, and proposes a mathematical framework for their analysis. Specifically, we establish deformation and translation sensitivity results of local and global nature, and we investigate how certain structural properties of the input signal are reflected in the corresponding feature vectors. Our theory applies to general filters and general Lipschitz-continuous non-linearities and pooling operators. Experiments on handwritten digit classification and facial landmark detection---including feature importance evaluation---complement the theoretical findings. △ Less

Submitted 26 May, 2016; originally announced May 2016.

Comments: Proc. of International Conference on Machine Learning (ICML), New York, USA, June 2016, to appear

Journal ref: Proc. of International Conference on Machine Learning (ICML), New York, USA, pp. 2149-2158, June 2016

Showing 1–11 of 11 results for author: Stanić, A