-
The Overcooked Generalisation Challenge
Authors:
Constantin Ruhdorfer,
Matteo Bortoletto,
Anna Penzkofer,
Andreas Bulling
Abstract:
We introduce the Overcooked Generalisation Challenge (OGC) - the first benchmark to study agents' zero-shot cooperation abilities when faced with novel partners and levels in the Overcooked-AI environment. This perspective starkly contrasts a large body of previous work that has trained and evaluated cooperating agents only on the same level, failing to capture generalisation abilities required fo…
▽ More
We introduce the Overcooked Generalisation Challenge (OGC) - the first benchmark to study agents' zero-shot cooperation abilities when faced with novel partners and levels in the Overcooked-AI environment. This perspective starkly contrasts a large body of previous work that has trained and evaluated cooperating agents only on the same level, failing to capture generalisation abilities required for real-world human-AI cooperation. Our challenge interfaces with state-of-the-art dual curriculum design (DCD) methods to generate auto-curricula for training general agents in Overcooked. It is the first cooperative multi-agent environment specially designed for DCD methods and, consequently, the first benchmarked with state-of-the-art methods. It is fully GPU-accelerated, built on the DCD benchmark suite minimax, and freely available under an open-source license: https://git.hcics.simtech.uni-stuttgart.de/public-projects/OGC. We show that current DCD algorithms struggle to produce useful policies in this novel challenge, even if combined with recent network architectures that were designed for scalability and generalisability. The OGC pushes the boundaries of real-world human-AI cooperation by enabling the research community to study the impact of generalisation on cooperating agents.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Benchmarking Mental State Representations in Language Models
Authors:
Matteo Bortoletto,
Constantin Ruhdorfer,
Lei Shi,
Andreas Bulling
Abstract:
While numerous works have assessed the generative performance of language models (LMs) on tasks requiring Theory of Mind reasoning, research into the models' internal representation of mental states remains limited. Recent work has used probing to demonstrate that LMs can represent beliefs of themselves and others. However, these claims are accompanied by limited evaluation, making it difficult to…
▽ More
While numerous works have assessed the generative performance of language models (LMs) on tasks requiring Theory of Mind reasoning, research into the models' internal representation of mental states remains limited. Recent work has used probing to demonstrate that LMs can represent beliefs of themselves and others. However, these claims are accompanied by limited evaluation, making it difficult to assess how mental state representations are affected by model design and training choices. We report an extensive benchmark with various LM types with different model sizes, fine-tuning approaches, and prompt designs to study the robustness of mental state representations and memorisation issues within the probes. Our results show that the quality of models' internal representations of the beliefs of others increases with model size and, more crucially, with fine-tuning. We are the first to study how prompt variations impact probing performance on theory of mind tasks. We demonstrate that models' representations are sensitive to prompt variations, even when such variations should be beneficial. Finally, we complement previous activation editing experiments on Theory of Mind tasks and show that it is possible to improve models' reasoning performance by steering their activations without the need to train any probe.
△ Less
Submitted 1 July, 2024; v1 submitted 25 June, 2024;
originally announced June 2024.
-
Limits of Theory of Mind Modelling in Dialogue-Based Collaborative Plan Acquisition
Authors:
Matteo Bortoletto,
Constantin Ruhdorfer,
Adnen Abdessaied,
Lei Shi,
Andreas Bulling
Abstract:
Recent work on dialogue-based collaborative plan acquisition (CPA) has suggested that Theory of Mind (ToM) modelling can improve missing knowledge prediction in settings with asymmetric skill-sets and knowledge. Although ToM was claimed to be important for effective collaboration, its real impact on this novel task remains under-explored. By representing plans as graphs and by exploiting task-spec…
▽ More
Recent work on dialogue-based collaborative plan acquisition (CPA) has suggested that Theory of Mind (ToM) modelling can improve missing knowledge prediction in settings with asymmetric skill-sets and knowledge. Although ToM was claimed to be important for effective collaboration, its real impact on this novel task remains under-explored. By representing plans as graphs and by exploiting task-specific constraints we show that, as performance on CPA nearly doubles when predicting one's own missing knowledge, the improvements due to ToM modelling diminish. This phenomenon persists even when evaluating existing baseline methods. To better understand the relevance of ToM for CPA, we report a principled performance comparison of models with and without ToM features. Results across different models and ablations consistently suggest that learned ToM features are indeed more likely to reflect latent patterns in the data with no perceivable link to ToM. This finding calls for a deeper understanding of the role of ToM in CPA and beyond, as well as new methods for modelling and evaluating mental states in computational collaborative agents.
△ Less
Submitted 28 May, 2024; v1 submitted 21 May, 2024;
originally announced May 2024.
-
TMS-EEG Reliability: Bridging the Gap to Clinical Use
Authors:
Giacomo Bertazzoli,
Carlo Miniussi,
Petro Julkunen,
Marta Bortoletto
Abstract:
Concurrent transcranial magnetic stimulation (TMS) and electroencephalography (EEG), or TMS-EEG, holds the potential to broaden the clinical applications of TMS beyond its traditional role in evaluating the cortico-spinal tract and motor cortices. TMS-evoked potentials (TEPs) have emerged as valuable tools in clinical research, enabling the assessment of cortical excitability and effective connect…
▽ More
Concurrent transcranial magnetic stimulation (TMS) and electroencephalography (EEG), or TMS-EEG, holds the potential to broaden the clinical applications of TMS beyond its traditional role in evaluating the cortico-spinal tract and motor cortices. TMS-evoked potentials (TEPs) have emerged as valuable tools in clinical research, enabling the assessment of cortical excitability and effective connectivity between cortical regions in various psychiatric and neurological disorders and are increasingly recognized as a promising candidate biomarker for aiding in diagnosis and prognosis. Despite the well-established diagnostic utility of TMS, the clinical implementation of TMS-EEG has yet to meet the necessary standards. One critical aspect that often remains unaddressed is the reliability of TEP measurements. In this context, we outline the crucial reliability assessments required to determine the clinical applicability of TEPs. Firstly, we conduct a comprehensive review of the existing literature on reliability, encompassing both theoretical and statistical considerations. Subsequently, we present the current state of knowledge on TEP reliability. We emphasize the specific elements of reliability that must be incorporated to facilitate a unified, evidence-derived assessment of TMS-EEG as a clinical tool.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
Neural Reasoning About Agents' Goals, Preferences, and Actions
Authors:
Matteo Bortoletto,
Lei Shi,
Andreas Bulling
Abstract:
We propose the Intuitive Reasoning Network (IRENE) - a novel neural model for intuitive psychological reasoning about agents' goals, preferences, and actions that can generalise previous experiences to new situations. IRENE combines a graph neural network for learning agent and world state representations with a transformer to encode the task context. When evaluated on the challenging Baby Intuiti…
▽ More
We propose the Intuitive Reasoning Network (IRENE) - a novel neural model for intuitive psychological reasoning about agents' goals, preferences, and actions that can generalise previous experiences to new situations. IRENE combines a graph neural network for learning agent and world state representations with a transformer to encode the task context. When evaluated on the challenging Baby Intuitions Benchmark, IRENE achieves new state-of-the-art performance on three out of its five tasks - with up to 48.9% improvement. In contrast to existing methods, IRENE is able to bind preferences to specific agents, to better distinguish between rational and irrational agents, and to better understand the role of blocking obstacles. We also investigate, for the first time, the influence of the training tasks on test performance. Our analyses demonstrate the effectiveness of IRENE in combining prior knowledge gained during training for unseen evaluation tasks.
△ Less
Submitted 12 December, 2023;
originally announced December 2023.
-
The Past, Present, and Future of the Brain Imaging Data Structure (BIDS)
Authors:
Russell A. Poldrack,
Christopher J. Markiewicz,
Stefan Appelhoff,
Yoni K. Ashar,
Tibor Auer,
Sylvain Baillet,
Shashank Bansal,
Leandro Beltrachini,
Christian G. Benar,
Giacomo Bertazzoli,
Suyash Bhogawar,
Ross W. Blair,
Marta Bortoletto,
Mathieu Boudreau,
Teon L. Brooks,
Vince D. Calhoun,
Filippo Maria Castelli,
Patricia Clement,
Alexander L Cohen,
Julien Cohen-Adad,
Sasha D'Ambrosio,
Gilles de Hollander,
María de la iglesia-Vayá,
Alejandro de la Vega,
Arnaud Delorme
, et al. (89 additional authors not shown)
Abstract:
The Brain Imaging Data Structure (BIDS) is a community-driven standard for the organization of data and metadata from a growing range of neuroscience modalities. This paper is meant as a history of how the standard has developed and grown over time. We outline the principles behind the project, the mechanisms by which it has been extended, and some of the challenges being addressed as it evolves.…
▽ More
The Brain Imaging Data Structure (BIDS) is a community-driven standard for the organization of data and metadata from a growing range of neuroscience modalities. This paper is meant as a history of how the standard has developed and grown over time. We outline the principles behind the project, the mechanisms by which it has been extended, and some of the challenges being addressed as it evolves. We also discuss the lessons learned through the project, with the aim of enabling researchers in other domains to learn from the success of BIDS.
△ Less
Submitted 8 January, 2024; v1 submitted 11 September, 2023;
originally announced September 2023.
-
Exploring Natural Language Processing Methods for Interactive Behaviour Modelling
Authors:
Guanhua Zhang,
Matteo Bortoletto,
Zhiming Hu,
Lei Shi,
Mihai Bâce,
Andreas Bulling
Abstract:
Analysing and modelling interactive behaviour is an important topic in human-computer interaction (HCI) and a key requirement for the development of intelligent interactive systems. Interactive behaviour has a sequential (actions happen one after another) and hierarchical (a sequence of actions forms an activity driven by interaction goals) structure, which may be similar to the structure of natur…
▽ More
Analysing and modelling interactive behaviour is an important topic in human-computer interaction (HCI) and a key requirement for the development of intelligent interactive systems. Interactive behaviour has a sequential (actions happen one after another) and hierarchical (a sequence of actions forms an activity driven by interaction goals) structure, which may be similar to the structure of natural language. Designed based on such a structure, natural language processing (NLP) methods have achieved groundbreaking success in various downstream tasks. However, few works linked interactive behaviour with natural language. In this paper, we explore the similarity between interactive behaviour and natural language by applying an NLP method, byte pair encoding (BPE), to encode mouse and keyboard behaviour. We then analyse the vocabulary, i.e., the set of action sequences, learnt by BPE, as well as use the vocabulary to encode the input behaviour for interactive task recognition. An existing dataset collected in constrained lab settings and our novel out-of-the-lab dataset were used for evaluation. Results show that this natural language-inspired approach not only learns action sequences that reflect specific interaction goals, but also achieves higher F1 scores on task recognition than other methods. Our work reveals the similarity between interactive behaviour and natural language, and presents the potential of applying the new pack of methods that leverage insights from NLP to model interactive behaviour in HCI.
△ Less
Submitted 11 May, 2023; v1 submitted 28 March, 2023;
originally announced March 2023.