Search | arXiv e-print repository

Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal Models

Authors: Sherzod Hakimov, Yerkezhan Abdullayeva, Kushal Koshti, Antonia Schmidt, Yan Weiser, Anne Beyer, David Schlangen

Abstract: While the situation has improved for text-only models, it again seems to be the case currently that multimodal (text and image) models develop faster than ways to evaluate them. In this paper, we bring a recently developed evaluation paradigm from text models to multimodal models, namely evaluation through the goal-oriented game (self) play, complementing reference-based and preference-based evalu… ▽ More While the situation has improved for text-only models, it again seems to be the case currently that multimodal (text and image) models develop faster than ways to evaluate them. In this paper, we bring a recently developed evaluation paradigm from text models to multimodal models, namely evaluation through the goal-oriented game (self) play, complementing reference-based and preference-based evaluation. Specifically, we define games that challenge a model's capability to represent a situation from visual information and align such representations through dialogue. We find that the largest closed models perform rather well on the games that we define, while even the best open-weight models struggle with them. On further analysis, we find that the exceptional deep captioning capabilities of the largest models drive some of the performance. There is still room to grow for both kinds of models, ensuring the continued relevance of the benchmark. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: under review

arXiv:2406.05561 [pdf, other]

Learning Human Detected Differences in Directed Acyclic Graphs

Authors: Kathrin Guckes, Alena Beyer, Prof. Margit Pohl, Prof. Tatiana von Landesberger

Abstract: Prior research has shown that human perception of similarity differs from mathematical measures in visual comparison tasks, including those involving directed acyclic graphs. This divergence can lead to missed differences and skepticism about algorithmic results. To address this, we aim to learn the structural differences humans detect in graphs visually. We want to visualize these human-detected… ▽ More Prior research has shown that human perception of similarity differs from mathematical measures in visual comparison tasks, including those involving directed acyclic graphs. This divergence can lead to missed differences and skepticism about algorithmic results. To address this, we aim to learn the structural differences humans detect in graphs visually. We want to visualize these human-detected differences alongside actual changes, enhancing credibility and aiding users in spotting overlooked differences. Our approach aligns with recent research in machine learning capturing human behavior. We provide a data augmentation algorithm, a dataset, and a machine learning model to support this task. This work fills a gap in learning differences in directed acyclic graphs and contributes to better comparative visualizations. △ Less

Submitted 8 June, 2024; originally announced June 2024.

arXiv:2405.20859 [pdf, other]

clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents

Authors: Anne Beyer, Kranti Chalamalasetti, Sherzod Hakimov, Brielen Madureira, Philipp Sadler, David Schlangen

Abstract: It has been established in recent work that Large Language Models (LLMs) can be prompted to "self-play" conversational games that probe certain capabilities (general instruction following, strategic goal orientation, language understanding abilities), where the resulting interactive game play can be automatically scored. In this paper, we take one of the proposed frameworks for setting up such gam… ▽ More It has been established in recent work that Large Language Models (LLMs) can be prompted to "self-play" conversational games that probe certain capabilities (general instruction following, strategic goal orientation, language understanding abilities), where the resulting interactive game play can be automatically scored. In this paper, we take one of the proposed frameworks for setting up such game-play environments, and further test its usefulness as an evaluation instrument, along a number of dimensions: We show that it can easily keep up with new developments while avoiding data contamination, we show that the tests implemented within it are not yet saturated (human performance is substantially higher than that of even the best models), and we show that it lends itself to investigating additional questions, such as the impact of the prompting language on performance. We believe that the approach forms a good basis for making decisions on model choice for building applied interactive systems, and perhaps ultimately setting up a closed-loop development environment of system and simulated evaluator. △ Less

Submitted 31 May, 2024; originally announced May 2024.

Comments: under review

arXiv:2308.06095 [pdf, other]

Neural Conversation Models and How to Rein Them in: A Survey of Failures and Fixes

Authors: Fabian Galetzka, Anne Beyer, David Schlangen

Abstract: Recent conditional language models are able to continue any kind of text source in an often seemingly fluent way. This fact encouraged research in the area of open-domain conversational systems that are based on powerful language models and aim to imitate an interlocutor by generating appropriate contributions to a written dialogue. From a linguistic perspective, however, the complexity of contrib… ▽ More Recent conditional language models are able to continue any kind of text source in an often seemingly fluent way. This fact encouraged research in the area of open-domain conversational systems that are based on powerful language models and aim to imitate an interlocutor by generating appropriate contributions to a written dialogue. From a linguistic perspective, however, the complexity of contributing to a conversation is high. In this survey, we interpret Grice's maxims of cooperative conversation from the perspective of this specific research area and systematize the literature under the aspect of what makes a contribution appropriate: A neural conversation model has to be fluent, informative, consistent, coherent, and follow social norms. In order to ensure these qualities, recent approaches try to tame the underlying language models at various intervention points, such as data, training regime or decoding. Sorted by these categories and intervention points, we discuss promising attempts and suggest novel ways for future research. △ Less

Submitted 11 August, 2023; originally announced August 2023.

Comments: Represents the state of the field in 2022; partially based on the first authors 2022 PhD thesis

arXiv:2203.12111 [pdf]

Muscle Vision: Real Time Keypoint Based Pose Classification of Physical Exercises

Authors: Alex Moran, Bart Gebka, Joshua Goldshteyn, Autumn Beyer, Nathan Johnson, Alexander Neuwirth

Abstract: Recent advances in machine learning technology have enabled highly portable and performant models for many common tasks, especially in image recognition. One emerging field, 3D human pose recognition extrapolated from video, has now advanced to the point of enabling real-time software applications with robust enough output to support downstream machine learning tasks. In this work we propose a new… ▽ More Recent advances in machine learning technology have enabled highly portable and performant models for many common tasks, especially in image recognition. One emerging field, 3D human pose recognition extrapolated from video, has now advanced to the point of enabling real-time software applications with robust enough output to support downstream machine learning tasks. In this work we propose a new machine learning pipeline and web interface that performs human pose recognition on a live video feed to detect when common exercises are performed and classify them accordingly. We present a model interface capable of webcam input with live display of classification results. Our main contributions include a keypoint and time series based lightweight approach for classifying a selected set of fitness exercises and a web-based software application for obtaining and visualizing the results in real time. △ Less

Submitted 22 March, 2022; originally announced March 2022.

Comments: Published in MICS 2022

arXiv:2105.03495 [pdf, other]

Is Incoherence Surprising? Targeted Evaluation of Coherence Prediction from Language Models

Authors: Anne Beyer, Sharid Loáiciga, David Schlangen

Abstract: Coherent discourse is distinguished from a mere collection of utterances by the satisfaction of a diverse set of constraints, for example choice of expression, logical relation between denoted events, and implicit compatibility with world-knowledge. Do neural language models encode such constraints? We design an extendable set of test suites addressing different aspects of discourse and dialogue c… ▽ More Coherent discourse is distinguished from a mere collection of utterances by the satisfaction of a diverse set of constraints, for example choice of expression, logical relation between denoted events, and implicit compatibility with world-knowledge. Do neural language models encode such constraints? We design an extendable set of test suites addressing different aspects of discourse and dialogue coherence. Unlike most previous coherence evaluation studies, we address specific linguistic devices beyond sentence order perturbations, allowing for a more fine-grained analysis of what constitutes coherence and what neural models trained on a language modelling objective do encode. Extending the targeted evaluation paradigm for neural language models (Marvin and Linzen, 2018) to phenomena beyond syntax, we show that this paradigm is equally suited to evaluate linguistic qualities that contribute to the notion of coherence. △ Less

Submitted 7 May, 2021; originally announced May 2021.

Comments: Accepted as long paper at NAACL 2021

arXiv:2006.16896 [pdf]

doi 10.25972/OPUS-20232

White Paper on Crowdsourced Network and QoE Measurements -- Definitions, Use Cases and Challenges

Authors: Tobias Hoßfeld, Stefan Wunderer, André Beyer, Andrew Hall, Anika Schwind, Christian Gassner, Fabrice Guillemin, Florian Wamser, Krzysztof Wascinski, Matthias Hirth, Michael Seufert, Pedro Casas, Phuoc Tran-Gia, Werner Robitza, Wojciech Wascinski, Zied Ben Houidi

Abstract: This white paper is the outcome of the Würzburg seminar on "Crowdsourced Network and QoE Measurements" which took place from 25-26 September 2019 in Würzburg, Germany. International experts were invited from industry and academia. They are well known in their communities, having different backgrounds in crowdsourcing, mobile networks, network measurements, network performance, Quality of Service (… ▽ More This white paper is the outcome of the Würzburg seminar on "Crowdsourced Network and QoE Measurements" which took place from 25-26 September 2019 in Würzburg, Germany. International experts were invited from industry and academia. They are well known in their communities, having different backgrounds in crowdsourcing, mobile networks, network measurements, network performance, Quality of Service (QoS), and Quality of Experience (QoE). The discussions in the seminar focused on how crowdsourcing will support vendors, operators, and regulators to determine the Quality of Experience in new 5G networks that enable various new applications and network architectures. As a result of the discussions, the need for a white paper manifested, with the goal of providing a scientific discussion of the terms "crowdsourced network measurements" and "crowdsourced QoE measurements", describing relevant use cases for such crowdsourced data, and its underlying challenges. During the seminar, those main topics were identified, intensively discussed in break-out groups, and brought back into the plenum several times. The outcome of the seminar is this white paper at hand which is - to our knowledge - the first one covering the topic of crowdsourced network and QoE measurements. △ Less

Submitted 25 May, 2020; originally announced June 2020.

arXiv:1908.06660 [pdf, other]

doi 10.3389/frai.2020.00024

Learning to play the Chess Variant Crazyhouse above World Champion Level with Deep Neural Networks and Human Data

Authors: Johannes Czech, Moritz Willig, Alena Beyer, Kristian Kersting, Johannes Fürnkranz

Abstract: Deep neural networks have been successfully applied in learning the board games Go, chess and shogi without prior knowledge by making use of reinforcement learning. Although starting from zero knowledge has been shown to yield impressive results, it is associated with high computationally costs especially for complex games. With this paper, we present CrazyAra which is a neural network based engin… ▽ More Deep neural networks have been successfully applied in learning the board games Go, chess and shogi without prior knowledge by making use of reinforcement learning. Although starting from zero knowledge has been shown to yield impressive results, it is associated with high computationally costs especially for complex games. With this paper, we present CrazyAra which is a neural network based engine solely trained in supervised manner for the chess variant crazyhouse. Crazyhouse is a game with a higher branching factor than chess and there is only limited data of lower quality available compared to AlphaGo. Therefore, we focus on improving efficiency in multiple aspects while relying on low computational resources. These improvements include modifications in the neural network design and training configuration, the introduction of a data normalization step and a more sample efficient Monte-Carlo tree search which has a lower chance to blunder. After training on 569,537 human games for 1.5 days we achieve a move prediction accuracy of 60.4%. During development, versions of CrazyAra played professional human players. Most notably, CrazyAra achieved a four to one win over 2017 crazyhouse world champion Justin Tan (aka LM Jann Lee) who is more than 400 Elo higher rated compared to the average player in our training set. Furthermore, we test the playing strength of CrazyAra on CPU against all participants of the second Crazyhouse Computer Championships 2017, winning against twelve of the thirteen participants. Finally, for CrazyAraFish we continue training our model on generated engine games. In ten long-time control matches playing Stockfish 10, CrazyAraFish wins three games and draws one out of ten matches. △ Less

Submitted 22 August, 2019; v1 submitted 19 August, 2019; originally announced August 2019.

Comments: 35 pages, 19 figures, 14 tables

Journal ref: Frontiers in Artificial Intelligence, Machine Learning and Artificial Intelligence, Volume 3 (2020)

arXiv:1805.03522 [pdf]

doi 10.1371/journal.pcbi.1006191

Ten simple rules for measuring the impact of workshops

Authors: Shoaib Sufi, Beth Duckles, Iveta Simera, Terhi Nurmikko-Fuller, Louisa Bellis, Wadud Miah, Adriana Wilde, Aleksandra Nenadic, Raniere Silva, Jennifer A. de Beyer, Caroline Struthers, Iain Emsley, Olivier Philippe, Melissa Balzano, Sara Coelho, Heather Ford, Catherine Jones, Vanessa Higgins

Abstract: Workshops are used to explore a specific topic, transfer knowledge, solve identified problems or create something new. In funded research projects and other research endeavours, workshops are the mechanism to gather the wider project, community or interested people together around a particular topic. However, natural questions arise: how do we measure the impact of these workshops? Do we know whet… ▽ More Workshops are used to explore a specific topic, transfer knowledge, solve identified problems or create something new. In funded research projects and other research endeavours, workshops are the mechanism to gather the wider project, community or interested people together around a particular topic. However, natural questions arise: how do we measure the impact of these workshops? Do we know whether they are meeting the goals and objectives we set for them? What indicators should we use? In response to these questions, this paper will outline rules that will improve the measurement of the impact of workshops. △ Less

Submitted 9 May, 2018; originally announced May 2018.

arXiv:1404.6583 [pdf, other]

ILATO Project: Fusion of Optical Surface Models and Volumetric CT Data

Authors: Andreas Beyer, Hubert Mara, Susanne Krömker

Abstract: Project ILATO focuses on Improving Limited Angle computed Tomography by Optical data integration in order to enhance image quality and shorten acquisition times in X-ray based industrial quality inspection. Limited angle computed tomography is indicated whenever specimen dimensions exceed cone beam limits or the object is impenetrable from certain angles. Thus, acquiring only a subset of a full ci… ▽ More Project ILATO focuses on Improving Limited Angle computed Tomography by Optical data integration in order to enhance image quality and shorten acquisition times in X-ray based industrial quality inspection. Limited angle computed tomography is indicated whenever specimen dimensions exceed cone beam limits or the object is impenetrable from certain angles. Thus, acquiring only a subset of a full circle CT scan poses problems for reconstruction algorithms due to incomplete data which introduces blurred edges and other artifacts. To support volumetric data reconstruction algorithm a surface mesh of the object obtained via structured light optical scan acts as a mask defining boundaries of the reconstructed image. The registration of optically acquired surfaces with data acquired from computed tomography is our current challenge. This article presents our setup, the methods applied and discusses the problems arising from registration of data sets created with considerably different imaging techniques. △ Less

Submitted 25 April, 2014; originally announced April 2014.

Comments: Part of the OAGM 2014 proceedings (arXiv:1404.3538)

Report number: OAGM/2014/17

Showing 1–10 of 10 results for author: Beyer, A