Search | arXiv e-print repository

AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding

Authors: Alessandro Suglia, Claudio Greco, Katie Baker, Jose L. Part, Ioannis Papaioannou, Arash Eshghi, Ioannis Konstas, Oliver Lemon

Abstract: AI personal assistants deployed via robots or wearables require embodied understanding to collaborate with humans effectively. However, current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric perceptual experience. To address this gap, we propose three key contributions. First, we introduce the Egocentric Video Understanding Dataset… ▽ More AI personal assistants deployed via robots or wearables require embodied understanding to collaborate with humans effectively. However, current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric perceptual experience. To address this gap, we propose three key contributions. First, we introduce the Egocentric Video Understanding Dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. Second, we present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD. Finally, we evaluate AlanaVLM's capabilities on OpenEQA, a challenging benchmark for embodied video question answering. Our model achieves state-of-the-art performance, outperforming open-source models including strong Socratic models using GPT-4 as a planner by 3.6%. Additionally, we outperform Claude 3 and Gemini Pro Vision 1.0 and showcase competitive results compared to Gemini Pro 1.5 and GPT-4V, even surpassing the latter in spatial reasoning. This research paves the way for building efficient VLMs that can be deployed in robots or wearables, leveraging embodied video understanding to collaborate seamlessly with humans in everyday tasks, contributing to the next generation of Embodied AI. △ Less

Submitted 21 June, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

Comments: Code available https://github.com/alanaai/EVUD

arXiv:2311.01146 [pdf, other]

Building for Speech: Designing the Next Generation of Social Robots for Audio Interaction

Authors: Angus Addlesee, Ioannis Papaioannou, Oliver Lemon

Abstract: There have been incredible advancements in robotics and spoken dialogue systems (SDSs) over the past few years, yet we still don't find social robots in public spaces like train stations, shop** malls, or hospital waiting rooms. In this paper, we argue that early-stage collaboration between robot designers and SDS researchers is crucial to create social robots that can legitimately be used in re… ▽ More There have been incredible advancements in robotics and spoken dialogue systems (SDSs) over the past few years, yet we still don't find social robots in public spaces like train stations, shop** malls, or hospital waiting rooms. In this paper, we argue that early-stage collaboration between robot designers and SDS researchers is crucial to create social robots that can legitimately be used in real-world environments. We draw from our experiences running experiments with social robots, and the surrounding literature, to highlight recurring issues. Robots need more speakers, more microphones, quieter motors, and quieter fans to enable human-robot spoken interaction in the wild and improve accessibility. More robust robot joints are also needed to limit potential harm to older adults and other more vulnerable groups. △ Less

Submitted 17 January, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

Comments: In WTF Workshop Proceedings (arXiv:2401.04108) held in conjunction with the ACM conference on Conversational User Interfaces (CUI), 19 - 21/07 2023, in Eindhoven, The Netherlands

Report number: WTFCUI/2023/01

arXiv:2307.16689 [pdf, other]

No that's not what I meant: Handling Third Position Repair in Conversational Question Answering

Authors: Vevake Balaraman, Arash Eshghi, Ioannis Konstas, Ioannis Papaioannou

Abstract: The ability to handle miscommunication is crucial to robust and faithful conversational AI. People usually deal with miscommunication immediately as they detect it, using highly systematic interactional mechanisms called repair. One important type of repair is Third Position Repair (TPR) whereby a speaker is initially misunderstood but then corrects the misunderstanding as it becomes apparent afte… ▽ More The ability to handle miscommunication is crucial to robust and faithful conversational AI. People usually deal with miscommunication immediately as they detect it, using highly systematic interactional mechanisms called repair. One important type of repair is Third Position Repair (TPR) whereby a speaker is initially misunderstood but then corrects the misunderstanding as it becomes apparent after the addressee's erroneous response. Here, we collect and publicly release Repair-QA, the first large dataset of TPRs in a conversational question answering (QA) setting. The data is comprised of the TPR turns, corresponding dialogue contexts, and candidate repairs of the original turn for execution of TPRs. We demonstrate the usefulness of the data by training and evaluating strong baseline models for executing TPRs. For stand-alone TPR execution, we perform both automatic and human evaluations on a fine-tuned T5 model, as well as OpenAI's GPT-3 LLMs. Additionally, we extrinsically evaluate the LLMs' TPR processing capabilities in the downstream conversational QA task. The results indicate poor out-of-the-box performance on TPR's by the GPT-3 models, which then significantly improves when exposed to Repair-QA. △ Less

Submitted 31 July, 2023; originally announced July 2023.

Comments: Accepted at SIGDIAL'23

arXiv:2305.16519 [pdf, other]

The Dangers of trusting Stochastic Parrots: Faithfulness and Trust in Open-domain Conversational Question Answering

Authors: Sabrina Chiesurin, Dimitris Dimakopoulos, Marco Antonio Sobrevilla Cabezudo, Arash Eshghi, Ioannis Papaioannou, Verena Rieser, Ioannis Konstas

Abstract: Large language models are known to produce output which sounds fluent and convincing, but is also often wrong, e.g. "unfaithful" with respect to a rationale as retrieved from a knowledge base. In this paper, we show that task-based systems which exhibit certain advanced linguistic dialog behaviors, such as lexical alignment (repeating what the user said), are in fact preferred and trusted more, wh… ▽ More Large language models are known to produce output which sounds fluent and convincing, but is also often wrong, e.g. "unfaithful" with respect to a rationale as retrieved from a knowledge base. In this paper, we show that task-based systems which exhibit certain advanced linguistic dialog behaviors, such as lexical alignment (repeating what the user said), are in fact preferred and trusted more, whereas other phenomena, such as pronouns and ellipsis are dis-preferred. We use open-domain question answering systems as our test-bed for task based dialog generation and compare several open- and closed-book models. Our results highlight the danger of systems that appear to be trustworthy by parroting user input while providing an unfaithful response. △ Less

Submitted 25 May, 2023; originally announced May 2023.

Comments: 5 pages, ACL Findings 2023

arXiv:2206.02523 [pdf, other]

Sparse Bayesian Learning for Complex-Valued Rational Approximations

Authors: Felix Schneider, Iason Papaioannou, Gerhard Müller

Abstract: Surrogate models are used to alleviate the computational burden in engineering tasks, which require the repeated evaluation of computationally demanding models of physical systems, such as the efficient propagation of uncertainties. For models that show a strongly non-linear dependence on their input parameters, standard surrogate techniques, such as polynomial chaos expansion, are not sufficient… ▽ More Surrogate models are used to alleviate the computational burden in engineering tasks, which require the repeated evaluation of computationally demanding models of physical systems, such as the efficient propagation of uncertainties. For models that show a strongly non-linear dependence on their input parameters, standard surrogate techniques, such as polynomial chaos expansion, are not sufficient to obtain an accurate representation of the original model response. Through applying a rational approximation instead, the approximation error can be efficiently reduced for models whose non-linearity is accurately described through a rational function. Specifically, our aim is to approximate complex-valued models. A common approach to obtain the coefficients in the surrogate is to minimize the sample-based error between model and surrogate in the least-square sense. In order to obtain an accurate representation of the original model and to avoid overfitting, the sample set has be two to three times the number of polynomial terms in the expansion. For models that require a high polynomial degree or are high-dimensional in terms of their input parameters, this number often exceeds the affordable computational cost. To overcome this issue, we apply a sparse Bayesian learning approach to the rational approximation. Through a specific prior distribution structure, sparsity is induced in the coefficients of the surrogate model. The denominator polynomial coefficients as well as the hyperparameters of the problem are determined through a type-II-maximum likelihood approach. We apply a quasi-Newton gradient-descent algorithm in order to find the optimal denominator coefficients and derive the required gradients through application of $\mathbb{CR}$-calculus. △ Less

Submitted 27 September, 2022; v1 submitted 6 June, 2022; originally announced June 2022.

Comments: 27 pages, 13 figures

arXiv:2106.05824 [pdf, other]

doi 10.1016/j.strusafe.2021.102179

Rare event estimation using stochastic spectral embedding

Authors: P. -R. Wagner, S. Marelli, I. Papaioannou, D. Straub, B. Sudret

Abstract: Estimating the probability of rare failure events is an essential step in the reliability assessment of engineering systems. Computing this failure probability for complex non-linear systems is challenging, and has recently spurred the development of active-learning reliability methods. These methods approximate the limit-state function (LSF) using surrogate models trained with a sequentially enri… ▽ More Estimating the probability of rare failure events is an essential step in the reliability assessment of engineering systems. Computing this failure probability for complex non-linear systems is challenging, and has recently spurred the development of active-learning reliability methods. These methods approximate the limit-state function (LSF) using surrogate models trained with a sequentially enriched set of model evaluations. A recently proposed method called stochastic spectral embedding (SSE) aims to improve the local approximation accuracy of global, spectral surrogate modelling techniques by sequentially embedding local residual expansions in subdomains of the input space. In this work we apply SSE to the LSF, giving rise to a stochastic spectral embedding-based reliability (SSER) method. The resulting partition of the input space decomposes the failure probability into a set of easy-to-compute \rev{conditional} failure probabilities. We propose a set of modifications that tailor the algorithm to efficiently solve rare event estimation problems. These modifications include specialized refinement domain selection, partitioning and enrichment strategies. We showcase the algorithm performance on four benchmark problems of various dimensionality and complexity in the LSF. △ Less

Submitted 9 February, 2022; v1 submitted 9 June, 2021; originally announced June 2021.

Report number: RSUQ-2021-003B

Journal ref: Structural Safety, Vol. 96, 102179 (2022)

arXiv:2103.09550 [pdf, other]

doi 10.1016/j.cma.2021.114049

Uncertainty quantification of microstructure variability and mechanical behaviour of additively manufactured lattice structures

Authors: Nina Korshunova, Iason Papaioannou, Stefan Kollmannsberger, Daninel Straub, Ernst Rank

Abstract: Process-induced defects are the leading cause of discrepancies between as-designed and as-manufactured additive manufacturing (AM) product behavior. Especially for metal lattices, the variations in the printed geometry cannot be neglected. Therefore, the evaluation of the influence of microstructural variability on their mechanical behavior is crucial for the quality assessment of the produced str… ▽ More Process-induced defects are the leading cause of discrepancies between as-designed and as-manufactured additive manufacturing (AM) product behavior. Especially for metal lattices, the variations in the printed geometry cannot be neglected. Therefore, the evaluation of the influence of microstructural variability on their mechanical behavior is crucial for the quality assessment of the produced structures. Commonly, the as-manufactured geometry can be obtained by computed tomography (CT). However, to incorporate all process-induced defects into the numerical analysis is often computationally demanding. Thus, commonly this task is limited to a predefined set of considered variations, such as strut size or strut diameter. In this work, a CT-based binary random field is proposed to generate statistically equivalent geometries of periodic metal lattices. The proposed random field model in combination with the Finite Cell Method (FCM), an immersed boundary method, allows to efficiently evaluate the influence of the underlying microstructure on the variability of the mechanical behavior of AM products. Numerical analysis of two lattices manufactured at different scales shows an excellent agreement with experimental data. Furthermore, it provides a unique insight into the effects of the process on the occurring geometrical variations and final mechanical behavior. △ Less

Submitted 17 March, 2021; originally announced March 2021.

arXiv:1912.11029 [pdf, other]

doi 10.1016/j.jcp.2020.109498

Sparse Polynomial Chaos expansions using Variational Relevance Vector Machines

Authors: Panagiotis Tsilifis, Iason Papaioannou, Daniel Straub, Fabio Nobile

Abstract: The challenges for non-intrusive methods for Polynomial Chaos modeling lie in the computational efficiency and accuracy under a limited number of model simulations. These challenges can be addressed by enforcing sparsity in the series representation through retaining only the most important basis terms. In this work, we present a novel sparse Bayesian learning technique for obtaining sparse Polyno… ▽ More The challenges for non-intrusive methods for Polynomial Chaos modeling lie in the computational efficiency and accuracy under a limited number of model simulations. These challenges can be addressed by enforcing sparsity in the series representation through retaining only the most important basis terms. In this work, we present a novel sparse Bayesian learning technique for obtaining sparse Polynomial Chaos expansions which is based on a Relevance Vector Machine model and is trained using Variational Inference. The methodology shows great potential in high-dimensional data-driven settings using relatively few data points and achieves user-controlled sparse levels that are comparable to other methods such as compressive sensing. The proposed approach is illustrated on two numerical examples, a synthetic response function that is explored for validation purposes and a low-carbon steel plate with random Young's modulus and random loading, which is modeled by stochastic finite element with 38 input random variables. △ Less

Submitted 23 December, 2019; originally announced December 2019.

Comments: Submitted to Journal of Computational Physics

arXiv:1909.06749 [pdf, other]

MuMMER: Socially Intelligent Human-Robot Interaction in Public Spaces

Authors: Mary Ellen Foster, Bart Craenen, Amol Deshmukh, Oliver Lemon, Emanuele Bastianelli, Christian Dondrup, Ioannis Papaioannou, Andrea Vanzo, Jean-Marc Odobez, Olivier Canévet, Yuanzhouhan Cao, Weipeng He, Angel Martínez-González, Petr Motlicek, Rémy Siegfried, Rachid Alami, Kathleen Belhassein, Guilhem Buisan, Aurélie Clodic, Amandine Mayima, Yoan Sallami, Guillaume Sarthou, Phani-Teja Singamaneni, Jules Waldhart, Alexandre Mazel , et al. (5 additional authors not shown)

Abstract: In the EU-funded MuMMER project, we have developed a social robot designed to interact naturally and flexibly with users in public spaces such as a shop** mall. We present the latest version of the robot system developed during the project. This system encompasses audio-visual sensing, social signal processing, conversational interaction, perspective taking, geometric reasoning, and motion plann… ▽ More In the EU-funded MuMMER project, we have developed a social robot designed to interact naturally and flexibly with users in public spaces such as a shop** mall. We present the latest version of the robot system developed during the project. This system encompasses audio-visual sensing, social signal processing, conversational interaction, perspective taking, geometric reasoning, and motion planning. It successfully combines all these components in an overarching framework using the Robot Operating System (ROS) and has been deployed to a shop** mall in Finland interacting with customers. In this paper, we describe the system components, their interplay, and the resulting robot behaviours and scenarios provided at the shop** mall. △ Less

Submitted 15 September, 2019; originally announced September 2019.

Report number: AI-HRI/2019/14

arXiv:1909.06174 [pdf, other]

Petri Net Machines for Human-Agent Interaction

Authors: Christian Dondrup, Ioannis Papaioannou, Oliver Lemon

Abstract: Smart speakers and robots become ever more prevalent in our daily lives. These agents are able to execute a wide range of tasks and actions and, therefore, need systems to control their execution. Current state-of-the-art such as (deep) reinforcement learning, however, requires vast amounts of data for training which is often hard to come by when interacting with humans. To overcome this issue, mo… ▽ More Smart speakers and robots become ever more prevalent in our daily lives. These agents are able to execute a wide range of tasks and actions and, therefore, need systems to control their execution. Current state-of-the-art such as (deep) reinforcement learning, however, requires vast amounts of data for training which is often hard to come by when interacting with humans. To overcome this issue, most systems still rely on Finite State Machines. We introduce Petri Net Machines which present a formal definition for state machines based on Petri Nets that are able to execute concurrent actions reliably, execute and interleave several plans at the same time, and provide an easy to use modelling language. We show their workings based on the example of Human-Robot Interaction in a shop** mall. △ Less

Submitted 13 September, 2019; originally announced September 2019.

Report number: AI-HRI/2019/02

arXiv:1712.07558 [pdf, other]

An Ensemble Model with Ranking for Social Dialogue

Authors: Ioannis Papaioannou, Amanda Cercas Curry, Jose L. Part, Igor Shalyminov, Xinnuo Xu, Yanchao Yu, Ondřej Dušek, Verena Rieser, Oliver Lemon

Abstract: Open-domain social dialogue is one of the long-standing goals of Artificial Intelligence. This year, the Amazon Alexa Prize challenge was announced for the first time, where real customers get to rate systems developed by leading universities worldwide. The aim of the challenge is to converse "coherently and engagingly with humans on popular topics for 20 minutes". We describe our Alexa Prize syst… ▽ More Open-domain social dialogue is one of the long-standing goals of Artificial Intelligence. This year, the Amazon Alexa Prize challenge was announced for the first time, where real customers get to rate systems developed by leading universities worldwide. The aim of the challenge is to converse "coherently and engagingly with humans on popular topics for 20 minutes". We describe our Alexa Prize system (called 'Alana') consisting of an ensemble of bots, combining rule-based and machine learning systems, and using a contextual ranking mechanism to choose a system response. The ranker was trained on real user feedback received during the competition, where we address the problem of how to train on the noisy and sparse feedback obtained during the competition. △ Less

Submitted 20 December, 2017; originally announced December 2017.

Comments: NIPS 2017 Workshop on Conversational AI

arXiv:1706.02757 [pdf, other]

Sympathy Begins with a Smile, Intelligence Begins with a Word: Use of Multimodal Features in Spoken Human-Robot Interaction

Authors: Jekaterina Novikova, Christian Dondrup, Ioannis Papaioannou, Oliver Lemon

Abstract: Recognition of social signals, from human facial expressions or prosody of speech, is a popular research topic in human-robot interaction studies. There is also a long line of research in the spoken dialogue community that investigates user satisfaction in relation to dialogue characteristics. However, very little research relates a combination of multimodal social signals and language features de… ▽ More Recognition of social signals, from human facial expressions or prosody of speech, is a popular research topic in human-robot interaction studies. There is also a long line of research in the spoken dialogue community that investigates user satisfaction in relation to dialogue characteristics. However, very little research relates a combination of multimodal social signals and language features detected during spoken face-to-face human-robot interaction to the resulting user perception of a robot. In this paper we show how different emotional facial expressions of human users, in combination with prosodic characteristics of human speech and features of human-robot dialogue, correlate with users' impressions of the robot after a conversation. We find that happiness in the user's recognised facial expression strongly correlates with likeability of a robot, while dialogue-related features (such as number of human turns or number of sentences per robot utterance) correlate with perceiving a robot as intelligent. In addition, we show that facial expression, emotional features, and prosody are better predictors of human ratings related to perceived robot likeability and anthropomorphism, while linguistic and non-linguistic features more often predict perceived robot intelligence and interpretability. As such, these characteristics may in future be used as an online reward signal for in-situ Reinforcement Learning based adaptive human-robot dialogue systems. △ Less

Submitted 8 June, 2017; originally announced June 2017.

Comments: Robo-NLP workshop at ACL 2017. 9 pages, 5 figures, 6 tables

Showing 1–12 of 12 results for author: Papaioannou, I