Skip to main content

Showing 1–47 of 47 results for author: Thomason, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.13636  [pdf, other

    cs.RO cs.LG

    Contrast Sets for Evaluating Language-Guided Robot Policies

    Authors: Abrar Anwar, Rohan Gupta, Jesse Thomason

    Abstract: Robot evaluations in language-guided, real world settings are time-consuming and often sample only a small space of potential instructions across complex scenes. In this work, we introduce contrast sets for robotics as an approach to make small, but specific, perturbations to otherwise independent, identically distributed (i.i.d.) test instances. We investigate the relationship between experimente… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  2. arXiv:2406.13131  [pdf, other

    cs.CL

    When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models

    Authors: Ting-Yun Chang, Jesse Thomason, Robin Jia

    Abstract: This paper studies in-context learning (ICL) by decomposing the output of large language models into the individual contributions of attention heads and MLPs (components). We observe curious components: good-performing ones that individually do well on a classification task, even when the model performs poorly; bad-performing ones that do much worse than chance; and label-biased components that al… ▽ More

    Submitted 24 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

    Comments: fix typos and citations; appendix

  3. arXiv:2406.02791  [pdf, other

    cs.AI cs.CL cs.RO

    Language Models can Infer Action Semantics for Classical Planners from Environment Feedback

    Authors: Wang Zhu, Ishika Singh, Robin Jia, Jesse Thomason

    Abstract: Classical planning approaches guarantee finding a set of actions that can achieve a given goal state when possible, but require an expert to specify logical action semantics that govern the dynamics of the environment. Researchers have shown that Large Language Models (LLMs) can be used to directly infer planning steps based on commonsense knowledge and minimal domain information alone, but such p… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  4. arXiv:2403.17246  [pdf, other

    cs.AI cs.CL cs.MA cs.RO

    TwoStep: Multi-agent Task Planning using Classical Planners and Large Language Models

    Authors: Ishika Singh, David Traum, Jesse Thomason

    Abstract: Classical planning formulations like the Planning Domain Definition Language (PDDL) admit action sequences guaranteed to achieve a goal state given an initial state if any are possible. However, reasoning problems defined in PDDL do not capture temporal aspects of action taking, for example that two agents in the domain can execute an action simultaneously if postconditions of each do not interfer… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: 12 pages

  5. arXiv:2403.10940  [pdf, other

    cs.RO cs.LG

    ViSaRL: Visual Reinforcement Learning Guided by Human Saliency

    Authors: Anthony Liang, Jesse Thomason, Erdem Bıyık

    Abstract: Training robots to perform complex control tasks from high-dimensional pixel input using reinforcement learning (RL) is sample-inefficient, because image observations are comprised primarily of task-irrelevant information. By contrast, humans are able to visually attend to task-relevant objects and areas. Based on this insight, we introduce Visual Saliency-Guided Reinforcement Learning (ViSaRL). U… ▽ More

    Submitted 16 March, 2024; originally announced March 2024.

  6. arXiv:2402.15610  [pdf, other

    cs.CL

    Selective "Selective Prediction": Reducing Unnecessary Abstention in Vision-Language Reasoning

    Authors: Tejas Srinivasan, Jack Hessel, Tanmay Gupta, Bill Yuchen Lin, Ye** Choi, Jesse Thomason, Khyathi Raghavi Chandu

    Abstract: Selective prediction minimizes incorrect predictions from vision-language models (VLMs) by allowing them to abstain from answering when uncertain. However, when deploying a vision-language system with low tolerance for inaccurate predictions, selective prediction may be over-cautious and abstain too frequently, even on many correct predictions. We introduce ReCoVERR, an inference-time algorithm to… ▽ More

    Submitted 12 June, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

    Comments: Accepted to ACL Findings 2024

  7. arXiv:2402.13584  [pdf, other

    cs.CL

    WinoViz: Probing Visual Properties of Objects Under Different States

    Authors: Woojeong **, Tejas Srinivasan, Jesse Thomason, Xiang Ren

    Abstract: Humans perceive and comprehend different visual properties of an object based on specific contexts. For instance, we know that a banana turns brown ``when it becomes rotten,'' whereas it appears green ``when it is unripe.'' Previous studies on probing visual commonsense knowledge have primarily focused on examining language models' understanding of typical properties (e.g., colors and shapes) of o… ▽ More

    Submitted 21 February, 2024; originally announced February 2024.

    Comments: Preprint

  8. arXiv:2402.08191  [pdf, other

    cs.RO cs.AI cs.LG

    THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

    Authors: Wilbert Pumacay, Ishika Singh, Jiafei Duan, Ranjay Krishna, Jesse Thomason, Dieter Fox

    Abstract: To realize effective large-scale, real-world robotic applications, we must evaluate how well our robot policies adapt to changes in environmental conditions. Unfortunately, a majority of studies evaluate robot performance in environments closely resembling or even identical to the training setup. We present THE COLOSSEUM, a novel simulation benchmark, with 20 diverse manipulation tasks, that enabl… ▽ More

    Submitted 27 May, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

    Comments: RSS 2024. 33 pages

  9. arXiv:2311.17280  [pdf, other

    cs.CL cs.CV

    Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions?

    Authors: Wang Zhu, Ishika Singh, Yuan Huang, Robin Jia, Jesse Thomason

    Abstract: Data augmentation via back-translation is common when pretraining Vision-and-Language Navigation (VLN) models, even though the generated instructions are noisy. But: does that noise matter? We find that nonsensical or irrelevant language instructions during pretraining can have little effect on downstream performance for both HAMT and VLN-BERT on R2R, and is still better than only using clean, hum… ▽ More

    Submitted 23 December, 2023; v1 submitted 28 November, 2023; originally announced November 2023.

    Comments: Accepted by O-DRUM @ CVPR 2023

  10. arXiv:2311.09612  [pdf, other

    cs.CV cs.CL

    Efficient End-to-End Visual Document Understanding with Rationale Distillation

    Authors: Wang Zhu, Alekh Agarwal, Mandar Joshi, Robin Jia, Jesse Thomason, Kristina Toutanova

    Abstract: Understanding visually situated language requires interpreting complex layouts of textual and visual elements. Pre-processing tools, such as optical character recognition (OCR), can map document image inputs to textual tokens, then large language models (LLMs) can reason over text. However, such methods have high computational and engineering complexity. Can small pretrained image-to-text models a… ▽ More

    Submitted 1 April, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

    Comments: Accepted by NAACL 2024

  11. arXiv:2311.09060  [pdf, other

    cs.CL

    Do Localization Methods Actually Localize Memorized Data in LLMs? A Tale of Two Benchmarks

    Authors: Ting-Yun Chang, Jesse Thomason, Robin Jia

    Abstract: The concept of localization in LLMs is often mentioned in prior work; however, methods for localization have never been systematically and directly evaluated. We propose two complementary benchmarks that evaluate the ability of localization methods to pinpoint LLM components responsible for memorized data. In our INJ benchmark, we actively inject a piece of new information into a small subset of L… ▽ More

    Submitted 2 April, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: accepted by NAACL 2024

  12. arXiv:2311.06694  [pdf, other

    cs.CL cs.AI cs.CV cs.RO

    Which One? Leveraging Context Between Objects and Multiple Views for Language Grounding

    Authors: Chancharik Mitra, Abrar Anwar, Rodolfo Corona, Dan Klein, Trevor Darrell, Jesse Thomason

    Abstract: When connecting objects and their language referents in an embodied 3D environment, it is important to note that: (1) an object can be better characterized by leveraging comparative information between itself and other objects, and (2) an object's appearance can vary with camera position. As such, we present the Multi-view Approach to Grounding in Context (MAGiC), which selects an object referent… ▽ More

    Submitted 6 April, 2024; v1 submitted 11 November, 2023; originally announced November 2023.

    Journal ref: North American Chapter of the Association for Computational Linguistics (NAACL), 2024

  13. The Sem-Lex Benchmark: Modeling ASL Signs and Their Phonemes

    Authors: Lee Kezar, Elana Pontecorvo, Adele Daniels, Connor Baer, Ruth Ferster, Lauren Berger, Jesse Thomason, Zed Sevcikova Sehyr, Naomi Caselli

    Abstract: Sign language recognition and translation technologies have the potential to increase access and inclusion of deaf signing communities, but research progress is bottlenecked by a lack of representative data. We introduce a new resource for American Sign Language (ASL) modeling, the Sem-Lex Benchmark. The Benchmark is the current largest of its kind, consisting of over 84k videos of isolated sign p… ▽ More

    Submitted 29 September, 2023; originally announced October 2023.

    Comments: In Proceedings of the ACM Conference on Accessibility (ASSETS) 2023

  14. arXiv:2310.00195  [pdf, other

    cs.CL cs.CV

    Exploring Strategies for Modeling Sign Language Phonology

    Authors: Lee Kezar, Riley Carlin, Tejas Srinivasan, Zed Sehyr, Naomi Caselli, Jesse Thomason

    Abstract: Like speech, signs are composed of discrete, recombinable features called phonemes. Prior work shows that models which can recognize phonemes are better at sign recognition, motivating deeper exploration into strategies for modeling sign language phonemes. In this work, we learn graph convolution networks to recognize the sixteen phoneme "types" found in ASL-LEX 2.0. Specifically, we explore how l… ▽ More

    Submitted 29 September, 2023; originally announced October 2023.

    Comments: In Proceedings of the European Symposium for Artificial Neural Networks (ESANN) 2023

  15. arXiv:2305.14901  [pdf, other

    cs.CL cs.LG

    Chain-of-Questions Training with Latent Answers for Robust Multistep Question Answering

    Authors: Wang Zhu, Jesse Thomason, Robin Jia

    Abstract: We train a language model (LM) to robustly answer multistep questions by generating and answering sub-questions. We propose Chain-of-Questions, a framework that trains a model to generate sub-questions and sub-answers one at a time by leveraging human annotated question decomposition meaning representation (QDMR). The key technical challenge is that QDMR only contains sub-questions but not answers… ▽ More

    Submitted 23 December, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: Accepted by EMNLP 2023

  16. arXiv:2304.02168  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    I2I: Initializing Adapters with Improvised Knowledge

    Authors: Tejas Srinivasan, Furong Jia, Mohammad Rostami, Jesse Thomason

    Abstract: Adapters present a promising solution to the catastrophic forgetting problem in continual learning. However, training independent Adapter modules for every new task misses an opportunity for cross-task knowledge transfer. We propose Improvise to Initialize (I2I), a continual learning algorithm that initializes Adapters for incoming tasks by distilling knowledge from previously-learned tasks' Adapt… ▽ More

    Submitted 10 July, 2023; v1 submitted 4 April, 2023; originally announced April 2023.

    Comments: Accepted at 2nd Conference on Lifelong Learning Agents (CoLLAs), 2023

  17. arXiv:2303.14423  [pdf, other

    cs.LG

    Task-Attentive Transformer Architecture for Continual Learning of Vision-and-Language Tasks Using Knowledge Distillation

    Authors: Yuliang Cai, Jesse Thomason, Mohammad Rostami

    Abstract: The size and the computational load of fine-tuning large-scale pre-trained neural network are becoming two major obstacles in adopting machine learning in many applications. Continual learning (CL) can serve as a remedy through enabling knowledge-transfer across sequentially arriving tasks which relaxes the need to fine-tune all network weights from scratch. However, existing CL algorithms primari… ▽ More

    Submitted 25 March, 2023; originally announced March 2023.

  18. Multimodal Speech Recognition for Language-Guided Embodied Agents

    Authors: Allen Chang, Xiaoyuan Zhu, Aarav Monga, Seoho Ahn, Tejas Srinivasan, Jesse Thomason

    Abstract: Benchmarks for language-guided embodied agents typically assume text-based instructions, but deployed agents will encounter spoken instructions. While Automatic Speech Recognition (ASR) models can bridge the input gap, erroneous ASR transcripts can hurt the agents' ability to complete tasks. In this work, we propose training a multimodal ASR model to reduce errors in transcribing spoken instructio… ▽ More

    Submitted 9 October, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

    Comments: 5 pages, 5 figures

    Journal ref: Proceedings of Interspeech 2023, 1608-1612

  19. arXiv:2302.05759  [pdf, other

    cs.CL cs.CV

    Improving Sign Recognition with Phonology

    Authors: Lee Kezar, Jesse Thomason, Zed Sevcikova Sehyr

    Abstract: We use insights from research on American Sign Language (ASL) phonology to train models for isolated sign language recognition (ISLR), a step towards automatic sign language understanding. Our key insight is to explicitly recognize the role of phonology in sign production to achieve more accurate ISLR than existing work which does not consider sign language phonology. We train ISLR models that tak… ▽ More

    Submitted 11 February, 2023; originally announced February 2023.

    ACM Class: I.2.7; I.2.10; I.5.4

  20. arXiv:2301.12614  [pdf, other

    cs.RO cs.AI cs.CV

    RREx-BoT: Remote Referring Expressions with a Bag of Tricks

    Authors: Gunnar A. Sigurdsson, Jesse Thomason, Gaurav S. Sukhatme, Robinson Piramuthu

    Abstract: Household robots operate in the same space for years. Such robots incrementally build dynamic maps that can be used for tasks requiring remote object localization. However, benchmarks in robot learning often test generalization through inference on tasks in unobserved environments. In an observed environment, locating an object is reduced to choosing from among all object proposals in the environm… ▽ More

    Submitted 29 January, 2023; originally announced January 2023.

  21. arXiv:2211.16649  [pdf, other

    cs.CV cs.AI cs.CL cs.RO

    CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation

    Authors: Vishnu Sashank Dorbala, Gunnar Sigurdsson, Robinson Piramuthu, Jesse Thomason, Gaurav S. Sukhatme

    Abstract: Household environments are visually diverse. Embodied agents performing Vision-and-Language Navigation (VLN) in the wild must be able to handle this diversity, while also following arbitrary language instructions. Recently, Vision-Language models like CLIP have shown great performance on the task of zero-shot object recognition. In this work, we ask if these models are also capable of zero-shot la… ▽ More

    Submitted 29 November, 2022; originally announced November 2022.

    Comments: 8 pages, Accepted at LangRob Workshop at Conference on Robot Learning (CoRL), 2022

  22. arXiv:2210.15037  [pdf, other

    cs.CL cs.CV

    Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems

    Authors: Wang Zhu, Jesse Thomason, Robin Jia

    Abstract: For vision-and-language reasoning tasks, both fully connectionist, end-to-end methods and hybrid, neuro-symbolic methods have achieved high in-distribution performance. In which out-of-distribution settings does each paradigm excel? We investigate this question on both single-image and multi-image visual question-answering through four types of generalization tests: a novel segment-combine test fo… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: Accepted by the Findings of EMNLP 2022

  23. arXiv:2210.06849  [pdf, other

    cs.CV

    Retrospectives on the Embodied AI Workshop

    Authors: Matt Deitke, Dhruv Batra, Yonatan Bisk, Tommaso Campari, Angel X. Chang, Devendra Singh Chaplot, Changan Chen, Claudia Pérez D'Arpino, Kiana Ehsani, Ali Farhadi, Li Fei-Fei, Anthony Francis, Chuang Gan, Kristen Grauman, David Hall, Winson Han, Unnat Jain, Aniruddha Kembhavi, Jacob Krantz, Stefan Lee, Chengshu Li, Sagnik Majumder, Oleksandr Maksymets, Roberto Martín-Martín, Roozbeh Mottaghi , et al. (14 additional authors not shown)

    Abstract: We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of… ▽ More

    Submitted 4 December, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

  24. arXiv:2210.03087  [pdf, other

    cs.CV cs.CL cs.RO

    Iterative Vision-and-Language Navigation

    Authors: Jacob Krantz, Shurjo Banerjee, Wang Zhu, Jason Corso, Peter Anderson, Stefan Lee, Jesse Thomason

    Abstract: We present Iterative Vision-and-Language Navigation (IVLN), a paradigm for evaluating language-guided agents navigating in a persistent environment over time. Existing Vision-and-Language Navigation (VLN) benchmarks erase the agent's memory at the beginning of every episode, testing the ability to perform cold-start navigation with no prior information. However, deployed robots occupy the same env… ▽ More

    Submitted 24 December, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

    Comments: Accepted by CVPR 2023

  25. arXiv:2209.11302  [pdf, other

    cs.RO cs.AI cs.CL cs.LG

    ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

    Authors: Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, Animesh Garg

    Abstract: Task planning can require defining myriad domain knowledge about the world in which a robot needs to act. To ameliorate that effort, large language models (LLMs) can be used to score potential next actions during task planning, and even generate action sequences directly, given an instruction in natural language with no additional domain information. However, such methods either require enumeratin… ▽ More

    Submitted 22 September, 2022; originally announced September 2022.

  26. arXiv:2208.09021  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    VAuLT: Augmenting the Vision-and-Language Transformer for Sentiment Classification on Social Media

    Authors: Georgios Chochlakis, Tejas Srinivasan, Jesse Thomason, Shrikanth Narayanan

    Abstract: We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT is an extension of the popular Vision-and-Language Transformer (ViLT), and improves performance on vision-and-language (VL) tasks that involve more complex text inputs than image captions while having minimal impact on training and inference efficiency. ViLT, importantly, enables efficient training and inference in VL tasks, a… ▽ More

    Submitted 25 January, 2023; v1 submitted 18 August, 2022; originally announced August 2022.

    Comments: 5 pages, 1 figure

  27. arXiv:2207.14525  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Curriculum Learning for Data-Efficient Vision-Language Alignment

    Authors: Tejas Srinivasan, Xiang Ren, Jesse Thomason

    Abstract: Aligning image and text encoders from scratch using contrastive learning requires large amounts of paired image-text data. We alleviate this need by aligning individually pre-trained language and vision representation models using a much smaller amount of paired data, augmented with a curriculum learning algorithm to learn fine-grained vision-language alignments. TOnICS (Training with Ontology-Inf… ▽ More

    Submitted 29 July, 2022; originally announced July 2022.

  28. Geolocated Social Media Posts are Happier: Understanding the Characteristics of Check-in Posts on Twitter

    Authors: Julie Jiang, Jesse Thomason, Francesco Barbieri, Emilio Ferrara

    Abstract: The increasing prevalence of location-sharing features on social media has enabled researchers to ground computational social science research using geolocated data, affording opportunities to study human mobility, the impact of real-world events, and more. This paper analyzes what crucially separates posts with geotags from those without. We find that users who share location are not representati… ▽ More

    Submitted 13 February, 2023; v1 submitted 22 July, 2022; originally announced July 2022.

    Comments: 11 pages, 10 figures, 2 tables

    Journal ref: 15th ACM Web Science Conference 2023 (WebSci '23)

  29. arXiv:2207.00627  [pdf, other

    cs.FL cs.CL

    Interactive Learning from Natural Language and Demonstrations using Signal Temporal Logic

    Authors: Sara Mohammadinejad, Jesse Thomason, Jyotirmoy V. Deshmukh

    Abstract: Natural language is an intuitive way for humans to communicate tasks to a robot. While natural language (NL) is ambiguous, real world tasks and their safety requirements need to be communicated unambiguously. Signal Temporal Logic (STL) is a formal logic that can serve as a versatile, expressive, and unambiguous formal language to describe robotic tasks. On one hand, existing work in using STL for… ▽ More

    Submitted 1 July, 2022; originally announced July 2022.

  30. arXiv:2206.09059  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks

    Authors: Tejas Srinivasan, Ting-Yun Chang, Leticia Leonor Pinto Alva, Georgios Chochlakis, Mohammad Rostami, Jesse Thomason

    Abstract: Current state-of-the-art vision-and-language models are evaluated on tasks either individually or in a multi-task setting, overlooking the challenges of continually learning (CL) tasks as they arrive. Existing CL benchmarks have facilitated research on task adaptation and mitigating "catastrophic forgetting", but are limited to vision-only and language-only tasks. We present CLiMB, a benchmark to… ▽ More

    Submitted 24 November, 2022; v1 submitted 17 June, 2022; originally announced June 2022.

    Comments: Accepted to NeurIPS 2022 Datasets and Benchmarks track

  31. arXiv:2203.12667  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

    Authors: **g Gu, Eliana Stefani, Qi Wu, Jesse Thomason, Xin Eric Wang

    Abstract: A long-term goal of AI research is to build intelligent agents that can communicate with humans in natural language, perceive the environment, and perform real-world tasks. Vision-and-Language Navigation (VLN) is a fundamental and interdisciplinary research topic towards this goal, and receives increasing attention from natural language processing, computer vision, robotics, and machine learning c… ▽ More

    Submitted 3 June, 2022; v1 submitted 22 March, 2022; originally announced March 2022.

    Comments: 19 pages. Accepted to ACL 2022

    Journal ref: ACL 2022, Long, pages 7606,7623, Dublin, Ireland. Association for Computational Linguistics

  32. arXiv:2111.05527  [pdf, other

    cs.AI

    LUMINOUS: Indoor Scene Generation for Embodied AI Challenges

    Authors: Yizhou Zhao, Kaixiang Lin, Zhiwei Jia, Qiaozi Gao, Govind Thattai, Jesse Thomason, Gaurav S. Sukhatme

    Abstract: Learning-based methods for training embodied agents typically require a large number of high-quality scenes that contain realistic layouts and support meaningful interactions. However, current simulators for Embodied AI (EAI) challenges only provide simulated indoor scenes with a limited number of layouts. This paper presents Luminous, the first research framework that employs state-of-the-art ind… ▽ More

    Submitted 9 November, 2021; originally announced November 2021.

    Comments: 2021 paper, Amazon

  33. arXiv:2110.00534  [pdf, other

    cs.CV cs.AI cs.CL cs.RO

    TEACh: Task-driven Embodied Agents that Chat

    Authors: Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, Dilek Hakkani-Tur

    Abstract: Robots operating in human spaces must be able to engage in natural language interaction with people, both understanding and executing instructions, and using conversation to resolve ambiguity and recover from mistakes. To study this, we introduce TEACh, a dataset of over 3,000 human--human, interactive dialogues to complete household tasks in simulation. A Commander with access to oracle informati… ▽ More

    Submitted 28 December, 2021; v1 submitted 1 October, 2021; originally announced October 2021.

    Comments: Accepted at AAAI 2022; 7 pages main, 28 pages total, 29 figures; Version 3 uses a new test set for EDH instances that restrict evaluation to state changes only on task-relevant objects

  34. arXiv:2108.04927  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion

    Authors: Alessandro Suglia, Qiaozi Gao, Jesse Thomason, Govind Thattai, Gaurav Sukhatme

    Abstract: Language-guided robots performing home and office tasks must navigate in and interact with the world. Grounding language instructions against visual observations and actions to take in an environment is an open challenge. We present Embodied BERT (EmBERT), a transformer-based model which can attend to high-dimensional, multi-modal inputs across long temporal horizons for language-conditioned task… ▽ More

    Submitted 4 November, 2021; v1 submitted 10 August, 2021; originally announced August 2021.

    Comments: Accepted at Novel Ideas in Learning-to-Learn through Interaction (NILLI) workshop @ EMNLP 2021

  35. arXiv:2107.12514  [pdf, other

    cs.CL cs.AI cs.CV cs.LG cs.RO

    Language Grounding with 3D Objects

    Authors: Jesse Thomason, Mohit Shridhar, Yonatan Bisk, Chris Paxton, Luke Zettlemoyer

    Abstract: Seemingly simple natural language requests to a robot are generally underspecified, for example "Can you bring me the wireless mouse?" Flat images of candidate mice may not provide the discriminative information needed for "wireless." The world, and objects in it, are not flat images but complex 3D shapes. If a human requests an object based on any of its basic properties, such as color, shape, or… ▽ More

    Submitted 15 September, 2021; v1 submitted 26 July, 2021; originally announced July 2021.

    Comments: Conference on Robot Learning (CoRL) 2021

  36. arXiv:2010.12639  [pdf, other

    cs.RO cs.AI cs.CL cs.CV

    The RobotSlang Benchmark: Dialog-guided Robot Localization and Navigation

    Authors: Shurjo Banerjee, Jesse Thomason, Jason J. Corso

    Abstract: Autonomous robot systems for applications from search and rescue to assistive guidance should be able to engage in natural language dialog with people. To study such cooperative communication, we introduce Robot Simultaneous Localization and Map** with Natural Language (RobotSlang), a benchmark of 169 natural language dialogs between a human Driver controlling a robot and a human Commander provi… ▽ More

    Submitted 23 October, 2020; originally announced October 2020.

    Comments: Conference on Robot Learning 2020

  37. arXiv:2005.00728  [pdf, other

    cs.CL cs.AI cs.CV cs.LG cs.RO

    RMM: A Recursive Mental Model for Dialog Navigation

    Authors: Homero Roman Roman, Yonatan Bisk, Jesse Thomason, Asli Celikyilmaz, Jianfeng Gao

    Abstract: Language-guided robots must be able to both ask humans questions and understand answers. Much existing work focuses only on the latter. In this paper, we go beyond instruction following and introduce a two-agent task where one agent navigates and asks questions that a second, guiding agent answers. Inspired by theory of mind, we propose the Recursive Mental Model (RMM). The navigating agent models… ▽ More

    Submitted 5 October, 2020; v1 submitted 2 May, 2020; originally announced May 2020.

    Comments: Findings of Empirical Methods in Natural Language Processing (EMNLP Findings), 2020

  38. arXiv:2004.10151  [pdf, other

    cs.CL cs.AI cs.LG

    Experience Grounds Language

    Authors: Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, Nicolas Pinto, Joseph Turian

    Abstract: Language understanding research is held back by a failure to relate language to the physical world it describes and to the social interactions it facilitates. Despite the incredible effectiveness of language processing models to tackle tasks after being trained on text alone, successful linguistic communication relies on a shared experience of the world. It is this shared experience that makes utt… ▽ More

    Submitted 1 November, 2020; v1 submitted 21 April, 2020; originally announced April 2020.

    Comments: Empirical Methods in Natural Language Processing (EMNLP), 2020

  39. arXiv:1912.01734  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.RO

    ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

    Authors: Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, Dieter Fox

    Abstract: We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a map** from natural language instructions and egocentric vision to sequences of actions for household tasks. ALFRED includes long, compositional tasks with non-reversible state changes to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demons… ▽ More

    Submitted 30 March, 2020; v1 submitted 3 December, 2019; originally announced December 2019.

    Comments: Computer Vision and Pattern Recognition (CVPR) 2020 ; https://askforalfred.com/

  40. arXiv:1907.04957  [pdf, other

    cs.CL cs.AI cs.CV cs.RO

    Vision-and-Dialog Navigation

    Authors: Jesse Thomason, Michael Murray, Maya Cakmak, Luke Zettlemoyer

    Abstract: Robots navigating in human environments should use language to ask for assistance and be able to understand human responses. To study this challenge, we introduce Cooperative Vision-and-Dialog Navigation, a dataset of over 2k embodied, human-human dialogs situated in simulated, photorealistic home environments. The Navigator asks questions to their partner, the Oracle, who has privileged access to… ▽ More

    Submitted 12 October, 2019; v1 submitted 10 July, 2019; originally announced July 2019.

    Comments: Conference on Robot Learning (CoRL) 2019

  41. arXiv:1907.03390  [pdf, other

    cs.RO cs.HC

    Augmenting Knowledge through Statistical, Goal-oriented Human-Robot Dialog

    Authors: Saeid Amiri, Sujay Bajracharya, Cihangir Goktolga, Jesse Thomason, Shiqi Zhang

    Abstract: Some robots can interact with humans using natural language, and identify service requests through human-robot dialog. However, few robots are able to improve their language capabilities from this experience. In this paper, we develop a dialog agent for robots that is able to interpret user commands using a semantic parser, while asking clarification questions using a probabilistic dialog manager.… ▽ More

    Submitted 12 November, 2019; v1 submitted 7 July, 2019; originally announced July 2019.

    Comments: In proceedings of International Conference on Intelligent Robots and Systems (IROS) 2019

  42. arXiv:1904.01650  [pdf, other

    cs.RO cs.AI cs.CL

    Improving Robot Success Detection using Static Object Data

    Authors: Rosario Scalise, Jesse Thomason, Yonatan Bisk, Siddhartha Srinivasa

    Abstract: We use static object data to improve success detection for stacking objects on and nesting objects in one another. Such actions are necessary for certain robotics tasks, e.g., clearing a dining table or packing a warehouse bin. However, using an RGB-D camera to detect success can be insufficient: same-colored objects can be difficult to differentiate, and reflective silverware cause noisy depth ca… ▽ More

    Submitted 31 July, 2019; v1 submitted 2 April, 2019; originally announced April 2019.

    Comments: IROS 2019 + Appendix

  43. Interpreting Black Box Models via Hypothesis Testing

    Authors: Collin Burns, Jesse Thomason, Wesley Tansey

    Abstract: In science and medicine, model interpretations may be reported as discoveries of natural phenomena or used to guide patient treatments. In such high-stakes tasks, false discoveries may lead investigators astray. These applications would therefore benefit from control over the finite-sample error rate of interpretations. We reframe black box model interpretability as a multiple hypothesis testing p… ▽ More

    Submitted 17 August, 2020; v1 submitted 29 March, 2019; originally announced April 2019.

    Comments: FODS 2020

  44. arXiv:1903.08309  [pdf, other

    cs.AI cs.CL cs.LG cs.RO

    Prospection: Interpretable Plans From Language By Predicting the Future

    Authors: Chris Paxton, Yonatan Bisk, Jesse Thomason, Arunkumar Byravan, Dieter Fox

    Abstract: High-level human instructions often correspond to behaviors with multiple implicit steps. In order for robots to be useful in the real world, they must be able to to reason over both motions and intermediate goals implied by human instructions. In this work, we propose a framework for learning representations that convert from a natural-language command to a sequence of intermediate goals for exec… ▽ More

    Submitted 19 March, 2019; originally announced March 2019.

    Comments: Accepted to ICRA 2019; extended version with appendix containing additional results

  45. Improving Grounded Natural Language Understanding through Human-Robot Dialog

    Authors: Jesse Thomason, Aishwarya Padmakumar, Jivko Sinapov, Nick Walker, Yuqian Jiang, Harel Yedidsion, Justin Hart, Peter Stone, Raymond J. Mooney

    Abstract: Natural language understanding for robotics can require substantial domain- and platform-specific engineering. For example, for mobile robots to pick-and-place objects in an environment to satisfy human commands, we can specify the language humans use to issue such commands, and connect concept words like red can to physical object properties. One way to alleviate this engineering for a new domain… ▽ More

    Submitted 28 February, 2019; originally announced March 2019.

  46. arXiv:1811.00613  [pdf, other

    cs.CL

    Shifting the Baseline: Single Modality Performance on Visual Navigation & QA

    Authors: Jesse Thomason, Daniel Gordon, Yonatan Bisk

    Abstract: We demonstrate the surprising strength of unimodal baselines in multimodal domains, and make concrete recommendations for best practices in future research. Where existing work often compares against random or majority class baselines, we argue that unimodal approaches better capture and reflect dataset biases and therefore provide an important comparison when assessing the performance of multimod… ▽ More

    Submitted 11 March, 2019; v1 submitted 1 November, 2018; originally announced November 2018.

    Comments: Published at The Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) 2019

  47. arXiv:1810.02919  [pdf, other

    cs.RO

    Interaction and Autonomy in RoboCup@Home and Building-Wide Intelligence

    Authors: Justin Hart, Harel Yedidsion, Yuqian Jiang, Nick Walker, Rishi Shah, Jesse Thomason, Aishwarya Padmakumar, Rolando Fernandez, Jivko Sinapov, Raymond Mooney, Peter Stone

    Abstract: Efforts are underway at UT Austin to build autonomous robot systems that address the challenges of long-term deployments in office environments and of the more prescribed domestic service tasks of the RoboCup@Home competition. We discuss the contrasts and synergies of these efforts, highlighting how our work to build a RoboCup@Home Domestic Standard Platform League entry led us to identify an inte… ▽ More

    Submitted 5 October, 2018; originally announced October 2018.

    Comments: Presented at AI-HRI AAAI-FSS, 2018 (arXiv:1809.06606)

    Report number: AI-HRI/2018/10