Search | arXiv e-print repository

A Surprising Failure? Multimodal LLMs and the NLVR Challenge

Authors: Anne Wu, Kianté Brantley, Yoav Artzi

Abstract: This study evaluates three state-of-the-art MLLMs -- GPT-4V, Gemini Pro, and the open-source model IDEFICS -- on the compositional natural language vision reasoning task NLVR. Given a human-written sentence paired with a synthetic image, this task requires the model to determine the truth value of the sentence with respect to the image. Despite the strong performance demonstrated by these models,… ▽ More This study evaluates three state-of-the-art MLLMs -- GPT-4V, Gemini Pro, and the open-source model IDEFICS -- on the compositional natural language vision reasoning task NLVR. Given a human-written sentence paired with a synthetic image, this task requires the model to determine the truth value of the sentence with respect to the image. Despite the strong performance demonstrated by these models, we observe they perform poorly on NLVR, which was constructed to require compositional and spatial reasoning, and to be robust for semantic and systematic biases. △ Less

Submitted 26 February, 2024; originally announced February 2024.

arXiv:2310.03720 [pdf, other]

SteP: Stacked LLM Policies for Web Actions

Authors: Paloma Sodhi, S. R. K. Branavan, Yoav Artzi, Ryan McDonald

Abstract: Performing tasks on the web presents fundamental challenges to large language models (LLMs), including combinatorially large open-world tasks and variations across web interfaces. Simply specifying a large prompt to handle all possible behaviors and states is extremely complex, and results in behavior leaks between unrelated behaviors. Decomposition to distinct policies can address this challenge,… ▽ More Performing tasks on the web presents fundamental challenges to large language models (LLMs), including combinatorially large open-world tasks and variations across web interfaces. Simply specifying a large prompt to handle all possible behaviors and states is extremely complex, and results in behavior leaks between unrelated behaviors. Decomposition to distinct policies can address this challenge, but requires carefully handing off control between policies. We propose Stacked LLM Policies for Web Actions (SteP), an approach to dynamically compose policies to solve a diverse set of web tasks. SteP defines a Markov Decision Process where the state is a stack of policies representing the control state, i.e., the chain of policy calls. Unlike traditional methods that are restricted to static hierarchies, SteP enables dynamic control that adapts to the complexity of the task. We evaluate SteP against multiple baselines and web environments including WebArena, MiniWoB++, and a CRM simulator. On WebArena, SteP improves (14.9% to 35.8%) over SOTA that use GPT-4 policies, while on MiniWob++, SteP is competitive with prior works while using significantly less data. Our code and data is available at https://asappresearch.github.io/webagents-step. △ Less

Submitted 22 April, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

Comments: 30 pages, 15 figures

arXiv:2309.02691 [pdf, other]

A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models

Authors: Noriyuki Kojima, Hadar Averbuch-Elor, Yoav Artzi

Abstract: Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. However, observing this grounding in contemporary models is complex, even if it is generally expected to take place if the task is addressed in a way that is conductive to generalization. We propose a framework to jointly study task performance and phrase grounding, and pr… ▽ More Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. However, observing this grounding in contemporary models is complex, even if it is generally expected to take place if the task is addressed in a way that is conductive to generalization. We propose a framework to jointly study task performance and phrase grounding, and propose three benchmarks to study the relation between the two. Our results show that contemporary models demonstrate inconsistency between their ability to ground phrases and solve tasks. We show how this can be addressed through brute-force training on ground phrasing annotations, and analyze the dynamics it creates. Code and at available at https://github.com/lil-lab/phrase_grounding. △ Less

Submitted 30 May, 2024; v1 submitted 5 September, 2023; originally announced September 2023.

Comments: This was published in TMLR in 2024, on January 24th

arXiv:2307.10323 [pdf, other]

IncDSI: Incrementally Updatable Document Retrieval

Authors: Varsha Kishore, Chao Wan, Justin Lovelace, Yoav Artzi, Kilian Q. Weinberger

Abstract: Differentiable Search Index is a recently proposed paradigm for document retrieval, that encodes information about a corpus of documents within the parameters of a neural network and directly maps queries to corresponding documents. These models have achieved state-of-the-art performances for document retrieval across many benchmarks. These kinds of models have a significant limitation: it is not… ▽ More Differentiable Search Index is a recently proposed paradigm for document retrieval, that encodes information about a corpus of documents within the parameters of a neural network and directly maps queries to corresponding documents. These models have achieved state-of-the-art performances for document retrieval across many benchmarks. These kinds of models have a significant limitation: it is not easy to add new documents after a model is trained. We propose IncDSI, a method to add documents in real time (about 20-50ms per document), without retraining the model on the entire dataset (or even parts thereof). Instead we formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters. Although orders of magnitude faster, our approach is competitive with re-training the model on the whole dataset and enables the development of document retrieval systems that can be updated with new information in real-time. Our code for IncDSI is available at https://github.com/varshakishore/IncDSI. △ Less

Submitted 19 July, 2023; originally announced July 2023.

arXiv:2305.12473 [pdf, other]

Continually Improving Extractive QA via Human Feedback

Authors: Ge Gao, Hung-Ting Chen, Yoav Artzi, Eunsol Choi

Abstract: We study continually improving an extractive question answering (QA) system via human user feedback. We design and deploy an iterative approach, where information-seeking users ask questions, receive model-predicted answers, and provide feedback. We conduct experiments involving thousands of user interactions under diverse setups to broaden the understanding of learning from feedback over time. Ou… ▽ More We study continually improving an extractive question answering (QA) system via human user feedback. We design and deploy an iterative approach, where information-seeking users ask questions, receive model-predicted answers, and provide feedback. We conduct experiments involving thousands of user interactions under diverse setups to broaden the understanding of learning from feedback over time. Our experiments show effective improvement from user feedback of extractive QA models over time across different data regimes, including significant potential for domain adaptation. △ Less

Submitted 3 November, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

Comments: EMNLP 2023

arXiv:2305.06539 [pdf, other]

Semantic uncertainty guides the extension of conventions to new referents

Authors: Ron Eliav, Anya Ji, Yoav Artzi, Robert D. Hawkins

Abstract: A long tradition of studies in psycholinguistics has examined the formation and generalization of ad hoc conventions in reference games, showing how newly acquired conventions for a given target transfer to new referential contexts. However, another axis of generalization remains understudied: how do conventions formed for one target transfer to completely distinct targets, when specific lexical c… ▽ More A long tradition of studies in psycholinguistics has examined the formation and generalization of ad hoc conventions in reference games, showing how newly acquired conventions for a given target transfer to new referential contexts. However, another axis of generalization remains understudied: how do conventions formed for one target transfer to completely distinct targets, when specific lexical choices are unlikely to repeat? This paper presents two dyadic studies (N = 240) that address this axis of generalization, focusing on the role of nameability -- the a priori likelihood that two individuals will share the same label. We leverage the recently-released KiloGram dataset, a collection of abstract tangram images that is orders of magnitude larger than previously available, exhibiting high diversity of properties like nameability. Our first study asks how nameability shapes convention formation, while the second asks how new conventions generalize to entirely new targets of reference. Our results raise new questions about how ad hoc conventions extend beyond target-specific re-use of specific lexical choices. △ Less

Submitted 10 May, 2023; originally announced May 2023.

Comments: Proceedings of the 45th Annual Conference of the Cognitive Science Society

arXiv:2303.08127 [pdf, other]

CB2: Collaborative Natural Language Interaction Research Platform

Authors: Jacob Sharf, Mustafa Omer Gul, Yoav Artzi

Abstract: CB2 is a multi-agent platform to study collaborative natural language interaction in a grounded task-oriented scenario. It includes a 3D game environment, a backend server designed to serve trained models to human agents, and various tools and processes to enable scalable studies. We deploy CB2 at https://cb2.ai as a system demonstration with a learned instruction following model. CB2 is a multi-agent platform to study collaborative natural language interaction in a grounded task-oriented scenario. It includes a 3D game environment, a backend server designed to serve trained models to human agents, and various tools and processes to enable scalable studies. We deploy CB2 at https://cb2.ai as a system demonstration with a learned instruction following model. △ Less

Submitted 29 May, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

Comments: ACL 2023 Demo paper

arXiv:2212.09710 [pdf, other]

Continual Learning for Instruction Following from Realtime Feedback

Authors: Alane Suhr, Yoav Artzi

Abstract: We propose and deploy an approach to continually train an instruction-following agent from feedback provided by users during collaborative interactions. During interaction, human users instruct an agent using natural language, and provide realtime binary feedback as they observe the agent following their instructions. We design a contextual bandit learning approach, converting user feedback to imm… ▽ More We propose and deploy an approach to continually train an instruction-following agent from feedback provided by users during collaborative interactions. During interaction, human users instruct an agent using natural language, and provide realtime binary feedback as they observe the agent following their instructions. We design a contextual bandit learning approach, converting user feedback to immediate reward. We evaluate through thousands of human-agent interactions, demonstrating 15.4% absolute improvement in instruction execution accuracy over time. We also show our approach is robust to several design variations, and that the feedback signal is roughly equivalent to the learning signal of supervised demonstration data. △ Less

Submitted 5 December, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

Comments: NeurIPS 2023 Spotlight paper

arXiv:2211.16492 [pdf, other]

Abstract Visual Reasoning with Tangram Shapes

Authors: Anya Ji, Noriyuki Kojima, Noah Rush, Alane Suhr, Wai Keen Vong, Robert D. Hawkins, Yoav Artzi

Abstract: We introduce KiloGram, a resource for studying abstract visual reasoning in humans and machines. Drawing on the history of tangram puzzles as stimuli in cognitive science, we build a richly annotated dataset that, with >1k distinct stimuli, is orders of magnitude larger and more diverse than prior resources. It is both visually and linguistically richer, moving beyond whole shape descriptions to i… ▽ More We introduce KiloGram, a resource for studying abstract visual reasoning in humans and machines. Drawing on the history of tangram puzzles as stimuli in cognitive science, we build a richly annotated dataset that, with >1k distinct stimuli, is orders of magnitude larger and more diverse than prior resources. It is both visually and linguistically richer, moving beyond whole shape descriptions to include segmentation maps and part labels. We use this resource to evaluate the abstract visual reasoning capacities of recent multi-modal models. We observe that pre-trained weights demonstrate limited abstract reasoning, which dramatically improves with fine-tuning. We also observe that explicitly describing parts aids abstract reasoning for both humans and models, especially when jointly encoding the linguistic and visual inputs. KiloGram is available at https://lil.nlp.cornell.edu/kilogram . △ Less

Submitted 29 November, 2022; originally announced November 2022.

Comments: EMNLP 2022 long paper

arXiv:2211.01994 [pdf, other]

lilGym: Natural Language Visual Reasoning with Reinforcement Learning

Authors: Anne Wu, Kianté Brantley, Noriyuki Kojima, Yoav Artzi

Abstract: We present lilGym, a new benchmark for language-conditioned reinforcement learning in visual environments. lilGym is based on 2,661 highly-compositional human-written natural language statements grounded in an interactive visual environment. We introduce a new approach for exact reward computation in every possible world state by annotating all statements with executable Python programs. Each stat… ▽ More We present lilGym, a new benchmark for language-conditioned reinforcement learning in visual environments. lilGym is based on 2,661 highly-compositional human-written natural language statements grounded in an interactive visual environment. We introduce a new approach for exact reward computation in every possible world state by annotating all statements with executable Python programs. Each statement is paired with multiple start states and reward functions to form thousands of distinct Markov Decision Processes of varying difficulty. We experiment with lilGym with different models and learning regimes. Our results and analysis show that while existing methods are able to achieve non-trivial performance, lilGym forms a challenging open problem. lilGym is available at https://lil.nlp.cornell.edu/lilgym/. △ Less

Submitted 29 May, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

Comments: ACL 2023 Long Paper

arXiv:2205.01086 [pdf, other]

Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages

Authors: Felix Wu, Kwangyoun Kim, Shinji Watanabe, Kyu Han, Ryan McDonald, Kilian Q. Weinberger, Yoav Artzi

Abstract: We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task -- transcribing audio inputs into pseudo subword sequences. This process stands on its own, or can be applied as low-cost second-stage pre-training… ▽ More We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task -- transcribing audio inputs into pseudo subword sequences. This process stands on its own, or can be applied as low-cost second-stage pre-training. We experiment with automatic speech recognition (ASR), spoken named entity recognition, and speech-to-text translation. We set new state-of-the-art results for end-to-end spoken named entity recognition, and show consistent improvements on 20 language pairs for speech-to-text translation, even when competing methods use additional text data for training. Finally, on ASR, our approach enables encoder-decoder methods to benefit from pre-training for all parts of the network, and shows comparable performance to highly optimized recent methods. △ Less

Submitted 2 May, 2022; originally announced May 2022.

Comments: Code available at https://github.com/asappresearch/wav2seq

arXiv:2203.10079 [pdf, other]

Simulating Bandit Learning from User Feedback for Extractive Question Answering

Authors: Ge Gao, Eunsol Choi, Yoav Artzi

Abstract: We study learning from user feedback for extractive question answering by simulating feedback using supervised data. We cast the problem as contextual bandit learning, and analyze the characteristics of several learning scenarios with focus on reducing data annotation. We show that systems initially trained on a small number of examples can dramatically improve given feedback from users on model-p… ▽ More We study learning from user feedback for extractive question answering by simulating feedback using supervised data. We cast the problem as contextual bandit learning, and analyze the characteristics of several learning scenarios with focus on reducing data annotation. We show that systems initially trained on a small number of examples can dramatically improve given feedback from users on model-predicted answers, and that one can use existing datasets to deploy systems in new domains without any annotation, but instead improving the system on-the-fly via user feedback. △ Less

Submitted 18 March, 2022; originally announced March 2022.

Comments: ACL 2022

arXiv:2111.10367 [pdf, other]

SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech

Authors: Suwon Shon, Ankita Pasad, Felix Wu, Pablo Brusco, Yoav Artzi, Karen Livescu, Kyu J. Han

Abstract: Progress in speech processing has been facilitated by shared datasets and benchmarks. Historically these have focused on automatic speech recognition (ASR), speaker identification, or other lower-level tasks. Interest has been growing in higher-level spoken language understanding tasks, including using end-to-end models, but there are fewer annotated datasets for such tasks. At the same time, rece… ▽ More Progress in speech processing has been facilitated by shared datasets and benchmarks. Historically these have focused on automatic speech recognition (ASR), speaker identification, or other lower-level tasks. Interest has been growing in higher-level spoken language understanding tasks, including using end-to-end models, but there are fewer annotated datasets for such tasks. At the same time, recent work shows the possibility of pre-training generic representations and then fine-tuning for several tasks using relatively little labeled data. We propose to create a suite of benchmark tasks for Spoken Language Understanding Evaluation (SLUE) consisting of limited-size labeled training sets and corresponding evaluation sets. This resource would allow the research community to track progress, evaluate pre-trained representations for higher-level tasks, and study open questions such as the utility of pipeline versus end-to-end approaches. We present the first phase of the SLUE benchmark suite, consisting of named entity recognition, sentiment analysis, and ASR on the corresponding datasets. We focus on naturally produced (not read or synthesized) speech, and freely available datasets. We provide new transcriptions and annotations on subsets of the VoxCeleb and VoxPopuli datasets, evaluation metrics and results for baseline models, and an open-source toolkit to reproduce the baselines and evaluate new models. △ Less

Submitted 29 July, 2022; v1 submitted 19 November, 2021; originally announced November 2021.

Comments: Updated preprint for SLUE Benchmark v0.2; Toolkit link https://github.com/asappresearch/slue-toolkit

arXiv:2109.13449 [pdf, other]

When in Doubt: Improving Classification Performance with Alternating Normalization

Authors: Menglin Jia, Austin Reiter, Ser-Nam Lim, Yoav Artzi, Claire Cardie

Abstract: We introduce Classification with Alternating Normalization (CAN), a non-parametric post-processing step for classification. CAN improves classification accuracy for challenging examples by re-adjusting their predicted class probability distribution using the predicted class distributions of high-confidence validation examples. CAN is easily applicable to any probabilistic classifier, with minimal… ▽ More We introduce Classification with Alternating Normalization (CAN), a non-parametric post-processing step for classification. CAN improves classification accuracy for challenging examples by re-adjusting their predicted class probability distribution using the predicted class distributions of high-confidence validation examples. CAN is easily applicable to any probabilistic classifier, with minimal computation overhead. We analyze the properties of CAN using simulated experiments, and empirically demonstrate its effectiveness across a diverse set of classification tasks. △ Less

Submitted 27 September, 2021; originally announced September 2021.

Comments: Findings of EMNLP 2021

arXiv:2109.06870 [pdf, other]

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

Authors: Felix Wu, Kwangyoun Kim, **g Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi

Abstract: This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improveme… ▽ More This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25-50% across different model sizes. △ Less

Submitted 14 September, 2021; originally announced September 2021.

Comments: Code available at https://github.com/asappresearch/sew

arXiv:2109.04452 [pdf, other]

Analysis of Language Change in Collaborative Instruction Following

Authors: Anna Effenberger, Eva Yan, Rhia Singh, Alane Suhr, Yoav Artzi

Abstract: We analyze language change over time in a collaborative, goal-oriented instructional task, where utility-maximizing participants form conventions and increase their expertise. Prior work studied such scenarios mostly in the context of reference games, and consistently found that language complexity is reduced along multiple dimensions, such as utterance length, as conventions are formed. In contra… ▽ More We analyze language change over time in a collaborative, goal-oriented instructional task, where utility-maximizing participants form conventions and increase their expertise. Prior work studied such scenarios mostly in the context of reference games, and consistently found that language complexity is reduced along multiple dimensions, such as utterance length, as conventions are formed. In contrast, we find that, given the ability to increase instruction utility, instructors increase language complexity along these previously studied dimensions to better collaborate with increasingly skilled instruction followers. △ Less

Submitted 9 September, 2021; originally announced September 2021.

Comments: Findings of EMNLP 2021 Short Paper

arXiv:2108.07253 [pdf, other]

Who's Waldo? Linking People Across Text and Images

Authors: Claire Yuqing Cui, Apoorv Khandelwal, Yoav Artzi, Noah Snavely, Hadar Averbuch-Elor

Abstract: We present a task and benchmark dataset for person-centric visual grounding, the problem of linking between people named in a caption and people pictured in an image. In contrast to prior work in visual grounding, which is predominantly object-based, our new task masks out the names of people in captions in order to encourage methods trained on such image-caption pairs to focus on contextual cues… ▽ More We present a task and benchmark dataset for person-centric visual grounding, the problem of linking between people named in a caption and people pictured in an image. In contrast to prior work in visual grounding, which is predominantly object-based, our new task masks out the names of people in captions in order to encourage methods trained on such image-caption pairs to focus on contextual cues (such as rich interactions between multiple people), rather than learning associations between names and appearances. To facilitate this task, we introduce a new dataset, Who's Waldo, mined automatically from image-caption data on Wikimedia Commons. We propose a Transformer-based method that outperforms several strong baselines on this task, and are releasing our data to the research community to spur work on contextual models that consider both vision and language. △ Less

Submitted 17 August, 2021; v1 submitted 16 August, 2021; originally announced August 2021.

Comments: Published in ICCV 2021 (Oral). Project webpage: https://whoswaldo.github.io

arXiv:2108.04812 [pdf, other]

Continual Learning for Grounded Instruction Generation by Observing Human Following Behavior

Authors: Noriyuki Kojima, Alane Suhr, Yoav Artzi

Abstract: We study continual learning for natural language instruction generation, by observing human users' instruction execution. We focus on a collaborative scenario, where the system both acts and delegates tasks to human users using natural language. We compare user execution of generated instructions to the original system intent as an indication to the system's success communicating its intent. We sh… ▽ More We study continual learning for natural language instruction generation, by observing human users' instruction execution. We focus on a collaborative scenario, where the system both acts and delegates tasks to human users using natural language. We compare user execution of generated instructions to the original system intent as an indication to the system's success communicating its intent. We show how to use this signal to improve the system's ability to generate instructions via contextual bandit learning. In interaction with real users, our system demonstrates dramatic improvements in its ability to generate language over time. △ Less

Submitted 10 August, 2021; originally announced August 2021.

Comments: To appear in TACL 2021. The arXiv version is a pre-MIT Press publication version

arXiv:2107.05612 [pdf, other]

A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution

Authors: Valts Blukis, Chris Paxton, Dieter Fox, Animesh Garg, Yoav Artzi

Abstract: Natural language provides an accessible and expressive interface to specify long-term tasks for robotic agents. However, non-experts are likely to specify such tasks with high-level instructions, which abstract over specific robot actions through several layers of abstraction. We propose that key to bridging this gap between language and robot actions over long execution horizons are persistent re… ▽ More Natural language provides an accessible and expressive interface to specify long-term tasks for robotic agents. However, non-experts are likely to specify such tasks with high-level instructions, which abstract over specific robot actions through several layers of abstraction. We propose that key to bridging this gap between language and robot actions over long execution horizons are persistent representations. We propose a persistent spatial semantic representation method, and show how it enables building an agent that performs hierarchical reasoning to effectively execute long-term tasks. We evaluate our approach on the ALFRED benchmark and achieve state-of-the-art results, despite completely avoiding the commonly used step-by-step instructions. △ Less

Submitted 28 November, 2021; v1 submitted 12 July, 2021; originally announced July 2021.

Comments: Presented at CoRL 2021

arXiv:2106.04163 [pdf]

doi 10.1016/j.jmr.2021.107102

Superconducting microresonators for electron spin resonance, the good, the bad, and the future

Authors: Yaron Artzi, Yakir Yishay, Marco Fanciulli, Moamen Jbara, Aharon Blank

Abstract: The field of electron spin resonance is in constant need to improve its capabilities. Among other things, this means having better resonators which would provide improved spin sensitivity, as well as enable larger microwave magnetic field power conversion factors. Surface micro resonators, made of small metallic patches on a dielectric substrate, provide very good absolute spin sensitivity and hig… ▽ More The field of electron spin resonance is in constant need to improve its capabilities. Among other things, this means having better resonators which would provide improved spin sensitivity, as well as enable larger microwave magnetic field power conversion factors. Surface micro resonators, made of small metallic patches on a dielectric substrate, provide very good absolute spin sensitivity and high conversion factors due to their very small mode volume. However, such resonators suffer from having a relatively low quality factor, which offsets some of their significant potential advantages. The use of superconducting patches to replace the metallic layer seems like a reasonable and straightforward solution to the quality factor issue, at least for measurements carried out at cryogenic temperatures. Nevertheless, superconducting materials are not easily incorporated into setups requiring high magnetic fields, due to electric current vortices generated in the latter's surface. This makes the transition from normal conducing materials to superconductors highly nontrivial. Here we present the design, fabrication, and testing results of surface micro resonators made of yttrium barium copper oxide (YBCO) superconducting material. We show that with a unique experimental setup, these resonators can be made to operate well even at high fields of about 1.2 T. Furthermore, we analyze the effect of current vortices on the ESR signal and the spins' coherence times. Finally, we provide a head to head comparison of YBCO vs copper resonators of the same dimensions, which clearly shows their pros and cons and directs us to future potential developments and improvements in this field. △ Less

Submitted 27 August, 2021; v1 submitted 8 June, 2021; originally announced June 2021.

arXiv:2011.07384 [pdf, other]

Few-shot Object Grounding and Map** for Natural Language Robot Instruction Following

Authors: Valts Blukis, Ross A. Knepper, Yoav Artzi

Abstract: We study the problem of learning a robot policy to follow natural language instructions that can be easily extended to reason about new objects. We introduce a few-shot language-conditioned object grounding method trained from augmented reality data that uses exemplars to identify objects and align them to their mentions in instructions. We present a learned map representation that encodes object… ▽ More We study the problem of learning a robot policy to follow natural language instructions that can be easily extended to reason about new objects. We introduce a few-shot language-conditioned object grounding method trained from augmented reality data that uses exemplars to identify objects and align them to their mentions in instructions. We present a learned map representation that encodes object locations and their instructed use, and construct it from our few-shot grounding output. We integrate this map** approach into an instruction-following policy, thereby allowing it to reason about previously unseen objects at test-time by simply adding exemplars. We evaluate on the task of learning to map raw observations and instructions to continuous control of a physical quadcopter. Our approach significantly outperforms the prior state of the art in the presence of new objects, even when the prior approach observes all objects during training. △ Less

Submitted 14 November, 2020; originally announced November 2020.

Comments: 4th Conference on Robot Learning (CoRL 2020), Cambridge MA, USA

arXiv:2006.05987 [pdf, other]

Revisiting Few-sample BERT Fine-tuning

Authors: Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q. Weinberger, Yoav Artzi

Abstract: This paper is a study of fine-tuning of BERT contextual representations, with focus on commonly observed instabilities in few-sample scenarios. We identify several factors that cause this instability: the common use of a non-standard optimization method with biased gradient estimation; the limited applicability of significant parts of the BERT network for down-stream tasks; and the prevalent pract… ▽ More This paper is a study of fine-tuning of BERT contextual representations, with focus on commonly observed instabilities in few-sample scenarios. We identify several factors that cause this instability: the common use of a non-standard optimization method with biased gradient estimation; the limited applicability of significant parts of the BERT network for down-stream tasks; and the prevalent practice of using a pre-determined, and small number of training iterations. We empirically test the impact of these factors, and identify alternative practices that resolve the commonly observed instability of the process. In light of these observations, we re-visit recently proposed methods to improve few-sample fine-tuning with BERT and re-evaluate their effectiveness. Generally, we observe the impact of these methods diminishes significantly with our modified process. △ Less

Submitted 11 March, 2021; v1 submitted 10 June, 2020; originally announced June 2020.

Comments: Code available at https://github.com/asappresearch/revisit-bert-finetuning

arXiv:2005.01678 [pdf, other]

What is Learned in Visually Grounded Neural Syntax Acquisition

Authors: Noriyuki Kojima, Hadar Averbuch-Elor, Alexander M. Rush, Yoav Artzi

Abstract: Visual features are a promising signal for learning bootstrap textual models. However, blackbox learning models make it difficult to isolate the specific contribution of visual components. In this analysis, we consider the case study of the Visually Grounded Neural Syntax Learner (Shi et al., 2019), a recent approach for learning syntax from a visual training signal. By constructing simplified ver… ▽ More Visual features are a promising signal for learning bootstrap textual models. However, blackbox learning models make it difficult to isolate the specific contribution of visual components. In this analysis, we consider the case study of the Visually Grounded Neural Syntax Learner (Shi et al., 2019), a recent approach for learning syntax from a visual training signal. By constructing simplified versions of the model, we isolate the core factors that yield the model's strong performance. Contrary to what the model might be capable of learning, we find significantly less expressive versions produce similar predictions and perform just as well, or even better. We also find that a simple lexical signal of noun concreteness plays the main role in the model's predictions as opposed to more complex syntactic reasoning. △ Less

Submitted 18 May, 2020; v1 submitted 4 May, 2020; originally announced May 2020.

Comments: In ACL 2020

arXiv:2004.02709 [pdf, other]

Evaluating Models' Local Decision Boundaries via Contrast Sets

Authors: Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang , et al. (1 additional authors not shown)

Abstract: Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systemati… ▽ More Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets---up to 25\% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes. △ Less

Submitted 1 October, 2020; v1 submitted 6 April, 2020; originally announced April 2020.

arXiv:2001.03671 [pdf, other]

Retouchdown: Adding Touchdown to StreetLearn as a Shareable Resource for Language Grounding Tasks in Street View

Authors: Harsh Mehta, Yoav Artzi, Jason Baldridge, Eugene Ie, Piotr Mirowski

Abstract: The Touchdown dataset (Chen et al., 2019) provides instructions by human annotators for navigation through New York City streets and for resolving spatial descriptions at a given location. To enable the wider research community to work effectively with the Touchdown tasks, we are publicly releasing the 29k raw Street View panoramas needed for Touchdown. We follow the process used for the StreetLea… ▽ More The Touchdown dataset (Chen et al., 2019) provides instructions by human annotators for navigation through New York City streets and for resolving spatial descriptions at a given location. To enable the wider research community to work effectively with the Touchdown tasks, we are publicly releasing the 29k raw Street View panoramas needed for Touchdown. We follow the process used for the StreetLearn data release (Mirowski et al., 2019) to check panoramas for personally identifiable information and blur them as necessary. These have been added to the StreetLearn dataset and can be obtained via the same process as used previously for StreetLearn. We also provide a reference implementation for both of the Touchdown tasks: vision and language navigation (VLN) and spatial description resolution (SDR). We compare our model results to those given in Chen et al. (2019) and show that the panoramas we have added to StreetLearn fully support both Touchdown tasks and can be used effectively for further research and comparison. △ Less

Submitted 10 January, 2020; originally announced January 2020.

arXiv:1911.03598 [pdf, other]

Interactive Classification by Asking Informative Questions

Authors: Lili Yu, Howard Chen, Sida Wang, Tao Lei, Yoav Artzi

Abstract: We study the potential for interaction in natural language classification. We add a limited form of interaction for intent classification, where users provide an initial query using natural language, and the system asks for additional information using binary or multi-choice questions. At each turn, our system decides between asking the most informative question or making the final classification… ▽ More We study the potential for interaction in natural language classification. We add a limited form of interaction for intent classification, where users provide an initial query using natural language, and the system asks for additional information using binary or multi-choice questions. At each turn, our system decides between asking the most informative question or making the final classification prediction.The simplicity of the model allows for bootstrap** of the system without interaction data, instead relying on simple crowdsourcing tasks. We evaluate our approach on two domains, showing the benefit of interaction and the advantage of learning to balance between asking additional questions and making the final prediction. △ Less

Submitted 3 May, 2020; v1 submitted 8 November, 2019; originally announced November 2019.

Comments: Accepted at ACL 2020

arXiv:1910.09664 [pdf, other]

Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight

Authors: Valts Blukis, Yannick Terme, Eyvind Niklasson, Ross A. Knepper, Yoav Artzi

Abstract: We propose a joint simulation and real-world learning framework for map** navigation instructions and raw first-person observations to continuous control. Our model estimates the need for environment exploration, predicts the likelihood of visiting environment positions during execution, and controls the agent to both explore and visit high-likelihood positions. We introduce Supervised Reinforce… ▽ More We propose a joint simulation and real-world learning framework for map** navigation instructions and raw first-person observations to continuous control. Our model estimates the need for environment exploration, predicts the likelihood of visiting environment positions during execution, and controls the agent to both explore and visit high-likelihood positions. We introduce Supervised Reinforcement Asynchronous Learning (SuReAL). Learning uses both simulation and real environments without requiring autonomous flight in the physical environment during training, and combines supervised learning for predicting positions to visit and reinforcement learning for continuous control. We evaluate our approach on a natural language instruction-following task with a physical quadcopter, and demonstrate effective execution and exploration behavior. △ Less

Submitted 21 October, 2019; originally announced October 2019.

Comments: Conference on Robot Learning (CoRL) 2019

arXiv:1910.03655 [pdf, other]

Executing Instructions in Situated Collaborative Interactions

Authors: Alane Suhr, Claudia Yan, Charlotte Schluger, Stanley Yu, Hadi Khader, Marwa Mouallem, Iris Zhang, Yoav Artzi

Abstract: We study a collaborative scenario where a user not only instructs a system to complete tasks, but also acts alongside it. This allows the user to adapt to the system abilities by changing their language or deciding to simply accomplish some tasks themselves, and requires the system to effectively recover from errors as the user strategically assigns it new goals. We build a game environment to stu… ▽ More We study a collaborative scenario where a user not only instructs a system to complete tasks, but also acts alongside it. This allows the user to adapt to the system abilities by changing their language or deciding to simply accomplish some tasks themselves, and requires the system to effectively recover from errors as the user strategically assigns it new goals. We build a game environment to study this scenario, and learn to map user instructions to system actions. We introduce a learning approach focused on recovery from cascading errors between instructions, and modeling methods to explicitly reason about instructions with multiple goals. We evaluate with a new evaluation protocol using recorded interactions and online games with human users, and observe how users adapt to the system abilities. △ Less

Submitted 22 November, 2022; v1 submitted 8 October, 2019; originally announced October 2019.

Comments: EMNLP 2019 long paper

arXiv:1909.10411 [pdf, other]

NLVR2 Visual Bias Analysis

Authors: Alane Suhr, Yoav Artzi

Abstract: NLVR2 (Suhr et al., 2019) was designed to be robust for language bias through a data collection process that resulted in each natural language sentence appearing with both true and false labels. The process did not provide a similar measure of control for visual bias. This technical report analyzes the potential for visual bias in NLVR2. We show that some amount of visual bias likely exists. Final… ▽ More NLVR2 (Suhr et al., 2019) was designed to be robust for language bias through a data collection process that resulted in each natural language sentence appearing with both true and false labels. The process did not provide a similar measure of control for visual bias. This technical report analyzes the potential for visual bias in NLVR2. We show that some amount of visual bias likely exists. Finally, we identify a subset of the test data that allows to test for model performance in a way that is robust to such potential biases. We show that the performance of existing models (Li et al., 2019; Tan and Bansal 2019) is relatively robust to this potential bias. We propose to add the evaluation on this subset of the data to the NLVR2 evaluation protocol, and update the official release to include it. A notebook including an implementation of the code used to replicate this analysis is available at http://nlvr.ai/NLVR2BiasAnalysis.html. △ Less

Submitted 23 September, 2019; originally announced September 2019.

Comments: Corresponding notebook available at http://lil.nlp.cornell.edu/nlvr/NLVR2BiasAnalysis.html

arXiv:1904.09675 [pdf, other]

BERTScore: Evaluating Text Generation with BERT

Authors: Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi

Abstract: We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning sys… ▽ More We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTScore is more robust to challenging examples when compared to existing metrics. △ Less

Submitted 24 February, 2020; v1 submitted 21 April, 2019; originally announced April 2019.

Comments: Code available at https://github.com/Tiiiger/bert_score; To appear in ICLR2020

arXiv:1811.12354 [pdf, other]

Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments

Authors: Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, Yoav Artzi

Abstract: We study the problem of jointly reasoning about language and vision through a navigation and spatial reasoning task. We introduce the Touchdown task and dataset, where an agent must first follow navigation instructions in a real-life visual urban environment, and then identify a location described in natural language to find a hidden object at the goal position. The data contains 9,326 examples of… ▽ More We study the problem of jointly reasoning about language and vision through a navigation and spatial reasoning task. We introduce the Touchdown task and dataset, where an agent must first follow navigation instructions in a real-life visual urban environment, and then identify a location described in natural language to find a hidden object at the goal position. The data contains 9,326 examples of English instructions and spatial descriptions paired with demonstrations. Empirical analysis shows the data presents an open challenge to existing methods, and qualitative linguistic analysis shows that the data displays richer use of spatial reasoning compared to related resources. △ Less

Submitted 16 May, 2020; v1 submitted 29 November, 2018; originally announced November 2018.

Comments: arXiv admin note: text overlap with arXiv:1809.00786

Journal ref: Published in CVPR 2019

arXiv:1811.08824 [pdf, other]

Early Fusion for Goal Directed Robotic Vision

Authors: Aaron Walsman, Yonatan Bisk, Saadia Gabriel, Dipendra Misra, Yoav Artzi, Ye** Choi, Dieter Fox

Abstract: Building perceptual systems for robotics which perform well under tight computational budgets requires novel architectures which rethink the traditional computer vision pipeline. Modern vision architectures require the agent to build a summary representation of the entire scene, even if most of the input is irrelevant to the agent's current goal. In this work, we flip this paradigm, by introducing… ▽ More Building perceptual systems for robotics which perform well under tight computational budgets requires novel architectures which rethink the traditional computer vision pipeline. Modern vision architectures require the agent to build a summary representation of the entire scene, even if most of the input is irrelevant to the agent's current goal. In this work, we flip this paradigm, by introducing EarlyFusion vision models that condition on a goal to build custom representations for downstream tasks. We show that these goal specific representations can be learned more quickly, are substantially more parameter efficient, and more robust than existing attention mechanisms in our domain. We demonstrate the effectiveness of these methods on a simulated robotic item retrieval problem that is trained in a fully end-to-end manner via imitation learning. △ Less

Submitted 7 August, 2019; v1 submitted 21 November, 2018; originally announced November 2018.

arXiv:1811.04179 [pdf, other]

Map** Navigation Instructions to Continuous Control Actions with Position-Visitation Prediction

Authors: Valts Blukis, Dipendra Misra, Ross A. Knepper, Yoav Artzi

Abstract: We propose an approach for map** natural language instructions and raw observations to continuous control of a quadcopter drone. Our model predicts interpretable position-visitation distributions indicating where the agent should go during execution and where it should stop, and uses the predicted distributions to select the actions to execute. This two-step model decomposition allows for simple… ▽ More We propose an approach for map** natural language instructions and raw observations to continuous control of a quadcopter drone. Our model predicts interpretable position-visitation distributions indicating where the agent should go during execution and where it should stop, and uses the predicted distributions to select the actions to execute. This two-step model decomposition allows for simple and efficient training using a combination of supervised learning and imitation learning. We evaluate our approach with a realistic drone simulator, and demonstrate absolute task-completion accuracy improvements of 16.85% over two state-of-the-art instruction-following methods. △ Less

Submitted 10 December, 2018; v1 submitted 9 November, 2018; originally announced November 2018.

Comments: Appeared in Conference on Robot Learning 2018

Journal ref: In Conference on Robot Learning (pp. 505-518) (2018)

arXiv:1811.00491 [pdf, other]

A Corpus for Reasoning About Natural Language Grounded in Photographs

Authors: Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, Yoav Artzi

Abstract: We introduce a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges. The data contains 107,292 examples of English sentences paired with web photographs. The task is to determine whether a natural language caption is true about a pair of photographs. We crowdsource the data using sets of visually ri… ▽ More We introduce a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges. The data contains 107,292 examples of English sentences paired with web photographs. The task is to determine whether a natural language caption is true about a pair of photographs. We crowdsource the data using sets of visually rich images and a compare-and-contrast task to elicit linguistically diverse language. Qualitative analysis shows the data requires compositional joint reasoning, including about quantities, comparisons, and relations. Evaluation using state-of-the-art visual reasoning methods shows the data presents a strong challenge. △ Less

Submitted 21 July, 2019; v1 submitted 1 November, 2018; originally announced November 2018.

Comments: ACL 2019 Long Paper

arXiv:1809.00786 [pdf, other]

Map** Instructions to Actions in 3D Environments with Visual Goal Prediction

Authors: Dipendra Misra, Andrew Bennett, Valts Blukis, Eyvind Niklasson, Max Shatkhin, Yoav Artzi

Abstract: We propose to decompose instruction execution to goal prediction and action generation. We design a model that maps raw visual observations to goals using LINGUNET, a language-conditioned image generation network, and then generates the actions required to complete them. Our model is trained from demonstration only without external resources. To evaluate our approach, we introduce two benchmarks f… ▽ More We propose to decompose instruction execution to goal prediction and action generation. We design a model that maps raw visual observations to goals using LINGUNET, a language-conditioned image generation network, and then generates the actions required to complete them. Our model is trained from demonstration only without external resources. To evaluate our approach, we introduce two benchmarks for instruction following: LANI, a navigation task; and CHAI, where an agent executes household instructions. Our evaluation demonstrates the advantages of our model decomposition, and illustrates the challenges posed by our new benchmarks. △ Less

Submitted 18 March, 2019; v1 submitted 3 September, 2018; originally announced September 2018.

Comments: Accepted at EMNLP 2018

arXiv:1806.00047 [pdf, other]

Following High-level Navigation Instructions on a Simulated Quadcopter with Imitation Learning

Authors: Valts Blukis, Nataly Brukhim, Andrew Bennett, Ross A. Knepper, Yoav Artzi

Abstract: We introduce a method for following high-level navigation instructions by map** directly from images, instructions and pose estimates to continuous low-level velocity commands for real-time control. The Grounded Semantic Map** Network (GSMN) is a fully-differentiable neural network architecture that builds an explicit semantic map in the world reference frame by incorporating a pinhole camera… ▽ More We introduce a method for following high-level navigation instructions by map** directly from images, instructions and pose estimates to continuous low-level velocity commands for real-time control. The Grounded Semantic Map** Network (GSMN) is a fully-differentiable neural network architecture that builds an explicit semantic map in the world reference frame by incorporating a pinhole camera projection model within the network. The information stored in the map is learned from experience, while the local-to-world transformation is computed explicitly. We train the model using DAggerFM, a modified variant of DAgger that trades tabular convergence guarantees for improved training speed and memory use. We test GSMN in virtual environments on a realistic quadcopter simulator and show that incorporating an explicit map** and grounding modules allows GSMN to outperform strong neural baselines and almost reach an expert policy performance. Finally, we analyze the learned map representations and show that using an explicit map leads to an interpretable instruction-following model. △ Less

Submitted 31 May, 2018; originally announced June 2018.

Comments: To appear in Robotics: Science and Systems (RSS), 2018

arXiv:1805.10209 [pdf, other]

Situated Map** of Sequential Instructions to Actions with Single-step Reward Observation

Authors: Alane Suhr, Yoav Artzi

Abstract: We propose a learning approach for map** context-dependent sequential instructions to actions. We address the problem of discourse and state dependencies with an attention-based model that considers both the history of the interaction and the state of the world. To train from start and goal states without access to demonstrations, we propose SESTRA, a learning algorithm that takes advantage of s… ▽ More We propose a learning approach for map** context-dependent sequential instructions to actions. We address the problem of discourse and state dependencies with an attention-based model that considers both the history of the interaction and the state of the world. To train from start and goal states without access to demonstrations, we propose SESTRA, a learning algorithm that takes advantage of single-step reward observations and immediate expected reward maximization. We evaluate on the SCONE domains, and show absolute accuracy improvements of 9.8%-25.3% across the domains over approaches that use high-level logical representations. △ Less

Submitted 8 June, 2018; v1 submitted 25 May, 2018; originally announced May 2018.

Comments: ACL 2018 Long Paper

arXiv:1804.11283 [pdf, other]

Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies

Authors: Max Grusky, Mor Naaman, Yoav Artzi

Abstract: We present NEWSROOM, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news publications. Extracted from search and social media metadata between 1998 and 2017, these high-quality summaries demonstrate high diversity of summarization styles. In particular, the summaries combine abstractive and extractive strategies, borrowing word… ▽ More We present NEWSROOM, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news publications. Extracted from search and social media metadata between 1998 and 2017, these high-quality summaries demonstrate high diversity of summarization styles. In particular, the summaries combine abstractive and extractive strategies, borrowing words and phrases from articles at varying rates. We analyze the extraction strategies used in NEWSROOM summaries against other datasets to quantify the diversity and difficulty of our new data, and train existing methods on the data to evaluate its utility and challenges. △ Less

Submitted 17 May, 2020; v1 submitted 30 April, 2018; originally announced April 2018.

Comments: Proceedings of NAACL-HLT 2018 (Long Paper)

arXiv:1804.06868 [pdf, other]

Learning to Map Context-Dependent Sentences to Executable Formal Queries

Authors: Alane Suhr, Srinivasan Iyer, Yoav Artzi

Abstract: We propose a context-dependent model to map utterances within an interaction to executable formal queries. To incorporate interaction history, the model maintains an interaction-level encoder that updates after each turn, and can copy sub-sequences of previously predicted queries during generation. Our approach combines implicit and explicit modeling of references between utterances. We evaluate o… ▽ More We propose a context-dependent model to map utterances within an interaction to executable formal queries. To incorporate interaction history, the model maintains an interaction-level encoder that updates after each turn, and can copy sub-sequences of previously predicted queries during generation. Our approach combines implicit and explicit modeling of references between utterances. We evaluate our model on the ATIS flight planning interactions, and demonstrate the benefits of modeling context and explicit references. △ Less

Submitted 25 April, 2018; v1 submitted 18 April, 2018; originally announced April 2018.

Comments: NAACL-HLT 2018 Long Paper

arXiv:1801.07357 [pdf, other]

CHALET: Cornell House Agent Learning Environment

Authors: Claudia Yan, Dipendra Misra, Andrew Bennnett, Aaron Walsman, Yonatan Bisk, Yoav Artzi

Abstract: We present CHALET, a 3D house simulator with support for navigation and manipulation. CHALET includes 58 rooms and 10 house configuration, and allows to easily create new house and room layouts. CHALET supports a range of common household activities, including moving objects, toggling appliances, and placing objects inside closeable containers. The environment and actions available are designed to… ▽ More We present CHALET, a 3D house simulator with support for navigation and manipulation. CHALET includes 58 rooms and 10 house configuration, and allows to easily create new house and room layouts. CHALET supports a range of common household activities, including moving objects, toggling appliances, and placing objects inside closeable containers. The environment and actions available are designed to create a challenging domain to train and evaluate autonomous agents, including for tasks that combine language, vision, and planning in a dynamic environment. △ Less

Submitted 16 September, 2019; v1 submitted 22 January, 2018; originally announced January 2018.

arXiv:1710.00453 [pdf, other]

Visual Reasoning with Natural Language

Authors: Stephanie Zhou, Alane Suhr, Yoav Artzi

Abstract: Natural language provides a widely accessible and expressive interface for robotic agents. To understand language in complex environments, agents must reason about the full range of language inputs and their correspondence to the world. Such reasoning over language and vision is an open problem that is receiving increasing attention. While existing data sets focus on visual diversity, they do not… ▽ More Natural language provides a widely accessible and expressive interface for robotic agents. To understand language in complex environments, agents must reason about the full range of language inputs and their correspondence to the world. Such reasoning over language and vision is an open problem that is receiving increasing attention. While existing data sets focus on visual diversity, they do not display the full range of natural language expressions, such as counting, set reasoning, and comparisons. We propose a simple task for natural language visual reasoning, where images are paired with descriptive statements. The task is to predict if a statement is true for the given scene. This abstract describes our existing synthetic images corpus and our current work on collecting real vision data. △ Less

Submitted 1 October, 2017; originally announced October 2017.

Comments: AAAI NCHRC 2017

arXiv:1709.02755 [pdf, other]

Simple Recurrent Units for Highly Parallelizable Recurrence

Authors: Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, Yoav Artzi

Abstract: Common recurrent neural architectures scale poorly due to the intrinsic difficulty in parallelizing their state computations. In this work, we propose the Simple Recurrent Unit (SRU), a light recurrent unit that balances model capacity and scalability. SRU is designed to provide expressive recurrence, enable highly parallelized implementation, and comes with careful initialization to facilitate tr… ▽ More Common recurrent neural architectures scale poorly due to the intrinsic difficulty in parallelizing their state computations. In this work, we propose the Simple Recurrent Unit (SRU), a light recurrent unit that balances model capacity and scalability. SRU is designed to provide expressive recurrence, enable highly parallelized implementation, and comes with careful initialization to facilitate training of deep models. We demonstrate the effectiveness of SRU on multiple NLP tasks. SRU achieves 5--9x speed-up over cuDNN-optimized LSTM on classification and question answering datasets, and delivers stronger results than LSTM and convolutional models. We also obtain an average of 0.7 BLEU improvement over the Transformer model on translation by incorporating SRU into the architecture. △ Less

Submitted 7 September, 2018; v1 submitted 8 September, 2017; originally announced September 2017.

Comments: EMNLP

arXiv:1704.08795 [pdf, other]

Map** Instructions and Visual Observations to Actions with Reinforcement Learning

Authors: Dipendra Misra, John Langford, Yoav Artzi

Abstract: We propose to directly map raw visual observations and text input to actions for instruction execution. While existing approaches assume access to structured environment representations or use a pipeline of separately trained models, we learn a single model to jointly reason about linguistic and visual input. We use reinforcement learning in a contextual bandit setting to train a neural network ag… ▽ More We propose to directly map raw visual observations and text input to actions for instruction execution. While existing approaches assume access to structured environment representations or use a pipeline of separately trained models, we learn a single model to jointly reason about linguistic and visual input. We use reinforcement learning in a contextual bandit setting to train a neural network agent. To guide the agent's exploration, we use reward sha** with different forms of supervision. Our approach does not require intermediate representations, planning procedures, or training different models. We evaluate in a simulated environment, and show significant improvements over supervised learning and common reinforcement learning variants. △ Less

Submitted 22 July, 2017; v1 submitted 27 April, 2017; originally announced April 2017.

Comments: In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017

arXiv:1311.3011 [pdf]

Cornell SPF: Cornell Semantic Parsing Framework

Authors: Yoav Artzi

Abstract: The Cornell Semantic Parsing Framework (SPF) is a learning and inference framework for map** natural language to formal representation of its meaning. The Cornell Semantic Parsing Framework (SPF) is a learning and inference framework for map** natural language to formal representation of its meaning. △ Less

Submitted 8 October, 2016; v1 submitted 12 November, 2013; originally announced November 2013.

Showing 1–44 of 44 results for author: Artzi, Y