Skip to main content

Showing 1–45 of 45 results for author: Baldridge, J

.
  1. arXiv:2406.19967  [pdf, other

    cs.CL cs.AI

    Into the Unknown: Generating Geospatial Descriptions for New Environments

    Authors: Tzuf Paz-Argaman, John Palowitch, Sayali Kulkarni, Reut Tsarfaty, Jason Baldridge

    Abstract: Similar to vision-and-language navigation (VLN) tasks that focus on bridging the gap between vision and language for embodied navigation, the new Rendezvous (RVS) task requires reasoning over allocentric spatial relationships (independent of the observer's viewpoint) using non-sequential navigation instructions and maps. However, performance substantially drops in new environments with no training… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Journal ref: ACL 2024 Findings

  2. arXiv:2405.16759  [pdf, other

    cs.CV cs.LG

    Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

    Authors: Cristina N. Vasconcelos, Abdullah Rashwan, Austin Waters, Trevor Walker, Keyang Xu, Jimmy Yan, Rui Qian, Shixin Luo, Zarana Parekh, Andrew Bunner, Hongliang Fei, Roopal Garg, Mandy Guo, Ivana Kajic, Yeqing Li, Henna Nandwani, Jordi Pont-Tuset, Yasumasa Onoe, Sarah Rosston, Su Wang, Wenlei Zhou, Kevin Swersky, David J. Fleet, Jason M. Baldridge, Oliver Wang

    Abstract: We address the long-standing problem of how to learn effective pixel-based image diffusion models at scale, introducing a remarkably simple greedy growing method for stable training of large-scale, high-resolution models. without the needs for cascaded super-resolution components. The key insight stems from careful pre-training of core components, namely, those responsible for text-to-image alignm… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

  3. arXiv:2405.02793  [pdf, other

    cs.CV cs.CL

    ImageInWords: Unlocking Hyper-Detailed Image Descriptions

    Authors: Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Montgomery, Yasumasa Onoe, Andrew Bunner, Ranjay Krishna, Jason Baldridge, Radu Soricut

    Abstract: Despite the longstanding adage "an image is worth a thousand words," creating accurate and hyper-detailed image descriptions for training Vision-Language models remains challenging. Current datasets typically have web-scraped descriptions that are short, low-granularity, and often contain details unrelated to the visual content. As a result, models trained on such data generate descriptions replet… ▽ More

    Submitted 4 May, 2024; originally announced May 2024.

    Comments: Webpage (https://google.github.io/imageinwords), GitHub (https://github.com/google/imageinwords), HuggingFace (https://huggingface.co/datasets/google/imageinwords)

  4. arXiv:2404.19753  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    DOCCI: Descriptions of Connected and Contrasting Images

    Authors: Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, Su Wang, Jason Baldridge

    Abstract: Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research. However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models. To fill the gap, we introduce Descriptions of Connected and Contrasting Images (DOCCI), a dataset with long, human-annotated English descriptions for 15k images that w… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

  5. arXiv:2402.16364  [pdf, other

    cs.CL cs.LG cs.MM

    Where Do We Go from Here? Multi-scale Allocentric Relational Inference from Natural Spatial Descriptions

    Authors: Tzuf Paz-Argaman, Sayali Kulkarni, John Palowitch, Jason Baldridge, Reut Tsarfaty

    Abstract: When communicating routes in natural language, the concept of {\em acquired spatial knowledge} is crucial for geographic information retrieval (GIR) and in spatial cognitive research. However, NLP navigation studies often overlook the impact of such acquired knowledge on textual descriptions. Current navigation studies concentrate on egocentric local descriptions (e.g., `it will be on your right')… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

  6. arXiv:2312.11805  [pdf, other

    cs.CL cs.AI cs.CV

    Gemini: A Family of Highly Capable Multimodal Models

    Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

    Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More

    Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

  7. arXiv:2310.18235  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation

    Authors: Jaemin Cho, Yushi Hu, Roopal Garg, Peter Anderson, Ranjay Krishna, Jason Baldridge, Mohit Bansal, Jordi Pont-Tuset, Su Wang

    Abstract: Evaluating text-to-image models is notoriously difficult. A strong recent approach for assessing text-image faithfulness is based on QG/A (question generation and answering), which uses pre-trained foundational models to automatically generate a set of questions and answers from the prompt, and output images are scored based on whether these answers extracted with a visual question answering model… ▽ More

    Submitted 13 March, 2024; v1 submitted 27 October, 2023; originally announced October 2023.

    Comments: ICLR 2024; Project website: https://google.github.io/dsg

  8. arXiv:2305.18213  [pdf

    cs.LG cs.AI

    Gaussian Process Probes (GPP) for Uncertainty-Aware Probing

    Authors: Zi Wang, Alexander Ku, Jason Baldridge, Thomas L. Griffiths, Been Kim

    Abstract: Understanding which concepts models can and cannot represent has been fundamental to many tasks: from effective and responsible use of models to detecting out of distribution data. We introduce Gaussian process probes (GPP), a unified and simple framework for probing and measuring uncertainty about concepts represented by models. As a Bayesian extension of linear probing methods, GPP asks what kin… ▽ More

    Submitted 6 November, 2023; v1 submitted 29 May, 2023; originally announced May 2023.

    Journal ref: 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

  9. arXiv:2303.13455  [pdf, other

    cs.CV cs.CL

    CoBIT: A Contrastive Bi-directional Image-Text Generation Model

    Authors: Haoxuan You, Mandy Guo, Zhecan Wang, Kai-Wei Chang, Jason Baldridge, Jiahui Yu

    Abstract: The field of vision and language has witnessed a proliferation of pre-trained foundation models. Most existing methods are independently pre-trained with contrastive objective like CLIP, image-to-text generative objective like PaLI, or text-to-image generative objective like Parti. However, the three objectives can be pre-trained on the same data, image-text pairs, and intuitively they complement… ▽ More

    Submitted 23 March, 2023; originally announced March 2023.

    Comments: 14 pages, 5 figures

  10. arXiv:2212.06909  [pdf, other

    cs.CV cs.AI

    Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting

    Authors: Su Wang, Chitwan Saharia, Ceslee Montgomery, Jordi Pont-Tuset, Shai Noy, Stefano Pellegrini, Yasumasa Onoe, Sarah Laszlo, David J. Fleet, Radu Soricut, Jason Baldridge, Mohammad Norouzi, Peter Anderson, William Chan

    Abstract: Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to input text prompts, while consistent with input images. We present Imagen Editor, a cascaded diffusion model built, by fine-tuning Imagen on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplish… ▽ More

    Submitted 12 April, 2023; v1 submitted 13 December, 2022; originally announced December 2022.

    Comments: CVPR 2023 Camera Ready

  11. arXiv:2210.05815  [pdf, other

    cs.CV cs.CL

    Underspecification in Scene Description-to-Depiction Tasks

    Authors: Ben Hutchinson, Jason Baldridge, Vinodkumar Prabhakaran

    Abstract: Questions regarding implicitness, ambiguity and underspecification are crucial for understanding the task validity and ethical concerns of multimodal image+text systems, yet have received little attention to date. This position paper maps out a conceptual framework to address this gap, focusing on systems which generate images depicting scenes from scene descriptions. In doing so, we account for h… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

  12. arXiv:2210.03112  [pdf, other

    cs.LG cs.CL cs.CV cs.RO

    A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning

    Authors: Aishwarya Kamath, Peter Anderson, Su Wang, **g Yu Koh, Alexander Ku, Austin Waters, Yinfei Yang, Jason Baldridge, Zarana Parekh

    Abstract: Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial langua… ▽ More

    Submitted 17 April, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

    Comments: CVPR 2023

  13. arXiv:2206.10789  [pdf, other

    cs.CV cs.LG

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Authors: Jiahui Yu, Yuanzhong Xu, **g Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, Yonghui Wu

    Abstract: We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in a… ▽ More

    Submitted 21 June, 2022; originally announced June 2022.

    Comments: Preprint

  14. arXiv:2204.02960  [pdf, other

    cs.CV cs.AI cs.LG

    Simple and Effective Synthesis of Indoor 3D Scenes

    Authors: **g Yu Koh, Harsh Agrawal, Dhruv Batra, Richard Tucker, Austin Waters, Honglak Lee, Yinfei Yang, Jason Baldridge, Peter Anderson

    Abstract: We study the problem of synthesizing immersive 3D indoor scenes from one or more images. Our aim is to generate high-resolution images and videos from novel viewpoints, including viewpoints that extrapolate far beyond the input images while maintaining 3D consistency. Existing approaches are highly complex, with many separately trained stages and components. We propose a simple alternative: an ima… ▽ More

    Submitted 1 December, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

    Comments: AAAI 2023

  15. arXiv:2111.12872  [pdf, other

    cs.CV cs.CL

    Less is More: Generating Grounded Navigation Instructions from Landmarks

    Authors: Su Wang, Ceslee Montgomery, Jordi Orbay, Vighnesh Birodkar, Aleksandra Faust, Izzeddin Gur, Natasha Jaques, Austin Waters, Jason Baldridge, Peter Anderson

    Abstract: We study the automatic generation of navigation instructions from 360-degree images captured on indoor routes. Existing generators suffer from poor visual grounding, causing them to rely on language priors and hallucinate objects. Our MARKY-MT5 system addresses this by focusing on visual landmarks; it comprises a first stage landmark detector and a second stage generator -- a multimodal, multiling… ▽ More

    Submitted 4 April, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

    Comments: CVPR 2022 Camera-ready

  16. arXiv:2110.04627  [pdf, other

    cs.CV cs.LG

    Vector-quantized Image Modeling with Improved VQGAN

    Authors: Jiahui Yu, Xin Li, **g Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu

    Abstract: Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregres… ▽ More

    Submitted 4 June, 2022; v1 submitted 9 October, 2021; originally announced October 2021.

    Comments: Accepted in ICLR 2022

  17. arXiv:2109.05125  [pdf, other

    cs.IR cs.AI cs.CL cs.LG

    MURAL: Multimodal, Multitask Retrieval Across Languages

    Authors: Aashi Jain, Mandy Guo, Krishna Srinivasan, Ting Chen, Sneha Kudugunta, Chao Jia, Yinfei Yang, Jason Baldridge

    Abstract: Both image-caption pairs and translation pairs provide the means to learn deep representations of and connections between languages. We use both types of pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual encoder that solves two tasks: 1) image-text matching and 2) translation pair matching. By incorporating billions of translation pairs, MURAL extends ALIGN (Jia et al… ▽ More

    Submitted 10 September, 2021; originally announced September 2021.

  18. arXiv:2105.08756  [pdf, other

    cs.CV cs.LG

    Pathdreamer: A World Model for Indoor Navigation

    Authors: **g Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, Peter Anderson

    Abstract: People navigating in unfamiliar buildings take advantage of myriad visual, spatial and semantic cues to efficiently achieve their navigation goals. Towards equip** computational agents with similar capabilities, we introduce Pathdreamer, a visual world model for agents navigating in novel indoor environments. Given one or more previous visual observations, Pathdreamer generates plausible high-re… ▽ More

    Submitted 16 August, 2021; v1 submitted 18 May, 2021; originally announced May 2021.

    Comments: In ICCV 2021

  19. arXiv:2104.01894  [pdf, ps, other

    cs.CL cs.CV cs.IR cs.LG

    Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval

    Authors: Ramon Sanabria, Austin Waters, Jason Baldridge

    Abstract: Speech-based image retrieval has been studied as a proxy for joint representation learning, usually without emphasis on retrieval itself. As such, it is unclear how well speech-based retrieval can work in practice -- both in an absolute sense and versus alternative strategies that combine automatic speech recognition (ASR) with strong text encoders. In this work, we extensively study and expand ch… ▽ More

    Submitted 15 June, 2021; v1 submitted 5 April, 2021; originally announced April 2021.

    Comments: Accepted to INTERSPEECH 2021

  20. arXiv:2103.12703  [pdf, other

    cs.CV cs.AI cs.CL

    PanGEA: The Panoramic Graph Environment Annotation Toolkit

    Authors: Alexander Ku, Peter Anderson, Jordi Pont-Tuset, Jason Baldridge

    Abstract: PanGEA, the Panoramic Graph Environment Annotation toolkit, is a lightweight toolkit for collecting speech and text annotations in photo-realistic 3D environments. PanGEA immerses annotators in a web-based simulation and allows them to move around easily as they speak and/or listen. It includes database and cloud storage integration, plus utilities for automatically aligning recorded speech with m… ▽ More

    Submitted 23 March, 2021; originally announced March 2021.

  21. arXiv:2101.10504  [pdf, other

    cs.AI cs.CL cs.CV

    On the Evaluation of Vision-and-Language Navigation Instructions

    Authors: Ming Zhao, Peter Anderson, Vihan Jain, Su Wang, Alexander Ku, Jason Baldridge, Eugene Ie

    Abstract: Vision-and-Language Navigation wayfinding agents can be enhanced by exploiting automatically generated navigation instructions. However, existing instruction generators have not been comprehensively evaluated, and the automatic evaluation metrics used to develop them have not been validated. Using human wayfinders, we show that these generators perform on par with or only slightly better than a te… ▽ More

    Submitted 25 January, 2021; originally announced January 2021.

    Comments: Accepted to EACL 2021

  22. arXiv:2101.04702  [pdf, other

    cs.CV

    Cross-Modal Contrastive Learning for Text-to-Image Generation

    Authors: Han Zhang, **g Yu Koh, Jason Baldridge, Honglak Lee, Yinfei Yang

    Abstract: The output of text-to-image synthesis systems should be coherent, clear, photo-realistic scenes with high semantic fidelity to their conditioned text descriptions. Our Cross-Modal Contrastive Generative Adversarial Network (XMC-GAN) addresses this challenge by maximizing the mutual information between image and text. It does this via multiple contrastive losses which capture inter-modality and int… ▽ More

    Submitted 14 April, 2022; v1 submitted 12 January, 2021; originally announced January 2021.

    Comments: CVPR 2021

  23. arXiv:2011.03775  [pdf, other

    cs.CV cs.AI

    Text-to-Image Generation Grounded by Fine-Grained User Attention

    Authors: **g Yu Koh, Jason Baldridge, Honglak Lee, Yinfei Yang

    Abstract: Localized Narratives is a dataset with detailed natural language descriptions of images paired with mouse traces that provide a sparse, fine-grained visual grounding for phrases. We propose TReCS, a sequential model that exploits this grounding to generate images. TReCS uses descriptions to retrieve segmentation masks and predict object labels aligned with mouse traces. These alignments are used t… ▽ More

    Submitted 30 March, 2021; v1 submitted 7 November, 2020; originally announced November 2020.

    Comments: To appear in WACV 2021

  24. arXiv:2010.07954  [pdf, other

    cs.CV cs.AI cs.CL

    Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding

    Authors: Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, Jason Baldridge

    Abstract: We introduce Room-Across-Room (RxR), a new Vision-and-Language Navigation (VLN) dataset. RxR is multilingual (English, Hindi, and Telugu) and larger (more paths and instructions) than other VLN datasets. It emphasizes the role of language in VLN by addressing known biases in paths and eliciting more references to visible entities. Furthermore, each word in an instruction is time-aligned to the vir… ▽ More

    Submitted 15 October, 2020; originally announced October 2020.

    Comments: EMNLP 2020

  25. arXiv:2008.09236  [pdf, other

    cs.CL

    Spatial Language Representation with Multi-Level Geocoding

    Authors: Sayali Kulkarni, Shailee Jain, Mohammad Javad Hosseini, Jason Baldridge, Eugene Ie, Li Zhang

    Abstract: We present a multi-level geocoding model (MLG) that learns to associate texts to geographic locations. The Earth's surface is represented using space-filling curves that decompose the sphere into a hierarchy of similarly sized, non-overlap** cells. MLG balances generalization and accuracy by combining losses across multiple levels and predicting cells at each level simultaneously. Without using… ▽ More

    Submitted 20 August, 2020; originally announced August 2020.

  26. arXiv:2005.08469  [pdf, other

    cs.CL

    Text Classification with Few Examples using Controlled Generalization

    Authors: Abhijit Mahabal, Jason Baldridge, Burcu Karagol Ayan, Vincent Perot, Dan Roth

    Abstract: Training data for text classification is often limited in practice, especially for applications with many output classes or involving many related classification problems. This means classifiers must generalize from limited evidence, but the manner and extent of generalization is task dependent. Current practice primarily relies on pre-trained word embeddings to map words unseen in training to sim… ▽ More

    Submitted 18 May, 2020; originally announced May 2020.

    Journal ref: Proceedings of NAACL-HLT 2019

  27. arXiv:2005.03776  [pdf, other

    cs.CL cs.LG

    Map** Natural Language Instructions to Mobile UI Action Sequences

    Authors: Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, Jason Baldridge

    Abstract: We present a new problem: grounding natural language instructions to mobile user interface actions, and create three new datasets for it. For full task evaluation, we create PIXELHELP, a corpus that pairs English instructions with actions performed by people on a mobile UI emulator. To scale training, we decouple the language and action data by (a) annotating action phrase spans in HowTo instructi… ▽ More

    Submitted 4 June, 2020; v1 submitted 7 May, 2020; originally announced May 2020.

    Comments: Annual Conference of the Association for Computational Linguistics (ACL 2020)

  28. arXiv:2004.15020  [pdf, other

    cs.CL

    Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO

    Authors: Zarana Parekh, Jason Baldridge, Daniel Cer, Austin Waters, Yinfei Yang

    Abstract: By supporting multi-modal retrieval training and evaluation, image captioning datasets have spurred remarkable progress on representation learning. Unfortunately, datasets have limited cross-modal associations: images are not paired with other images, captions are only paired with other captions of the same image, there are no negative associations and there are missing positive cross-modal associ… ▽ More

    Submitted 24 March, 2021; v1 submitted 30 April, 2020; originally announced April 2020.

    Comments: To be presented at EACL2021

  29. arXiv:2001.03671  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Retouchdown: Adding Touchdown to StreetLearn as a Shareable Resource for Language Grounding Tasks in Street View

    Authors: Harsh Mehta, Yoav Artzi, Jason Baldridge, Eugene Ie, Piotr Mirowski

    Abstract: The Touchdown dataset (Chen et al., 2019) provides instructions by human annotators for navigation through New York City streets and for resolving spatial descriptions at a given location. To enable the wider research community to work effectively with the Touchdown tasks, we are publicly releasing the 29k raw Street View panoramas needed for Touchdown. We follow the process used for the StreetLea… ▽ More

    Submitted 10 January, 2020; originally announced January 2020.

  30. arXiv:1912.05877  [pdf, other

    cs.CL cs.AI

    Extending Machine Language Models toward Human-Level Language Understanding

    Authors: James L. McClelland, Felix Hill, Maja Rudolph, Jason Baldridge, Hinrich Schütze

    Abstract: Language is crucial for human intelligence, but what exactly is its role? We take language to be a part of a system for understanding and communicating about situations. The human ability to understand and communicate about situations emerges gradually from experience and depends on domain-general principles of biological neural networks: connection-based learning, distributed representation, and… ▽ More

    Submitted 4 July, 2020; v1 submitted 12 December, 2019; originally announced December 2019.

  31. arXiv:1909.10506  [pdf, other

    cs.CL cs.IR cs.LG

    Learning Dense Representations for Entity Retrieval

    Authors: Daniel Gillick, Sayali Kulkarni, Larry Lansing, Alessandro Presta, Jason Baldridge, Eugene Ie, Diego Garcia-Olano

    Abstract: We show that it is feasible to perform entity linking by training a dual encoder (two-tower) model that encodes mentions and entities in the same dense vector space, where candidate entities are retrieved by approximate nearest neighbor search. Unlike prior work, this setup does not rely on an alias table followed by a re-ranker, and is thus the first fully learned entity retrieval model. We show… ▽ More

    Submitted 23 September, 2019; originally announced September 2019.

    Comments: CoNLL 2019

  32. arXiv:1909.08782  [pdf, other

    cs.CV cs.CL cs.SD eess.AS

    Large-scale representation learning from visually grounded untranscribed speech

    Authors: Gabriel Ilharco, Yuan Zhang, Jason Baldridge

    Abstract: Systems that can associate images with their spoken audio captions are an important step towards visually grounded language learning. We describe a scalable method to automatically generate diverse audio for image captioning datasets. This supports pretraining deep networks for encoding both audio and images, which we do via a dual encoder that learns to align latent representations from both moda… ▽ More

    Submitted 18 September, 2019; originally announced September 2019.

    Journal ref: The SIGNLL Conference on Computational Natural Language Learning (CoNLL), 2019

  33. arXiv:1908.11828  [pdf, ps, other

    cs.CL

    PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification

    Authors: Yinfei Yang, Yuan Zhang, Chris Tar, Jason Baldridge

    Abstract: Most existing work on adversarial data generation focuses on English. For example, PAWS (Paraphrase Adversaries from Word Scrambling) consists of challenging English paraphrase identification pairs from Wikipedia and Quora. We remedy this gap with PAWS-X, a new dataset of 23,659 human translated PAWS evaluation pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japane… ▽ More

    Submitted 30 August, 2019; originally announced August 2019.

    Comments: Accepted by EMNLP2019

  34. arXiv:1908.03409  [pdf, other

    cs.CV cs.CL cs.LG cs.RO

    Transferable Representation Learning in Vision-and-Language Navigation

    Authors: Haoshuo Huang, Vihan Jain, Harsh Mehta, Alexander Ku, Gabriel Magalhaes, Jason Baldridge, Eugene Ie

    Abstract: Vision-and-Language Navigation (VLN) tasks such as Room-to-Room (R2R) require machine agents to interpret natural language instructions and learn to act in visually realistic environments to achieve navigation goals. The overall task requires competence in several perception problems: successful agents combine spatio-temporal, vision and language understanding to produce appropriate action sequenc… ▽ More

    Submitted 12 August, 2019; v1 submitted 9 August, 2019; originally announced August 2019.

    Comments: To appear in ICCV 2019

  35. arXiv:1907.05446  [pdf, other

    cs.RO cs.AI cs.CL

    General Evaluation for Instruction Conditioned Navigation using Dynamic Time War**

    Authors: Gabriel Ilharco, Vihan Jain, Alexander Ku, Eugene Ie, Jason Baldridge

    Abstract: In instruction conditioned navigation, agents interpret natural language and their surroundings to navigate through an environment. Datasets for studying this task typically contain pairs of these instructions and reference trajectories. Yet, most evaluation metrics used thus far fail to properly account for the latter, relying instead on insufficient similarity comparisons. We address fundamental… ▽ More

    Submitted 28 November, 2019; v1 submitted 11 July, 2019; originally announced July 2019.

    Journal ref: Thirty-third Conference on Neural Information Processing Systems (NeurIPS 2019)

  36. arXiv:1905.13358  [pdf, other

    cs.CL cs.CV

    Multi-modal Discriminative Model for Vision-and-Language Navigation

    Authors: Haoshuo Huang, Vihan Jain, Harsh Mehta, Jason Baldridge, Eugene Ie

    Abstract: Vision-and-Language Navigation (VLN) is a natural language grounding task where agents have to interpret natural language instructions in the context of visual scenes in a dynamic environment to achieve prescribed navigation goals. Successful agents must have the ability to parse natural language of varying linguistic styles, ground them in potentially unfamiliar scenes, plan and react with ambigu… ▽ More

    Submitted 30 May, 2019; originally announced May 2019.

    Comments: Accepted at SpLU-RoboNLP 2019 (workshop at NAACL)

  37. arXiv:1905.12255  [pdf, other

    cs.AI cs.CL

    Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation

    Authors: Vihan Jain, Gabriel Magalhaes, Alexander Ku, Ashish Vaswani, Eugene Ie, Jason Baldridge

    Abstract: Advances in learning and representations have reinvigorated work that connects language to other modalities. A particularly exciting direction is Vision-and-Language Navigation(VLN), in which agents interpret natural language instructions and visual scenes to move through environments and reach goals. Despite recent progress, current research leaves unclear how much of a role language understandin… ▽ More

    Submitted 21 June, 2019; v1 submitted 29 May, 2019; originally announced May 2019.

    Comments: Accepted at ACL 2019 as long paper

  38. arXiv:1904.01130  [pdf, other

    cs.CL

    PAWS: Paraphrase Adversaries from Word Scrambling

    Authors: Yuan Zhang, Jason Baldridge, Luheng He

    Abstract: Existing paraphrase identification datasets lack sentence pairs that have high lexical overlap without being paraphrases. Models trained on such data fail to distinguish pairs like flights from New York to Florida and flights from Florida to New York. This paper introduces PAWS (Paraphrase Adversaries from Word Scrambling), a new dataset with 108,463 well-formed paraphrase and non-paraphrase pairs… ▽ More

    Submitted 1 April, 2019; originally announced April 2019.

    Comments: NAACL 2019

  39. arXiv:1811.00119  [pdf, other

    cs.CL

    A task in a suit and a tie: paraphrase generation with semantic augmentation

    Authors: Su Wang, Rahul Gupta, Nancy Chang, Jason Baldridge

    Abstract: Paraphrasing is rooted in semantics. We show the effectiveness of transformers (Vaswani et al. 2017) for paraphrase generation and further improvements by incorporating PropBank labels via a multi-encoder. Evaluating on MSCOCO and WikiAnswers, we find that transformers are fast and effective, and that semantic augmentation for both transformers and LSTMs leads to sizable 2-3 point gains in BLEU, M… ▽ More

    Submitted 14 November, 2018; v1 submitted 31 October, 2018; originally announced November 2018.

    Journal ref: Association for the Advancement of Artificial Intelligence (AAAI) 2019

  40. arXiv:1810.05201  [pdf, other

    cs.CL

    Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns

    Authors: Kellie Webster, Marta Recasens, Vera Axelrod, Jason Baldridge

    Abstract: Coreference resolution is an important task for natural language understanding, and the resolution of ambiguous pronouns a longstanding challenge. Nonetheless, existing corpora do not capture ambiguous pronouns in sufficient volume or diversity to accurately indicate the practical utility of models. Furthermore, we find gender bias in existing corpora and systems favoring masculine entities. To ad… ▽ More

    Submitted 11 October, 2018; originally announced October 2018.

  41. arXiv:1810.04142  [pdf, other

    cs.CL

    A Fast, Compact, Accurate Model for Language Identification of Codemixed Text

    Authors: Yuan Zhang, Jason Riesa, Daniel Gillick, Anton Bakalov, Jason Baldridge, David Weiss

    Abstract: We address fine-grained multilingual language identification: providing a language code for every token in a sentence, including codemixed text containing multiple languages. Such text is prevalent online, in documents, social media, and message boards. We show that a feed-forward network with a simple globally constrained decoder can accurately and rapidly label both codemixed and monolingual tex… ▽ More

    Submitted 9 October, 2018; originally announced October 2018.

    Comments: EMNLP 2018

  42. arXiv:1808.09468  [pdf, ps, other

    cs.CL

    Learning To Split and Rephrase From Wikipedia Edit History

    Authors: Jan A. Botha, Manaal Faruqui, John Alex, Jason Baldridge, Dipanjan Das

    Abstract: Split and rephrase is the task of breaking down a sentence into shorter ones that together convey the same meaning. We extract a rich new dataset for this task by mining Wikipedia's edit history: WikiSplit contains one million naturally occurring sentence rewrites, providing sixty times more distinct split examples and a ninety times larger vocabulary than the WebSplit corpus introduced by Narayan… ▽ More

    Submitted 28 August, 2018; originally announced August 2018.

    Journal ref: Proc. of EMNLP 2018

  43. arXiv:1611.08765  [pdf, other

    cs.CL

    Fill it up: Exploiting partial dependency annotations in a minimum spanning tree parser

    Authors: Liang Sun, Jason Mielens, Jason Baldridge

    Abstract: Unsupervised models of dependency parsing typically require large amounts of clean, unlabeled data plus gold-standard part-of-speech tags. Adding indirect supervision (e.g. language universals and rules) can help, but we show that obtaining small amounts of direct supervision - here, partial dependency annotations - provides a strong balance between zero and full supervision. We adapt the unsuperv… ▽ More

    Submitted 26 November, 2016; originally announced November 2016.

  44. arXiv:1306.2091  [pdf, other

    cs.CL

    A framework for (under)specifying dependency syntax without overloading annotators

    Authors: Nathan Schneider, Brendan O'Connor, Naomi Saphra, David Bamman, Manaal Faruqui, Noah A. Smith, Chris Dyer, Jason Baldridge

    Abstract: We introduce a framework for lightweight dependency syntax annotation. Our formalism builds upon the typical representation for unlabeled dependencies, permitting a simple notation and annotation workflow. Moreover, the formalism encourages annotators to underspecify parts of the syntax if doing so would streamline the annotation process. We demonstrate the efficacy of this annotation on three lan… ▽ More

    Submitted 14 June, 2013; v1 submitted 9 June, 2013; originally announced June 2013.

    Comments: This is an expanded version of a paper appearing in Proceedings of the 7th Linguistic Annotation Workshop & Interoperability with Discourse, Sofia, Bulgaria, August 8-9, 2013

  45. arXiv:1211.2290  [pdf, other

    cs.CL cs.AI

    Dating Texts without Explicit Temporal Cues

    Authors: Abhimanu Kumar, Jason Baldridge, Matthew Lease, Joydeep Ghosh

    Abstract: This paper tackles temporal resolution of documents, such as determining when a document is about or when it was written, based only on its text. We apply techniques from information retrieval that predict dates via language models over a discretized timeline. Unlike most previous works, we rely {\it solely} on temporal cues implicit in the text. We consider both document-likelihood and divergence… ▽ More

    Submitted 10 November, 2012; originally announced November 2012.