Skip to main content

Showing 1–50 of 74 results for author: Bisk, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.00369  [pdf, other

    cs.CL

    How to Train Your Fact Verifier: Knowledge Transfer with Multimodal Open Models

    Authors: Jaeyoung Lee, Ximing Lu, Jack Hessel, Faeze Brahman, Youngjae Yu, Yonatan Bisk, Ye** Choi, Saadia Gabriel

    Abstract: Given the growing influx of misinformation across news and social media, there is a critical need for systems that can provide effective real-time verification of news claims. Large language or multimodal model based verification has been proposed to scale up online policing mechanisms for mitigating spread of false and harmful content. While these can potentially reduce burden on human fact-check… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

  2. arXiv:2406.19228  [pdf, other

    cs.CL cs.AI cs.LG

    Tools Fail: Detecting Silent Errors in Faulty Tools

    Authors: Jimin Sun, So Yeon Min, Yingshan Chang, Yonatan Bisk

    Abstract: Tools have become a mainstay of LLMs, allowing them to retrieve knowledge not in their weights, to perform tasks on the web, and even to control robots. However, most ontologies and surveys of tool-use have assumed the core challenge for LLMs is choosing the tool. Instead, we introduce a framework for tools more broadly which guides us to explore a model's ability to detect "silent" tool errors, a… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: 18 pages, 12 figures

  3. arXiv:2406.05191  [pdf, other

    cs.CV

    DiffusionPID: Interpreting Diffusion via Partial Information Decomposition

    Authors: Shaurya Dewan, Rushikesh Zawar, Prakanshul Saxena, Yingshan Chang, Andrew Luo, Yonatan Bisk

    Abstract: Text-to-image diffusion models have made significant progress in generating naturalistic images from textual inputs, and demonstrate the capacity to learn and represent complex visual-semantic relationships. While these diffusion models have achieved remarkable success, the underlying mechanisms driving their performance are not yet fully accounted for, with many unanswered questions surrounding w… ▽ More

    Submitted 12 June, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

  4. arXiv:2405.20131  [pdf, other

    cs.CL cs.AI

    Language Models Need Inductive Biases to Count Inductively

    Authors: Yingshan Chang, Yonatan Bisk

    Abstract: Counting is a fundamental example of generalization, whether viewed through the mathematical lens of Peano's axioms defining the natural numbers or the cognitive science literature for children learning to count. The argument holds for both cases that learning to count means learning to count infinitely. While few papers have tried to distill transformer "reasoning" to the simplest case of countin… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

  5. arXiv:2404.11483  [pdf, other

    cs.AI cs.LG

    AgentKit: Flow Engineering with Graphs, not Coding

    Authors: Yue Wu, Yewen Fan, So Yeon Min, Shrimai Prabhumoye, Stephen McAleer, Yonatan Bisk, Ruslan Salakhutdinov, Yuanzhi Li, Tom Mitchell

    Abstract: We propose an intuitive LLM prompting framework (AgentKit) for multifunctional agents. AgentKit offers a unified framework for explicitly constructing a complex "thought process" from simple natural language prompts. The basic building block in AgentKit is a node, containing a natural language prompt for a specific subtask. The user then puts together chains of nodes, like stacking LEGO pieces. Th… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

  6. arXiv:2404.01258  [pdf, other

    cs.CV cs.AI

    Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

    Authors: Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, Yiming Yang

    Abstract: Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of large language model (LLM). However, in tasks involving video instruction-following, providing informative feedback, especially for detecting hallucinations in generated responses, remains a significant challenge. Previous studies have explored using large… ▽ More

    Submitted 2 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

  7. arXiv:2404.01158  [pdf, other

    cs.CL cs.RO

    Dialogue with Robots: Proposals for Broadening Participation and Research in the SLIVAR Community

    Authors: Casey Kennington, Malihe Alikhani, Heather Pon-Barry, Katherine Atwell, Yonatan Bisk, Daniel Fried, Felix Gervits, Zhao Han, Mert Inan, Michael Johnston, Raj Korpan, Diane Litman, Matthew Marge, Cynthia Matuszek, Ross Mead, Shiwali Mohan, Raymond Mooney, Natalie Parde, Jivko Sinapov, Angela Stewart, Matthew Stone, Stefanie Tellex, Tom Williams

    Abstract: The ability to interact with machines using natural human language is becoming not just commonplace, but expected. The next step is not just text interfaces, but speech interfaces and not just with computers, but with all machines including robots. In this paper, we chronicle the recent history of this growing field of spoken dialogue with robots and offer the community three proposals, the first… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: NSF Report on the "Dialogue with Robots" Workshop held in Pittsburg, PA, April 2023

  8. arXiv:2403.16394  [pdf, other

    cs.CL cs.AI

    Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation

    Authors: Yingshan Chang, Yasi Zhang, Zhiyuan Fang, Yingnian Wu, Yonatan Bisk, Feng Gao

    Abstract: The literature on text-to-image generation is plagued by issues of faithfully composing entities with relations. But there lacks a formal understanding of how entity-relation compositions can be effectively learned. Moreover, the underlying phenomenon space that meaningfully reflects the problem structure is not well-defined, leading to an arms race for larger quantities of data in the hope that g… ▽ More

    Submitted 24 March, 2024; originally announced March 2024.

  9. arXiv:2403.12943  [pdf, other

    cs.RO cs.AI

    Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

    Authors: Vidhi Jain, Maria Attarian, Nikhil J Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R Sanketi, Pierre Sermanet, Stefan Welker, Christine Chan, Igor Gilitschenski, Yonatan Bisk, Debidatta Dwibedi

    Abstract: While large-scale robotic systems typically rely on textual instructions for tasks, this work explores a different approach: can robots infer the task directly from observing humans? This shift necessitates the robot's ability to decode human intent and translate it into executable actions within its physical constraints and environment. We introduce Vid2Robot, a novel end-to-end video-based learn… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: Robot learning: Imitation Learning, Robot Perception, Sensing & Vision, Gras** & Manipulation

  10. arXiv:2403.10534  [pdf, other

    cs.CV cs.AI

    VISREAS: Complex Visual Reasoning with Unanswerable Questions

    Authors: Syeda Nahida Akter, Sangwu Lee, Yingshan Chang, Yonatan Bisk, Eric Nyberg

    Abstract: Verifying a question's validity before answering is crucial in real-world applications, where users may provide imperfect instructions. In this scenario, an ideal model should address the discrepancies in the query and convey them to the users rather than generating the best possible answer. Addressing this requirement, we introduce a new compositional visual question-answering dataset, VISREAS, t… ▽ More

    Submitted 22 February, 2024; originally announced March 2024.

    Comments: 18 pages, 14 figures, 5 tables

  11. arXiv:2403.08715  [pdf, other

    cs.CL

    SOTOPIA-$π$: Interactive Learning of Socially Intelligent Language Agents

    Authors: Ruiyi Wang, Haofei Yu, Wenxin Zhang, Zhengyang Qi, Maarten Sap, Graham Neubig, Yonatan Bisk, Hao Zhu

    Abstract: Humans learn social skills through both imitation and social interaction. This social learning process is largely understudied by existing research on building language agents. Motivated by this gap, we propose an interactive learning method, SOTOPIA-$π$, improving the social intelligence of language agents. This method leverages behavior cloning and self-reinforcement training on filtered social… ▽ More

    Submitted 25 April, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

  12. arXiv:2312.08782  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis

    Authors: Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Shibo Zhao, Yu Quan Chong, Chen Wang, Katia Sycara, Matthew Johnson-Roberson, Dhruv Batra, Xiaolong Wang, Sebastian Scherer, Zsolt Kira, Fei Xia, Yonatan Bisk

    Abstract: Building general-purpose robots that can operate seamlessly, in any environment, with any object, and utilizing various skills to complete diverse tasks has been a long-standing goal in Artificial Intelligence. Unfortunately, however, most existing robotic systems have been constrained - having been designed for specific tasks, trained on specific datasets, and deployed within specific environment… ▽ More

    Submitted 15 December, 2023; v1 submitted 14 December, 2023; originally announced December 2023.

  13. arXiv:2310.11667  [pdf, other

    cs.AI cs.CL cs.LG

    SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents

    Authors: Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, Maarten Sap

    Abstract: Humans are social beings; we pursue social goals in our daily interactions, which is a crucial aspect of social intelligence. Yet, AI systems' abilities in this realm remain elusive. We present SOTOPIA, an open-ended environment to simulate complex social interactions between artificial agents and evaluate their social intelligence. In our environment, agents role-play and interact under a wide va… ▽ More

    Submitted 22 March, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

    Comments: Preprint, 43 pages. The first two authors contribute equally

  14. arXiv:2310.08864  [pdf, other

    cs.RO

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Authors: Open X-Embodiment Collaboration, Abby O'Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, A**kya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie , et al. (267 additional authors not shown)

    Abstract: Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning method… ▽ More

    Submitted 1 June, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

    Comments: Project website: https://robotics-transformer-x.github.io

  15. arXiv:2309.10103  [pdf, other

    cs.RO cs.AI

    Reasoning about the Unseen for Efficient Outdoor Object Navigation

    Authors: Quanting Xie, Tianyi Zhang, Kedi Xu, Matthew Johnson-Roberson, Yonatan Bisk

    Abstract: Robots should exist anywhere humans do: indoors, outdoors, and even unmapped environments. In contrast, the focus of recent advancements in Object Goal Navigation(OGN) has targeted navigating in indoor environments by leveraging spatial and semantic cues that do not generalize outdoors. While these contributions provide valuable insights into indoor scenarios, the broader spectrum of real-world ro… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

    Comments: 6 pages, 7 figures

  16. arXiv:2309.08508  [pdf, other

    cs.RO

    MOSAIC: Learning Unified Multi-Sensory Object Property Representations for Robot Learning via Interactive Perception

    Authors: Gyan Tatiya, Jonathan Francis, Ho-Hsiang Wu, Yonatan Bisk, Jivko Sinapov

    Abstract: A holistic understanding of object properties across diverse sensory modalities (e.g., visual, audio, and haptic) is essential for tasks ranging from object categorization to complex manipulation. Drawing inspiration from cognitive science studies that emphasize the significance of multi-sensory integration in human perception, we introduce MOSAIC (Multimodal Object property learning with Self-Att… ▽ More

    Submitted 22 February, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: Accepted to the 2024 IEEE International Conference on Robotics and Automation (ICRA), May 13 to 17, 2024; Yokohama, Japan

  17. arXiv:2307.13854  [pdf, other

    cs.AI cs.CL cs.LG

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Authors: Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig

    Abstract: With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, w… ▽ More

    Submitted 16 April, 2024; v1 submitted 25 July, 2023; originally announced July 2023.

    Comments: Our code, data, environment reproduction resources, and video demonstrations are publicly available at https://webarena.dev/

  18. arXiv:2307.13850  [pdf, other

    cs.LG cs.AI cs.CV cs.RO

    MAEA: Multimodal Attribution for Embodied AI

    Authors: Vidhi Jain, Jayant Sravan Tamarapalli, Sahiti Yerramilli, Yonatan Bisk

    Abstract: Understanding multimodal perception for embodied AI is an open question because such inputs may contain highly complementary as well as redundant information for the task. A relevant direction for multimodal policies is understanding the global trends of each modality at the fusion layer. To this end, we disentangle the attributions for visual, language, and previous action inputs across different… ▽ More

    Submitted 25 July, 2023; originally announced July 2023.

  19. arXiv:2306.17842  [pdf, other

    cs.CV cs.CL cs.MM

    SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs

    Authors: Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yan** Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin Murphy, Alexander G. Hauptmann, Lu Jiang

    Abstract: In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details n… ▽ More

    Submitted 28 October, 2023; v1 submitted 30 June, 2023; originally announced June 2023.

    Comments: NeurIPS 2023 spotlight

  20. arXiv:2306.11565  [pdf, other

    cs.RO cs.AI cs.CV

    HomeRobot: Open-Vocabulary Mobile Manipulation

    Authors: Sriram Yenamandra, Arun Ramachandran, Karmesh Yadav, Austin Wang, Mukul Khanna, Theophile Gervet, Tsung-Yen Yang, Vidhi Jain, Alexander William Clegg, John Turner, Zsolt Kira, Manolis Savva, Angel Chang, Devendra Singh Chaplot, Dhruv Batra, Roozbeh Mottaghi, Yonatan Bisk, Chris Paxton

    Abstract: HomeRobot (noun): An affordable compliant robot that navigates homes and manipulates a wide range of objects in order to complete everyday tasks. Open-Vocabulary Mobile Manipulation (OVMM) is the problem of picking any object in any unseen environment, and placing it in a commanded location. This is a foundational challenge for robots to be useful assistants in human environments, because it invol… ▽ More

    Submitted 10 January, 2024; v1 submitted 20 June, 2023; originally announced June 2023.

    Comments: 37 pages, 22 figures, 8 tables

  21. arXiv:2305.15486  [pdf, other

    cs.AI cs.LG

    SPRING: Studying the Paper and Reasoning to Play Games

    Authors: Yue Wu, Shrimai Prabhumoye, So Yeon Min, Yonatan Bisk, Ruslan Salakhutdinov, Amos Azaria, Tom Mitchell, Yuanzhi Li

    Abstract: Open-world survival games pose significant challenges for AI algorithms due to their multi-tasking, deep exploration, and goal prioritization requirements. Despite reinforcement learning (RL) being popular for solving games, its high sample complexity limits its effectiveness in complex open-world games like Crafter or Minecraft. We propose a novel approach, SPRING, to read the game's original aca… ▽ More

    Submitted 11 December, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

  22. arXiv:2305.02412  [pdf, other

    cs.CL cs.AI cs.LG

    Plan, Eliminate, and Track -- Language Models are Good Teachers for Embodied Agents

    Authors: Yue Wu, So Yeon Min, Yonatan Bisk, Ruslan Salakhutdinov, Amos Azaria, Yuanzhi Li, Tom Mitchell, Shrimai Prabhumoye

    Abstract: Pre-trained large language models (LLMs) capture procedural knowledge about the world. Recent work has leveraged LLM's ability to generate abstract plans to simplify challenging control tasks, either by action scoring, or action modeling (fine-tuning). However, the transformer architecture inherits several constraints that make it difficult for the LLM to directly serve as the agent: e.g. limited… ▽ More

    Submitted 7 May, 2023; v1 submitted 3 May, 2023; originally announced May 2023.

  23. arXiv:2304.11235  [pdf, other

    cs.RO cs.AI

    Spatial-Language Attention Policies for Efficient Robot Learning

    Authors: Priyam Parashar, Vidhi Jain, Xiaohan Zhang, Jay Vakil, Sam Powers, Yonatan Bisk, Chris Paxton

    Abstract: Despite great strides in language-guided manipulation, existing work has been constrained to table-top settings. Table-tops allow for perfect and consistent camera angles, properties are that do not hold in mobile manipulation. Task plans that involve moving around the environment must be robust to egocentric views and changes in the plane and angle of grasp. A further challenge is ensuring this i… ▽ More

    Submitted 7 November, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

  24. arXiv:2303.01502  [pdf, other

    cs.CL cs.AI

    Computational Language Acquisition with Theory of Mind

    Authors: Andy Liu, Hao Zhu, Emmy Liu, Yonatan Bisk, Graham Neubig

    Abstract: Unlike current state-of-the-art language models, young children actively acquire language through interactions with their surrounding environment and caretakers. One mechanism that has been argued to be critical to language learning is the ability to infer the mental states of other agents in social environments, coined Theory of Mind (ToM) by Premack & Woodruff (1978). Drawing inspiration from th… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: 9 pages, 3 figures. To be published in the 11th International Conference on Learning Representations, ICLR 2023, Conference Track Proceedings

  25. arXiv:2302.06117  [pdf, other

    cs.LG

    The Framework Tax: Disparities Between Inference Efficiency in NLP Research and Deployment

    Authors: Jared Fernandez, Jacob Kahn, Clara Na, Yonatan Bisk, Emma Strubell

    Abstract: Increased focus on the computational efficiency of NLP systems has motivated the design of efficient model architectures and improvements to underlying hardware accelerators. However, the resulting increases in computational throughput and reductions in floating point operations have not directly translated to improvements in wall-clock inference latency. We demonstrate that these discrepancies ca… ▽ More

    Submitted 22 December, 2023; v1 submitted 13 February, 2023; originally announced February 2023.

    Comments: EMNLP 2023

  26. arXiv:2212.05923  [pdf, other

    cs.RO cs.LG

    Self-Supervised Object Goal Navigation with In-Situ Finetuning

    Authors: So Yeon Min, Yao-Hung Hubert Tsai, Wei Ding, Ali Farhadi, Ruslan Salakhutdinov, Yonatan Bisk, Jian Zhang

    Abstract: A household robot should be able to navigate to target objects without requiring users to first annotate everything in their home. Most current approaches to object navigation do not test on real robots and rely solely on reconstructed scans of houses and their expensively labeled semantic 3D meshes. In this work, our goal is to build an agent that builds self-supervised models of the world via ex… ▽ More

    Submitted 1 April, 2023; v1 submitted 8 December, 2022; originally announced December 2022.

  27. arXiv:2211.05392  [pdf, other

    cs.CL

    EvEntS ReaLM: Event Reasoning of Entity States via Language Models

    Authors: Evangelia Spiliopoulou, Artidoro Pagnoni, Yonatan Bisk, Eduard Hovy

    Abstract: This paper investigates models of event implications. Specifically, how well models predict entity state-changes, by targeting their understanding of physical attributes. Nominally, Large Language models (LLM) have been exposed to procedural knowledge about how objects interact, yet our benchmarking shows they fail to reason about the world. Conversely, we also demonstrate that existing approaches… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

    Comments: EMNLP 2022

  28. arXiv:2210.06849  [pdf, other

    cs.CV

    Retrospectives on the Embodied AI Workshop

    Authors: Matt Deitke, Dhruv Batra, Yonatan Bisk, Tommaso Campari, Angel X. Chang, Devendra Singh Chaplot, Changan Chen, Claudia Pérez D'Arpino, Kiana Ehsani, Ali Farhadi, Li Fei-Fei, Anthony Francis, Chuang Gan, Kristen Grauman, David Hall, Winson Han, Unnat Jain, Aniruddha Kembhavi, Jacob Krantz, Stefan Lee, Chengshu Li, Sagnik Majumder, Oleksandr Maksymets, Roberto Martín-Martín, Roozbeh Mottaghi , et al. (14 additional authors not shown)

    Abstract: We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of… ▽ More

    Submitted 4 December, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

  29. arXiv:2210.04443  [pdf, other

    cs.LG cs.AI cs.CL

    Don't Copy the Teacher: Data and Model Challenges in Embodied Dialogue

    Authors: So Yeon Min, Hao Zhu, Ruslan Salakhutdinov, Yonatan Bisk

    Abstract: Embodied dialogue instruction following requires an agent to complete a complex sequence of tasks from a natural language exchange. The recent introduction of benchmarks (Padmakumar et al., 2022) raises the question of how best to train and evaluate models for this multi-turn, multi-agent, long-horizon task. This paper contributes to that conversation, by arguing that imitation learning (IL) and r… ▽ More

    Submitted 11 October, 2022; v1 submitted 10 October, 2022; originally announced October 2022.

    Comments: To Appear in the Proceedings of EMNLP 2022

  30. arXiv:2207.02442  [pdf, other

    cs.RO cs.AI cs.LG

    Transformers are Adaptable Task Planners

    Authors: Vidhi Jain, Yixin Lin, Eric Undersander, Yonatan Bisk, Akshara Rai

    Abstract: Every home is different, and every person likes things done in their particular way. Therefore, home robots of the future need to both reason about the sequential nature of day-to-day tasks and generalize to user's preferences. To this end, we propose a Transformer Task Planner(TTP) that learns high-level actions from demonstrations by leveraging object attribute-based representations. TTP can be… ▽ More

    Submitted 6 July, 2022; originally announced July 2022.

    Comments: https://anonymous.4open.science/r/temporal_task_planner-Paper148/

  31. arXiv:2205.11686  [pdf, other

    cs.CL cs.CV

    On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization

    Authors: Shruti Palaskar, Akshita Bhagia, Yonatan Bisk, Florian Metze, Alan W Black, Ana Marasović

    Abstract: Combining the visual modality with pretrained language models has been surprisingly effective for simple descriptive tasks such as image captioning. More general text generation however remains elusive. We take a step back and ask: How do these models work for more complex generative tasks, i.e. conditioning on both text and images? Are multimodal models simply visually adapted language models, or… ▽ More

    Submitted 22 October, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

    Comments: v2: EMNLP Findings 2022 accepted paper camera-ready version. 9 pages main, 2 pages appendix

  32. arXiv:2205.09256  [pdf, other

    cs.CV cs.MM

    Training Vision-Language Transformers from Captions

    Authors: Liangke Gui, Yingshan Chang, Qiuyuan Huang, Subhojit Som, Alex Hauptmann, Jianfeng Gao, Yonatan Bisk

    Abstract: Vision-Language Transformers can be learned without low-level human labels (e.g. class labels, bounding boxes, etc). Existing work, whether explicitly utilizing bounding boxes or patches, assumes that the visual backbone must first be trained on ImageNet class prediction before being integrated into a multimodal linguistic pipeline. We show that this is not necessary and introduce a new model Visi… ▽ More

    Submitted 14 June, 2023; v1 submitted 18 May, 2022; originally announced May 2022.

  33. arXiv:2203.03022  [pdf, ps, other

    cs.SD cs.AI cs.LG eess.AS stat.ML

    HEAR: Holistic Evaluation of Audio Representations

    Authors: Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Björn W. Schuller, Christian J. Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, Max Henry, Nicolas Pinto, Camille Noufi, Christian Clough, Dorien Herremans, Eduardo Fonseca, Jesse Engel, Justin Salamon, Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu **, Yonatan Bisk

    Abstract: What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR evaluates audio representations using a benchmark suite across a variety of domains, in… ▽ More

    Submitted 29 May, 2022; v1 submitted 6 March, 2022; originally announced March 2022.

    Comments: to appear in Proceedings of Machine Learning Research (PMLR): NeurIPS 2021 Competition Track

  34. arXiv:2112.09544  [pdf

    cs.CY

    It's Time to Do Something: Mitigating the Negative Impacts of Computing Through a Change to the Peer Review Process

    Authors: Brent Hecht, Lauren Wilcox, Jeffrey P. Bigham, Johannes Schöning, Ehsan Hoque, Jason Ernst, Yonatan Bisk, Luigi De Russis, Lana Yarosh, Bushra Anjum, Danish Contractor, Cathy Wu

    Abstract: The computing research community needs to work much harder to address the downsides of our innovations. Between the erosion of privacy, threats to democracy, and automation's effect on employment (among many other issues), we can no longer simply assume that our research will have a net positive impact on the world. While bending the arc of computing innovation towards societal benefit may at firs… ▽ More

    Submitted 17 December, 2021; originally announced December 2021.

    Comments: First published on the ACM Future of Computing Academy blog on March 29, 2018. This is the archival version

  35. arXiv:2112.08614  [pdf, other

    cs.CL

    KAT: A Knowledge Augmented Transformer for Vision-and-Language

    Authors: Liangke Gui, Borui Wang, Qiuyuan Huang, Alex Hauptmann, Yonatan Bisk, Jianfeng Gao

    Abstract: The primary focus of recent work with largescale transformers has been on optimizing the amount of information packed into the model's parameters. In this work, we ask a different question: Can multimodal transformers leverage explicit knowledge in their reasoning? Existing, primarily unimodal, methods have explored approaches under the paradigm of knowledge retrieval followed by answer prediction… ▽ More

    Submitted 5 May, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

    Comments: Accepted by NAACL 2022

  36. arXiv:2110.08258  [pdf, other

    cs.LG cs.AI cs.HC cs.RO

    A Framework for Learning to Request Rich and Contextually Useful Information from Humans

    Authors: Khanh Nguyen, Yonatan Bisk, Hal Daumé III

    Abstract: When deployed, AI agents will encounter problems that are beyond their autonomous problem-solving capabilities. Leveraging human assistance can help agents overcome their inherent limitations and robustly cope with unfamiliar situations. We present a general interactive framework that enables an agent to request and interpret rich, contextually useful information from an assistant that has knowled… ▽ More

    Submitted 22 June, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: Accepted to ICML 2022

  37. arXiv:2110.07342  [pdf, other

    cs.CL cs.LG

    FILM: Following Instructions in Language with Modular Methods

    Authors: So Yeon Min, Devendra Singh Chaplot, Pradeep Ravikumar, Yonatan Bisk, Ruslan Salakhutdinov

    Abstract: Recent methods for embodied instruction following are typically trained end-to-end using imitation learning. This often requires the use of expert trajectories and low-level language instructions. Such approaches assume that neural states will integrate multimodal semantics to perform state tracking, building spatial memory, exploration, and long-term planning. In contrast, we propose a modular me… ▽ More

    Submitted 16 March, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

    Comments: Published as a conference paper at International Conference on Learning Representations (ICLR) 2022

  38. arXiv:2109.09790  [pdf, other

    cs.CL

    Dependency Induction Through the Lens of Visual Perception

    Authors: Ruisi Su, Shruti Rijhwani, Hao Zhu, Junxian He, Xinyu Wang, Yonatan Bisk, Graham Neubig

    Abstract: Most previous work on grammar induction focuses on learning phrasal or dependency structure purely from text. However, because the signal provided by text alone is limited, recently introduced visually grounded syntax models make use of multimodal information leading to improved performance in constituency grammar induction. However, as compared to dependency grammars, constituency grammars do not… ▽ More

    Submitted 20 September, 2021; originally announced September 2021.

    Comments: Accepted to CoNLL 2021

  39. arXiv:2109.00590  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    WebQA: Multihop and Multimodal QA

    Authors: Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng Gao, Yonatan Bisk

    Abstract: Scaling Visual Question Answering (VQA) to the open-domain and multi-hop nature of web searches, requires fundamental advances in visual representation learning, knowledge aggregation, and language generation. In this work, we introduce WebQA, a challenging new benchmark that proves difficult for large-scale state-of-the-art models which lack language groundable visual representations for novel ob… ▽ More

    Submitted 27 March, 2022; v1 submitted 1 September, 2021; originally announced September 2021.

    Comments: CVPR Camera ready

  40. arXiv:2108.09980  [pdf, other

    cs.CV cs.AI cs.LG

    TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment

    Authors: Jianwei Yang, Yonatan Bisk, Jianfeng Gao

    Abstract: Contrastive learning has been widely used to train transformer-based vision-language models for video-text alignment and multi-modal representation learning. This paper presents a new algorithm called Token-Aware Cascade contrastive learning (TACo) that improves contrastive learning using two novel techniques. The first is the token-aware contrastive loss which is computed by taking into account t… ▽ More

    Submitted 23 August, 2021; originally announced August 2021.

    Comments: Accepted by ICCV 2021

  41. arXiv:2107.12514  [pdf, other

    cs.CL cs.AI cs.CV cs.LG cs.RO

    Language Grounding with 3D Objects

    Authors: Jesse Thomason, Mohit Shridhar, Yonatan Bisk, Chris Paxton, Luke Zettlemoyer

    Abstract: Seemingly simple natural language requests to a robot are generally underspecified, for example "Can you bring me the wireless mouse?" Flat images of candidate mice may not provide the discriminative information needed for "wireless." The world, and objects in it, are not flat images but complex 3D shapes. If a human requests an object based on any of its basic properties, such as color, shape, or… ▽ More

    Submitted 15 September, 2021; v1 submitted 26 July, 2021; originally announced July 2021.

    Comments: Conference on Robot Learning (CoRL) 2021

  42. arXiv:2107.05697  [pdf, other

    cs.CL

    Few-shot Language Coordination by Modeling Theory of Mind

    Authors: Hao Zhu, Graham Neubig, Yonatan Bisk

    Abstract: $\textit{No man is an island.}$ Humans communicate with a large community by coordinating with different interlocutors within short conversations. This ability has been understudied by the research on building neural communicative agents. We study the task of few-shot $\textit{language coordination}… ▽ More

    Submitted 12 July, 2021; originally announced July 2021.

    Comments: Thirty-eighth International Conference on Machine Learning (ICML 2021)

  43. arXiv:2106.02192  [pdf, other

    cs.CL

    Grounding 'Grounding' in NLP

    Authors: Khyathi Raghavi Chandu, Yonatan Bisk, Alan W Black

    Abstract: The NLP community has seen substantial recent interest in grounding to facilitate interaction between language technologies and the world. However, as a community, we use the term broadly to reference any linking of text to data or non-textual modality. In contrast, Cognitive Science more formally defines "grounding" as the process of establishing what mutual information is required for successful… ▽ More

    Submitted 3 June, 2021; originally announced June 2021.

    Comments: 24 pages

  44. arXiv:2104.08666  [pdf, other

    cs.CL

    Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models

    Authors: Tejas Srinivasan, Yonatan Bisk

    Abstract: Numerous works have analyzed biases in vision and pre-trained language models individually - however, less attention has been paid to how these biases interact in multimodal settings. This work extends text-based bias analysis methods to investigate multimodal language models, and analyzes intra- and inter-modality associations and biases learned by these models. Specifically, we demonstrate that… ▽ More

    Submitted 19 May, 2022; v1 submitted 17 April, 2021; originally announced April 2021.

    Comments: Accepted to 4th Workshop on Gender Bias in Natural Language Processing, NAACL 2022

  45. arXiv:2102.00424  [pdf, other

    cs.CL cs.CV cs.LG

    An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing Games

    Authors: Alessandro Suglia, Yonatan Bisk, Ioannis Konstas, Antonio Vergari, Emanuele Bastianelli, Andrea Vanzo, Oliver Lemon

    Abstract: Guessing games are a prototypical instance of the "learning by interacting" paradigm. This work investigates how well an artificial agent can benefit from playing guessing games when later asked to perform on novel NLP downstream tasks such as Visual Question Answering (VQA). We propose two ways to exploit playing guessing games: 1) a supervised learning scenario in which the agent learns to mimic… ▽ More

    Submitted 31 January, 2021; originally announced February 2021.

    Comments: Accepted paper for the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021)

  46. arXiv:2011.03863  [pdf, other

    cs.CL cs.AI

    Knowledge-driven Data Construction for Zero-shot Evaluation in Commonsense Question Answering

    Authors: Kaixin Ma, Filip Ilievski, Jonathan Francis, Yonatan Bisk, Eric Nyberg, Alessandro Oltramari

    Abstract: Recent developments in pre-trained neural language modeling have led to leaps in accuracy on commonsense question-answering benchmarks. However, there is increasing concern that models overfit to specific tasks, without learning to utilize external knowledge or perform general semantic reasoning. In contrast, zero-shot evaluations have shown promise as a more robust measure of a model's general re… ▽ More

    Submitted 14 December, 2020; v1 submitted 7 November, 2020; originally announced November 2020.

    Comments: AAAI 2021

  47. arXiv:2011.02917  [pdf, other

    cs.CL cs.CV cs.LG

    Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games

    Authors: Alessandro Suglia, Antonio Vergari, Ioannis Konstas, Yonatan Bisk, Emanuele Bastianelli, Andrea Vanzo, Oliver Lemon

    Abstract: In visual guessing games, a Guesser has to identify a target object in a scene by asking questions to an Oracle. An effective strategy for the players is to learn conceptual representations of objects that are both discriminative and expressive enough to ask questions and guess correctly. However, as shown by Suglia et al. (2020), existing models fail to learn truly multi-modal representations, re… ▽ More

    Submitted 5 November, 2020; originally announced November 2020.

    Comments: Accepted to the International Conference on Computational Linguistics (COLING) 2020

  48. arXiv:2010.03768  [pdf, other

    cs.CL cs.AI cs.CV cs.LG cs.RO

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Authors: Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, Matthew Hausknecht

    Abstract: Given a simple request like Put a washed apple in the kitchen fridge, humans can reason in purely abstract terms by imagining action sequences and scoring their likelihood of success, prototypicality, and efficiency, all without moving a muscle. Once we see the kitchen in question, we can update our abstract plans to fit the scene. Embodied agents require the same abilities, but existing work does… ▽ More

    Submitted 14 March, 2021; v1 submitted 8 October, 2020; originally announced October 2020.

    Comments: ICLR 2021; Data, code, and videos are available at alfworld.github.io

  49. arXiv:2007.15135  [pdf, other

    cs.CL cs.LG

    The Return of Lexical Dependencies: Neural Lexicalized PCFGs

    Authors: Hao Zhu, Yonatan Bisk, Graham Neubig

    Abstract: In this paper we demonstrate that $\textit{context free grammar (CFG) based methods for grammar induction benefit from modeling lexical dependencies}$. This contrasts to the most popular current methods for grammar induction, which focus on discovering $\textit{either}$ constituents $\textit{or}$ dependencies. Previous approaches to marry these two disparate syntactic formalisms (e.g. lexicalized… ▽ More

    Submitted 29 July, 2020; originally announced July 2020.

    Comments: Accepted at TACL 2020

  50. arXiv:2005.00728  [pdf, other

    cs.CL cs.AI cs.CV cs.LG cs.RO

    RMM: A Recursive Mental Model for Dialog Navigation

    Authors: Homero Roman Roman, Yonatan Bisk, Jesse Thomason, Asli Celikyilmaz, Jianfeng Gao

    Abstract: Language-guided robots must be able to both ask humans questions and understand answers. Much existing work focuses only on the latter. In this paper, we go beyond instruction following and introduce a two-agent task where one agent navigates and asks questions that a second, guiding agent answers. Inspired by theory of mind, we propose the Recursive Mental Model (RMM). The navigating agent models… ▽ More

    Submitted 5 October, 2020; v1 submitted 2 May, 2020; originally announced May 2020.

    Comments: Findings of Empirical Methods in Natural Language Processing (EMNLP Findings), 2020