Skip to main content

Showing 1–50 of 106 results for author: Krishna, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.18915  [pdf, other

    cs.RO cs.CV

    Manipulate-Anything: Automating Real-World Robots using Vision-Language Models

    Authors: Jiafei Duan, Wentao Yuan, Wilbert Pumacay, Yi Ru Wang, Kiana Ehsani, Dieter Fox, Ranjay Krishna

    Abstract: Large-scale endeavors like RT-1 and widespread community efforts such as Open-X-Embodiment have contributed to growing the scale of robot demonstration data. However, there is still an opportunity to improve the quality, quantity, and diversity of robot demonstration data. Although vision-language models have been shown to automatically generate demonstration data, their utility has been limited t… ▽ More

    Submitted 27 June, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

    Comments: Project page: https://robot-ma.github.io/

  2. arXiv:2406.16008  [pdf, other

    cs.CL cs.AI cs.LG

    Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization

    Authors: Cheng-Yu Hsieh, Yung-Sung Chuang, Chun-Liang Li, Zifeng Wang, Long T. Le, Abhishek Kumar, James Glass, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna, Tomas Pfister

    Abstract: Large language models (LLMs), even when specifically trained to process long input contexts, struggle to capture relevant information located in the middle of their input. This phenomenon has been known as the lost-in-the-middle problem. In this work, we make three contributions. First, we set out to understand the factors that cause this phenomenon. In doing so, we establish a connection between… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: ACL Findings 2024

  3. arXiv:2406.11775  [pdf, other

    cs.CV cs.AI

    Task Me Anything

    Authors: Jieyu Zhang, Weikai Huang, Zixian Ma, Oscar Michel, Dong He, Tanmay Gupta, Wei-Chiu Ma, Ali Farhadi, Aniruddha Kembhavi, Ranjay Krishna

    Abstract: Benchmarks for large multimodal language models (MLMs) now serve to simultaneously assess the general capabilities of models instead of evaluating for a specific capability. As a result, when a developer wants to identify which models to use for their application, they are overwhelmed by the number of benchmarks and remain uncertain about which benchmark's results are most reflective of their spec… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: website: https://www.task-me-anything.org

  4. arXiv:2406.10721  [pdf, other

    cs.RO cs.AI cs.CV

    RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics

    Authors: Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, Dieter Fox

    Abstract: From rearranging objects on a table to putting groceries into shelves, robots must plan precise action points to perform tasks accurately and reliably. In spite of the recent adoption of vision language models (VLMs) to control robot behavior, VLMs struggle to precisely articulate robot actions using language. We introduce an automatic synthetic data generation pipeline that instruction-tunes VLMs… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

  5. arXiv:2406.09403  [pdf, other

    cs.CV cs.CL

    Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

    Authors: Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, Ranjay Krishna

    Abstract: Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In t… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 26 pages

  6. arXiv:2406.05184  [pdf, other

    cs.CV

    The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better

    Authors: Scott Geng, Cheng-Yu Hsieh, Vivek Ramanujan, Matthew Wallingford, Chun-Liang Li, Pang Wei Koh, Ranjay Krishna

    Abstract: Generative text-to-image models enable us to synthesize unlimited amounts of images in a controllable manner, spurring many recent efforts to train vision models with synthetic data. However, every synthetic image ultimately originates from the upstream data used to train the generator. What additional value does the intermediate generator provide over directly training on relevant parts of the up… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: Correspondence to [email protected]. RK and PWK equally advised the project

  7. arXiv:2405.18400  [pdf, other

    cs.CL cs.LG

    Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass

    Authors: Ethan Shen, Alan Fan, Sarah M. Pratt, Jae Sung Park, Matthew Wallingford, Sham M. Kakade, Ari Holtzman, Ranjay Krishna, Ali Farhadi, Aditya Kusupati

    Abstract: Many applications today provide users with multiple auto-complete drafts as they type, including GitHub's code completion, Gmail's smart compose, and Apple's messaging auto-suggestions. Under the hood, language models support this by running an autoregressive inference pass to provide a draft. Consequently, providing $k$ drafts to the user requires running an expensive language model $k$ times. To… ▽ More

    Submitted 24 June, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: 22 pages, 15 figures

  8. arXiv:2405.16915  [pdf, other

    cs.CV cs.LG

    Multilingual Diversity Improves Vision-Language Representations

    Authors: Thao Nguyen, Matthew Wallingford, Sebastin Santy, Wei-Chiu Ma, Sewoong Oh, Ludwig Schmidt, Pang Wei Koh, Ranjay Krishna

    Abstract: Massive web-crawled image-text datasets lay the foundation for recent progress in multimodal learning. These datasets are designed with the goal of training a model to do well on standard computer vision benchmarks, many of which, however, have been shown to be English-centric (e.g., ImageNet). Consequently, existing data curation techniques gravitate towards using predominantly English image-text… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  9. arXiv:2405.02793  [pdf, other

    cs.CV cs.CL

    ImageInWords: Unlocking Hyper-Detailed Image Descriptions

    Authors: Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Montgomery, Yasumasa Onoe, Andrew Bunner, Ranjay Krishna, Jason Baldridge, Radu Soricut

    Abstract: Despite the longstanding adage "an image is worth a thousand words," creating accurate and hyper-detailed image descriptions for training Vision-Language models remains challenging. Current datasets typically have web-scraped descriptions that are short, low-granularity, and often contain details unrelated to the visual content. As a result, models trained on such data generate descriptions replet… ▽ More

    Submitted 4 May, 2024; originally announced May 2024.

    Comments: Webpage (https://google.github.io/imageinwords), GitHub (https://github.com/google/imageinwords), HuggingFace (https://huggingface.co/datasets/google/imageinwords)

  10. arXiv:2404.15721  [pdf, other

    cs.CV cs.AI

    SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision

    Authors: Ankit Vani, Bac Nguyen, Samuel Lavoie, Ranjay Krishna, Aaron Courville

    Abstract: Selective attention helps us focus on task-relevant aspects in the constant flood of our sensory input. This constraint in our perception allows us to robustly generalize under distractions and to new compositions of perceivable concepts. Transformers employ a similar notion of attention in their architecture, but representation learning models with transformer backbones like CLIP and DINO often f… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

  11. arXiv:2404.12390  [pdf, other

    cs.CV cs.AI cs.CL

    BLINK: Multimodal Large Language Models Can See but Not Perceive

    Authors: Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, Ranjay Krishna

    Abstract: We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challeng… ▽ More

    Submitted 4 May, 2024; v1 submitted 18 April, 2024; originally announced April 2024.

    Comments: Multimodal Benchmark, Project Url: https://zeyofu.github.io/blink/

  12. arXiv:2404.06089  [pdf, other

    cs.HC cs.RO

    EVE: Enabling Anyone to Train Robots using Augmented Reality

    Authors: Jun Wang, Chun-Cheng Chang, Jiafei Duan, Dieter Fox, Ranjay Krishna

    Abstract: The increasing affordability of robot hardware is accelerating the integration of robots into everyday activities. However, training robots to automate tasks typically requires physical robots and expensive demonstration data from trained human annotators. Consequently, only those with access to physical robots produce demonstrations to train robots. To mitigate this issue, we introduce EVE, an iO… ▽ More

    Submitted 18 May, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

    Comments: 11 pages

  13. arXiv:2404.02145  [pdf, other

    cs.CV

    Iterated Learning Improves Compositionality in Large Vision-Language Models

    Authors: Chenhao Zheng, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna

    Abstract: A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, recent investigations find that most-if not all-our state-of-the-art vision-language models struggle at compositionality. They are unable to distinguish between images of " a girl in white facing a man… ▽ More

    Submitted 16 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

    Comments: CVPR 2024

  14. arXiv:2403.14617  [pdf, other

    cs.CV cs.AI cs.LG

    Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion

    Authors: Xiang Fan, Anand Bhattad, Ranjay Krishna

    Abstract: We introduce Videoshop, a training-free video editing algorithm for localized semantic edits. Videoshop allows users to use any editing software, including Photoshop and generative inpainting, to modify the first frame; it automatically propagates those changes, with semantic, spatial, and temporally consistent motion, to the remaining frames. Unlike existing methods that enable edits only through… ▽ More

    Submitted 22 March, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

    Comments: Project page at https://videoshop-editing.github.io/

  15. arXiv:2403.11085  [pdf, other

    cs.CV cs.CL

    m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks

    Authors: Zixian Ma, Weikai Huang, Jieyu Zhang, Tanmay Gupta, Ranjay Krishna

    Abstract: Real-world multi-modal problems are rarely solved by a single machine learning model, and often require multi-step computational plans that involve stitching several models. Tool-augmented LLMs hold tremendous promise for automating the generation of such computational plans. However, the lack of standardized benchmarks for evaluating LLMs as planners for multi-step multi-modal tasks has prevented… ▽ More

    Submitted 21 March, 2024; v1 submitted 17 March, 2024; originally announced March 2024.

  16. arXiv:2403.02626  [pdf, other

    cs.CV cs.LG

    Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use

    Authors: Imad Eddine Toubal, Aditya Avinash, Neil Gordon Alldrin, Jan Dlabal, Wenlei Zhou, Enming Luo, Otilia Stretcu, Hao Xiong, Chun-Ta Lu, Howard Zhou, Ranjay Krishna, Ariel Fuxman, Tom Duerig

    Abstract: From content moderation to wildlife conservation, the number of applications that require models to recognize nuanced or subjective visual concepts is growing. Traditionally, develo** classifiers for such concepts requires substantial manual effort measured in hours, days, or even months to identify and annotate data needed for training. Even with recently proposed Agile Modeling techniques, whi… ▽ More

    Submitted 19 March, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

  17. arXiv:2402.14590  [pdf, other

    cs.IR cs.CL cs.LG

    Scaling Up LLM Reviews for Google Ads Content Moderation

    Authors: Wei Qiao, Tushar Dogra, Otilia Stretcu, Yu-Han Lyu, Tiantian Fang, Dong** Kwon, Chun-Ta Lu, Enming Luo, Yuan Wang, Chih-Chun Chia, Ariel Fuxman, Fangzhou Wang, Ranjay Krishna, Mehmet Tek

    Abstract: Large language models (LLMs) are powerful tools for content moderation, but their inference costs and latency make them prohibitive for casual use on large datasets, such as the Google Ads repository. This study proposes a method for scaling up LLM reviews for content moderation in Google Ads. First, we use heuristics to select candidates via filtering and duplicate removal, and create clusters of… ▽ More

    Submitted 7 February, 2024; originally announced February 2024.

  18. arXiv:2402.11359  [pdf, other

    cs.AI cs.CL

    Offline Training of Language Model Agents with Functions as Learnable Weights

    Authors: Shaokun Zhang, Jieyu Zhang, Jiale Liu, Linxin Song, Chi Wang, Ranjay Krishna, Qingyun Wu

    Abstract: Researchers and practitioners have recently reframed powerful Large Language Models (LLMs) as agents, enabling them to automate complex tasks largely via the use of specialized functions. To facilitate the development of LLM agents, we present a novel paradigm of training LLM agents without modifying the LLM weights, which is particularly useful when the LLMs are difficult or inaccessible for modi… ▽ More

    Submitted 7 June, 2024; v1 submitted 17 February, 2024; originally announced February 2024.

    Comments: 22 pages, 10 figures

  19. arXiv:2402.08191  [pdf, other

    cs.RO cs.AI cs.LG

    THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

    Authors: Wilbert Pumacay, Ishika Singh, Jiafei Duan, Ranjay Krishna, Jesse Thomason, Dieter Fox

    Abstract: To realize effective large-scale, real-world robotic applications, we must evaluate how well our robot policies adapt to changes in environmental conditions. Unfortunately, a majority of studies evaluate robot performance in environments closely resembling or even identical to the training setup. We present THE COLOSSEUM, a novel simulation benchmark, with 20 diverse manipulation tasks, that enabl… ▽ More

    Submitted 27 May, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

    Comments: RSS 2024. 33 pages

  20. arXiv:2312.11681  [pdf, other

    cs.HC cs.AI cs.CL

    Designing LLM Chains by Adapting Techniques from Crowdsourcing Workflows

    Authors: Madeleine Grunde-McLaughlin, Michelle S. Lam, Ranjay Krishna, Daniel S. Weld, Jeffrey Heer

    Abstract: LLM chains enable complex tasks by decomposing work into a sequence of subtasks. Similarly, the more established techniques of crowdsourcing workflows decompose complex tasks into smaller tasks for human crowdworkers. Chains address LLM errors analogously to the way crowdsourcing workflows address human error. To characterize opportunities for LLM chaining, we survey 107 papers across the crowdsou… ▽ More

    Submitted 6 May, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

  21. arXiv:2312.09067  [pdf, other

    cs.CV cs.AI cs.CL cs.RO

    Holodeck: Language Guided Generation of 3D Embodied AI Environments

    Authors: Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, Christopher Clark

    Abstract: 3D simulated environments play a critical role in Embodied AI, but their creation requires expertise and extensive manual effort, restricting their diversity and scope. To mitigate this limitation, we present Holodeck, a system that generates 3D environments to match a user-supplied prompt fully automatedly. Holodeck can generate diverse scenes, e.g., arcades, spas, and museums, adjust the designs… ▽ More

    Submitted 22 April, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

    Comments: Published in CVPR 2024, 21 pages, 27 figures, 2 tables

  22. arXiv:2312.04746  [pdf, other

    cs.CV cs.AI cs.CL

    Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos

    Authors: Mehmet Saygin Seyfioglu, Wisdom O. Ikezogwo, Fatemeh Ghezloo, Ranjay Krishna, Linda Shapiro

    Abstract: Diagnosis in histopathology requires a global whole slide images (WSIs) analysis, requiring pathologists to compound evidence from different WSI patches. The gigapixel scale of WSIs poses a challenge for histopathology multi-modal models. Training multi-model models for histopathology requires instruction tuning datasets, which currently contain information for individual image patches, without a… ▽ More

    Submitted 9 April, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

  23. arXiv:2312.03052  [pdf, other

    cs.CV cs.CL

    Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

    Authors: Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, Ariel Fuxman

    Abstract: Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by decomposing such tasks using a large language model (LLM) into an executable program that invokes specialized vision models. However, generated programs are error-pr… ▽ More

    Submitted 5 April, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

    Comments: CVPR 2024 Oral

  24. arXiv:2312.02976  [pdf, other

    cs.RO cs.AI cs.CV

    Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World

    Authors: Kiana Ehsani, Tanmay Gupta, Rose Hendrix, Jordi Salvador, Luca Weihs, Kuo-Hao Zeng, Kunal Pratap Singh, Ye** Kim, Winson Han, Alvaro Herrasti, Ranjay Krishna, Dustin Schwenk, Eli VanderBilt, Aniruddha Kembhavi

    Abstract: Reinforcement learning (RL) with dense rewards and imitation learning (IL) with human-generated trajectories are the most widely used approaches for training modern embodied agents. RL requires extensive reward sha** and auxiliary losses and is often too slow and ineffective for long-horizon tasks. While IL with human supervision is effective, collecting human trajectories at scale is extremely… ▽ More

    Submitted 5 December, 2023; originally announced December 2023.

    Comments: First six authors contributed equally. Project page: https://spoc-robot.github.io/

  25. arXiv:2312.00833  [pdf, other

    cs.CV

    Lasagna: Layered Score Distillation for Disentangled Object Relighting

    Authors: Dina Bashkirova, Arijit Ray, Rupayan Mallick, Sarah Adel Bargal, Jianming Zhang, Ranjay Krishna, Kate Saenko

    Abstract: Professional artists, photographers, and other visual content creators use object relighting to establish their photo's desired effect. Unfortunately, manual tools that allow relighting have a steep learning curve and are difficult to master. Although generative editing methods now enable some forms of image editing, relighting is still beyond today's capabilities; existing methods struggle to kee… ▽ More

    Submitted 30 November, 2023; originally announced December 2023.

  26. arXiv:2311.17946  [pdf, other

    cs.CV cs.AI cs.CL

    DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback

    Authors: Jiao Sun, Deqing Fu, Yushi Hu, Su Wang, Royi Rassin, Da-Cheng Juan, Dana Alon, Charles Herrmann, Sjoerd van Steenkiste, Ranjay Krishna, Cyrus Rashtchian

    Abstract: Despite their wide-spread success, Text-to-Image models (T2I) still struggle to produce images that are both aesthetically pleasing and faithful to the user's input text. We introduce DreamSync, a model-agnostic training algorithm by design that improves T2I models to be faithful to the text input. DreamSync builds off a recent insight from TIFA's evaluation framework -- that large vision-language… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

  27. arXiv:2311.04193  [pdf, other

    cs.CV cs.AI

    Selective Visual Representations Improve Convergence and Generalization for Embodied AI

    Authors: Ainaz Eftekhar, Kuo-Hao Zeng, Jiafei Duan, Ali Farhadi, Ani Kembhavi, Ranjay Krishna

    Abstract: Embodied AI models often employ off the shelf vision backbones like CLIP to encode their visual observations. Although such general purpose representations encode rich syntactic and semantic information about the scene, much of this information is often irrelevant to the specific task at hand. This introduces noise within the learning process and distracts the agent's focus from task-relevant visu… ▽ More

    Submitted 9 March, 2024; v1 submitted 7 November, 2023; originally announced November 2023.

    Comments: See project website: https://embodied-codebook.github.io

  28. arXiv:2311.00687  [pdf, other

    cs.AI cs.CL cs.HC cs.LG

    Improving Interpersonal Communication by Simulating Audiences with Language Models

    Authors: Ryan Liu, Howard Yen, Raja Marjieh, Thomas L. Griffiths, Ranjay Krishna

    Abstract: How do we communicate with others to achieve our goals? We use our prior experience or advice from others, or construct a candidate utterance by predicting how it will be received. However, our experiences are limited and biased, and reasoning about potential outcomes can be difficult and cognitively challenging. In this paper, we explore how we can leverage Large Language Model (LLM) simulations… ▽ More

    Submitted 3 November, 2023; v1 submitted 1 November, 2023; originally announced November 2023.

    Comments: 16 pages (main paper), 7 tables and figures (main)

  29. arXiv:2310.18235  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation

    Authors: Jaemin Cho, Yushi Hu, Roopal Garg, Peter Anderson, Ranjay Krishna, Jason Baldridge, Mohit Bansal, Jordi Pont-Tuset, Su Wang

    Abstract: Evaluating text-to-image models is notoriously difficult. A strong recent approach for assessing text-image faithfulness is based on QG/A (question generation and answering), which uses pre-trained foundational models to automatically generate a set of questions and answers from the prompt, and output images are scored based on whether these answers extracted with a visual question answering model… ▽ More

    Submitted 13 March, 2024; v1 submitted 27 October, 2023; originally announced October 2023.

    Comments: ICLR 2024; Project website: https://google.github.io/dsg

  30. arXiv:2310.14356  [pdf, other

    cs.CV cs.CL cs.CY cs.HC

    Computer Vision Datasets and Models Exhibit Cultural and Linguistic Diversity in Perception

    Authors: Andre Ye, Sebastin Santy, Jena D. Hwang, Amy X. Zhang, Ranjay Krishna

    Abstract: Computer vision often treats human perception as homogeneous: an implicit assumption that visual stimuli are perceived similarly by everyone. This assumption is reflected in the way researchers collect datasets and train vision models. By contrast, literature in cross-cultural psychology and linguistics has provided evidence that people from different cultural backgrounds observe vastly different… ▽ More

    Submitted 9 March, 2024; v1 submitted 22 October, 2023; originally announced October 2023.

  31. arXiv:2310.03046  [pdf, other

    cs.SE cs.AI

    EcoAssistant: Using LLM Assistant More Affordably and Accurately

    Authors: Jieyu Zhang, Ranjay Krishna, Ahmed H. Awadallah, Chi Wang

    Abstract: Today, users ask Large language models (LLMs) as assistants to answer queries that require external knowledge; they ask about the weather in a specific city, about stock prices, and even about where specific locations are within their neighborhood. These queries require the LLM to produce code that invokes external APIs to answer the user's question, yet LLMs rarely produce correct code on the fir… ▽ More

    Submitted 3 October, 2023; originally announced October 2023.

  32. Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code

    Authors: Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lambert Pouguem Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, Reyhaneh Jabbarvand

    Abstract: Code translation aims to convert source code from one programming language (PL) to another. Given the promising abilities of large language models (LLMs) in code synthesis, researchers are exploring their potential to automate code translation. The prerequisite for advancing the state of LLM-based code translation is to understand their promises and limitations over existing techniques. To that en… ▽ More

    Submitted 16 January, 2024; v1 submitted 6 August, 2023; originally announced August 2023.

    Comments: Published in ICSE 2024

  33. arXiv:2308.00675  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models

    Authors: Cheng-Yu Hsieh, Si-An Chen, Chun-Liang Li, Yasuhisa Fujii, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna, Tomas Pfister

    Abstract: Today, large language models (LLMs) are taught to use new tools by providing a few demonstrations of the tool's usage. Unfortunately, demonstrations are hard to acquire, and can result in undesirable biased usage if the wrong demonstration is chosen. Even in the rare scenario that demonstrations are readily available, there is no principled selection protocol to determine how many and which ones t… ▽ More

    Submitted 1 August, 2023; originally announced August 2023.

  34. arXiv:2307.11073  [pdf, other

    cs.CV cs.AI cs.GR

    OBJECT 3DIT: Language-guided 3D-aware Image Editing

    Authors: Oscar Michel, Anand Bhattad, Eli VanderBilt, Ranjay Krishna, Aniruddha Kembhavi, Tanmay Gupta

    Abstract: Existing image editing tools, while powerful, typically disregard the underlying 3D geometry from which the image is projected. As a result, edits made using these tools may become detached from the geometry and lighting conditions that are at the foundation of the image formation process. In this work, we formulate the newt ask of language-guided 3D-aware editing, where objects in an image should… ▽ More

    Submitted 20 July, 2023; originally announced July 2023.

  35. arXiv:2306.15895  [pdf, other

    cs.CL cs.AI cs.LG

    Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias

    Authors: Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay Krishna, Jiaming Shen, Chao Zhang

    Abstract: Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks. While previous research has explored different approaches to training models using generated data, they generally rely on simple class-conditional prompts, which may limit the diversity of the generated data and inherit systematic biases of LLM. Thus, we invest… ▽ More

    Submitted 17 October, 2023; v1 submitted 27 June, 2023; originally announced June 2023.

    Comments: Accepted to NeurIPS 2023 (Datasets and Benchmarks Track)

    Journal ref: NeurIPS 2023

  36. arXiv:2306.15128  [pdf, other

    cs.CV cs.AI cs.LG

    MIMIC: Masked Image Modeling with Image Correspondences

    Authors: Kalyani Marathe, Mahtab Bigverdi, Nishat Khan, Tuhin Kundu, Patrick Howe, Sharan Ranjit S, Anand Bhattad, Aniruddha Kembhavi, Linda G. Shapiro, Ranjay Krishna

    Abstract: Dense pixel-specific representation learning at scale has been bottlenecked due to the unavailability of large-scale multi-view datasets. Current methods for building effective pretraining datasets heavily rely on annotated 3D meshes, point clouds, and camera parameters from simulated environments, preventing them from building datasets from real-world data sources where such metadata is lacking.… ▽ More

    Submitted 15 May, 2024; v1 submitted 26 June, 2023; originally announced June 2023.

  37. arXiv:2306.14610  [pdf, other

    cs.CV cs.CL cs.LG

    SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality

    Authors: Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, Ranjay Krishna

    Abstract: In the last year alone, a surge of new benchmarks to measure compositional understanding of vision-language models have permeated the machine learning ecosystem. Given an image, these benchmarks probe a model's ability to identify its associated caption amongst a set of compositional distractors. Surprisingly, we find significant biases in all these benchmarks rendering them hackable. This hackabi… ▽ More

    Submitted 26 June, 2023; originally announced June 2023.

  38. arXiv:2306.13818  [pdf, other

    cs.RO cs.CV

    AR2-D2:Training a Robot Without a Robot

    Authors: Jiafei Duan, Yi Ru Wang, Mohit Shridhar, Dieter Fox, Ranjay Krishna

    Abstract: Diligently gathered human demonstrations serve as the unsung heroes empowering the progression of robot learning. Today, demonstrations are collected by training people to use specialized controllers, which (tele-)operate robots to manipulate a small number of objects. By contrast, we introduce AR2-D2: a system for collecting demonstrations which (1) does not require people with specialized traini… ▽ More

    Submitted 23 June, 2023; originally announced June 2023.

    Comments: Project website: www.ar2d2.site

  39. arXiv:2306.11207  [pdf, other

    cs.CV cs.CL cs.LG

    Quilt-1M: One Million Image-Text Pairs for Histopathology

    Authors: Wisdom Oluchi Ikezogwo, Mehmet Saygin Seyfioglu, Fatemeh Ghezloo, Dylan Stefan Chan Geva, Fatwir Sheikh Mohammed, Pavan Kumar Anand, Ranjay Krishna, Linda Shapiro

    Abstract: Recent accelerations in multi-modal applications have been made possible with the plethora of image and text data available online. However, the scarcity of analogous data in the medical field, specifically in histopathology, has slowed comparable progress. To enable similar representation learning for histopathology, we turn to YouTube, an untapped resource of videos, offering $1,087$ hours of va… ▽ More

    Submitted 27 October, 2023; v1 submitted 19 June, 2023; originally announced June 2023.

  40. arXiv:2305.03689  [pdf, other

    cs.CV

    COLA: A Benchmark for Compositional Text-to-image Retrieval

    Authors: Arijit Ray, Filip Radenovic, Abhimanyu Dubey, Bryan A. Plummer, Ranjay Krishna, Kate Saenko

    Abstract: Compositional reasoning is a hallmark of human visual intelligence. Yet, despite the size of large vision-language models, they struggle to represent simple compositions by combining objects with their attributes. To measure this lack of compositional capability, we design Cola, a text-to-image retrieval benchmark to Compose Objects Localized with Attributes. To solve Cola, a model must retrieve i… ▽ More

    Submitted 2 November, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

    Comments: Accepted to NeurIPS 2023. Webpage: https://cs-people.bu.edu/array/research/cola/

  41. arXiv:2305.02301  [pdf, other

    cs.CL cs.AI cs.LG

    Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

    Authors: Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister

    Abstract: Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs.… ▽ More

    Submitted 5 July, 2023; v1 submitted 3 May, 2023; originally announced May 2023.

    Comments: Accepted to Findings of ACL 2023

  42. arXiv:2304.14108  [pdf, other

    cs.CV cs.CL cs.LG

    DataComp: In search of the next generation of multimodal datasets

    Authors: Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexander Ratner, Shuran Song , et al. (9 additional authors not shown)

    Abstract: Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Commo… ▽ More

    Submitted 20 October, 2023; v1 submitted 27 April, 2023; originally announced April 2023.

    Comments: NeurIPS 2023 Datasets and Benchmarks Track

  43. arXiv:2303.11897  [pdf, other

    cs.CV

    TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering

    Authors: Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, Noah A Smith

    Abstract: Despite thousands of researchers, engineers, and artists actively working on improving text-to-image generation models, systems often fail to produce images that accurately align with the text inputs. We introduce TIFA (Text-to-Image Faithfulness evaluation with question Answering), an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual ques… ▽ More

    Submitted 17 August, 2023; v1 submitted 21 March, 2023; originally announced March 2023.

    Comments: Accepted to ICCV 2023

  44. arXiv:2303.04068  [pdf, other

    cs.DB cs.CV cs.SD eess.AS

    VOCALExplore: Pay-as-You-Go Video Data Exploration and Model Building [Technical Report]

    Authors: Maureen Daum, Enhao Zhang, Dong He, Stephen Mussmann, Brandon Haynes, Ranjay Krishna, Magdalena Balazinska

    Abstract: We introduce VOCALExplore, a system designed to support users in building domain-specific models over video datasets. VOCALExplore supports interactive labeling sessions and trains models using user-supplied labels. VOCALExplore maximizes model quality by automatically deciding how to select samples based on observed skew in the collected labels. It also selects the optimal video representations t… ▽ More

    Submitted 29 September, 2023; v1 submitted 7 March, 2023; originally announced March 2023.

  45. arXiv:2302.12948  [pdf, other

    cs.LG cs.AI cs.CV

    Agile Modeling: From Concept to Classifier in Minutes

    Authors: Otilia Stretcu, Edward Vendrow, Kenji Hata, Krishnamurthy Viswanathan, Vittorio Ferrari, Sasan Tavakkol, Wenlei Zhou, Aditya Avinash, Enming Luo, Neil Gordon Alldrin, MohammadHossein Bateni, Gabriel Berger, Andrew Bunner, Chun-Ta Lu, Javier A Rey, Giulia DeSalvo, Ranjay Krishna, Ariel Fuxman

    Abstract: The application of computer vision to nuanced subjective use cases is growing. While crowdsourcing has served the vision community well for most objective tasks (such as labeling a "zebra"), it now falters on tasks where there is substantial subjectivity in the concept (such as identifying "gourmet tuna"). However, empowering any user to develop a classifier for their concept is technically diffic… ▽ More

    Submitted 12 May, 2023; v1 submitted 24 February, 2023; originally announced February 2023.

  46. arXiv:2301.00929  [pdf, other

    cs.DB

    EQUI-VOCAL: Synthesizing Queries for Compositional Video Events from Limited User Interactions [Technical Report]

    Authors: Enhao Zhang, Maureen Daum, Dong He, Brandon Haynes, Ranjay Krishna, Magdalena Balazinska

    Abstract: We introduce EQUI-VOCAL: a new system that automatically synthesizes queries over videos from limited user interactions. The user only provides a handful of positive and negative examples of what they are looking for. EQUI-VOCAL utilizes these initial examples and additional ones collected through active learning to efficiently synthesize complex user queries. Our approach enables users to find ev… ▽ More

    Submitted 8 August, 2023; v1 submitted 2 January, 2023; originally announced January 2023.

    Comments: This is an extended technical report for the following paper: "Enhao Zhang, Maureen Daum, Dong He, Brandon Haynes, Ranjay Krishna, and Magdalena Balazinska. EQUI-VOCAL: Synthesizing Queries for Compositional Video Events from Limited User Interactions. PVLDB, 16(11): 2714-2727, 2023. doi:10.14778/3611479.3611482"

  47. arXiv:2212.07796  [pdf, other

    cs.CL cs.CV

    CREPE: Can Vision-Language Foundation Models Reason Compositionally?

    Authors: Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, Ranjay Krishna

    Abstract: A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that: across 7 architectures trained with 4 algorithms on massive datasets, they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchm… ▽ More

    Submitted 16 May, 2023; v1 submitted 13 December, 2022; originally announced December 2022.

    Comments: Updated figures and numbers

  48. arXiv:2212.06823  [pdf, other

    cs.HC cs.AI

    Explanations Can Reduce Overreliance on AI Systems During Decision-Making

    Authors: Helena Vasconcelos, Matthew Jörke, Madeleine Grunde-McLaughlin, Tobias Gerstenberg, Michael Bernstein, Ranjay Krishna

    Abstract: Prior work has identified a resilient phenomenon that threatens the performance of human-AI decision-making teams: overreliance, when people agree with an AI, even when it is incorrect. Surprisingly, overreliance does not reduce when the AI produces explanations for its predictions, compared to only providing predictions. Some have argued that overreliance results from cognitive biases or uncalibr… ▽ More

    Submitted 26 January, 2023; v1 submitted 13 December, 2022; originally announced December 2022.

    Comments: CSCW 2023

  49. arXiv:2210.04365  [pdf, other

    cs.MA cs.AI cs.LG

    ELIGN: Expectation Alignment as a Multi-Agent Intrinsic Reward

    Authors: Zixian Ma, Rose Wang, Li Fei-Fei, Michael Bernstein, Ranjay Krishna

    Abstract: Modern multi-agent reinforcement learning frameworks rely on centralized training and reward sha** to perform well. However, centralized training and dense rewards are not readily available in the real world. Current multi-agent algorithms struggle to learn in the alternative setup of decentralized training or sparse rewards. To address these issues, we propose a self-supervised intrinsic reward… ▽ More

    Submitted 9 November, 2022; v1 submitted 9 October, 2022; originally announced October 2022.

    Comments: This paper will be published in Neurips 2022

  50. arXiv:2207.11784  [pdf, other

    cs.SE

    CARGO: AI-Guided Dependency Analysis for Migrating Monolithic Applications to Microservices Architecture

    Authors: Vikram Nitin, Shubhi Asthana, Baishakhi Ray, Rahul Krishna

    Abstract: Microservices Architecture (MSA) has become a de-facto standard for designing cloud-native enterprise applications due to its efficient infrastructure setup, service availability, elastic scalability, dependability, and better security. Existing (monolithic) systems must be decomposed into microservices to harness these characteristics. Since manual decomposition of large scale applications can be… ▽ More

    Submitted 6 October, 2022; v1 submitted 24 July, 2022; originally announced July 2022.

    Comments: ACM Distinguished Paper ASE '22, October 10-14, 2022, Ann Arbor, MI, USA

    ACM Class: D.2.11