Skip to main content

Showing 1–50 of 284 results for author: Darrell, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.20081  [pdf, other

    cs.CV cs.LG

    Segment Anything without Supervision

    Authors: XuDong Wang, **gfeng Yang, Trevor Darrell

    Abstract: The Segmentation Anything Model (SAM) requires labor-intensive data labeling. We present Unsupervised SAM (UnSAM) for promptable and automatic whole-image segmentation that does not require human annotations. UnSAM utilizes a divide-and-conquer strategy to "discover" the hierarchical structure of visual scenes. We first leverage top-down clustering methods to partition an unlabeled image into inst… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: Code: https://github.com/frank-xwang/UnSAM

  2. arXiv:2406.15334  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

    Authors: Brandon Huang, Chancharik Mitra, Assaf Arbelle, Leonid Karlinsky, Trevor Darrell, Roei Herzig

    Abstract: The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, wh… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  3. arXiv:2406.12172  [pdf, other

    cs.AI

    Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems

    Authors: Nasim Borazjanizadeh, Roei Herzig, Trevor Darrell, Rogerio Feris, Leonid Karlinsky

    Abstract: Recently, Large Language Models (LLMs) attained impressive performance in math and reasoning benchmarks. However, they still often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we introduce a new benchmark, SearchBench, containing 11 unique search problem types, each equipped with automated pipelines to generate an arbitrary number of in… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  4. arXiv:2406.11815  [pdf, other

    cs.RO cs.CV cs.LG

    LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

    Authors: Dantong Niu, Yuvan Sharma, Giscard Biamby, Jerome Quenum, Yutong Bai, Baifeng Shi, Trevor Darrell, Roei Herzig

    Abstract: In recent years, instruction-tuned Large Multimodal Models (LMMs) have been successful at several tasks, including image captioning and visual question answering; yet leveraging these models remains an open question for robotics. Prior LMMs for robotics applications have been extensively trained on language and action data, but their ability to generalize in different settings has often been less… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  5. arXiv:2405.08597  [pdf, other

    cs.LG

    Risks and Opportunities of Open-Source Generative AI

    Authors: Francisco Eiras, Aleksandar Petrov, Bertie Vidgen, Christian Schroeder, Fabio Pizzati, Katherine Elkins, Supratik Mukhopadhyay, Adel Bibi, Aaron Purewal, Csaba Botos, Fabro Steibel, Fazel Keshtkar, Fazl Barez, Genevieve Smith, Gianluca Guadagni, Jon Chun, Jordi Cabot, Joseph Imperial, Juan Arturo Nolazco, Lori Landay, Matthew Jackson, Phillip H. S. Torr, Trevor Darrell, Yong Lee, Jakob Foerster

    Abstract: Applications of Generative AI (Gen AI) are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about the potential risks of the technology, and resulted in calls for tighter regulation, in particular from some of the major tech companies who are leading in AI development. This reg… ▽ More

    Submitted 29 May, 2024; v1 submitted 14 May, 2024; originally announced May 2024.

    Comments: Extension of arXiv:2404.17047

  6. arXiv:2405.03689  [pdf, other

    cs.CV cs.CL

    Pose Priors from Language Models

    Authors: Sanjay Subramanian, Evonne Ng, Lea Müller, Dan Klein, Shiry Ginosar, Trevor Darrell

    Abstract: We present a zero-shot pose optimization method that enforces accurate physical contact constraints when estimating the 3D pose of humans. Our central insight is that since language is often used to describe physical interaction, large pretrained text-based models can act as priors on pose estimation. We can thus leverage this insight to improve pose estimation by converting natural language des… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

  7. arXiv:2404.17047  [pdf, other

    cs.LG

    Near to Mid-term Risks and Opportunities of Open-Source Generative AI

    Authors: Francisco Eiras, Aleksandar Petrov, Bertie Vidgen, Christian Schroeder de Witt, Fabio Pizzati, Katherine Elkins, Supratik Mukhopadhyay, Adel Bibi, Botos Csaba, Fabro Steibel, Fazl Barez, Genevieve Smith, Gianluca Guadagni, Jon Chun, Jordi Cabot, Joseph Marvin Imperial, Juan A. Nolazco-Flores, Lori Landay, Matthew Jackson, Paul Röttger, Philip H. S. Torr, Trevor Darrell, Yong Suk Lee, Jakob Foerster

    Abstract: In the next few years, applications of Generative AI are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about potential risks and resulted in calls for tighter regulation, in particular from some of the major tech companies who are leading in AI development. This regulation i… ▽ More

    Submitted 24 May, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: Accepted to ICML'24 as a position paper

  8. arXiv:2404.09991  [pdf, other

    cs.RO cs.CV

    EgoPet: Egomotion and Interaction Data from an Animal's Perspective

    Authors: Amir Bar, Arya Bakhtiar, Danny Tran, Antonio Loquercio, Jathushan Rajasegaran, Yann LeCun, Amir Globerson, Trevor Darrell

    Abstract: Animals perceive the world to plan their actions and interact with other agents to accomplish complex tasks, demonstrating capabilities that are still unmatched by AI systems. To advance our understanding and reduce the gap between the capabilities of animals and AI systems, we introduce a dataset of pet egomotion imagery with diverse examples of simultaneous egomotion and multi-agent interaction.… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: https://www.amirbar.net/egopet

  9. arXiv:2404.05729  [pdf, other

    cs.CV

    Finding Visual Task Vectors

    Authors: Alberto Hojel, Yutong Bai, Trevor Darrell, Amir Globerson, Amir Bar

    Abstract: Visual Prompting is a technique for teaching models to perform a visual task via in-context examples, without any additional training. In this work, we analyze the activations of MAE-VQGAN, a recent Visual Prompting model, and find task vectors, activations that encode task-specific information. Equipped with this insight, we demonstrate that it is possible to identify the task vectors and use the… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

    Comments: https://github.com/alhojel/visual_task_vectors

  10. arXiv:2404.02904  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    ALOHa: A New Measure for Hallucination in Captioning Models

    Authors: Suzanne Petryk, David M. Chan, Anish Kachinthaya, Haodi Zou, John Canny, Joseph E. Gonzalez, Trevor Darrell

    Abstract: Despite recent advances in multimodal pre-training for visual description, state-of-the-art models still produce captions containing errors, such as hallucinating objects not present in a scene. The existing prominent metric for object hallucination, CHAIR, is limited to a fixed set of MS COCO objects and synonyms. In this work, we propose a modernized open-vocabulary metric, ALOHa, which leverage… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: To appear at NAACL 2024

  11. arXiv:2404.01476  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    TraveLER: A Multi-LMM Agent Framework for Video Question-Answering

    Authors: Chuyi Shang, Amos You, Sanjay Subramanian, Trevor Darrell, Roei Herzig

    Abstract: Recently, Large Multimodal Models (LMMs) have made significant progress in video question-answering using a frame-wise approach by leveraging large-scale, image-based pretraining in a zero-shot manner. While image-based methods for videos have shown impressive performance, a current limitation is that they often overlook how key timestamps are selected and cannot adjust when incorrect timestamps a… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

  12. arXiv:2403.13043  [pdf, other

    cs.CV

    When Do We Not Need Larger Vision Models?

    Authors: Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, Trevor Darrell

    Abstract: Scaling up the size of vision models has been the de facto standard to obtain more powerful visual representations. In this work, we discuss the point beyond which larger vision models are not necessary. First, we demonstrate the power of Scaling on Scales (S$^2$), whereby a pre-trained and frozen smaller vision model (e.g., ViT-B or ViT-L), run over multiple image scales, can outperform larger mo… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: Code: https://github.com/bfshi/scaling_on_scales

  13. arXiv:2403.01915  [pdf, other

    cs.CV cs.AI

    xT: Nested Tokenization for Larger Context in Large Images

    Authors: Ritwik Gupta, Shufan Li, Tyler Zhu, Jitendra Malik, Trevor Darrell, Karttikeya Mangalam

    Abstract: Modern computer vision pipelines handle large images in one of two sub-optimal ways: down-sampling or crop**. These two methods incur significant losses in the amount of information and context present in an image. There are many downstream applications in which global context matters as much as high frequency details, such as in real-world satellite imagery; in such cases researchers have to ma… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

  14. arXiv:2402.19469  [pdf, other

    cs.RO cs.CV cs.LG

    Humanoid Locomotion as Next Token Prediction

    Authors: Ilija Radosavovic, Bike Zhang, Baifeng Shi, Jathushan Rajasegaran, Sarthak Kamat, Trevor Darrell, Koushil Sreenath, Jitendra Malik

    Abstract: We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This gen… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

  15. arXiv:2402.13144  [pdf, other

    cs.LG cs.CV

    Neural Network Parameter Diffusion

    Authors: Kai Wang, Zhaopan Xu, Yukun Zhou, Zelin Zang, Trevor Darrell, Zhuang Liu, Yang You

    Abstract: Diffusion models have achieved remarkable success in image and video generation. In this work, we demonstrate that diffusion models can also \textit{generate high-performing neural network parameters}. Our approach is simple, utilizing an autoencoder and a standard latent diffusion model. The autoencoder extracts latent representations of a subset of the trained network parameters. A diffusion mod… ▽ More

    Submitted 28 May, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: We introduce a novel approach for parameter generation, named neural network parameter diffusion (\textbf{p-diff}), which employs a standard latent diffusion model to synthesize a new set of parameters

  16. arXiv:2402.03290  [pdf, other

    cs.CV cs.AI cs.LG

    InstanceDiffusion: Instance-level Control for Image Generation

    Authors: Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Rohit Girdhar, Ishan Misra

    Abstract: Text-to-image diffusion models produce high quality images but do not offer control over individual instances in the image. We introduce InstanceDiffusion that adds precise instance-level control to text-to-image diffusion models. InstanceDiffusion supports free-form language conditions per instance and allows flexible ways to specify instance locations such as simple single points, scribbles, bou… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Comments: Preprint; Project page: https://people.eecs.berkeley.edu/~xdwang/projects/InstDiff/

  17. arXiv:2401.14391  [pdf, other

    cs.CV

    Rethinking Patch Dependence for Masked Autoencoders

    Authors: Letian Fu, Long Lian, Renhao Wang, Baifeng Shi, Xudong Wang, Adam Yala, Trevor Darrell, Alexei A. Efros, Ken Goldberg

    Abstract: In this work, we re-examine inter-patch dependencies in the decoding mechanism of masked autoencoders (MAE). We decompose this decoding mechanism for masked patch reconstruction in MAE into self-attention and cross-attention. Our investigations suggest that self-attention between mask patches is not essential for learning good representations. To this end, we propose a novel pretraining framework:… ▽ More

    Submitted 25 January, 2024; originally announced January 2024.

  18. arXiv:2401.01885  [pdf, other

    cs.CV

    From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

    Authors: Evonne Ng, Javier Romero, Timur Bagautdinov, Shaojie Bai, Trevor Darrell, Angjoo Kanazawa, Alexander Richard

    Abstract: We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

  19. arXiv:2312.17243  [pdf, other

    cs.CV

    Unsupervised Universal Image Segmentation

    Authors: Dantong Niu, Xudong Wang, Xinyang Han, Long Lian, Roei Herzig, Trevor Darrell

    Abstract: Several unsupervised image segmentation approaches have been proposed which eliminate the need for dense manually-annotated segmentation masks; current models separately handle either semantic segmentation (e.g., STEGO) or class-agnostic instance segmentation (e.g., CutLER), but not both (i.e., panoptic segmentation). We propose an Unsupervised Universal Segmentation model (U2Seg) adept at perform… ▽ More

    Submitted 28 December, 2023; originally announced December 2023.

  20. arXiv:2312.08366  [pdf, other

    cs.CV

    See, Say, and Segment: Teaching LMMs to Overcome False Premises

    Authors: Tsung-Han Wu, Giscard Biamby, David Chan, Lisa Dunlap, Ritwik Gupta, Xudong Wang, Joseph E. Gonzalez, Trevor Darrell

    Abstract: Current open-source Large Multimodal Models (LMMs) excel at tasks such as open-vocabulary language grounding and segmentation but can suffer under false premises when queries imply the existence of something that is not actually present in the image. We observe that existing methods that fine-tune an LMM to segment images significantly degrade their ability to reliably determine ("see") if an obje… ▽ More

    Submitted 13 December, 2023; originally announced December 2023.

    Comments: Project Page: https://see-say-segment.github.io

  21. arXiv:2312.02974  [pdf, other

    cs.CV cs.CL cs.CY cs.LG

    Describing Differences in Image Sets with Natural Language

    Authors: Lisa Dunlap, Yuhui Zhang, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell, Jacob Steinhardt, Joseph E. Gonzalez, Serena Yeung-Levy

    Abstract: How do two sets of images differ? Discerning set-level differences is crucial for understanding model behaviors and analyzing datasets, yet manually sifting through thousands of images is impractical. To aid in this discovery process, we explore the task of automatically describing the differences between two $\textbf{sets}$ of images, which we term Set Difference Captioning. This task takes in im… ▽ More

    Submitted 26 April, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

    Comments: CVPR 2024 Oral

  22. arXiv:2312.02249  [pdf, other

    cs.CV cs.CL

    Recursive Visual Programming

    Authors: Jiaxin Ge, Sanjay Subramanian, Baifeng Shi, Roei Herzig, Trevor Darrell

    Abstract: Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA). By generating and executing bespoke code for each question, these methods demonstrate impressive compositional and reasoning capabilities, especially in few-shot and zero-shot scenarios. However, existing VP methods generate all code in a single function, resulting in code that is suboptimal in terms o… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

  23. arXiv:2312.02150  [pdf, other

    cs.CV

    Readout Guidance: Learning Control from Diffusion Features

    Authors: Grace Luo, Trevor Darrell, Oliver Wang, Dan B Goldman, Aleksander Holynski

    Abstract: We present Readout Guidance, a method for controlling text-to-image diffusion models with learned signals. Readout Guidance uses readout heads, lightweight networks trained to extract signals from the features of a pre-trained, frozen diffusion model at every timestep. These readouts can encode single-image properties, such as pose, depth, and edges; or higher-order properties that relate multiple… ▽ More

    Submitted 2 April, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

    Comments: CVPR 2024

  24. arXiv:2312.01771  [pdf, other

    cs.CV

    IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks

    Authors: Jiarui Xu, Yossi Gandelsman, Amir Bar, Jianwei Yang, Jianfeng Gao, Trevor Darrell, Xiaolong Wang

    Abstract: In-context learning allows adapting a model to new tasks given a task description at test time. In this paper, we present IMProv - a generative model that is able to in-context learn visual tasks from multimodal prompts. Given a textual description of a visual task (e.g. "Left: input image, Right: foreground segmentation"), a few input-output visual examples, or both, the model in-context learns t… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

    Comments: Project page: https://jerryxu.net/IMProv

  25. arXiv:2312.00785  [pdf, other

    cs.CV

    Sequential Modeling Enables Scalable Learning for Large Vision Models

    Authors: Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, Alexei A Efros

    Abstract: We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once… ▽ More

    Submitted 1 December, 2023; originally announced December 2023.

    Comments: Website: https://yutongbai.com/lvm.html

  26. arXiv:2311.18823  [pdf, other

    cs.LG cs.CV

    Initializing Models with Larger Ones

    Authors: Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, Zhuang Liu

    Abstract: Weight initialization plays an important role in neural network training. Widely used initialization methods are proposed and evaluated for networks that are trained from scratch. However, the growing number of pretrained models now offers new opportunities for tackling this classical problem of weight initialization. In this work, we introduce weight selection, a method for initializing smaller m… ▽ More

    Submitted 30 November, 2023; originally announced November 2023.

  27. arXiv:2311.17942  [pdf, other

    cs.CV

    Object-based (yet Class-agnostic) Video Domain Adaptation

    Authors: Dantong Niu, Amir Bar, Roei Herzig, Trevor Darrell, Anna Rohrbach

    Abstract: Existing video-based action recognition systems typically require dense annotation and struggle in environments when there is significant distribution shift relative to the training data. Current methods for video domain adaptation typically fine-tune the model using fully annotated data on a subset of target domain data or align the representation of the two domains using bootstrap** or adversa… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

  28. arXiv:2311.17076  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Compositional Chain-of-Thought Prompting for Large Multimodal Models

    Authors: Chancharik Mitra, Brandon Huang, Trevor Darrell, Roei Herzig

    Abstract: The combination of strong visual backbones and Large Language Model (LLM) reasoning has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range of vision and language (VL) tasks. However, recent research has shown that even the most advanced LMMs still struggle to capture aspects of compositional visual reasoning, such as attributes and relationships between objects. O… ▽ More

    Submitted 31 March, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

  29. arXiv:2311.16090  [pdf, other

    cs.CV

    Self-correcting LLM-controlled Diffusion Models

    Authors: Tsung-Han Wu, Long Lian, Joseph E. Gonzalez, Boyi Li, Trevor Darrell

    Abstract: Text-to-image generation has witnessed significant progress with the advent of diffusion models. Despite the ability to generate photorealistic images, current text-to-image diffusion models still often struggle to accurately interpret and follow complex input text prompts. In contrast to existing models that aim to generate images only with their best effort, we introduce Self-correcting LLM-cont… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

    Comments: 16 pages, 10 figures

  30. arXiv:2311.12391  [pdf, other

    cs.CV

    From Wrong To Right: A Recursive Approach Towards Vision-Language Explanation

    Authors: Jiaxin Ge, Sanjay Subramanian, Trevor Darrell, Boyi Li

    Abstract: Addressing the challenge of adapting pre-trained vision-language models for generating insightful explanations for visual reasoning tasks with limited annotations, we present ReVisE: a $\textbf{Re}$cursive $\textbf{Vis}$ual $\textbf{E}$xplanation algorithm. Our method iteratively computes visual features (conditioned on the text input), an answer, and an explanation, to improve the explanation qua… ▽ More

    Submitted 21 November, 2023; originally announced November 2023.

    Comments: EMNLP 2023 Main

  31. arXiv:2311.06694  [pdf, other

    cs.CL cs.AI cs.CV cs.RO

    Which One? Leveraging Context Between Objects and Multiple Views for Language Grounding

    Authors: Chancharik Mitra, Abrar Anwar, Rodolfo Corona, Dan Klein, Trevor Darrell, Jesse Thomason

    Abstract: When connecting objects and their language referents in an embodied 3D environment, it is important to note that: (1) an object can be better characterized by leveraging comparative information between itself and other objects, and (2) an object's appearance can vary with camera position. As such, we present the Multi-view Approach to Grounding in Context (MAGiC), which selects an object referent… ▽ More

    Submitted 6 April, 2024; v1 submitted 11 November, 2023; originally announced November 2023.

    Journal ref: North American Chapter of the Association for Computational Linguistics (NAACL), 2024

  32. arXiv:2311.05589  [pdf, other

    cs.LG math.OC stat.ML

    A Coefficient Makes SVRG Effective

    Authors: Yida Yin, Zhiqiu Xu, Zhiyuan Li, Trevor Darrell, Zhuang Liu

    Abstract: Stochastic Variance Reduced Gradient (SVRG), introduced by Johnson & Zhang (2013), is a theoretically compelling optimization method. However, as Defazio & Bottou (2019) highlights, its effectiveness in deep learning is yet to be proven. In this work, we demonstrate the potential of SVRG in optimizing real-world neural networks. Our analysis finds that, for deeper networks, the strength of the var… ▽ More

    Submitted 9 November, 2023; originally announced November 2023.

    Comments: Preprint

  33. arXiv:2311.01011  [pdf, other

    cs.LG cs.CR

    Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game

    Authors: Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, Stuart Russell

    Abstract: While Large Language Models (LLMs) are increasingly being used in real-world applications, they remain vulnerable to prompt injection attacks: malicious third party prompts that subvert the intent of the system designer. To help researchers study this problem, we present a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection, all created by p… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

  34. arXiv:2310.17688  [pdf, other

    cs.CY cs.AI cs.CL cs.LG

    Managing extreme AI risks amid rapid progress

    Authors: Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila McIlraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner, Sören Mindermann

    Abstract: Artificial Intelligence (AI) is progressing rapidly, and companies are shifting their focus to develo** generalist AI systems that can autonomously act and pursue goals. Increases in capabilities and autonomy may soon massively amplify AI's impact, with risks that include large-scale social harms, malicious uses, and an irreversible loss of human control over autonomous AI systems. Although rese… ▽ More

    Submitted 22 May, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

    Comments: Published in Science: https://www.science.org/doi/10.1126/science.adn0117

  35. arXiv:2310.15166  [pdf, other

    cs.CV cs.CL

    Large Language Models are Visual Reasoning Coordinators

    Authors: Liangyu Chen, Bo Li, Sheng Shen, **gkang Yang, Chunyuan Li, Kurt Keutzer, Trevor Darrell, Ziwei Liu

    Abstract: Visual reasoning requires multimodal perception and commonsense cognition of the world. Recently, multiple vision-language models (VLMs) have been proposed with excellent commonsense reasoning ability in various domains. However, how to harness the collective power of these complementary VLMs is rarely explored. Existing methods like ensemble still struggle to aggregate these models with the desir… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

    Comments: Accepted at NeurIPS 2023

  36. arXiv:2310.12971  [pdf, other

    cs.CV cs.AI cs.CL

    CLAIR: Evaluating Image Captions with Large Language Models

    Authors: David Chan, Suzanne Petryk, Joseph E. Gonzalez, Trevor Darrell, John Canny

    Abstract: The evaluation of machine-generated image captions poses an interesting yet persistent challenge. Effective evaluation measures must consider numerous dimensions of similarity, including semantic relevance, visual structure, object interactions, caption diversity, and specificity. Existing highly-engineered measures attempt to capture specific aspects, but fall short in providing a holistic score… ▽ More

    Submitted 19 October, 2023; originally announced October 2023.

    Comments: To Appear at EMNLP 2023

  37. arXiv:2310.08864  [pdf, other

    cs.RO

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Authors: Open X-Embodiment Collaboration, Abby O'Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, A**kya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie , et al. (267 additional authors not shown)

    Abstract: Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning method… ▽ More

    Submitted 1 June, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

    Comments: Project website: https://robotics-transformer-x.github.io

  38. arXiv:2309.17444  [pdf, other

    cs.CV cs.AI cs.CL

    LLM-grounded Video Diffusion Models

    Authors: Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, Boyi Li

    Abstract: Text-conditioned diffusion models have emerged as a promising tool for neural video generation. However, current models still struggle with intricate spatiotemporal prompts and often generate restricted or incorrect motion. To address these limitations, we introduce LLM-grounded Video Diffusion (LVD). Instead of directly generating videos from the text inputs, LVD first leverages a large language… ▽ More

    Submitted 4 May, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

    Comments: ICLR 2024. Project Page: https://llm-grounded-video-diffusion.github.io/

  39. arXiv:2309.14525  [pdf, other

    cs.CV cs.CL

    Aligning Large Multimodal Models with Factually Augmented RLHF

    Authors: Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, Trevor Darrell

    Abstract: Large Multimodal Models (LMM) are built across modalities and the misalignment between two modalities can result in "hallucination", generating textual outputs that are not grounded by the multimodal information in context. To address the multimodal misalignment issue, we adapt the Reinforcement Learning from Human Feedback (RLHF) from the text domain to the task of vision-language alignment, wher… ▽ More

    Submitted 25 September, 2023; originally announced September 2023.

    Comments: Preprint

  40. arXiv:2308.14710  [pdf, other

    cs.CV cs.AI cs.LG

    VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

    Authors: Xudong Wang, Ishan Misra, Ziyun Zeng, Rohit Girdhar, Trevor Darrell

    Abstract: Existing approaches to unsupervised video instance segmentation typically rely on motion estimates and experience difficulties tracking small or divergent motions. We present VideoCutLER, a simple method for unsupervised multi-instance video segmentation without using motion-based learning signals like optical flow or training on natural videos. Our key insight is that using high-quality pseudo ma… ▽ More

    Submitted 28 August, 2023; originally announced August 2023.

    Comments: Preprint. Code: https://github.com/facebookresearch/CutLER

  41. arXiv:2308.10897  [pdf, other

    cs.CV

    Can Language Models Learn to Listen?

    Authors: Evonne Ng, Sanjay Subramanian, Dan Klein, Angjoo Kanazawa, Trevor Darrell, Shiry Ginosar

    Abstract: We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words. Given an input transcription of the speaker's words with their timestamps, our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE. Since gesture is a language component, we propose t… ▽ More

    Submitted 21 August, 2023; originally announced August 2023.

    Comments: ICCV 2023; Project page: https://people.eecs.berkeley.edu/~evonne_ng/projects/text2listen/

  42. arXiv:2308.00566  [pdf, other

    cs.CV cs.AI cs.LG

    Stochastic positional embeddings improve masked image modeling

    Authors: Amir Bar, Florian Bordes, Assaf Shocher, Mahmoud Assran, Pascal Vincent, Nicolas Ballas, Trevor Darrell, Amir Globerson, Yann LeCun

    Abstract: Masked Image Modeling (MIM) is a promising self-supervised learning approach that enables learning from unlabeled images. Despite its recent success, learning good representations through MIM remains challenging because it requires predicting the right semantic content in accurate locations. For example, given an incomplete picture of a dog, we can guess that there is a tail, but we cannot determi… ▽ More

    Submitted 27 February, 2024; v1 submitted 31 July, 2023; originally announced August 2023.

    Comments: Code and models available in https://github.com/amirbar/StoP

  43. arXiv:2307.00764  [pdf, other

    cs.CV cs.AI cs.LG

    Hierarchical Open-vocabulary Universal Image Segmentation

    Authors: Xudong Wang, Shufan Li, Konstantinos Kallidromitis, Yusuke Kato, Kazuki Kozuka, Trevor Darrell

    Abstract: Open-vocabulary image segmentation aims to partition an image into semantic regions according to arbitrary text descriptions. However, complex visual scenes can be naturally decomposed into simpler parts and abstracted at multiple levels of granularity, introducing inherent segmentation ambiguity. Unlike existing methods that typically sidestep this ambiguity and treat it as an external factor, ou… ▽ More

    Submitted 21 December, 2023; v1 submitted 3 July, 2023; originally announced July 2023.

    Comments: Project web-page: http://people.eecs.berkeley.edu/~xdwang/projects/HIPIE/; NeurIPS 2023 Camera-ready

  44. arXiv:2306.11180  [pdf, other

    cs.CV cs.AI

    Hyperbolic Active Learning for Semantic Segmentation under Domain Shift

    Authors: Luca Franco, Paolo Mandica, Konstantinos Kallidromitis, Devin Guillory, Yu-Teng Li, Trevor Darrell, Fabio Galasso

    Abstract: We introduce a hyperbolic neural network approach to pixel-level active learning for semantic segmentation. Analysis of the data statistics leads to a novel interpretation of the hyperbolic radius as an indicator of data scarcity. In HALO (Hyperbolic Active Learning Optimization), for the first time, we propose the use of epistemic uncertainty as a data acquisition strategy, following the intuitio… ▽ More

    Submitted 4 June, 2024; v1 submitted 19 June, 2023; originally announced June 2023.

    Comments: ICML 2024. Project repository: https://github.com/paolomandica/HALO

  45. arXiv:2306.10007  [pdf, other

    cs.RO cs.CV cs.LG

    Robot Learning with Sensorimotor Pre-training

    Authors: Ilija Radosavovic, Baifeng Shi, Letian Fu, Ken Goldberg, Trevor Darrell, Jitendra Malik

    Abstract: We present a self-supervised sensorimotor pre-training approach for robotics. Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens. Given a sequence of camera images, proprioceptive robot states, and actions, we encode the sequence into tokens, mask out a subset, and train a model to predict the missing content from the rest. We hypothesize that if a robot can… ▽ More

    Submitted 14 December, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

    Comments: CoRL 2023; Project page: https://robotic-pretrained-transformer.github.io

  46. arXiv:2306.09322  [pdf, other

    cs.CV

    Neural Relighting with Subsurface Scattering by Learning the Radiance Transfer Gradient

    Authors: Shizhan Zhu, Shunsuke Saito, Aljaz Bozic, Carlos Aliaga, Trevor Darrell, Christoph Lassner

    Abstract: Reconstructing and relighting objects and scenes under varying lighting conditions is challenging: existing neural rendering methods often cannot handle the complex interactions between materials and light. Incorporating pre-computed radiance transfer techniques enables global illumination, but still struggles with materials with subsurface scattering effects. We propose a novel framework for lear… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

    Comments: https://youtu.be/NYKB_Jm8c-Q

  47. arXiv:2306.05392  [pdf, other

    cs.CL

    Modular Visual Question Answering via Code Generation

    Authors: Sanjay Subramanian, Medhini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagrani, Cordelia Schmid, Andy Zeng, Trevor Darrell, Dan Klein

    Abstract: We present a framework that formulates visual question answering as modular code generation. In contrast to prior work on modular approaches to VQA, our approach requires no additional training and relies on pre-trained language models (LMs), visual models pre-trained on image-caption pairs, and fifty VQA examples used for in-context learning. The generated Python programs invoke and compose the o… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

    Comments: ACL 2023

  48. arXiv:2305.16289  [pdf, other

    cs.CV cs.AI

    Diversify Your Vision Datasets with Automatic Diffusion-Based Augmentation

    Authors: Lisa Dunlap, Alyssa Umino, Han Zhang, Jiezhi Yang, Joseph E. Gonzalez, Trevor Darrell

    Abstract: Many fine-grained classification tasks, like rare animal identification, have limited training data and consequently classifiers trained on these datasets often fail to generalize to variations in the domain like changes in weather or location. As such, we explore how natural language descriptions of the domains seen in training data can be used with large vision models trained on diverse pretrain… ▽ More

    Submitted 29 October, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

    Comments: Update: replaced Planes dataset with Waterbirds & updated results after bug fix

  49. arXiv:2305.15542  [pdf, other

    cs.CV cs.CL cs.LG

    TOAST: Transfer Learning via Attention Steering

    Authors: Baifeng Shi, Siyu Gai, Trevor Darrell, Xin Wang

    Abstract: Transfer learning involves adapting a pre-trained model to novel downstream tasks. However, we observe that current transfer learning methods often fail to focus on task-relevant features. In this work, we explore refocusing model attention for transfer learning. We introduce Top-Down Attention Steering (TOAST), a novel transfer learning algorithm that keeps the pre-trained backbone frozen, select… ▽ More

    Submitted 11 July, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: Code is available at https://github.com/bfshi/TOAST

  50. arXiv:2305.14705  [pdf, other

    cs.CL

    Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models

    Authors: Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen, Tu Vu, Yuexin Wu, Wuyang Chen, Albert Webson, Yunxuan Li, Vincent Zhao, Hongkun Yu, Kurt Keutzer, Trevor Darrell, Denny Zhou

    Abstract: Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost. Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models. In particular, we… ▽ More

    Submitted 5 July, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: Preprint