Skip to main content

Showing 1–50 of 176 results for author: Fei-Fei, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.00316  [pdf, other

    cs.CV

    OccFusion: Rendering Occluded Humans with Generative Diffusion Priors

    Authors: Adam Sun, Tiange Xiang, Scott Delp, Li Fei-Fei, Ehsan Adeli

    Abstract: Most existing human rendering methods require every part of the human to be fully visible throughout the input video. However, this assumption does not hold in real-life settings where obstructions are common, resulting in only partial visibility of the human. Considering this, we present OccFusion, an approach that utilizes efficient 3D Gaussian splatting supervised by pretrained 2D diffusion mod… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

  2. arXiv:2406.01662  [pdf, other

    cs.CV cs.AI

    Few-Shot Classification of Interactive Activities of Daily Living (InteractADL)

    Authors: Zane Durante, Robathan Harries, Edward Vendrow, Zelun Luo, Yuta Kyuragi, Kazuki Kozuka, Li Fei-Fei, Ehsan Adeli

    Abstract: Understanding Activities of Daily Living (ADLs) is a crucial step for different applications including assistive robots, smart homes, and healthcare. However, to date, few benchmarks and methods have focused on complex ADLs, especially those involving multi-person interactions in home environments. In this paper, we propose a new dataset and benchmark, InteractADL, for understanding complex ADLs t… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  3. arXiv:2405.10315  [pdf, other

    cs.RO cs.AI cs.LG

    TRANSIC: Sim-to-Real Policy Transfer by Learning from Online Correction

    Authors: Yunfan Jiang, Chen Wang, Ruohan Zhang, Jiajun Wu, Li Fei-Fei

    Abstract: Learning in simulation and transferring the learned policy to the real world has the potential to enable generalist robots. The key challenge of this approach is to address simulation-to-reality (sim-to-real) gaps. Previous methods often require domain-specific knowledge a priori. We argue that a straightforward way to obtain such knowledge is by asking humans to observe and assist robot policy ex… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: Project website: https://transic-robot.github.io/

  4. arXiv:2405.09546  [pdf, other

    cs.CV

    BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

    Authors: Yunhao Ge, Yihe Tang, Jiashu Xu, Cem Gokmen, Chengshu Li, Wensi Ai, Benjamin Jose Martinez, Arman Aydin, Mona Anvari, Ayush K Chakravarthy, Hong-Xing Yu, Josiah Wong, Sanjana Srivastava, Sharon Lee, Shengxin Zha, Laurent Itti, Yunzhu Li, Roberto Martín-Martín, Miao Liu, Pengchuan Zhang, Ruohan Zhang, Li Fei-Fei, Jiajun Wu

    Abstract: The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels, which real-world vision datasets rarely satisfy. While current synthetic data generators offer a promising alternative, particularly for embodied AI tasks, they often fall short for computer vision tasks due to low asset and renderin… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

    Comments: CVPR 2024 (Highlight). Project website: https://behavior-vision-suite.github.io/

  5. arXiv:2403.09227  [pdf, other

    cs.RO cs.AI

    BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

    Authors: Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Wensi Ai, Benjamin Martinez, Hang Yin, Michael Lingelbach, Minjune Hwang, Ayano Hiranaka, Sujay Garlanka, Arman Aydin, Sharon Lee, Jiankai Sun, Mona Anvari, Manasi Sharma, Dhruva Bansal, Samuel Hunter, Kyu-Young Kim, Alan Lou, Caleb R Matthews , et al. (10 additional authors not shown)

    Abstract: We present BEHAVIOR-1K, a comprehensive simulation benchmark for human-centered robotics. BEHAVIOR-1K includes two components, guided and motivated by the results of an extensive survey on "what do you want robots to do for you?". The first is the definition of 1,000 everyday activities, grounded in 50 scenes (houses, gardens, restaurants, offices, etc.) with more than 9,000 objects annotated with… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

    Comments: A preliminary version was published at 6th Conference on Robot Learning (CoRL 2022)

  6. arXiv:2403.07788  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation

    Authors: Chen Wang, Haochen Shi, Weizhuo Wang, Ruohan Zhang, Li Fei-Fei, C. Karen Liu

    Abstract: Imitation learning from human hand motion data presents a promising avenue for imbuing robots with human-like dexterity in real-world manipulation tasks. Despite this potential, substantial challenges persist, particularly with the portability of existing hand motion capture (mocap) systems and the difficulty of translating mocap data into effective control policies. To tackle these issues, we int… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

  7. arXiv:2403.00833  [pdf, other

    cs.AI

    Position Paper: Agent AI Towards a Holistic Intelligence

    Authors: Qiuyuan Huang, Naoki Wake, Bidipta Sarkar, Zane Durante, Ran Gong, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Noboru Kuno, Ade Famoti, Ashley Llorens, John Langford, Hoi Vo, Li Fei-Fei, Katsu Ikeuchi, Jianfeng Gao

    Abstract: Recent advancements in large foundation models have remarkably enhanced our understanding of sensory information in open-world environments. In leveraging the power of foundation models, it is crucial for AI research to pivot away from excessive reductionism and toward an emphasis on systems that function as cohesive wholes. Specifically, we emphasize develo** Agent AI -- an embodied system that… ▽ More

    Submitted 28 February, 2024; originally announced March 2024.

    Comments: 22 pages, 4 figures. arXiv admin note: substantial text overlap with arXiv:2401.03568

  8. arXiv:2402.05929  [pdf, other

    cs.AI cs.LG cs.RO

    An Interactive Agent Foundation Model

    Authors: Zane Durante, Bidipta Sarkar, Ran Gong, Rohan Taori, Yusuke Noda, Paul Tang, Ehsan Adeli, Shrinidhi Kowshika Lakshmikanth, Kevin Schulman, Arnold Milstein, Demetri Terzopoulos, Ade Famoti, Noboru Kuno, Ashley Llorens, Hoi Vo, Katsu Ikeuchi, Li Fei-Fei, Jianfeng Gao, Naoki Wake, Qiuyuan Huang

    Abstract: The development of artificial intelligence systems is transitioning from creating static, task-specific models to dynamic, agent-based systems capable of performing well in a wide range of applications. We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents across a wide range of domains, datasets, and tasks. Our training paradi… ▽ More

    Submitted 17 June, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

  9. arXiv:2401.03568  [pdf, other

    cs.AI cs.HC cs.LG

    Agent AI: Surveying the Horizons of Multimodal Interaction

    Authors: Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Ye** Choi, Katsushi Ikeuchi, Hoi Vo, Li Fei-Fei, Jianfeng Gao

    Abstract: Multi-modal AI systems will likely become a ubiquitous presence in our everyday lives. A promising approach to making these systems more interactive is to embody them as agents within physical and virtual environments. At present, systems leverage existing foundation models as the basic building blocks for the creation of embodied agents. Embedding agents within such environments facilitates the a… ▽ More

    Submitted 25 January, 2024; v1 submitted 7 January, 2024; originally announced January 2024.

  10. arXiv:2401.00431  [pdf, other

    cs.CV

    Wild2Avatar: Rendering Humans Behind Occlusions

    Authors: Tiange Xiang, Adam Sun, Scott Delp, Kazuki Kozuka, Li Fei-Fei, Ehsan Adeli

    Abstract: Rendering the visual appearance of moving humans from occluded monocular videos is a challenging task. Most existing research renders 3D humans under ideal conditions, requiring a clear and unobstructed scene. Those methods cannot be used to render humans in real-world scenes where obstacles may block the camera's view and lead to partial occlusions. In this work, we present Wild2Avatar, a neural… ▽ More

    Submitted 31 December, 2023; originally announced January 2024.

  11. arXiv:2312.12791  [pdf, other

    cs.RO cs.AI cs.LG

    Model-Based Control with Sparse Neural Dynamics

    Authors: Ziang Liu, Genggeng Zhou, Jeff He, Tobia Marcucci, Li Fei-Fei, Jiajun Wu, Yunzhu Li

    Abstract: Learning predictive models from observations using deep neural networks (DNNs) is a promising new approach to many real-world planning and control problems. However, common DNNs are too unstructured for effective planning, and current control methods typically rely on extensive sampling or local gradient descent. In this paper, we propose a new framework for integrated model learning and predictiv… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

    Comments: Accepted at NeurIPS 2023. For tutorial code and additional visualizations, see https://robopil.github.io/Sparse-Dynamics/

  12. arXiv:2312.06662  [pdf, other

    cs.CV cs.AI cs.LG

    Photorealistic Video Generation with Diffusion Models

    Authors: Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama

    Abstract: We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

    Comments: Project website https://walt-video-diffusion.github.io/

  13. arXiv:2312.04474  [pdf, other

    cs.CL cs.AI cs.LG cs.RO

    Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

    Authors: Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, Brian Ichter

    Abstract: Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter - we hypothesize that language models (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks, but also for semantic ones (and in particular, those that are a mix of both). For example, consider prompting an… ▽ More

    Submitted 7 December, 2023; v1 submitted 7 December, 2023; originally announced December 2023.

  14. arXiv:2311.04287  [pdf, other

    cs.CV cs.LG

    Holistic Evaluation of Text-To-Image Models

    Authors: Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak Narayanan, Hannah Benita Teufel, Marco Bellagente, Minguk Kang, Taesung Park, Jure Leskovec, Jun-Yan Zhu, Li Fei-Fei, Jiajun Wu, Stefano Ermon, Percy Liang

    Abstract: The stunning qualitative improvement of recent text-to-image models has led to their widespread attention and adoption. However, we lack a comprehensive quantitative understanding of their capabilities and risks. To fill this gap, we introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM). Whereas previous evaluations focus mostly on text-image alignment and image quality, we… ▽ More

    Submitted 7 November, 2023; originally announced November 2023.

    Comments: NeurIPS 2023. First three authors contributed equally

  15. arXiv:2311.01454  [pdf, other

    cs.RO cs.AI

    NOIR: Neural Signal Operated Intelligent Robots for Everyday Activities

    Authors: Ruohan Zhang, Sharon Lee, Minjune Hwang, Ayano Hiranaka, Chen Wang, Wensi Ai, ** Jie Ryan Tan, Shreya Gupta, Yilun Hao, Gabrael Levine, Ruohan Gao, Anthony Norcia, Li Fei-Fei, Jiajun Wu

    Abstract: We present Neural Signal Operated Intelligent Robots (NOIR), a general-purpose, intelligent brain-robot interface system that enables humans to command robots to perform everyday activities through brain signals. Through this interface, humans communicate their intended objects of interest and actions to the robots using electroencephalography (EEG). Our novel system demonstrates success in an exp… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

  16. arXiv:2310.17994  [pdf, other

    cs.CV cs.GR

    ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image

    Authors: Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry Lagun, Li Fei-Fei, Deqing Sun, Jiajun Wu

    Abstract: We introduce a 3D-aware diffusion model, ZeroNVS, for single-image novel view synthesis for in-the-wild scenes. While existing methods are designed for single objects with masked backgrounds, we propose new techniques to address challenges introduced by in-the-wild multi-object scenes with complex backgrounds. Specifically, we train a generative prior on a mixture of data sources that capture obje… ▽ More

    Submitted 23 April, 2024; v1 submitted 27 October, 2023; originally announced October 2023.

    Comments: Accepted to CVPR 2024. 12 pages

  17. arXiv:2310.08864  [pdf, other

    cs.RO

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Authors: Open X-Embodiment Collaboration, Abby O'Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, A**kya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie , et al. (267 additional authors not shown)

    Abstract: Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning method… ▽ More

    Submitted 1 June, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

    Comments: Project website: https://robotics-transformer-x.github.io

  18. arXiv:2310.01824  [pdf, other

    cs.AI cs.LG cs.RO

    Mini-BEHAVIOR: A Procedurally Generated Benchmark for Long-horizon Decision-Making in Embodied AI

    Authors: Emily **, Jiaheng Hu, Zhuoyi Huang, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, Roberto Martín-Martín

    Abstract: We present Mini-BEHAVIOR, a novel benchmark for embodied AI that challenges agents to use reasoning and decision-making skills to solve complex activities that resemble everyday human challenges. The Mini-BEHAVIOR environment is a fast, realistic Gridworld environment that offers the benefits of rapid prototy** and ease of use while preserving a symbolic level of physical realism and complexity… ▽ More

    Submitted 27 December, 2023; v1 submitted 3 October, 2023; originally announced October 2023.

  19. arXiv:2309.16118  [pdf, other

    cs.RO cs.CV cs.LG

    D$^3$Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation

    Authors: Yixuan Wang, Zhuoran Li, Mingtong Zhang, Katherine Driggs-Campbell, Jiajun Wu, Li Fei-Fei, Yunzhu Li

    Abstract: Scene representation has been a crucial design choice in robotic manipulation systems. An ideal representation should be 3D, dynamic, and semantic to meet the demands of diverse manipulation tasks. However, previous works often lack all three properties simultaneously. In this work, we introduce D$^3$Fields - dynamic 3D descriptor fields. These fields capture the dynamics of the underlying 3D envi… ▽ More

    Submitted 8 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: Project Page: https://robopil.github.io/d3fields/

  20. arXiv:2309.09971  [pdf, other

    cs.AI cs.HC cs.MA

    MindAgent: Emergent Gaming Interaction

    Authors: Ran Gong, Qiuyuan Huang, Xiaojian Ma, Hoi Vo, Zane Durante, Yusuke Noda, Zilong Zheng, Song-Chun Zhu, Demetri Terzopoulos, Li Fei-Fei, Jianfeng Gao

    Abstract: Large Language Models (LLMs) have the capacity of performing complex scheduling in a multi-agent system and can coordinate these agents into completing sophisticated tasks that require extensive collaboration. However, despite the introduction of numerous gaming frameworks, the community has insufficient benchmarks towards building general multi-agents collaboration infrastructure that encompass b… ▽ More

    Submitted 19 September, 2023; v1 submitted 18 September, 2023; originally announced September 2023.

    Comments: The first three authors contributed equally. 28 pages

  21. arXiv:2309.00987  [pdf, other

    cs.RO cs.AI cs.LG

    Sequential Dexterity: Chaining Dexterous Policies for Long-Horizon Manipulation

    Authors: Yuanpei Chen, Chen Wang, Li Fei-Fei, C. Karen Liu

    Abstract: Many real-world manipulation tasks consist of a series of subtasks that are significantly different from one another. Such long-horizon, complex tasks highlight the potential of dexterous hands, which possess adaptability and versatility, capable of seamlessly transitioning between different modes of functionality without the need for re-gras** or external tools. However, the challenges arise du… ▽ More

    Submitted 16 October, 2023; v1 submitted 2 September, 2023; originally announced September 2023.

    Comments: 7th Conference on Robot Learning (CoRL 2023)

  22. arXiv:2308.04622  [pdf, other

    cs.CV

    Rendering Humans from Object-Occluded Monocular Videos

    Authors: Tiange Xiang, Adam Sun, Jiajun Wu, Ehsan Adeli, Li Fei-Fei

    Abstract: 3D understanding and rendering of moving humans from monocular videos is a challenging task. Despite recent progress, the task remains difficult in real-world scenarios, where obstacles may block the camera view and cause partial occlusions in the captured videos. Existing methods cannot handle such defects due to two reasons. First, the standard rendering strategy relies on point-point map**, w… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

    Comments: ICCV 2023, project page: https://cs.stanford.edu/~xtiange/projects/occnerf/

  23. arXiv:2307.15801  [pdf, other

    cs.RO cs.AI

    Primitive Skill-based Robot Learning from Human Evaluative Feedback

    Authors: Ayano Hiranaka, Minjune Hwang, Sharon Lee, Chen Wang, Li Fei-Fei, Jiajun Wu, Ruohan Zhang

    Abstract: Reinforcement learning (RL) algorithms face significant challenges when dealing with long-horizon robot manipulation tasks in real-world environments due to sample inefficiency and safety issues. To overcome these challenges, we propose a novel framework, SEED, which leverages two approaches: reinforcement learning from human feedback (RLHF) and primitive skill-based reinforcement learning. Both a… ▽ More

    Submitted 2 August, 2023; v1 submitted 28 July, 2023; originally announced July 2023.

  24. arXiv:2307.05973  [pdf, other

    cs.RO cs.AI cs.CL cs.CV cs.LG

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    Authors: Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, Li Fei-Fei

    Abstract: Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, most still rely on pre-defined motion primitives to carry out the physical interactions with the environment, which remains a major bottleneck. In this work, we aim to synthesize robot trajectories, i.e., a dens… ▽ More

    Submitted 2 November, 2023; v1 submitted 12 July, 2023; originally announced July 2023.

  25. arXiv:2306.16700  [pdf, other

    cs.RO cs.CV cs.LG

    Dynamic-Resolution Model Learning for Object Pile Manipulation

    Authors: Yixuan Wang, Yunzhu Li, Katherine Driggs-Campbell, Li Fei-Fei, Jiajun Wu

    Abstract: Dynamics models learned from visual observations have shown to be effective in various robotic manipulation tasks. One of the key questions for learning such dynamics models is what scene representation to use. Prior works typically assume representation at a fixed dimension or resolution, which may be inefficient for simple tasks and ineffective for more complicated tasks. In this work, we invest… ▽ More

    Submitted 29 June, 2023; v1 submitted 29 June, 2023; originally announced June 2023.

    Comments: Accepted to Robotics: Science and Systems (RSS) 2023. The first two authors contributed equally. Project Page: https://robopil.github.io/dyn-res-pile-manip

  26. arXiv:2306.15742  [pdf, other

    cs.CV

    Differentially Private Video Activity Recognition

    Authors: Zelun Luo, Yuliang Zou, Yi** Yang, Zane Durante, De-An Huang, Zhiding Yu, Chaowei Xiao, Li Fei-Fei, Animashree Anandkumar

    Abstract: In recent years, differential privacy has seen significant advancements in image classification; however, its application to video activity recognition remains under-explored. This paper addresses the challenges of applying differential privacy to video activity recognition, which primarily stem from: (1) a discrepancy between the desired privacy level for entire videos and the nature of input dat… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

  27. arXiv:2306.13760  [pdf, other

    cs.AI

    Task-Driven Graph Attention for Hierarchical Relational Object Navigation

    Authors: Michael Lingelbach, Chengshu Li, Minjune Hwang, Andrey Kurenkov, Alan Lou, Roberto Martín-Martín, Ruohan Zhang, Li Fei-Fei, Jiajun Wu

    Abstract: Embodied AI agents in large scenes often need to navigate to find objects. In this work, we study a naturally emerging variant of the object navigation task, hierarchical relational object navigation (HRON), where the goal is to find objects specified by logical predicates organized in a hierarchical structure - objects related to furniture and then to rooms - such as finding an apple on top of a… ▽ More

    Submitted 23 June, 2023; originally announced June 2023.

  28. arXiv:2306.01623  [pdf, other

    cs.CV cs.AI cs.LG

    HomE: Homography-Equivariant Video Representation Learning

    Authors: Anirudh Sriram, Adrien Gaidon, Jiajun Wu, Juan Carlos Niebles, Li Fei-Fei, Ehsan Adeli

    Abstract: Recent advances in self-supervised representation learning have enabled more efficient and robust model performance without relying on extensive labeled data. However, most works are still focused on images, with few working on videos and even fewer on multi-view videos, where more powerful inductive biases can be leveraged for self-supervision. In this work, we propose a novel method for represen… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

    Comments: 10 pages, 4 figures, 4 tables

  29. arXiv:2306.00956  [pdf, other

    cs.CV cs.AI cs.GR cs.HC cs.RO

    The ObjectFolder Benchmark: Multisensory Learning with Neural and Real Objects

    Authors: Ruohan Gao, Yiming Dou, Hao Li, Tanmay Agarwal, Jeannette Bohg, Yunzhu Li, Li Fei-Fei, Jiajun Wu

    Abstract: We introduce the ObjectFolder Benchmark, a benchmark suite of 10 tasks for multisensory object-centric learning, centered around object recognition, reconstruction, and manipulation with sight, sound, and touch. We also introduce the ObjectFolder Real dataset, including the multisensory measurements for 100 real-world household objects, building upon a newly designed pipeline for collecting the 3D… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: In CVPR 2023. Project page: https://objectfolder.stanford.edu/. ObjectFolder Real demo: https://www.objectfolder.org/swan_vis/. Gao, Dou, and Li contributed equally to this work

  30. arXiv:2306.00923  [pdf, other

    cs.RO cs.CV cs.HC

    Sonicverse: A Multisensory Simulation Platform for Embodied Household Agents that See and Hear

    Authors: Ruohan Gao, Hao Li, Gokul Dharan, Zhuzhu Wang, Chengshu Li, Fei Xia, Silvio Savarese, Li Fei-Fei, Jiajun Wu

    Abstract: Develo** embodied agents in simulation has been a key research topic in recent years. Exciting new tasks, algorithms, and benchmarks have been developed in various simulators. However, most of them assume deaf agents in silent environments, while we humans perceive the world with multiple senses. We introduce Sonicverse, a multisensory simulation platform with integrated audio-visual simulation… ▽ More

    Submitted 16 September, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: In ICRA 2023. Project page: https://ai.stanford.edu/~rhgao/sonicverse/. Code: https://github.com/StanfordVL/sonicverse. Gao and Li contributed equally to this work and are in alphabetical order

  31. arXiv:2305.17537  [pdf, other

    cs.LG cs.AI

    Modeling Dynamic Environments with Scene Graph Memory

    Authors: Andrey Kurenkov, Michael Lingelbach, Tanmay Agarwal, Emily **, Chengshu Li, Ruohan Zhang, Li Fei-Fei, Jiajun Wu, Silvio Savarese, Roberto Martín-Martín

    Abstract: Embodied AI agents that search for objects in large environments such as households often need to make efficient decisions by predicting object locations based on partial information. We pose this as a new type of link prediction problem: link prediction on partially observable dynamic graphs. Our graph is a representation of a scene in which rooms and objects are nodes, and their relationships ar… ▽ More

    Submitted 12 June, 2023; v1 submitted 27 May, 2023; originally announced May 2023.

  32. arXiv:2305.14344  [pdf, other

    cs.CV cs.LG

    Siamese Masked Autoencoders

    Authors: Agrim Gupta, Jiajun Wu, Jia Deng, Li Fei-Fei

    Abstract: Establishing correspondence between images or scenes is a significant challenge in computer vision, especially given occlusions, viewpoint changes, and varying object appearances. In this paper, we present Siamese Masked Autoencoders (SiamMAE), a simple extension of Masked Autoencoders (MAE) for learning visual correspondence from videos. SiamMAE operates on pairs of randomly sampled video frames… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: Project page https://siam-mae-video.github.io/

  33. arXiv:2305.13567  [pdf, other

    cs.RO

    M-EMBER: Tackling Long-Horizon Mobile Manipulation via Factorized Domain Transfer

    Authors: Bohan Wu, Roberto Martin-Martin, Li Fei-Fei

    Abstract: In this paper, we propose a method to create visuomotor mobile manipulation solutions for long-horizon activities. We propose to leverage the recent advances in simulation to train visual solutions for mobile manipulation. While previous works have shown success applying this procedure to autonomous visual navigation and stationary manipulation, applying it to long-horizon visuomotor mobile manipu… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

  34. arXiv:2302.12422  [pdf, other

    cs.RO

    MimicPlay: Long-Horizon Imitation Learning by Watching Human Play

    Authors: Chen Wang, Linxi Fan, Jiankai Sun, Ruohan Zhang, Li Fei-Fei, Danfei Xu, Yuke Zhu, Anima Anandkumar

    Abstract: Imitation learning from human demonstrations is a promising paradigm for teaching robots manipulation skills in the real world. However, learning complex long-horizon tasks often requires an unattainable amount of demonstrations. To reduce the high data requirement, we resort to human play data - video sequences of people freely interacting with the environment using their hands. Even with differe… ▽ More

    Submitted 13 October, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

    Comments: 7th Conference on Robot Learning (CoRL 2023 oral presentation)

  35. arXiv:2212.03858  [pdf, other

    cs.RO cs.CV

    See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation

    Authors: Hao Li, Yizhi Zhang, Junzhe Zhu, Shaoxiong Wang, Michelle A Lee, Huazhe Xu, Edward Adelson, Li Fei-Fei, Ruohan Gao, Jiajun Wu

    Abstract: Humans use all of their senses to accomplish different tasks in everyday activities. In contrast, existing work on robotic manipulation mostly relies on one, or occasionally two modalities, such as vision and touch. In this work, we systematically study how visual, auditory, and tactile perception can jointly help robots to solve complex manipulation tasks. We build a robot system that can see wit… ▽ More

    Submitted 8 December, 2022; v1 submitted 7 December, 2022; originally announced December 2022.

    Comments: In CoRL 2022. Li and Zhang equal contribution; Gao and Wu equal advising. Project page: https://ai.stanford.edu/~rhgao/see_hear_feel/

  36. arXiv:2211.06134  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Active Task Randomization: Learning Robust Skills via Unsupervised Generation of Diverse and Feasible Tasks

    Authors: Kuan Fang, Toki Migimatsu, Ajay Mandlekar, Li Fei-Fei, Jeannette Bohg

    Abstract: Solving real-world manipulation tasks requires robots to have a repertoire of skills applicable to a wide range of circumstances. When using learning-based methods to acquire such skills, the key challenge is to obtain training data that covers diverse and feasible variations of the task, which often requires non-trivial manual labor and domain knowledge. In this work, we introduce Active Task Ran… ▽ More

    Submitted 18 April, 2023; v1 submitted 11 November, 2022; originally announced November 2022.

    Comments: 9 pages, 5 figures

  37. arXiv:2210.06849  [pdf, other

    cs.CV

    Retrospectives on the Embodied AI Workshop

    Authors: Matt Deitke, Dhruv Batra, Yonatan Bisk, Tommaso Campari, Angel X. Chang, Devendra Singh Chaplot, Changan Chen, Claudia Pérez D'Arpino, Kiana Ehsani, Ali Farhadi, Li Fei-Fei, Anthony Francis, Chuang Gan, Kristen Grauman, David Hall, Winson Han, Unnat Jain, Aniruddha Kembhavi, Jacob Krantz, Stefan Lee, Chengshu Li, Sagnik Majumder, Oleksandr Maksymets, Roberto Martín-Martín, Roozbeh Mottaghi , et al. (14 additional authors not shown)

    Abstract: We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of… ▽ More

    Submitted 4 December, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

  38. arXiv:2210.04365  [pdf, other

    cs.MA cs.AI cs.LG

    ELIGN: Expectation Alignment as a Multi-Agent Intrinsic Reward

    Authors: Zixian Ma, Rose Wang, Li Fei-Fei, Michael Bernstein, Ranjay Krishna

    Abstract: Modern multi-agent reinforcement learning frameworks rely on centralized training and reward sha** to perform well. However, centralized training and dense rewards are not readily available in the real world. Current multi-agent algorithms struggle to learn in the alternative setup of decentralized training or sparse rewards. To address these issues, we propose a self-supervised intrinsic reward… ▽ More

    Submitted 9 November, 2022; v1 submitted 9 October, 2022; originally announced October 2022.

    Comments: This paper will be published in Neurips 2022

  39. arXiv:2210.03094  [pdf, other

    cs.RO cs.AI cs.LG

    VIMA: General Robot Manipulation with Multimodal Prompts

    Authors: Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, Linxi Fan

    Abstract: Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different… ▽ More

    Submitted 28 May, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

    Comments: ICML 2023 Camera-ready version. Project website: https://vimalabs.github.io/

  40. arXiv:2207.00106  [pdf, other

    cs.CV cs.LG eess.IV

    GaitForeMer: Self-Supervised Pre-Training of Transformers via Human Motion Forecasting for Few-Shot Gait Impairment Severity Estimation

    Authors: Mark Endo, Kathleen L. Poston, Edith V. Sullivan, Li Fei-Fei, Kilian M. Pohl, Ehsan Adeli

    Abstract: Parkinson's disease (PD) is a neurological disorder that has a variety of observable motor-related symptoms such as slow movement, tremor, muscular rigidity, and impaired posture. PD is typically diagnosed by evaluating the severity of motor impairments according to scoring systems such as the Movement Disorder Society Unified Parkinson's Disease Rating Scale (MDS-UPDRS). Automated severity predic… ▽ More

    Submitted 30 June, 2022; originally announced July 2022.

    Comments: Accepted as a conference paper at MICCAI (Medical Image Computing and Computer Assisted Intervention) 2022

  41. arXiv:2206.11894  [pdf, other

    cs.CV cs.LG cs.RO

    MaskViT: Masked Visual Pre-Training for Video Prediction

    Authors: Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martín-Martín, Li Fei-Fei

    Abstract: The ability to predict future visual observations conditioned on past observations and motor commands can enable embodied agents to plan solutions to a variety of tasks in complex environments. This work shows that we can create good video prediction models by pre-training transformers via masked visual modeling. Our approach, named MaskViT, is based on two simple design decisions. First, for memo… ▽ More

    Submitted 6 August, 2022; v1 submitted 23 June, 2022; originally announced June 2022.

    Comments: Project page: https://maskedvit.github.io/

  42. arXiv:2206.06489  [pdf, other

    cs.AI cs.CV cs.RO

    BEHAVIOR in Habitat 2.0: Simulator-Independent Logical Task Description for Benchmarking Embodied AI Agents

    Authors: Ziang Liu, Roberto Martín-Martín, Fei Xia, Jiajun Wu, Li Fei-Fei

    Abstract: Robots excel in performing repetitive and precision-sensitive tasks in controlled environments such as warehouses and factories, but have not been yet extended to embodied AI agents providing assistance in household tasks. Inspired by the catalyzing effect that benchmarks have played in the AI fields such as computer vision and natural language processing, the community is looking for new benchmar… ▽ More

    Submitted 13 June, 2022; originally announced June 2022.

  43. arXiv:2206.03891  [pdf, other

    cs.CV cs.AI cs.CR cs.LG eess.IV

    PrivHAR: Recognizing Human Actions From Privacy-preserving Lens

    Authors: Carlos Hinojosa, Miguel Marquez, Henry Arguello, Ehsan Adeli, Li Fei-Fei, Juan Carlos Niebles

    Abstract: The accelerated use of digital cameras prompts an increasing concern about privacy and security, particularly in applications such as action recognition. In this paper, we propose an optimizing framework to provide robust visual privacy protection along the human action recognition pipeline. Our framework parameterizes the camera lens to successfully degrade the quality of the videos to inhibit pr… ▽ More

    Submitted 29 January, 2023; v1 submitted 8 June, 2022; originally announced June 2022.

    Comments: Oral paper presented at European Conference on Computer Vision (ECCV) 2022, in Tel Aviv, Israel

    Journal ref: Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part IV

  44. arXiv:2206.01720  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Revisiting the "Video" in Video-Language Understanding

    Authors: Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, Juan Carlos Niebles

    Abstract: What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models co… ▽ More

    Submitted 3 June, 2022; originally announced June 2022.

    Comments: CVPR 2022 (Oral)

  45. arXiv:2205.07993  [pdf, other

    cs.RO

    Generalizable Task Planning through Representation Pretraining

    Authors: Chen Wang, Danfei Xu, Li Fei-Fei

    Abstract: The ability to plan for multi-step manipulation tasks in unseen situations is crucial for future home robots. But collecting sufficient experience data for end-to-end learning is often infeasible in the real world, as deploying robots in many environments can be prohibitively expensive. On the other hand, large-scale scene understanding datasets contain diverse and rich semantic and geometric info… ▽ More

    Submitted 16 May, 2022; originally announced May 2022.

  46. arXiv:2204.02389  [pdf, other

    cs.CV cs.LG cs.RO cs.SD eess.AS

    ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer

    Authors: Ruohan Gao, Zilin Si, Yen-Yu Chang, Samuel Clarke, Jeannette Bohg, Li Fei-Fei, Wenzhen Yuan, Jiajun Wu

    Abstract: Objects play a crucial role in our everyday activities. Though multisensory object-centric learning has shown great potential lately, the modeling of objects in prior work is rather unrealistic. ObjectFolder 1.0 is a recent dataset that introduces 100 virtualized objects with visual, acoustic, and tactile sensory data. However, the dataset is small in scale and the multisensory data is of limited… ▽ More

    Submitted 5 April, 2022; originally announced April 2022.

    Comments: In CVPR 2022. Gao, Si, and Chang contributed equally to this work. Project page: https://ai.stanford.edu/~rhgao/objectfolder2.0/

  47. arXiv:2203.11931  [pdf, other

    cs.LG cs.NE cs.RO

    MetaMorph: Learning Universal Controllers with Transformers

    Authors: Agrim Gupta, Linxi Fan, Surya Ganguli, Li Fei-Fei

    Abstract: Multiple domains like vision, natural language, and audio are witnessing tremendous progress by leveraging Transformers for large scale pre-training followed by task specific fine tuning. In contrast, in robotics we primarily train a single robot for a single task. However, modular robot systems now allow for the flexible combination of general-purpose building blocks into task optimized morpholog… ▽ More

    Submitted 22 March, 2022; originally announced March 2022.

    Comments: ICLR 2022

  48. arXiv:2112.05251  [pdf, other

    cs.RO cs.AI cs.LG

    Error-Aware Imitation Learning from Teleoperation Data for Mobile Manipulation

    Authors: Josiah Wong, Albert Tung, Andrey Kurenkov, Ajay Mandlekar, Li Fei-Fei, Silvio Savarese, Roberto Martín-Martín

    Abstract: In mobile manipulation (MM), robots can both navigate within and interact with their environment and are thus able to complete many more tasks than robots only capable of navigation or manipulation. In this work, we explore how to apply imitation learning (IL) to learn continuous visuo-motor policies for MM tasks. Much prior work has shown that IL can train visuo-motor policies for either manipula… ▽ More

    Submitted 9 December, 2021; originally announced December 2021.

    Comments: CoRL 2021

  49. Visual Intelligence through Human Interaction

    Authors: Ranjay Krishna, Mitchell Gordon, Li Fei-Fei, Michael Bernstein

    Abstract: Over the last decade, Computer Vision, the branch of Artificial Intelligence aimed at understanding the visual world, has evolved from simply recognizing objects in images to describing pictures, answering questions about images, aiding robots maneuver around physical spaces and even generating novel visual content. As these tasks and applications have modernized, so too has the reliance on more d… ▽ More

    Submitted 12 November, 2021; originally announced November 2021.

    Comments: This is a preprint of the following chapter: Ranjay Krishna, Mitchell Gordon, Li Fei-Fei, Michael Bernstein, Visual Intelligence through Human Interaction, published in Artificial Intelligence for Human Computer Interaction: A Modern Approach, edited by Yang Li and Otmar Hilliges, 2021, Springer reproduced with permission of Springer Nature. arXiv admin note: substantial text overlap with arXiv:1602.04506, arXiv:1904.01121

  50. arXiv:2109.10312  [pdf, other

    cs.RO cs.AI cs.LG

    Example-Driven Model-Based Reinforcement Learning for Solving Long-Horizon Visuomotor Tasks

    Authors: Bohan Wu, Suraj Nair, Li Fei-Fei, Chelsea Finn

    Abstract: In this paper, we study the problem of learning a repertoire of low-level skills from raw images that can be sequenced to complete long-horizon visuomotor tasks. Reinforcement learning (RL) is a promising approach for acquiring short-horizon skills autonomously. However, the focus of RL algorithms has largely been on the success of those individual skills, more so than learning and grounding a lar… ▽ More

    Submitted 19 September, 2022; v1 submitted 21 September, 2021; originally announced September 2021.

    Comments: Equal advising and contribution for last two authors