Skip to main content

Showing 1–17 of 17 results for author: Dwibedi, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.02292  [pdf, other

    cs.RO cs.LG

    ALOHA 2: An Enhanced Low-Cost Hardware for Bimanual Teleoperation

    Authors: ALOHA 2 Team, Jorge Aldaco, Travis Armstrong, Robert Baruch, Jeff Bingham, Sanky Chan, Kenneth Draper, Debidatta Dwibedi, Chelsea Finn, Pete Florence, Spencer Goodrich, Wayne Gramlich, Torr Hage, Alexander Herzog, Jonathan Hoech, Thinh Nguyen, Ian Storz, Baruch Tabanpour, Leila Takayama, Jonathan Tompson, Ayzaan Wahid, Ted Wahrburg, Sichun Xu, Sergey Yaroshenko, Kevin Zakka , et al. (1 additional authors not shown)

    Abstract: Diverse demonstration datasets have powered significant advances in robot learning, but the dexterity and scale of such data can be limited by the hardware cost, the hardware robustness, and the ease of teleoperation. We introduce ALOHA 2, an enhanced version of ALOHA that has greater performance, ergonomics, and robustness compared to the original design. To accelerate research in large-scale bim… ▽ More

    Submitted 7 February, 2024; originally announced May 2024.

    Comments: Project website: aloha-2.github.io

  2. arXiv:2403.12943  [pdf, other

    cs.RO cs.AI

    Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

    Authors: Vidhi Jain, Maria Attarian, Nikhil J Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R Sanketi, Pierre Sermanet, Stefan Welker, Christine Chan, Igor Gilitschenski, Yonatan Bisk, Debidatta Dwibedi

    Abstract: While large-scale robotic systems typically rely on textual instructions for tasks, this work explores a different approach: can robots infer the task directly from observing humans? This shift necessitates the robot's ability to decode human intent and translate it into executable actions within its physical constraints and environment. We introduce Vid2Robot, a novel end-to-end video-based learn… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: Robot learning: Imitation Learning, Robot Perception, Sensing & Vision, Gras** & Manipulation

  3. arXiv:2403.12026  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    FlexCap: Generating Rich, Localized, and Flexible Captions in Images

    Authors: Debidatta Dwibedi, Vidhi Jain, Jonathan Tompson, Andrew Zisserman, Yusuf Aytar

    Abstract: We introduce a versatile $\textit{flexible-captioning}$ vision-language model (VLM) capable of generating region-specific descriptions of varying lengths. The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions. To ach… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  4. arXiv:2403.01823  [pdf, other

    cs.RO cs.AI

    RT-H: Action Hierarchies Using Language

    Authors: Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, Dorsa Sadigh

    Abstract: Language provides a way to break down complex concepts into digestible pieces. Recent works in robot imitation learning use language-conditioned policies that predict actions given visual observations and the high-level task specified in language. These methods leverage the structure of natural language to share data between semantically similar tasks (e.g., "pick coke can" and "pick an apple") in… ▽ More

    Submitted 31 May, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

  5. arXiv:2401.12963  [pdf, other

    cs.RO cs.AI cs.CL cs.CV cs.LG

    AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents

    Authors: Michael Ahn, Debidatta Dwibedi, Chelsea Finn, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Karol Hausman, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Sean Kirmani, Isabel Leal, Edward Lee, Sergey Levine, Yao Lu, Isabel Leal, Sharath Maddineni, Kanishka Rao, Dorsa Sadigh, Pannag Sanketi, Pierre Sermanet, Quan Vuong, Stefan Welker, Fei Xia, Ted Xiao , et al. (3 additional authors not shown)

    Abstract: Foundation models that incorporate language, vision, and more recently actions have revolutionized the ability to harness internet scale data to reason about useful tasks. However, one of the key challenges of training embodied foundation models is the lack of data grounded in the physical world. In this paper, we propose AutoRT, a system that leverages existing foundation models to scale up the d… ▽ More

    Submitted 1 July, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

    Comments: 26 pages, 9 figures, ICRA 2024 VLMNM Workshop

  6. arXiv:2311.00899  [pdf, other

    cs.RO

    RoboVQA: Multimodal Long-Horizon Reasoning for Robotics

    Authors: Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, Pete Florence, Wei Han, Robert Baruch, Yao Lu, Suvir Mirchandani, Peng Xu, Pannag Sanketi, Karol Hausman, Izhak Shafran, Brian Ichter, Yuan Cao

    Abstract: We present a scalable, bottom-up and intrinsically diverse data collection scheme that can be used for high-level reasoning with long and medium horizons and that has 2.2x higher throughput compared to traditional narrow top-down step-by-step collection. We collect realistic data by performing any user requests within the entirety of 3 office buildings and using multiple robot and human embodiment… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

  7. arXiv:2302.05444  [pdf, other

    cs.LG

    Q-Match: Self-Supervised Learning by Matching Distributions Induced by a Queue

    Authors: Thomas Mulc, Debidatta Dwibedi

    Abstract: In semi-supervised learning, student-teacher distribution matching has been successful in improving performance of models using unlabeled data in conjunction with few labeled samples. In this paper, we aim to replicate that success in the self-supervised setup where we do not have access to any labeled data during pre-training. We introduce our algorithm, Q-Match, and show it is possible to induce… ▽ More

    Submitted 22 February, 2023; v1 submitted 10 February, 2023; originally announced February 2023.

  8. arXiv:2205.06333  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Visuomotor Control in Multi-Object Scenes Using Object-Aware Representations

    Authors: Negin Heravi, Ayzaan Wahid, Corey Lynch, Pete Florence, Travis Armstrong, Jonathan Tompson, Pierre Sermanet, Jeannette Bohg, Debidatta Dwibedi

    Abstract: Perceptual understanding of the scene and the relationship between its different components is important for successful completion of robotic tasks. Representation learning has been shown to be a powerful technique for this, but most of the current methodologies learn task specific representations that do not necessarily transfer well to other tasks. Furthermore, representations learned by supervi… ▽ More

    Submitted 12 March, 2023; v1 submitted 12 May, 2022; originally announced May 2022.

  9. arXiv:2106.03911  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    XIRL: Cross-embodiment Inverse Reinforcement Learning

    Authors: Kevin Zakka, Andy Zeng, Pete Florence, Jonathan Tompson, Jeannette Bohg, Debidatta Dwibedi

    Abstract: We investigate the visual cross-embodiment imitation setting, in which agents learn policies from videos of other agents (such as humans) demonstrating the same task, but with stark differences in their embodiments -- shape, actions, end-effector dynamics, etc. In this work, we demonstrate that it is possible to automatically discover and learn vision-based reward functions from cross-embodiment d… ▽ More

    Submitted 13 December, 2021; v1 submitted 7 June, 2021; originally announced June 2021.

    Comments: Oral Accept, CoRL '21

  10. arXiv:2104.14548  [pdf, other

    cs.CV

    With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations

    Authors: Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman

    Abstract: Self-supervised learning algorithms based on instance discrimination train encoders to be invariant to pre-defined transformations of the same instance. While most methods treat different views of the same image as positives for a contrastive loss, we are interested in using positives from other instances in the dataset. Our method, Nearest-Neighbor Contrastive Learning of visual Representations (… ▽ More

    Submitted 7 October, 2021; v1 submitted 29 April, 2021; originally announced April 2021.

    Comments: Accepted at ICCV 2021

  11. arXiv:2006.15418  [pdf, other

    cs.CV

    Counting Out Time: Class Agnostic Video Repetition Counting in the Wild

    Authors: Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman

    Abstract: We present an approach for estimating the period with which an action is repeated in a video. The crux of the approach lies in constraining the period prediction module to use temporal self-similarity as an intermediate representation bottleneck that allows generalization to unseen repetitions in videos in the wild. We train this model, called Repnet, with a synthetic dataset that is generated fro… ▽ More

    Submitted 27 June, 2020; originally announced June 2020.

    Comments: Accepted at CVPR 2020. Project webpage: https://sites.google.com/view/repnet

  12. arXiv:2001.02593  [pdf, other

    cs.CV

    An Analysis of Object Representations in Deep Visual Trackers

    Authors: Ross Goroshin, Jonathan Tompson, Debidatta Dwibedi

    Abstract: Fully convolutional deep correlation networks are integral components of state-of the-art approaches to single object visual tracking. It is commonly assumed that these networks perform tracking by detection by matching features of the object instance with features of the entire frame. Strong architectural priors and conditioning on the object representation is thought to encourage this tracking s… ▽ More

    Submitted 8 January, 2020; originally announced January 2020.

  13. arXiv:1904.07846  [pdf, other

    cs.CV cs.LG

    Temporal Cycle-Consistency Learning

    Authors: Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman

    Abstract: We introduce a self-supervised representation learning method based on the task of temporal alignment between videos. The method trains a network using temporal cycle consistency (TCC), a differentiable cycle-consistency loss that can be used to find correspondences across time in multiple videos. The resulting per-frame embeddings can be used to align videos by simply matching frames using the ne… ▽ More

    Submitted 16 April, 2019; originally announced April 2019.

    Comments: Accepted at CVPR 2019. Project webpage: https://sites.google.com/view/temporal-cycle-consistency

  14. arXiv:1809.02925  [pdf, other

    cs.LG stat.ML

    Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning

    Authors: Ilya Kostrikov, Kumar Krishna Agrawal, Debidatta Dwibedi, Sergey Levine, Jonathan Tompson

    Abstract: We identify two issues with the family of algorithms based on the Adversarial Imitation Learning framework. The first problem is implicit bias present in the reward functions used in these algorithms. While these biases might work well for some environments, they can also lead to sub-optimal behavior in others. Secondly, even though these algorithms can learn from few expert demonstrations, they r… ▽ More

    Submitted 15 October, 2018; v1 submitted 9 September, 2018; originally announced September 2018.

  15. arXiv:1808.00928  [pdf, other

    cs.CV cs.LG cs.RO

    Learning Actionable Representations from Visual Observations

    Authors: Debidatta Dwibedi, Jonathan Tompson, Corey Lynch, Pierre Sermanet

    Abstract: In this work we explore a new approach for robots to teach themselves about the world simply by observing it. In particular we investigate the effectiveness of learning task-agnostic representations for continuous control tasks. We extend Time-Contrastive Networks (TCN) that learn from visual observations by embedding multiple frames jointly in the embedding space as opposed to a single frame. We… ▽ More

    Submitted 2 February, 2019; v1 submitted 2 August, 2018; originally announced August 2018.

    Comments: This work is accepted in IROS 2018. Project website: https://sites.google.com/view/actionablerepresentations

  16. arXiv:1708.01642  [pdf, other

    cs.CV

    Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection

    Authors: Debidatta Dwibedi, Ishan Misra, Martial Hebert

    Abstract: A major impediment in rapidly deploying object detection models for instance detection is the lack of large annotated datasets. For example, finding a large labeled dataset containing instances in a particular kitchen is unlikely. Each new environment with new instances requires expensive data collection and annotation. In this paper, we propose a simple approach to generate large annotated instan… ▽ More

    Submitted 4 August, 2017; originally announced August 2017.

    Comments: To appear in ICCV 2017

  17. arXiv:1611.10010  [pdf, other

    cs.CV

    Deep Cuboid Detection: Beyond 2D Bounding Boxes

    Authors: Debidatta Dwibedi, Tomasz Malisiewicz, Vijay Badrinarayanan, Andrew Rabinovich

    Abstract: We present a Deep Cuboid Detector which takes a consumer-quality RGB image of a cluttered scene and localizes all 3D cuboids (box-like objects). Contrary to classical approaches which fit a 3D model from low-level cues like corners, edges, and vanishing points, we propose an end-to-end deep learning system to detect cuboids across many semantic categories (e.g., ovens, ship** boxes, and furnitur… ▽ More

    Submitted 30 November, 2016; originally announced November 2016.