Skip to main content

Showing 1–50 of 232 results for author: Malik, J

.
  1. arXiv:2404.16823  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Learning Visuotactile Skills with Two Multifingered Hands

    Authors: Toru Lin, Yu Zhang, Qiyang Li, Haozhi Qi, Brent Yi, Sergey Levine, Jitendra Malik

    Abstract: Aiming to replicate human-like dexterity, perceptual experiences, and motion patterns, we explore learning from human demonstrations using a bimanual system with multifingered hands and visuotactile data. Two significant challenges exist: the lack of an affordable and accessible teleoperation system suitable for a dual-arm setup with multifingered hands, and the scarcity of multifingered hand hard… ▽ More

    Submitted 22 May, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: Code and Project Website: https://toruowo.github.io/hato/

  2. arXiv:2404.06507  [pdf, other

    cs.CV

    Reconstructing Hand-Held Objects in 3D

    Authors: Jane Wu, Georgios Pavlakos, Georgia Gkioxari, Jitendra Malik

    Abstract: Objects manipulated by the hand (i.e., manipulanda) are particularly challenging to reconstruct from in-the-wild RGB images or videos. Not only does the hand occlude much of the object, but also the object is often only visible in a small number of image pixels. At the same time, two strong anchors emerge in this setting: (1) estimated 3D hands help disambiguate the location and scale of the objec… ▽ More

    Submitted 9 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

    Comments: Project page: https://janehwu.github.io/mcc-ho

  3. arXiv:2403.17575  [pdf

    physics.med-ph

    MR sequence design using digital twins of non-idealized hardware

    Authors: Daniel J West, Felix Glang, Jonathan Endres, David Leitão, Moritz Zaiss, Joseph V Hajnal, Shaihan J Malik

    Abstract: MRI systems are traditionally engineered to produce close to idealized performance, enabling a simplified pulse sequence design philosophy. An example of this is control of eddy currents produced by gradient fields; usually these are compensated by pre-emphasizing demanded waveforms. This process typically happens invisibly to the pulse designer, allowing them to assume the achieved gradient wavef… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: 33 pages, 14 figures (including supporting information)

  4. arXiv:2403.12945  [pdf, other

    cs.RO

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Authors: Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, Youngwoon Lee, Marius Memmel, Sungjae Park , et al. (74 additional authors not shown)

    Abstract: The creation of large, diverse, high-quality robot manipulation datasets is an important step** stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a resu… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: Project website: https://droid-dataset.github.io/

  5. arXiv:2403.07008  [pdf, other

    cs.LG cs.AI cs.CL stat.ME

    AutoEval Done Right: Using Synthetic Data for Model Evaluation

    Authors: Pierre Boyeau, Anastasios N. Angelopoulos, Nir Yosef, Jitendra Malik, Michael I. Jordan

    Abstract: The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased. These… ▽ More

    Submitted 28 May, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

    Comments: New experiments, fix fig 1

  6. arXiv:2403.02338  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Twisting Lids Off with Two Hands

    Authors: Toru Lin, Zhao-Heng Yin, Haozhi Qi, Pieter Abbeel, Jitendra Malik

    Abstract: Manipulating objects with two multi-fingered hands has been a long-standing challenge in robotics, attributed to the contact-rich nature of many manipulation tasks and the complexity inherent in coordinating a high-dimensional bimanual system. In this work, we consider the problem of twisting lids of various bottle-like objects with two hands, and demonstrate that policies trained in simulation us… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

    Comments: Project page can be found at https://toruowo.github.io/bimanual-twist

  7. arXiv:2403.01915  [pdf, other

    cs.CV cs.AI

    xT: Nested Tokenization for Larger Context in Large Images

    Authors: Ritwik Gupta, Shufan Li, Tyler Zhu, Jitendra Malik, Trevor Darrell, Karttikeya Mangalam

    Abstract: Modern computer vision pipelines handle large images in one of two sub-optimal ways: down-sampling or crop**. These two methods incur significant losses in the amount of information and context present in an image. There are many downstream applications in which global context matters as much as high frequency details, such as in real-world satellite imagery; in such cases researchers have to ma… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

  8. arXiv:2402.19469  [pdf, other

    cs.RO cs.CV cs.LG

    Humanoid Locomotion as Next Token Prediction

    Authors: Ilija Radosavovic, Bike Zhang, Baifeng Shi, Jathushan Rajasegaran, Sarthak Kamat, Trevor Darrell, Koushil Sreenath, Jitendra Malik

    Abstract: We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This gen… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

  9. arXiv:2401.10889  [pdf, other

    cs.CV cs.AI

    Synthesizing Moving People with 3D Control

    Authors: Boyi Li, Jathushan Rajasegaran, Yossi Gandelsman, Alexei A. Efros, Jitendra Malik

    Abstract: In this paper, we present a diffusion model-based framework for animating people from a single image for a given target 3D motion sequence. Our approach has two core components: a) learning priors about invisible parts of the human body and clothing, and b) rendering novel body poses with proper clothing and texture. For the first part, we learn an in-filling diffusion model to hallucinate unseen… ▽ More

    Submitted 19 January, 2024; originally announced January 2024.

  10. arXiv:2401.04105  [pdf, other

    cs.CV cs.AI

    Dr$^2$Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning

    Authors: Chen Zhao, Shuming Liu, Karttikeya Mangalam, Guocheng Qian, Fatimah Zohra, Abdulmohsen Alghannam, Jitendra Malik, Bernard Ghanem

    Abstract: Large pretrained models are increasingly crucial in modern computer vision tasks. These models are typically used in downstream tasks by end-to-end finetuning, which is highly memory-intensive for tasks with high-resolution data, e.g., video understanding, small object detection, and point cloud analysis. In this paper, we propose Dynamic Reversible Dual-Residual Networks, or Dr$^2$Net, a novel fa… ▽ More

    Submitted 30 March, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

    Journal ref: the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024

  11. arXiv:2312.13469  [pdf, other

    cs.RO cs.CV cs.LG

    Neural feels with neural fields: Visuo-tactile perception for in-hand manipulation

    Authors: Sudharshan Suresh, Haozhi Qi, Tingfan Wu, Taosha Fan, Luis Pineda, Mike Lambeta, Jitendra Malik, Mrinal Kalakrishnan, Roberto Calandra, Michael Kaess, Joseph Ortiz, Mustafa Mukadam

    Abstract: To achieve human-level dexterity, robots must infer spatial awareness from multimodal sensing to reason over contact interactions. During in-hand manipulation of novel objects, such spatial awareness involves estimating the object's pose and shape. The status quo for in-hand perception primarily employs vision, and restricts to tracking a priori known objects. Moreover, visual occlusion of objects… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

    Comments: 43 pages, 20 figures, 1 table; https://suddhu.github.io/neural-feels/

  12. arXiv:2312.06653  [pdf, other

    cs.CV

    Adaptive Human Trajectory Prediction via Latent Corridors

    Authors: Neerja Thakkar, Karttikeya Mangalam, Andrea Bajcsy, Jitendra Malik

    Abstract: Human trajectory prediction is typically posed as a zero-shot generalization problem: a predictor is learnt on a dataset of human motion in training scenes, and then deployed on unseen test scenes. While this paradigm has yielded tremendous progress, it fundamentally assumes that trends in human behavior within the deployment scene are constant over time. As such, current prediction models are una… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

    Comments: Project website can be found at https://neerja.me/atp_latent_corridors/

  13. arXiv:2312.05251  [pdf, other

    cs.CV

    Reconstructing Hands in 3D with Transformers

    Authors: Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, Jitendra Malik

    Abstract: We present an approach that can reconstruct hands in 3D from monocular input. Our approach for Hand Mesh Recovery, HaMeR, follows a fully transformer-based architecture and can analyze hands with significantly increased accuracy and robustness compared to previous work. The key to HaMeR's success lies in scaling up both the data used for training and the capacity of the deep network for hand recon… ▽ More

    Submitted 8 December, 2023; originally announced December 2023.

  14. arXiv:2312.00785  [pdf, other

    cs.CV

    Sequential Modeling Enables Scalable Learning for Large Vision Models

    Authors: Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, Alexei A Efros

    Abstract: We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once… ▽ More

    Submitted 1 December, 2023; originally announced December 2023.

    Comments: Website: https://yutongbai.com/lvm.html

  15. arXiv:2311.18259  [pdf, other

    cs.CV cs.AI

    Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

    Authors: Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, **g Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, **g Huang, Md Mohaiminul Islam, Suyog Jain , et al. (76 additional authors not shown)

    Abstract: We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from… ▽ More

    Submitted 29 April, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: updated baseline results and dataset statistics to match the released v2 data; added table to appendix comparing stats of Ego-Exo4D alongside other datasets

  16. arXiv:2311.06430  [pdf, other

    cs.RO

    GOAT: GO to Any Thing

    Authors: Matthew Chang, Theophile Gervet, Mukul Khanna, Sriram Yenamandra, Dhruv Shah, So Yeon Min, Kavit Shah, Chris Paxton, Saurabh Gupta, Dhruv Batra, Roozbeh Mottaghi, Jitendra Malik, Devendra Singh Chaplot

    Abstract: In deployment scenarios such as homes and warehouses, mobile robots are expected to autonomously navigate for extended periods, seamlessly executing tasks articulated in terms that are intuitively understandable by human operators. We present GO To Any Thing (GOAT), a universal navigation system capable of tackling these requirements with three key features: a) Multimodal: it can tackle goals spec… ▽ More

    Submitted 10 November, 2023; originally announced November 2023.

  17. arXiv:2311.01457  [pdf, other

    cs.RO cs.AI

    Conformal Policy Learning for Sensorimotor Control Under Distribution Shifts

    Authors: Huang Huang, Satvik Sharma, Antonio Loquercio, Anastasios Angelopoulos, Ken Goldberg, Jitendra Malik

    Abstract: This paper focuses on the problem of detecting and reacting to changes in the distribution of a sensorimotor controller's observables. The key idea is the design of switching policies that can take conformal quantiles as input, which we define as conformal policy learning, that allows robots to detect distribution shifts with formal statistical guarantees. We show how to design such policies by us… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

    Comments: Conformal Policy Learning

  18. arXiv:2310.13724  [pdf, other

    cs.HC cs.AI cs.CV cs.GR cs.MA cs.RO

    Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots

    Authors: Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander William Clegg, Michal Hlavac, So Yeon Min, Vladimír Vondruš, Theophile Gervet, Vincent-Pierre Berges, John M. Turner, Oleksandr Maksymets, Zsolt Kira, Mrinal Kalakrishnan, Jitendra Malik, Devendra Singh Chaplot, Unnat Jain, Dhruv Batra, Akshara Rai, Roozbeh Mottaghi

    Abstract: We present Habitat 3.0: a simulation platform for studying collaborative human-robot tasks in home environments. Habitat 3.0 offers contributions across three dimensions: (1) Accurate humanoid simulation: addressing challenges in modeling complex deformable bodies and diversity in appearance and motion, all while ensuring high simulation speed. (2) Human-in-the-loop infrastructure: enabling real h… ▽ More

    Submitted 19 October, 2023; originally announced October 2023.

    Comments: Project page: http://aihabitat.org/habitat3

  19. arXiv:2310.11811  [pdf, other

    cs.CV

    ShapeGraFormer: GraFormer-Based Network for Hand-Object Reconstruction from a Single Depth Map

    Authors: Ahmed Tawfik Aboukhadra, Jameel Malik, Nadia Robertini, Ahmed Elhayek, Didier Stricker

    Abstract: 3D reconstruction of hand-object manipulations is important for emulating human actions. Most methods dealing with challenging object manipulation scenarios, focus on hands reconstruction in isolation, ignoring physical and kinematic constraints due to object contact. Some approaches produce more realistic results by jointly reconstructing 3D hand-object interactions. However, they focus on coarse… ▽ More

    Submitted 18 October, 2023; originally announced October 2023.

  20. arXiv:2310.10645  [pdf, other

    cs.RO cs.AI cs.CL cs.HC

    Interactive Task Planning with Language Models

    Authors: Boyi Li, Philipp Wu, Pieter Abbeel, Jitendra Malik

    Abstract: An interactive robot framework accomplishes long-horizon task planning and can easily generalize to new goals or distinct tasks, even during execution. However, most traditional methods require predefined module design, which makes it hard to generalize to different goals. Recent large language model based approaches can allow for more open-ended planning but often require heavy prompt engineering… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

  21. arXiv:2310.08864  [pdf, other

    cs.RO

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Authors: Open X-Embodiment Collaboration, Abby O'Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, A**kya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie , et al. (267 additional authors not shown)

    Abstract: Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning method… ▽ More

    Submitted 1 June, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

    Comments: Project website: https://robotics-transformer-x.github.io

  22. arXiv:2310.07932  [pdf, other

    cs.RO cs.AI cs.CV

    What Matters to You? Towards Visual Representation Alignment for Robot Learning

    Authors: Ran Tian, Chenfeng Xu, Masayoshi Tomizuka, Jitendra Malik, Andrea Bajcsy

    Abstract: When operating in service of people, robots need to optimize rewards aligned with end-user preferences. Since robots will rely on raw perceptual inputs like RGB images, their rewards will inevitably use visual representations. Recently there has been excitement in using representations from pre-trained visual models, but key to making these work in robotics is fine-tuning, which is typically done… ▽ More

    Submitted 15 January, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

  23. arXiv:2310.05921  [pdf, other

    stat.ML cs.LG cs.RO stat.ME

    Conformal Decision Theory: Safe Autonomous Decisions from Imperfect Predictions

    Authors: Jordan Lekeufack, Anastasios N. Angelopoulos, Andrea Bajcsy, Michael I. Jordan, Jitendra Malik

    Abstract: We introduce Conformal Decision Theory, a framework for producing safe autonomous decisions despite imperfect machine learning predictions. Examples of such decisions are ubiquitous, from robot planning algorithms that rely on pedestrian predictions, to calibrating autonomous manufacturing to exhibit high throughput and low error, to the choice of trusting a nominal policy versus switching to a sa… ▽ More

    Submitted 2 May, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

    Comments: 8 pages, 5 figures

  24. arXiv:2309.09979  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    General In-Hand Object Rotation with Vision and Touch

    Authors: Haozhi Qi, Brent Yi, Sudharshan Suresh, Mike Lambeta, Yi Ma, Roberto Calandra, Jitendra Malik

    Abstract: We introduce RotateIt, a system that enables fingertip-based object rotation along multiple axes by leveraging multimodal sensory inputs. Our system is trained in simulation, where it has access to ground-truth object shapes and physical properties. Then we distill it to operate on realistic yet noisy simulated visuotactile and proprioceptive sensory inputs. These multimodal inputs are fused via a… ▽ More

    Submitted 28 September, 2023; v1 submitted 18 September, 2023; originally announced September 2023.

    Comments: CoRL 2023; Website: https://haozhi.io/rotateit/

  25. arXiv:2308.16185  [pdf, other

    cs.RO cs.AI

    Learning Vision-based Pursuit-Evasion Robot Policies

    Authors: Andrea Bajcsy, Antonio Loquercio, Ashish Kumar, Jitendra Malik

    Abstract: Learning strategic robot behavior -- like that required in pursuit-evasion interactions -- under real-world constraints is extremely challenging. It requires exploiting the dynamics of the interaction, and planning through both physical state and latent intent uncertainty. In this paper, we transform this intractable problem into a supervised learning problem, where a fully-observable robot policy… ▽ More

    Submitted 30 August, 2023; originally announced August 2023.

    Comments: Includes Supplementary. Project webpage at https://abajcsy.github.io/vision-based-pursuit/

  26. arXiv:2308.09126  [pdf, other

    cs.CV cs.AI cs.CL

    EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

    Authors: Karttikeya Mangalam, Raiymbek Akshulakov, Jitendra Malik

    Abstract: We introduce EgoSchema, a very long-form video question-answering dataset, and benchmark to evaluate long video understanding capabilities of modern vision and language systems. Derived from Ego4D, EgoSchema consists of over 5000 human curated multiple choice question answer pairs, spanning over 250 hours of real video data, covering a very broad range of natural human activity and behavior. For e… ▽ More

    Submitted 17 August, 2023; originally announced August 2023.

    Comments: https://egoschema.github.io/

  27. arXiv:2306.10208  [pdf, other

    cs.CV

    Learning Space-Time Semantic Correspondences

    Authors: Du Tran, Jitendra Malik

    Abstract: We propose a new task of space-time semantic correspondence prediction in videos. Given a source video, a target video, and a set of space-time key-points in the source video, the task requires predicting a set of keypoints in the target video that are the semantic correspondences of the provided source keypoints. We believe that this task is important for fine-grain video understanding, potential… ▽ More

    Submitted 16 June, 2023; originally announced June 2023.

  28. arXiv:2306.10007  [pdf, other

    cs.RO cs.CV cs.LG

    Robot Learning with Sensorimotor Pre-training

    Authors: Ilija Radosavovic, Baifeng Shi, Letian Fu, Ken Goldberg, Trevor Darrell, Jitendra Malik

    Abstract: We present a self-supervised sensorimotor pre-training approach for robotics. Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens. Given a sequence of camera images, proprioceptive robot states, and actions, we encode the sequence into tokens, mask out a subset, and train a model to predict the missing content from the rest. We hypothesize that if a robot can… ▽ More

    Submitted 14 December, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

    Comments: CoRL 2023; Project page: https://robotic-pretrained-transformer.github.io

  29. arXiv:2306.00989  [pdf, other

    cs.CV cs.LG

    Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

    Authors: Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer

    Abstract: Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is unnecessary. By pretraini… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: ICML 2023 Oral version. Code+Models: https://github.com/facebookresearch/hiera

  30. arXiv:2305.20091  [pdf, other

    cs.CV

    Humans in 4D: Reconstructing and Tracking Humans with Transformers

    Authors: Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, Jitendra Malik

    Abstract: We present an approach to reconstruct humans and track them over time. At the core of our approach, we propose a fully "transformerized" version of a network for human mesh recovery. This network, HMR 2.0, advances the state of the art and shows the capability to analyze unusual poses that have in the past been difficult to reconstruct from single images. To analyze video, we use 3D reconstruction… ▽ More

    Submitted 31 August, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

    Comments: In ICCV 2023. Project Webpage: https://shubham-goel.github.io/4dhumans/

  31. arXiv:2305.01648  [pdf, other

    cs.RO

    More Than an Arm: Using a Manipulator as a Tail for Enhanced Stability in Legged Locomotion

    Authors: Huang Huang, Antonio Loquercio, Ashish Kumar, Neerja Thakkar, Ken Goldberg, Jitendra Malik

    Abstract: Is a manipulator on a legged robot a liability or an asset for locomotion? Prior works mainly designed specific controllers to account for the added payload and inertia from a manipulator. In contrast, biological systems typically benefit from additional limbs, which can simplify postural control. For instance, cats use their tails to enhance the stability of their bodies and prevent falls under d… ▽ More

    Submitted 2 May, 2023; originally announced May 2023.

  32. arXiv:2304.01199  [pdf, other

    cs.CV

    On the Benefits of 3D Pose and Tracking for Human Action Recognition

    Authors: Jathushan Rajasegaran, Georgios Pavlakos, Angjoo Kanazawa, Christoph Feichtenhofer, Jitendra Malik

    Abstract: In this work we study the benefits of using tracking and 3D poses for action recognition. To achieve this, we take the Lagrangian view on analysing actions over a trajectory of human motion rather than at a fixed point in space. Taking this stand allows us to use the tracklets of people to predict their actions. In this spirit, first we show the benefits of using 3D pose to infer actions, and stud… ▽ More

    Submitted 7 August, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

    Comments: CVPR2023 (project page: https://brjathu.github.io/LART)

  33. arXiv:2304.01192  [pdf, other

    cs.CV cs.RO

    Navigating to Objects Specified by Images

    Authors: Jacob Krantz, Theophile Gervet, Karmesh Yadav, Austin Wang, Chris Paxton, Roozbeh Mottaghi, Dhruv Batra, Jitendra Malik, Stefan Lee, Devendra Singh Chaplot

    Abstract: Images are a convenient way to specify which particular object instance an embodied agent should navigate to. Solving this task requires semantic visual reasoning and exploration of unknown environments. We present a system that can perform this task in both simulation and the real world. Our modular method solves sub-tasks of exploration, goal instance re-identification, goal localization, and lo… ▽ More

    Submitted 3 April, 2023; originally announced April 2023.

  34. arXiv:2303.18240  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

    Authors: Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Yecheng Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Pieter Abbeel, Jitendra Malik, Dhruv Batra, Yixin Lin, Oleksandr Maksymets, Aravind Rajeswaran, Franziska Meier

    Abstract: We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual 'foundation models' for Embodied AI. First, we curate CortexBench, consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation. Next, we systematically evaluate existing PVRs and find that none are universally dominant. To study the effect of… ▽ More

    Submitted 1 February, 2024; v1 submitted 31 March, 2023; originally announced March 2023.

    Comments: Project website: https://eai-vc.github.io

  35. arXiv:2303.03381  [pdf, other

    cs.RO cs.LG

    Real-World Humanoid Locomotion with Reinforcement Learning

    Authors: Ilija Radosavovic, Tete Xiao, Bike Zhang, Trevor Darrell, Jitendra Malik, Koushil Sreenath

    Abstract: Humanoid robots that can autonomously operate in diverse environments have the potential to help address labour shortages in factories, assist elderly at homes, and colonize new planets. While classical controllers for humanoid robots have shown impressive results in a number of settings, they are challenging to generalize and adapt to new environments. Here, we present a fully learning-based appr… ▽ More

    Submitted 14 December, 2023; v1 submitted 6 March, 2023; originally announced March 2023.

    Comments: Project page: https://learning-humanoid-locomotion.github.io

  36. arXiv:2302.12827  [pdf, other

    cs.CV

    Decoupling Human and Camera Motion from Videos in the Wild

    Authors: Vickie Ye, Georgios Pavlakos, Jitendra Malik, Angjoo Kanazawa

    Abstract: We propose a method to reconstruct global human trajectories from videos in the wild. Our optimization method decouples the camera and human motion, which allows us to place people in the same world coordinate frame. Most existing methods do not model the camera motion; methods that rely on the background pixels to infer 3D human motion usually require a full scene reconstruction, which is often n… ▽ More

    Submitted 20 March, 2023; v1 submitted 24 February, 2023; originally announced February 2023.

    Comments: Project site: https://vye16.github.io/slahmr. CVPR 2023

  37. arXiv:2302.07863  [pdf, other

    cs.CL

    Speculative Decoding with Big Little Decoder

    Authors: Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W. Mahoney, Amir Gholami, Kurt Keutzer

    Abstract: The recent emergence of Large Language Models based on the Transformer architecture has enabled dramatic advancements in the field of Natural Language Processing. However, these models have long inference latency, which limits their deployment and makes them prohibitively expensive for various real-time applications. The inference latency is further exacerbated by autoregressive generative tasks,… ▽ More

    Submitted 12 October, 2023; v1 submitted 15 February, 2023; originally announced February 2023.

    Comments: NeurIPS 2023

  38. arXiv:2302.04869  [pdf, other

    cs.CV cs.AI

    Reversible Vision Transformers

    Authors: Karttikeya Mangalam, Haoqi Fan, Yanghao Li, Chao-Yuan Wu, Bo Xiong, Christoph Feichtenhofer, Jitendra Malik

    Abstract: We present Reversible Vision Transformers, a memory efficient architecture design for visual recognition. By decoupling the GPU memory requirement from the depth of the model, Reversible Vision Transformers enable scaling up architectures with efficient memory usage. We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants and benchmark exte… ▽ More

    Submitted 9 February, 2023; originally announced February 2023.

    Comments: Oral at CVPR 2022, updated version

  39. arXiv:2301.08247  [pdf, other

    cs.CV

    Multiview Compressive Coding for 3D Reconstruction

    Authors: Chao-Yuan Wu, Justin Johnson, Jitendra Malik, Christoph Feichtenhofer, Georgia Gkioxari

    Abstract: A central goal of visual recognition is to understand objects and scenes from a single image. 2D recognition has witnessed tremendous progress thanks to large-scale learning and general-purpose representations. Comparatively, 3D poses new challenges stemming from occlusions not depicted in the image. Prior works try to overcome these by inferring from multiple views or rely on scarce CAD models an… ▽ More

    Submitted 19 January, 2023; originally announced January 2023.

    Comments: Project page: https://mcc3d.github.io/

  40. arXiv:2301.02232  [pdf, other

    cs.CV cs.AI cs.GR cs.LG

    CA$^2$T-Net: Category-Agnostic 3D Articulation Transfer from Single Image

    Authors: Jasmine Collins, Anqi Liang, Jitendra Malik, Hao Zhang, Frédéric Devernay

    Abstract: We present a neural network approach to transfer the motion from a single image of an articulated object to a rest-state (i.e., unarticulated) 3D model. Our network learns to predict the object's pose, part segmentation, and corresponding motion parameters to reproduce the articulation shown in the input image. The network is composed of three distinct branches that take a shared joint image-shape… ▽ More

    Submitted 22 March, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

    Comments: 8 pages

  41. arXiv:2212.10564  [pdf, other

    cs.CL cs.AI cs.LG

    Re-evaluating the Need for Multimodal Signals in Unsupervised Grammar Induction

    Authors: Boyi Li, Rodolfo Corona, Karttikeya Mangalam, Catherine Chen, Daniel Flaherty, Serge Belongie, Kilian Q. Weinberger, Jitendra Malik, Trevor Darrell, Dan Klein

    Abstract: Are multimodal inputs necessary for grammar induction? Recent work has shown that multimodal training inputs can improve grammar induction. However, these improvements are based on comparisons to weak text-only baselines that were trained on relatively little textual data. To determine whether multimodal inputs are needed in regimes with large amounts of textual training data, we design a stronger… ▽ More

    Submitted 12 April, 2024; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: NAACL Findings 2024

  42. arXiv:2212.08071  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    MAViL: Masked Audio-Video Learners

    Authors: Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, Haoqi Fan, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, Christoph Feichtenhofer

    Abstract: We present Masked Audio-Video Learners (MAViL) to train audio-visual representations. Our approach learns with three complementary forms of self-supervision: (1) reconstruction of masked audio and video input data, (2) intra- and inter-modal contrastive learning with masking, and (3) self-training by reconstructing joint audio-video contextualized features learned from the first two objectives. Pr… ▽ More

    Submitted 17 July, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

    Comments: Technical report

  43. arXiv:2212.00922  [pdf, other

    cs.RO cs.CV cs.LG

    Navigating to Objects in the Real World

    Authors: Theophile Gervet, Soumith Chintala, Dhruv Batra, Jitendra Malik, Devendra Singh Chaplot

    Abstract: Semantic navigation is necessary to deploy mobile robots in uncontrolled environments like our homes, schools, and hospitals. Many learning-based approaches have been proposed in response to the lack of semantic understanding of the classical pipeline for spatial navigation, which builds a geometric map using depth sensors and plans to reach point goals. Broadly, end-to-end learning approaches rea… ▽ More

    Submitted 1 December, 2022; originally announced December 2022.

    Comments: 39 pages, 19 figures and tables, submitted to Science Robotics

  44. arXiv:2211.15876  [pdf, other

    cs.CV

    Instance-Specific Image Goal Navigation: Training Embodied Agents to Find Object Instances

    Authors: Jacob Krantz, Stefan Lee, Jitendra Malik, Dhruv Batra, Devendra Singh Chaplot

    Abstract: We consider the problem of embodied visual navigation given an image-goal (ImageNav) where an agent is initialized in an unfamiliar environment and tasked with navigating to a location 'described' by an image. Unlike related navigation tasks, ImageNav does not have a standardized task definition which makes comparison across methods difficult. Further, existing formulations have two problematic pr… ▽ More

    Submitted 28 November, 2022; originally announced November 2022.

  45. arXiv:2211.13225  [pdf, other

    cs.CV cs.LG cs.RO

    Learning to Imitate Object Interactions from Internet Videos

    Authors: Austin Patel, Andrew Wang, Ilija Radosavovic, Jitendra Malik

    Abstract: We study the problem of imitating object interactions from Internet videos. This requires understanding the hand-object interactions in 4D, spatially in 3D and over time, which is challenging due to mutual hand-object occlusions. In this paper we make two main contributions: (1) a novel reconstruction technique RHOV (Reconstructing Hands and Objects from Videos), which reconstructs 4D trajectories… ▽ More

    Submitted 23 November, 2022; originally announced November 2022.

    Comments: Project page: https://austinapatel.github.io/imitate-video

  46. arXiv:2211.07638  [pdf, other

    cs.RO cs.AI cs.CV cs.LG eess.SY

    Legged Locomotion in Challenging Terrains using Egocentric Vision

    Authors: Ananye Agarwal, Ashish Kumar, Jitendra Malik, Deepak Pathak

    Abstract: Animals are capable of precise and agile locomotion using vision. Replicating this ability has been a long-standing goal in robotics. The traditional approach has been to decompose this problem into elevation map** and foothold planning phases. The elevation map**, however, is susceptible to failure and large noise artifacts, requires specialized hardware, and is biologically implausible. In t… ▽ More

    Submitted 14 November, 2022; originally announced November 2022.

    Comments: Oral presentation at CoRL 2022. Website at https://vision-locomotion.github.io

  47. arXiv:2211.03785  [pdf, other

    cs.AI cs.RO

    Learning Visual Locomotion with Cross-Modal Supervision

    Authors: Antonio Loquercio, Ashish Kumar, Jitendra Malik

    Abstract: In this work, we show how to learn a visual walking policy that only uses a monocular RGB camera and proprioception. Since simulating RGB is hard, we necessarily have to learn vision in the real world. We start with a blind walking policy trained in simulation. This policy can traverse some terrains in the real world but often struggles since it lacks knowledge of the upcoming geometry. This can b… ▽ More

    Submitted 7 November, 2022; originally announced November 2022.

    Comments: Learning to walk from pixels in the real world by using proprioception as supervision. Project page for videos and code: https://antonilo.github.io/vision_locomotion/

  48. arXiv:2210.13853  [pdf, other

    cs.CV

    THOR-Net: End-to-end Graformer-based Realistic Two Hands and Object Reconstruction with Self-supervision

    Authors: Ahmed Tawfik Aboukhadra, Jameel Malik, Ahmed Elhayek, Nadia Robertini, Didier Stricker

    Abstract: Realistic reconstruction of two hands interacting with objects is a new and challenging problem that is essential for building personalized Virtual and Augmented Reality environments. Graph Convolutional networks (GCNs) allow for the preservation of the topologies of hands poses and shapes by modeling them as a graph. In this work, we propose the THOR-Net which combines the power of GCNs, Transfor… ▽ More

    Submitted 25 October, 2022; originally announced October 2022.

    Comments: To be published in WACV2023

  49. arXiv:2210.04887  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    In-Hand Object Rotation via Rapid Motor Adaptation

    Authors: Haozhi Qi, Ashish Kumar, Roberto Calandra, Yi Ma, Jitendra Malik

    Abstract: Generalized in-hand manipulation has long been an unsolved challenge of robotics. As a small step towards this grand goal, we demonstrate how to design and learn a simple adaptive controller to achieve in-hand object rotation using only fingertips. The controller is trained entirely in simulation on only cylindrical objects, which then - without any fine-tuning - can be directly deployed to a real… ▽ More

    Submitted 10 October, 2022; originally announced October 2022.

    Comments: CoRL 2022. Code and Website: https://haozhi.io/hora

  50. arXiv:2210.03109  [pdf, other

    cs.RO cs.CV cs.LG

    Real-World Robot Learning with Masked Visual Pre-training

    Authors: Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, Trevor Darrell

    Abstract: In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module. Unlike prior work, we show that the pre-trained representations are effective across a range of real-world robotic ta… ▽ More

    Submitted 6 October, 2022; originally announced October 2022.

    Comments: CoRL 2022; Project page: https://tetexiao.com/projects/real-mvp