Skip to main content

Showing 1–50 of 131 results for author: Ramanan, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.13896  [pdf, other

    cs.CV

    Simultaneous Map and Object Reconstruction

    Authors: Nathaniel Chodosh, Anish Madan, Deva Ramanan, Simon Lucey

    Abstract: In this paper, we present a method for dynamic surface reconstruction of large-scale urban scenes from LiDAR. Depth-based reconstructions tend to focus on small-scale objects or large-scale SLAM reconstructions that treat moving objects as outliers. We take a holistic perspective and optimize a compositional model of a dynamic scene that decomposes the world into rigidly moving objects and the bac… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  2. arXiv:2406.13743  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

    Authors: Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, Deva Ramanan

    Abstract: While text-to-visual models now produce photo-realistic images and videos, they struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. In this work, we conduct an extensive human study on GenAI-Bench to evaluate the performance of leading image and video generation models in various aspects of compositional text-to-vis… ▽ More

    Submitted 21 June, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

    Comments: We open-source our dataset, model, and code at: https://linzhiqiu.github.io/papers/genai_bench ; Project page: https://linzhiqiu.github.io/papers/genai_bench ; GenAI-Bench was first introduced in arxiv:2404.01291. This article extends it with an additional GenAI-Rank benchmark.

  3. arXiv:2406.10714  [pdf, other

    cs.RO cs.LG

    Planning with Adaptive World Models for Autonomous Driving

    Authors: Arun Balajee Vasudevan, Neehar Peri, Jeff Schneider, Deva Ramanan

    Abstract: Motion planning is crucial for safe navigation in complex urban environments. Historically, motion planners (MPs) have been evaluated with procedurally-generated simulators like CARLA. However, such synthetic benchmarks do not capture real-world multi-agent interactions. nuPlan, a recently released MP benchmark, addresses this limitation by augmenting real-world driving logs with closed-loop simul… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

    Comments: Project Page: https://arunbalajeev.github.io/world_models_planning/world_model_paper.html

  4. arXiv:2406.10115  [pdf, other

    cs.CV cs.LG cs.RO

    Shelf-Supervised Multi-Modal Pre-Training for 3D Object Detection

    Authors: Mehar Khurana, Neehar Peri, Deva Ramanan, James Hays

    Abstract: State-of-the-art 3D object detectors are often trained on massive labeled datasets. However, annotating 3D bounding boxes remains prohibitively expensive and time-consuming, particularly for LiDAR. Instead, recent works demonstrate that self-supervised pre-training with unlabeled data can improve detection accuracy with limited labels. Contemporary methods adapt best-practices for self-supervised… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  5. arXiv:2406.02659  [pdf, other

    q-bio.NC cs.AI cs.CV

    Neural Representations of Dynamic Visual Stimuli

    Authors: Jacob Yeung, Andrew F. Luo, Gabriel Sarch, Margaret M. Henderson, Deva Ramanan, Michael J. Tarr

    Abstract: Humans experience the world through constantly changing visual stimuli, where scenes can shift and move, change in appearance, and vary in distance. The dynamic nature of visual perception is a fundamental aspect of our daily lives, yet the large majority of research on object and scene processing, particularly using fMRI, has focused on static stimuli. While studies of static image perception are… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  6. arXiv:2404.11554  [pdf, other

    cs.CV

    Predicting Long-horizon Futures by Conditioning on Geometry and Time

    Authors: Tarasha Khurana, Deva Ramanan

    Abstract: Our work explores the task of generating future sensor observations conditioned on the past. We are motivated by `predictive coding' concepts from neuroscience as well as robotic applications such as self-driving vehicles. Predictive video modeling is challenging because the future may be multi-modal and learning at scale remains computationally expensive for video processing. To address both chal… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

    Comments: Project page: http://www.cs.cmu.edu/~tkhurana/depthforecasting/

  7. arXiv:2404.01291  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    Evaluating Text-to-Visual Generation with Image-to-Text Generation

    Authors: Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, Deva Ramanan

    Abstract: Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reas… ▽ More

    Submitted 18 June, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

    Comments: We open-source our data, model, and code at: https://github.com/linzhiqiu/t2v_metrics ; Project page: https://linzhiqiu.github.io/papers/vqascore

  8. arXiv:2403.13129  [pdf, other

    cs.CV cs.RO

    Better Call SAL: Towards Learning to Segment Anything in Lidar

    Authors: Aljoša Ošep, Tim Meinhardt, Francesco Ferroni, Neehar Peri, Deva Ramanan, Laura Leal-Taixé

    Abstract: We propose $\texttt{SAL}$ ($\texttt{S}$egment $\texttt{A}$nything in $\texttt{L}$idar) method consisting of a text-promptable zero-shot model for segmenting and classifying any object in Lidar, and a pseudo-labeling engine that facilitates model training without manual supervision. While the established paradigm for $\textit{Lidar Panoptic Segmentation}$ (LPS) relies on manual supervision for a ha… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

  9. arXiv:2403.04739  [pdf, other

    cs.CV

    I Can't Believe It's Not Scene Flow!

    Authors: Ishan Khatri, Kyle Vedder, Neehar Peri, Deva Ramanan, James Hays

    Abstract: Current scene flow methods broadly fail to describe motion on small objects, and current scene flow evaluation protocols hide this failure by averaging over many points, with most drawn larger objects. To fix this evaluation failure, we propose a new evaluation protocol, Bucket Normalized EPE, which is class-aware and speed-normalized, enabling contextualized error comparisons between object types… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

    Comments: 13 pages, 3 pages of citations, 2 pages of supplemental

  10. arXiv:2402.14817  [pdf, other

    cs.CV cs.LG

    Cameras as Rays: Pose Estimation via Ray Diffusion

    Authors: Jason Y. Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, Shubham Tulsiani

    Abstract: Estimating camera poses is a fundamental task for 3D reconstruction and remains challenging given sparsely sampled views (<10). In contrast to existing approaches that pursue top-down prediction of global parametrizations of camera extrinsics, we propose a distributed representation of camera pose that treats a camera as a bundle of rays. This representation allows for a tight coupling with spatia… ▽ More

    Submitted 4 April, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

    Comments: In ICLR 2024 (oral). v2-3: updated references. Project webpage: https://jasonyzhang.com/RayDiffusion

  11. arXiv:2402.13251  [pdf, other

    cs.GR cs.CV cs.LG

    FlashTex: Fast Relightable Mesh Texturing with LightControlNet

    Authors: Kangle Deng, Timothy Omernick, Alexander Weiss, Deva Ramanan, Jun-Yan Zhu, Tinghui Zhou, Maneesh Agrawala

    Abstract: Manually creating textures for 3D meshes is time-consuming, even for expert visual content creators. We propose a fast approach for automatically texturing an input 3D mesh based on a user-provided text prompt. Importantly, our approach disentangles lighting from surface material/reflectance in the resulting texture so that the mesh can be properly relit and rendered in any lighting environment. W… ▽ More

    Submitted 22 April, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: Project page: https://flashtex.github.io/

  12. arXiv:2402.12394  [pdf, other

    cs.HC cs.AI cs.LG eess.IV

    Improving Model's Interpretability and Reliability using Biomarkers

    Authors: Gautam Rajendrakumar Gare, Tom Fox, Beam Chansangavej, Amita Krishnan, Ricardo Luis Rodriguez, Bennett P deBoisblanc, Deva Kannan Ramanan, John Michael Galeotti

    Abstract: Accurate and interpretable diagnostic models are crucial in the safety-critical field of medicine. We investigate the interpretability of our proposed biomarker-based lung ultrasound diagnostic pipeline to enhance clinicians' diagnostic capabilities. The objective of this study is to assess whether explanations from a decision tree classifier, utilizing biomarkers, can improve users' ability to id… ▽ More

    Submitted 16 February, 2024; originally announced February 2024.

    Comments: Accepted at BIAS 2023 Conference

  13. arXiv:2401.12425  [pdf, other

    cs.CV cs.CL cs.LG

    The Neglected Tails in Vision-Language Models

    Authors: Shubham Parashar, Zhiqiu Lin, Tian Liu, Xiangjue Dong, Yanan Li, Deva Ramanan, James Caverlee, Shu Kong

    Abstract: Vision-language models (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts. For example, although CLIP achieves impressive accuracy on ImageNet (60-80%), its performance drops below 10% for more than ten concepts like night snake, presumably due to their limited presence in the pretraining data. However, measuring the frequency of concepts in… ▽ More

    Submitted 22 May, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

    Comments: Project Page: https://shubhamprshr27.github.io/neglected-tails-of-vlms/

  14. arXiv:2312.14494  [pdf, other

    cs.CV

    Revisiting Few-Shot Object Detection with Vision-Language Models

    Authors: Anish Madan, Neehar Peri, Shu Kong, Deva Ramanan

    Abstract: The era of vision-language models (VLMs) trained on large web-scale datasets challenges conventional formulations of "open-world" perception. In this work, we revisit the task of few-shot object detection (FSOD) in the context of recent foundational VLMs. First, we point out that zero-shot VLMs such as GroundingDINO significantly outperform state-of-the-art few-shot detectors (48 vs. 33 AP) on COC… ▽ More

    Submitted 14 June, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

  15. arXiv:2312.12433  [pdf, other

    cs.CV cs.AI cs.LG

    TAO-Amodal: A Benchmark for Tracking Any Object Amodally

    Authors: Cheng-Yen Hsieh, Kaihua Chen, Achal Dave, Tarasha Khurana, Deva Ramanan

    Abstract: Amodal perception, the ability to comprehend complete object structures from partial visibility, is a fundamental skill, even for infants. Its significance extends to applications like autonomous driving, where a clear understanding of heavily occluded objects is essential. However, modern detection and tracking algorithms often overlook this critical capability, perhaps due to the prevalence of \… ▽ More

    Submitted 2 April, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

    Comments: Project Page: https://tao-amodal.github.io

  16. arXiv:2312.10986  [pdf, other

    cs.CV cs.RO

    Long-Tailed 3D Detection via 2D Late Fusion

    Authors: Yechi Ma, Neehar Peri, Shuoquan Wei, Wei Hua, Deva Ramanan, Yanan Li, Shu Kong

    Abstract: Long-Tailed 3D Object Detection (LT3D) addresses the problem of accurately detecting objects from both common and rare classes. Contemporary multi-modal detectors achieve low AP on rare-classes (e.g., CMT only achieves 9.4 AP on stroller), presumably because training detectors end-to-end with significant class imbalance is challenging. To address this limitation, we delve into a simple late-fusion… ▽ More

    Submitted 14 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

  17. arXiv:2312.03160  [pdf, other

    cs.CV cs.GR cs.LG

    HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces

    Authors: Haithem Turki, Vasu Agrawal, Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder, Deva Ramanan, Michael Zollhöfer, Christian Richardt

    Abstract: Neural radiance fields provide state-of-the-art view synthesis quality but tend to be slow to render. One reason is that they make use of volume rendering, thus requiring many samples (and model queries) per ray at render time. Although this representation is flexible and easy to optimize, most real-world objects can be modeled more efficiently with surfaces instead of volumes, requiring far fewer… ▽ More

    Submitted 27 March, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

    Comments: CVPR 2024 Project page: https://haithemturki.com/hybrid-nerf/

  18. arXiv:2312.02126  [pdf, other

    cs.CV cs.AI cs.RO

    SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

    Authors: Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula, Gengshan Yang, Sebastian Scherer, Deva Ramanan, Jonathon Luiten

    Abstract: Dense simultaneous localization and map** (SLAM) is crucial for robotics and augmented reality applications. However, current methods are often hampered by the non-volumetric or implicit way they represent a scene. This work introduces SplaTAM, an approach that, for the first time, leverages explicit volumetric representations, i.e., 3D Gaussians, to enable high-fidelity reconstruction from a si… ▽ More

    Submitted 16 April, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

    Comments: CVPR 2024. Website: https://spla-tam.github.io/

  19. arXiv:2312.00252  [pdf, other

    cs.CV cs.GR cs.LG

    PyNeRF: Pyramidal Neural Radiance Fields

    Authors: Haithem Turki, Michael Zollhöfer, Christian Richardt, Deva Ramanan

    Abstract: Neural Radiance Fields (NeRFs) can be dramatically accelerated by spatial grid representations. However, they do not explicitly reason about scale and so introduce aliasing artifacts when reconstructing scenes captured at different camera distances. Mip-NeRF and its extensions propose scale-aware renderers that project volumetric frustums rather than point samples but such approaches rely on posit… ▽ More

    Submitted 30 November, 2023; originally announced December 2023.

    Comments: Neurips 2023 Project page: https://haithemturki.com/pynerf/

  20. arXiv:2310.12464  [pdf, other

    cs.CV cs.RO

    Lidar Panoptic Segmentation and Tracking without Bells and Whistles

    Authors: Abhinav Agarwalla, Xuhua Huang, Jason Ziglar, Francesco Ferroni, Laura Leal-Taixé, James Hays, Aljoša Ošep, Deva Ramanan

    Abstract: State-of-the-art lidar panoptic segmentation (LPS) methods follow bottom-up segmentation-centric fashion wherein they build upon semantic segmentation networks by utilizing clustering to obtain object instances. In this paper, we re-think this approach and propose a surprisingly simple yet effective detection-centric network for both LPS and tracking. Our network is modular by design and optimized… ▽ More

    Submitted 19 October, 2023; originally announced October 2023.

    Comments: IROS 2023. Code at https://github.com/abhinavagarwalla/most-lps

  21. arXiv:2310.05882  [pdf, other

    cs.HC

    Evaluating a VR System for Collecting Safety-Critical Vehicle-Pedestrian Interactions

    Authors: Erica Weng, Kenta Mukoya, Deva Ramanan, Kris Kitani

    Abstract: Autonomous vehicles (AVs) require comprehensive and reliable pedestrian trajectory data to ensure safe operation. However, obtaining data of safety-critical scenarios such as jaywalking and near-collisions, or uncommon agents such as children, disabled pedestrians, and vulnerable road users poses logistical and ethical challenges. This paper evaluates a Virtual Reality (VR) system designed to coll… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

    Comments: In submission to CHI 2024

  22. arXiv:2310.01351  [pdf, other

    cs.CV

    Streaming Motion Forecasting for Autonomous Driving

    Authors: Ziqi Pang, Deva Ramanan, Mengtian Li, Yu-Xiong Wang

    Abstract: Trajectory forecasting is a widely-studied problem for autonomous navigation. However, existing benchmarks evaluate forecasting based on independent snapshots of trajectories, which are not representative of real-world applications that operate on a continuous stream of data. To bridge this gap, we introduce a benchmark that continuously queries future trajectories on streaming data and we refer t… ▽ More

    Submitted 2 October, 2023; originally announced October 2023.

    Comments: IROS 2023, 8 pages, 9 figures

  23. arXiv:2309.05950  [pdf, other

    cs.CL cs.CV cs.LG cs.MM

    Language Models as Black-Box Optimizers for Vision-Language Models

    Authors: Shihong Liu, Zhiqiu Lin, Samuel Yu, Ryan Lee, Tiffany Ling, Deepak Pathak, Deva Ramanan

    Abstract: Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. As such, we aim to develop a black-box approach to optimize VLMs through natural language prompts, thereb… ▽ More

    Submitted 13 May, 2024; v1 submitted 12 September, 2023; originally announced September 2023.

    Comments: Published at CVPR 2024. Project site: https://llm-can-optimize-vlm.github.io/

  24. arXiv:2308.09713  [pdf, other

    cs.CV

    Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis

    Authors: Jonathon Luiten, Georgios Kopanas, Bastian Leibe, Deva Ramanan

    Abstract: We present a method that simultaneously addresses the tasks of dynamic scene novel-view synthesis and six degree-of-freedom (6-DOF) tracking of all dense scene elements. We follow an analysis-by-synthesis framework, inspired by recent work that models scenes as a collection of 3D Gaussians which are optimized to reconstruct input images via differentiable rendering. To model dynamic scenes, we all… ▽ More

    Submitted 18 August, 2023; originally announced August 2023.

  25. arXiv:2308.09105  [pdf, other

    cs.CV cs.LG

    Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation

    Authors: Shengcao Cao, Mengtian Li, James Hays, Deva Ramanan, Yi-Xiong Wang, Liang-Yan Gui

    Abstract: Resource-constrained perception systems such as edge computing and vision-for-robotics require vision models to be both accurate and lightweight in computation and memory usage. While knowledge distillation is a proven strategy to enhance the performance of lightweight classification models, its application to structured outputs like object detection and instance segmentation remains a complicated… ▽ More

    Submitted 17 August, 2023; originally announced August 2023.

    Comments: ICML 2023

  26. arXiv:2308.04054  [pdf, other

    cs.CV cs.RO

    An Empirical Analysis of Range for 3D Object Detection

    Authors: Neehar Peri, Mengtian Li, Benjamin Wilson, Yu-Xiong Wang, James Hays, Deva Ramanan

    Abstract: LiDAR-based 3D detection plays a vital role in autonomous navigation. Surprisingly, although autonomous vehicles (AVs) must detect both near-field objects (for collision avoidance) and far-field objects (for longer-term planning), contemporary benchmarks focus only on near-field 3D detection. However, AVs must detect far-field objects for safe navigation. In this paper, we present an empirical ana… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

    Comments: Accepted to ICCV 2023 Workshop - Robustness and Reliability of Autonomous Vehicles in the Open-World

  27. arXiv:2306.14035  [pdf, other

    cs.CV

    Thinking Like an Annotator: Generation of Dataset Labeling Instructions

    Authors: Nadine Chang, Francesco Ferroni, Michael J. Tarr, Martial Hebert, Deva Ramanan

    Abstract: Large-scale datasets are essential to modern day deep learning. Advocates argue that understanding these methods requires dataset transparency (e.g. "dataset curation, motivation, composition, collection process, etc..."). However, almost no one has suggested the release of the detailed definitions and visual category examples provided to annotators - information critical to understanding the stru… ▽ More

    Submitted 24 June, 2023; originally announced June 2023.

  28. arXiv:2306.01879  [pdf, other

    cs.CV cs.AI cs.CL

    Revisiting the Role of Language Priors in Vision-Language Models

    Authors: Zhiqiu Lin, Xinyue Chen, Deepak Pathak, Pengchuan Zhang, Deva Ramanan

    Abstract: Vision-language models (VLMs) are impactful in part because they can be applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning. We study $\textit{generative VLMs}$ that are trained for next-word generation given an image. We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks. O… ▽ More

    Submitted 15 May, 2024; v1 submitted 2 June, 2023; originally announced June 2023.

    Comments: Published at ICML 2024. Website: https://linzhiqiu.github.io/papers/visual_gpt_score/

  29. arXiv:2305.10424  [pdf, other

    cs.CV cs.LG

    ZeroFlow: Scalable Scene Flow via Distillation

    Authors: Kyle Vedder, Neehar Peri, Nathaniel Chodosh, Ishan Khatri, Eric Eaton, Dinesh Jayaraman, Yang Liu, Deva Ramanan, James Hays

    Abstract: Scene flow estimation is the task of describing the 3D motion field between temporally successive point clouds. State-of-the-art methods use strong priors and test-time optimization techniques, but require on the order of tens of seconds to process full-size point clouds, making them unusable as computer vision primitives for real-time applications such as open world object detection. Feedforward… ▽ More

    Submitted 14 March, 2024; v1 submitted 17 May, 2023; originally announced May 2023.

    Comments: Accepted to ICLR 2024. 9 pages, 4 pages of citations, 6 pages of Supplemental. Project page with data releases is at http://vedder.io/zeroflow.html

  30. arXiv:2305.07528  [pdf, other

    cs.CV cs.AI

    WEDGE: A multi-weather autonomous driving dataset built from generative vision-language models

    Authors: Aboli Marathe, Deva Ramanan, Rahee Walambe, Ketan Kotecha

    Abstract: The open road poses many challenges to autonomous perception, including poor visibility from extreme weather conditions. Models trained on good-weather datasets frequently fail at detection in these out-of-distribution settings. To aid adversarial robustness in perception, we introduce WEDGE (WEather images by DALL-E GEneration): a synthetic dataset generated with a vision-language generative mode… ▽ More

    Submitted 12 May, 2023; originally announced May 2023.

    Comments: Accepted in Vision Datasets Understanding at CVPR 2023

  31. arXiv:2305.06351  [pdf, other

    cs.CV cs.GR

    Reconstructing Animatable Categories from Videos

    Authors: Gengshan Yang, Chaoyang Wang, N Dinesh Reddy, Deva Ramanan

    Abstract: Building animatable 3D models is challenging due to the need for 3D scans, laborious registration, and manual rigging, which are difficult to scale to arbitrary categories. Recently, differentiable rendering provides a pathway to obtain high-quality 3D models from monocular videos, but these are limited to rigid categories or single instances. We present RAC that builds category 3D models from mon… ▽ More

    Submitted 10 May, 2023; originally announced May 2023.

    Comments: Project page: https://gengshan-y.github.io/rac-www/

  32. arXiv:2305.06292  [pdf, other

    cs.RO cs.LG

    Joint Metrics Matter: A Better Standard for Trajectory Forecasting

    Authors: Erica Weng, Hana Hoshino, Deva Ramanan, Kris Kitani

    Abstract: Multi-modal trajectory forecasting methods commonly evaluate using single-agent metrics (marginal metrics), such as minimum Average Displacement Error (ADE) and Final Displacement Error (FDE), which fail to capture joint performance of multiple interacting agents. Only focusing on marginal metrics can lead to unnatural predictions, such as colliding trajectories or diverging trajectories for peopl… ▽ More

    Submitted 11 October, 2023; v1 submitted 10 May, 2023; originally announced May 2023.

    Comments: Published as a conference paper at ICCV 2023

  33. arXiv:2305.04926  [pdf, other

    cs.CV

    RelPose++: Recovering 6D Poses from Sparse-view Observations

    Authors: Amy Lin, Jason Y. Zhang, Deva Ramanan, Shubham Tulsiani

    Abstract: We address the task of estimating 6D camera poses from sparse-view image sets (2-8 images). This task is a vital pre-processing stage for nearly all contemporary (neural) reconstruction algorithms but remains challenging given sparse views, especially for objects with visual symmetries and texture-less surfaces. We build on the recent RelPose framework which learns a network that infers distributi… ▽ More

    Submitted 18 December, 2023; v1 submitted 8 May, 2023; originally announced May 2023.

    Comments: Project webpage: https://amyxlase.github.io/relpose-plus-plus (Accepted to 3DV 2024)

  34. arXiv:2304.14389  [pdf, other

    cs.RO

    SLoMo: A General System for Legged Robot Motion Imitation from Casual Videos

    Authors: John Z. Zhang, Shuo Yang, Gengshan Yang, Arun L. Bishop, Deva Ramanan, Zachary Manchester

    Abstract: We present SLoMo: a first-of-its-kind framework for transferring skilled motions from casually captured "in the wild" video footage of humans and animals to legged robots. SLoMo works in three stages: 1) synthesize a physically plausible reconstructed key-point trajectory from monocular videos; 2) optimize a dynamically feasible reference trajectory for the robot offline that includes body and foo… ▽ More

    Submitted 5 September, 2023; v1 submitted 27 April, 2023; originally announced April 2023.

    Comments: accepted at RA-L 2023, with ICRA 2024 option

  35. arXiv:2304.12317  [pdf, other

    cs.CV cs.GR cs.LG

    Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis

    Authors: Chonghyuk Song, Gengshan Yang, Kangle Deng, Jun-Yan Zhu, Deva Ramanan

    Abstract: We explore the task of embodied view synthesis from monocular videos of deformable scenes. Given a minute-long RGBD video of people interacting with their pets, we render the scene from novel camera trajectories derived from the in-scene motion of actors: (1) egocentric cameras that simulate the point of view of a target actor and (2) 3rd-person cameras that follow the actor. Building such a syste… ▽ More

    Submitted 2 October, 2023; v1 submitted 24 April, 2023; originally announced April 2023.

    Comments: ICCV 2023 camera-ready version. Project page with code, models, and data: https://andrewsonga.github.io/totalrecon

  36. arXiv:2304.02150  [pdf, other

    cs.CV

    Re-Evaluating LiDAR Scene Flow for Autonomous Driving

    Authors: Nathaniel Chodosh, Deva Ramanan, Simon Lucey

    Abstract: Popular benchmarks for self-supervised LiDAR scene flow (stereoKITTI, and FlyingThings3D) have unrealistic rates of dynamic motion, unrealistic correspondences, and unrealistic sampling patterns. As a result, progress on these benchmarks is misleading and may cause researchers to focus on the wrong problems. We evaluate a suite of top methods on a suite of real-world datasets (Argoverse 2.0, Waymo… ▽ More

    Submitted 20 December, 2023; v1 submitted 4 April, 2023; originally announced April 2023.

    Comments: WACV 2024

  37. arXiv:2303.15390  [pdf, other

    cs.CV

    Learning to Zoom and Unzoom

    Authors: Chittesh Thavamani, Mengtian Li, Francesco Ferroni, Deva Ramanan

    Abstract: Many perception systems in mobile computing, autonomous navigation, and AR/VR face strict compute constraints that are particularly challenging for high-resolution input images. Previous works propose nonuniform downsamplers that "learn to zoom" on salient image regions, reducing compute while retaining task-relevant image information. However, for tasks with spatial labels (such as 2D/3D object d… ▽ More

    Submitted 27 March, 2023; originally announced March 2023.

    Comments: CVPR 2023. Code and additional visuals available at https://tchittesh.github.io/lzu/

  38. arXiv:2303.14536  [pdf, other

    cs.CV cs.GR cs.LG

    SUDS: Scalable Urban Dynamic Scenes

    Authors: Haithem Turki, Jason Y. Zhang, Francesco Ferroni, Deva Ramanan

    Abstract: We extend neural radiance fields (NeRFs) to dynamic large-scale urban scenes. Prior work tends to reconstruct single video clips of short durations (up to 10 seconds). Two reasons are that such methods (a) tend to scale linearly with the number of moving objects and input videos because a separate model is built for each and (b) tend to require supervision via 3D bounding boxes and panoptic labels… ▽ More

    Submitted 25 March, 2023; originally announced March 2023.

    Comments: CVPR 2023 Project page: https://haithemturki.com/suds/

  39. arXiv:2302.13130  [pdf, other

    cs.CV eess.SP

    Point Cloud Forecasting as a Proxy for 4D Occupancy Forecasting

    Authors: Tarasha Khurana, Peiyun Hu, David Held, Deva Ramanan

    Abstract: Predicting how the world can evolve in the future is crucial for motion planning in autonomous systems. Classical methods are limited because they rely on costly human annotations in the form of semantic class labels, bounding boxes, and tracks or HD maps of cities to plan their motion and thus are difficult to scale to large unlabeled datasets. One promising self-supervised task is 3D point cloud… ▽ More

    Submitted 30 April, 2023; v1 submitted 25 February, 2023; originally announced February 2023.

    Comments: CVPR 2023. Project page: https://www.cs.cmu.edu/~tkhurana/ff4d/index.html Code: https://github.com/tarashakhurana/4d-occ-forecasting

  40. arXiv:2302.08509  [pdf, other

    cs.CV cs.GR cs.LG

    3D-aware Conditional Image Synthesis

    Authors: Kangle Deng, Gengshan Yang, Deva Ramanan, Jun-Yan Zhu

    Abstract: We propose pix2pix3D, a 3D-aware conditional generative model for controllable photorealistic image synthesis. Given a 2D label map, such as a segmentation or edge map, our model learns to synthesize a corresponding image from different viewpoints. To enable explicit 3D user control, we extend conditional generative models with neural radiance fields. Given widely-available monocular images and la… ▽ More

    Submitted 1 May, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

    Comments: Project Page: https://www.cs.cmu.edu/~pix2pix3D/

  41. arXiv:2301.06267  [pdf, other

    cs.CV cs.AI cs.LG cs.SD eess.AS

    Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

    Authors: Zhiqiu Lin, Samuel Yu, Zhiyi Kuang, Deepak Pathak, Deva Ramanan

    Abstract: The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, w… ▽ More

    Submitted 2 August, 2023; v1 submitted 16 January, 2023; originally announced January 2023.

    Comments: CVPR 2023. Project website: https://linzhiqiu.github.io/papers/cross_modal/

  42. arXiv:2301.04224  [pdf, other

    cs.CV cs.LG

    Pix2Map: Cross-modal Retrieval for Inferring Street Maps from Images

    Authors: Xindi Wu, KwunFung Lau, Francesco Ferroni, Aljoša Ošep, Deva Ramanan

    Abstract: Self-driving vehicles rely on urban street maps for autonomous navigation. In this paper, we introduce Pix2Map, a method for inferring urban street map topology directly from ego-view images, as needed to continually update and expand existing maps. This is a challenging task, as we need to infer a complex urban road topology directly from raw image data. The main insight of this paper is that thi… ▽ More

    Submitted 9 April, 2023; v1 submitted 10 January, 2023; originally announced January 2023.

    Comments: 12 pages, 8 figures

  43. arXiv:2301.02657  [pdf, other

    cs.CV cs.AI cs.LG

    TarViS: A Unified Approach for Target-based Video Segmentation

    Authors: Ali Athar, Alexander Hermans, Jonathon Luiten, Deva Ramanan, Bastian Leibe

    Abstract: The general domain of video segmentation is currently fragmented into different tasks spanning multiple benchmarks. Despite rapid progress in the state-of-the-art, current methods are overwhelmingly task-specific and cannot conceptually generalize to other tasks. Inspired by recent approaches with multi-task capability, we propose TarViS: a novel, unified network architecture that can be applied t… ▽ More

    Submitted 10 May, 2023; v1 submitted 6 January, 2023; originally announced January 2023.

    Comments: Accepted to CVPR'23 (Highlight). Code is available at: https://github.com/Ali2500/TarViS

    ACM Class: I.4.6; I.4.8; I.4.10

  44. arXiv:2301.00493  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

    Authors: Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, James Hays

    Abstract: We introduce Argoverse 2 (AV2) - a collection of three datasets for perception and forecasting research in the self-driving domain. The annotated Sensor Dataset contains 1,000 sequences of multimodal data, encompassing high-resolution imagery from seven ring cameras, and two stereo cameras in addition to lidar point clouds, and 6-DOF map-aligned pose. Sequences contain 3D cuboid annotations for 26… ▽ More

    Submitted 1 January, 2023; originally announced January 2023.

    Comments: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks

  45. arXiv:2211.13858  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Far3Det: Towards Far-Field 3D Detection

    Authors: Shubham Gupta, Jeet Kanjani, Mengtian Li, Francesco Ferroni, James Hays, Deva Ramanan, Shu Kong

    Abstract: We focus on the task of far-field 3D detection (Far3Det) of objects beyond a certain distance from an observer, e.g., $>$50m. Far3Det is particularly important for autonomous vehicles (AVs) operating at highway speeds, which require detections of far-field obstacles to ensure sufficient braking distances. However, contemporary AV benchmarks such as nuScenes underemphasize this problem because they… ▽ More

    Submitted 24 November, 2022; originally announced November 2022.

    Comments: WACV 2023 12 Pages, 8 Figures, 10 Tables

  46. arXiv:2211.08691  [pdf, other

    cs.CV cs.RO

    Towards Long-Tailed 3D Detection

    Authors: Neehar Peri, Achal Dave, Deva Ramanan, Shu Kong

    Abstract: Contemporary autonomous vehicle (AV) benchmarks have advanced techniques for training 3D detectors, particularly on large-scale lidar data. Surprisingly, although semantic class labels naturally follow a long-tailed distribution, contemporary benchmarks focus on only a few common classes (e.g., pedestrian and car) and neglect many rare classes in-the-tail (e.g., debris and stroller). However, AVs… ▽ More

    Submitted 19 May, 2023; v1 submitted 16 November, 2022; originally announced November 2022.

    Comments: This work has been accepted to the Conference on Robot Learning (CoRL) 2022

  47. arXiv:2211.04625  [pdf, other

    cs.CV

    Soft Augmentation for Image Classification

    Authors: Yang Liu, Shen Yan, Laura Leal-Taixé, James Hays, Deva Ramanan

    Abstract: Modern neural networks are over-parameterized and thus rely on strong regularization such as data augmentation and weight decay to reduce overfitting and improve generalization. The dominant form of data augmentation applies invariant transforms, where the learning target of a sample is invariant to the transform applied to that sample. We draw inspiration from human visual classification studies… ▽ More

    Submitted 23 January, 2024; v1 submitted 8 November, 2022; originally announced November 2022.

    Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023 (pp. 16241-16250)

  48. arXiv:2210.10774  [pdf, other

    cs.CV

    Learning to Discover and Detect Objects

    Authors: Vladimir Fomenko, Ismail Elezi, Deva Ramanan, Laura Leal-Taixé, Aljoša Ošep

    Abstract: We tackle the problem of novel class discovery and localization (NCDL). In this setting, we assume a source dataset with supervision for only some object classes. Instances of other classes need to be discovered, classified, and localized automatically based on visual similarity without any human supervision. To tackle NCDL, we propose a two-stage object detection network Region-based NCDL (RNCDL)… ▽ More

    Submitted 30 November, 2022; v1 submitted 19 October, 2022; originally announced October 2022.

    Comments: Accepted to NeurIPS 2022, Homepage: https://vlfom.github.io/RNCDL/

  49. arXiv:2210.04993  [pdf, other

    cs.CV cs.AI cs.LG

    Continual Learning with Evolving Class Ontologies

    Authors: Zhiqiu Lin, Deepak Pathak, Yu-Xiong Wang, Deva Ramanan, Shu Kong

    Abstract: Lifelong learners must recognize concept vocabularies that evolve over time. A common yet underexplored scenario is learning with class labels that continually refine/expand old classes. For example, humans learn to recognize ${\tt dog}$ before dog breeds. In practical settings, dataset $\textit{versioning}$ often introduces refinement to ontologies, such as autonomous vehicle benchmarks that refi… ▽ More

    Submitted 14 December, 2022; v1 submitted 10 October, 2022; originally announced October 2022.

    Comments: NeurIPS 2022; Website: https://linzhiqiu.github.io/papers/leco/

  50. arXiv:2210.01917  [pdf, other

    cs.CV cs.RO

    Differentiable Raycasting for Self-supervised Occupancy Forecasting

    Authors: Tarasha Khurana, Peiyun Hu, Achal Dave, Jason Ziglar, David Held, Deva Ramanan

    Abstract: Motion planning for safe autonomous driving requires learning how the environment around an ego-vehicle evolves with time. Ego-centric perception of driveable regions in a scene not only changes with the motion of actors in the environment, but also with the movement of the ego-vehicle itself. Self-supervised representations proposed for large-scale planning, such as ego-centric freespace, confoun… ▽ More

    Submitted 18 October, 2022; v1 submitted 4 October, 2022; originally announced October 2022.

    Comments: ECCV 2022. Code available at https://github.com/tarashakhurana/emergent-occ-forecasting