Skip to main content

Showing 1–38 of 38 results for author: Girdhar, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.05206  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

    Authors: Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman

    Abstract: We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos. Whereas existing methods rely on curated data with known audio-visual correspondence, our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree, while diminishing those associations when… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

    Comments: Accepted at CVPR 2024. Project page: https://vision.cs.utexas.edu/projects/soundingactions

  2. arXiv:2402.03290  [pdf, other

    cs.CV cs.AI cs.LG

    InstanceDiffusion: Instance-level Control for Image Generation

    Authors: Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Rohit Girdhar, Ishan Misra

    Abstract: Text-to-image diffusion models produce high quality images but do not offer control over individual instances in the image. We introduce InstanceDiffusion that adds precise instance-level control to text-to-image diffusion models. InstanceDiffusion supports free-form language conditions per instance and allows flexible ways to specify instance locations such as simple single points, scribbles, bou… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Comments: Preprint; Project page: https://people.eecs.berkeley.edu/~xdwang/projects/InstDiff/

  3. arXiv:2312.04552  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    Generating Illustrated Instructions

    Authors: Sachit Menon, Ishan Misra, Rohit Girdhar

    Abstract: We introduce the new task of generating Illustrated Instructions, i.e., visual instructions customized to a user's needs. We identify desiderata unique to this task, and formalize it through a suite of automatic and human evaluation metrics, designed to measure the validity, consistency, and efficacy of the generations. We combine the power of large language models (LLMs) together with strong text… ▽ More

    Submitted 12 April, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

    Comments: Accepted to CVPR 2024. Project website: http://facebookresearch.github.io/IllustratedInstructions. Code reproduction: https://github.com/sachit-menon/generating-illustrated-instructions-reproduction

  4. arXiv:2311.18827  [pdf, other

    cs.GR cs.AI cs.CV cs.LG cs.MM

    Motion-Conditioned Image Animation for Video Editing

    Authors: Wilson Yan, Andrew Brown, Pieter Abbeel, Rohit Girdhar, Samaneh Azadi

    Abstract: We introduce MoCA, a Motion-Conditioned Image Animation approach for video editing. It leverages a simple decomposition of the video editing problem into image editing followed by motion-conditioned image animation. Furthermore, given the lack of robust evaluation datasets for video editing, we introduce a new benchmark that measures edit capability across a wide variety of tasks, such as object r… ▽ More

    Submitted 30 November, 2023; originally announced November 2023.

    Comments: Project page: https://facebookresearch.github.io/MoCA

  5. arXiv:2311.10709  [pdf, other

    cs.CV cs.AI cs.GR cs.LG cs.MM

    Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

    Authors: Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra

    Abstract: We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions--adjusted noise schedules for diffusion, and multi-stage training--that enable us to directly generate high quality and high resolut… ▽ More

    Submitted 17 November, 2023; originally announced November 2023.

    Comments: Project page: https://emu-video.metademolab.com

  6. arXiv:2308.14710  [pdf, other

    cs.CV cs.AI cs.LG

    VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

    Authors: Xudong Wang, Ishan Misra, Ziyun Zeng, Rohit Girdhar, Trevor Darrell

    Abstract: Existing approaches to unsupervised video instance segmentation typically rely on motion estimates and experience difficulties tracking small or divergent motions. We present VideoCutLER, a simple method for unsupervised multi-instance video segmentation without using motion-based learning signals like optical flow or training on natural videos. Our key insight is that using high-quality pseudo ma… ▽ More

    Submitted 28 August, 2023; originally announced August 2023.

    Comments: Preprint. Code: https://github.com/facebookresearch/CutLER

  7. arXiv:2305.05665  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    ImageBind: One Embedding Space To Bind Them All

    Authors: Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

    Abstract: We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their… ▽ More

    Submitted 31 May, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

    Comments: CVPR 2023 (Highlighted Paper). Website: https://imagebind.metademolab.com/ Code/Models: https://github.com/facebookresearch/ImageBind

  8. arXiv:2303.13496  [pdf, other

    cs.CV cs.AI cs.LG

    The effectiveness of MAE pre-pretraining for billion-scale pretraining

    Authors: Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra

    Abstract: This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model. While MAE has on… ▽ More

    Submitted 24 January, 2024; v1 submitted 23 March, 2023; originally announced March 2023.

    Comments: ICCV 2023. Models available at https://github.com/facebookresearch/maws/

  9. arXiv:2302.07960  [pdf, other

    cs.LG cs.HC

    Learning to Substitute Ingredients in Recipes

    Authors: Bahare Fatemi, Quentin Duval, Rohit Girdhar, Michal Drozdzal, Adriana Romero-Soriano

    Abstract: Recipe personalization through ingredient substitution has the potential to help people meet their dietary needs and preferences, avoid potential allergens, and ease culinary exploration in everyone's kitchen. To address ingredient substitution, we build a benchmark, composed of a dataset of substitution pairs with standardized splits, evaluation metrics, and baselines. We further introduce Graph-… ▽ More

    Submitted 15 February, 2023; originally announced February 2023.

  10. arXiv:2301.11320  [pdf, other

    cs.CV cs.AI cs.LG

    Cut and Learn for Unsupervised Object Detection and Instance Segmentation

    Authors: Xudong Wang, Rohit Girdhar, Stella X. Yu, Ishan Misra

    Abstract: We propose Cut-and-LEaRn (CutLER), a simple approach for training unsupervised object detection and segmentation models. We leverage the property of self-supervised models to 'discover' objects without supervision and amplify it to train a state-of-the-art localization model without any human labels. CutLER first uses our proposed MaskCut approach to generate coarse masks for multiple objects in a… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

    Comments: Tech report. Project page: http://people.eecs.berkeley.edu/~xdwang/projects/CutLER/. Code is available at https://github.com/facebookresearch/CutLER

  11. arXiv:2301.02311  [pdf, other

    cs.CV

    HierVL: Learning Hierarchical Video-Language Embeddings

    Authors: Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman

    Abstract: Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos acc… ▽ More

    Submitted 8 June, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

    Comments: CVPR 2023

  12. arXiv:2301.02307  [pdf, other

    cs.CV

    What You Say Is What You Show: Visual Narration Detection in Instructional Videos

    Authors: Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman

    Abstract: Narrated ''how-to'' videos have emerged as a promising data source for a wide range of learning problems, from learning visual representations to training robot policies. However, this data is extremely noisy, as the narrations do not always describe the actions demonstrated in the video. To address this problem we introduce the novel task of visual narration detection, which entails determining w… ▽ More

    Submitted 18 July, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

    Comments: Technical Report

  13. arXiv:2212.04501  [pdf, other

    cs.CV

    Learning Video Representations from Large Language Models

    Authors: Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar

    Abstract: We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators. Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual informatio… ▽ More

    Submitted 8 December, 2022; originally announced December 2022.

    Comments: Tech report. Project page: https://facebookresearch.github.io/LaViLa; Code is available at http://github.com/facebookresearch/LaViLa

  14. arXiv:2206.08356  [pdf, other

    cs.CV cs.AI cs.LG stat.ML

    OmniMAE: Single Model Masked Pretraining on Images and Videos

    Authors: Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

    Abstract: Transformer-based architectures have become competitive across a variety of visual domains, most notably images and videos. While prior work studies these modalities in isolation, having a common architecture suggests that one can train a single unified model for multiple visual modalities. Prior attempts at unified modeling typically use architectures tailored for vision tasks, or obtain worse pe… ▽ More

    Submitted 31 May, 2023; v1 submitted 16 June, 2022; originally announced June 2022.

    Comments: CVPR 2023. Code/models: https://github.com/facebookresearch/omnivore

  15. arXiv:2201.08377  [pdf, other

    cs.CV cs.AI cs.IR cs.LG

    Omnivore: A Single Model for Many Visual Modalities

    Authors: Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, Ishan Misra

    Abstract: Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data. Instead, in this paper, we propose a single model which excels at classifying images, videos, and single-view 3D data using exactly the same model parameters. Our 'Omnivore' model leverages the flexibility of transformer-based architectures and is tra… ▽ More

    Submitted 30 March, 2022; v1 submitted 20 January, 2022; originally announced January 2022.

    Comments: Accepted at CVPR 2022 (Oral Presentation)

  16. arXiv:2201.02605  [pdf, other

    cs.CV

    Detecting Twenty-thousand Classes using Image-level Supervision

    Authors: Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra

    Abstract: Current object detectors are limited in vocabulary size due to the small scale of detection datasets. Image classifiers, on the other hand, reason about much larger vocabularies, as their datasets are larger and easier to collect. We propose Detic, which simply trains the classifiers of a detector on image classification data and thus expands the vocabulary of detectors to tens of thousands of con… ▽ More

    Submitted 29 July, 2022; v1 submitted 7 January, 2022; originally announced January 2022.

    Comments: ECCV 2022 camera ready. Code is available at https://github.com/facebookresearch/Detic

  17. arXiv:2112.10764  [pdf, other

    cs.CV cs.AI cs.LG

    Mask2Former for Video Instance Segmentation

    Authors: Bowen Cheng, Anwesa Choudhuri, Ishan Misra, Alexander Kirillov, Rohit Girdhar, Alexander G. Schwing

    Abstract: We find Mask2Former also achieves state-of-the-art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline. In this report, we show universal image segmentation architectures trivially generalize to video segmentation by directly predicting 3D segmentation volumes. Specifically, Mask2Former sets a new state-of-the-art of 60.4 AP on YouT… ▽ More

    Submitted 20 December, 2021; originally announced December 2021.

    Comments: Code and models: https://github.com/facebookresearch/Mask2Former

  18. arXiv:2112.01527  [pdf, other

    cs.CV cs.AI cs.LG

    Masked-attention Mask Transformer for Universal Image Segmentation

    Authors: Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar

    Abstract: Image segmentation is about grou** pixels with different semantics, e.g., category or instance membership, where each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmenta… ▽ More

    Submitted 15 June, 2022; v1 submitted 2 December, 2021; originally announced December 2021.

    Comments: CVPR 2022. Project page/code/models: https://bowenc0221.github.io/mask2former

  19. arXiv:2110.07058  [pdf, other

    cs.CV cs.AI

    Ego4D: Around the World in 3,000 Hours of Egocentric Video

    Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do , et al. (60 additional authors not shown)

    Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with cons… ▽ More

    Submitted 11 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)

  20. arXiv:2109.08141  [pdf, other

    cs.CV cs.AI cs.LG

    An End-to-End Transformer Model for 3D Object Detection

    Authors: Ishan Misra, Rohit Girdhar, Armand Joulin

    Abstract: We propose 3DETR, an end-to-end Transformer based object detection model for 3D point clouds. Compared to existing detection methods that employ a number of 3D-specific inductive biases, 3DETR requires minimal modifications to the vanilla Transformer block. Specifically, we find that a standard Transformer with non-parametric queries and Fourier positional embeddings is competitive with specialize… ▽ More

    Submitted 16 September, 2021; originally announced September 2021.

    Comments: Accepted at ICCV 2021

  21. arXiv:2106.02036  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    Anticipative Video Transformer

    Authors: Rohit Girdhar, Kristen Grauman

    Abstract: We propose Anticipative Video Transformer (AVT), an end-to-end attention-based video modeling architecture that attends to the previously observed video in order to anticipate future actions. We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames' features. Compared to existing temporal a… ▽ More

    Submitted 22 September, 2021; v1 submitted 3 June, 2021; originally announced June 2021.

    Comments: ICCV 2021. Ranked #1 in CVPR'21 EPIC-Kitchens-100 Action Anticipation challenge. Webpage/code/models: http://facebookresearch.github.io/AVT

  22. arXiv:2105.06461  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    3D Spatial Recognition without Spatially Labeled 3D

    Authors: Zhongzheng Ren, Ishan Misra, Alexander G. Schwing, Rohit Girdhar

    Abstract: We introduce WyPR, a Weakly-supervised framework for Point cloud Recognition, requiring only scene-level class tags as supervision. WyPR jointly addresses three core 3D recognition tasks: point-level semantic segmentation, 3D proposal generation, and 3D object detection, coupling their predictions through self and cross-task consistency losses. We show that in conjunction with standard multiple-in… ▽ More

    Submitted 13 May, 2021; originally announced May 2021.

    Comments: CVPR 2021

  23. arXiv:2102.10336  [pdf, other

    cs.AI cs.LG

    Physical Reasoning Using Dynamics-Aware Models

    Authors: Eltayeb Ahmed, Anton Bakhtin, Laurens van der Maaten, Rohit Girdhar

    Abstract: A common approach to solving physical reasoning tasks is to train a value learner on example tasks. A limitation of such an approach is that it requires learning about object dynamics solely from reward values assigned to the final state of a rollout of the environment. This study aims to address this limitation by augmenting the reward value with self-supervised signals about object dynamics. Spe… ▽ More

    Submitted 1 September, 2021; v1 submitted 20 February, 2021; originally announced February 2021.

    Comments: ICML 2021 Workshop on Self-Supervised Learning for Reasoning and Perception; Webpage/Code: https://facebookresearch.github.io/DynamicsAware

  24. arXiv:2101.02691  [pdf, other

    cs.CV

    Self-Supervised Pretraining of 3D Features on any Point-Cloud

    Authors: Zaiwei Zhang, Rohit Girdhar, Armand Joulin, Ishan Misra

    Abstract: Pretraining on large labeled datasets is a prerequisite to achieve good performance in many computer vision tasks like 2D object recognition, video classification etc. However, pretraining is not widely used for 3D recognition tasks where state-of-the-art methods train models from scratch. A primary reason is the lack of large annotated datasets because 3D data is both difficult to acquire and tim… ▽ More

    Submitted 7 January, 2021; originally announced January 2021.

  25. arXiv:2012.07512  [pdf

    cs.CL cs.LG cs.SI

    Linguistic Classification using Instance-Based Learning

    Authors: Priya S. Nayak, Rhythm Girdhar, Shreekanth M. Prabhu

    Abstract: Traditionally linguists have organized languages of the world as language families modelled as trees. In this work we take a contrarian approach and question the tree-based model that is rather restrictive. For example, the affinity that Sanskrit independently has with languages across Indo-European languages is better illustrated using a network model. We can say the same about inter-relationship… ▽ More

    Submitted 1 December, 2020; originally announced December 2020.

    Comments: 8 pages,3 papers

  26. arXiv:2006.10734  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Forward Prediction for Physical Reasoning

    Authors: Rohit Girdhar, Laura Gustafson, Aaron Adcock, Laurens van der Maaten

    Abstract: Physical reasoning requires forward prediction: the ability to forecast what will happen next given some initial world state. We study the performance of state-of-the-art forward-prediction models in the complex physical-reasoning tasks of the PHYRE benchmark. We do so by incorporating models that operate on object or pixel-based representations of the world into simple physical-reasoning agents.… ▽ More

    Submitted 29 March, 2021; v1 submitted 18 June, 2020; originally announced June 2020.

    Comments: Webpage/code/models: https://facebookresearch.github.io/phyre-fwd/

  27. arXiv:2006.07203   

    cs.CV

    Video Understanding as Machine Translation

    Authors: Bruno Korbar, Fabio Petroni, Rohit Girdhar, Lorenzo Torresani

    Abstract: With the advent of large-scale multimodal video datasets, especially sequences with audio or transcribed speech, there has been a growing interest in self-supervised learning of video representations. Most prior work formulates the objective as a contrastive metric learning problem between the modalities. To enable effective learning, however, these strategies require a careful selection of positi… ▽ More

    Submitted 17 September, 2020; v1 submitted 12 June, 2020; originally announced June 2020.

    Comments: The authors have temporarily withdrawn this paper to reassess some of the experimental results

  28. arXiv:1911.03083  [pdf, other

    cs.CV cs.CL

    Are we asking the right questions in MovieQA?

    Authors: Bhavan Jasani, Rohit Girdhar, Deva Ramanan

    Abstract: Joint vision and language tasks like visual question answering are fascinating because they explore high-level understanding, but at the same time, can be more prone to language biases. In this paper, we explore the biases in the MovieQA dataset and propose a strikingly simple model which can exploit them. We find that using the right word embedding is of utmost importance. By using an appropriate… ▽ More

    Submitted 8 November, 2019; originally announced November 2019.

    Comments: Spotlight presentation at CLVL workshop, ICCV 2019. Project page: https://bhavanj.github.io/MovieQAWithoutMovies/

  29. arXiv:1910.04744  [pdf, other

    cs.CV

    CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning

    Authors: Rohit Girdhar, Deva Ramanan

    Abstract: Computer vision has undergone a dramatic revolution in performance, driven in large part through deep features trained on large-scale supervised datasets. However, much of these improvements have focused on static image analysis; video understanding has seen rather modest improvements. Even though new datasets and spatiotemporal models have been proposed, simple frame-by-frame classification metho… ▽ More

    Submitted 4 April, 2020; v1 submitted 10 October, 2019; originally announced October 2019.

    Comments: ICLR 2020 (oral). Webpage/code/data: https://rohitgirdhar.github.io/CATER

  30. arXiv:1910.04742  [pdf, other

    cs.CV cs.LG

    MetaPix: Few-Shot Video Retargeting

    Authors: Jessica Lee, Deva Ramanan, Rohit Girdhar

    Abstract: We address the task of unsupervised retargeting of human actions from one video to another. We consider the challenging setting where only a few frames of the target is available. The core of our approach is a conditional generative model that can transcode input skeletal poses (automatically extracted with an off-the-shelf pose estimator) to output target frames. However, it is challenging to bui… ▽ More

    Submitted 24 March, 2020; v1 submitted 10 October, 2019; originally announced October 2019.

    Comments: Short version accepted to NeurIPS'19 MetaLearn Workshop. Full version accepted to ICLR 2020. Webpage: https://imjal.github.io/MetaPix/

  31. arXiv:1901.09244  [pdf, other

    cs.CV

    DistInit: Learning Video Representations Without a Single Labeled Video

    Authors: Rohit Girdhar, Du Tran, Lorenzo Torresani, Deva Ramanan

    Abstract: Video recognition models have progressed significantly over the past few years, evolving from shallow classifiers trained on hand-crafted features to deep spatiotemporal networks. However, labeled video data required to train such models have not been able to keep up with the ever-increasing depth and sophistication of these networks. In this work, we propose an alternative approach to learning vi… ▽ More

    Submitted 20 August, 2019; v1 submitted 26 January, 2019; originally announced January 2019.

    Comments: ICCV 2019

  32. arXiv:1812.02707  [pdf, other

    cs.CV

    Video Action Transformer Network

    Authors: Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman

    Abstract: We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people… ▽ More

    Submitted 17 May, 2019; v1 submitted 6 December, 2018; originally announced December 2018.

    Comments: CVPR 2019

  33. arXiv:1807.10066  [pdf, other

    cs.CV

    A Better Baseline for AVA

    Authors: Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman

    Abstract: We introduce a simple baseline for action localization on the AVA dataset. The model builds upon the Faster R-CNN bounding box detection framework, adapted to operate on pure spatiotemporal features - in our case produced exclusively by an I3D model pretrained on Kinetics. This model obtains 21.9% average AP on the validation set of AVA v2.1, up from 14.5% for the best RGB spatiotemporal model use… ▽ More

    Submitted 26 July, 2018; originally announced July 2018.

    Comments: ActivityNet Workshop (AVA Challenge), CVPR 2018

  34. arXiv:1804.03080  [pdf, other

    cs.CV

    Binge Watching: Scaling Affordance Learning from Sitcoms

    Authors: Xiaolong Wang, Rohit Girdhar, Abhinav Gupta

    Abstract: In recent years, there has been a renewed interest in jointly modeling perception and action. At the core of this investigation is the idea of modeling affordances(Affordances are opportunities of interaction in the scene. In other words, it represents what actions can the object be used for). However, when it comes to predicting affordances, even the state of the art approaches still do not use a… ▽ More

    Submitted 9 April, 2018; originally announced April 2018.

    Comments: CVPR 2017, project page: http://www.cs.cmu.edu/~xiaolonw/affordance.html

  35. arXiv:1712.09184  [pdf, other

    cs.CV

    Detect-and-Track: Efficient Pose Estimation in Videos

    Authors: Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, Du Tran

    Abstract: This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video. We propose an extremely lightweight yet highly effective approach that builds upon the latest advancements in human detection and video understanding. Our method operates in two-stages: keypoint estimation in frames or short clips, followed by lightweight tracking to generate keypoint p… ▽ More

    Submitted 2 May, 2018; v1 submitted 26 December, 2017; originally announced December 2017.

    Comments: In CVPR 2018. Ranked first in ICCV 2017 PoseTrack challenge (keypoint tracking in videos). Code: https://github.com/facebookresearch/DetectAndTrack and webpage: https://rohitgirdhar.github.io/DetectAndTrack/

  36. arXiv:1711.01467  [pdf, other

    cs.CV

    Attentional Pooling for Action Recognition

    Authors: Rohit Girdhar, Deva Ramanan

    Abstract: We introduce a simple yet surprisingly powerful model to incorporate attention in action recognition and human object interaction tasks. Our proposed attention module can be trained with or without extra supervision, and gives a sizable boost in accuracy while kee** the network size and computational cost nearly the same. It leads to significant improvements over state of the art base architectu… ▽ More

    Submitted 29 December, 2017; v1 submitted 4 November, 2017; originally announced November 2017.

    Comments: In NIPS 2017. Project page: https://rohitgirdhar.github.io/AttentionalPoolingAction/

  37. arXiv:1704.02895  [pdf, other

    cs.CV

    ActionVLAD: Learning spatio-temporal aggregation for action classification

    Authors: Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, Bryan Russell

    Abstract: In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video. We do so by integrating state-of-the-art two-stream networks with learnable spatio-temporal feature aggregation. The resulting architecture is end-to-end trainable for whole-video classification. We investigate different… ▽ More

    Submitted 10 April, 2017; originally announced April 2017.

    Comments: Accepted to CVPR 2017. Project page: https://rohitgirdhar.github.io/ActionVLAD/

  38. arXiv:1603.08637  [pdf, other

    cs.CV

    Learning a Predictable and Generative Vector Representation for Objects

    Authors: Rohit Girdhar, David F. Fouhey, Mikel Rodriguez, Abhinav Gupta

    Abstract: What is a good vector representation of an object? We believe that it should be generative in 3D, in the sense that it can produce new 3D objects; as well as be predictable from 2D, in the sense that it can be perceived from 2D images. We propose a novel architecture, called the TL-embedding network, to learn an embedding space with these properties. The network consists of two components: (a) an… ▽ More

    Submitted 31 August, 2016; v1 submitted 29 March, 2016; originally announced March 2016.

    Comments: To appear in ECCV 2016. Project webpage: rohitgirdhar.github.io/GenerativePredictableVoxels/