Skip to main content

Showing 1–50 of 67 results for author: Misra, I

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.03290  [pdf, other

    cs.CV cs.AI cs.LG

    InstanceDiffusion: Instance-level Control for Image Generation

    Authors: Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Rohit Girdhar, Ishan Misra

    Abstract: Text-to-image diffusion models produce high quality images but do not offer control over individual instances in the image. We introduce InstanceDiffusion that adds precise instance-level control to text-to-image diffusion models. InstanceDiffusion supports free-form language conditions per instance and allows flexible ways to specify instance locations such as simple single points, scribbles, bou… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Comments: Preprint; Project page: https://people.eecs.berkeley.edu/~xdwang/projects/InstDiff/

  2. arXiv:2312.17681  [pdf, other

    cs.CV cs.MM

    FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

    Authors: Feng Liang, Bichen Wu, Jialiang Wang, Licheng Yu, Kunpeng Li, Yinan Zhao, Ishan Misra, Jia-Bin Huang, Peizhao Zhang, Peter Vajda, Diana Marculescu

    Abstract: Diffusion models have transformed the image-to-image (I2I) synthesis and are now permeating into videos. However, the advancement of video-to-video (V2V) synthesis has been hampered by the challenge of maintaining temporal consistency across video frames. This paper proposes a consistent V2V synthesis framework by jointly leveraging spatial conditions and temporal optical flow clues within the sou… ▽ More

    Submitted 29 December, 2023; originally announced December 2023.

    Comments: Project website: https://jeff-liangf.github.io/projects/flowvid/

  3. arXiv:2312.16894  [pdf

    cs.CV

    Chaurah: A Smart Raspberry Pi based Parking System

    Authors: Soumya Ranjan Choudhaury, Aditya Narendra, Ashutosh Mishra, Ipsit Misra

    Abstract: The widespread usage of cars and other large, heavy vehicles necessitates the development of an effective parking infrastructure. Additionally, algorithms for detection and recognition of number plates are often used to identify automobiles all around the world where standardized plate sizes and fonts are enforced, making recognition an effortless task. As a result, both kinds of data can be combi… ▽ More

    Submitted 28 December, 2023; originally announced December 2023.

    Comments: 13 Pages, 9 Figures, Accepted at ICCCT-23

  4. arXiv:2312.04552  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    Generating Illustrated Instructions

    Authors: Sachit Menon, Ishan Misra, Rohit Girdhar

    Abstract: We introduce the new task of generating Illustrated Instructions, i.e., visual instructions customized to a user's needs. We identify desiderata unique to this task, and formalize it through a suite of automatic and human evaluation metrics, designed to measure the validity, consistency, and efficacy of the generations. We combine the power of large language models (LLMs) together with strong text… ▽ More

    Submitted 12 April, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

    Comments: Accepted to CVPR 2024. Project website: http://facebookresearch.github.io/IllustratedInstructions. Code reproduction: https://github.com/sachit-menon/generating-illustrated-instructions-reproduction

  5. arXiv:2311.16098  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    On Bringing Robots Home

    Authors: Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, Lerrel Pinto

    Abstract: Throughout history, we have successfully integrated various machines into our homes. Dishwashers, laundry machines, stand mixers, and robot vacuums are a few recent examples. However, these machines excel at performing only a single task effectively. The concept of a "generalist machine" in homes - a domestic assistant that can adapt and learn from our needs, all while remaining cost-effective - h… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

    Comments: Project website and videos are available at https://dobb-e.com, technical documentation for getting started is available at https://docs.dobb-e.com, and code is released at https://github.com/notmahi/dobb-e

  6. arXiv:2311.10709  [pdf, other

    cs.CV cs.AI cs.GR cs.LG cs.MM

    Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

    Authors: Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra

    Abstract: We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions--adjusted noise schedules for diffusion, and multi-stage training--that enable us to directly generate high quality and high resolut… ▽ More

    Submitted 17 November, 2023; originally announced November 2023.

    Comments: Project page: https://emu-video.metademolab.com

  7. arXiv:2311.10708  [pdf, other

    cs.CV cs.LG

    SelfEval: Leveraging the discriminative nature of generative models for evaluation

    Authors: Sai Saketh Rambhatla, Ishan Misra

    Abstract: In this work, we show that text-to-image generative models can be 'inverted' to assess their own text-image understanding capabilities in a completely automated manner. Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts, making the generative model directly applicable to discriminative tasks. Using SelfEval, we repurpose standard… ▽ More

    Submitted 17 November, 2023; originally announced November 2023.

  8. arXiv:2308.14710  [pdf, other

    cs.CV cs.AI cs.LG

    VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

    Authors: Xudong Wang, Ishan Misra, Ziyun Zeng, Rohit Girdhar, Trevor Darrell

    Abstract: Existing approaches to unsupervised video instance segmentation typically rely on motion estimates and experience difficulties tracking small or divergent motions. We present VideoCutLER, a simple method for unsupervised multi-instance video segmentation without using motion-based learning signals like optical flow or training on natural videos. Our key insight is that using high-quality pseudo ma… ▽ More

    Submitted 28 August, 2023; originally announced August 2023.

    Comments: Preprint. Code: https://github.com/facebookresearch/CutLER

  9. arXiv:2306.07969  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    GeneCIS: A Benchmark for General Conditional Image Similarity

    Authors: Sagar Vaze, Nicolas Carion, Ishan Misra

    Abstract: We argue that there are many notions of 'similarity' and that models, like humans, should be able to adapt to these dynamically. This contrasts with most representation learning methods, supervised or self-supervised, which learn a fixed embedding function and hence implicitly assume a single notion of similarity. For instance, models trained on ImageNet are biased towards object categories, while… ▽ More

    Submitted 13 June, 2023; originally announced June 2023.

    Comments: CVPR 2023 (Highlighted Paper). Project page at https://sgvaze.github.io/genecis/

  10. arXiv:2305.05665  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    ImageBind: One Embedding Space To Bind Them All

    Authors: Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

    Abstract: We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their… ▽ More

    Submitted 31 May, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

    Comments: CVPR 2023 (Highlighted Paper). Website: https://imagebind.metademolab.com/ Code/Models: https://github.com/facebookresearch/ImageBind

  11. arXiv:2304.07193  [pdf, other

    cs.CV

    DINOv2: Learning Robust Visual Features without Supervision

    Authors: Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin , et al. (1 additional authors not shown)

    Abstract: The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pr… ▽ More

    Submitted 2 February, 2024; v1 submitted 14 April, 2023; originally announced April 2023.

  12. arXiv:2304.05387  [pdf, other

    cs.CV

    MOST: Multiple Object localization with Self-supervised Transformers for object discovery

    Authors: Sai Saketh Rambhatla, Ishan Misra, Rama Chellappa, Abhinav Shrivastava

    Abstract: We tackle the challenging task of unsupervised object localization in this work. Recently, transformers trained with self-supervised learning have been shown to exhibit object localization properties without being trained for this task. In this work, we present Multiple Object localization with Self-supervised Transformers (MOST) that uses features of transformers trained using self-supervised lea… ▽ More

    Submitted 26 August, 2023; v1 submitted 11 April, 2023; originally announced April 2023.

    Comments: Accepted to ICCV2023 as an Oral. Project webpage: https://rssaketh.github.io/most

  13. arXiv:2303.13496  [pdf, other

    cs.CV cs.AI cs.LG

    The effectiveness of MAE pre-pretraining for billion-scale pretraining

    Authors: Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra

    Abstract: This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model. While MAE has on… ▽ More

    Submitted 24 January, 2024; v1 submitted 23 March, 2023; originally announced March 2023.

    Comments: ICCV 2023. Models available at https://github.com/facebookresearch/maws/

  14. arXiv:2302.14483  [pdf, other

    cs.LG cs.CV stat.ML

    RoPAWS: Robust Semi-supervised Representation Learning from Uncurated Data

    Authors: Sangwoo Mo, Jong-Chyi Su, Chih-Yao Ma, Mido Assran, Ishan Misra, Licheng Yu, Sean Bell

    Abstract: Semi-supervised learning aims to train a model using limited labels. State-of-the-art semi-supervised methods for image classification such as PAWS rely on self-supervised representations learned with large-scale unlabeled but curated data. However, PAWS is often less effective when using real-world unlabeled data that is uncurated, e.g., contains out-of-class data. We propose RoPAWS, a robust ext… ▽ More

    Submitted 28 February, 2023; originally announced February 2023.

    Comments: ICLR 2023

  15. arXiv:2301.11320  [pdf, other

    cs.CV cs.AI cs.LG

    Cut and Learn for Unsupervised Object Detection and Instance Segmentation

    Authors: Xudong Wang, Rohit Girdhar, Stella X. Yu, Ishan Misra

    Abstract: We propose Cut-and-LEaRn (CutLER), a simple approach for training unsupervised object detection and segmentation models. We leverage the property of self-supervised models to 'discover' objects without supervision and amplify it to train a state-of-the-art localization model without any human labels. CutLER first uses our proposed MaskCut approach to generate coarse masks for multiple objects in a… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

    Comments: Tech report. Project page: http://people.eecs.berkeley.edu/~xdwang/projects/CutLER/. Code is available at https://github.com/facebookresearch/CutLER

  16. arXiv:2301.11100  [pdf, other

    cs.CV cs.CY cs.HC

    Vision-Language Models Performing Zero-Shot Tasks Exhibit Gender-based Disparities

    Authors: Melissa Hall, Laura Gustafson, Aaron Adcock, Ishan Misra, Candace Ross

    Abstract: We explore the extent to which zero-shot vision-language models exhibit gender bias for different vision tasks. Vision models traditionally required task-specific labels for representing concepts, as well as finetuning; zero-shot models like CLIP instead perform tasks with an open-vocabulary, meaning they do not need a fixed set of labels, by using text embeddings to represent concepts. With these… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

  17. arXiv:2301.10026  [pdf

    cs.CY cs.AI cs.LG

    From Robots to Books: An Introduction to Smart Applications of AI in Education (AIEd)

    Authors: Shubham Ojha, Aditya Narendra, Siddharth Mohapatra, Ipsit Misra

    Abstract: The world around us has undergone a radical transformation due to rapid technological advancement in recent decades. The industry of the future generation is evolving, and artificial intelligence is the following change in the making popularly known as Industry 4.0. Indeed, experts predict that artificial intelligence(AI) will be the main force behind the following significant virtual shift in the… ▽ More

    Submitted 11 January, 2023; originally announced January 2023.

    Comments: In Preparation for Conference Submission, 9 Pages, 5 Tables, 1 Figure

  18. arXiv:2301.09451  [pdf, other

    cs.CV cs.AI cs.LG

    A Simple Recipe for Competitive Low-compute Self supervised Vision Models

    Authors: Quentin Duval, Ishan Misra, Nicolas Ballas

    Abstract: Self-supervised methods in vision have been mostly focused on large architectures as they seem to suffer from a significant performance drop for smaller architectures. In this paper, we propose a simple self-supervised distillation technique that can train high performance low-compute neural networks. Our main insight is that existing joint-embedding based SSL methods can be repurposed for knowled… ▽ More

    Submitted 23 January, 2023; originally announced January 2023.

  19. arXiv:2301.08243  [pdf, other

    cs.CV cs.AI cs.LG eess.IV

    Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

    Authors: Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas

    Abstract: This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target block… ▽ More

    Submitted 13 April, 2023; v1 submitted 19 January, 2023; originally announced January 2023.

    Comments: 2023 IEEE/CVF International Conference on Computer Vision

  20. arXiv:2212.04501  [pdf, other

    cs.CV

    Learning Video Representations from Large Language Models

    Authors: Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar

    Abstract: We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators. Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual informatio… ▽ More

    Submitted 8 December, 2022; originally announced December 2022.

    Comments: Tech report. Project page: https://facebookresearch.github.io/LaViLa; Code is available at http://github.com/facebookresearch/LaViLa

  21. arXiv:2210.07277  [pdf, other

    cs.LG cs.AI cs.CV

    The Hidden Uniform Cluster Prior in Self-Supervised Learning

    Authors: Mahmoud Assran, Randall Balestriero, Quentin Duval, Florian Bordes, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Nicolas Ballas

    Abstract: A successful paradigm in representation learning is to perform self-supervised pretraining using tasks based on mini-batch statistics (e.g., SimCLR, VICReg, SwAV, MSN). We show that in the formulation of all these methods is an overlooked prior to learn features that enable uniform clustering of the data. While this prior has led to remarkably semantic representations when pretraining on class-bal… ▽ More

    Submitted 13 October, 2022; originally announced October 2022.

  22. arXiv:2210.07181  [pdf, other

    cs.CV cs.LG cs.RO

    MonoNeRF: Learning Generalizable NeRFs from Monocular Videos without Camera Pose

    Authors: Yang Fu, Ishan Misra, Xiaolong Wang

    Abstract: We propose a generalizable neural radiance fields - MonoNeRF, that can be trained on large-scale monocular videos of moving in static scenes without any ground-truth annotations of depth and camera poses. MonoNeRF follows an Autoencoder-based architecture, where the encoder estimates the monocular depth and the camera pose, and the decoder constructs a Multiplane NeRF representation based on the d… ▽ More

    Submitted 4 June, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

    Comments: ICML 2023 camera ready version. Project page: https://oasisyang.github.io/mononerf

  23. arXiv:2206.08356  [pdf, other

    cs.CV cs.AI cs.LG stat.ML

    OmniMAE: Single Model Masked Pretraining on Images and Videos

    Authors: Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

    Abstract: Transformer-based architectures have become competitive across a variety of visual domains, most notably images and videos. While prior work studies these modalities in isolation, having a common architecture suggests that one can train a single unified model for multiple visual modalities. Prior attempts at unified modeling typically use architectures tailored for vision tasks, or obtain worse pe… ▽ More

    Submitted 31 May, 2023; v1 submitted 16 June, 2022; originally announced June 2022.

    Comments: CVPR 2023. Code/models: https://github.com/facebookresearch/omnivore

  24. arXiv:2204.07141  [pdf, other

    cs.LG cs.AI cs.CV eess.IV

    Masked Siamese Networks for Label-Efficient Learning

    Authors: Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas

    Abstract: We propose Masked Siamese Networks (MSN), a self-supervised learning framework for learning image representations. Our approach matches the representation of an image view containing randomly masked patches to the representation of the original unmasked image. This self-supervised pre-training strategy is particularly scalable when applied to Vision Transformers since only the unmasked patches are… ▽ More

    Submitted 14 April, 2022; originally announced April 2022.

  25. arXiv:2202.08360  [pdf, other

    cs.CV cs.AI cs.CY

    Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision

    Authors: Priya Goyal, Quentin Duval, Isaac Seessel, Mathilde Caron, Ishan Misra, Levent Sagun, Armand Joulin, Piotr Bojanowski

    Abstract: Discriminative self-supervised learning allows training models on any random group of internet images, and possibly recover salient information that helps differentiate between the images. Applied to ImageNet, this leads to object centric features that perform on par with supervised features on most object-centric downstream tasks. In this work, we question if using this ability, we can learn any… ▽ More

    Submitted 22 February, 2022; v1 submitted 16 February, 2022; originally announced February 2022.

  26. arXiv:2202.08325  [pdf, other

    cs.LG cs.CV

    A Data-Augmentation Is Worth A Thousand Samples: Exact Quantification From Analytical Augmented Sample Moments

    Authors: Randall Balestriero, Ishan Misra, Yann LeCun

    Abstract: Data-Augmentation (DA) is known to improve performance across tasks and datasets. We propose a method to theoretically analyze the effect of DA and study questions such as: how many augmented samples are needed to correctly estimate the information encoded by that DA? How does the augmentation policy impact the final parameters of a model? We derive several quantities in close-form, such as the ex… ▽ More

    Submitted 16 February, 2022; originally announced February 2022.

  27. arXiv:2201.08377  [pdf, other

    cs.CV cs.AI cs.IR cs.LG

    Omnivore: A Single Model for Many Visual Modalities

    Authors: Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens van der Maaten, Armand Joulin, Ishan Misra

    Abstract: Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data. Instead, in this paper, we propose a single model which excels at classifying images, videos, and single-view 3D data using exactly the same model parameters. Our 'Omnivore' model leverages the flexibility of transformer-based architectures and is tra… ▽ More

    Submitted 30 March, 2022; v1 submitted 20 January, 2022; originally announced January 2022.

    Comments: Accepted at CVPR 2022 (Oral Presentation)

  28. arXiv:2201.02605  [pdf, other

    cs.CV

    Detecting Twenty-thousand Classes using Image-level Supervision

    Authors: Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra

    Abstract: Current object detectors are limited in vocabulary size due to the small scale of detection datasets. Image classifiers, on the other hand, reason about much larger vocabularies, as their datasets are larger and easier to collect. We propose Detic, which simply trains the classifiers of a detector on image classification data and thus expands the vocabulary of detectors to tens of thousands of con… ▽ More

    Submitted 29 July, 2022; v1 submitted 7 January, 2022; originally announced January 2022.

    Comments: ECCV 2022 camera ready. Code is available at https://github.com/facebookresearch/Detic

  29. arXiv:2112.10764  [pdf, other

    cs.CV cs.AI cs.LG

    Mask2Former for Video Instance Segmentation

    Authors: Bowen Cheng, Anwesa Choudhuri, Ishan Misra, Alexander Kirillov, Rohit Girdhar, Alexander G. Schwing

    Abstract: We find Mask2Former also achieves state-of-the-art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline. In this report, we show universal image segmentation architectures trivially generalize to video segmentation by directly predicting 3D segmentation volumes. Specifically, Mask2Former sets a new state-of-the-art of 60.4 AP on YouT… ▽ More

    Submitted 20 December, 2021; originally announced December 2021.

    Comments: Code and models: https://github.com/facebookresearch/Mask2Former

  30. arXiv:2112.01527  [pdf, other

    cs.CV cs.AI cs.LG

    Masked-attention Mask Transformer for Universal Image Segmentation

    Authors: Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar

    Abstract: Image segmentation is about grou** pixels with different semantics, e.g., category or instance membership, where each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmenta… ▽ More

    Submitted 15 June, 2022; v1 submitted 2 December, 2021; originally announced December 2021.

    Comments: CVPR 2022. Project page/code/models: https://bowenc0221.github.io/mask2former

  31. arXiv:2110.03336  [pdf, other

    cs.LG stat.ML

    Frame Averaging for Invariant and Equivariant Network Design

    Authors: Omri Puny, Matan Atzmon, Heli Ben-Hamu, Ishan Misra, Aditya Grover, Edward J. Smith, Yaron Lipman

    Abstract: Many machine learning tasks involve learning functions that are known to be invariant or equivariant to certain symmetries of the input data. However, it is often challenging to design neural network architectures that respect these symmetries while being expressive and computationally efficient. For example, Euclidean motion invariant/equivariant graph or point cloud neural networks. We introduce… ▽ More

    Submitted 15 March, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

  32. arXiv:2109.08141  [pdf, other

    cs.CV cs.AI cs.LG

    An End-to-End Transformer Model for 3D Object Detection

    Authors: Ishan Misra, Rohit Girdhar, Armand Joulin

    Abstract: We propose 3DETR, an end-to-end Transformer based object detection model for 3D point clouds. Compared to existing detection methods that employ a number of 3D-specific inductive biases, 3DETR requires minimal modifications to the vanilla Transformer block. Specifically, we find that a standard Transformer with non-parametric queries and Fourier positional embeddings is competitive with specialize… ▽ More

    Submitted 16 September, 2021; originally announced September 2021.

    Comments: Accepted at ICCV 2021

  33. arXiv:2106.05392  [pdf, other

    cs.CV

    Kee** Your Eye on the Ball: Trajectory Attention in Video Transformers

    Authors: Mandela Patrick, Dylan Campbell, Yuki M. Asano, Ishan Misra, Florian Metze, Christoph Feichtenhofer, Andrea Vedaldi, João F. Henriques

    Abstract: In video transformers, the time dimension is often treated in the same way as the two spatial dimensions. However, in a scene where objects or the camera may move, a physical point imaged at one location in frame $t$ may be entirely unrelated to what is found at that location in frame $t+k$. These temporal correspondences should be modeled to facilitate learning about dynamic scenes. To this end,… ▽ More

    Submitted 23 October, 2021; v1 submitted 9 June, 2021; originally announced June 2021.

    Comments: NeurIPS 2021 (Oral). Project page: https://facebookresearch.github.io/Motionformer

  34. arXiv:2105.06461  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    3D Spatial Recognition without Spatially Labeled 3D

    Authors: Zhongzheng Ren, Ishan Misra, Alexander G. Schwing, Rohit Girdhar

    Abstract: We introduce WyPR, a Weakly-supervised framework for Point cloud Recognition, requiring only scene-level class tags as supervision. WyPR jointly addresses three core 3D recognition tasks: point-level semantic segmentation, 3D proposal generation, and 3D object detection, coupling their predictions through self and cross-task consistency losses. We show that in conjunction with standard multiple-in… ▽ More

    Submitted 13 May, 2021; originally announced May 2021.

    Comments: CVPR 2021

  35. arXiv:2104.14294  [pdf, other

    cs.CV

    Emerging Properties in Self-Supervised Vision Transformers

    Authors: Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin

    Abstract: In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentatio… ▽ More

    Submitted 24 May, 2021; v1 submitted 29 April, 2021; originally announced April 2021.

    Comments: 21 pages

  36. arXiv:2104.13963  [pdf, other

    cs.CV cs.AI cs.LG eess.IV

    Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples

    Authors: Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Armand Joulin, Nicolas Ballas, Michael Rabbat

    Abstract: This paper proposes a novel method of learning by predicting view assignments with support samples (PAWS). The method trains a model to minimize a consistency loss, which ensures that different views of the same unlabeled instance are assigned similar pseudo-labels. The pseudo-labels are generated non-parametrically, by comparing the representations of the image views to those of a set of randomly… ▽ More

    Submitted 30 July, 2021; v1 submitted 28 April, 2021; originally announced April 2021.

    Journal ref: ICCV 2021

  37. arXiv:2104.12763  [pdf, other

    cs.CV cs.CL cs.LG

    MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

    Authors: Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, Nicolas Carion

    Abstract: Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objects and attributes. This makes it challenging for such systems to capture the long tail of visual concepts expressed in free form text. In this… ▽ More

    Submitted 11 October, 2021; v1 submitted 26 April, 2021; originally announced April 2021.

  38. arXiv:2103.15916  [pdf, other

    cs.CV

    Robust Audio-Visual Instance Discrimination

    Authors: Pedro Morgado, Ishan Misra, Nuno Vasconcelos

    Abstract: We present a self-supervised learning method to learn audio and video representations. Prior work uses the natural correspondence between audio and video to define a standard cross-modal instance discrimination task, where a model is trained to match representations from the two modalities. However, the standard approach introduces two sources of training noise. First, audio-visual correspondences… ▽ More

    Submitted 29 March, 2021; originally announced March 2021.

  39. arXiv:2103.10211  [pdf, other

    cs.CV

    Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning

    Authors: Mandela Patrick, Yuki M. Asano, Bernie Huang, Ishan Misra, Florian Metze, Joao Henriques, Andrea Vedaldi

    Abstract: The quality of the image representations obtained from self-supervised learning depends strongly on the type of data augmentations used in the learning formulation. Recent papers have ported these methods from still images to videos and found that leveraging both audio and video signals yields strong gains; however, they did not find that spatial augmentations such as crop**, which are very impo… ▽ More

    Submitted 27 October, 2021; v1 submitted 18 March, 2021; originally announced March 2021.

    Comments: Accepted to ICCV 2021. Code at https://github.com/facebookresearch/GDT

  40. arXiv:2103.03230  [pdf, other

    cs.CV cs.AI cs.LG q-bio.NC

    Barlow Twins: Self-Supervised Learning via Redundancy Reduction

    Authors: Jure Zbontar, Li **g, Ishan Misra, Yann LeCun, Stéphane Deny

    Abstract: Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn embeddings which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant solutions. Most current methods avoid such solutions by careful implementation details. We… ▽ More

    Submitted 14 June, 2021; v1 submitted 4 March, 2021; originally announced March 2021.

    Comments: 13 pages, 6 figures, to appear at ICML 2021

  41. arXiv:2103.01988  [pdf, other

    cs.CV cs.AI

    Self-supervised Pretraining of Visual Features in the Wild

    Authors: Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand Joulin, Piotr Bojanowski

    Abstract: Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and SwAV have reduced the gap with supervised methods. These results have been achieved in a control environment, that is the highly curated ImageNet dataset. However, the premise of self-supervised learning is that it can learn from any random image and from any unbounded dataset. In this work, we explore if self-supervision lives… ▽ More

    Submitted 5 March, 2021; v1 submitted 2 March, 2021; originally announced March 2021.

  42. arXiv:2101.02691  [pdf, other

    cs.CV

    Self-Supervised Pretraining of 3D Features on any Point-Cloud

    Authors: Zaiwei Zhang, Rohit Girdhar, Armand Joulin, Ishan Misra

    Abstract: Pretraining on large labeled datasets is a prerequisite to achieve good performance in many computer vision tasks like 2D object recognition, video classification etc. However, pretraining is not widely used for 3D recognition tasks where state-of-the-art methods train models from scratch. A primary reason is the lack of large annotated datasets because 3D data is both difficult to acquire and tim… ▽ More

    Submitted 7 January, 2021; originally announced January 2021.

  43. arXiv:2011.13046  [pdf, other

    cs.CV

    Can Temporal Information Help with Contrastive Self-Supervised Learning?

    Authors: Yutong Bai, Haoqi Fan, Ishan Misra, Ganesh Venkatesh, Yongyi Lu, Yuyin Zhou, Qihang Yu, Vikas Chandra, Alan Yuille

    Abstract: Leveraging temporal information has been regarded as essential for develo** video understanding models. However, how to properly incorporate temporal information into the recent successful instance discrimination based contrastive self-supervised learning (CSL) framework remains unclear. As an intuitive solution, we find that directly applying temporal augmentations does not help, or even impair… ▽ More

    Submitted 25 November, 2020; originally announced November 2020.

  44. arXiv:2006.09882  [pdf, other

    cs.CV

    Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

    Authors: Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, Armand Joulin

    Abstract: Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably with the recent achievements of contrastive learning methods. These contrastive methods typically work online and rely on a large number of explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose an online algorithm, SwAV, that takes advantage of… ▽ More

    Submitted 8 January, 2021; v1 submitted 17 June, 2020; originally announced June 2020.

    Comments: NeurIPS 2020

  45. arXiv:2004.12943  [pdf, other

    cs.CV

    Audio-Visual Instance Discrimination with Cross-Modal Agreement

    Authors: Pedro Morgado, Nuno Vasconcelos, Ishan Misra

    Abstract: We present a self-supervised learning approach to learn audio-visual representations from video and audio. Our method uses contrastive learning for cross-modal discrimination of video from audio and vice-versa. We show that optimizing for cross-modal discrimination, rather than within-modal discrimination, is important to learn good representations from video and audio. With this simple but powerf… ▽ More

    Submitted 29 March, 2021; v1 submitted 27 April, 2020; originally announced April 2020.

  46. arXiv:2004.03867  [pdf, other

    eess.IV cs.CV

    S2A: Wasserstein GAN with Spatio-Spectral Laplacian Attention for Multi-Spectral Band Synthesis

    Authors: Litu Rout, Indranil Misra, S Manthira Moorthi, Debajyoti Dhar

    Abstract: Intersection of adversarial learning and satellite image processing is an emerging field in remote sensing. In this study, we intend to address synthesis of high resolution multi-spectral satellite imagery using adversarial learning. Guided by the discovery of attention mechanism, we regulate the process of band synthesis through spatio-spectral Laplacian attention. Further, we use Wasserstein GAN… ▽ More

    Submitted 8 April, 2020; originally announced April 2020.

    Comments: Computer Vision and Pattern Recognition (CVPR) Workshop on Large Scale Computer Vision for Remote Sensing Imagery

  47. arXiv:2001.03615  [pdf, other

    cs.CV

    In Defense of Grid Features for Visual Question Answering

    Authors: Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, Xinlei Chen

    Abstract: Popularized as 'bottom-up' attention, bounding box (or region) based visual features have recently surpassed vanilla grid-based convolutional features as the de facto standard for vision and language tasks like visual question answering (VQA). However, it is not clear whether the advantages of regions (e.g. better localization) are the key reasons for the success of bottom-up attention. In this pa… ▽ More

    Submitted 2 April, 2020; v1 submitted 10 January, 2020; originally announced January 2020.

    Journal ref: CVPR, 2020

  48. arXiv:1912.03330  [pdf, other

    cs.CV cs.LG

    ClusterFit: Improving Generalization of Visual Representations

    Authors: Xueting Yan, Ishan Misra, Abhinav Gupta, Deepti Ghadiyaram, Dhruv Mahajan

    Abstract: Pre-training convolutional neural networks with weakly-supervised and self-supervised strategies is becoming increasingly popular for several computer vision tasks. However, due to the lack of strong discriminative signals, these learned representations may overfit to the pre-training objective (e.g., hashtag prediction) and not generalize well to downstream tasks. In this work, we present a simpl… ▽ More

    Submitted 6 December, 2019; originally announced December 2019.

  49. arXiv:1912.01991  [pdf, other

    cs.CV cs.LG

    Self-Supervised Learning of Pretext-Invariant Representations

    Authors: Ishan Misra, Laurens van der Maaten

    Abstract: The goal of self-supervised learning from images is to construct image representations that are semantically meaningful via pretext tasks that do not require semantic annotations for a large training set of images. Many pretext tasks lead to representations that are covariant with image transformations. We argue that, instead, semantic representations ought to be invariant under such transformatio… ▽ More

    Submitted 4 December, 2019; originally announced December 2019.

  50. arXiv:1906.02729  [pdf, other

    cs.CV

    3D-RelNet: Joint Object and Relational Network for 3D Prediction

    Authors: Nilesh Kulkarni, Ishan Misra, Shubham Tulsiani, Abhinav Gupta

    Abstract: We propose an approach to predict the 3D shape and pose for the objects present in a scene. Existing learning based methods that pursue this goal make independent predictions per object, and do not leverage the relationships amongst them. We argue that reasoning about these relationships is crucial, and present an approach to incorporate these in a 3D prediction framework. In addition to independe… ▽ More

    Submitted 4 March, 2020; v1 submitted 6 June, 2019; originally announced June 2019.

    Comments: Project page: https://nileshkulkarni.github.io/relative3d/