Skip to main content

Showing 1–50 of 73 results for author: Girshick, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.20083  [pdf, other

    cs.RO cs.CV

    PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

    Authors: Kuo-Hao Zeng, Zichen Zhang, Kiana Ehsani, Rose Hendrix, Jordi Salvador, Alvaro Herrasti, Ross Girshick, Aniruddha Kembhavi, Luca Weihs

    Abstract: We present PoliFormer (Policy Transformer), an RGB-only indoor navigation agent trained end-to-end with reinforcement learning at scale that generalizes to the real-world without adaptation despite being trained purely in simulation. PoliFormer uses a foundational vision transformer encoder with a causal transformer decoder enabling long-term memory and reasoning. It is trained for hundreds of mil… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  2. arXiv:2304.02643  [pdf, other

    cs.CV cs.AI cs.LG

    Segment Anything

    Authors: Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, Ross Girshick

    Abstract: We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and… ▽ More

    Submitted 5 April, 2023; originally announced April 2023.

    Comments: Project web-page: https://segment-anything.com

  3. arXiv:2303.13496  [pdf, other

    cs.CV cs.AI cs.LG

    The effectiveness of MAE pre-pretraining for billion-scale pretraining

    Authors: Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra

    Abstract: This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model. While MAE has on… ▽ More

    Submitted 24 January, 2024; v1 submitted 23 March, 2023; originally announced March 2023.

    Comments: ICCV 2023. Models available at https://github.com/facebookresearch/maws/

  4. arXiv:2203.16527  [pdf, other

    cs.CV

    Exploring Plain Vision Transformer Backbones for Object Detection

    Authors: Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He

    Abstract: We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) i… ▽ More

    Submitted 10 June, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: Tech report. arXiv v2: add RetinaNet results

  5. arXiv:2201.08371  [pdf, other

    cs.CV

    Revisiting Weakly Supervised Pre-Training of Visual Perception Models

    Authors: Mannat Singh, Laura Gustafson, Aaron Adcock, Vinicius de Freitas Reis, Bugra Gedik, Raj Prateek Kosaraju, Dhruv Mahajan, Ross Girshick, Piotr Dollár, Laurens van der Maaten

    Abstract: Model pre-training is a cornerstone of modern visual recognition systems. Although fully supervised pre-training on datasets like ImageNet is still the de-facto standard, recent studies suggest that large-scale weakly supervised pre-training can outperform fully supervised approaches. This paper revisits weakly-supervised pre-training of models using hashtag supervision with modern versions of res… ▽ More

    Submitted 2 April, 2022; v1 submitted 20 January, 2022; originally announced January 2022.

    Comments: CVPR 2022

  6. arXiv:2111.11429  [pdf, other

    cs.CV

    Benchmarking Detection Transfer Learning with Vision Transformers

    Authors: Yanghao Li, Saining Xie, Xinlei Chen, Piotr Dollar, Kaiming He, Ross Girshick

    Abstract: Object detection is a central downstream task used to test if pre-trained network parameters confer benefits, such as improved accuracy or training speed. The complexity of object detection methods can make this benchmarking non-trivial when new architectures, such as Vision Transformer (ViT) models, arrive. These difficulties (e.g., architectural incompatibility, slow training, high memory consum… ▽ More

    Submitted 22 November, 2021; originally announced November 2021.

  7. arXiv:2111.09887  [pdf, other

    cs.CV cs.LG

    PyTorchVideo: A Deep Learning Library for Video Understanding

    Authors: Haoqi Fan, Tullie Murrell, Heng Wang, Kalyan Vasudev Alwala, Yanghao Li, Yilei Li, Bo Xiong, Nikhila Ravi, Meng Li, Haichuan Yang, Jitendra Malik, Ross Girshick, Matt Feiszli, Aaron Adcock, Wan-Yen Lo, Christoph Feichtenhofer

    Abstract: We introduce PyTorchVideo, an open-source deep-learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing. The library covers a full stack of video understanding tools including multimodal data loading, transformations, and models tha… ▽ More

    Submitted 18 November, 2021; originally announced November 2021.

    Comments: Technical report

  8. arXiv:2111.06377  [pdf, other

    cs.CV

    Masked Autoencoders Are Scalable Vision Learners

    Authors: Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick

    Abstract: This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), a… ▽ More

    Submitted 19 December, 2021; v1 submitted 11 November, 2021; originally announced November 2021.

    Comments: Tech report. arXiv v2: add more transfer learning results; v3: add robustness evaluation

  9. arXiv:2106.14881  [pdf, other

    cs.CV

    Early Convolutions Help Transformers See Better

    Authors: Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollár, Ross Girshick

    Abstract: Vision transformer (ViT) models exhibit substandard optimizability. In particular, they are sensitive to the choice of optimizer (AdamW vs. SGD), optimizer hyperparameters, and training schedule length. In comparison, modern convolutional neural networks are easier to optimize. Why is this the case? In this work, we conjecture that the issue lies with the patchify stem of ViT models, which is impl… ▽ More

    Submitted 25 October, 2021; v1 submitted 28 June, 2021; originally announced June 2021.

    Comments: NeurIPS 2021

  10. arXiv:2104.14558  [pdf, other

    cs.CV cs.AI cs.LG

    A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

    Authors: Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick, Kaiming He

    Abstract: We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to space-time. Our objective encourages temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) d… ▽ More

    Submitted 29 April, 2021; originally announced April 2021.

    Comments: CVPR 2021

  11. arXiv:2103.16562  [pdf, other

    cs.CV

    Boundary IoU: Improving Object-Centric Image Segmentation Evaluation

    Authors: Bowen Cheng, Ross Girshick, Piotr Dollár, Alexander C. Berg, Alexander Kirillov

    Abstract: We present Boundary IoU (Intersection-over-Union), a new segmentation evaluation measure focused on boundary quality. We perform an extensive analysis across different error types and object sizes and show that Boundary IoU is significantly more sensitive than the standard Mask IoU measure to boundary errors for large objects and does not over-penalize errors on smaller objects. The new quality me… ▽ More

    Submitted 30 March, 2021; originally announced March 2021.

    Comments: CVPR 2021, project page: https://bowenc0221.github.io/boundary-iou

  12. arXiv:2103.06877  [pdf, other

    cs.CV cs.LG

    Fast and Accurate Model Scaling

    Authors: Piotr Dollár, Mannat Singh, Ross Girshick

    Abstract: In this work we analyze strategies for convolutional neural network scaling; that is, the process of scaling a base convolutional network to endow it with greater computational complexity and consequently representational power. Example scaling strategies may include increasing model width, depth, resolution, etc. While various scaling strategies exist, their tradeoffs are not fully understood. Ex… ▽ More

    Submitted 11 March, 2021; originally announced March 2021.

    Comments: CVPR 2021

  13. arXiv:2102.01066  [pdf, other

    cs.CV

    Evaluating Large-Vocabulary Object Detectors: The Devil is in the Details

    Authors: Achal Dave, Piotr Dollár, Deva Ramanan, Alexander Kirillov, Ross Girshick

    Abstract: By design, average precision (AP) for object detection aims to treat all classes independently: AP is computed independently per category and averaged. On one hand, this is desirable as it treats all classes equally. On the other hand, it ignores cross-category confidence calibration, a key property in real-world use cases. Unfortunately, under important conditions (i.e., large vocabulary, high in… ▽ More

    Submitted 15 March, 2022; v1 submitted 1 February, 2021; originally announced February 2021.

  14. arXiv:2005.07850  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Large scale weakly and semi-supervised learning for low-resource video ASR

    Authors: Kritika Singh, Vimal Manohar, Alex Xiao, Sergey Edunov, Ross Girshick, Vitaliy Liptchinsky, Christian Fuegen, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

    Abstract: Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on th… ▽ More

    Submitted 6 August, 2020; v1 submitted 15 May, 2020; originally announced May 2020.

  15. arXiv:2003.13678  [pdf, other

    cs.CV cs.LG

    Designing Network Design Spaces

    Authors: Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, Piotr Dollár

    Abstract: In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elev… ▽ More

    Submitted 30 March, 2020; originally announced March 2020.

    Comments: CVPR 2020

  16. arXiv:2003.12056  [pdf, other

    cs.CV cs.LG

    Are Labels Necessary for Neural Architecture Search?

    Authors: Chenxi Liu, Piotr Dollár, Kaiming He, Ross Girshick, Alan Yuille, Saining Xie

    Abstract: Existing neural network architectures in computer vision -- whether designed by humans or by machines -- were typically found using both images and their associated labels. In this paper, we ask the question: can we find high-quality neural architectures using only images, but no human-annotated labels? To answer this question, we first define a new setup called Unsupervised Neural Architecture Se… ▽ More

    Submitted 3 August, 2020; v1 submitted 26 March, 2020; originally announced March 2020.

    Comments: To appear in ECCV 2020 as spotlight. Code release: https://github.com/facebookresearch/unnas

  17. arXiv:2003.04297  [pdf, other

    cs.CV

    Improved Baselines with Momentum Contrastive Learning

    Authors: Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He

    Abstract: Contrastive unsupervised learning has recently shown encouraging progress, e.g., in Momentum Contrast (MoCo) and SimCLR. In this note, we verify the effectiveness of two of SimCLR's design improvements by implementing them in the MoCo framework. With simple modifications to MoCo---namely, using an MLP projection head and more data augmentation---we establish stronger baselines that outperform SimC… ▽ More

    Submitted 9 March, 2020; originally announced March 2020.

    Comments: Tech report, 2 pages + references

  18. arXiv:1912.08193  [pdf, other

    cs.CV

    PointRend: Image Segmentation as Rendering

    Authors: Alexander Kirillov, Yuxin Wu, Kaiming He, Ross Girshick

    Abstract: We present a new method for efficient high-quality image segmentation of objects and scenes. By analogizing classical computer graphics methods for efficient rendering with over- and undersampling challenges faced in pixel labeling tasks, we develop a unique perspective of image segmentation as a rendering problem. From this vantage, we present the PointRend (Point-based Rendering) neural network… ▽ More

    Submitted 16 February, 2020; v1 submitted 17 December, 2019; originally announced December 2019.

    Comments: Technical Report

  19. arXiv:1912.00998  [pdf, ps, other

    cs.CV

    A Multigrid Method for Efficiently Training Video Models

    Authors: Chao-Yuan Wu, Ross Girshick, Kaiming He, Christoph Feichtenhofer, Philipp Krähenbühl

    Abstract: Training competitive deep video models is an order of magnitude slower than training their counterpart image models. Slow training causes long research cycles, which hinders progress in video understanding research. Following standard practice for training image models, video model training assumes a fixed mini-batch shape: a specific number of clips, frames, and spatial size. However, what is the… ▽ More

    Submitted 9 June, 2020; v1 submitted 2 December, 2019; originally announced December 2019.

    Comments: CVPR 2020

  20. arXiv:1911.05722  [pdf, other

    cs.CV

    Momentum Contrast for Unsupervised Visual Representation Learning

    Authors: Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick

    Abstract: We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linea… ▽ More

    Submitted 23 March, 2020; v1 submitted 13 November, 2019; originally announced November 2019.

    Comments: CVPR 2020 camera-ready. Code: https://github.com/facebookresearch/moco

  21. arXiv:1910.12367  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Training ASR models by Generation of Contextual Information

    Authors: Kritika Singh, Dmytro Okhonko, Jun Liu, Yongqiang Wang, Frank Zhang, Ross Girshick, Sergey Edunov, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

    Abstract: Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labelled training data. However, in many applications and locales, only moderate amounts of data are available, which has led to a surge in semi- and weakly-supervised learning research. In this paper, we conduct a large-scale study evaluating the effectiveness of weakly-supervised lea… ▽ More

    Submitted 14 February, 2020; v1 submitted 27 October, 2019; originally announced October 2019.

  22. arXiv:1908.05656  [pdf, other

    cs.LG cs.AI stat.ML

    PHYRE: A New Benchmark for Physical Reasoning

    Authors: Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, Ross Girshick

    Abstract: Understanding and reasoning about physics is an important ability of intelligent agents. We develop the PHYRE benchmark for physical reasoning that contains a set of simple classical mechanics puzzles in a 2D physical environment. The benchmark is designed to encourage the development of learning algorithms that are sample-efficient and generalize well across puzzles. We test several modern learni… ▽ More

    Submitted 15 August, 2019; originally announced August 2019.

  23. arXiv:1908.03195  [pdf, other

    cs.CV

    LVIS: A Dataset for Large Vocabulary Instance Segmentation

    Authors: Agrim Gupta, Piotr Dollár, Ross Girshick

    Abstract: Progress on object detection is enabled by datasets that focus the research community's attention on open challenges. This process led us from simple images to complex scenes and from bounding boxes to segmentation masks. In this work, we introduce LVIS (pronounced `el-vis'): a new dataset for Large Vocabulary Instance Segmentation. We plan to collect ~2 million high-quality instance segmentation… ▽ More

    Submitted 15 September, 2019; v1 submitted 8 August, 2019; originally announced August 2019.

    Comments: Extension of the CVPR'19 paper describing release v0.5, the LVIS Challenge, and baseline results

  24. arXiv:1904.01569  [pdf, other

    cs.CV cs.LG

    Exploring Randomly Wired Neural Networks for Image Recognition

    Authors: Saining Xie, Alexander Kirillov, Ross Girshick, Kaiming He

    Abstract: Neural networks for image recognition have evolved through extensive manual design from simple chain-like models to structures with multiple wiring paths. The success of ResNets and DenseNets is due in large part to their innovative wiring plans. Now, neural architecture search (NAS) studies are exploring the joint optimization of wiring and operation types, however, the space of possible wirings… ▽ More

    Submitted 8 April, 2019; v1 submitted 2 April, 2019; originally announced April 2019.

    Comments: Technical report

  25. arXiv:1903.12174  [pdf, other

    cs.CV

    TensorMask: A Foundation for Dense Object Segmentation

    Authors: Xinlei Chen, Ross Girshick, Kaiming He, Piotr Dollár

    Abstract: Sliding-window object detectors that generate bounding-box object predictions over a dense, regular grid have advanced rapidly and proven popular. In contrast, modern instance segmentation approaches are dominated by methods that first detect object bounding boxes, and then crop and segment these regions, as popularized by Mask R-CNN. In this work, we investigate the paradigm of dense sliding-wind… ▽ More

    Submitted 27 August, 2019; v1 submitted 28 March, 2019; originally announced March 2019.

    Comments: accepted to ICCV

  26. arXiv:1901.02446  [pdf, other

    cs.CV

    Panoptic Feature Pyramid Networks

    Authors: Alexander Kirillov, Ross Girshick, Kaiming He, Piotr Dollár

    Abstract: The recently introduced panoptic segmentation task has renewed our community's interest in unifying the tasks of instance segmentation (for thing classes) and semantic segmentation (for stuff classes). However, current state-of-the-art methods for this joint task use separate and dissimilar networks for instance and semantic segmentation, without performing any shared computation. In this work, we… ▽ More

    Submitted 10 April, 2019; v1 submitted 8 January, 2019; originally announced January 2019.

    Comments: accepted to CVPR 2019

  27. arXiv:1812.05038  [pdf, other

    cs.CV

    Long-Term Feature Banks for Detailed Video Understanding

    Authors: Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krähenbühl, Ross Girshick

    Abstract: To understand the world, we humans constantly need to relate the present to the past, and put events in context. In this paper, we enable existing video models to do the same. We propose a long-term feature bank---supportive information extracted over the entire span of a video---to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds. Our experiments dem… ▽ More

    Submitted 17 April, 2019; v1 submitted 12 December, 2018; originally announced December 2018.

    Comments: Code and models are available at https://github.com/facebookresearch/video-long-term-feature-banks

  28. arXiv:1811.08883  [pdf, other

    cs.CV

    Rethinking ImageNet Pre-training

    Authors: Kaiming He, Ross Girshick, Piotr Dollár

    Abstract: We report competitive results on object detection and instance segmentation on the COCO dataset using standard models trained from random initialization. The results are no worse than their ImageNet pre-training counterparts even when using the hyper-parameters of the baseline system (Mask R-CNN) that were optimized for fine-tuning pre-trained models, with the sole exception of increasing the numb… ▽ More

    Submitted 21 November, 2018; originally announced November 2018.

    Comments: Technical report

  29. arXiv:1805.00932  [pdf, ps, other

    cs.CV

    Exploring the Limits of Weakly Supervised Pretraining

    Authors: Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, Laurens van der Maaten

    Abstract: State-of-the-art visual perception models for a wide range of tasks rely on supervised pretraining. ImageNet classification is the de facto pretraining task for these models. Yet, ImageNet is now nearly ten years old and is by modern standards "small". Even so, relatively little is known about the behavior of pretraining with datasets that are multiple orders of magnitude larger. The reasons are o… ▽ More

    Submitted 2 May, 2018; originally announced May 2018.

    Comments: Technical report

  30. arXiv:1801.05401  [pdf, other

    cs.CV

    Low-Shot Learning from Imaginary Data

    Authors: Yu-Xiong Wang, Ross Girshick, Martial Hebert, Bharath Hariharan

    Abstract: Humans can quickly learn new visual concepts, perhaps because they can easily visualize or imagine what novel objects look like from different views. Incorporating this ability to hallucinate novel instances of new concepts might help machine vision systems perform better low-shot learning, i.e., learning concepts from few examples. We present a novel approach to low-shot learning that uses this i… ▽ More

    Submitted 2 April, 2018; v1 submitted 16 January, 2018; originally announced January 2018.

    Comments: CVPR 2018 camera-ready version

  31. arXiv:1801.00868  [pdf, other

    cs.CV

    Panoptic Segmentation

    Authors: Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, Piotr Dollár

    Abstract: We propose and study a task we name panoptic segmentation (PS). Panoptic segmentation unifies the typically distinct tasks of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each object instance). The proposed task requires generating a coherent scene segmentation that is rich and complete, an important step toward real-world vision systems.… ▽ More

    Submitted 10 April, 2019; v1 submitted 2 January, 2018; originally announced January 2018.

    Comments: accepted to CVPR 2019

  32. arXiv:1712.04440  [pdf, other

    cs.CV

    Data Distillation: Towards Omni-Supervised Learning

    Authors: Ilija Radosavovic, Piotr Dollár, Ross Girshick, Georgia Gkioxari, Kaiming He

    Abstract: We investigate omni-supervised learning, a special regime of semi-supervised learning in which the learner exploits all available labeled data plus internet-scale sources of unlabeled data. Omni-supervised learning is lower-bounded by performance on existing labeled datasets, offering the potential to surpass state-of-the-art fully supervised methods. To exploit the omni-supervised setting, we pro… ▽ More

    Submitted 12 December, 2017; originally announced December 2017.

    Comments: tech report

  33. arXiv:1712.01238  [pdf, other

    cs.CV cs.CL cs.LG

    Learning by Asking Questions

    Authors: Ishan Misra, Ross Girshick, Rob Fergus, Martial Hebert, Abhinav Gupta, Laurens van der Maaten

    Abstract: We introduce an interactive learning framework for the development and testing of intelligent visual systems, called learning-by-asking (LBA). We explore LBA in context of the Visual Question Answering (VQA) task. LBA differs from standard VQA training in that most questions are not observed during training time, and the learner must ask questions it wants answers to. Thus, LBA more closely mimics… ▽ More

    Submitted 4 December, 2017; originally announced December 2017.

  34. arXiv:1711.10370  [pdf, other

    cs.CV

    Learning to Segment Every Thing

    Authors: Ronghang Hu, Piotr Dollár, Kaiming He, Trevor Darrell, Ross Girshick

    Abstract: Most methods for object instance segmentation require all training examples to be labeled with segmentation masks. This requirement makes it expensive to annotate new categories and has restricted instance segmentation models to ~100 well-annotated classes. The goal of this paper is to propose a new partially supervised training paradigm, together with a novel weight transfer function, that enable… ▽ More

    Submitted 27 March, 2018; v1 submitted 28 November, 2017; originally announced November 2017.

  35. arXiv:1711.07971  [pdf, other

    cs.CV

    Non-local Neural Networks

    Authors: Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He

    Abstract: Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method in computer vision, our non-local operation computes the response at a position as a weighted sum of the features a… ▽ More

    Submitted 13 April, 2018; v1 submitted 21 November, 2017; originally announced November 2017.

    Comments: CVPR 2018, code is available at: https://github.com/facebookresearch/video-nonlocal-net

  36. arXiv:1708.02002  [pdf, other

    cs.CV

    Focal Loss for Dense Object Detection

    Authors: Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár

    Abstract: The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In… ▽ More

    Submitted 7 February, 2018; v1 submitted 7 August, 2017; originally announced August 2017.

  37. arXiv:1706.02677  [pdf, other

    cs.CV cs.DC cs.LG

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    Authors: Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He

    Abstract: Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large,… ▽ More

    Submitted 30 April, 2018; v1 submitted 8 June, 2017; originally announced June 2017.

    Comments: Tech report (v2: correct typos)

  38. arXiv:1705.03633  [pdf, other

    cs.CV cs.CL cs.LG

    Inferring and Executing Programs for Visual Reasoning

    Authors: Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick

    Abstract: Existing methods for visual reasoning attempt to directly map inputs to outputs using black-box architectures without explicitly modeling the underlying reasoning processes. As a result, these black-box models often learn to exploit biases in the data rather than learning to perform visual reasoning. Inspired by module networks, this paper proposes a model for visual reasoning that consists of a p… ▽ More

    Submitted 10 May, 2017; originally announced May 2017.

  39. arXiv:1704.07333  [pdf, other

    cs.CV

    Detecting and Recognizing Human-Object Interactions

    Authors: Georgia Gkioxari, Ross Girshick, Piotr Dollár, Kaiming He

    Abstract: To understand the visual world, a machine must not only recognize individual object instances but also how they interact. Humans are often at the center of such interactions and detecting human-object interactions is an important practical and scientific problem. In this paper, we address the task of detecting <human, verb, object> triplets in challenging everyday photos. We propose a novel model… ▽ More

    Submitted 26 March, 2018; v1 submitted 24 April, 2017; originally announced April 2017.

  40. arXiv:1703.06870  [pdf, other

    cs.CV

    Mask R-CNN

    Authors: Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick

    Abstract: We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognit… ▽ More

    Submitted 24 January, 2018; v1 submitted 20 March, 2017; originally announced March 2017.

    Comments: open source; appendix on more results

  41. arXiv:1612.06890  [pdf, other

    cs.CV cs.CL cs.LG

    CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

    Authors: Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick

    Abstract: When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover shortcomings. Existing benchmarks for visual question answering can help, but have strong biases that models can exploit to correctly answer questions without reasoning. They also conflate multiple sources of error, making it hard to pi… ▽ More

    Submitted 20 December, 2016; originally announced December 2016.

  42. arXiv:1612.06370  [pdf, other

    cs.CV cs.AI cs.LG cs.NE stat.ML

    Learning Features by Watching Objects Move

    Authors: Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, Bharath Hariharan

    Abstract: This paper presents a novel yet intuitive approach to unsupervised feature learning. Inspired by the human visual system, we explore whether low-level motion-based grou** cues can be used to learn an effective visual representation. Specifically, we use unsupervised motion-based segmentation on videos to obtain segments, which we use as 'pseudo ground truth' to train a convolutional network to s… ▽ More

    Submitted 12 April, 2017; v1 submitted 19 December, 2016; originally announced December 2016.

    Comments: CVPR 2017

  43. arXiv:1612.03144  [pdf, other

    cs.CV

    Feature Pyramid Networks for Object Detection

    Authors: Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie

    Abstract: Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A… ▽ More

    Submitted 19 April, 2017; v1 submitted 9 December, 2016; originally announced December 2016.

  44. arXiv:1611.05431  [pdf, other

    cs.CV

    Aggregated Residual Transformations for Deep Neural Networks

    Authors: Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, Kaiming He

    Abstract: We present a simple, highly modularized network architecture for image classification. Our network is constructed by repeating a building block that aggregates a set of transformations with the same topology. Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which we call "cardinality" (the size… ▽ More

    Submitted 10 April, 2017; v1 submitted 16 November, 2016; originally announced November 2016.

    Comments: Accepted to CVPR 2017. Code and models: https://github.com/facebookresearch/ResNeXt

  45. arXiv:1606.02819  [pdf, other

    cs.CV

    Low-shot Visual Recognition by Shrinking and Hallucinating Features

    Authors: Bharath Hariharan, Ross Girshick

    Abstract: Low-shot visual learning---the ability to recognize novel object categories from very few examples---is a hallmark of human visual intelligence. Existing machine learning approaches fail to generalize in the same way. To make progress on this foundational problem, we present a low-shot learning benchmark on complex images that mimics challenges faced by recognition systems in the wild. We then pro… ▽ More

    Submitted 4 November, 2017; v1 submitted 9 June, 2016; originally announced June 2016.

    Comments: ICCV 2017 spotlight

  46. arXiv:1604.03968  [pdf, other

    cs.CL cs.AI cs.CV

    Visual Storytelling

    Authors: Ting-Hao, Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende, Michel Galley, Margaret Mitchell

    Abstract: We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The first release of this dataset, SIND v.1, includes 81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language. We establish several strong baselines for the storytelling task, and motivate an automatic metric to benc… ▽ More

    Submitted 13 April, 2016; originally announced April 2016.

    Comments: to appear in NAACL 2016

  47. arXiv:1604.03650  [pdf, other

    cs.CV

    Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks

    Authors: Junyuan Xie, Ross Girshick, Ali Farhadi

    Abstract: As 3D movie viewing becomes mainstream and Virtual Reality (VR) market emerges, the demand for 3D contents is growing rapidly. Producing 3D videos, however, remains challenging. In this paper we propose to use deep neural networks for automatically converting 2D videos and images to stereoscopic 3D format. In contrast to previous automatic 2D-to-3D conversion algorithms, which have separate stages… ▽ More

    Submitted 13 April, 2016; originally announced April 2016.

  48. arXiv:1604.03540  [pdf, other

    cs.CV cs.LG

    Training Region-based Object Detectors with Online Hard Example Mining

    Authors: Abhinav Shrivastava, Abhinav Gupta, Ross Girshick

    Abstract: The field of object detection has made significant advances riding on the wave of region-based ConvNets, but their training procedure still includes many heuristics and hyperparameters that are costly to tune. We present a simple yet surprisingly effective online hard example mining (OHEM) algorithm for training region-based ConvNet detectors. Our motivation is the same as it has always been -- de… ▽ More

    Submitted 12 April, 2016; originally announced April 2016.

    Comments: To appear in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. (oral)

  49. arXiv:1512.06974  [pdf, other

    cs.CV

    Seeing through the Human Reporting Bias: Visual Classifiers from Noisy Human-Centric Labels

    Authors: Ishan Misra, C. Lawrence Zitnick, Margaret Mitchell, Ross Girshick

    Abstract: When human annotators are given a choice about what to label in an image, they apply their own subjective judgments on what to ignore and what to mention. We refer to these noisy "human-centric" annotations as exhibiting human reporting bias. Examples of such annotations include image tags and keywords found on photo sharing sites, or in datasets containing image captions. In this paper, we use th… ▽ More

    Submitted 12 April, 2016; v1 submitted 22 December, 2015; originally announced December 2015.

    Comments: To appear in CVPR 2016

  50. arXiv:1512.04143  [pdf, other

    cs.CV

    Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks

    Authors: Sean Bell, C. Lawrence Zitnick, Kavita Bala, Ross Girshick

    Abstract: It is well known that contextual and multi-scale representations are important for accurate visual recognition. In this paper we present the Inside-Outside Net (ION), an object detector that exploits information both inside and outside the region of interest. Contextual information outside the region of interest is integrated using spatial recurrent neural networks. Inside, we use skip pooling to… ▽ More

    Submitted 13 December, 2015; originally announced December 2015.