Skip to main content

Showing 1–45 of 45 results for author: Feichtenhofer, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2311.05613  [pdf, other

    cs.CV

    Window Attention is Bugged: How not to Interpolate Position Embeddings

    Authors: Daniel Bolya, Chaitanya Ryali, Judy Hoffman, Christoph Feichtenhofer

    Abstract: Window attention, position embeddings, and high resolution finetuning are core concepts in the modern transformer era of computer vision. However, we find that naively combining these near ubiquitous components can have a detrimental effect on performance. The issue is simple: interpolating position embeddings while using window attention is wrong. We study two state-of-the-art methods that have t… ▽ More

    Submitted 9 November, 2023; originally announced November 2023.

    Comments: Preprint. Code release will be coming in the future

  2. arXiv:2309.16671  [pdf, other

    cs.CV cs.CL

    Demystifying CLIP Data

    Authors: Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer

    Abstract: Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been… ▽ More

    Submitted 7 April, 2024; v1 submitted 28 September, 2023; originally announced September 2023.

    Comments: 17 pages. arXiv admin note: text overlap with arXiv:2103.00020 by other authors

  3. arXiv:2306.00989  [pdf, other

    cs.CV cs.LG

    Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

    Authors: Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer

    Abstract: Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is unnecessary. By pretraini… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: ICML 2023 Oral version. Code+Models: https://github.com/facebookresearch/hiera

  4. arXiv:2304.03283  [pdf, other

    cs.CV

    Diffusion Models as Masked Autoencoders

    Authors: Chen Wei, Karttikeya Mangalam, Po-Yao Huang, Yanghao Li, Haoqi Fan, Hu Xu, Huiyu Wang, Cihang Xie, Alan Yuille, Christoph Feichtenhofer

    Abstract: There has been a longstanding belief that generation can facilitate a true understanding of visual data. In line with this, we revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models. While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffus… ▽ More

    Submitted 6 April, 2023; originally announced April 2023.

    Comments: Tech report. Project page: https://weichen582.github.io/diffmae.html

  5. arXiv:2304.01199  [pdf, other

    cs.CV

    On the Benefits of 3D Pose and Tracking for Human Action Recognition

    Authors: Jathushan Rajasegaran, Georgios Pavlakos, Angjoo Kanazawa, Christoph Feichtenhofer, Jitendra Malik

    Abstract: In this work we study the benefits of using tracking and 3D poses for action recognition. To achieve this, we take the Lagrangian view on analysing actions over a trajectory of human motion rather than at a fixed point in space. Taking this stand allows us to use the tracklets of people to predict their actions. In this spirit, first we show the benefits of using 3D pose to infer actions, and stud… ▽ More

    Submitted 7 August, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

    Comments: CVPR2023 (project page: https://brjathu.github.io/LART)

  6. arXiv:2303.13496  [pdf, other

    cs.CV cs.AI cs.LG

    The effectiveness of MAE pre-pretraining for billion-scale pretraining

    Authors: Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra

    Abstract: This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model. While MAE has on… ▽ More

    Submitted 24 January, 2024; v1 submitted 23 March, 2023; originally announced March 2023.

    Comments: ICCV 2023. Models available at https://github.com/facebookresearch/maws/

  7. arXiv:2302.04869  [pdf, other

    cs.CV cs.AI

    Reversible Vision Transformers

    Authors: Karttikeya Mangalam, Haoqi Fan, Yanghao Li, Chao-Yuan Wu, Bo Xiong, Christoph Feichtenhofer, Jitendra Malik

    Abstract: We present Reversible Vision Transformers, a memory efficient architecture design for visual recognition. By decoupling the GPU memory requirement from the depth of the model, Reversible Vision Transformers enable scaling up architectures with efficient memory usage. We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants and benchmark exte… ▽ More

    Submitted 9 February, 2023; originally announced February 2023.

    Comments: Oral at CVPR 2022, updated version

  8. arXiv:2301.08247  [pdf, other

    cs.CV

    Multiview Compressive Coding for 3D Reconstruction

    Authors: Chao-Yuan Wu, Justin Johnson, Jitendra Malik, Christoph Feichtenhofer, Georgia Gkioxari

    Abstract: A central goal of visual recognition is to understand objects and scenes from a single image. 2D recognition has witnessed tremendous progress thanks to large-scale learning and general-purpose representations. Comparatively, 3D poses new challenges stemming from occlusions not depicted in the image. Prior works try to overcome these by inferring from multiple views or rely on scarce CAD models an… ▽ More

    Submitted 19 January, 2023; originally announced January 2023.

    Comments: Project page: https://mcc3d.github.io/

  9. arXiv:2301.02241  [pdf, other

    cs.CV cs.CL

    CiT: Curation in Training for Effective Vision-Language Data

    Authors: Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell Howes, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer

    Abstract: Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant training cost that only large institutions can afford. This paper trades generality for efficiency and presents Curation in Training (CiT), a simple and efficient vision-text learning algorithm that couples a data objective into training. CiT automatically yields quality data to speed-up contr… ▽ More

    Submitted 5 January, 2023; originally announced January 2023.

    Comments: Technical Report

  10. arXiv:2212.08071  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    MAViL: Masked Audio-Video Learners

    Authors: Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, Haoqi Fan, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, Christoph Feichtenhofer

    Abstract: We present Masked Audio-Video Learners (MAViL) to train audio-visual representations. Our approach learns with three complementary forms of self-supervision: (1) reconstruction of masked audio and video input data, (2) intra- and inter-modal contrastive learning with masking, and (3) self-training by reconstructing joint audio-video contextualized features learned from the first two objectives. Pr… ▽ More

    Submitted 17 July, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

    Comments: Technical report

  11. arXiv:2212.00794  [pdf, other

    cs.CV

    Scaling Language-Image Pre-training via Masking

    Authors: Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, Kaiming He

    Abstract: We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP. Our method randomly masks out and removes a large portion of image patches during training. Masking allows us to learn from more image-text pairs given the same wall-clock time and contrast more samples per iteration with similar memory footprint. It leads to a favorable trade-off between accu… ▽ More

    Submitted 30 March, 2023; v1 submitted 1 December, 2022; originally announced December 2022.

    Comments: Tech report; arXiv v2: update scaling results and add code repo

  12. arXiv:2210.09461  [pdf, other

    cs.CV

    Token Merging: Your ViT But Faster

    Authors: Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman

    Abstract: We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and… ▽ More

    Submitted 1 March, 2023; v1 submitted 17 October, 2022; originally announced October 2022.

    Comments: Accepted ICLR 2023 Oral (top 5%) [final v2]. This version includes stable diffusion experiments. See code at https://github.com/facebookresearch/ToMe

  13. arXiv:2207.06405  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Masked Autoencoders that Listen

    Authors: Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, Christoph Feichtenhofer

    Abstract: This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded conte… ▽ More

    Submitted 12 January, 2023; v1 submitted 13 July, 2022; originally announced July 2022.

    Comments: Accepted at NeurIPS 2022

  14. arXiv:2205.09113  [pdf, other

    cs.CV cs.LG

    Masked Autoencoders As Spatiotemporal Learners

    Authors: Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He

    Abstract: This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to spatiotemporal representation learning from videos. We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels. Interestingly, we show that our MAE method can learn strong representations with almost no inductive bias on spacetime (only except for patch and positional emb… ▽ More

    Submitted 21 October, 2022; v1 submitted 18 May, 2022; originally announced May 2022.

  15. arXiv:2201.08383  [pdf, other

    cs.CV

    MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

    Authors: Chao-Yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer

    Abstract: While today's video recognition systems parse snapshots or short clips accurately, they cannot connect the dots and reason across a longer range of time yet. Most existing video architectures can only process <5 seconds of a video without hitting the computation or memory bottlenecks. In this paper, we propose a new strategy to overcome this challenge. Instead of trying to process more frames at… ▽ More

    Submitted 30 November, 2022; v1 submitted 20 January, 2022; originally announced January 2022.

    Comments: Technical report. arXiv v2: add link to code

  16. arXiv:2201.03545  [pdf, other

    cs.CV

    A ConvNet for the 2020s

    Authors: Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie

    Abstract: The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) th… ▽ More

    Submitted 2 March, 2022; v1 submitted 10 January, 2022; originally announced January 2022.

    Comments: CVPR 2022; Code: https://github.com/facebookresearch/ConvNeXt

  17. arXiv:2112.09133  [pdf, other

    cs.CV cs.LG

    Masked Feature Prediction for Self-Supervised Visual Pre-Training

    Authors: Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, Christoph Feichtenhofer

    Abstract: We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance… ▽ More

    Submitted 12 January, 2023; v1 submitted 16 December, 2021; originally announced December 2021.

    Comments: Technical report. arXiv v2: update AVA results (details in Appendix E)

  18. arXiv:2112.01526  [pdf, other

    cs.CV

    MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

    Authors: Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer

    Abstract: In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and K… ▽ More

    Submitted 30 March, 2022; v1 submitted 2 December, 2021; originally announced December 2021.

    Comments: CVPR 2022 Camera Ready

  19. arXiv:2111.09887  [pdf, other

    cs.CV cs.LG

    PyTorchVideo: A Deep Learning Library for Video Understanding

    Authors: Haoqi Fan, Tullie Murrell, Heng Wang, Kalyan Vasudev Alwala, Yanghao Li, Yilei Li, Bo Xiong, Nikhila Ravi, Meng Li, Haichuan Yang, Jitendra Malik, Ross Girshick, Matt Feiszli, Aaron Adcock, Wan-Yen Lo, Christoph Feichtenhofer

    Abstract: We introduce PyTorchVideo, an open-source deep-learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing. The library covers a full stack of video understanding tools including multimodal data loading, transformations, and models tha… ▽ More

    Submitted 18 November, 2021; originally announced November 2021.

    Comments: Technical report

  20. arXiv:2110.07058  [pdf, other

    cs.CV cs.AI

    Ego4D: Around the World in 3,000 Hours of Egocentric Video

    Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do , et al. (60 additional authors not shown)

    Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with cons… ▽ More

    Submitted 11 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)

  21. arXiv:2109.14084  [pdf, other

    cs.CV cs.CL

    VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

    Authors: Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer

    Abstract: We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlap** positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including se… ▽ More

    Submitted 1 October, 2021; v1 submitted 28 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021

  22. arXiv:2106.05392  [pdf, other

    cs.CV

    Kee** Your Eye on the Ball: Trajectory Attention in Video Transformers

    Authors: Mandela Patrick, Dylan Campbell, Yuki M. Asano, Ishan Misra, Florian Metze, Christoph Feichtenhofer, Andrea Vedaldi, João F. Henriques

    Abstract: In video transformers, the time dimension is often treated in the same way as the two spatial dimensions. However, in a scene where objects or the camera may move, a physical point imaged at one location in frame $t$ may be entirely unrelated to what is found at that location in frame $t+k$. These temporal correspondences should be modeled to facilitate learning about dynamic scenes. To this end,… ▽ More

    Submitted 23 October, 2021; v1 submitted 9 June, 2021; originally announced June 2021.

    Comments: NeurIPS 2021 (Oral). Project page: https://facebookresearch.github.io/Motionformer

  23. arXiv:2105.09996  [pdf, other

    cs.CV cs.CL

    VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

    Authors: Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, Luke Zettlemoyer

    Abstract: We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early c… ▽ More

    Submitted 30 September, 2021; v1 submitted 20 May, 2021; originally announced May 2021.

    Comments: 9 pages, ACL Findings 2021

  24. arXiv:2104.14558  [pdf, other

    cs.CV cs.AI cs.LG

    A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

    Authors: Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick, Kaiming He

    Abstract: We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to space-time. Our objective encourages temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) d… ▽ More

    Submitted 29 April, 2021; originally announced April 2021.

    Comments: CVPR 2021

  25. arXiv:2104.11227  [pdf, other

    cs.CV cs.AI cs.LG

    Multiscale Vision Transformers

    Authors: Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, Christoph Feichtenhofer

    Abstract: We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution.… ▽ More

    Submitted 22 April, 2021; originally announced April 2021.

    Comments: Technical report

  26. arXiv:2104.00682  [pdf, other

    cs.CV cs.AI cs.LG

    Multiview Pseudo-Labeling for Semi-supervised Learning from Video

    Authors: Bo Xiong, Haoqi Fan, Kristen Grauman, Christoph Feichtenhofer

    Abstract: We present a multiview pseudo-labeling approach to video learning, a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video. The complementary views help obtain more reliable pseudo-labels on unlabeled video, to learn stronger video representations than from purely supervised data. Though our method capitalizes on multip… ▽ More

    Submitted 1 April, 2021; originally announced April 2021.

    Comments: Technical report

  27. arXiv:2101.02702  [pdf, other

    cs.CV

    TrackFormer: Multi-Object Tracking with Transformers

    Authors: Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, Christoph Feichtenhofer

    Abstract: The challenging task of multi-object tracking (MOT) requires simultaneous reasoning about track initialization, identity, and spatio-temporal trajectories. We formulate this task as a frame-to-frame set prediction problem and introduce TrackFormer, an end-to-end trainable MOT approach based on an encoder-decoder Transformer architecture. Our model achieves data association between frames via atten… ▽ More

    Submitted 29 April, 2022; v1 submitted 7 January, 2021; originally announced January 2021.

  28. arXiv:2004.04730  [pdf, other

    cs.CV

    X3D: Expanding Architectures for Efficient Video Recognition

    Authors: Christoph Feichtenhofer

    Abstract: This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth. Inspired by feature selection methods in machine learning, a simple stepwise network expansion approach is employed that expands a single axis in each step, such that good accuracy to complexity trade-off i… ▽ More

    Submitted 9 April, 2020; originally announced April 2020.

    Comments: CVPR 2020 (Oral)

  29. arXiv:2004.03580  [pdf, other

    cs.CV

    Feature Pyramid Grids

    Authors: Kai Chen, Yuhang Cao, Chen Change Loy, Dahua Lin, Christoph Feichtenhofer

    Abstract: Feature pyramid networks have been widely adopted in the object detection literature to improve feature representations for better handling of variations in scale. In this paper, we present Feature Pyramid Grids (FPG), a deep multi-pathway feature pyramid, that represents the feature scale-space as a regular grid of parallel bottom-up pathways which are fused by multi-directional lateral connectio… ▽ More

    Submitted 7 April, 2020; originally announced April 2020.

    Comments: Technical report

  30. arXiv:2001.08740  [pdf, other

    cs.CV

    Audiovisual SlowFast Networks for Video Recognition

    Authors: Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, Christoph Feichtenhofer

    Abstract: We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are deeply integrated with a Faster Audio pathway to model vision and sound in a unified representation. We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcom… ▽ More

    Submitted 8 March, 2020; v1 submitted 23 January, 2020; originally announced January 2020.

    Comments: Technical report

  31. arXiv:2001.04583  [pdf, other

    cs.CV

    EGO-TOPO: Environment Affordances from Egocentric Video

    Authors: Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, Kristen Grauman

    Abstract: First-person video naturally brings the use of a physical environment to the forefront, since it shows the camera wearer interacting fluidly in a space based on his intentions. However, current methods largely separate the observed actions from the persistent space itself. We introduce a model for environment affordances that is learned directly from egocentric video. The main idea is to gain a hu… ▽ More

    Submitted 27 March, 2020; v1 submitted 13 January, 2020; originally announced January 2020.

    Comments: Published in CVPR 2020, project page: http://vision.cs.utexas.edu/projects/ego-topo/

  32. arXiv:1912.00998  [pdf, ps, other

    cs.CV

    A Multigrid Method for Efficiently Training Video Models

    Authors: Chao-Yuan Wu, Ross Girshick, Kaiming He, Christoph Feichtenhofer, Philipp Krähenbühl

    Abstract: Training competitive deep video models is an order of magnitude slower than training their counterpart image models. Slow training causes long research cycles, which hinders progress in video understanding research. Following standard practice for training image models, video model training assumes a fixed mini-batch shape: a specific number of clips, frames, and spatial size. However, what is the… ▽ More

    Submitted 9 June, 2020; v1 submitted 2 December, 2019; originally announced December 2019.

    Comments: CVPR 2020

  33. arXiv:1906.04016  [pdf, other

    cs.CV

    Learning Temporal Pose Estimation from Sparsely-Labeled Videos

    Authors: Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani

    Abstract: Modern approaches for multi-person pose estimation in video require large amounts of dense annotations. However, labeling every frame in a video is costly and labor intensive. To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation. Given a pa… ▽ More

    Submitted 11 December, 2019; v1 submitted 6 June, 2019; originally announced June 2019.

    Comments: Accepted to NeurIPS 2019

  34. arXiv:1906.01963  [pdf, other

    cs.CV

    Grounded Human-Object Interaction Hotspots from Video (Extended Abstract)

    Authors: Tushar Nagarajan, Christoph Feichtenhofer, Kristen Grauman

    Abstract: Learning how to interact with objects is an important step towards embodied visual intelligence, but existing techniques suffer from heavy supervision or sensing requirements. We propose an approach to learn human-object interaction "hotspots" directly from video. Rather than treat affordances as a manually supervised semantic segmentation task, our approach learns about interactions by watching v… ▽ More

    Submitted 3 June, 2019; originally announced June 2019.

    Comments: arXiv admin note: substantial text overlap with arXiv:1812.04558

  35. Modeling Human Motion with Quaternion-based Neural Networks

    Authors: Dario Pavllo, Christoph Feichtenhofer, Michael Auli, David Grangier

    Abstract: Previous work on predicting or generating 3D human pose sequences regresses either joint rotations or joint positions. The former strategy is prone to error accumulation along the kinematic chain, as well as discontinuities when using Euler angles or exponential maps as parameterizations. The latter requires re-projection onto skeleton constraints to avoid bone stretching and invalid configuration… ▽ More

    Submitted 26 October, 2019; v1 submitted 21 January, 2019; originally announced January 2019.

    Comments: Follow-up work of arXiv:1805.06485. This is a pre-print of an article published in IJCV. The final authenticated version is available online at https://doi.org/10.1007/s11263-019-01245-6

    Journal ref: International Journal of Computer Vision (Special Issue on Machine Vision with Deep Learning), 2019. Online ISSN: 1573-1405

  36. arXiv:1812.05038  [pdf, other

    cs.CV

    Long-Term Feature Banks for Detailed Video Understanding

    Authors: Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krähenbühl, Ross Girshick

    Abstract: To understand the world, we humans constantly need to relate the present to the past, and put events in context. In this paper, we enable existing video models to do the same. We propose a long-term feature bank---supportive information extracted over the entire span of a video---to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds. Our experiments dem… ▽ More

    Submitted 17 April, 2019; v1 submitted 12 December, 2018; originally announced December 2018.

    Comments: Code and models are available at https://github.com/facebookresearch/video-long-term-feature-banks

  37. arXiv:1812.04558  [pdf, other

    cs.CV

    Grounded Human-Object Interaction Hotspots from Video

    Authors: Tushar Nagarajan, Christoph Feichtenhofer, Kristen Grauman

    Abstract: Learning how to interact with objects is an important step towards embodied visual intelligence, but existing techniques suffer from heavy supervision or sensing requirements. We propose an approach to learn human-object interaction "hotspots" directly from video. Rather than treat affordances as a manually supervised semantic segmentation task, our approach learns about interactions by watching v… ▽ More

    Submitted 2 April, 2019; v1 submitted 11 December, 2018; originally announced December 2018.

  38. arXiv:1812.04172  [pdf, other

    cs.CV

    Learning Discriminative Motion Features Through Detection

    Authors: Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani

    Abstract: Despite huge success in the image domain, modern detection models such as Faster R-CNN have not been used nearly as much for video analysis. This is arguably due to the fact that detection models are designed to operate on single frames and as a result do not have a mechanism for learning motion representations directly from video. We propose a learning procedure that allows detection models such… ▽ More

    Submitted 10 December, 2018; originally announced December 2018.

  39. arXiv:1812.03982  [pdf, other

    cs.CV

    SlowFast Networks for Video Recognition

    Authors: Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He

    Abstract: We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our… ▽ More

    Submitted 29 October, 2019; v1 submitted 10 December, 2018; originally announced December 2018.

    Comments: Technical report

  40. arXiv:1811.11742  [pdf, other

    cs.CV

    3D human pose estimation in video with temporal convolutions and semi-supervised training

    Authors: Dario Pavllo, Christoph Feichtenhofer, David Grangier, Michael Auli

    Abstract: In this work, we demonstrate that 3D poses in video can be effectively estimated with a fully convolutional model based on dilated temporal convolutions over 2D keypoints. We also introduce back-projection, a simple and effective semi-supervised training method that leverages unlabeled video data. We start with predicted 2D keypoints for unlabeled video, then estimate 3D poses and finally back-pro… ▽ More

    Submitted 29 March, 2019; v1 submitted 28 November, 2018; originally announced November 2018.

    Comments: CVPR 2019

  41. arXiv:1802.07094  [pdf, other

    cs.CV

    Camera-based vehicle velocity estimation from monocular video

    Authors: Moritz Kampelmühler, Michael G. Müller, Christoph Feichtenhofer

    Abstract: This paper documents the winning entry at the CVPR2017 vehicle velocity estimation challenge. Velocity estimation is an emerging task in autonomous driving which has not yet been thoroughly explored. The goal is to estimate the relative velocity of a specific vehicle from a sequence of images. In this paper, we present a light-weight approach for directly regressing vehicle velocities from their t… ▽ More

    Submitted 20 February, 2018; originally announced February 2018.

    Comments: 8 pages, 5 figures, in CVWW2018

  42. arXiv:1801.01415  [pdf, other

    cs.CV

    What have we learned from deep representations for action recognition?

    Authors: Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes, Andrew Zisserman

    Abstract: As the success of deep models has led to their deployment in all areas of computer vision, it is increasingly important to understand how these representations work and what they are capturing. In this paper, we shed light on deep spatiotemporal representations by visualizing what two-stream models have learned in order to recognize actions in video. We show that local detectors for appearance and… ▽ More

    Submitted 4 January, 2018; originally announced January 2018.

    Comments: This document is best viewed in Adobe Reader where figures play on click. Supplementary material can be downloaded at http://feichtenhofer.github.io/action_vis.pdf

  43. arXiv:1710.03958  [pdf, other

    cs.CV

    Detect to Track and Track to Detect

    Authors: Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman

    Abstract: Recent approaches for high accuracy detection and tracking of object categories in video consist of complex multistage solutions that become more cumbersome each year. In this paper we propose a ConvNet architecture that jointly performs detection and tracking, solving the task in a simple and effective way. Our contributions are threefold: (i) we set up a ConvNet architecture for simultaneous det… ▽ More

    Submitted 7 March, 2018; v1 submitted 11 October, 2017; originally announced October 2017.

    Comments: ICCV 2017. Code and models: https://github.com/feichtenhofer/Detect-Track Results: https://www.robots.ox.ac.uk/~vgg/research/detect-track/

  44. arXiv:1611.02155  [pdf, other

    cs.CV

    Spatiotemporal Residual Networks for Video Action Recognition

    Authors: Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes

    Abstract: Two-stream Convolutional Networks (ConvNets) have shown strong performance for human action recognition in videos. Recently, Residual Networks (ResNets) have arisen as a new technique to train extremely deep architectures. In this paper, we introduce spatiotemporal ResNets as a combination of these two approaches. Our novel architecture generalizes ResNets for the spatiotemporal domain by introduc… ▽ More

    Submitted 7 November, 2016; originally announced November 2016.

    Comments: NIPS 2016

  45. arXiv:1604.06573  [pdf, other

    cs.CV

    Convolutional Two-Stream Network Fusion for Video Action Recognition

    Authors: Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman

    Abstract: Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fus… ▽ More

    Submitted 26 September, 2016; v1 submitted 22 April, 2016; originally announced April 2016.

    Comments: in Proc. CVPR 2016