Skip to main content

Showing 1–50 of 68 results for author: Torresani, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.01408  [pdf, other

    cs.CV cs.AI cs.LG

    Semantic Compositions Enhance Vision-Language Contrastive Learning

    Authors: Maxwell Aladago, Lorenzo Torresani, Soroush Vosoughi

    Abstract: In the field of vision-language contrastive learning, models such as CLIP capitalize on matched image-caption pairs as positive examples and leverage within-batch non-matching pairs as negatives. This approach has led to remarkable outcomes in zero-shot image classification, cross-modal retrieval, and linear evaluation tasks. We show that the zero-shot classification and retrieval capabilities of… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  2. arXiv:2404.16222  [pdf, other

    cs.CV

    Step Differences in Instructional Video

    Authors: Tushar Nagarajan, Lorenzo Torresani

    Abstract: Comparing a user video to a reference how-to video is a key requirement for AR/VR technology delivering personalized assistance tailored to the user's progress. However, current approaches for language-based assistance can only answer questions about a single video. We propose an approach that first automatically generates large amounts of visual instruction tuning data involving pairs of videos f… ▽ More

    Submitted 27 June, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

    Comments: CVPR 2024

  3. arXiv:2402.13250  [pdf, other

    cs.CV

    Video ReCap: Recursive Captioning of Hour-Long Videos

    Authors: Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius

    Abstract: Most video captioning models are designed to process short video clips of few seconds and output text describing low-level visual concepts (e.g., objects, scenes, atomic actions). However, most real-world videos last for minutes or hours and have a complex hierarchical structure spanning different temporal granularities. We propose Video ReCap, a recursive video captioning model that can process v… ▽ More

    Submitted 16 May, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: Accepted by CVPR 2024

  4. arXiv:2311.18259  [pdf, other

    cs.CV cs.AI

    Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

    Authors: Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, **g Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, **g Huang, Md Mohaiminul Islam, Suyog Jain , et al. (76 additional authors not shown)

    Abstract: We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from… ▽ More

    Submitted 29 April, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: updated baseline results and dataset statistics to match the released v2 data; added table to appendix comparing stats of Ego-Exo4D alongside other datasets

  5. arXiv:2307.12854  [pdf, other

    cs.CV

    Multiscale Video Pretraining for Long-Term Activity Forecasting

    Authors: Reuben Tan, Matthias De Lange, Michael Iuzzolino, Bryan A. Plummer, Kate Saenko, Karl Ridgeway, Lorenzo Torresani

    Abstract: Long-term activity forecasting is an especially challenging research problem because it requires understanding the temporal relationships between observed actions, as well as the variability and complexity of human activities. Despite relying on strong supervision via expensive human annotations, state-of-the-art forecasting approaches often generalize poorly to unseen data. To alleviate this issu… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

  6. arXiv:2306.03802  [pdf, other

    cs.CV cs.AI

    Learning to Ground Instructional Articles in Videos through Narrations

    Authors: Effrosyni Mavroudi, Triantafyllos Afouras, Lorenzo Torresani

    Abstract: In this paper we present an approach for localizing steps of procedural activities in narrated how-to videos. To deal with the scarcity of labeled data at scale, we source the step descriptions from a language knowledge base (wikiHow) containing instructional articles for a large variety of procedural tasks. Without any form of manual supervision, our model learns to temporally ground the steps of… ▽ More

    Submitted 6 June, 2023; originally announced June 2023.

    Comments: 17 pages, 4 figures and 10 tables

  7. arXiv:2303.05503  [pdf, other

    cs.CV cs.AI cs.LG

    Open-world Instance Segmentation: Top-down Learning with Bottom-up Supervision

    Authors: Tarun Kalluri, Weiyao Wang, Heng Wang, Manmohan Chandraker, Lorenzo Torresani, Du Tran

    Abstract: Many top-down architectures for instance segmentation achieve significant success when trained and tested on pre-defined closed-world taxonomy. However, when deployed in the open world, they exhibit notable bias towards seen classes and suffer from significant performance drop. In this work, we propose a novel approach for open world instance segmentation called bottom-Up and top-Down Open-world S… ▽ More

    Submitted 13 May, 2024; v1 submitted 9 March, 2023; originally announced March 2023.

    Comments: L3D-IVU Workshop, CVPR 2024. Project page: https://tarun005.github.io/UDOS

  8. arXiv:2302.08063  [pdf, other

    cs.CV

    MINOTAUR: Multi-task Video Grounding From Multimodal Queries

    Authors: Raghav Goyal, Effrosyni Mavroudi, Xitong Yang, Sainbayar Sukhbaatar, Leonid Sigal, Matt Feiszli, Lorenzo Torresani, Du Tran

    Abstract: Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences. These tasks differ in the type of inputs (only video, or video-query pair where query is an image region or sentence) and outputs (temporal segments or spatio-temporal tubes). However, at their core they require the same fundamental understanding of the video, i… ▽ More

    Submitted 17 March, 2023; v1 submitted 15 February, 2023; originally announced February 2023.

    Comments: 22 pages, 8 figures and 13 tables

  9. arXiv:2302.01891  [pdf, other

    cs.CV

    Egocentric Video Task Translation @ Ego4D Challenge 2022

    Authors: Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani

    Abstract: This technical report describes the EgoTask Translation approach that explores relations among a set of egocentric video tasks in the Ego4D challenge. To improve the primary task of interest, we propose to leverage existing models developed for other related tasks and design a task translator that learns to ''translate'' auxiliary task features to the primary task. With no modification to the base… ▽ More

    Submitted 3 February, 2023; originally announced February 2023.

    Comments: The technical report of ECCV@2022 Ego4D challenge

  10. arXiv:2301.02311  [pdf, other

    cs.CV

    HierVL: Learning Hierarchical Video-Language Embeddings

    Authors: Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman

    Abstract: Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos acc… ▽ More

    Submitted 8 June, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

    Comments: CVPR 2023

  11. arXiv:2301.02307  [pdf, other

    cs.CV

    What You Say Is What You Show: Visual Narration Detection in Instructional Videos

    Authors: Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman

    Abstract: Narrated ''how-to'' videos have emerged as a promising data source for a wide range of learning problems, from learning visual representations to training robot policies. However, this data is extremely noisy, as the narrations do not always describe the actions demonstrated in the video. To address this problem we introduce the novel task of visual narration detection, which entails determining w… ▽ More

    Submitted 18 July, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

    Comments: Technical Report

  12. arXiv:2301.01380  [pdf, other

    cs.CV

    Ego-Only: Egocentric Action Detection without Exocentric Transferring

    Authors: Huiyu Wang, Mitesh Kumar Singh, Lorenzo Torresani

    Abstract: We present Ego-Only, the first approach that enables state-of-the-art action detection on egocentric (first-person) videos without any form of exocentric (third-person) transferring. Despite the content and appearance gap separating the two domains, large-scale exocentric transferring has been the default choice for egocentric action detection. This is because prior works found that egocentric mod… ▽ More

    Submitted 19 May, 2023; v1 submitted 3 January, 2023; originally announced January 2023.

  13. arXiv:2212.06301  [pdf, other

    cs.CV

    Egocentric Video Task Translation

    Authors: Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani

    Abstract: Different video understanding tasks are typically treated in isolation, and even with distinct types of curated data (e.g., classifying sports in one dataset, tracking animals in another). However, in wearable cameras, the immersive egocentric perspective of a person engaging with the world around them presents an interconnected web of video understanding tasks -- hand-object manipulations, naviga… ▽ More

    Submitted 6 April, 2023; v1 submitted 12 December, 2022; originally announced December 2022.

    Comments: Accepted by CVPR 2023 (Highlight), Project website: https://vision.cs.utexas.edu/projects/egot2/

  14. arXiv:2209.06185  [pdf

    cs.CV

    HistoPerm: A Permutation-Based View Generation Approach for Improving Histopathologic Feature Representation Learning

    Authors: Joseph DiPalma, Lorenzo Torresani, Saeed Hassanpour

    Abstract: Deep learning has been effective for histology image analysis in digital pathology. However, many current deep learning approaches require large, strongly- or weakly-labeled images and regions of interest, which can be time-consuming and resource-intensive to obtain. To address this challenge, we present HistoPerm, a view generation method for representation learning using joint embedding architec… ▽ More

    Submitted 5 April, 2023; v1 submitted 13 September, 2022; originally announced September 2022.

  15. arXiv:2203.16795  [pdf, other

    cs.CV

    Deformable Video Transformer

    Authors: Jue Wang, Lorenzo Torresani

    Abstract: Video transformers have recently emerged as an effective alternative to convolutional networks for action classification. However, most prior video transformers adopt either global space-time attention or hand-defined strategies to compare patches within and across frames. These fixed attention schemes not only have high computational cost but, by comparing patches at predetermined locations, they… ▽ More

    Submitted 31 March, 2022; originally announced March 2022.

    Comments: Accepted in CVPR 2022

  16. arXiv:2201.11866  [pdf, other

    eess.IV cs.CV

    Calibrating Histopathology Image Classifiers using Label Smoothing

    Authors: Jerry Wei, Lorenzo Torresani, Jason Wei, Saeed Hassanpour

    Abstract: The classification of histopathology images fundamentally differs from traditional image classification tasks because histopathology images naturally exhibit a range of diagnostic features, resulting in a diverse range of annotator agreement levels. However, examples with high annotator disagreement are often either assigned the majority label or discarded entirely when training histopathology ima… ▽ More

    Submitted 27 January, 2022; originally announced January 2022.

  17. arXiv:2201.10990  [pdf, other

    cs.CV

    Learning To Recognize Procedural Activities with Distant Supervision

    Authors: Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, Lorenzo Torresani

    Abstract: In this paper we consider the problem of classifying fine-grained, multi-step activities (e.g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes. Accurately categorizing these activities requires not only recognizing the individual steps that compose the task but also capturing their temporal d… ▽ More

    Submitted 16 June, 2022; v1 submitted 26 January, 2022; originally announced January 2022.

    Comments: CVPR 2022. Code will be released here https://github.com/facebookresearch/video-distant-supervision

  18. arXiv:2112.03340  [pdf, other

    cs.CV cs.LG

    Label Hallucination for Few-Shot Classification

    Authors: Yiren Jian, Lorenzo Torresani

    Abstract: Few-shot classification requires adapting knowledge learned from a large annotated base dataset to recognize novel unseen classes, each represented by few labeled examples. In such a scenario, pretraining a network with high capacity on the large dataset and then finetuning it on the few examples causes severe overfitting. At the same time, training a simple linear classifier on top of "frozen" fe… ▽ More

    Submitted 6 December, 2021; originally announced December 2021.

    Comments: Accepted by AAAI 2022. Code is available: https://github.com/yiren-jian/LabelHalluc

  19. arXiv:2110.07058  [pdf, other

    cs.CV cs.AI

    Ego4D: Around the World in 3,000 Hours of Egocentric Video

    Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do , et al. (60 additional authors not shown)

    Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with cons… ▽ More

    Submitted 11 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)

  20. arXiv:2106.09212  [pdf, other

    cs.CV cs.AI

    Long-Short Temporal Contrastive Learning of Video Transformers

    Authors: Jue Wang, Gedas Bertasius, Du Tran, Lorenzo Torresani

    Abstract: Video transformers have recently emerged as a competitive alternative to 3D CNNs for video understanding. However, due to their large number of parameters and reduced inductive biases, these models require supervised pretraining on large-scale image datasets to achieve top performance. In this paper, we empirically demonstrate that self-supervised pretraining of video transformers on video-only da… ▽ More

    Submitted 31 March, 2022; v1 submitted 16 June, 2021; originally announced June 2021.

    Comments: Accepted in CVPR 2022

  21. arXiv:2104.01198  [pdf, other

    cs.CV

    Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories

    Authors: Xitong Yang, Haoqi Fan, Lorenzo Torresani, Larry Davis, Heng Wang

    Abstract: The standard way of training video models entails sampling at each iteration a single clip from a video and optimizing the clip prediction with respect to the video-level label. We argue that a single clip may not have enough temporal coverage to exhibit the label to recognize, since video datasets are often weakly labeled with categorical information but without dense temporal annotations. Furthe… ▽ More

    Submitted 2 April, 2021; originally announced April 2021.

    Comments: CVPR 2021

  22. arXiv:2102.06291  [pdf, other

    cs.SD cs.LG eess.AS eess.IV

    A Multi-View Approach To Audio-Visual Speaker Verification

    Authors: Leda Sarı, Kritika Singh, Jiatong Zhou, Lorenzo Torresani, Nayan Singhal, Yatharth Saraf

    Abstract: Although speaker verification has conventionally been an audio-only task, some practical applications provide both audio and visual streams of input. In these cases, the visual stream provides complementary information and can often be leveraged in conjunction with the acoustics of speech to improve verification performance. In this study, we explore audio-visual approaches to speaker verification… ▽ More

    Submitted 11 February, 2021; originally announced February 2021.

  23. arXiv:2102.05095  [pdf, other

    cs.CV

    Is Space-Time Attention All You Need for Video Understanding?

    Authors: Gedas Bertasius, Heng Wang, Lorenzo Torresani

    Abstract: We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Our experimental study compares different self-attention schemes and suggests that "divided attentio… ▽ More

    Submitted 9 June, 2021; v1 submitted 9 February, 2021; originally announced February 2021.

    Comments: Accepted to ICML 2021

  24. arXiv:2101.12355  [pdf, other

    eess.IV cs.CV

    A Petri Dish for Histopathology Image Analysis

    Authors: Jerry Wei, Arief Suriawinata, Bing Ren, Xiaoying Liu, Mikhail Lisovsky, Louis Vaickus, Charles Brown, Michael Baker, Naofumi Tomita, Lorenzo Torresani, Jason Wei, Saeed Hassanpour

    Abstract: With the rise of deep learning, there has been increased interest in using neural networks for histopathology image analysis, a field that investigates the properties of biopsy or resected specimens traditionally manually examined under a microscope by pathologists. However, challenges such as limited data, costly annotation, and processing high-resolution and variable-size images make it difficul… ▽ More

    Submitted 27 March, 2021; v1 submitted 28 January, 2021; originally announced January 2021.

    Comments: In proceedings of Artificial Intelligence in Medicine (AIME) 2021

  25. arXiv:2101.12059  [pdf, other

    cs.CV cs.CL

    VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

    Authors: Xudong Lin, Gedas Bertasius, Jue Wang, Shih-Fu Chang, Devi Parikh, Lorenzo Torresani

    Abstract: We present \textsc{Vx2Text}, a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio. In order to leverage transformer networks, which have been shown to be effective at modeling language, each modality is first converted into a set of language embeddings by a learnable tokenizer. This allows our approach to perform multimodal fusion in the language s… ▽ More

    Submitted 29 January, 2021; v1 submitted 28 January, 2021; originally announced January 2021.

    Comments: Work in progress

  26. arXiv:2101.06475  [pdf, other

    cs.LG cs.AI

    Slot Machines: Discovering Winning Combinations of Random Weights in Neural Networks

    Authors: Maxwell Mbabilla Aladago, Lorenzo Torresani

    Abstract: In contrast to traditional weight optimization in a continuous space, we demonstrate the existence of effective random networks whose weights are never updated. By selecting a weight among a fixed set of random values for each individual connection, our method uncovers combinations of random weights that match the performance of traditionally-trained networks of the same capacity. We refer to our… ▽ More

    Submitted 8 June, 2021; v1 submitted 16 January, 2021; originally announced January 2021.

  27. arXiv:2101.04170  [pdf

    eess.IV cs.CV

    Resolution-Based Distillation for Efficient Histology Image Classification

    Authors: Joseph DiPalma, Arief A. Suriawinata, Laura J. Tafe, Lorenzo Torresani, Saeed Hassanpour

    Abstract: Develo** deep learning models to analyze histology images has been computationally challenging, as the massive size of the images causes excessive strain on all parts of the computing pipeline. This paper proposes a novel deep learning-based methodology for improving the computational efficiency of histology image classification. The proposed approach is robust when used with images that have re… ▽ More

    Submitted 11 January, 2021; originally announced January 2021.

  28. arXiv:2009.13698  [pdf, other

    cs.CV

    Learn like a Pathologist: Curriculum Learning by Annotator Agreement for Histopathology Image Classification

    Authors: Jerry Wei, Arief Suriawinata, Bing Ren, Xiaoying Liu, Mikhail Lisovsky, Louis Vaickus, Charles Brown, Michael Baker, Mustafa Nasir-Moin, Naofumi Tomita, Lorenzo Torresani, Jason Wei, Saeed Hassanpour

    Abstract: Applying curriculum learning requires both a range of difficulty in data and a method for determining the difficulty of examples. In many tasks, however, satisfying these requirements can be a formidable challenge. In this paper, we contend that histopathology image classification is a compelling use case for curriculum learning. Based on the nature of histopathology images, a range of difficulty… ▽ More

    Submitted 28 September, 2020; originally announced September 2020.

  29. arXiv:2007.07306  [pdf, other

    cs.CV

    COBE: Contextualized Object Embeddings from Narrated Instructional Video

    Authors: Gedas Bertasius, Lorenzo Torresani

    Abstract: Many objects in the real world undergo dramatic variations in visual appearance. For example, a tomato may be red or green, sliced or chopped, fresh or fried, liquid or solid. Training a single detector to accurately recognize tomatoes in all these different states is challenging. On the other hand, contextual cues (e.g., the presence of a knife, a cutting board, a strainer or a pan) are often str… ▽ More

    Submitted 29 October, 2020; v1 submitted 14 July, 2020; originally announced July 2020.

    Comments: NeurIPS 2020

  30. arXiv:2007.04755  [pdf, other

    cs.CV

    Generalized Few-Shot Video Classification with Video Retrieval and Feature Generation

    Authors: Yongqin Xian, Bruno Korbar, Matthijs Douze, Lorenzo Torresani, Bernt Schiele, Zeynep Akata

    Abstract: Few-shot learning aims to recognize novel classes from a few examples. Although significant progress has been made in the image domain, few-shot video classification is relatively unexplored. We argue that previous methods underestimate the importance of video feature learning and propose to learn spatiotemporal features using a 3D CNN. Proposing a two-stage approach that learns video features on… ▽ More

    Submitted 13 October, 2021; v1 submitted 9 July, 2020; originally announced July 2020.

    Comments: Accepted by TPAMI in October, 2021

  31. arXiv:2006.07203   

    cs.CV

    Video Understanding as Machine Translation

    Authors: Bruno Korbar, Fabio Petroni, Rohit Girdhar, Lorenzo Torresani

    Abstract: With the advent of large-scale multimodal video datasets, especially sequences with audio or transcribed speech, there has been a growing interest in self-supervised learning of video representations. Most prior work formulates the objective as a contrastive metric learning problem between the modalities. To enable effective learning, however, these strategies require a careful selection of positi… ▽ More

    Submitted 17 September, 2020; v1 submitted 12 June, 2020; originally announced June 2020.

    Comments: The authors have temporarily withdrawn this paper to reassess some of the experimental results

  32. arXiv:2003.00605  [pdf, other

    cs.LG cs.AI stat.ML

    Stein Variational Inference for Discrete Distributions

    Authors: Jun Han, Fan Ding, Xianglong Liu, Lorenzo Torresani, Jian Peng, Qiang Liu

    Abstract: Gradient-based approximate inference methods, such as Stein variational gradient descent (SVGD), provide simple and general-purpose inference engines for differentiable continuous distributions. However, existing forms of SVGD cannot be directly applied to discrete distributions. In this work, we fill this gap by proposing a simple yet general framework that transforms discrete distributions to eq… ▽ More

    Submitted 1 March, 2020; originally announced March 2020.

    Comments: AISTATS 2020

  33. arXiv:1912.04573  [pdf, other

    cs.CV

    Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation

    Authors: Gedas Bertasius, Lorenzo Torresani

    Abstract: We introduce a method for simultaneously classifying, segmenting and tracking object instances in a video sequence. Our method, named MaskProp, adapts the popular Mask R-CNN to video by adding a mask propagation branch that propagates frame-level object instance masks from each video frame to all the other frames in a video clip. This allows our system to predict clip-level instance tracks with re… ▽ More

    Submitted 9 July, 2021; v1 submitted 10 December, 2019; originally announced December 2019.

    Comments: CVPR 2020 Best Paper Nominee

  34. arXiv:1912.04487  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    Listen to Look: Action Recognition by Previewing Audio

    Authors: Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, Lorenzo Torresani

    Abstract: In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical. We propose a framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redundancies. First, we devise an ImgAud2Vid framework that hallucinates clip-level features by distilling from lighter modalit… ▽ More

    Submitted 28 March, 2020; v1 submitted 9 December, 2019; originally announced December 2019.

    Comments: Appears in CVPR 2020; Project page: http://vision.cs.utexas.edu/projects/listen_to_look/

  35. arXiv:1911.12667  [pdf, other

    cs.CV

    Self-Supervised Learning by Cross-Modal Audio-Video Clustering

    Authors: Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, Du Tran

    Abstract: Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning.… ▽ More

    Submitted 26 October, 2020; v1 submitted 28 November, 2019; originally announced November 2019.

    Comments: Accepted to NeurIPS 2020 (spotlight presentation)

  36. arXiv:1907.08340  [pdf, other

    cs.CV cs.LG

    Only Time Can Tell: Discovering Temporal Data for Temporal Modeling

    Authors: Laura Sevilla-Lara, Shengxin Zha, Zhicheng Yan, Vedanuj Goswami, Matt Feiszli, Lorenzo Torresani

    Abstract: Understanding temporal information and how the visual world changes over time is a fundamental ability of intelligent systems. In video understanding, temporal information is at the core of many current challenges, including compression, efficient inference, motion estimation or summarization. However, in current video datasets it has been observed that action classes can often be recognized witho… ▽ More

    Submitted 29 October, 2019; v1 submitted 18 July, 2019; originally announced July 2019.

  37. arXiv:1906.04016  [pdf, other

    cs.CV

    Learning Temporal Pose Estimation from Sparsely-Labeled Videos

    Authors: Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani

    Abstract: Modern approaches for multi-person pose estimation in video require large amounts of dense annotations. However, labeling every frame in a video is costly and labor intensive. To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation. Given a pa… ▽ More

    Submitted 11 December, 2019; v1 submitted 6 June, 2019; originally announced June 2019.

    Comments: Accepted to NeurIPS 2019

  38. arXiv:1906.03857  [pdf, other

    cs.CV

    UniDual: A Unified Model for Image and Video Understanding

    Authors: Yufei Wang, Du Tran, Lorenzo Torresani

    Abstract: Although a video is effectively a sequence of images, visual perception systems typically model images and videos separately, thus failing to exploit the correlation and the synergy provided by these two media. While a few prior research efforts have explored the benefits of leveraging still-image datasets for video analysis, or vice-versa, most of these attempts have been limited to pretraining a… ▽ More

    Submitted 12 June, 2019; v1 submitted 10 June, 2019; originally announced June 2019.

  39. arXiv:1906.03349  [pdf, other

    cs.CV

    Video Modeling with Correlation Networks

    Authors: Heng Wang, Du Tran, Lorenzo Torresani, Matt Feiszli

    Abstract: Motion is a salient cue to recognize actions in video. Modern action recognition models leverage motion information either explicitly by using optical flow as input or implicitly by means of 3D convolutional filters that simultaneously capture appearance and motion information. This paper proposes an alternative approach based on a learnable correlation operator that can be used to establish frame… ▽ More

    Submitted 26 May, 2020; v1 submitted 7 June, 2019; originally announced June 2019.

  40. arXiv:1904.05410  [pdf, other

    cs.CV

    Attentive Action and Context Factorization

    Authors: Yang Wang, Vinh Tran, Gedas Bertasius, Lorenzo Torresani, Minh Hoai

    Abstract: We propose a method for human action recognition, one that can localize the spatiotemporal regions that `define' the actions. This is a challenging task due to the subtlety of human actions in video and the co-occurrence of contextual elements. To address this challenge, we utilize conjugate samples of human actions, which are video clips that are contextually similar to human action samples but d… ▽ More

    Submitted 10 April, 2019; originally announced April 2019.

    Comments: 10 pages, 6 figures

  41. arXiv:1904.04289  [pdf, other

    cs.CV

    SCSampler: Sampling Salient Clips from Video for Efficient Action Recognition

    Authors: Bruno Korbar, Du Tran, Lorenzo Torresani

    Abstract: While many action recognition datasets consist of collections of brief, trimmed videos each containing a relevant action, videos in the real-world (e.g., on YouTube) exhibit very different properties: they are often several minutes long, where brief relevant clips are often interleaved with segments of extended duration containing little change. Applying densely an action recognition system to eve… ▽ More

    Submitted 30 August, 2019; v1 submitted 8 April, 2019; originally announced April 2019.

  42. arXiv:1904.02811  [pdf, other

    cs.CV cs.AI

    Video Classification with Channel-Separated Convolutional Networks

    Authors: Du Tran, Heng Wang, Lorenzo Torresani, Matt Feiszli

    Abstract: Group convolution has been shown to offer great computational savings in various 2D convolutional architectures for image classification. It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D gro… ▽ More

    Submitted 18 November, 2019; v1 submitted 4 April, 2019; originally announced April 2019.

  43. arXiv:1901.09244  [pdf, other

    cs.CV

    DistInit: Learning Video Representations Without a Single Labeled Video

    Authors: Rohit Girdhar, Du Tran, Lorenzo Torresani, Deva Ramanan

    Abstract: Video recognition models have progressed significantly over the past few years, evolving from shallow classifiers trained on hand-crafted features to deep spatiotemporal networks. However, labeled video data required to train such models have not been able to keep up with the ever-increasing depth and sophistication of these networks. In this work, we propose an alternative approach to learning vi… ▽ More

    Submitted 20 August, 2019; v1 submitted 26 January, 2019; originally announced January 2019.

    Comments: ICCV 2019

  44. arXiv:1812.04172  [pdf, other

    cs.CV

    Learning Discriminative Motion Features Through Detection

    Authors: Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani

    Abstract: Despite huge success in the image domain, modern detection models such as Faster R-CNN have not been used nearly as much for video analysis. This is arguably due to the fact that detection models are designed to operate on single frames and as a result do not have a mechanism for learning motion representations directly from video. We propose a learning procedure that allows detection models such… ▽ More

    Submitted 10 December, 2018; originally announced December 2018.

  45. MaskConnect: Connectivity Learning by Gradient Descent

    Authors: Karim Ahmed, Lorenzo Torresani

    Abstract: Although deep networks have recently emerged as the model of choice for many computer vision problems, in order to yield good results they often require time-consuming architecture search. To combat the complexity of design choices, prior work has adopted the principle of modularized design which consists in defining the network in terms of a composition of topologically identical or similar build… ▽ More

    Submitted 27 July, 2018; originally announced July 2018.

    Comments: ECCV 2018. arXiv admin note: substantial text overlap with arXiv:1709.09582

    Journal ref: ECCV 2018

  46. arXiv:1807.00230  [pdf, other

    cs.CV

    Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

    Authors: Bruno Korbar, Du Tran, Lorenzo Torresani

    Abstract: There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredien… ▽ More

    Submitted 9 November, 2018; v1 submitted 30 June, 2018; originally announced July 2018.

    Comments: Note: Changed name - added experiments

  47. arXiv:1803.05549  [pdf, other

    cs.CV

    Object Detection in Video with Spatiotemporal Sampling Networks

    Authors: Gedas Bertasius, Lorenzo Torresani, Jianbo Shi

    Abstract: We propose a Spatiotemporal Sampling Network (STSN) that uses deformable convolutions across time for object detection in videos. Our STSN performs object detection in a video frame by learning to spatially sample features from the adjacent frames. This naturally renders the approach robust to occlusion or motion blur in individual frames. Our framework does not require additional supervision, as… ▽ More

    Submitted 24 July, 2018; v1 submitted 14 March, 2018; originally announced March 2018.

  48. arXiv:1712.09374  [pdf, other

    cs.CV cs.AI

    HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization

    Authors: Hang Zhao, Antonio Torralba, Lorenzo Torresani, Zhicheng Yan

    Abstract: This paper presents a new large-scale dataset for recognition and temporal localization of human actions collected from Web videos. We refer to it as HACS (Human Action Clips and Segments). We leverage both consensus and disagreement among visual classifiers to automatically mine candidate short clips from unlabeled videos, which are subsequently validated by human annotators. The resulting datase… ▽ More

    Submitted 4 September, 2019; v1 submitted 26 December, 2017; originally announced December 2017.

  49. arXiv:1712.09184  [pdf, other

    cs.CV

    Detect-and-Track: Efficient Pose Estimation in Videos

    Authors: Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, Du Tran

    Abstract: This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video. We propose an extremely lightweight yet highly effective approach that builds upon the latest advancements in human detection and video understanding. Our method operates in two-stages: keypoint estimation in frames or short clips, followed by lightweight tracking to generate keypoint p… ▽ More

    Submitted 2 May, 2018; v1 submitted 26 December, 2017; originally announced December 2017.

    Comments: In CVPR 2018. Ranked first in ICCV 2017 PoseTrack challenge (keypoint tracking in videos). Code: https://github.com/facebookresearch/DetectAndTrack and webpage: https://rohitgirdhar.github.io/DetectAndTrack/

  50. arXiv:1711.11248  [pdf, other

    cs.CV

    A Closer Look at Spatiotemporal Convolutions for Action Recognition

    Authors: Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri

    Abstract: In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of r… ▽ More

    Submitted 11 April, 2018; v1 submitted 30 November, 2017; originally announced November 2017.