Skip to main content

Showing 1–29 of 29 results for author: Modolo, D

.
  1. arXiv:2404.05136  [pdf, other

    cs.CV cs.AI

    Self-Supervised Multi-Object Tracking with Path Consistency

    Authors: Zijia Lu, Bing Shuai, Yanbei Chen, Zhenlin Xu, Davide Modolo

    Abstract: In this paper, we propose a novel concept of path consistency to learn robust object matching without using manual object identity supervision. Our key idea is that, to track a object through frames, we can obtain multiple different association results from a model by varying the frames it can observe, i.e., skip** frames in observation. As the differences in observations do not alter the identi… ▽ More

    Submitted 7 April, 2024; originally announced April 2024.

    Comments: Accepted at CVPR 2024

  2. arXiv:2404.05016  [pdf, other

    cs.CV

    Hyperbolic Learning with Synthetic Captions for Open-World Detection

    Authors: Fanjie Kong, Yanbei Chen, Jiarui Cai, Davide Modolo

    Abstract: Open-world detection poses significant challenges, as it requires the detection of any object using either object class labels or free-form texts. Existing related works often use large-scale manual annotated caption datasets for training, which are extremely expensive to collect. Instead, we propose to transfer knowledge from vision-language models (VLMs) to enrich the open-vocabulary description… ▽ More

    Submitted 7 April, 2024; originally announced April 2024.

    Comments: CVPR 2024

  3. arXiv:2312.06598  [pdf, other

    cs.CV

    Early Action Recognition with Action Prototypes

    Authors: Guglielmo Camporese, Alessandro Bergamo, Xunyu Lin, Joseph Tighe, Davide Modolo

    Abstract: Early action recognition is an important and challenging problem that enables the recognition of an action from a partially observed video stream where the activity is potentially unfinished or even not started. In this work, we propose a novel model that learns a prototypical representation of the full action for each class and uses it to regularize the architecture and the visual representations… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

  4. arXiv:2311.01646  [pdf, other

    cs.CV cs.LG

    SemiGPC: Distribution-Aware Label Refinement for Imbalanced Semi-Supervised Learning Using Gaussian Processes

    Authors: Abdelhak Lemkhenter, Manchen Wang, Luca Zancato, Gurumurthy Swaminathan, Paolo Favaro, Davide Modolo

    Abstract: In this paper we introduce SemiGPC, a distribution-aware label refinement strategy based on Gaussian Processes where the predictions of the model are derived from the labels posterior distribution. Differently from other buffer-based semi-supervised methods such as CoMatch and SimMatch, our SemiGPC includes a normalization term that addresses imbalances in the global data distribution while mainta… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

  5. arXiv:2310.00099  [pdf, other

    cs.CV

    Denoising and Selecting Pseudo-Heatmaps for Semi-Supervised Human Pose Estimation

    Authors: Zhuoran Yu, Manchen Wang, Yanbei Chen, Paolo Favaro, Davide Modolo

    Abstract: We propose a new semi-supervised learning design for human pose estimation that revisits the popular dual-student framework and enhances it two ways. First, we introduce a denoising scheme to generate reliable pseudo-heatmaps as targets for learning from unlabeled data. This uses multi-view augmentations and a threshold-and-refine procedure to produce a pool of pseudo-heatmaps. Second, we select t… ▽ More

    Submitted 29 September, 2023; originally announced October 2023.

  6. arXiv:2309.11445  [pdf, other

    cs.CV

    SkeleTR: Towrads Skeleton-based Action Recognition in the Wild

    Authors: Haodong Duan, Mingze Xu, Bing Shuai, Davide Modolo, Zhuowen Tu, Joseph Tighe, Alessandro Bergamo

    Abstract: We present SkeleTR, a new framework for skeleton-based action recognition. In contrast to prior work, which focuses mainly on controlled environments, we target more general scenarios that typically involve a variable number of people and various forms of interaction between people. SkeleTR works with a two-stage paradigm. It first models the intra-person skeleton dynamics for each skeleton sequen… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

    Comments: ICCV 2023

  7. arXiv:2306.16048  [pdf, other

    cs.CV cs.AI

    Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

    Authors: Zhenlin Xu, Yi Zhu, Tiffany Deng, Abhay Mittal, Yanbei Chen, Manchen Wang, Paolo Favaro, Joseph Tighe, Davide Modolo

    Abstract: This paper presents novel benchmarks for evaluating vision-language models (VLMs) in zero-shot recognition, focusing on granularity and specificity. Although VLMs excel in tasks like image captioning, they face challenges in open-world settings. Our benchmarks test VLMs' consistency in understanding concepts across semantic granularity levels and their response to varying text specificity. Finding… ▽ More

    Submitted 18 June, 2024; v1 submitted 28 June, 2023; originally announced June 2023.

    Comments: CVPR2024 MMFM workshop

  8. arXiv:2306.04849  [pdf, other

    cs.CV

    ScaleDet: A Scalable Multi-Dataset Object Detector

    Authors: Yanbei Chen, Manchen Wang, Abhay Mittal, Zhenlin Xu, Paolo Favaro, Joseph Tighe, Davide Modolo

    Abstract: Multi-dataset training provides a viable solution for exploiting heterogeneous large-scale datasets without extra annotation cost. In this work, we propose a scalable multi-dataset detector (ScaleDet) that can scale up its generalization across datasets when increasing the number of training datasets. Unlike existing multi-dataset learners that mostly rely on manual relabelling efforts or sophisti… ▽ More

    Submitted 7 June, 2023; originally announced June 2023.

    Comments: CVPR 2023

  9. arXiv:2305.07019  [pdf, other

    cs.CV cs.AI cs.CL

    Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts

    Authors: Zhaoyang Zhang, Yantao Shen, Kunyu Shi, Zhaowei Cai, Jun Fang, Siqi Deng, Hao Yang, Davide Modolo, Zhuowen Tu, Stefano Soatto

    Abstract: We present a vision-language model whose parameters are jointly trained on all tasks and fully shared among multiple heterogeneous tasks which may interfere with each other, resulting in a single model which we named Musketeer. The integration of knowledge across heterogeneous tasks is enabled by a novel feature called Task Explanation Prompt (TEP). With rich and structured information such as tas… ▽ More

    Submitted 14 March, 2024; v1 submitted 11 May, 2023; originally announced May 2023.

  10. arXiv:2208.05688  [pdf, other

    cs.CV cs.AI cs.LG

    Semi-supervised Vision Transformers at Scale

    Authors: Zhaowei Cai, Avinash Ravichandran, Paolo Favaro, Manchen Wang, Davide Modolo, Rahul Bhotika, Zhuowen Tu, Stefano Soatto

    Abstract: We study semi-supervised learning (SSL) for vision transformers (ViT), an under-explored topic despite the wide adoption of the ViT architectures to different tasks. To tackle this problem, we propose a new SSL pipeline, consisting of first un/self-supervised pre-training, followed by supervised fine-tuning, and finally semi-supervised fine-tuning. At the semi-supervised fine-tuning stage, we adop… ▽ More

    Submitted 11 August, 2022; originally announced August 2022.

  11. arXiv:2205.11710  [pdf, other

    cs.CV

    SCVRL: Shuffled Contrastive Video Representation Learning

    Authors: Michael Dorkenwald, Fanyi Xiao, Biagio Brattoli, Joseph Tighe, Davide Modolo

    Abstract: We propose SCVRL, a novel contrastive-based framework for self-supervised learning for videos. Differently from previous contrast learning based methods that mostly focus on learning visual semantics (e.g., CVRL), SCVRL is capable of learning both semantic and motion patterns. For that, we reformulate the popular shuffling pretext task within a modern contrastive learning paradigm. We show that ou… ▽ More

    Submitted 23 May, 2022; originally announced May 2022.

    Comments: CVPR 2022 - L3DIVU workshop

  12. arXiv:2204.03101  [pdf, other

    cs.CV

    Hierarchical Self-supervised Representation Learning for Movie Understanding

    Authors: Fanyi Xiao, Kaustav Kundu, Joseph Tighe, Davide Modolo

    Abstract: Most self-supervised video representation learning approaches focus on action recognition. In contrast, in this paper we focus on self-supervised video learning for movie understanding and propose a novel hierarchical self-supervised pretraining strategy that separately pretrains each level of our hierarchical movie understanding model (based on [37]). Specifically, we propose to pretrain the low-… ▽ More

    Submitted 6 April, 2022; originally announced April 2022.

    Comments: CVPR 2022

  13. arXiv:2204.00746  [pdf, other

    cs.CV

    What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions

    Authors: A S M Iftekhar, Hao Chen, Kaustav Kundu, Xinyu Li, Joseph Tighe, Davide Modolo

    Abstract: We propose a novel one-stage Transformer-based semantic and spatial refined transformer (SSRT) to solve the Human-Object Interaction detection task, which requires to localize humans and objects, and predicts their interactions. Differently from previous Transformer-based HOI approaches, which mostly focus at improving the design of the decoder outputs for the final detection, SSRT introduces two… ▽ More

    Submitted 25 May, 2022; v1 submitted 1 April, 2022; originally announced April 2022.

    Comments: CVPR 2022 Oral

  14. arXiv:2203.05553  [pdf, other

    cs.CV

    Transfer of Representations to Video Label Propagation: Implementation Factors Matter

    Authors: Daniel McKee, Zitong Zhan, Bing Shuai, Davide Modolo, Joseph Tighe, Svetlana Lazebnik

    Abstract: This work studies feature representations for dense label propagation in video, with a focus on recently proposed methods that learn video correspondence using self-supervised signals such as colorization or temporal cycle consistency. In the literature, these methods have been evaluated with an array of inconsistent settings, making it difficult to discern trends or compare performance fairly. St… ▽ More

    Submitted 10 March, 2022; originally announced March 2022.

  15. arXiv:2108.08836  [pdf, other

    cs.CV

    Multi-Object Tracking with Hallucinated and Unlabeled Videos

    Authors: Daniel McKee, Bing Shuai, Andrew Berneshawi, Manchen Wang, Davide Modolo, Svetlana Lazebnik, Joseph Tighe

    Abstract: In this paper, we explore learning end-to-end deep neural trackers without tracking annotations. This is important as large-scale training data is essential for training deep neural trackers while tracking annotations are expensive to acquire. In place of tracking annotations, we first hallucinate videos from images with bounding box annotations using zoom-in/out motion transformations to obtain f… ▽ More

    Submitted 19 August, 2021; originally announced August 2021.

  16. arXiv:2106.09703  [pdf, other

    cs.CV

    MaCLR: Motion-aware Contrastive Learning of Representations for Videos

    Authors: Fanyi Xiao, Joseph Tighe, Davide Modolo

    Abstract: We present MaCLR, a novel method to explicitly perform cross-modal self-supervised video representations learning from visual and motion modalities. Compared to previous video representation learning methods that mostly focus on learning motion cues implicitly from RGB inputs, MaCLR enriches standard contrastive learning objectives for RGB video clips with a cross-modal learning objective between… ▽ More

    Submitted 20 July, 2022; v1 submitted 17 June, 2021; originally announced June 2021.

    Comments: ECCV 2022

  17. arXiv:2105.11595  [pdf, other

    cs.CV

    SiamMOT: Siamese Multi-Object Tracking

    Authors: Bing Shuai, Andrew Berneshawi, Xinyu Li, Davide Modolo, Joseph Tighe

    Abstract: In this paper, we focus on improving online multi-object tracking (MOT). In particular, we introduce a region-based Siamese Multi-Object Tracking network, which we name SiamMOT. SiamMOT includes a motion model that estimates the instance's movement between two frames such that detected instances are associated. To explore how the motion modelling affects its tracking capability, we present two var… ▽ More

    Submitted 24 May, 2021; originally announced May 2021.

    Journal ref: CVPR2021

  18. arXiv:2104.00969  [pdf, other

    cs.CV

    TubeR: Tubelet Transformer for Video Action Detection

    Authors: Jiaojiao Zhao, Yanyi Zhang, Xinyu Li, Hao Chen, Shuai Bing, Mingze Xu, Chunhui Liu, Kaustav Kundu, Yuanjun Xiong, Davide Modolo, Ivan Marsic, Cees G. M. Snoek, Joseph Tighe

    Abstract: We propose TubeR: a simple solution for spatio-temporal video action detection. Different from existing methods that depend on either an off-line actor detector or hand-designed actor-positional hypotheses like proposals or anchors, we propose to directly detect an action tubelet in a video by simultaneously performing action localization and recognition from a single representation. TubeR learns… ▽ More

    Submitted 10 May, 2022; v1 submitted 2 April, 2021; originally announced April 2021.

    Comments: Accepted at CVPR 2022 (Oral)

  19. arXiv:2104.00179  [pdf, other

    cs.CV

    Selective Feature Compression for Efficient Activity Recognition Inference

    Authors: Chunhui Liu, Xinyu Li, Hao Chen, Davide Modolo, Joseph Tighe

    Abstract: Most action recognition solutions rely on dense sampling to precisely cover the informative temporal clip. Extensively searching temporal region is expensive for a real-world application. In this work, we focus on improving the inference efficiency of current action recognition backbones on trimmed videos, and illustrate that one action model can also cover then informative region by drop** non-… ▽ More

    Submitted 29 July, 2021; v1 submitted 31 March, 2021; originally announced April 2021.

    Comments: Accepted by ICCV 2021

  20. arXiv:2004.07786  [pdf, other

    cs.CV

    Multi-Object Tracking with Siamese Track-RCNN

    Authors: Bing Shuai, Andrew G. Berneshawi, Davide Modolo, Joseph Tighe

    Abstract: Multi-object tracking systems often consist of a combination of a detector, a short term linker, a re-identification feature extractor and a solver that takes the output from these separate components and makes a final prediction. Differently, this work aims to unify all these in a single tracking system. Towards this, we propose Siamese Track-RCNN, a two stage detect-and-track framework which con… ▽ More

    Submitted 16 April, 2020; originally announced April 2020.

  21. arXiv:2003.13759  [pdf, other

    cs.CV

    Understanding the impact of mistakes on background regions in crowd counting

    Authors: Davide Modolo, Bing Shuai, Rahul Rama Varior, Joseph Tighe

    Abstract: Every crowd counting researcher has likely observed their model output wrong positive predictions on image regions not containing any person. But how often do these mistakes happen? Are our models negatively affected by this? In this paper we analyze this problem in depth. In order to understand its magnitude, we present an extensive analysis on five of the most important crowd counting datasets.… ▽ More

    Submitted 30 March, 2020; originally announced March 2020.

  22. arXiv:2003.13743  [pdf, other

    cs.CV

    Combining detection and tracking for human pose estimation in videos

    Authors: Manchen Wang, Joseph Tighe, Davide Modolo

    Abstract: We propose a novel top-down approach that tackles the problem of multi-person human pose estimation and tracking in videos. In contrast to existing top-down approaches, our method is not limited by the performance of its person detector and can predict the poses of person instances not localized. It achieves this capability by propagating known person locations forward and backward in time and sea… ▽ More

    Submitted 30 March, 2020; originally announced March 2020.

    Comments: Accepted to CVPR 2020 as oral

  23. arXiv:1908.07625  [pdf, other

    cs.CV

    Action recognition with spatial-temporal discriminative filter banks

    Authors: Brais Martinez, Davide Modolo, Yuanjun Xiong, Joseph Tighe

    Abstract: Action recognition has seen a dramatic performance improvement in the last few years. Most of the current state-of-the-art literature either aims at improving performance through changes to the backbone CNN network, or they explore different trade-offs between computational efficiency and performance, again through altering the backbone network. However, almost all of these works maintain the same… ▽ More

    Submitted 20 August, 2019; originally announced August 2019.

    Comments: ICCV 2019 Accepted Paper

  24. arXiv:1901.06026  [pdf, other

    cs.CV

    Multi-Scale Attention Network for Crowd Counting

    Authors: Rahul Rama Varior, Bing Shuai, Joseph Tighe, Davide Modolo

    Abstract: In crowd counting datasets, people appear at different scales, depending on their distance from the camera. To address this issue, we propose a novel multi-branch scale-aware attention network that exploits the hierarchical structure of convolutional neural networks and generates, in a single forward pass, multi-scale density predictions from different layers of the architecture. To aggregate thes… ▽ More

    Submitted 25 July, 2019; v1 submitted 17 January, 2019; originally announced January 2019.

  25. arXiv:1703.09529  [pdf, other

    cs.CV

    Objects as context for detecting their semantic parts

    Authors: Abel Gonzalez-Garcia, Davide Modolo, Vittorio Ferrari

    Abstract: We present a semantic part detection approach that effectively leverages object information.We use the object appearance and its class as indicators of what parts to expect. We also model the expected relative location of parts inside the objects based on their appearance. We achieve this with a new network module, called OffsetNet, that efficiently predicts a variable number of part locations wit… ▽ More

    Submitted 27 March, 2018; v1 submitted 28 March, 2017; originally announced March 2017.

  26. Learning Semantic Part-Based Models from Google Images

    Authors: Davide Modolo, Vittorio Ferrari

    Abstract: We propose a technique to train semantic part-based models of object classes from Google Images. Our models encompass the appearance of parts and their spatial arrangement on the object, specific to each viewpoint. We learn these rich models by collecting training instances for both parts and objects, and automatically connecting the two levels. Our framework works incrementally, by learning from… ▽ More

    Submitted 6 July, 2017; v1 submitted 11 September, 2016; originally announced September 2016.

  27. arXiv:1607.03738  [pdf, other

    cs.CV

    Do semantic parts emerge in Convolutional Neural Networks?

    Authors: Abel Gonzalez-Garcia, Davide Modolo, Vittorio Ferrari

    Abstract: Semantic object parts can be useful for several visual recognition tasks. Lately, these tasks have been addressed using Convolutional Neural Networks (CNN), achieving outstanding results. In this work we study whether CNNs learn semantic parts in their internal representation. We investigate the responses of convolutional filters and try to associate their stimuli with semantic parts. We perform t… ▽ More

    Submitted 20 September, 2017; v1 submitted 13 July, 2016; originally announced July 2016.

  28. arXiv:1503.00787  [pdf, other

    cs.CV

    Context Forest for efficient object detection with large mixture models

    Authors: Davide Modolo, Alexander Vezhnevets, Vittorio Ferrari

    Abstract: We present Context Forest (ConF), a technique for predicting properties of the objects in an image based on its global appearance. Compared to standard nearest-neighbour techniques, ConF is more accurate, fast and memory efficient. We train ConF to predict which aspects of an object class are likely to appear in a given image (e.g. which viewpoint). This enables to speed-up multi-component object… ▽ More

    Submitted 2 March, 2015; originally announced March 2015.

  29. arXiv:1503.00783  [pdf, other

    cs.CV

    Joint calibration of Ensemble of Exemplar SVMs

    Authors: Davide Modolo, Alexander Vezhnevets, Olga Russakovsky, Vittorio Ferrari

    Abstract: We present a method for calibrating the Ensemble of Exemplar SVMs model. Unlike the standard approach, which calibrates each SVM independently, our method optimizes their joint performance as an ensemble. We formulate joint calibration as a constrained optimization problem and devise an efficient optimization algorithm to find its global optimum. The algorithm dynamically discards parts of the sol… ▽ More

    Submitted 24 April, 2015; v1 submitted 2 March, 2015; originally announced March 2015.