Skip to main content

Showing 1–10 of 10 results for author: Soldan, M

.
  1. arXiv:2405.17146  [pdf, other

    cs.CV

    Compressed-Language Models for Understanding Compressed File Formats: a JPEG Exploration

    Authors: Juan C. Pérez, Alejandro Pardo, Mattia Soldan, Hani Itani, Juan Leon-Alcazar, Bernard Ghanem

    Abstract: This study investigates whether Compressed-Language Models (CLMs), i.e. language models operating on raw byte streams from Compressed File Formats~(CFFs), can understand files compressed by CFFs. We focus on the JPEG format as a representative CFF, given its commonality and its representativeness of key concepts in compression, such as entropy coding and run-length encoding. We test if CLMs unders… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  2. arXiv:2404.03477  [pdf, other

    cs.CV

    Towards Automated Movie Trailer Generation

    Authors: Dawit Mureja Argaw, Mattia Soldan, Alejandro Pardo, Chen Zhao, Fabian Caba Heilbron, Joon Son Chung, Bernard Ghanem

    Abstract: Movie trailers are an essential tool for promoting films and attracting audiences. However, the process of creating trailers can be time-consuming and expensive. To streamline this process, we propose an automatic trailer generation framework that generates plausible trailers from a full movie by automating shot selection and composition. Our approach draws inspiration from machine translation tec… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR 2024

  3. arXiv:2304.02934  [pdf, other

    cs.CV

    Boundary-Denoising for Video Activity Localization

    Authors: Mengmeng Xu, Mattia Soldan, Jialin Gao, Shuming Liu, Juan-Manuel Pérez-Rúa, Bernard Ghanem

    Abstract: Video activity localization aims at understanding the semantic content in long untrimmed videos and retrieving actions of interest. The retrieved action with its start and end locations can be used for highlight generation, temporal action detection, etc. Unfortunately, learning the exact boundary location of activities is highly challenging because temporal activities are continuous in time, and… ▽ More

    Submitted 6 April, 2023; originally announced April 2023.

  4. arXiv:2302.13372  [pdf, other

    cs.CV cs.AI cs.LG

    Localizing Moments in Long Video Via Multimodal Guidance

    Authors: Wayner Barrios, Mattia Soldan, Alberto Mario Ceballos-Arroyo, Fabian Caba Heilbron, Bernard Ghanem

    Abstract: The recent introduction of the large-scale, long-form MAD and Ego4D datasets has enabled researchers to investigate the performance of current state-of-the-art methods for video grounding in the long-form setup, with interesting findings: current grounding methods alone fail at tackling this challenging task and setup due to their inability to process long video sequences. In this paper, we propos… ▽ More

    Submitted 15 October, 2023; v1 submitted 26 February, 2023; originally announced February 2023.

    Journal ref: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 2023

  5. arXiv:2207.01622  [pdf, other

    cs.CV

    Egocentric Video-Language Pretraining @ Ego4D Challenge 2022

    Authors: Kevin Qinghong Lin, Alex **peng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou

    Abstract: In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for four Ego4D challenge tasks, including Natural Language Query (NLQ), Moment Query (MQ), Object State Change Classification (OSCC), and PNR Localization (PNR). Especially, we exploit the recently released Ego4D dataset \cite{grauman2021ego4d} to pioneer Egocentric VLP from pretraining dataset, pre… ▽ More

    Submitted 3 August, 2022; v1 submitted 4 July, 2022; originally announced July 2022.

    Comments: Preprint. 4 pages, 2 figures, 5 tables. Code: https://github.com/showlab/EgoVLP. The Ego4D challenge technical report of EgoVLP arXiv:2206.01670. See EPIC challenge technical report arXiv:2207.01334 for overlap

  6. arXiv:2206.01670  [pdf, other

    cs.CV cs.AI

    Egocentric Video-Language Pretraining

    Authors: Kevin Qinghong Lin, Alex **peng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou

    Abstract: Video-Language Pretraining (VLP), which aims to learn transferable representation to advance a wide range of video-text downstream tasks, has recently received increasing attention. Best performing works rely on large-scale, 3rd-person video-text datasets, such as HowTo100M. In this work, we exploit the recently released Ego4D dataset to pioneer Egocentric VLP along three directions. (i) We create… ▽ More

    Submitted 12 October, 2022; v1 submitted 3 June, 2022; originally announced June 2022.

    Comments: Accepted by NeurIPS 2022. Double champions at Ego4D and EPIC-Kitchens, CVPR 2022 challenges. 23 pages, 13 figures, 12 tables. Code: https://github.com/showlab/EgoVLP

  7. arXiv:2112.00431  [pdf, other

    cs.CV cs.AI

    MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

    Authors: Mattia Soldan, Alejandro Pardo, Juan León Alcázar, Fabian Caba Heilbron, Chen Zhao, Silvio Giancola, Bernard Ghanem

    Abstract: The recent and increasing interest in video-language research has driven the development of large-scale datasets that enable data-intensive machine learning techniques. In comparison, limited effort has been made at assessing the fitness of these datasets for the video-language grounding task. Recent works have begun to discover significant limitations in these datasets, suggesting that state-of-t… ▽ More

    Submitted 28 March, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

    Comments: 12 Pages, 6 Figures, 7 Tables

    Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR 2022

  8. arXiv:2011.10132  [pdf, other

    cs.CV cs.CL

    VLG-Net: Video-Language Graph Matching Network for Video Grounding

    Authors: Mattia Soldan, Mengmeng Xu, Sisi Qu, Jesper Tegner, Bernard Ghanem

    Abstract: Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query. The solution to this challenging task demands understanding videos' and queries' semantic content and the fine-grained reasoning about their multi-modal interactions. Our key idea is to recast this challenge into an algorithmic graph matching problem. Fueled by recent a… ▽ More

    Submitted 16 August, 2021; v1 submitted 19 November, 2020; originally announced November 2020.

    Comments: 14 pages, 7 figures, In proceeding of the ICCV21 workshop: AI for Creative Video Editing and Understanding 2021

  9. arXiv:1911.08608  [pdf, other

    eess.SP cs.HC cs.LG

    Seq2Seq RNN based Gait Anomaly Detection from Smartphone Acquired Multimodal Motion Data

    Authors: Riccardo Bonetto, Mattia Soldan, Alberto Lanaro, Simone Milani, Michele Rossi

    Abstract: Smartphones and wearable devices are fast growing technologies that, in conjunction with advances in wireless sensor hardware, are enabling ubiquitous sensing applications. Wearables are suitable for indoor and outdoor scenarios, can be placed on many parts of the human body and can integrate a large number of sensors capable of gathering physiological and behavioral biometric information. Here, w… ▽ More

    Submitted 19 November, 2019; originally announced November 2019.

  10. arXiv:1907.12763  [pdf, other

    cs.CV cs.CL

    Finding Moments in Video Collections Using Natural Language

    Authors: Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, Bryan Russell

    Abstract: We introduce the task of retrieving relevant video moments from a large corpus of untrimmed, unsegmented videos given a natural language query. Our task poses unique challenges as a system must efficiently identify both the relevant videos and localize the relevant moments in the videos. To address these challenges, we propose SpatioTemporal Alignment with Language (STAL), a model that represents… ▽ More

    Submitted 23 February, 2022; v1 submitted 30 July, 2019; originally announced July 2019.