Skip to main content

Showing 1–9 of 9 results for author: Korbar, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2401.12039  [pdf, other

    cs.CV cs.SD eess.AS

    Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling

    Authors: Bruno Korbar, Jaesung Huh, Andrew Zisserman

    Abstract: The goal of this paper is automatic character-aware subtitle generation. Given a video and a minimal amount of metadata, we propose an audio-visual method that generates a full transcript of the dialogue, with precise speech timestamps, and the character speaking identified. The key idea is to first use audio-visual cues to select a set of high-precision audio exemplars for each character, and the… ▽ More

    Submitted 22 January, 2024; originally announced January 2024.

    Comments: Accepted for publication in ICASSP 2024

  2. arXiv:2312.11897  [pdf, other

    cs.CV

    Text-Conditioned Resampler For Long Form Video Understanding

    Authors: Bruno Korbar, Yongqin Xian, Alessio Tonioni, Andrew Zisserman, Federico Tombari

    Abstract: In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can pro… ▽ More

    Submitted 25 March, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

  3. arXiv:2210.14601  [pdf, other

    cs.CV

    End-to-end Tracking with a Multi-query Transformer

    Authors: Bruno Korbar, Andrew Zisserman

    Abstract: Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time. Our aim in this paper is to move beyond tracking-by-detection approaches, that perform well on datasets where the object classes are known, to class-agnostic tracking that performs well also for unknown object classes.To this end,… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

  4. arXiv:2007.04755  [pdf, other

    cs.CV

    Generalized Few-Shot Video Classification with Video Retrieval and Feature Generation

    Authors: Yongqin Xian, Bruno Korbar, Matthijs Douze, Lorenzo Torresani, Bernt Schiele, Zeynep Akata

    Abstract: Few-shot learning aims to recognize novel classes from a few examples. Although significant progress has been made in the image domain, few-shot video classification is relatively unexplored. We argue that previous methods underestimate the importance of video feature learning and propose to learn spatiotemporal features using a 3D CNN. Proposing a two-stage approach that learns video features on… ▽ More

    Submitted 13 October, 2021; v1 submitted 9 July, 2020; originally announced July 2020.

    Comments: Accepted by TPAMI in October, 2021

  5. arXiv:2006.07203   

    cs.CV

    Video Understanding as Machine Translation

    Authors: Bruno Korbar, Fabio Petroni, Rohit Girdhar, Lorenzo Torresani

    Abstract: With the advent of large-scale multimodal video datasets, especially sequences with audio or transcribed speech, there has been a growing interest in self-supervised learning of video representations. Most prior work formulates the objective as a contrastive metric learning problem between the modalities. To enable effective learning, however, these strategies require a careful selection of positi… ▽ More

    Submitted 17 September, 2020; v1 submitted 12 June, 2020; originally announced June 2020.

    Comments: The authors have temporarily withdrawn this paper to reassess some of the experimental results

  6. arXiv:1911.12667  [pdf, other

    cs.CV

    Self-Supervised Learning by Cross-Modal Audio-Video Clustering

    Authors: Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, Du Tran

    Abstract: Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning.… ▽ More

    Submitted 26 October, 2020; v1 submitted 28 November, 2019; originally announced November 2019.

    Comments: Accepted to NeurIPS 2020 (spotlight presentation)

  7. arXiv:1904.04289  [pdf, other

    cs.CV

    SCSampler: Sampling Salient Clips from Video for Efficient Action Recognition

    Authors: Bruno Korbar, Du Tran, Lorenzo Torresani

    Abstract: While many action recognition datasets consist of collections of brief, trimmed videos each containing a relevant action, videos in the real-world (e.g., on YouTube) exhibit very different properties: they are often several minutes long, where brief relevant clips are often interleaved with segments of extended duration containing little change. Applying densely an action recognition system to eve… ▽ More

    Submitted 30 August, 2019; v1 submitted 8 April, 2019; originally announced April 2019.

  8. arXiv:1807.00230  [pdf, other

    cs.CV

    Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

    Authors: Bruno Korbar, Du Tran, Lorenzo Torresani

    Abstract: There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredien… ▽ More

    Submitted 9 November, 2018; v1 submitted 30 June, 2018; originally announced July 2018.

    Comments: Note: Changed name - added experiments

  9. arXiv:1703.01550  [pdf, other

    cs.CV

    Deep-Learning for Classification of Colorectal Polyps on Whole-Slide Images

    Authors: Bruno Korbar, Andrea M. Olofson, Allen P. Miraflor, Katherine M. Nicka, Matthew A. Suriawinata, Lorenzo Torresani, Arief A. Suriawinata, Saeed Hassanpour

    Abstract: Histopathological characterization of colorectal polyps is an important principle for determining the risk of colorectal cancer and future rates of surveillance for patients. This characterization is time-intensive, requires years of specialized training, and suffers from significant inter-observer and intra-observer variability. In this work, we built an automatic image-understanding method that… ▽ More

    Submitted 12 April, 2017; v1 submitted 4 March, 2017; originally announced March 2017.