Skip to main content

Showing 1–7 of 7 results for author: Iashin, V

.
  1. arXiv:2401.16423  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Synchformer: Efficient Synchronization from Sparse Cues

    Authors: Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman

    Abstract: Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

    Comments: Extended version of the ICASSP 24 paper. Project page: https://www.robots.ox.ac.uk/~vgg/research/synchformer/ Code: https://github.com/v-iashin/Synchformer

  2. arXiv:2210.07055  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors

    Authors: Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman

    Abstract: The objective of this paper is audio-visual synchronisation of general videos 'in the wild'. For such videos, the events that may be harnessed for synchronisation cues may be spatially small and may occur only infrequently during a many seconds-long video clip, i.e. the synchronisation signal is 'sparse in space and time'. This contrasts with the case of synchronising videos of talking heads, wher… ▽ More

    Submitted 13 October, 2022; originally announced October 2022.

    Comments: Accepted as a spotlight presentation for the BMVC 2022. Code: https://github.com/v-iashin/SparseSync Project page: https://v-iashin.github.io/SparseSync

  3. arXiv:2110.08791  [pdf, other

    cs.CV cs.AI cs.LG cs.SD eess.AS

    Taming Visually Guided Sound Generation

    Authors: Vladimir Iashin, Esa Rahtu

    Abstract: Recent advances in visually-induced audio generation are based on sampling short, low-fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the-art model takes minutes on a high-end GPU. In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it tak… ▽ More

    Submitted 17 October, 2021; originally announced October 2021.

    Comments: Accepted as an oral presentation for the BMVC 2021. Code: https://github.com/v-iashin/SpecVQGAN Project page: https://v-iashin.github.io/SpecVQGAN

  4. arXiv:2107.12719  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    The CORSMAL benchmark for the prediction of the properties of containers

    Authors: Alessio Xompero, Santiago Donaher, Vladimir Iashin, Francesca Palermo, Gökhan Solak, Claudio Coppola, Reina Ishikawa, Yuichi Nagao, Ryo Hachiuma, Qi Liu, Fan Feng, Chuanlin Lan, Rosa H. M. Chan, Guilherme Christmann, Jyun-Ting Song, Gonuguntla Neeharika, Chinnakotla Krishna Teja Reddy, Dinesh Jain, Bakhtawar Ur Rehman, Andrea Cavallaro

    Abstract: The contactless estimation of the weight of a container and the amount of its content manipulated by a person are key pre-requisites for safe human-to-robot handovers. However, opaqueness and transparencies of the container and the content, and variability of materials, shapes, and sizes, make this estimation difficult. In this paper, we present a range of methods and an open framework to benchmar… ▽ More

    Submitted 21 April, 2022; v1 submitted 27 July, 2021; originally announced July 2021.

    Comments: Authors' post-print accepted for publication in IEEE Access, see https://doi.org/10.1109/ACCESS.2022.3166906 . 14 pages, 6 tables, 7 figures

    Journal ref: IEEE Access, vol. 10, 2022, 1-15

  5. arXiv:2012.01311  [pdf, other

    cs.CV cs.HC cs.LG cs.RO cs.SD

    Top-1 CORSMAL Challenge 2020 Submission: Filling Mass Estimation Using Multi-modal Observations of Human-robot Handovers

    Authors: Vladimir Iashin, Francesca Palermo, Gökhan Solak, Claudio Coppola

    Abstract: Human-robot object handover is a key skill for the future of human-robot collaboration. CORSMAL 2020 Challenge focuses on the perception part of this problem: the robot needs to estimate the filling mass of a container held by a human. Although there are powerful methods in image processing and audio processing individually, answering such a problem requires processing data from multiple sensors t… ▽ More

    Submitted 2 December, 2020; originally announced December 2020.

    Comments: Code: https://github.com/v-iashin/CORSMAL Docker: https://hub.docker.com/r/iashin/corsmal

  6. arXiv:2005.08271  [pdf, other

    cs.CV cs.CL cs.LG cs.SD eess.AS

    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

    Authors: Vladimir Iashin, Esa Rahtu

    Abstract: Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting only visual features, while completely neglecting the audio track. Only a few prior works have utilized both modalities, yet they show poor results or demonstrate the importance on a dataset with a specific domain. In this paper, we introduce Bi-modal Tr… ▽ More

    Submitted 11 August, 2020; v1 submitted 17 May, 2020; originally announced May 2020.

    Comments: Accepted by BMVC 2020. More experiments. Code: https://github.com/v-iashin/bmt Project page: https://v-iashin.github.io/bmt

  7. arXiv:2003.07758  [pdf, other

    cs.CV cs.CL cs.LG cs.SD eess.AS eess.IV

    Multi-modal Dense Video Captioning

    Authors: Vladimir Iashin, Esa Rahtu

    Abstract: Dense video captioning is a task of localizing interesting events from an untrimmed video and producing textual description (captions) for each localized event. Most of the previous works in dense video captioning are solely based on visual information and completely ignore the audio track. However, audio, and speech, in particular, are vital cues for a human observer in understanding an environme… ▽ More

    Submitted 5 May, 2020; v1 submitted 17 March, 2020; originally announced March 2020.

    Comments: To appear in the proceedings of CVPR Workshops 2020; Code: https://github.com/v-iashin/MDVC Project Page: https://v-iashin.github.io/mdvc