Skip to main content

Showing 1–14 of 14 results for author: Rahtu, E

Searching in archive eess. Search in all archives.
.
  1. arXiv:2401.16423  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Synchformer: Efficient Synchronization from Sparse Cues

    Authors: Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman

    Abstract: Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

    Comments: Extended version of the ICASSP 24 paper. Project page: https://www.robots.ox.ac.uk/~vgg/research/synchformer/ Code: https://github.com/v-iashin/Synchformer

  2. arXiv:2401.10761  [pdf, other

    eess.IV cs.CV

    NN-VVC: Versatile Video Coding boosted by self-supervisedly learned image coding for machines

    Authors: Jukka I. Ahonen, Nam Le, Honglei Zhang, Antti Hallapuro, Francesco Cricri, Hamed Rezazadegan Tavakoli, Miska M. Hannuksela, Esa Rahtu

    Abstract: The recent progress in artificial intelligence has led to an ever-increasing usage of images and videos by machine analysis algorithms, mainly neural networks. Nonetheless, compression, storage and transmission of media have traditionally been designed considering human beings as the viewers of the content. Recent research on image and video coding for machine analysis has progressed mainly in two… ▽ More

    Submitted 19 January, 2024; originally announced January 2024.

    Comments: ISM 2023 Best paper award winner version

  3. Bridging the gap between image coding for machines and humans

    Authors: Nam Le, Honglei Zhang, Francesco Cricri, Ramin G. Youvalari, Hamed Rezazadegan Tavakoli, Emre Aksu, Miska M. Hannuksela, Esa Rahtu

    Abstract: Image coding for machines (ICM) aims at reducing the bitrate required to represent an image while minimizing the drop in machine vision analysis accuracy. In many use cases, such as surveillance, it is also important that the visual quality is not drastically deteriorated by the compression process. Recent works on using neural network (NN) based ICM codecs have shown significant coding gains agai… ▽ More

    Submitted 19 January, 2024; originally announced January 2024.

    Journal ref: IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 2022, pp. 3411-3415

  4. arXiv:2210.07055  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors

    Authors: Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman

    Abstract: The objective of this paper is audio-visual synchronisation of general videos 'in the wild'. For such videos, the events that may be harnessed for synchronisation cues may be spatially small and may occur only infrequently during a many seconds-long video clip, i.e. the synchronisation signal is 'sparse in space and time'. This contrasts with the case of synchronising videos of talking heads, wher… ▽ More

    Submitted 13 October, 2022; originally announced October 2022.

    Comments: Accepted as a spotlight presentation for the BMVC 2022. Code: https://github.com/v-iashin/SparseSync Project page: https://v-iashin.github.io/SparseSync

  5. arXiv:2112.08767  [pdf, other

    eess.IV cs.CV cs.LG

    Adaptation and Attention for Neural Video Coding

    Authors: Nannan Zou, Honglei Zhang, Francesco Cricri, Ramin G. Youvalari, Hamed R. Tavakoli, Jani Lainema, Emre Aksu, Miska Hannuksela, Esa Rahtu

    Abstract: Neural image coding represents now the state-of-the-art image compression approach. However, a lot of work is still to be done in the video domain. In this work, we propose an end-to-end learned video codec that introduces several architectural novelties as well as training novelties, revolving around the concepts of adaptation and attention. Our codec is organized as an intra-frame codec paired w… ▽ More

    Submitted 16 December, 2021; originally announced December 2021.

  6. arXiv:2110.08791  [pdf, other

    cs.CV cs.AI cs.LG cs.SD eess.AS

    Taming Visually Guided Sound Generation

    Authors: Vladimir Iashin, Esa Rahtu

    Abstract: Recent advances in visually-induced audio generation are based on sampling short, low-fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the-art model takes minutes on a high-end GPU. In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it tak… ▽ More

    Submitted 17 October, 2021; originally announced October 2021.

    Comments: Accepted as an oral presentation for the BMVC 2021. Code: https://github.com/v-iashin/SpecVQGAN Project page: https://v-iashin.github.io/SpecVQGAN

  7. arXiv:2109.08867  [pdf, other

    cs.CV cs.SD eess.AS

    V-SlowFast Network for Efficient Visual Sound Separation

    Authors: Lingyu Zhu, Esa Rahtu

    Abstract: The objective of this paper is to perform visual sound separation: i) we study visual sound separation on spectrograms of different temporal resolutions; ii) we propose a new light yet efficient three-stream framework V-SlowFast that operates on Visual frame, Slow spectrogram, and Fast spectrogram. The Slow spectrogram captures the coarse temporal resolution while the Fast spectrogram contains the… ▽ More

    Submitted 21 September, 2021; v1 submitted 18 September, 2021; originally announced September 2021.

    Comments: total 21 pages: main paper 8 pages, references 3 pages, supplementary material 10 pages

  8. Image coding for machines: an end-to-end learned approach

    Authors: Nam Le, Honglei Zhang, Francesco Cricri, Ramin Ghaznavi-Youvalari, Esa Rahtu

    Abstract: Over recent years, deep learning-based computer vision systems have been applied to images at an ever-increasing pace, oftentimes representing the only type of consumption for those images. Given the dramatic explosion in the number of images generated per day, a question arises: how much better would an image codec targeting machine-consumption perform against state-of-the-art codecs targeting hu… ▽ More

    Submitted 30 August, 2021; v1 submitted 23 August, 2021; originally announced August 2021.

    Comments: Fixed a couple of mistakes since the version accepted in IEEE ICASSP2021

    Journal ref: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2021), 2021, pp. 1590-1594

  9. Learned Image Coding for Machines: A Content-Adaptive Approach

    Authors: Nam Le, Honglei Zhang, Francesco Cricri, Ramin Ghaznavi-Youvalari, Hamed Rezazadegan Tavakoli, Esa Rahtu

    Abstract: Today, according to the Cisco Annual Internet Report (2018-2023), the fastest-growing category of Internet traffic is machine-to-machine communication. In particular, machine-to-machine communication of images and videos represents a new challenge and opens up new perspectives in the context of data compression. One possible solution approach consists of adapting current human-targeted image and v… ▽ More

    Submitted 13 October, 2021; v1 submitted 23 August, 2021; originally announced August 2021.

    Comments: Fig 4 correction

    Journal ref: 2021 IEEE International Conference on Multimedia and Expo (ICME), 2021, pp. 1-6

  10. arXiv:2007.16054  [pdf, other

    eess.IV cs.CV cs.LG cs.MM stat.ML

    Learning to Learn to Compress

    Authors: Nannan Zou, Honglei Zhang, Francesco Cricri, Hamed R. Tavakoli, Jani Lainema, Miska Hannuksela, Emre Aksu, Esa Rahtu

    Abstract: In this paper we present an end-to-end meta-learned system for image compression. Traditional machine learning based approaches to image compression train one or more neural network for generalization performance. However, at inference time, the encoder or the latent tensor output by the encoder can be optimized for each test image. This optimization can be regarded as a form of adaptation or bene… ▽ More

    Submitted 1 May, 2021; v1 submitted 31 July, 2020; originally announced July 2020.

  11. arXiv:2005.08271  [pdf, other

    cs.CV cs.CL cs.LG cs.SD eess.AS

    A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer

    Authors: Vladimir Iashin, Esa Rahtu

    Abstract: Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting only visual features, while completely neglecting the audio track. Only a few prior works have utilized both modalities, yet they show poor results or demonstrate the importance on a dataset with a specific domain. In this paper, we introduce Bi-modal Tr… ▽ More

    Submitted 11 August, 2020; v1 submitted 17 May, 2020; originally announced May 2020.

    Comments: Accepted by BMVC 2020. More experiments. Code: https://github.com/v-iashin/bmt Project page: https://v-iashin.github.io/bmt

  12. arXiv:2004.09226  [pdf, other

    eess.IV cs.CV cs.LG

    End-to-End Learning for Video Frame Compression with Self-Attention

    Authors: Nannan Zou, Honglei Zhang, Francesco Cricri, Hamed R. Tavakoli, Jani Lainema, Emre Aksu, Miska Hannuksela, Esa Rahtu

    Abstract: One of the core components of conventional (i.e., non-learned) video codecs consists of predicting a frame from a previously-decoded frame, by leveraging temporal correlations. In this paper, we propose an end-to-end learned system for compressing video frames. Instead of relying on pixel-space motion (as with optical flow), our system learns deep embeddings of frames and encodes their difference… ▽ More

    Submitted 20 April, 2020; originally announced April 2020.

  13. arXiv:2004.04548  [pdf, other

    cs.CV cs.LG eess.IV

    Sequential View Synthesis with Transformer

    Authors: Phong Nguyen-Ha, Lam Huynh, Esa Rahtu, Janne Heikkila

    Abstract: This paper addresses the problem of novel view synthesis by means of neural rendering, where we are interested in predicting the novel view at an arbitrary camera pose based on a given set of input images from other viewpoints. Using the known query pose and input poses, we create an ordered set of observations that leads to the target view. Thus, the problem of single novel view synthesis is refo… ▽ More

    Submitted 22 September, 2020; v1 submitted 9 April, 2020; originally announced April 2020.

    Comments: Code is available at: https://github.com/phongnhhn92/TransformerGQN; Supplementary material: https://bit.ly/3kEgnzU

  14. arXiv:2003.07758  [pdf, other

    cs.CV cs.CL cs.LG cs.SD eess.AS eess.IV

    Multi-modal Dense Video Captioning

    Authors: Vladimir Iashin, Esa Rahtu

    Abstract: Dense video captioning is a task of localizing interesting events from an untrimmed video and producing textual description (captions) for each localized event. Most of the previous works in dense video captioning are solely based on visual information and completely ignore the audio track. However, audio, and speech, in particular, are vital cues for a human observer in understanding an environme… ▽ More

    Submitted 5 May, 2020; v1 submitted 17 March, 2020; originally announced March 2020.

    Comments: To appear in the proceedings of CVPR Workshops 2020; Code: https://github.com/v-iashin/MDVC Project Page: https://v-iashin.github.io/mdvc