Search | arXiv e-print repository

JDLL: A library to run Deep Learning models on Java bioimage informatics platforms

Authors: Carlos Garcia Lopez de Haro, Stephane Dallongeville, Thomas Musset, Estibaliz Gomez de Mariscal, Daniel Sage, Wei Ouyang, Arrate Munoz-Barrutia, Jean-Yves Tinevez, Jean-Christophe Olivo-Marin

Abstract: We present JDLL, an agile Java library that offers a comprehensive toolset/API to unify the development of high-end applications of DL for bioimage analysis and to streamline their installation and maintenance. JDLL provides all the functions required to consume DL models seamlessly, without being burdened by the configuration of the Python-based DL frameworks, within Java bioimage informatics pla… ▽ More We present JDLL, an agile Java library that offers a comprehensive toolset/API to unify the development of high-end applications of DL for bioimage analysis and to streamline their installation and maintenance. JDLL provides all the functions required to consume DL models seamlessly, without being burdened by the configuration of the Python-based DL frameworks, within Java bioimage informatics platforms. Moreover, it allows the deployment of pre-trained models in the Bioimage Model Zoo (BMZ) by ship** the logic to connect to the BMZ website, download and run a selected model inference. △ Less

Submitted 25 September, 2023; v1 submitted 7 June, 2023; originally announced June 2023.

Comments: New version with new figure and updated links

arXiv:2306.00489 [pdf, other]

Speech inpainting: Context-based speech synthesis guided by video

Authors: Juan F. Montesinos, Daniel Michelsanti, Gloria Haro, Zheng-Hua Tan, Jesper Jensen

Abstract: Audio and visual modalities are inherently connected in speech signals: lip movements and facial expressions are correlated with speech sounds. This motivates studies that incorporate the visual modality to enhance an acoustic speech signal or even restore missing audio information. Specifically, this paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing… ▽ More Audio and visual modalities are inherently connected in speech signals: lip movements and facial expressions are correlated with speech sounds. This motivates studies that incorporate the visual modality to enhance an acoustic speech signal or even restore missing audio information. Specifically, this paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing the speech in a corrupted audio segment in a way that it is consistent with the corresponding visual content and the uncorrupted audio context. We present an audio-visual transformer-based deep learning model that leverages visual cues that provide information about the content of the corrupted audio. It outperforms the previous state-of-the-art audio-visual model and audio-only baselines. We also show how visual features extracted with AV-HuBERT, a large audio-visual transformer for speech recognition, are suitable for synthesizing speech. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: Accepted in Interspeech23

arXiv:2204.02090 [pdf, other]

VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices

Authors: Venkatesh S. Kadandale, Juan F. Montesinos, Gloria Haro

Abstract: In this paper, we address the problem of lip-voice synchronisation in videos containing human face and voice. Our approach is based on determining if the lips motion and the voice in a video are synchronised or not, depending on their audio-visual correspondence score. We propose an audio-visual cross-modal transformer-based model that outperforms several baseline models in the audio-visual synchr… ▽ More In this paper, we address the problem of lip-voice synchronisation in videos containing human face and voice. Our approach is based on determining if the lips motion and the voice in a video are synchronised or not, depending on their audio-visual correspondence score. We propose an audio-visual cross-modal transformer-based model that outperforms several baseline models in the audio-visual synchronisation task on the standard lip-reading speech benchmark dataset LRS2. While the existing methods focus mainly on lip synchronisation in speech videos, we also consider the special case of the singing voice. The singing voice is a more challenging use case for synchronisation due to sustained vowel sounds. We also investigate the relevance of lip synchronisation models trained on speech datasets in the context of singing voice. Finally, we use the frozen visual features learned by our lip synchronisation model in the singing voice separation task to outperform a baseline audio-visual model which was trained end-to-end. The demos, source code, and the pre-trained models are available on https://ipcv.github.io/VocaLiST/ △ Less

Submitted 30 June, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

Comments: Paper accepted to Interspeech 2022; Project Page: https://ipcv.github.io/VocaLiST/

arXiv:2203.04099 [pdf, other]

VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer

Authors: Juan F. Montesinos, Venkatesh S. Kadandale, Gloria Haro

Abstract: This paper presents an audio-visual approach for voice separation which produces state-of-the-art results at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produ… ▽ More This paper presents an audio-visual approach for voice separation which produces state-of-the-art results at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produces a fairly good estimation of the isolated target source. In a second stage, the predominant voice is enhanced with an audio-only network. We present different ablation studies and comparison to state-of-the-art methods. Finally, we explore the transferability of models trained for speech separation in the task of singing voice separation. The demos, code, and weights are available in https://ipcv.github.io/VoViT/ △ Less

Submitted 19 July, 2022; v1 submitted 8 March, 2022; originally announced March 2022.

Comments: Accepted to ECCV 2022

arXiv:2106.00359 [pdf, other]

Learning Football Body-Orientation as a Matter of Classification

Authors: Adrià Arbués-Sangüesa, Adrián Martín, Paulino Granero, Coloma Ballester, Gloria Haro

Abstract: Orientation is a crucial skill for football players that becomes a differential factor in a large set of events, especially the ones involving passes. However, existing orientation estimation methods, which are based on computer-vision techniques, still have a lot of room for improvement. To the best of our knowledge, this article presents the first deep learning model for estimating orientation d… ▽ More Orientation is a crucial skill for football players that becomes a differential factor in a large set of events, especially the ones involving passes. However, existing orientation estimation methods, which are based on computer-vision techniques, still have a lot of room for improvement. To the best of our knowledge, this article presents the first deep learning model for estimating orientation directly from video footage. By approaching this challenge as a classification problem where classes correspond to orientation bins, and by introducing a cyclic loss function, a well-known convolutional network is refined to provide player orientation data. The model is trained by using ground-truth orientation data obtained from wearable EPTS devices, which are individually compensated with respect to the perceived orientation in the current frame. The obtained results outperform previous methods; in particular, the absolute median error is less than 12 degrees per player. An ablation study is included in order to show the potential generalization to any kind of football video footage. △ Less

Submitted 1 June, 2021; originally announced June 2021.

Comments: Accepted in the AI for Sports Analytics Workshop at ICJAI 2021

arXiv:2104.09946 [pdf, other]

A cappella: Audio-visual Singing Voice Separation

Authors: Juan F. Montesinos, Venkatesh S. Kadandale, Gloria Haro

Abstract: The task of isolating a target singing voice in music videos has useful applications. In this work, we explore the single-channel singing voice separation problem from a multimodal perspective, by jointly learning from audio and visual modalities. To do so, we present Acappella, a dataset spanning around 46 hours of a cappella solo singing videos sourced from YouTube. We also propose an audio-visu… ▽ More The task of isolating a target singing voice in music videos has useful applications. In this work, we explore the single-channel singing voice separation problem from a multimodal perspective, by jointly learning from audio and visual modalities. To do so, we present Acappella, a dataset spanning around 46 hours of a cappella solo singing videos sourced from YouTube. We also propose an audio-visual convolutional network based on graphs which achieves state-of-the-art singing voice separation results on our dataset and compare it against its audio-only counterpart, U-Net, and a state-of-the-art audio-visual speech separation model. We evaluate the models in the following challenging setups: i) presence of overlap** voices in the audio mixtures, ii) the target voice set to lower volume levels in the mix, and iii) combination of i) and ii). The third one being the most challenging evaluation setup. We demonstrate that our model outperforms the baseline models in the singing voice separation task in the most challenging evaluation setup. The code, the pre-trained models, and the dataset are publicly available at https://ipcv.github.io/Acappella/able at https://ipcv.github.io/Acappella/ △ Less

Submitted 18 October, 2021; v1 submitted 20 April, 2021; originally announced April 2021.

Comments: Paper accepted at The 32nd British Machine Vision Conference, BMVC 2021

arXiv:2006.07931 [pdf, other]

Solos: A Dataset for Audio-Visual Music Analysis

Authors: Juan F. Montesinos, Olga Slizovskaia, Gloria Haro

Abstract: In this paper, we present a new dataset of music performance videos which can be used for training machine learning methods for multiple tasks such as audio-visual blind source separation and localization, cross-modal correspondences, cross-modal generation and, in general, any audio-visual self-supervised task. These videos, gathered from YouTube, consist of solo musical performances of 13 differ… ▽ More In this paper, we present a new dataset of music performance videos which can be used for training machine learning methods for multiple tasks such as audio-visual blind source separation and localization, cross-modal correspondences, cross-modal generation and, in general, any audio-visual self-supervised task. These videos, gathered from YouTube, consist of solo musical performances of 13 different instruments. Compared to previously proposed audio-visual datasets, Solos is cleaner since a big amount of its recordings are auditions and manually checked recordings, ensuring there is no background noise nor effects added in the video post-processing. Besides, it is, up to the best of our knowledge, the only dataset that contains the whole set of instruments present in the URMP\cite{URPM} dataset, a high-quality dataset of 44 audio-visual recordings of multi-instrument classical music pieces with individual audio tracks. URMP was intented to be used for source separation, thus, we evaluate the performance on the URMP dataset of two different source-separation models trained on Solos. The dataset is publicly available at https://juanfmontesinos.github.io/Solos/ △ Less

Submitted 6 August, 2020; v1 submitted 14 June, 2020; originally announced June 2020.

Comments: Rephrased some sentenced. Explanation about OpenPose. Minor grammatical errors

arXiv:2004.03873 [pdf, other]

doi 10.1109/TASLP.2021.3082331

Conditioned Source Separation for Music Instrument Performances

Authors: Olga Slizovskaia, Gloria Haro, Emilia Gómez

Abstract: In music source separation, the number of sources may vary for each piece and some of the sources may belong to the same family of instruments, thus sharing timbral characteristics and making the sources more correlated. This leads to additional challenges in the source separation problem. This paper proposes a source separation method for multiple musical instruments sounding simultaneously and e… ▽ More In music source separation, the number of sources may vary for each piece and some of the sources may belong to the same family of instruments, thus sharing timbral characteristics and making the sources more correlated. This leads to additional challenges in the source separation problem. This paper proposes a source separation method for multiple musical instruments sounding simultaneously and explores how much additional information apart from the audio stream can lift the quality of source separation. We explore conditioning techniques at different levels of a primary source separation network and utilize two extra modalities of data, namely presence or absence of instruments in the mixture, and the corresponding video stream data. △ Less

Submitted 7 July, 2021; v1 submitted 8 April, 2020; originally announced April 2020.

Comments: 14 pages, 5 figures, under review

arXiv:2004.02541 [pdf, other]

Vocoder-Based Speech Synthesis from Silent Videos

Authors: Daniel Michelsanti, Olga Slizovskaia, Gloria Haro, Emilia Gómez, Zheng-Hua Tan, Jesper Jensen

Abstract: Both acoustic and visual information influence human perception of speech. For this reason, the lack of audio in a video sequence determines an extremely low speech intelligibility for untrained lip readers. In this paper, we present a way to synthesise speech from the silent video of a talker using deep learning. The system learns a map** function from raw video frames to acoustic features and… ▽ More Both acoustic and visual information influence human perception of speech. For this reason, the lack of audio in a video sequence determines an extremely low speech intelligibility for untrained lip readers. In this paper, we present a way to synthesise speech from the silent video of a talker using deep learning. The system learns a map** function from raw video frames to acoustic features and reconstructs the speech with a vocoder synthesis algorithm. To improve speech reconstruction performance, our model is also trained to predict text information in a multi-task learning fashion and it is able to simultaneously reconstruct and recognise speech in real time. The results in terms of estimated speech quality and intelligibility show the effectiveness of our method, which exhibits an improvement over existing video-to-speech approaches. △ Less

Submitted 15 August, 2020; v1 submitted 6 April, 2020; originally announced April 2020.

Comments: Accepted to Interspeech 2020

arXiv:1907.01813 [pdf, other]

A Case Study of Deep-Learned Activations via Hand-Crafted Audio Features

Authors: Olga Slizovskaia, Emilia Gómez, Gloria Haro

Abstract: The explainability of Convolutional Neural Networks (CNNs) is a particularly challenging task in all areas of application, and it is notably under-researched in music and audio domain. In this paper, we approach explainability by exploiting the knowledge we have on hand-crafted audio features. Our study focuses on a well-defined MIR task, the recognition of musical instruments from user-generated… ▽ More The explainability of Convolutional Neural Networks (CNNs) is a particularly challenging task in all areas of application, and it is notably under-researched in music and audio domain. In this paper, we approach explainability by exploiting the knowledge we have on hand-crafted audio features. Our study focuses on a well-defined MIR task, the recognition of musical instruments from user-generated music recordings. We compute the similarity between a set of traditional audio features and representations learned by CNNs. We also propose a technique for measuring the similarity between activation maps and audio features which typically presented in the form of a matrix, such as chromagrams or spectrograms. We observe that some neurons' activations correspond to well-known classical audio features. In particular, for shallow layers, we found similarities between activations and harmonic and percussive components of the spectrum. For deeper layers, we compare chromagrams with high-level activation maps as well as loudness and onset rate with deep-learned embeddings. △ Less

Submitted 3 July, 2019; originally announced July 2019.

Comments: The 2018 Joint Workshop on Machine Learning for Music, The Federated Artificial Intelligence Meeting (FAIM), Joint workshop program of ICML, IJCAI/ECAI, and AAMAS, Stockholm, Sweden, Saturday, July 14th, 2018

arXiv:1811.01850 [pdf, other]

End-to-End Sound Source Separation Conditioned On Instrument Labels

Authors: Olga Slizovskaia, Leo Kim, Gloria Haro, Emilia Gomez

Abstract: Can we perform an end-to-end music source separation with a variable number of sources using a deep learning model? We present an extension of the Wave-U-Net model which allows end-to-end monaural source separation with a non-fixed number of sources. Furthermore, we propose multiplicative conditioning with instrument labels at the bottleneck of the Wave-U-Net and show its effect on the separation… ▽ More Can we perform an end-to-end music source separation with a variable number of sources using a deep learning model? We present an extension of the Wave-U-Net model which allows end-to-end monaural source separation with a non-fixed number of sources. Furthermore, we propose multiplicative conditioning with instrument labels at the bottleneck of the Wave-U-Net and show its effect on the separation results. This approach leads to other types of conditioning such as audio-visual source separation and score-informed source separation. △ Less

Submitted 9 May, 2019; v1 submitted 5 November, 2018; originally announced November 2018.

Comments: 5 pages, 2 figures, 2 tables, ICASSP 2019

Showing 1–11 of 11 results for author: Haro, G