Search | arXiv e-print repository

Direction Specific Ambisonics Source Separation with End-To-End Deep Learning

Authors: Francesc Lluís, Nils Meyer-Kahlen, Vasileios Chatziioannou, Alex Hofmann

Abstract: Ambisonics is a scene-based spatial audio format that has several useful features compared to object-based formats, such as efficient whole scene rotation and versatility. However, it does not provide direct access to the individual source signals, so that these have to be separated from the mixture when required. Typically, this is done with linear spherical harmonics (SH) beamforming. In this pa… ▽ More Ambisonics is a scene-based spatial audio format that has several useful features compared to object-based formats, such as efficient whole scene rotation and versatility. However, it does not provide direct access to the individual source signals, so that these have to be separated from the mixture when required. Typically, this is done with linear spherical harmonics (SH) beamforming. In this paper, we explore deep-learning-based source separation on static Ambisonics mixtures. In contrast to most source separation approaches, which separate a fixed number of sources of specific sound types, we focus on separating arbitrary sound from specific directions. Specifically, we propose three operating modes that combine a source separation neural network with SH beamforming: refinement, implicit, and mixed mode. We show that a neural network can implicitly associate conditioning directions with the spatial information contained in the Ambisonics scene to extract specific sources. We evaluate the performance of the three proposed approaches and compare them to SH beamforming on musical mixtures generated with the musdb18 dataset, as well as with mixtures generated with the FUSS dataset for universal source separation, under both anechoic and room conditions. Results show that the proposed approaches offer improved separation performance and spatial selectivity compared to conventional SH beamforming. △ Less

Submitted 20 June, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

Comments: Code and listening examples: https://github.com/francesclluis/direction-ambisonics-source-separation

Journal ref: Acta Acustica 2023, 7, 29

arXiv:2104.12462 [pdf, other]

doi 10.1186/s13636-022-00265-4

Points2Sound: From mono to binaural audio using 3D point cloud scenes

Authors: Francesc Lluís, Vasileios Chatziioannou, Alex Hofmann

Abstract: For immersive applications, the generation of binaural sound that matches its visual counterpart is crucial to bring meaningful experiences to people in a virtual environment. Recent studies have shown the possibility of using neural networks for synthesizing binaural audio from mono audio by using 2D visual information as guidance. Extending this approach by guiding the audio with 3D visual infor… ▽ More For immersive applications, the generation of binaural sound that matches its visual counterpart is crucial to bring meaningful experiences to people in a virtual environment. Recent studies have shown the possibility of using neural networks for synthesizing binaural audio from mono audio by using 2D visual information as guidance. Extending this approach by guiding the audio with 3D visual information and operating in the waveform domain may allow for a more accurate auralization of a virtual audio scene. We propose Points2Sound, a multi-modal deep learning model which generates a binaural version from mono audio using 3D point cloud scenes. Specifically, Points2Sound consists of a vision network and an audio network. The vision network uses 3D sparse convolutions to extract a visual feature from the point cloud scene. Then, the visual feature conditions the audio network, which operates in the waveform domain, to synthesize the binaural version. Results show that 3D visual information can successfully guide multi-modal deep learning models for the task of binaural synthesis. We also investigate how 3D point cloud attributes, learning objectives, different reverberant conditions, and several types of mono mixture signals affect the binaural audio synthesis performance of Points2Sound for the different numbers of sound sources present in the scene. △ Less

Submitted 19 May, 2023; v1 submitted 26 April, 2021; originally announced April 2021.

Comments: Code, data, and listening examples: https://github.com/francesclluis/points2sound

Journal ref: EURASIP Journal on Audio, Speech, and Music Processing 2022 (1), 1-15

arXiv:2102.02028 [pdf, other]

Music source separation conditioned on 3D point clouds

Authors: Francesc Lluís, Vasileios Chatziioannou, Alex Hofmann

Abstract: Recently, significant progress has been made in audio source separation by the application of deep learning techniques. Current methods that combine both audio and visual information use 2D representations such as images to guide the separation process. However, in order to (re)-create acoustically correct scenes for 3D virtual/augmented reality applications from recordings of real music ensembles… ▽ More Recently, significant progress has been made in audio source separation by the application of deep learning techniques. Current methods that combine both audio and visual information use 2D representations such as images to guide the separation process. However, in order to (re)-create acoustically correct scenes for 3D virtual/augmented reality applications from recordings of real music ensembles, detailed information about each sound source in the 3D environment is required. This demand, together with the proliferation of 3D visual acquisition systems like LiDAR or rgb-depth cameras, stimulates the creation of models that can guide the audio separation using 3D visual information. This paper proposes a multi-modal deep learning model to perform music source separation conditioned on 3D point clouds of music performance recordings. This model extracts visual features using 3D sparse convolutions, while audio features are extracted using dense convolutions. A fusion module combines the extracted features to finally perform the audio source separation. It is shown, that the presented model can distinguish the musical instruments from a single 3D point cloud frame, and perform source separation qualitatively similar to a reference case, where manually assigned instrument labels are provided. △ Less

Submitted 3 February, 2021; originally announced February 2021.

Showing 1–3 of 3 results for author: Chatziioannou, V