-
A Demand-Driven Perspective on Generative Audio AI
Authors:
Sangshin Oh,
Minsung Kang,
Hyeongi Moon,
Keunwoo Choi,
Ben Sangbae Chon
Abstract:
To achieve successful deployment of AI research, it is crucial to understand the demands of the industry. In this paper, we present the results of a survey conducted with professional audio engineers, in order to determine research priorities and define various research tasks. We also summarize the current challenges in audio quality and controllability based on the survey. Our analysis emphasizes…
▽ More
To achieve successful deployment of AI research, it is crucial to understand the demands of the industry. In this paper, we present the results of a survey conducted with professional audio engineers, in order to determine research priorities and define various research tasks. We also summarize the current challenges in audio quality and controllability based on the survey. Our analysis emphasizes that the availability of datasets is currently the main bottleneck for achieving high-quality audio generation. Finally, we suggest potential solutions for some revealed issues with empirical evidence.
△ Less
Submitted 9 July, 2023;
originally announced July 2023.
-
FALL-E: A Foley Sound Synthesis Model and Strategies
Authors:
Minsung Kang,
Sangshin Oh,
Hyeongi Moon,
Kyungyun Lee,
Ben Sangbae Chon
Abstract:
This paper introduces FALL-E, a foley synthesis system and its training/inference strategies. The FALL-E model employs a cascaded approach comprising low-resolution spectrogram generation, spectrogram super-resolution, and a vocoder. We trained every sound-related model from scratch using our extensive datasets, and utilized a pre-trained language model. We conditioned the model with dataset-speci…
▽ More
This paper introduces FALL-E, a foley synthesis system and its training/inference strategies. The FALL-E model employs a cascaded approach comprising low-resolution spectrogram generation, spectrogram super-resolution, and a vocoder. We trained every sound-related model from scratch using our extensive datasets, and utilized a pre-trained language model. We conditioned the model with dataset-specific texts, enabling it to learn sound quality and recording environment based on text input. Moreover, we leveraged external language models to improve text descriptions of our datasets and performed prompt engineering for quality, coherence, and diversity. FALL-E was evaluated by an objective measure as well as listening tests in the DCASE 2023 challenge Task 7. The submission achieved the second place on average, while achieving the best score for diversity, second place for audio quality, and third place for class fitness.
△ Less
Submitted 10 August, 2023; v1 submitted 16 June, 2023;
originally announced June 2023.
-
Room Impulse Response Estimation in a Multiple Source Environment
Authors:
Kyungyun Lee,
Jeonghun Seo,
Keunwoo Choi,
Sangmoon Lee,
Ben Sangbae Chon
Abstract:
In real-world acoustic scenarios, there often are multiple sound sources present in a room. These sources are situated in various locations and produce sounds that reach the listener from multiple directions. The presence of multiple sources in a room creates new challenges in estimating the room impulse response (RIR) as each source has a unique RIR, dependent on its location and orientation. The…
▽ More
In real-world acoustic scenarios, there often are multiple sound sources present in a room. These sources are situated in various locations and produce sounds that reach the listener from multiple directions. The presence of multiple sources in a room creates new challenges in estimating the room impulse response (RIR) as each source has a unique RIR, dependent on its location and orientation. Therefore, issues of determining which RIR should be predicted and how to predict it arise, when the input signal is a mixture of multiple reverberated sources. To address these, we propose a new task of predicting a "representative" RIR for a room in a multiple source environment and present a training method to achieve this goal. In contrast to the model trained in a single source environment, our method shows robust performance, regardless of the number of sources in the environment.
△ Less
Submitted 25 May, 2023;
originally announced May 2023.
-
MedleyVox: An Evaluation Dataset for Multiple Singing Voices Separation
Authors:
Chang-Bin Jeon,
Hyeongi Moon,
Keunwoo Choi,
Ben Sangbae Chon,
Kyogu Lee
Abstract:
Separation of multiple singing voices into each voice is a rarely studied area in music source separation research. The absence of a benchmark dataset has hindered its progress. In this paper, we present an evaluation dataset and provide baseline studies for multiple singing voices separation. First, we introduce MedleyVox, an evaluation dataset for multiple singing voices separation. We specify t…
▽ More
Separation of multiple singing voices into each voice is a rarely studied area in music source separation research. The absence of a benchmark dataset has hindered its progress. In this paper, we present an evaluation dataset and provide baseline studies for multiple singing voices separation. First, we introduce MedleyVox, an evaluation dataset for multiple singing voices separation. We specify the problem definition in this dataset by categorizing it into i) unison, ii) duet, iii) main vs. rest, and iv) N-singing separation. Second, to overcome the absence of existing multi-singing datasets for a training purpose, we present a strategy for construction of multiple singing mixtures using various single-singing datasets. Third, we propose the improved super-resolution network (iSRNet), which greatly enhances initial estimates of separation networks. Jointly trained with the Conv-TasNet and the multi-singing mixture construction strategy, the proposed iSRNet achieved comparable performance to ideal time-frequency masks on duet and unison subsets of MedleyVox. Audio samples, the dataset, and codes are available on our website (https://github.com/jeonchangbin49/MedleyVox).
△ Less
Submitted 4 May, 2023; v1 submitted 14 November, 2022;
originally announced November 2022.
-
GSEP: A robust vocal and accompaniment separation system using gated CBHG module and loudness normalization
Authors:
Soochul Park,
Ben Sangbae Chon
Abstract:
In the field of audio signal processing research, source separation has been a popular research topic for a long time and the recent adoption of the deep neural networks have shown a significant improvement in performance. The improvement vitalizes the industry to productize audio deep learning based products and services including Karaoke in the music streaming apps and dialogue enhancement in th…
▽ More
In the field of audio signal processing research, source separation has been a popular research topic for a long time and the recent adoption of the deep neural networks have shown a significant improvement in performance. The improvement vitalizes the industry to productize audio deep learning based products and services including Karaoke in the music streaming apps and dialogue enhancement in the UHDTV. For these early markets, we defined a set of design principles of the vocal and accompaniment separation model in terms of robustness, quality, and cost. In this paper, we introduce GSEP (Gaudio source SEParation system), a robust vocal and accompaniment separation system using a Gated- CBHG module, mask war**, and loudness normalization and it was verified that the proposed system satisfies all three principles and outperforms the state-of-the-art systems both in objective measure and subjective assessment through experiments.
△ Less
Submitted 22 February, 2021; v1 submitted 22 October, 2020;
originally announced October 2020.