-
On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models
Authors:
**chuan Tian,
Yifan Peng,
William Chen,
Kwanghee Choi,
Karen Livescu,
Shinji Watanabe
Abstract:
The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full transparency in building advanced speech-to-text (S2T) foundation models. To this end, OWSM models are trained on 25 public speech datasets, which are heterogeneous in multiple ways. In this study, we advance the OWSM series by introducing OWSM v3.2, which improves on prior models by investigating and addressing the i…
▽ More
The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full transparency in building advanced speech-to-text (S2T) foundation models. To this end, OWSM models are trained on 25 public speech datasets, which are heterogeneous in multiple ways. In this study, we advance the OWSM series by introducing OWSM v3.2, which improves on prior models by investigating and addressing the impacts of this data heterogeneity. Our study begins with a detailed analysis of each dataset, from which we derive two key strategies: data filtering with proxy task to enhance data quality, and the incorporation of punctuation and true-casing using an open large language model (LLM). With all other configurations staying the same, OWSM v3.2 improves performance over the OWSM v3.1 baseline while using 15% less training data.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Self-Supervised Speech Representations are More Phonetic than Semantic
Authors:
Kwanghee Choi,
Ankita Pasad,
Tomohiko Nakamura,
Satoru Fukayama,
Karen Livescu,
Shinji Watanabe
Abstract:
Self-supervised speech models (S3Ms) have become an effective backbone for speech applications. Various analyses suggest that S3Ms encode linguistic properties. In this work, we seek a more fine-grained analysis of the word-level linguistic properties encoded in S3Ms. Specifically, we curate a novel dataset of near homophone (phonetically similar) and synonym (semantically similar) word pairs and…
▽ More
Self-supervised speech models (S3Ms) have become an effective backbone for speech applications. Various analyses suggest that S3Ms encode linguistic properties. In this work, we seek a more fine-grained analysis of the word-level linguistic properties encoded in S3Ms. Specifically, we curate a novel dataset of near homophone (phonetically similar) and synonym (semantically similar) word pairs and measure the similarities between S3M word representation pairs. Our study reveals that S3M representations consistently and significantly exhibit more phonetic than semantic similarity. Further, we question whether widely used intent classification datasets such as Fluent Speech Commands and Snips Smartlights are adequate for measuring semantic abilities. Our simple baseline, using only the word identity, surpasses S3M-based models. This corroborates our findings and suggests that high scores on these datasets do not necessarily guarantee the presence of semantic content.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Wireless Information and Energy Transfer in the Era of 6G Communications
Authors:
Constantinos Psomas,
Konstantinos Ntougias,
Nikita Shanin,
Dongfang Xu,
Kenneth MacSporran Mayer,
Nguyen Minh Tran,
Laura Cottatellucci,
Kae Won Choi,
Dong In Kim,
Robert Schober,
Ioannis Krikidis
Abstract:
Wireless information and energy transfer (WIET) represents an emerging paradigm which employs controllable transmission of radio-frequency signals for the dual purpose of data communication and wireless charging. As such, WIET is widely regarded as an enabler of envisioned 6G use cases that rely on energy-sustainable Internet-of-Things (IoT) networks, such as smart cities and smart grids. Meeting…
▽ More
Wireless information and energy transfer (WIET) represents an emerging paradigm which employs controllable transmission of radio-frequency signals for the dual purpose of data communication and wireless charging. As such, WIET is widely regarded as an enabler of envisioned 6G use cases that rely on energy-sustainable Internet-of-Things (IoT) networks, such as smart cities and smart grids. Meeting the quality-of-service demands of WIET, in terms of both data transfer and power delivery, requires effective co-design of the information and energy signals. In this article, we present the main principles and design aspects of WIET, focusing on its integration in 6G networks. First, we discuss how conventional communication notions such as resource allocation and waveform design need to be revisited in the context of WIET. Next, we consider various candidate 6G technologies that can boost WIET efficiency, namely, holographic multiple-input multiple-output, near-field beamforming, terahertz communication, intelligent reflecting surfaces (IRSs), and reconfigurable (fluid) antenna arrays. We introduce respective WIET design methods, analyze the promising performance gains of these WIET systems, and discuss challenges, open issues, and future research directions. Finally, a near-field energy beamforming scheme and a power-based IRS beamforming algorithm are experimentally validated using a wireless energy transfer testbed. The vision of WIET in communication systems has been gaining momentum in recent years, with constant progress with respect to theoretical but also practical aspects. The comprehensive overview of the state of the art of WIET presented in this paper highlights the potentials of WIET systems as well as their overall benefits in 6G networks.
△ Less
Submitted 16 May, 2024; v1 submitted 29 April, 2024;
originally announced April 2024.
-
A Comparative Analysis of Poetry Reading Audio: Singing, Narrating, or Somewhere In Between?
Authors:
Kahyun Choi,
Minje Kim
Abstract:
This paper provides a computational analysis of poetry reading audio signals at a large scale to unveil the musicality within professionally-read poems. Although the acoustic characteristics of other types of spoken language have been extensively studied, most of the literature is limited to narrative speech or singing voice, discussing how different they are from each other. In this work, we deve…
▽ More
This paper provides a computational analysis of poetry reading audio signals at a large scale to unveil the musicality within professionally-read poems. Although the acoustic characteristics of other types of spoken language have been extensively studied, most of the literature is limited to narrative speech or singing voice, discussing how different they are from each other. In this work, we develop signal processing methods, which are tailored to capture the unique acoustic characteristics of poetry reading based on their silence patterns, temporal variations of local pitch, and beat stability. Our large-scale statistical analyses on three big corpora, each of which consists of narration (LibriSpeech), singing voice (Intonation), and poetry reading (from The Poetry Foundation), discover that poetry reading does share some musical characteristics with singing voice, although it may also resemble narrative speech.
△ Less
Submitted 31 March, 2024;
originally announced April 2024.
-
Correlation of Fréchet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependant
Authors:
Modan Tailleur,
Junwon Lee,
Mathieu Lagrange,
Keunwoo Choi,
Laurie M. Heller,
Keisuke Imoto,
Yuki Okamoto
Abstract:
This paper explores whether considering alternative domain-specific embeddings to calculate the Fréchet Audio Distance (FAD) metric can help the FAD to correlate better with perceptual ratings of environmental sounds. We used embeddings from VGGish, PANNs, MS-CLAP, L-CLAP, and MERT, which are tailored for either music or environmental sound evaluation. The FAD scores were calculated for sounds fro…
▽ More
This paper explores whether considering alternative domain-specific embeddings to calculate the Fréchet Audio Distance (FAD) metric can help the FAD to correlate better with perceptual ratings of environmental sounds. We used embeddings from VGGish, PANNs, MS-CLAP, L-CLAP, and MERT, which are tailored for either music or environmental sound evaluation. The FAD scores were calculated for sounds from the DCASE 2023 Task 7 dataset. Using perceptual data from the same task, we find that PANNs-WGM-LogMel produces the best correlation between FAD scores and perceptual ratings of both audio quality and perceived fit with a Spearman correlation higher than 0.5. We also find that music-specific embeddings resulted in significantly lower results. Interestingly, VGGish, the embedding used for the original Fréchet calculation, yielded a correlation below 0.1. These results underscore the critical importance of the choice of embedding for the FAD metric design.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
CNN-based End-to-End Adaptive Controller with Stability Guarantees
Authors:
Myeongseok Ryu,
Kyunghwan Choi
Abstract:
This letter proposes a convolutional neural network (CNN)-based adaptive controller wtih three notable features: 1) it determines control input directly from historical sensor data (in an end-to-end process); 2) it learns the desired control policy during real-time implementation without using a pretrained network (in an online adaptive manner); and 3) the asymptotic tracking error convergence is…
▽ More
This letter proposes a convolutional neural network (CNN)-based adaptive controller wtih three notable features: 1) it determines control input directly from historical sensor data (in an end-to-end process); 2) it learns the desired control policy during real-time implementation without using a pretrained network (in an online adaptive manner); and 3) the asymptotic tracking error convergence is proven during the learning process (to deliver a stability guarantee). An adaptive law for learning the desired control policy is derived using the gradient descent optimization method, and its stability is analyzed based on the Lyapunov approach. A simulation study using a control-affine nonlinear system demonstrated that the proposed controller exhibits these features, and its performance can be tuned by manipulating the design parameters. In addition, it is shown that the proposed controller has a superior tracking performance to that of a deep neural network (DNN)-based adaptive controller.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
Authors:
Yifan Peng,
**chuan Tian,
William Chen,
Siddhant Arora,
Brian Yan,
Yui Sudo,
Muhammad Shakeel,
Kwanghee Choi,
Jiatong Shi,
Xuankai Chang,
Jee-weon Jung,
Shinji Watanabe
Abstract:
Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits. However, previous versions of OWSM (v1 to v3) are still based on standard Transformer, which might lead to inferior performance compared to state-of-the-art speech encoder archite…
▽ More
Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits. However, previous versions of OWSM (v1 to v3) are still based on standard Transformer, which might lead to inferior performance compared to state-of-the-art speech encoder architectures. This work aims to improve the performance and efficiency of OWSM without additional data. We present a series of E-Branchformer-based models named OWSM v3.1, ranging from 100M to 1B parameters. OWSM v3.1 outperforms its predecessor, OWSM v3, in most evaluation benchmarks, while showing an improved inference speed of up to 25%. We further reveal the emergent ability of OWSM v3.1 in zero-shot contextual biasing speech recognition. We also provide a model trained on a subset of data with low license restrictions. We will publicly release the code, pre-trained models, and training logs.
△ Less
Submitted 16 June, 2024; v1 submitted 29 January, 2024;
originally announced January 2024.
-
Understanding Probe Behaviors through Variational Bounds of Mutual Information
Authors:
Kwanghee Choi,
Jee-weon Jung,
Shinji Watanabe
Abstract:
With the success of self-supervised representations, researchers seek a better understanding of the information encapsulated within a representation. Among various interpretability methods, we focus on classification-based linear probing. We aim to foster a solid understanding and provide guidelines for linear probing by constructing a novel mathematical framework leveraging information theory. Fi…
▽ More
With the success of self-supervised representations, researchers seek a better understanding of the information encapsulated within a representation. Among various interpretability methods, we focus on classification-based linear probing. We aim to foster a solid understanding and provide guidelines for linear probing by constructing a novel mathematical framework leveraging information theory. First, we connect probing with the variational bounds of mutual information (MI) to relax the probe design, equating linear probing with fine-tuning. Then, we investigate empirical behaviors and practices of probing through our mathematical framework. We analyze the layer-wise performance curve being convex, which seemingly violates the data processing inequality. However, we show that the intermediate representations can have the biggest MI estimate because of the tradeoff between better separability and decreasing MI. We further suggest that the margin of linearly separable representations can be a criterion for measuring the "goodness of representation." We also compare accuracy with MI as the measuring criteria. Finally, we empirically validate our claims by observing the self-supervised speech models on retaining word and phoneme information.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study
Authors:
Xuankai Chang,
Brian Yan,
Kwanghee Choi,
Jeeweon Jung,
Yichen Lu,
Soumi Maiti,
Roshan Sharma,
Jiatong Shi,
**chuan Tian,
Shinji Watanabe,
Yuya Fujita,
Takashi Maekaku,
Pengcheng Guo,
Yao-Fei Cheng,
Pavel Denisov,
Kohei Saijo,
Hsiu-Hsuan Wang
Abstract:
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech features such as spectrograms are often used as the input for the subsequent model. However, they can still be redundant. Recent investigations proposed the use of discrete speech units derived from self-supervised learning repre…
▽ More
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech features such as spectrograms are often used as the input for the subsequent model. However, they can still be redundant. Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations, which significantly compresses the size of speech data. Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length. Hence, training time is significantly reduced while retaining notable performance. In this study, we undertake a comprehensive and systematic exploration into the application of discrete units within end-to-end speech processing models. Experiments on 12 automatic speech recognition, 3 speech translation, and 1 spoken language understanding corpora demonstrate that discrete units achieve reasonably good results in almost all the settings. We intend to release our configurations and trained models to foster future research efforts.
△ Less
Submitted 27 September, 2023;
originally announced September 2023.
-
A novel approach for holographic 3D content generation without depth map
Authors:
Hakdong Kim,
Minkyu Jee,
Yurim Lee,
Kyudam Choi,
MinSung Yoon,
Cheongwon Kim
Abstract:
In preparation for observing holographic 3D content, acquiring a set of RGB color and depth map images per scene is necessary to generate computer-generated holograms (CGHs) when using the fast Fourier transform (FFT) algorithm. However, in real-world situations, these paired formats of RGB color and depth map images are not always fully available. We propose a deep learning-based method to synthe…
▽ More
In preparation for observing holographic 3D content, acquiring a set of RGB color and depth map images per scene is necessary to generate computer-generated holograms (CGHs) when using the fast Fourier transform (FFT) algorithm. However, in real-world situations, these paired formats of RGB color and depth map images are not always fully available. We propose a deep learning-based method to synthesize the volumetric digital holograms using only the given RGB image, so that we can overcome environments where RGB color and depth map images are partially provided. The proposed method uses only the input of RGB image to estimate its depth map and then generate its CGH sequentially. Through experiments, we demonstrate that the volumetric hologram generated through our proposed model is more accurate than that of competitive models, under the situation that only RGB color data can be provided.
△ Less
Submitted 26 September, 2023;
originally announced September 2023.
-
Self-Calibrating, Fully Differentiable NLOS Inverse Rendering
Authors:
Kiseok Choi,
Inchul Kim,
Dongyoung Choi,
Julio Marco,
Diego Gutierrez,
Min H. Kim
Abstract:
Existing time-resolved non-line-of-sight (NLOS) imaging methods reconstruct hidden scenes by inverting the optical paths of indirect illumination measured at visible relay surfaces. These methods are prone to reconstruction artifacts due to inversion ambiguities and capture noise, which are typically mitigated through the manual selection of filtering functions and parameters. We introduce a fully…
▽ More
Existing time-resolved non-line-of-sight (NLOS) imaging methods reconstruct hidden scenes by inverting the optical paths of indirect illumination measured at visible relay surfaces. These methods are prone to reconstruction artifacts due to inversion ambiguities and capture noise, which are typically mitigated through the manual selection of filtering functions and parameters. We introduce a fully-differentiable end-to-end NLOS inverse rendering pipeline that self-calibrates the imaging parameters during the reconstruction of hidden scenes, using as input only the measured illumination while working both in the time and frequency domains. Our pipeline extracts a geometric representation of the hidden scene from NLOS volumetric intensities and estimates the time-resolved illumination at the relay wall produced by such geometric information using differentiable transient rendering. We then use gradient descent to optimize imaging parameters by minimizing the error between our simulated time-resolved illumination and the measured illumination. Our end-to-end differentiable pipeline couples diffraction-based volumetric NLOS reconstruction with path-space light transport and a simple ray marching technique to extract detailed, dense sets of surface points and normals of hidden scenes. We demonstrate the robustness of our method to consistently reconstruct geometry and albedo, even under significant noise levels.
△ Less
Submitted 25 September, 2023; v1 submitted 21 September, 2023;
originally announced September 2023.
-
The Biased Journey of MSD_AUDIO.ZIP
Authors:
Haven Kim,
Keunwoo Choi,
Mateusz Modrzejewski,
Cynthia C. S. Liem
Abstract:
The equitable distribution of academic data is crucial for ensuring equal research opportunities, and ultimately further progress. Yet, due to the complexity of using the API for audio data that corresponds to the Million Song Dataset along with its misreporting (before 2016) and the discontinuation of this API (after 2016), access to this data has become restricted to those within certain affilia…
▽ More
The equitable distribution of academic data is crucial for ensuring equal research opportunities, and ultimately further progress. Yet, due to the complexity of using the API for audio data that corresponds to the Million Song Dataset along with its misreporting (before 2016) and the discontinuation of this API (after 2016), access to this data has become restricted to those within certain affiliations that are connected peer-to-peer. In this paper, we delve into this issue, drawing insights from the experiences of 22 individuals who either attempted to access the data or played a role in its creation. With this, we hope to initiate more critical dialogue and more thoughtful consideration with regard to access privilege in the MIR community.
△ Less
Submitted 1 December, 2023; v1 submitted 30 August, 2023;
originally announced August 2023.
-
LP-MusicCaps: LLM-Based Pseudo Music Captioning
Authors:
SeungHeon Doh,
Keunwoo Choi,
Jongpil Lee,
Juhan Nam
Abstract:
Automatic music captioning, which generates natural language descriptions for given music tracks, holds significant potential for enhancing the understanding and organization of large volumes of musical data. Despite its importance, researchers face challenges due to the costly and time-consuming collection process of existing music-language datasets, which are limited in size. To address this dat…
▽ More
Automatic music captioning, which generates natural language descriptions for given music tracks, holds significant potential for enhancing the understanding and organization of large volumes of musical data. Despite its importance, researchers face challenges due to the costly and time-consuming collection process of existing music-language datasets, which are limited in size. To address this data scarcity issue, we propose the use of large language models (LLMs) to artificially generate the description sentences from large-scale tag datasets. This results in approximately 2.2M captions paired with 0.5M audio clips. We term it Large Language Model based Pseudo music caption dataset, shortly, LP-MusicCaps. We conduct a systemic evaluation of the large-scale music captioning dataset with various quantitative evaluation metrics used in the field of natural language processing as well as human evaluation. In addition, we trained a transformer-based music captioning model with the dataset and evaluated it under zero-shot and transfer-learning settings. The results demonstrate that our proposed approach outperforms the supervised baseline model.
△ Less
Submitted 30 July, 2023;
originally announced July 2023.
-
HCLAS-X: Hierarchical and Cascaded Lyrics Alignment System Using Multimodal Cross-Correlation
Authors:
Minsung Kang,
Soochul Park,
Keunwoo Choi
Abstract:
In this work, we address the challenge of lyrics alignment, which involves aligning the lyrics and vocal components of songs. This problem requires the alignment of two distinct modalities, namely text and audio. To overcome this challenge, we propose a model that is trained in a supervised manner, utilizing the cross-correlation matrix of latent representations between vocals and lyrics. Our syst…
▽ More
In this work, we address the challenge of lyrics alignment, which involves aligning the lyrics and vocal components of songs. This problem requires the alignment of two distinct modalities, namely text and audio. To overcome this challenge, we propose a model that is trained in a supervised manner, utilizing the cross-correlation matrix of latent representations between vocals and lyrics. Our system is designed in a hierarchical and cascaded manner. It predicts synced time first on a sentence-level and subsequently on a word-level. This design enables the system to process long sequences, as the cross-correlation uses quadratic memory with respect to sequence length. In our experiments, we demonstrate that our proposed system achieves a significant improvement in mean average error, showcasing its robustness in comparison to the previous state-of-the-art model. Additionally, we conduct a qualitative analysis of the system after successfully deploying it in several music streaming services.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
A Demand-Driven Perspective on Generative Audio AI
Authors:
Sangshin Oh,
Minsung Kang,
Hyeongi Moon,
Keunwoo Choi,
Ben Sangbae Chon
Abstract:
To achieve successful deployment of AI research, it is crucial to understand the demands of the industry. In this paper, we present the results of a survey conducted with professional audio engineers, in order to determine research priorities and define various research tasks. We also summarize the current challenges in audio quality and controllability based on the survey. Our analysis emphasizes…
▽ More
To achieve successful deployment of AI research, it is crucial to understand the demands of the industry. In this paper, we present the results of a survey conducted with professional audio engineers, in order to determine research priorities and define various research tasks. We also summarize the current challenges in audio quality and controllability based on the survey. Our analysis emphasizes that the availability of datasets is currently the main bottleneck for achieving high-quality audio generation. Finally, we suggest potential solutions for some revealed issues with empirical evidence.
△ Less
Submitted 9 July, 2023;
originally announced July 2023.
-
Speech Intelligibility Assessment of Dysarthric Speech by using Goodness of Pronunciation with Uncertainty Quantification
Authors:
Eun Jung Yeo,
Kwanghee Choi,
Sunhee Kim,
Minhwa Chung
Abstract:
This paper proposes an improved Goodness of Pronunciation (GoP) that utilizes Uncertainty Quantification (UQ) for automatic speech intelligibility assessment for dysarthric speech. Current GoP methods rely heavily on neural network-driven overconfident predictions, which is unsuitable for assessing dysarthric speech due to its significant acoustic differences from healthy speech. To alleviate the…
▽ More
This paper proposes an improved Goodness of Pronunciation (GoP) that utilizes Uncertainty Quantification (UQ) for automatic speech intelligibility assessment for dysarthric speech. Current GoP methods rely heavily on neural network-driven overconfident predictions, which is unsuitable for assessing dysarthric speech due to its significant acoustic differences from healthy speech. To alleviate the problem, UQ techniques were used on GoP by 1) normalizing the phoneme prediction (entropy, margin, maxlogit, logit-margin) and 2) modifying the scoring function (scaling, prior normalization). As a result, prior-normalized maxlogit GoP achieves the best performance, with a relative increase of 5.66%, 3.91%, and 23.65% compared to the baseline GoP for English, Korean, and Tamil, respectively. Furthermore, phoneme analysis is conducted to identify which phoneme scores significantly correlate with intelligibility scores in each language.
△ Less
Submitted 28 May, 2023;
originally announced May 2023.
-
Room Impulse Response Estimation in a Multiple Source Environment
Authors:
Kyungyun Lee,
Jeonghun Seo,
Keunwoo Choi,
Sangmoon Lee,
Ben Sangbae Chon
Abstract:
In real-world acoustic scenarios, there often are multiple sound sources present in a room. These sources are situated in various locations and produce sounds that reach the listener from multiple directions. The presence of multiple sources in a room creates new challenges in estimating the room impulse response (RIR) as each source has a unique RIR, dependent on its location and orientation. The…
▽ More
In real-world acoustic scenarios, there often are multiple sound sources present in a room. These sources are situated in various locations and produce sounds that reach the listener from multiple directions. The presence of multiple sources in a room creates new challenges in estimating the room impulse response (RIR) as each source has a unique RIR, dependent on its location and orientation. Therefore, issues of determining which RIR should be predicted and how to predict it arise, when the input signal is a mixture of multiple reverberated sources. To address these, we propose a new task of predicting a "representative" RIR for a room in a multiple source environment and present a training method to achieve this goal. In contrast to the model trained in a single source environment, our method shows robust performance, regardless of the number of sources in the environment.
△ Less
Submitted 25 May, 2023;
originally announced May 2023.
-
Foley Sound Synthesis at the DCASE 2023 Challenge
Authors:
Keunwoo Choi,
Jaekwon Im,
Laurie Heller,
Brian McFee,
Keisuke Imoto,
Yuki Okamoto,
Mathieu Lagrange,
Shinosuke Takamichi
Abstract:
The addition of Foley sound effects during post-production is a common technique used to enhance the perceived acoustic properties of multimedia content. Traditionally, Foley sound has been produced by human Foley artists, which involves manual recording and mixing of sound. However, recent advances in sound synthesis and generative models have generated interest in machine-assisted or automatic F…
▽ More
The addition of Foley sound effects during post-production is a common technique used to enhance the perceived acoustic properties of multimedia content. Traditionally, Foley sound has been produced by human Foley artists, which involves manual recording and mixing of sound. However, recent advances in sound synthesis and generative models have generated interest in machine-assisted or automatic Foley synthesis techniques. To promote further research in this area, we have organized a challenge in DCASE 2023: Task 7 - Foley Sound Synthesis. Our challenge aims to provide a standardized evaluation framework that is both rigorous and efficient, allowing for the evaluation of different Foley synthesis systems. We received 17 submissions, and performed both objective and subjective evaluation to rank them according to three criteria: audio quality, fit-to-category, and diversity. Through this challenge, we hope to encourage active participation from the research community and advance the state-of-the-art in automatic Foley synthesis. In this technical report, we provide a detailed overview of the Foley sound synthesis challenge, including task definition, dataset, baseline, evaluation scheme and criteria, challenge result, and discussion.
△ Less
Submitted 28 September, 2023; v1 submitted 24 April, 2023;
originally announced April 2023.
-
Unsupervised Speech Representation Pooling Using Vector Quantization
Authors:
Jeongkyun Park,
Kwanghee Choi,
Hyunjun Heo,
Hyung-Min Park
Abstract:
With the advent of general-purpose speech representations from large-scale self-supervised models, applying a single model to multiple downstream tasks is becoming a de-facto approach. However, the pooling problem remains; the length of speech representations is inherently variable. The naive average pooling is often used, even though it ignores the characteristics of speech, such as differently l…
▽ More
With the advent of general-purpose speech representations from large-scale self-supervised models, applying a single model to multiple downstream tasks is becoming a de-facto approach. However, the pooling problem remains; the length of speech representations is inherently variable. The naive average pooling is often used, even though it ignores the characteristics of speech, such as differently lengthed phonemes. Hence, we design a novel pooling method to squash acoustically similar representations via vector quantization, which does not require additional training, unlike attention-based pooling. Further, we evaluate various unsupervised pooling methods on various self-supervised models. We gather diverse methods scattered around speech and text to evaluate on various tasks: keyword spotting, speaker identification, intent classification, and emotion recognition. Finally, we quantitatively and qualitatively analyze our method, comparing it with supervised pooling methods.
△ Less
Submitted 8 April, 2023;
originally announced April 2023.
-
Textless Speech-to-Music Retrieval Using Emotion Similarity
Authors:
SeungHeon Doh,
Minz Won,
Keunwoo Choi,
Juhan Nam
Abstract:
We introduce a framework that recommends music based on the emotions of speech. In content creation and daily life, speech contains information about human emotions, which can be enhanced by music. Our framework focuses on a cross-domain retrieval system to bridge the gap between speech and music via emotion labels. We explore different speech representations and report their impact on different s…
▽ More
We introduce a framework that recommends music based on the emotions of speech. In content creation and daily life, speech contains information about human emotions, which can be enhanced by music. Our framework focuses on a cross-domain retrieval system to bridge the gap between speech and music via emotion labels. We explore different speech representations and report their impact on different speech types, including acting voice and wake-up words. We also propose an emotion similarity regularization term in cross-domain retrieval tasks. By incorporating the regularization term into training, similar speech-and-music pairs in the emotion space are closer in the joint embedding space. Our comprehensive experimental results show that the proposed model is effective in textless speech-to-music retrieval.
△ Less
Submitted 18 March, 2023;
originally announced March 2023.
-
Context-Based Trit-Plane Coding for Progressive Image Compression
Authors:
Seungmin Jeon,
Kwang Pyo Choi,
Youngo Park,
Chang-Su Kim
Abstract:
Trit-plane coding enables deep progressive image compression, but it cannot use autoregressive context models. In this paper, we propose the context-based trit-plane coding (CTC) algorithm to achieve progressive compression more compactly. First, we develop the context-based rate reduction module to estimate trit probabilities of latent elements accurately and thus encode the trit-planes compactly…
▽ More
Trit-plane coding enables deep progressive image compression, but it cannot use autoregressive context models. In this paper, we propose the context-based trit-plane coding (CTC) algorithm to achieve progressive compression more compactly. First, we develop the context-based rate reduction module to estimate trit probabilities of latent elements accurately and thus encode the trit-planes compactly. Second, we develop the context-based distortion reduction module to refine partial latent tensors from the trit-planes and improve the reconstructed image quality. Third, we propose a retraining scheme for the decoder to attain better rate-distortion tradeoffs. Extensive experiments show that CTC outperforms the baseline trit-plane codec significantly in BD-rate on the Kodak lossless dataset, while increasing the time complexity only marginally. Our codes are available at https://github.com/seungminjeon-github/CTC.
△ Less
Submitted 13 March, 2023; v1 submitted 10 March, 2023;
originally announced March 2023.
-
Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training
Authors:
Kin Wai Cheuk,
Keunwoo Choi,
Qiuqiang Kong,
Bochen Li,
Minz Won,
Ju-Chiang Wang,
Yun-Ning Hung,
Dorien Herremans
Abstract:
In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of an instrument recognition module that conditions the other two modules: a transcription module that outputs instrument-specific piano rolls, and a source separation module that utilize…
▽ More
In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of an instrument recognition module that conditions the other two modules: a transcription module that outputs instrument-specific piano rolls, and a source separation module that utilizes instrument information and transcription results. The joint training of the transcription and source separation modules serves to improve the performance of both tasks. The instrument module is optional and can be directly controlled by human users. This makes Jointist a flexible user-controllable framework. Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. Its novelty, however, necessitates a new perspective on how to evaluate such a model. In our experiments, we assess the proposed model from various aspects, providing a new evaluation perspective for multi-instrument transcription. Our subjective listening study shows that Jointist achieves state-of-the-art performance on popular music, outperforming existing multi-instrument transcription models such as MT3. We conducted experiments on several downstream tasks and found that the proposed method improved transcription by more than 1 percentage points (ppt.), source separation by 5 SDR, downbeat detection by 1.8 ppt., chord recognition by 1.4 ppt., and key estimation by 1.4 ppt., when utilizing transcription results obtained from Jointist.
Demo available at \url{https://jointist.github.io/Demo}.
△ Less
Submitted 1 February, 2023; v1 submitted 1 February, 2023;
originally announced February 2023.
-
Toward Universal Text-to-Music Retrieval
Authors:
SeungHeon Doh,
Minz Won,
Keunwoo Choi,
Juhan Nam
Abstract:
This paper introduces effective design choices for text-to-music retrieval systems. An ideal text-based retrieval system would support various input queries such as pre-defined tags, unseen tags, and sentence-level descriptions. In reality, most previous works mainly focused on a single query type (tag or sentence) which may not generalize to another input type. Hence, we review recent text-based…
▽ More
This paper introduces effective design choices for text-to-music retrieval systems. An ideal text-based retrieval system would support various input queries such as pre-defined tags, unseen tags, and sentence-level descriptions. In reality, most previous works mainly focused on a single query type (tag or sentence) which may not generalize to another input type. Hence, we review recent text-based music retrieval systems using our proposed benchmark in two main aspects: input text representation and training objectives. Our findings enable a universal text-to-music retrieval system that achieves comparable retrieval performances in both tag- and sentence-level inputs. Furthermore, the proposed multimodal representation generalizes to 9 different downstream music classification tasks. We present the code and demo online.
△ Less
Submitted 26 November, 2022;
originally announced November 2022.
-
MedleyVox: An Evaluation Dataset for Multiple Singing Voices Separation
Authors:
Chang-Bin Jeon,
Hyeongi Moon,
Keunwoo Choi,
Ben Sangbae Chon,
Kyogu Lee
Abstract:
Separation of multiple singing voices into each voice is a rarely studied area in music source separation research. The absence of a benchmark dataset has hindered its progress. In this paper, we present an evaluation dataset and provide baseline studies for multiple singing voices separation. First, we introduce MedleyVox, an evaluation dataset for multiple singing voices separation. We specify t…
▽ More
Separation of multiple singing voices into each voice is a rarely studied area in music source separation research. The absence of a benchmark dataset has hindered its progress. In this paper, we present an evaluation dataset and provide baseline studies for multiple singing voices separation. First, we introduce MedleyVox, an evaluation dataset for multiple singing voices separation. We specify the problem definition in this dataset by categorizing it into i) unison, ii) duet, iii) main vs. rest, and iv) N-singing separation. Second, to overcome the absence of existing multi-singing datasets for a training purpose, we present a strategy for construction of multiple singing mixtures using various single-singing datasets. Third, we propose the improved super-resolution network (iSRNet), which greatly enhances initial estimates of separation networks. Jointly trained with the Conv-TasNet and the multi-singing mixture construction strategy, the proposed iSRNet achieved comparable performance to ideal time-frequency masks on duet and unison subsets of MedleyVox. Audio samples, the dataset, and codes are available on our website (https://github.com/jeonchangbin49/MedleyVox).
△ Less
Submitted 4 May, 2023; v1 submitted 14 November, 2022;
originally announced November 2022.
-
Automatic Severity Classification of Dysarthric speech by using Self-supervised Model with Multi-task Learning
Authors:
Eun Jung Yeo,
Kwanghee Choi,
Sunhee Kim,
Minhwa Chung
Abstract:
Automatic assessment of dysarthric speech is essential for sustained treatments and rehabilitation. However, obtaining atypical speech is challenging, often leading to data scarcity issues. To tackle the problem, we propose a novel automatic severity assessment method for dysarthric speech, using the self-supervised model in conjunction with multi-task learning. Wav2vec 2.0 XLS-R is jointly traine…
▽ More
Automatic assessment of dysarthric speech is essential for sustained treatments and rehabilitation. However, obtaining atypical speech is challenging, often leading to data scarcity issues. To tackle the problem, we propose a novel automatic severity assessment method for dysarthric speech, using the self-supervised model in conjunction with multi-task learning. Wav2vec 2.0 XLS-R is jointly trained for two different tasks: severity classification and auxiliary automatic speech recognition (ASR). For the baseline experiments, we employ hand-crafted acoustic features and machine learning classifiers such as SVM, MLP, and XGBoost. Explored on the Korean dysarthric speech QoLT database, our model outperforms the traditional baseline methods, with a relative percentage increase of 1.25% for F1-score. In addition, the proposed model surpasses the model trained without ASR head, achieving 10.61% relative percentage improvements. Furthermore, we present how multi-task learning affects the severity classification performance by analyzing the latent representations and regularization effect.
△ Less
Submitted 28 April, 2023; v1 submitted 27 October, 2022;
originally announced October 2022.
-
Opening the Black Box of wav2vec Feature Encoder
Authors:
Kwanghee Choi,
Eun Jung Yeo
Abstract:
Self-supervised models, namely, wav2vec and its variants, have shown promising results in various downstream tasks in the speech domain. However, their inner workings are poorly understood, calling for in-depth analyses on what the model learns. In this paper, we concentrate on the convolutional feature encoder where its latent space is often speculated to represent discrete acoustic units. To ana…
▽ More
Self-supervised models, namely, wav2vec and its variants, have shown promising results in various downstream tasks in the speech domain. However, their inner workings are poorly understood, calling for in-depth analyses on what the model learns. In this paper, we concentrate on the convolutional feature encoder where its latent space is often speculated to represent discrete acoustic units. To analyze the embedding space in a reductive manner, we feed the synthesized audio signals, which is the summation of simple sine waves. Through extensive experiments, we conclude that various information is embedded inside the feature encoder representations: (1) fundamental frequency, (2) formants, and (3) amplitude, packed with (4) sufficient temporal detail. Further, the information incorporated inside the latent representations is analogous to spectrograms but with a fundamental difference: latent representations construct a metric space so that closer representations imply acoustic similarity.
△ Less
Submitted 27 October, 2022;
originally announced October 2022.
-
Cross-lingual Dysarthria Severity Classification for English, Korean, and Tamil
Authors:
Eun Jung Yeo,
Kwanghee Choi,
Sunhee Kim,
Minhwa Chung
Abstract:
This paper proposes a cross-lingual classification method for English, Korean, and Tamil, which employs both language-independent features and language-unique features. First, we extract thirty-nine features from diverse speech dimensions such as voice quality, pronunciation, and prosody. Second, feature selections are applied to identify the optimal feature set for each language. A set of shared…
▽ More
This paper proposes a cross-lingual classification method for English, Korean, and Tamil, which employs both language-independent features and language-unique features. First, we extract thirty-nine features from diverse speech dimensions such as voice quality, pronunciation, and prosody. Second, feature selections are applied to identify the optimal feature set for each language. A set of shared features and a set of distinctive features are distinguished by comparing the feature selection results of the three languages. Lastly, automatic severity classification is performed, utilizing the two feature sets. Notably, the proposed method removes different features by languages to prevent the negative effect of unique features for other languages. Accordingly, eXtreme Gradient Boosting (XGBoost) algorithm is employed for classification, due to its strength in imputing missing data. In order to validate the effectiveness of our proposed method, two baseline experiments are conducted: experiments using the intersection set of mono-lingual feature sets (Intersection) and experiments using the union set of mono-lingual feature sets (Union). According to the experimental results, our method achieves better performance with a 67.14% F1 score, compared to 64.52% for the Intersection experiment and 66.74% for the Union experiment. Further, the proposed method attains better performances than mono-lingual classifications for all three languages, achieving 17.67%, 2.28%, 7.79% relative percentage increases for English, Korean, and Tamil, respectively. The result specifies that commonly shared features and language-specific features must be considered separately for cross-language dysarthria severity classification.
△ Less
Submitted 26 September, 2022;
originally announced September 2022.
-
Frequency Reversal Alamouti Code-Based FBMC with Resilience to Inter-Antenna Frequency Offsets
Authors:
Cheng-Yu Lin,
Borching Su,
Kwonhue Choi
Abstract:
Transmit diversity schemes for filter bank multicarrier (FBMC) are known to be challenging. No existing schemes have considered the presence of inter-antenna frequency offset (IAFO), which will result in performance degradation. In this letter, a new transmit scheme based on the frequency reversal Alamouti code (FRAC)-based structure to address the issue of IAFO is proposed and is proven to inhere…
▽ More
Transmit diversity schemes for filter bank multicarrier (FBMC) are known to be challenging. No existing schemes have considered the presence of inter-antenna frequency offset (IAFO), which will result in performance degradation. In this letter, a new transmit scheme based on the frequency reversal Alamouti code (FRAC)-based structure to address the issue of IAFO is proposed and is proven to inherently cancel the inter-antenna inter-carrier interference (ICI) while preserving spatial diversity. Moreover, the proposed FRAC structure is applicable in frequency-selective channels. Numerical results show that the proposed scheme undergoes negligible bit error rate (BER) degradation even with considerable IAFOs.
△ Less
Submitted 14 September, 2022;
originally announced September 2022.
-
Foundations of Wireless Information and Power Transfer: Theory, Prototypes, and Experiments
Authors:
Bruno Clerckx,
Junghoon Kim,
Kae Won Choi,
Dong In Kim
Abstract:
As wireless has disrupted communications, wireless will also disrupt the delivery of energy. Future wireless networks will be equipped with (radiative) wireless power transfer (WPT) capability and exploit radio waves to carry both energy and information through a unified wireless information and power transfer (WIPT). Such networks will make the best use of the RF spectrum and radiation as well as…
▽ More
As wireless has disrupted communications, wireless will also disrupt the delivery of energy. Future wireless networks will be equipped with (radiative) wireless power transfer (WPT) capability and exploit radio waves to carry both energy and information through a unified wireless information and power transfer (WIPT). Such networks will make the best use of the RF spectrum and radiation as well as the network infrastructure for the dual purpose of communicating and energizing. Consequently those networks will enable trillions of future low-power devices to sense, compute, connect, and energize anywhere, anytime, and on the move. In this paper, we review the foundations of such future system. We first give an overview of the fundamental theoretical building blocks of WPT and WIPT. Then we discuss some state-of-the-art experimental setups and prototypes of both WPT and WIPT and contrast theoretical and experimental results. We draw a special attention to how the integration of RF, signal and system designs in WPT and WIPT leads to new theoretical and experimental design challenges for both microwave and communication engineers and highlight some promising solutions. Topics and experimental testbeds discussed include closed-loop WPT and WIPT architectures with beamforming, waveform, channel acquisition, and single/multi-antenna energy harvester, centralized and distributed WPT, reconfigurable metasurfaces and intelligent surfaces for WPT, transmitter and receiver architecture for WIPT, modulation, rate-energy trade-off. Moreover, we highlight important theoretical and experimental research directions to be addressed for WPT and WIPT to become a foundational technology of future wireless networks.
△ Less
Submitted 8 September, 2022;
originally announced September 2022.
-
A Proposal for Foley Sound Synthesis Challenge
Authors:
Keunwoo Choi,
Sangshin Oh,
Minsung Kang,
Brian McFee
Abstract:
"Foley" refers to sound effects that are added to multimedia during post-production to enhance its perceived acoustic properties, e.g., by simulating the sounds of footsteps, ambient environmental sounds, or visible objects on the screen. While foley is traditionally produced by foley artists, there is increasing interest in automatic or machine-assisted techniques building upon recent advances in…
▽ More
"Foley" refers to sound effects that are added to multimedia during post-production to enhance its perceived acoustic properties, e.g., by simulating the sounds of footsteps, ambient environmental sounds, or visible objects on the screen. While foley is traditionally produced by foley artists, there is increasing interest in automatic or machine-assisted techniques building upon recent advances in sound synthesis and generative models. To foster more participation in this growing research area, we propose a challenge for automatic foley synthesis. Through case studies on successful previous challenges in audio and machine learning, we set the goals of the proposed challenge: rigorous, unified, and efficient evaluation of different foley synthesis systems, with an overarching goal of drawing active participation from the research community. We outline the details and design considerations of a foley sound synthesis challenge, including task definition, dataset requirements, and evaluation criteria.
△ Less
Submitted 21 July, 2022;
originally announced July 2022.
-
Distilling a Pretrained Language Model to a Multilingual ASR Model
Authors:
Kwanghee Choi,
Hyung-Min Park
Abstract:
Multilingual speech data often suffer from long-tailed language distribution, resulting in performance degradation. However, multilingual text data is much easier to obtain, yielding a more useful general language model. Hence, we are motivated to distill the rich knowledge embedded inside a well-trained teacher text model to the student speech model. We propose a novel method called the Distillin…
▽ More
Multilingual speech data often suffer from long-tailed language distribution, resulting in performance degradation. However, multilingual text data is much easier to obtain, yielding a more useful general language model. Hence, we are motivated to distill the rich knowledge embedded inside a well-trained teacher text model to the student speech model. We propose a novel method called the Distilling a Language model to a Speech model (Distill-L2S), which aligns the latent representations of two different modalities. The subtle differences are handled by the shrinking mechanism, nearest-neighbor interpolation, and a learnable linear projection layer. We demonstrate the effectiveness of our distillation method by applying it to the multilingual automatic speech recognition (ASR) task. We distill the transformer-based cross-lingual language model (InfoXLM) while fine-tuning the large-scale multilingual ASR model (XLSR-wav2vec 2.0) for each language. We show the superiority of our method on 20 low-resource languages of the CommonVoice dataset with less than 100 hours of speech data.
△ Less
Submitted 25 June, 2022;
originally announced June 2022.
-
Jointist: Joint Learning for Multi-instrument Transcription and Its Applications
Authors:
Kin Wai Cheuk,
Keunwoo Choi,
Qiuqiang Kong,
Bochen Li,
Minz Won,
Amy Hung,
Ju-Chiang Wang,
Dorien Herremans
Abstract:
In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of the instrument recognition module that conditions the other modules: the transcription module that outputs instrument-specific piano rolls, and the source separation module that utiliz…
▽ More
In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of the instrument recognition module that conditions the other modules: the transcription module that outputs instrument-specific piano rolls, and the source separation module that utilizes instrument information and transcription results.
The instrument conditioning is designed for an explicit multi-instrument functionality while the connection between the transcription and source separation modules is for better transcription performance. Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. However, its novelty necessitates a new perspective on how to evaluate such a model. During the experiment, we assess the model from various aspects, providing a new evaluation perspective for multi-instrument transcription. We also argue that transcription models can be utilized as a preprocessing module for other music analysis tasks. In the experiment on several downstream tasks, the symbolic representation provided by our transcription model turned out to be helpful to spectrograms in solving downbeat detection, chord recognition, and key estimation.
△ Less
Submitted 28 June, 2022; v1 submitted 21 June, 2022;
originally announced June 2022.
-
Lightweight Image Enhancement Network for Mobile Devices Using Self-Feature Extraction and Dense Modulation
Authors:
Sangwook Baek,
Yongsup Park,
Youngo Park,
Jungmin Lee,
Kwangpyo Choi
Abstract:
Convolutional neural network (CNN) based image enhancement methods such as super-resolution and detail enhancement have achieved remarkable performances. However, amounts of operations including convolution and parameters within the networks cost high computing power and need huge memory resource, which limits the applications with on-device requirements. Lightweight image enhancement network shou…
▽ More
Convolutional neural network (CNN) based image enhancement methods such as super-resolution and detail enhancement have achieved remarkable performances. However, amounts of operations including convolution and parameters within the networks cost high computing power and need huge memory resource, which limits the applications with on-device requirements. Lightweight image enhancement network should restore details, texture, and structural information from low-resolution input images while kee** their fidelity. To address these issues, a lightweight image enhancement network is proposed. The proposed network include self-feature extraction module which produces modulation parameters from low-quality image itself, and provides them to modulate the features in the network. Also, dense modulation block is proposed for unit block of the proposed network, which uses dense connections of concatenated features applied in modulation layers. Experimental results demonstrate better performance over existing approaches in terms of both quantitative and qualitative evaluations.
△ Less
Submitted 2 May, 2022;
originally announced May 2022.
-
Mitigating Mismatch Compression in Differential Local Field Potentials
Authors:
Vineet Tiruvadi,
Sam James,
Bryan Howell,
Mosadoluwa Obatusin,
Andrea Crowell,
Patricio Riva-Posse,
Ki Sueng Choi,
Allison Waters,
Robert E. Gross,
Cameron C. McIntyre,
Helen S. Mayberg,
Robert Butera
Abstract:
Bidirectional deep brain stimulation (bdDBS) devices capable of recording differential local field potentials (dLFP) enable neural recordings alongside clinical therapy. Efforts to identify objective signals of various brain disorders, or disease readouts, are challenging in dLFP, especially during active DBS. In this report we identified, characterized, and mitigated a major source of distortion…
▽ More
Bidirectional deep brain stimulation (bdDBS) devices capable of recording differential local field potentials (dLFP) enable neural recordings alongside clinical therapy. Efforts to identify objective signals of various brain disorders, or disease readouts, are challenging in dLFP, especially during active DBS. In this report we identified, characterized, and mitigated a major source of distortion in dLFP that we introduce as mismatch compression (MC). MC occurs secondary to impedance mismatches across the dLFP channel resulting in incomplete rejection of artifacts and downstream amplifier gain compression. Using in silico and in vitro models we demonstrate that MC accounts for impedance-related distortions sensitive to DBS amplitude. We then use these models to develop and validate a mitigation strategy for MC that is provided as an opensource library for more reliable oscillatory disease readouts.
△ Less
Submitted 7 April, 2022;
originally announced April 2022.
-
DPICT: Deep Progressive Image Compression Using Trit-Planes
Authors:
Jae-Han Lee,
Seungmin Jeon,
Kwang Pyo Choi,
Youngo Park,
Chang-Su Kim
Abstract:
We propose the deep progressive image compression using trit-planes (DPICT) algorithm, which is the first learning-based codec supporting fine granular scalability (FGS). First, we transform an image into a latent tensor using an analysis network. Then, we represent the latent tensor in ternary digits (trits) and encode it into a compressed bitstream trit-plane by trit-plane in the decreasing orde…
▽ More
We propose the deep progressive image compression using trit-planes (DPICT) algorithm, which is the first learning-based codec supporting fine granular scalability (FGS). First, we transform an image into a latent tensor using an analysis network. Then, we represent the latent tensor in ternary digits (trits) and encode it into a compressed bitstream trit-plane by trit-plane in the decreasing order of significance. Moreover, within each trit-plane, we sort the trits according to their rate-distortion priorities and transmit more important information first. Since the compression network is less optimized for the cases of using fewer trit-planes, we develop a postprocessing network for refining reconstructed images at low rates. Experimental results show that DPICT outperforms conventional progressive codecs significantly, while enabling FGS transmission. Codes are available at https://github.com/jaehanlee-mcl/DPICT.
△ Less
Submitted 6 May, 2022; v1 submitted 12 December, 2021;
originally announced December 2021.
-
Semi-Supervised Music Tagging Transformer
Authors:
Minz Won,
Keunwoo Choi,
Xavier Serra
Abstract:
We present Music Tagging Transformer that is trained with a semi-supervised approach. The proposed model captures local acoustic characteristics in shallow convolutional layers, then temporally summarizes the sequence of the extracted features using stacked self-attention layers. Through a careful model assessment, we first show that the proposed architecture outperforms the previous state-of-the-…
▽ More
We present Music Tagging Transformer that is trained with a semi-supervised approach. The proposed model captures local acoustic characteristics in shallow convolutional layers, then temporally summarizes the sequence of the extracted features using stacked self-attention layers. Through a careful model assessment, we first show that the proposed architecture outperforms the previous state-of-the-art music tagging models that are based on convolutional neural networks under a supervised scheme.
The Music Tagging Transformer is further improved by noisy student training, a semi-supervised approach that leverages both labeled and unlabeled data combined with data augmentation. To our best knowledge, this is the first attempt to utilize the entire audio of the million song dataset.
△ Less
Submitted 26 November, 2021;
originally announced November 2021.
-
Music Classification: Beyond Supervised Learning, Towards Real-world Applications
Authors:
Minz Won,
Janne Spijkervet,
Keunwoo Choi
Abstract:
Music classification is a music information retrieval (MIR) task to classify music items to labels such as genre, mood, and instruments. It is also closely related to other concepts such as music similarity and musical preference. In this tutorial, we put our focus on two directions - the recent training schemes beyond supervised learning and the successful application of music classification mode…
▽ More
Music classification is a music information retrieval (MIR) task to classify music items to labels such as genre, mood, and instruments. It is also closely related to other concepts such as music similarity and musical preference. In this tutorial, we put our focus on two directions - the recent training schemes beyond supervised learning and the successful application of music classification models.
The target audience for this web book is researchers and practitioners who are interested in state-of-the-art music classification research and building real-world applications. We assume the audience is familiar with the basic machine learning concepts.
In this book, we present three lectures as follows: 1. Music classification overview: Task definition, applications, existing approaches, datasets, 2. Beyond supervised learning: Semi- and self-supervised learning for music classification, 3. Towards real-world applications: Less-discussed, yet important research issues in practice.
△ Less
Submitted 2 December, 2021; v1 submitted 22 November, 2021;
originally announced November 2021.
-
A Novel TSK Fuzzy System Incorporating Multi-view Collaborative Transfer Learning for Personalized Epileptic EEG Detection
Authors:
Andong Li,
Zhaohong Deng,
Qiongdan Lou,
Kup-Sze Choi,
Hongbin Shen,
Shitong Wang
Abstract:
In clinical practice, electroencephalography (EEG) plays an important role in the diagnosis of epilepsy. EEG-based computer-aided diagnosis of epilepsy can greatly improve the ac-curacy of epilepsy detection while reducing the workload of physicians. However, there are many challenges in practical applications for personalized epileptic EEG detection (i.e., training of detection model for a specif…
▽ More
In clinical practice, electroencephalography (EEG) plays an important role in the diagnosis of epilepsy. EEG-based computer-aided diagnosis of epilepsy can greatly improve the ac-curacy of epilepsy detection while reducing the workload of physicians. However, there are many challenges in practical applications for personalized epileptic EEG detection (i.e., training of detection model for a specific person), including the difficulty in extracting effective features from one single view, the undesirable but common scenario of lacking sufficient training data in practice, and the no guarantee of identically distributed training and test data. To solve these problems, we propose a TSK fuzzy system-based epilepsy detection algorithm that integrates multi-view collaborative transfer learning. To address the challenge due to the limitation of single-view features, multi-view learning ensures the diversity of features by extracting them from different views. The lack of training data for building a personalized detection model is tackled by leveraging the knowledge from the source domain (reference scene) to enhance the performance of the target domain (current scene of interest), where mismatch of data distributions between the two domains is resolved with adaption technique based on maximum mean discrepancy. Notably, the transfer learning and multi-view feature extraction are performed at the same time. Furthermore, the fuzzy rules of the TSK fuzzy system equip the model with strong fuzzy logic inference capability. Hence, the proposed method has the potential to detect epileptic EEG signals effectively, which is demonstrated with the positive results from a large number of experiments on the CHB-MIT dataset.
△ Less
Submitted 11 November, 2021;
originally announced November 2021.
-
Temporal Knowledge Distillation for On-device Audio Classification
Authors:
Kwanghee Choi,
Martin Kersner,
Jacob Morton,
Buru Chang
Abstract:
Improving the performance of on-device audio classification models remains a challenge given the computational limits of the mobile environment. Many studies leverage knowledge distillation to boost predictive performance by transferring the knowledge from large models to on-device models. However, most lack a mechanism to distill the essence of the temporal information, which is crucial to audio…
▽ More
Improving the performance of on-device audio classification models remains a challenge given the computational limits of the mobile environment. Many studies leverage knowledge distillation to boost predictive performance by transferring the knowledge from large models to on-device models. However, most lack a mechanism to distill the essence of the temporal information, which is crucial to audio classification tasks, or similar architecture is often required. In this paper, we propose a new knowledge distillation method designed to incorporate the temporal knowledge embedded in attention weights of large transformer-based models into on-device models. Our distillation method is applicable to various types of architectures, including the non-attention-based architectures such as CNNs or RNNs, while retaining the original network architecture during inference. Through extensive experiments on both an audio event detection dataset and a noisy keyword spotting dataset, we show that our proposed method improves the predictive performance across diverse on-device architectures.
△ Less
Submitted 5 February, 2022; v1 submitted 26 October, 2021;
originally announced October 2021.
-
SpecTNT: a Time-Frequency Transformer for Music Audio
Authors:
Wei-Tsung Lu,
Ju-Chiang Wang,
Minz Won,
Keunwoo Choi,
Xuchen Song
Abstract:
Transformers have drawn attention in the MIR field for their remarkable performance shown in natural language processing and computer vision. However, prior works in the audio processing domain mostly use Transformer as a temporal feature aggregator that acts similar to RNNs. In this paper, we propose SpecTNT, a Transformer-based architecture to model both spectral and temporal sequences of an inp…
▽ More
Transformers have drawn attention in the MIR field for their remarkable performance shown in natural language processing and computer vision. However, prior works in the audio processing domain mostly use Transformer as a temporal feature aggregator that acts similar to RNNs. In this paper, we propose SpecTNT, a Transformer-based architecture to model both spectral and temporal sequences of an input time-frequency representation. Specifically, we introduce a novel variant of the Transformer-in-Transformer (TNT) architecture. In each SpecTNT block, a spectral Transformer extracts frequency-related features into the frequency class token (FCT) for each frame. Later, the FCTs are linearly projected and added to the temporal embeddings (TEs), which aggregate useful information from the FCTs. Then, a temporal Transformer processes the TEs to exchange information across the time axis. By stacking the SpecTNT blocks, we build the SpecTNT model to learn the representation for music signals. In experiments, SpecTNT demonstrates state-of-the-art performance in music tagging and vocal melody extraction, and shows competitive performance for chord recognition. The effectiveness of SpecTNT and other design choices are further examined through ablation studies.
△ Less
Submitted 18 October, 2021;
originally announced October 2021.
-
Design and Implementation of 5.8GHz RF Wireless PowerTransfer System
Authors:
Je Hyeon Park,
Nguyen Minh Tran,
Sa Il Hwang,
Dong In Kim,
Kae Won Choi
Abstract:
In this paper, we present a 5.8 GHz radio-frequency (RF) wireless power transfer (WPT) system that consists of 64 transmit antennas and 16 receive antennas. Unlike the inductive or resonant coupling-based near-field WPT, RF WPT has a great advantage in powering low-power internet of things (IoT) devices with its capability of long-range wireless power transfer. We also propose a beam scanning algo…
▽ More
In this paper, we present a 5.8 GHz radio-frequency (RF) wireless power transfer (WPT) system that consists of 64 transmit antennas and 16 receive antennas. Unlike the inductive or resonant coupling-based near-field WPT, RF WPT has a great advantage in powering low-power internet of things (IoT) devices with its capability of long-range wireless power transfer. We also propose a beam scanning algorithm that can effectively transfer the power no matter whether the receiver is located in the radiative near-field zone or far-field zone. The proposed beam scanning algorithm is verified with a real-life WPT testbed implemented by ourselves. By experiments, we confirm that the implemented 5.8 GHz RF WPT system is able to transfer 3.67 mW at a distance of 25 meters with the proposed beam scanning algorithm. Moreover, the results show that the proposed algorithm can effectively cover radiative near-field region differently from the conventional scanning schemes which are designed under the assumption of the far-field WPT.
△ Less
Submitted 6 October, 2021;
originally announced October 2021.
-
Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation
Authors:
Qiuqiang Kong,
Yin Cao,
Haohe Liu,
Keunwoo Choi,
Yuxuan Wang
Abstract:
Deep neural network based methods have been successfully applied to music source separation. They typically learn a map** from a mixture spectrogram to a set of source spectrograms, all with magnitudes only. This approach has several limitations: 1) its incorrect phase reconstruction degrades the performance, 2) it limits the magnitude of masks between 0 and 1 while we observe that 22% of time-f…
▽ More
Deep neural network based methods have been successfully applied to music source separation. They typically learn a map** from a mixture spectrogram to a set of source spectrograms, all with magnitudes only. This approach has several limitations: 1) its incorrect phase reconstruction degrades the performance, 2) it limits the magnitude of masks between 0 and 1 while we observe that 22% of time-frequency bins have ideal ratio mask values of over~1 in a popular dataset, MUSDB18, 3) its potential on very deep architectures is under-explored. Our proposed system is designed to overcome these. First, we propose to estimate phases by estimating complex ideal ratio masks (cIRMs) where we decouple the estimation of cIRMs into magnitude and phase estimations. Second, we extend the separation method to effectively allow the magnitude of the mask to be larger than 1. Finally, we propose a residual UNet architecture with up to 143 layers. Our proposed system achieves a state-of-the-art MSS result on the MUSDB18 dataset, especially, a SDR of 8.98~dB on vocals, outperforming the previous best performance of 7.24~dB. The source code is available at: https://github.com/bytedance/music_source_separation
△ Less
Submitted 11 September, 2021;
originally announced September 2021.
-
Reconfigurable Intelligent Surface-Aided Wireless Power Transfer Systems: Analysis and Implementation
Authors:
Nguyen Minh Tran,
Muhammad Miftahul Amri,
Je Hyeon Park,
Dong In Kim,
Kae Won Choi
Abstract:
Reconfigurable intelligent surface (RIS) is a promising technology for RF wireless power transfer (WPT) as it is capable of beamforming and beam focusing without using active and power-hungry components. In this paper, we propose a multi-tile RIS beam scanning (MTBS) algorithm for powering up internet-of-things (IoT) devices. Considering the hardware limitations of the IoT devices, the proposed al…
▽ More
Reconfigurable intelligent surface (RIS) is a promising technology for RF wireless power transfer (WPT) as it is capable of beamforming and beam focusing without using active and power-hungry components. In this paper, we propose a multi-tile RIS beam scanning (MTBS) algorithm for powering up internet-of-things (IoT) devices. Considering the hardware limitations of the IoT devices, the proposed algorithm requires only power information to enable the beam focusing capability of the RIS. Specifically, we first divide the RIS into smaller RIS tiles. Then, all RIS tiles and the phased array transmitter are iteratively scanned and optimized to maximize the receive power. We elaborately analyze the proposed algorithm and build a simulator to verify it. Furthermore, we have built a real-life testbed of RIS-aided WPT systems to validate the algorithm. The experimental results show that the proposed MTBS algorithm can properly control the transmission phase of the transmitter and the reflection phase of the RIS to focus the power at the receiver. Consequently, after executing the algorithm, about 20 dB improvement of the receive power is achieved compared to the case that all unit cells of the RIS are in OFF state. By experiments, we confirm that the RIS with the MTBS algorithm can greatly enhance the power transfer efficiency.
△ Less
Submitted 13 March, 2022; v1 submitted 12 June, 2021;
originally announced June 2021.
-
Deep Learning-based High-precision Depth Map Estimation from Missing Viewpoints for 360 Degree Digital Holography
Authors:
Hakdong Kim,
Heonyeong Lim,
Minkyu Jee,
Yurim Lee,
Jisoo Jeong,
Kyudam Choi,
MinSung Yoon,
Cheongwon Kim
Abstract:
In this paper, we propose a novel, convolutional neural network model to extract highly precise depth maps from missing viewpoints, especially well applicable to generate holographic 3D contents. The depth map is an essential element for phase extraction which is required for synthesis of computer-generated hologram (CGH). The proposed model called the HDD Net uses MSE for the better performance o…
▽ More
In this paper, we propose a novel, convolutional neural network model to extract highly precise depth maps from missing viewpoints, especially well applicable to generate holographic 3D contents. The depth map is an essential element for phase extraction which is required for synthesis of computer-generated hologram (CGH). The proposed model called the HDD Net uses MSE for the better performance of depth map estimation as loss function, and utilizes the bilinear interpolation in up sampling layer with the Relu as activation function. We design and prepare a total of 8,192 multi-view images, each resolution of 640 by 360 for the deep learning study. The proposed model estimates depth maps through extracting features, up sampling. For quantitative assessment, we compare the estimated depth maps with the ground truths by using the PSNR, ACC, and RMSE. We also compare the CGH patterns made from estimated depth maps with ones made from ground truths. Furthermore, we demonstrate the experimental results to test the quality of estimated depth maps through directly reconstructing holographic 3D image scenes from the CGHs.
△ Less
Submitted 8 March, 2021;
originally announced March 2021.
-
Listen, Read, and Identify: Multimodal Singing Language Identification of Music
Authors:
Keunwoo Choi,
Yuxuan Wang
Abstract:
We propose a multimodal singing language classification model that uses both audio content and textual metadata. LRID-Net, the proposed model, takes an audio signal and a language probability vector estimated from the metadata and outputs the probabilities of the target languages. Optionally, LRID-Net is facilitated with modality dropouts to handle a missing modality. In the experiment, we trained…
▽ More
We propose a multimodal singing language classification model that uses both audio content and textual metadata. LRID-Net, the proposed model, takes an audio signal and a language probability vector estimated from the metadata and outputs the probabilities of the target languages. Optionally, LRID-Net is facilitated with modality dropouts to handle a missing modality. In the experiment, we trained several LRID-Nets with varying modality dropout configuration and tested them with various combinations of input modalities. The experiment results demonstrate that using multimodal input improves performance. The results also suggest that adopting modality dropout does not degrade the performance of the model when there are full modality inputs while enabling the model to handle missing modality cases to some extent.
△ Less
Submitted 27 July, 2021; v1 submitted 2 March, 2021;
originally announced March 2021.
-
Large-Scale MIDI-based Composer Classification
Authors:
Qiuqiang Kong,
Keunwoo Choi,
Yuxuan Wang
Abstract:
Music classification is a task to classify a music piece into labels such as genres or composers. We propose large-scale MIDI based composer classification systems using GiantMIDI-Piano, a transcription-based dataset. We propose to use piano rolls, onset rolls, and velocity rolls as input representations and use deep neural networks as classifiers. To our knowledge, we are the first to investigate…
▽ More
Music classification is a task to classify a music piece into labels such as genres or composers. We propose large-scale MIDI based composer classification systems using GiantMIDI-Piano, a transcription-based dataset. We propose to use piano rolls, onset rolls, and velocity rolls as input representations and use deep neural networks as classifiers. To our knowledge, we are the first to investigate the composer classification problem with up to 100 composers. By using convolutional recurrent neural networks as models, our MIDI based composer classification system achieves a 10-composer and a 100-composer classification accuracies of 0.648 and 0.385 (evaluated on 30-second clips) and 0.739 and 0.489 (evaluated on music pieces), respectively. Our MIDI based composer system outperforms several audio-based baseline classification systems, indicating the effectiveness of using compact MIDI representations for composer classification.
△ Less
Submitted 28 October, 2020;
originally announced October 2020.
-
Deep Composer Classification Using Symbolic Representation
Authors:
Sunghyeon Kim,
Hyeyoon Lee,
Sunjong Park,
**ho Lee,
Keunwoo Choi
Abstract:
In this study, we train deep neural networks to classify composer on a symbolic domain. The model takes a two-channel two-dimensional input, i.e., onset and note activations of time-pitch representation, which is converted from MIDI recordings and performs a single-label classification. On the experiments conducted on MAESTRO dataset, we report an F1 value of 0.8333 for the classification of 13~cl…
▽ More
In this study, we train deep neural networks to classify composer on a symbolic domain. The model takes a two-channel two-dimensional input, i.e., onset and note activations of time-pitch representation, which is converted from MIDI recordings and performs a single-label classification. On the experiments conducted on MAESTRO dataset, we report an F1 value of 0.8333 for the classification of 13~classical composers.
△ Less
Submitted 26 October, 2020; v1 submitted 2 October, 2020;
originally announced October 2020.
-
Dereverberation using joint estimation of dry speech signal and acoustic system
Authors:
Sanna Wager,
Keunwoo Choi,
Simon Durand
Abstract:
The purpose of speech dereverberation is to remove quality-degrading effects of a time-invariant impulse response filter from the signal. In this report, we describe an approach to speech dereverberation that involves joint estimation of the dry speech signal and of the room impulse response. We explore deep learning models that apply to each task separately, and how these can be combined in a joi…
▽ More
The purpose of speech dereverberation is to remove quality-degrading effects of a time-invariant impulse response filter from the signal. In this report, we describe an approach to speech dereverberation that involves joint estimation of the dry speech signal and of the room impulse response. We explore deep learning models that apply to each task separately, and how these can be combined in a joint model with shared parameters.
△ Less
Submitted 24 July, 2020;
originally announced July 2020.
-
Joint Orthogonal Band and Power Allocation for Energy Fairness in WPT System with Nonlinear Logarithmic Energy Harvesting Model
Authors:
Jaeseob Han,
Gyeong Ho Lee,
Sangdon Park,
Jun Kyun Choi
Abstract:
Wireless power transmission (WPT) is expected to play an important role in the Internet of Things services by providing the perpetual operation of IoT sensors. However, to prolong the IoT network's lifetime, the efficient resource allocation algorithm is required, in particular, the energy fairness issue among IoT sensors has been a critical challenge of the WPT system. In this paper, considering…
▽ More
Wireless power transmission (WPT) is expected to play an important role in the Internet of Things services by providing the perpetual operation of IoT sensors. However, to prolong the IoT network's lifetime, the efficient resource allocation algorithm is required, in particular, the energy fairness issue among IoT sensors has been a critical challenge of the WPT system. In this paper, considering energy fairness as the minimum received energy of all energy poverty IoT sensors (EPISs), we allocate orthogonal frequency bands to several EPISs and transfer the RF power on each orthogonal band, using energy beamforming. Based on the energy poverty, we propose orthogonal frequency bands assignment rule, granting the priority to the EPISs with less received energy. We also formulate two transmission power allocation problems, incorporated the nonlinear logarithm-energy harvesting (EH) model. First, the total received power maximization (TRPM) problem is presented and solved by combining the well-known Karush-Kuhn-Tucker (KKT) conditions with the modified water-filling algorithm. Second, the common received power maximization (CRPM) problem is formulated and the optimal solution is derived using the iterative bisection search method. To apply the bisection search method to the problem, this paper proposes a method of specifying the scope of the solution for the objective function defined by the sum of monotonous functions. In numerical results, assuming the mobility of EPISs by the one-dimensional random walk model, the effectiveness of the mobility of EPISs on the minimum received energy of all EPISs is presented. Finally, the performance of the proposed resource allocation schemes is verified by comparing other resources allocation schemes, such as Round robin and equal power distribution
△ Less
Submitted 30 March, 2020;
originally announced March 2020.
-
Encoding Musical Style with Transformer Autoencoders
Authors:
Kristy Choi,
Curtis Hawthorne,
Ian Simon,
Monica Dinculescu,
Jesse Engel
Abstract:
We consider the problem of learning high-level controls over the global structure of generated sequences, particularly in the context of symbolic music generation with complex language models. In this work, we present the Transformer autoencoder, which aggregates encodings of the input data across time to obtain a global representation of style from a given performance. We show it is possible to c…
▽ More
We consider the problem of learning high-level controls over the global structure of generated sequences, particularly in the context of symbolic music generation with complex language models. In this work, we present the Transformer autoencoder, which aggregates encodings of the input data across time to obtain a global representation of style from a given performance. We show it is possible to combine this global representation with other temporally distributed embeddings, enabling improved control over the separate aspects of performance style and melody. Empirically, we demonstrate the effectiveness of our method on various music generation tasks on the MAESTRO dataset and a YouTube dataset with 10,000+ hours of piano performances, where we achieve improvements in terms of log-likelihood and mean listening scores as compared to baselines.
△ Less
Submitted 30 June, 2020; v1 submitted 10 December, 2019;
originally announced December 2019.