Search | arXiv e-print repository

Exploiting Time-Frequency Conformers for Music Audio Enhancement

Authors: Yunkee Chae, Junghyun Koo, Sungho Lee, Kyogu Lee

Abstract: With the proliferation of video platforms on the internet, recording musical performances by mobile devices has become commonplace. However, these recordings often suffer from degradation such as noise and reverberation, which negatively impact the listening experience. Consequently, the necessity for music audio enhancement (referred to as music enhancement from this point onward), involving the… ▽ More With the proliferation of video platforms on the internet, recording musical performances by mobile devices has become commonplace. However, these recordings often suffer from degradation such as noise and reverberation, which negatively impact the listening experience. Consequently, the necessity for music audio enhancement (referred to as music enhancement from this point onward), involving the transformation of degraded audio recordings into pristine high-quality music, has surged to augment the auditory experience. To address this issue, we propose a music enhancement system based on the Conformer architecture that has demonstrated outstanding performance in speech enhancement tasks. Our approach explores the attention mechanisms of the Conformer and examines their performance to discover the best approach for the music enhancement task. Our experimental results show that our proposed model achieves state-of-the-art performance on single-stem music enhancement. Furthermore, our system can perform general music enhancement with multi-track mixtures, which has not been examined in previous work. △ Less

Submitted 24 August, 2023; originally announced August 2023.

Comments: Accepted by ACM Multimedia 2023

arXiv:2307.12576 [pdf, other]

Self-refining of Pseudo Labels for Music Source Separation with Noisy Labeled Data

Authors: Junghyun Koo, Yunkee Chae, Chang-Bin Jeon, Kyogu Lee

Abstract: Music source separation (MSS) faces challenges due to the limited availability of correctly-labeled individual instrument tracks. With the push to acquire larger datasets to improve MSS performance, the inevitability of encountering mislabeled individual instrument tracks becomes a significant challenge to address. This paper introduces an automated technique for refining the labels in a partially… ▽ More Music source separation (MSS) faces challenges due to the limited availability of correctly-labeled individual instrument tracks. With the push to acquire larger datasets to improve MSS performance, the inevitability of encountering mislabeled individual instrument tracks becomes a significant challenge to address. This paper introduces an automated technique for refining the labels in a partially mislabeled dataset. Our proposed self-refining technique, employed with a noisy-labeled dataset, results in only a 1% accuracy degradation in multi-label instrument recognition compared to a classifier trained on a clean-labeled dataset. The study demonstrates the importance of refining noisy-labeled data in MSS model training and shows that utilizing the refined dataset leads to comparable results derived from a clean-labeled dataset. Notably, upon only access to a noisy dataset, MSS models trained on a self-refined dataset even outperform those trained on a dataset refined with a classifier trained on clean labels. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Comments: 24th International Society for Music Information Retrieval Conference (ISMIR 2023)

arXiv:2305.13108 [pdf, other]

Debiased Automatic Speech Recognition for Dysarthric Speech via Sample Reweighting with Sample Affinity Test

Authors: Eungbeom Kim, Yunkee Chae, Jaeheon Sim, Kyogu Lee

Abstract: Automatic speech recognition systems based on deep learning are mainly trained under empirical risk minimization (ERM). Since ERM utilizes the averaged performance on the data samples regardless of a group such as healthy or dysarthric speakers, ASR systems are unaware of the performance disparities across the groups. This results in biased ASR systems whose performance differences among groups ar… ▽ More Automatic speech recognition systems based on deep learning are mainly trained under empirical risk minimization (ERM). Since ERM utilizes the averaged performance on the data samples regardless of a group such as healthy or dysarthric speakers, ASR systems are unaware of the performance disparities across the groups. This results in biased ASR systems whose performance differences among groups are severe. In this study, we aim to improve the ASR system in terms of group robustness for dysarthric speakers. To achieve our goal, we present a novel approach, sample reweighting with sample affinity test (Re-SAT). Re-SAT systematically measures the debiasing helpfulness of the given data sample and then mitigates the bias by debiasing helpfulness-based sample reweighting. Experimental results demonstrate that Re-SAT contributes to improved ASR performance on dysarthric speech without performance degradation on healthy speech. △ Less

Submitted 27 June, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: Accepted by Interspeech 2023

arXiv:2305.09167 [pdf, other]

Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion

Authors: Xintao Zhao, Shuai Wang, Yang Chao, Zhiyong Wu, Helen Meng

Abstract: Nowadays, recognition-synthesis-based methods have been quite popular with voice conversion (VC). By introducing linguistics features with good disentangling characters extracted from an automatic speech recognition (ASR) model, the VC performance achieved considerable breakthroughs. Recently, self-supervised learning (SSL) methods trained with a large-scale unannotated speech corpus have been app… ▽ More Nowadays, recognition-synthesis-based methods have been quite popular with voice conversion (VC). By introducing linguistics features with good disentangling characters extracted from an automatic speech recognition (ASR) model, the VC performance achieved considerable breakthroughs. Recently, self-supervised learning (SSL) methods trained with a large-scale unannotated speech corpus have been applied to downstream tasks focusing on the content information, which is suitable for VC tasks. However, a huge amount of speaker information in SSL representations degrades timbre similarity and the quality of converted speech significantly. To address this problem, we proposed a high-similarity any-to-one voice conversion method with the input of SSL representations. We incorporated adversarial training mechanisms in the synthesis module using external unannotated corpora. Two auxiliary discriminators were trained to distinguish whether a sequence of mel-spectrograms has been converted by the acoustic model and whether a sequence of content embeddings contains speaker information from external corpora. Experimental results show that our proposed method achieves comparable similarity and higher naturalness than the supervised method, which needs a huge amount of annotated corpora for training and is applicable to improve similarity for VC methods with other SSL representations as input. △ Less

Submitted 16 May, 2023; originally announced May 2023.

Comments: Accepted by ICME 2023

arXiv:2304.14496 [pdf, ps, other]

Restoring Original Signal From Pile-up Signal using Deep Learning

Authors: C. H. Kim, S. Ahn, K. Y. Chae, J. Hooker, G. V. Rogachev

Abstract: Pile-up signals are frequently produced in experimental physics. They create inaccurate physics data with high uncertainty and cause various problems. Therefore, the correction to pile-up signals is crucially required. In this study, we implemented a deep learning method to restore the original signals from the pile-up signals. We showed that a deep learning model could accurately reconstruct the… ▽ More Pile-up signals are frequently produced in experimental physics. They create inaccurate physics data with high uncertainty and cause various problems. Therefore, the correction to pile-up signals is crucially required. In this study, we implemented a deep learning method to restore the original signals from the pile-up signals. We showed that a deep learning model could accurately reconstruct the original signal waveforms from the pile-up waveforms. By substituting the pile-up signals with the original signals predicted by the model, the energy and timing resolutions of the data are notably enhanced. The model implementation significantly improved the quality of the particle identification plot and particle tracks. This method is applicable to similar problems, such as separating multiple signals or correcting pile-up signals with other types of noises and backgrounds. △ Less

Submitted 24 April, 2023; originally announced April 2023.

arXiv:2211.07951 [pdf, other]

Show Me the Instruments: Musical Instrument Retrieval from Mixture Audio

Authors: Kyungsu Kim, Minju Park, Haesun Joung, Yunkee Chae, Yeongbeom Hong, Seonghyeon Go, Kyogu Lee

Abstract: As digital music production has become mainstream, the selection of appropriate virtual instruments plays a crucial role in determining the quality of music. To search the musical instrument samples or virtual instruments that make one's desired sound, music producers use their ears to listen and compare each instrument sample in their collection, which is time-consuming and inefficient. In this p… ▽ More As digital music production has become mainstream, the selection of appropriate virtual instruments plays a crucial role in determining the quality of music. To search the musical instrument samples or virtual instruments that make one's desired sound, music producers use their ears to listen and compare each instrument sample in their collection, which is time-consuming and inefficient. In this paper, we call this task as Musical Instrument Retrieval and propose a method for retrieving desired musical instruments using reference music mixture as a query. The proposed model consists of the Single-Instrument Encoder and the Multi-Instrument Encoder, both based on convolutional neural networks. The Single-Instrument Encoder is trained to classify the instruments used in single-track audio, and we take its penultimate layer's activation as the instrument embedding. The Multi-Instrument Encoder is trained to estimate multiple instrument embeddings using the instrument embeddings computed by the Single-Instrument Encoder as a set of target embeddings. For more generalized training and realistic evaluation, we also propose a new dataset called Nlakh. Experimental results showed that the Single-Instrument Encoder was able to learn the map** from the audio signal of unseen instruments to the instrument embedding space and the Multi-Instrument Encoder was able to extract multiple embeddings from the mixture of music and retrieve the desired instruments successfully. The code used for the experiment and audio samples are available at: https://github.com/minju0821/musical_instrument_retrieval △ Less

Submitted 15 November, 2022; originally announced November 2022.

Comments: 5 pages, 4 figures, submitted to ICASSP 2023

arXiv:2206.06730 [pdf, other]

Automated Precision Localization of Peripherally Inserted Central Catheter Tip through Model-Agnostic Multi-Stage Networks

Authors: Subin Park, Yoon Ki Cha, Soyoung Park, Kyung-Su Kim, Myung ** Chung

Abstract: Peripherally inserted central catheters (PICCs) have been widely used as one of the representative central venous lines (CVCs) due to their long-term intravascular access with low infectivity. However, PICCs have a fatal drawback of a high frequency of tip mispositions, increasing the risk of puncture, embolism, and complications such as cardiac arrhythmias. To automatically and precisely detect i… ▽ More Peripherally inserted central catheters (PICCs) have been widely used as one of the representative central venous lines (CVCs) due to their long-term intravascular access with low infectivity. However, PICCs have a fatal drawback of a high frequency of tip mispositions, increasing the risk of puncture, embolism, and complications such as cardiac arrhythmias. To automatically and precisely detect it, various attempts have been made by using the latest deep learning (DL) technologies. However, even with these approaches, it is still practically difficult to determine the tip location because the multiple fragments phenomenon (MFP) occurs in the process of predicting and extracting the PICC line required before predicting the tip. This study aimed to develop a system generally applied to existing models and to restore the PICC line more exactly by removing the MFs of the model output, thereby precisely localizing the actual tip position for detecting its disposition. To achieve this, we proposed a multi-stage DL-based framework post-processing the PICC line extraction result of the existing technology. The performance was compared by each root mean squared error (RMSE) and MFP incidence rate according to whether or not MFCN is applied to five conventional models. In internal validation, when MFCN was applied to the existing single model, MFP was improved by an average of 45%. The RMSE was improved by over 63% from an average of 26.85mm (17.16 to 35.80mm) to 9.72mm (9.37 to 10.98mm). In external validation, when MFCN was applied, the MFP incidence rate decreased by an average of 32% and the RMSE decreased by an average of 65\%. Therefore, by applying the proposed MFCN, we observed the significant/consistent detection performance improvement of PICC tip location compared to the existing model. △ Less

Submitted 14 June, 2022; originally announced June 2022.

Comments: Subin Park and Yoon Ki Cha have contributed equally to this work as the co-first author. Kyung-Su Kim ([email protected]) and Myung ** Chung ([email protected]) have contributed equally to this work as the co-corresponding author

arXiv:2102.07883 [pdf, other]

doi 10.1109/TIP.2022.3145242

Pre-demosaic Graph-based Light Field Image Compression

Authors: Yung-Hsuan Chao, Haoran Hong, Gene Cheung, Antonio Ortega

Abstract: An unfocused plenoptic light field (LF) camera places an array of microlenses in front of an image sensor in order to separately capture different directional rays arriving at an image pixel. Using a conventional Bayer pattern, data captured at each pixel is a single color component (R, G or B).The sensed data then undergoes demosaicking (interpolation of RGB components per pixel) and conversion t… ▽ More An unfocused plenoptic light field (LF) camera places an array of microlenses in front of an image sensor in order to separately capture different directional rays arriving at an image pixel. Using a conventional Bayer pattern, data captured at each pixel is a single color component (R, G or B).The sensed data then undergoes demosaicking (interpolation of RGB components per pixel) and conversion to an array of sub-aperture images (SAIs). In this paper, we propose a new LF image coding scheme based on graph lifting transform (GLT), where the acquired sensor data are coded in the original captured form without pre-processing. Specifically, we directly map raw sensed color data to the SAIs, resulting in sparsely distributed color pixels on 2D grids, and perform demosaicking at the receiver after decoding. To exploit spatial correlation among the sparse pixels, we propose a novel intra-prediction scheme, where the prediction kernel is determined according to the local gradient estimated from already coded neighboring pixel blocks. We then connect the pixels by forming a graph, modeling the prediction residuals statistically as a Gaussian Markov Random Field (GMRF). The optimal edge weights are computed via a graph learning method using a set of training SAIs. The residual data is encoded via low-complexity GLT. Experiments show that at high PSNRs -- important for archiving and instant storage scenarios -- our method outperformed significantly a conventional light field image coding scheme with demosaicking followed by High Efficiency Video Coding (HEVC). △ Less

Submitted 6 January, 2022; v1 submitted 15 February, 2021; originally announced February 2021.

Comments: 13 pages, 12 figures, 6 tables, Accepted by IEEE Transactions on Image Processing

arXiv:1909.00952 [pdf, other]

doi 10.1109/TIP.2020.3026627

Graph-based Transforms for Video Coding

Authors: Hilmi E. Egilmez, Yung-Hsuan Chao, Antonio Ortega

Abstract: In many state-of-the-art compression systems, signal transformation is an integral part of the encoding and decoding process, where transforms provide compact representations for the signals of interest. This paper introduces a class of transforms called graph-based transforms (GBTs) for video compression, and proposes two different techniques to design GBTs. In the first technique, we formulate a… ▽ More In many state-of-the-art compression systems, signal transformation is an integral part of the encoding and decoding process, where transforms provide compact representations for the signals of interest. This paper introduces a class of transforms called graph-based transforms (GBTs) for video compression, and proposes two different techniques to design GBTs. In the first technique, we formulate an optimization problem to learn graphs from data and provide solutions for optimal separable and nonseparable GBT designs, called GL-GBTs. The optimality of the proposed GL-GBTs is also theoretically analyzed based on Gaussian-Markov random field (GMRF) models for intra and inter predicted block signals. The second technique develops edge-adaptive GBTs (EA-GBTs) in order to flexibly adapt transforms to block signals with image edges (discontinuities). The advantages of EA-GBTs are both theoretically and empirically demonstrated. Our experimental results demonstrate that the proposed transforms can significantly outperform the traditional Karhunen-Loeve transform (KLT). △ Less

Submitted 18 September, 2020; v1 submitted 3 September, 2019; originally announced September 2019.

Comments: To appear in IEEE Trans. on Image Processing (14 pages)

Showing 1–9 of 9 results for author: Chae, Y