Search | arXiv e-print repository

REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR

Authors: Liang-Hsuan Tseng, En-Pei Hu, Cheng-Han Chiang, Yuan Tseng, Hung-yi Lee, Lin-shan Lee, Shao-Hua Sun

Abstract: Unsupervised automatic speech recognition (ASR) aims to learn the map** between the speech signal and its corresponding textual transcription without the supervision of paired speech-text data. A word/phoneme in the speech signal is represented by a segment of speech signal with variable length and unknown boundary, and this segmental structure makes learning the map** between speech and text… ▽ More Unsupervised automatic speech recognition (ASR) aims to learn the map** between the speech signal and its corresponding textual transcription without the supervision of paired speech-text data. A word/phoneme in the speech signal is represented by a segment of speech signal with variable length and unknown boundary, and this segmental structure makes learning the map** between speech and text challenging, especially without paired data. In this paper, we propose REBORN,Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR. REBORN alternates between (1) training a segmentation model that predicts the boundaries of the segmental structures in speech signals and (2) training the phoneme prediction model, whose input is the speech feature segmented by the segmentation model, to predict a phoneme transcription. Since supervised data for training the segmentation model is not available, we use reinforcement learning to train the segmentation model to favor segmentations that yield phoneme sequence predictions with a lower perplexity. We conduct extensive experiments and find that under the same setting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech, TIMIT, and five non-English languages in Multilingual LibriSpeech. We comprehensively analyze why the boundaries learned by REBORN improve the unsupervised ASR performance. △ Less

Submitted 28 May, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

arXiv:2401.13463 [pdf, other]

SpeechDPR: End-to-End Spoken Passage Retrieval for Open-Domain Spoken Question Answering

Authors: Chyi-Jiunn Lin, Guan-Ting Lin, Yung-Sung Chuang, Wei-Lun Wu, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, Lin-shan Lee

Abstract: Spoken Question Answering (SQA) is essential for machines to reply to user's question by finding the answer span within a given spoken passage. SQA has been previously achieved without ASR to avoid recognition errors and Out-of-Vocabulary (OOV) problems. However, the real-world problem of Open-domain SQA (openSQA), in which the machine needs to first retrieve passages that possibly contain the ans… ▽ More Spoken Question Answering (SQA) is essential for machines to reply to user's question by finding the answer span within a given spoken passage. SQA has been previously achieved without ASR to avoid recognition errors and Out-of-Vocabulary (OOV) problems. However, the real-world problem of Open-domain SQA (openSQA), in which the machine needs to first retrieve passages that possibly contain the answer from a spoken archive in addition, was never considered. This paper proposes the first known end-to-end framework, Speech Dense Passage Retriever (SpeechDPR), for the retrieval component of the openSQA problem. SpeechDPR learns a sentence-level semantic representation by distilling knowledge from the cascading model of unsupervised ASR (UASR) and text dense retriever (TDR). No manually transcribed speech data is needed. Initial experiments showed performance comparable to the cascading model of UASR and TDR, and significantly better when UASR was poor, verifying this approach is more robust to speech recognition errors. △ Less

Submitted 18 March, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

Comments: Accepted at ICASSP 2024

arXiv:2312.09799 [pdf, other]

doi 10.1109/OJCAS.2023.3344094

IQNet: Image Quality Assessment Guided Just Noticeable Difference Prefiltering For Versatile Video Coding

Authors: Yu-Han Sun, Chiang Lo-Hsuan Lee, Tian-Sheuan Chang

Abstract: Image prefiltering with just noticeable distortion (JND) improves coding efficiency in a visual lossless way by filtering the perceptually redundant information prior to compression. However, real JND cannot be well modeled with inaccurate masking equations in traditional approaches or image-level subject tests in deep learning approaches. Thus, this paper proposes a fine-grained JND prefiltering… ▽ More Image prefiltering with just noticeable distortion (JND) improves coding efficiency in a visual lossless way by filtering the perceptually redundant information prior to compression. However, real JND cannot be well modeled with inaccurate masking equations in traditional approaches or image-level subject tests in deep learning approaches. Thus, this paper proposes a fine-grained JND prefiltering dataset guided by image quality assessment for accurate block-level JND modeling. The dataset is constructed from decoded images to include coding effects and is also perceptually enhanced with block overlap and edge preservation. Furthermore, based on this dataset, we propose a lightweight JND prefiltering network, IQNet, which can be applied directly to different quantization cases with the same model and only needs 3K parameters. The experimental results show that the proposed approach to Versatile Video Coding could yield maximum/average bitrate savings of 41\%/15\% and 53\%/19\% for all-intra and low-delay P configurations, respectively, with negligible subjective quality loss. Our method demonstrates higher perceptual quality and a model size that is an order of magnitude smaller than previous deep learning methods. △ Less

Submitted 15 December, 2023; originally announced December 2023.

arXiv:2307.11226 [pdf, other]

BLISS: Interplanetary Exploration with Swarms of Low-Cost Spacecraft

Authors: Alexander N. Alvara, Lydia Lee, Emmanuel Sin, Nathan Lambert, Andrew J. Westphal, Kristofer S. J. Pister

Abstract: Leveraging advancements in micro-scale technology, we propose a fleet of autonomous, low-cost, small solar sails for interplanetary exploration. The Berkeley Low-cost Interplanetary Solar Sail (BLISS) project aims to utilize small-scale technologies to create a fleet of tiny interplanetary femto-spacecraft for rapid, low-cost exploration of the inner solar system. This paper describes the hardware… ▽ More Leveraging advancements in micro-scale technology, we propose a fleet of autonomous, low-cost, small solar sails for interplanetary exploration. The Berkeley Low-cost Interplanetary Solar Sail (BLISS) project aims to utilize small-scale technologies to create a fleet of tiny interplanetary femto-spacecraft for rapid, low-cost exploration of the inner solar system. This paper describes the hardware required to build a nearly 10 g spacecraft using a 1 m$^2$ solar sail steered by micro-electromechanical systems (MEMS) inchworm actuators. The trajectory control to a NEO, here 101955 Bennu, is detailed along with the low-level actuation control of the solar sail and the specifications of proposed onboard communication and computation. Two other applications are also shortly considered: sample return from dozens of Jupiter-family comets and interstellar comet rendezvous and imaging. The paper concludes by discussing the fundamental scaling limits and future directions for steerable autonomous miniature solar sails with onboard custom computers and sensors. △ Less

Submitted 20 July, 2023; originally announced July 2023.

Comments: 16 pages, 13 figures, 5 tables, 23 equations, and just over 10 years

arXiv:2306.01296 [pdf, other]

doi 10.21437/Interspeech.2023-361

Improved Training for End-to-End Streaming Automatic Speech Recognition Model with Punctuation

Authors: Hanbyul Kim, Seunghyun Seo, Lukas Lee, Seolki Baek

Abstract: Punctuated text prediction is crucial for automatic speech recognition as it enhances readability and impacts downstream natural language processing tasks. In streaming scenarios, the ability to predict punctuation in real-time is particularly desirable but presents a difficult technical challenge. In this work, we propose a method for predicting punctuated text from input speech using a chunk-bas… ▽ More Punctuated text prediction is crucial for automatic speech recognition as it enhances readability and impacts downstream natural language processing tasks. In streaming scenarios, the ability to predict punctuation in real-time is particularly desirable but presents a difficult technical challenge. In this work, we propose a method for predicting punctuated text from input speech using a chunk-based Transformer encoder trained with Connectionist Temporal Classification (CTC) loss. The acoustic model trained with long sequences by concatenating the input and target sequences can learn punctuation marks attached to the end of sentences more effectively. Additionally, by combining CTC losses on the chunks and utterances, we achieved both the improved F1 score of punctuation prediction and Word Error Rate (WER). △ Less

Submitted 2 June, 2023; originally announced June 2023.

Comments: Accepted at INTERSPEECH 2023

Journal ref: Proc. INTERSPEECH 2023, 1653-1657

arXiv:2303.13752 [pdf, other]

doi 10.1609/aaai.v37i12.26659

Leveraging Old Knowledge to Continually Learn New Classes in Medical Images

Authors: Evelyn Chee, Mong Li Lee, Wynne Hsu

Abstract: Class-incremental continual learning is a core step towards develo** artificial intelligence systems that can continuously adapt to changes in the environment by learning new concepts without forgetting those previously learned. This is especially needed in the medical domain where continually learning from new incoming data is required to classify an expanded set of diseases. In this work, we f… ▽ More Class-incremental continual learning is a core step towards develo** artificial intelligence systems that can continuously adapt to changes in the environment by learning new concepts without forgetting those previously learned. This is especially needed in the medical domain where continually learning from new incoming data is required to classify an expanded set of diseases. In this work, we focus on how old knowledge can be leveraged to learn new classes without catastrophic forgetting. We propose a framework that comprises of two main components: (1) a dynamic architecture with expanding representations to preserve previously learned features and accommodate new features; and (2) a training procedure alternating between two objectives to balance the learning of new features while maintaining the model's performance on old classes. Experiment results on multiple medical datasets show that our solution is able to achieve superior performance over state-of-the-art baselines in terms of class accuracy and forgetting. △ Less

Submitted 23 March, 2023; originally announced March 2023.

Comments: Accepted to AAAI23

arXiv:2212.13034 [pdf]

Kidney and Kidney Tumour Segmentation in CT Images

Authors: Qi Ming How, Hoi Leong Lee

Abstract: Automatic segmentation of kidney and kidney tumour in Computed Tomography (CT) images is essential, as it uses less time as compared to the current gold standard of manual segmentation. However, many hospitals are still reliant on manual study and segmentation of CT images by medical practitioners because of its higher accuracy. Thus, this study focuses on the development of an approach for automa… ▽ More Automatic segmentation of kidney and kidney tumour in Computed Tomography (CT) images is essential, as it uses less time as compared to the current gold standard of manual segmentation. However, many hospitals are still reliant on manual study and segmentation of CT images by medical practitioners because of its higher accuracy. Thus, this study focuses on the development of an approach for automatic kidney and kidney tumour segmentation in contrast-enhanced CT images. A method based on Convolutional Neural Network (CNN) was proposed, where a 3D U-Net segmentation model was developed and trained to delineate the kidney and kidney tumour from CT scans. Each CT image was pre-processed before inputting to the CNN, and the effect of down-sampled and patch-wise input images on the model performance was analysed. The proposed method was evaluated on the publicly available 2021 Kidney and Kidney Tumour Segmentation Challenge (KiTS21) dataset. The method with the best performing model recorded an average training Dice score of 0.6129, with the kidney and kidney tumour Dice scores of 0.7923 and 0.4344, respectively. For testing, the model obtained a kidney Dice score of 0.8034, and a kidney tumour Dice score of 0.4713, with an average Dice score of 0.6374. △ Less

Submitted 26 December, 2022; originally announced December 2022.

arXiv:2212.13032 [pdf]

Diagnosis of COVID-19 based on Chest Radiography

Authors: Mei Gah Lim, Hoi Leong Lee

Abstract: The Coronavirus disease 2019 (COVID-19) was first identified in Wuhan, China, in early December 2019 and now becoming a pandemic. When COVID-19 patients undergo radiography examination, radiologists can observe the present of radiographic abnormalities from their chest X-ray (CXR) images. In this study, a deep convolutional neural network (CNN) model was proposed to aid radiologists in diagnosing… ▽ More The Coronavirus disease 2019 (COVID-19) was first identified in Wuhan, China, in early December 2019 and now becoming a pandemic. When COVID-19 patients undergo radiography examination, radiologists can observe the present of radiographic abnormalities from their chest X-ray (CXR) images. In this study, a deep convolutional neural network (CNN) model was proposed to aid radiologists in diagnosing COVID-19 patients. First, this work conducted a comparative study on the performance of modified VGG-16, ResNet-50 and DenseNet-121 to classify CXR images into normal, COVID-19 and viral pneumonia. Then, the impact of image augmentation on the classification results was evaluated. The publicly available COVID-19 Radiography Database was used throughout this study. After comparison, ResNet-50 achieved the highest accuracy with 95.88%. Next, after training ResNet-50 with rotation, translation, horizontal flip, intensity shift and zoom augmented dataset, the accuracy dropped to 80.95%. Furthermore, an ablation study on the effect of image augmentation on the classification results found that the combinations of rotation and intensity shift augmentation methods obtained an accuracy higher than baseline, which is 96.14%. Finally, ResNet-50 with rotation and intensity shift augmentations performed the best and was proposed as the final classification model in this work. These findings demonstrated that the proposed classification model can provide a promising result for COVID-19 diagnosis. △ Less

Submitted 26 December, 2022; originally announced December 2022.

arXiv:2205.03247 [pdf, other]

Musical Score Following and Audio Alignment

Authors: Lin Hao Lee

Abstract: Real-time tracking of the position of a musical performance on a musical score, i.e. score following, can be useful in music practice, performance and production. Example applications of such technology include computer-aided accompaniment and automatic page turning. Score following is a challenging task, especially when considering deviations in performance data from the score stemming from mista… ▽ More Real-time tracking of the position of a musical performance on a musical score, i.e. score following, can be useful in music practice, performance and production. Example applications of such technology include computer-aided accompaniment and automatic page turning. Score following is a challenging task, especially when considering deviations in performance data from the score stemming from mistakes or expressive choices. In this project, the extensive research present in the field is first explored before two open-source evaluation testbenches for score following--one quantitative and the other qualitative--are introduced. A new way of obtaining quantitative testbench data is proposed, and the QualScofo dataset for qualitative benchmarking is introduced. Subsequently, three different score followers, each of a different class, are implemented. First, a beat-based follower for an interactive conductor application--the TuneApp Conductor--is created to demonstrate an entertaining application of score following. Then, an Approximate String Matching (ASM) non-real-time follower is implemented to complement the quantitative testbench and provide more technical background details of score following. Finally, a Constant Q-Transform (CQT) Dynamic Time War** (DTW) score follower robust against major challenges in score following (such as polyphonic music and performance deviations) is outlined and implemented; it is shown that this CQT-based approach consistently and significantly outperforms a commonly used FFT-based approach in extracting audio features for score following. △ Less

Submitted 6 May, 2022; originally announced May 2022.

Comments: Imperial College London MEng Final Year Project Report

arXiv:2204.00176 [pdf, other]

Better Intermediates Improve CTC Inference

Authors: Tatsuya Komatsu, Yusuke Fujita, Jaesong Lee, Lukas Lee, Shinji Watanabe, Yusuke Kida

Abstract: This paper proposes a method for improved CTC inference with searched intermediates and multi-pass conditioning. The paper first formulates self-conditioned CTC as a probabilistic model with an intermediate prediction as a latent representation and provides a tractable conditioning framework. We then propose two new conditioning methods based on the new formulation: (1) Searched intermediate condi… ▽ More This paper proposes a method for improved CTC inference with searched intermediates and multi-pass conditioning. The paper first formulates self-conditioned CTC as a probabilistic model with an intermediate prediction as a latent representation and provides a tractable conditioning framework. We then propose two new conditioning methods based on the new formulation: (1) Searched intermediate conditioning that refines intermediate predictions with beam-search, (2) Multi-pass conditioning that uses predictions of previous inference for conditioning the next inference. These new approaches enable better conditioning than the original self-conditioned CTC during inference and improve the final performance. Experiments with the LibriSpeech dataset show relative 3%/12% performance improvement at the maximum in test clean/other sets compared to the original self-conditioned CTC. △ Less

Submitted 31 March, 2022; originally announced April 2022.

Comments: 5 pages, submitted INTERSPEECH2022

arXiv:2203.16868 [pdf, other]

Memory-Efficient Training of RNN-Transducer with Sampled Softmax

Authors: Jaesong Lee, Lukas Lee, Shinji Watanabe

Abstract: RNN-Transducer has been one of promising architectures for end-to-end automatic speech recognition. Although RNN-Transducer has many advantages including its strong accuracy and streaming-friendly property, its high memory consumption during training has been a critical problem for development. In this work, we propose to apply sampled softmax to RNN-Transducer, which requires only a small subset… ▽ More RNN-Transducer has been one of promising architectures for end-to-end automatic speech recognition. Although RNN-Transducer has many advantages including its strong accuracy and streaming-friendly property, its high memory consumption during training has been a critical problem for development. In this work, we propose to apply sampled softmax to RNN-Transducer, which requires only a small subset of vocabulary during training thus saves its memory consumption. We further extend sampled softmax to optimize memory consumption for a minibatch, and employ distributions of auxiliary CTC losses for sampling vocabulary to improve model accuracy. We present experimental results on LibriSpeech, AISHELL-1, and CSJ-APS, where sampled softmax greatly reduces memory consumption and still maintains the accuracy of the baseline model. △ Less

Submitted 31 March, 2022; originally announced March 2022.

Comments: Submitted to INTERSPEECH 2022

arXiv:2203.04911 [pdf, other]

DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering

Authors: Guan-Ting Lin, Yung-Sung Chuang, Ho-Lam Chung, Shu-wen Yang, Hsuan-Jui Chen, Shuyan Dong, Shang-Wen Li, Abdelrahman Mohamed, Hung-yi Lee, Lin-shan Lee

Abstract: Spoken Question Answering (SQA) is to find the answer from a spoken document given a question, which is crucial for personal assistants when replying to the queries from the users. Existing SQA methods all rely on Automatic Speech Recognition (ASR) transcripts. Not only does ASR need to be trained with massive annotated data that are time and cost-prohibitive to collect for low-resourced languages… ▽ More Spoken Question Answering (SQA) is to find the answer from a spoken document given a question, which is crucial for personal assistants when replying to the queries from the users. Existing SQA methods all rely on Automatic Speech Recognition (ASR) transcripts. Not only does ASR need to be trained with massive annotated data that are time and cost-prohibitive to collect for low-resourced languages, but more importantly, very often the answers to the questions include name entities or out-of-vocabulary words that cannot be recognized correctly. Also, ASR aims to minimize recognition errors equally over all words, including many function words irrelevant to the SQA task. Therefore, SQA without ASR transcripts (textless) is always highly desired, although known to be very difficult. This work proposes Discrete Spoken Unit Adaptive Learning (DUAL), leveraging unlabeled data for pre-training and fine-tuned by the SQA downstream task. The time intervals of spoken answers can be directly predicted from spoken documents. We also release a new SQA benchmark corpus, NMSQA, for data with more realistic scenarios. We empirically showed that DUAL yields results comparable to those obtained by cascading ASR and text QA model and robust to real-world data. Our code and model will be open-sourced. △ Less

Submitted 21 June, 2022; v1 submitted 9 March, 2022; originally announced March 2022.

Comments: Accepted by Interspeech 2022

arXiv:2111.13486 [pdf, other]

When Creators Meet the Metaverse: A Survey on Computational Arts

Authors: Lik-Hang Lee, Zijun Lin, Rui Hu, Zhengya Gong, Abhishek Kumar, Tangyao Li, Sijia Li, Pan Hui

Abstract: The metaverse, enormous virtual-physical cyberspace, has brought unprecedented opportunities for artists to blend every corner of our physical surroundings with digital creativity. This article conducts a comprehensive survey on computational arts, in which seven critical topics are relevant to the metaverse, describing novel artworks in blended virtual-physical realities. The topics first cover t… ▽ More The metaverse, enormous virtual-physical cyberspace, has brought unprecedented opportunities for artists to blend every corner of our physical surroundings with digital creativity. This article conducts a comprehensive survey on computational arts, in which seven critical topics are relevant to the metaverse, describing novel artworks in blended virtual-physical realities. The topics first cover the building elements for the metaverse, e.g., virtual scenes and characters, auditory, textual elements. Next, several remarkable types of novel creations in the expanded horizons of metaverse cyberspace have been reflected, such as immersive arts, robotic arts, and other user-centric approaches fuelling contemporary creative outputs. Finally, we propose several research agendas: democratising computational arts, digital privacy, and safety for metaverse artists, ownership recognition for digital artworks, technological challenges, and so on. The survey also serves as introductory material for artists and metaverse technologists to begin creations in the realm of surrealistic cyberspace. △ Less

Submitted 26 November, 2021; originally announced November 2021.

Comments: Submitted to ACM Computing Surveys, 36 pages

ACM Class: A.1; K.0

arXiv:2108.12719 [pdf, other]

A Dual Adversarial Calibration Framework for Automatic Fetal Brain Biometry

Authors: Yuan Gao, Lok Hin Lee, Richard Droste, Rachel Craik, Sridevi Beriwal, Aris Papageorghiou, Alison Noble

Abstract: This paper presents a novel approach to automatic fetal brain biometry motivated by needs in low- and medium- income countries. Specifically, we leverage high-end (HE) ultrasound images to build a biometry solution for low-cost (LC) point-of-care ultrasound images. We propose a novel unsupervised domain adaptation approach to train deep models to be invariant to significant image distribution shif… ▽ More This paper presents a novel approach to automatic fetal brain biometry motivated by needs in low- and medium- income countries. Specifically, we leverage high-end (HE) ultrasound images to build a biometry solution for low-cost (LC) point-of-care ultrasound images. We propose a novel unsupervised domain adaptation approach to train deep models to be invariant to significant image distribution shift between the image types. Our proposed method, which employs a Dual Adversarial Calibration (DAC) framework, consists of adversarial pathways which enforce model invariance to; i) adversarial perturbations in the feature space derived from LC images, and ii) appearance domain discrepancy. Our Dual Adversarial Calibration method estimates transcerebellar diameter and head circumference on images from low-cost ultrasound devices with a mean absolute error (MAE) of 2.43mm and 1.65mm, compared with 7.28 mm and 5.65 mm respectively for SOTA. △ Less

Submitted 28 August, 2021; originally announced August 2021.

Comments: CVAMD ICCV 2021

arXiv:2104.01616 [pdf, other]

Towards Lifelong Learning of End-to-end ASR

Authors: Heng-Jui Chang, Hung-yi Lee, Lin-shan Lee

Abstract: Automatic speech recognition (ASR) technologies today are primarily optimized for given datasets; thus, any changes in the application environment (e.g., acoustic conditions or topic domains) may inevitably degrade the performance. We can collect new data describing the new environment and fine-tune the system, but this naturally leads to higher error rates for the earlier datasets, referred to as… ▽ More Automatic speech recognition (ASR) technologies today are primarily optimized for given datasets; thus, any changes in the application environment (e.g., acoustic conditions or topic domains) may inevitably degrade the performance. We can collect new data describing the new environment and fine-tune the system, but this naturally leads to higher error rates for the earlier datasets, referred to as catastrophic forgetting. The concept of lifelong learning (LLL) aiming to enable a machine to sequentially learn new tasks from new datasets describing the changing real world without forgetting the previously learned knowledge is thus brought to attention. This paper reports, to our knowledge, the first effort to extensively consider and analyze the use of various approaches of LLL in end-to-end (E2E) ASR, including proposing novel methods in saving data for past domains to mitigate the catastrophic forgetting problem. An overall relative reduction of 28.7% in WER was achieved compared to the fine-tuning baseline when sequentially learning on three very different benchmark corpora. This can be the first step toward the highly desired ASR technologies capable of synchronizing with the continuously changing real world. △ Less

Submitted 2 July, 2021; v1 submitted 4 April, 2021; originally announced April 2021.

Comments: Interspeech 2021. We acknowledge the support of Salesforce Research Deep Learning Grant

arXiv:2102.11677 [pdf, other]

Cell abundance aware deep learning for cell detection on highly imbalanced pathological data

Authors: Yeman Brhane Hagos, Catherine SY Lecat, Dominic Patel, Lydia Lee, Thien-An Tran, Manuel Rodriguez- Justo, Kwee Yong, Yinyin Yuan

Abstract: Automated analysis of tissue sections allows a better understanding of disease biology and may reveal biomarkers that could guide prognosis or treatment selection. In digital pathology, less abundant cell types can be of biological significance, but their scarcity can result in biased and sub-optimal cell detection model. To minimize the effect of cell imbalance on cell detection, we proposed a de… ▽ More Automated analysis of tissue sections allows a better understanding of disease biology and may reveal biomarkers that could guide prognosis or treatment selection. In digital pathology, less abundant cell types can be of biological significance, but their scarcity can result in biased and sub-optimal cell detection model. To minimize the effect of cell imbalance on cell detection, we proposed a deep learning pipeline that considers the abundance of cell types during model training. Cell weight images were generated, which assign larger weights to less abundant cells and used the weights to regularize Dice overlap loss function. The model was trained and evaluated on myeloma bone marrow trephine samples. Our model obtained a cell detection F1-score of 0.78, a 2% increase compared to baseline models, and it outperformed baseline models at detecting rare cell types. We found that scaling deep learning loss function by the abundance of cells improves cell detection performance. Our results demonstrate the importance of incorporating domain knowledge on deep learning methods for pathological data with class imbalance. △ Less

Submitted 23 February, 2021; originally announced February 2021.

Comments: Accepted at The IEEE International Symposium on Biomedical Imaging (ISBI) 2021, 5 pages, 5 figures

arXiv:2010.14150 [pdf, other]

FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention

Authors: Yist Y. Lin, Chung-Ming Chien, Jheng-Hao Lin, Hung-yi Lee, Lin-shan Lee

Abstract: Any-to-any voice conversion aims to convert the voice from and to any speakers even unseen during training, which is much more challenging compared to one-to-one or many-to-many tasks, but much more attractive in real-world scenarios. In this paper we proposed FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0, while the spectra… ▽ More Any-to-any voice conversion aims to convert the voice from and to any speakers even unseen during training, which is much more challenging compared to one-to-one or many-to-many tasks, but much more attractive in real-world scenarios. In this paper we proposed FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0, while the spectral features of the utterance(s) from the target speaker are obtained from log mel-spectrograms. By aligning the hidden structures of the two different feature spaces with a two-stage training process, FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance, all based on the attention mechanism of Transformer as verified with analysis on attention maps, and is accomplished end-to-end. This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information and doesn't require parallel data. Objective evaluation based on speaker verification and subjective evaluation with MOS both showed that this approach outperformed SOTA approaches, such as AdaIN-VC and AutoVC. △ Less

Submitted 3 May, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

Comments: To appear in the proceedings of ICASSP 2021, equal contribution from first two authors

arXiv:2005.08781 [pdf, other]

Defending Your Voice: Adversarial Attack on Voice Conversion

Authors: Chien-yu Huang, Yist Y. Lin, Hung-yi Lee, Lin-shan Lee

Abstract: Substantial improvements have been achieved in recent years in voice conversion, which converts the speaker characteristics of an utterance into those of another speaker without changing the linguistic content of the utterance. Nonetheless, the improved conversion technologies also led to concerns about privacy and authentication. It thus becomes highly desired to be able to prevent one's voice fr… ▽ More Substantial improvements have been achieved in recent years in voice conversion, which converts the speaker characteristics of an utterance into those of another speaker without changing the linguistic content of the utterance. Nonetheless, the improved conversion technologies also led to concerns about privacy and authentication. It thus becomes highly desired to be able to prevent one's voice from being improperly utilized with such voice conversion technologies. This is why we report in this paper the first known attempt to perform adversarial attack on voice conversion. We introduce human imperceptible noise into the utterances of a speaker whose voice is to be defended. Given these adversarial examples, voice conversion models cannot convert other utterances so as to sound like being produced by the defended speaker. Preliminary experiments were conducted on two currently state-of-the-art zero-shot voice conversion models. Objective and subjective evaluation results in both white-box and black-box scenarios are reported. It was shown that the speaker characteristics of the converted utterances were made obviously different from those of the defended speaker, while the adversarial examples of the defended speaker are not distinguishable from the authentic utterances. △ Less

Submitted 4 May, 2021; v1 submitted 18 May, 2020; originally announced May 2020.

Comments: Accepted by SLT 2021

arXiv:2005.01972 [pdf, other]

End-to-end Whispered Speech Recognition with Frequency-weighted Approaches and Pseudo Whisper Pre-training

Authors: Heng-Jui Chang, Alexander H. Liu, Hung-yi Lee, Lin-shan Lee

Abstract: Whispering is an important mode of human speech, but no end-to-end recognition results for it were reported yet, probably due to the scarcity of available whispered speech data. In this paper, we present several approaches for end-to-end (E2E) recognition of whispered speech considering the special characteristics of whispered speech and the scarcity of data. This includes a frequency-weighted Spe… ▽ More Whispering is an important mode of human speech, but no end-to-end recognition results for it were reported yet, probably due to the scarcity of available whispered speech data. In this paper, we present several approaches for end-to-end (E2E) recognition of whispered speech considering the special characteristics of whispered speech and the scarcity of data. This includes a frequency-weighted SpecAugment policy and a frequency-divided CNN feature extractor for better capturing the high-frequency structures of whispered speech, and a layer-wise transfer learning approach to pre-train a model with normal or normal-to-whispered converted speech then fine-tune it with whispered speech to bridge the gap between whispered and normal speech. We achieve an overall relative reduction of 19.8% in PER and 44.4% in CER on a relatively small whispered TIMIT corpus. The results indicate as long as we have a good E2E model pre-trained on normal or pseudo-whispered speech, a relatively small set of whispered speech may suffice to obtain a reasonably good E2E whispered speech recognizer. △ Less

Submitted 8 November, 2020; v1 submitted 5 May, 2020; originally announced May 2020.

Comments: Accepted to IEEE SLT 2021

Journal ref: 2021 IEEE Spoken Language Technology Workshop (SLT)

arXiv:2004.06048 [pdf]

doi 10.1109/JESTPE.2020.2986372

Highly-Efficient Single-Switch-Regulated Resonant Wireless Power Receiver with Hybrid Modulation

Authors: Kerui Li, Albert Ting Leung Lee, Siew-Chong Tan, Ron Shu Yuen Hui

Abstract: In this paper, a highly-efficient single-switch-regulated resonant wireless power receiver with hybrid modulation is proposed. To achieve both high efficiency and good output voltage regulation, phase shift and pulse width hybrid modulation are simultaneously applied. The soft switching operation in this topology is achieved by the cycle-by-cycle phase shift adjustment between the input current an… ▽ More In this paper, a highly-efficient single-switch-regulated resonant wireless power receiver with hybrid modulation is proposed. To achieve both high efficiency and good output voltage regulation, phase shift and pulse width hybrid modulation are simultaneously applied. The soft switching operation in this topology is achieved by the cycle-by-cycle phase shift adjustment between the input current and the gate drive signal and also attributed to the reactive components such as the series-compensated secondary coil and the parasitic capacitor of the active switch . The soft switching operation also leads to high efficiency and low EMI. By adjusting the duty ratio of the switch, tight regulation of the output voltage can be attained. The steady-state and dynamic models of the resonant receiver with hybrid modulation are analytically derived in order to properly design the feedback controller. An experimental setup of a two-coil wireless power transfer system, including the hardware prototype of the proposed receiver, is constructed for experimental verification. The experimental results show the effectiveness of the soft-switching operation in the receiver with high efficiency while maintaining good regulation of the output voltage, regardless of line and load variations. △ Less

Submitted 5 January, 2021; v1 submitted 9 April, 2020; originally announced April 2020.

Comments: in IEEE Journal of Emerging and Selected Topics in Power Electronics. 2020

arXiv:1910.12740 [pdf, other]

Sequence-to-sequence Automatic Speech Recognition with Word Embedding Regularization and Fused Decoding

Authors: Alexander H. Liu, Tzu-Wei Sung, Shun-Po Chuang, Hung-yi Lee, Lin-shan Lee

Abstract: In this paper, we investigate the benefit that off-the-shelf word embedding can bring to the sequence-to-sequence (seq-to-seq) automatic speech recognition (ASR). We first introduced the word embedding regularization by maximizing the cosine similarity between a transformed decoder feature and the target word embedding. Based on the regularized decoder, we further proposed the fused decoding mecha… ▽ More In this paper, we investigate the benefit that off-the-shelf word embedding can bring to the sequence-to-sequence (seq-to-seq) automatic speech recognition (ASR). We first introduced the word embedding regularization by maximizing the cosine similarity between a transformed decoder feature and the target word embedding. Based on the regularized decoder, we further proposed the fused decoding mechanism. This allows the decoder to consider the semantic consistency during decoding by absorbing the information carried by the transformed decoder feature, which is learned to be close to the target word embedding. Initial results on LibriSpeech demonstrated that pre-trained word embedding can significantly lower ASR recognition error with a negligible cost, and the choice of word embedding algorithms among Skip-gram, CBOW and BERT is important. △ Less

Submitted 5 February, 2020; v1 submitted 28 October, 2019; originally announced October 2019.

Comments: ICASSP 2020

arXiv:1910.12729 [pdf, other]

Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning

Authors: Alexander H. Liu, Tao Tu, Hung-yi Lee, Lin-shan Lee

Abstract: In this paper we propose a Sequential Representation Quantization AutoEncoder (SeqRQ-AE) to learn from primarily unpaired audio data and produce sequences of representations very close to phoneme sequences of speech utterances. This is achieved by proper temporal segmentation to make the representations phoneme-synchronized, and proper phonetic clustering to have total number of distinct represent… ▽ More In this paper we propose a Sequential Representation Quantization AutoEncoder (SeqRQ-AE) to learn from primarily unpaired audio data and produce sequences of representations very close to phoneme sequences of speech utterances. This is achieved by proper temporal segmentation to make the representations phoneme-synchronized, and proper phonetic clustering to have total number of distinct representations close to the number of phonemes. Map** between the distinct representations and phonemes is learned from a small amount of annotated paired data. Preliminary experiments on LJSpeech demonstrated the learned representations for vowels have relative locations in latent space in good parallel to that shown in the IPA vowel chart defined by linguistics experts. With less than 20 minutes of annotated speech, our method outperformed existing methods on phoneme recognition and is able to synthesize intelligible speech that beats our baseline model. △ Less

Submitted 5 February, 2020; v1 submitted 28 October, 2019; originally announced October 2019.

Comments: ICASSP 2020, equal contribution from first two authors

arXiv:1910.12706 [pdf, ps, other]

Interrupted and cascaded permutation invariant training for speech separation

Authors: Gene-** Yang, Szu-Lin Wu, Yao-Wen Mao, Hung-yi Lee, Lin-shan Lee

Abstract: Permutation Invariant Training (PIT) has long been a step** stone method for training speech separation model in handling the label ambiguity problem. With PIT selecting the minimum cost label assignments dynamically, very few studies considered the separation problem to be optimizing both the model parameters and the label assignments, but focused on searching for good model architecture and pa… ▽ More Permutation Invariant Training (PIT) has long been a step** stone method for training speech separation model in handling the label ambiguity problem. With PIT selecting the minimum cost label assignments dynamically, very few studies considered the separation problem to be optimizing both the model parameters and the label assignments, but focused on searching for good model architecture and parameters. In this paper, we investigate instead for a given model architecture the various flexible label assignment strategies for training the model, rather than directly using PIT. Surprisingly, we discover a significant performance boost compared to PIT is possible if the model is trained with fixed label assignments and a good set of labels is chosen. With fixed label training cascaded between two sections of PIT, we achieved the state-of-the-art performance on WSJ0-2mix without changing the model architecture at all. △ Less

Submitted 28 October, 2019; originally announced October 2019.

arXiv:1910.11559 [pdf, other]

SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering

Authors: Yung-Sung Chuang, Chi-Liang Liu, Hung-Yi Lee, Lin-shan Lee

Abstract: While various end-to-end models for spoken language understanding tasks have been explored recently, this paper is probably the first known attempt to challenge the very difficult task of end-to-end spoken question answering (SQA). Learning from the very successful BERT model for various text processing tasks, here we proposed an audio-and-text jointly learned SpeechBERT model. This model outperfo… ▽ More While various end-to-end models for spoken language understanding tasks have been explored recently, this paper is probably the first known attempt to challenge the very difficult task of end-to-end spoken question answering (SQA). Learning from the very successful BERT model for various text processing tasks, here we proposed an audio-and-text jointly learned SpeechBERT model. This model outperformed the conventional approach of cascading ASR with the following text question answering (TQA) model on datasets including ASR errors in answer spans, because the end-to-end model was shown to be able to extract information out of audio data before ASR produced errors. When ensembling the proposed end-to-end model with the cascade architecture, even better performance was achieved. In addition to the potential of end-to-end SQA, the SpeechBERT can also be considered for many other spoken language understanding tasks just as BERT for many text processing tasks. △ Less

Submitted 11 August, 2020; v1 submitted 25 October, 2019; originally announced October 2019.

Comments: Interspeech 2020

arXiv:1910.10416 [pdf, other]

6G Massive Radio Access Networks: Key Issues, Technologies, and Future Challenges

Authors: Ying Loong Lee, Donghong Qin, Li-Chun Wang, Gek Hong, Sim

Abstract: Driven by the emerging use cases in massive access future networks, there is a need for technological advancements and evolutions for wireless communications beyond the fifth-generation (5G) networks. In particular, we envisage the upcoming sixth-generation (6G) networks to consist of numerous devices demanding extremely high-performance interconnections even under strenuous scenarios such as dive… ▽ More Driven by the emerging use cases in massive access future networks, there is a need for technological advancements and evolutions for wireless communications beyond the fifth-generation (5G) networks. In particular, we envisage the upcoming sixth-generation (6G) networks to consist of numerous devices demanding extremely high-performance interconnections even under strenuous scenarios such as diverse mobility, extreme density, and dynamic environment. To cater for such a demand, investigation on flexible and sustainable radio access network (RAN) techniques capable of supporting highly diverse requirements and massive connectivity is of utmost importance. To this end, this paper first outlines the key driving applications for 6G, including smart city and factory, which trigger the transformation of existing RAN techniques. We then examine and provide in-depth discussions on several critical performance requirements (i.e., the level of flexibility, the support for massive interconnectivity, and energy efficiency), issues, enabling technologies, and challenges in designing 6G massive RANs. We conclude the article by providing several artificial-intelligence-based approaches to overcome future challenges. △ Less

Submitted 23 October, 2019; originally announced October 2019.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:1904.07845 [pdf, other]

Improved Speech Separation with Time-and-Frequency Cross-domain Joint Embedding and Clustering

Authors: Gene-** Yang, Chao-I Tuan, Hung-Yi Lee, Lin-shan Lee

Abstract: Speech separation has been very successful with deep learning techniques. Substantial effort has been reported based on approaches over spectrogram, which is well known as the standard time-and-frequency cross-domain representation for speech signals. It is highly correlated to the phonetic structure of speech, or "how the speech sounds" when perceived by human, but primarily frequency domain feat… ▽ More Speech separation has been very successful with deep learning techniques. Substantial effort has been reported based on approaches over spectrogram, which is well known as the standard time-and-frequency cross-domain representation for speech signals. It is highly correlated to the phonetic structure of speech, or "how the speech sounds" when perceived by human, but primarily frequency domain features carrying temporal behaviour. Very impressive work achieving speech separation over time domain was reported recently, probably because waveforms in time domain may describe the different realizations of speech in a more precise way than spectrogram. In this paper, we propose a framework properly integrating the above two directions, ho** to achieve both purposes. We construct a time-and-frequency feature map by concatenating the 1-dim convolution encoded feature map (for time domain) and the spectrogram (for frequency domain), which was then processed by an embedding network and clustering approaches very similar to those used in time and frequency domain prior works. In this way, the information in the time and frequency domains, as well as the interactions between them, can be jointly considered during embedding and clustering. Very encouraging results (state-of-the-art to our knowledge) were obtained with WSJ0-2mix dataset in preliminary experiments. △ Less

Submitted 16 April, 2019; originally announced April 2019.

Comments: Submitted to Interspeech 2019

arXiv:1904.05078 [pdf, other]

From Semi-supervised to Almost-unsupervised Speech Recognition with Very-low Resource by Jointly Learning Phonetic Structures from Audio and Text Embeddings

Authors: Yi-Chen Chen, Sung-Feng Huang, Hung-yi Lee, Lin-shan Lee

Abstract: Producing a large amount of annotated speech data for training ASR systems remains difficult for more than 95% of languages all over the world which are low-resourced. However, we note human babies start to learn the language by the sounds (or phonetic structures) of a small number of exemplar words, and "generalize" such knowledge to other words without hearing a large amount of data. We initiate… ▽ More Producing a large amount of annotated speech data for training ASR systems remains difficult for more than 95% of languages all over the world which are low-resourced. However, we note human babies start to learn the language by the sounds (or phonetic structures) of a small number of exemplar words, and "generalize" such knowledge to other words without hearing a large amount of data. We initiate some preliminary work in this direction. Audio Word2Vec is used to learn the phonetic structures from spoken words (signal segments), while another autoencoder is used to learn the phonetic structures from text words. The relationships among the above two can be learned jointly, or separately after the above two are well trained. This relationship can be used in speech recognition with very low resource. In the initial experiments on the TIMIT dataset, only 2.1 hours of speech data (in which 2500 spoken words were annotated and the rest unlabeled) gave a word error rate of 44.6%, and this number can be reduced to 34.2% if 4.1 hr of speech data (in which 20000 spoken words were annotated) were given. These results are not satisfactory, but a good starting point. △ Less

Submitted 10 April, 2019; originally announced April 2019.

arXiv:1904.04100 [pdf, other]

Completely Unsupervised Speech Recognition By A Generative Adversarial Network Harmonized With Iteratively Refined Hidden Markov Models

Authors: Kuan-Yu Chen, Che-** Tsai, Da-Rong Liu, Hung-Yi Lee, Lin-shan Lee

Abstract: Producing a large annotated speech corpus for training ASR systems remains difficult for more than 95% of languages all over the world which are low-resourced, but collecting a relatively big unlabeled data set for such languages is more achievable. This is why some initial effort have been reported on completely unsupervised speech recognition learned from unlabeled data only, although with relat… ▽ More Producing a large annotated speech corpus for training ASR systems remains difficult for more than 95% of languages all over the world which are low-resourced, but collecting a relatively big unlabeled data set for such languages is more achievable. This is why some initial effort have been reported on completely unsupervised speech recognition learned from unlabeled data only, although with relatively high error rates. In this paper, we develop a Generative Adversarial Network (GAN) to achieve this purpose, in which a Generator and a Discriminator learn from each other iteratively to improve the performance. We further use a set of Hidden Markov Models (HMMs) iteratively refined from the machine generated labels to work in harmony with the GAN. The initial experiments on TIMIT data set achieve an phone error rate of 33.1%, which is 8.5% lower than the previous state-of-the-art. △ Less

Submitted 23 August, 2019; v1 submitted 8 April, 2019; originally announced April 2019.

Comments: Accepted by Interspeech 2019

arXiv:1810.12566 [pdf, other]

Almost-unsupervised Speech Recognition with Close-to-zero Resource Based on Phonetic Structures Learned from Very Small Unpaired Speech and Text Data

Authors: Yi-Chen Chen, Chia-Hao Shen, Sung-Feng Huang, Hung-yi Lee, Lin-shan Lee

Abstract: Producing a large amount of annotated speech data for training ASR systems remains difficult for more than 95% of languages all over the world which are low-resourced. However, we note human babies start to learn the language by the sounds of a small number of exemplar words without hearing a large amount of data. We initiate some preliminary work in this direction in this paper. Audio Word2Vec is… ▽ More Producing a large amount of annotated speech data for training ASR systems remains difficult for more than 95% of languages all over the world which are low-resourced. However, we note human babies start to learn the language by the sounds of a small number of exemplar words without hearing a large amount of data. We initiate some preliminary work in this direction in this paper. Audio Word2Vec is used to obtain embeddings of spoken words which carry phonetic information extracted from the signals. An autoencoder is used to generate embeddings of text words based on the articulatory features for the phoneme sequences. Both sets of embeddings for spoken and text words describe similar phonetic structures among words in their respective latent spaces. A map** relation from the audio embeddings to text embeddings actually gives the word-level ASR. This can be learned by aligning a small number of spoken words and the corresponding text words in the embedding spaces. In the initial experiments only 200 annotated spoken words and one hour of speech data without annotation gave a word accuracy of 27.5%, which is low but a good starting point. △ Less

Submitted 30 October, 2018; originally announced October 2018.

arXiv:1808.03113 [pdf, other]

Rhythm-Flexible Voice Conversion without Parallel Data Using Cycle-GAN over Phoneme Posteriorgram Sequences

Authors: Cheng-chieh Yeh, Po-chun Hsu, Ju-chieh Chou, Hung-yi Lee, Lin-shan Lee

Abstract: Speaking rate refers to the average number of phonemes within some unit time, while the rhythmic patterns refer to duration distributions for realizations of different phonemes within different phonetic structures. Both are key components of prosody in speech, which is different for different speakers. Models like cycle-consistent adversarial network (Cycle-GAN) and variational auto-encoder (VAE)… ▽ More Speaking rate refers to the average number of phonemes within some unit time, while the rhythmic patterns refer to duration distributions for realizations of different phonemes within different phonetic structures. Both are key components of prosody in speech, which is different for different speakers. Models like cycle-consistent adversarial network (Cycle-GAN) and variational auto-encoder (VAE) have been successfully applied to voice conversion tasks without parallel data. However, due to the neural network architectures and feature vectors chosen for these approaches, the length of the predicted utterance has to be fixed to that of the input utterance, which limits the flexibility in mimicking the speaking rates and rhythmic patterns for the target speaker. On the other hand, sequence-to-sequence learning model was used to remove the above length constraint, but parallel training data are needed. In this paper, we propose an approach utilizing sequence-to-sequence model trained with unsupervised Cycle-GAN to perform the transformation between the phoneme posteriorgram sequences for different speakers. In this way, the length constraint mentioned above is removed to offer rhythm-flexible voice conversion without requiring parallel data. Preliminary evaluation on two datasets showed very encouraging results. △ Less

Submitted 9 August, 2018; originally announced August 2018.

Comments: 8 pages, 6 figures, Submitted to SLT 2018

arXiv:1807.08089 [pdf, other]

Phonetic-and-Semantic Embedding of Spoken Words with Applications in Spoken Content Retrieval

Authors: Yi-Chen Chen, Sung-Feng Huang, Chia-Hao Shen, Hung-yi Lee, Lin-shan Lee

Abstract: Word embedding or Word2Vec has been successful in offering semantics for text words learned from the context of words. Audio Word2Vec was shown to offer phonetic structures for spoken words (signal segments for words) learned from signals within spoken words. This paper proposes a two-stage framework to perform phonetic-and-semantic embedding on spoken words considering the context of the spoken w… ▽ More Word embedding or Word2Vec has been successful in offering semantics for text words learned from the context of words. Audio Word2Vec was shown to offer phonetic structures for spoken words (signal segments for words) learned from signals within spoken words. This paper proposes a two-stage framework to perform phonetic-and-semantic embedding on spoken words considering the context of the spoken words. Stage 1 performs phonetic embedding with speaker characteristics disentangled. Stage 2 then performs semantic embedding in addition. We further propose to evaluate the phonetic-and-semantic nature of the audio embeddings obtained in Stage 2 by parallelizing with text embeddings. In general, phonetic structure and semantics inevitably disturb each other. For example the words "brother" and "sister" are close in semantics but very different in phonetic structure, while the words "brother" and "bother" are in the other way around. But phonetic-and-semantic embedding is attractive, as shown in the initial experiments on spoken document retrieval. Not only spoken documents including the spoken query can be retrieved based on the phonetic structures, but spoken documents semantically related to the query but not including the query can also be retrieved based on the semantics. △ Less

Submitted 19 January, 2019; v1 submitted 21 July, 2018; originally announced July 2018.

Comments: Accepted at SLT2018

arXiv:1804.05306 [pdf, other]

Transcribing Lyrics From Commercial Song Audio: The First Step Towards Singing Content Processing

Authors: Che-** Tsai, Yi-Lin Tuan, Lin-shan Lee

Abstract: Spoken content processing (such as retrieval and browsing) is maturing, but the singing content is still almost completely left out. Songs are human voice carrying plenty of semantic information just as speech, and may be considered as a special type of speech with highly flexible prosody. The various problems in song audio, for example the significantly changing phone duration over highly flexibl… ▽ More Spoken content processing (such as retrieval and browsing) is maturing, but the singing content is still almost completely left out. Songs are human voice carrying plenty of semantic information just as speech, and may be considered as a special type of speech with highly flexible prosody. The various problems in song audio, for example the significantly changing phone duration over highly flexible pitch contours, make the recognition of lyrics from song audio much more difficult. This paper reports an initial attempt towards this goal. We collected music-removed version of English songs directly from commercial singing content. The best results were obtained by TDNN-LSTM with data augmentation with 3-fold speed perturbation plus some special approaches. The WER achieved (73.90%) was significantly lower than the baseline (96.21%), but still relatively high. △ Less

Submitted 15 April, 2018; originally announced April 2018.

Comments: Accepted as a conference paper at ICASSP 2018

arXiv:1804.02812 [pdf, other]

Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations

Authors: Ju-chieh Chou, Cheng-chieh Yeh, Hung-yi Lee, Lin-shan Lee

Abstract: Recently, cycle-consistent adversarial network (Cycle-GAN) has been successfully applied to voice conversion to a different speaker without parallel data, although in those approaches an individual model is needed for each target speaker. In this paper, we propose an adversarial learning framework for voice conversion, with which a single model can be trained to convert the voice to many different… ▽ More Recently, cycle-consistent adversarial network (Cycle-GAN) has been successfully applied to voice conversion to a different speaker without parallel data, although in those approaches an individual model is needed for each target speaker. In this paper, we propose an adversarial learning framework for voice conversion, with which a single model can be trained to convert the voice to many different speakers, all without parallel data, by separating the speaker characteristics from the linguistic content in speech signals. An autoencoder is first trained to extract speaker-independent latent representations and speaker embedding separately using another auxiliary speaker classifier to regularize the latent representation. The decoder then takes the speaker-independent latent representation and the target speaker embedding as the input to generate the voice of the target speaker with the linguistic content of the source utterance. The quality of decoder output is further improved by patching with the residual signal produced by another pair of generator and discriminator. A target speaker set size of 20 was tested in the preliminary experiments, and very good voice quality was obtained. Conventional voice conversion metrics are reported. We also show that the speaker information has been properly reduced from the latent representations. △ Less

Submitted 24 June, 2018; v1 submitted 9 April, 2018; originally announced April 2018.

Comments: Accepted to Interspeech 2018

Showing 1–33 of 33 results for author: Lee, L