-
Continual Learning Optimizations for Auto-regressive Decoder of Multilingual ASR systems
Authors:
Chin Yuen Kwok,
Jia Qi Yip,
Eng Siong Chng
Abstract:
Continual Learning (CL) involves fine-tuning pre-trained models with new data while maintaining the performance on the pre-trained data. This is particularly relevant for expanding multilingual ASR (MASR) capabilities. However, existing CL methods, mainly designed for computer vision and reinforcement learning tasks, often yield sub-optimal results when directly applied to MASR. We hypothesise tha…
▽ More
Continual Learning (CL) involves fine-tuning pre-trained models with new data while maintaining the performance on the pre-trained data. This is particularly relevant for expanding multilingual ASR (MASR) capabilities. However, existing CL methods, mainly designed for computer vision and reinforcement learning tasks, often yield sub-optimal results when directly applied to MASR. We hypothesise that this is because CL of the auto-regressive decoder in the MASR model is difficult. To verify this, we propose four optimizations on the decoder. They include decoder-layer gradient surgery, freezing unused token embeddings, suppressing output of newly added tokens, and learning rate re-scaling. Our experiments on adapting Whisper to 10 unseen languages from the Common Voice dataset demonstrate that these optimizations reduce the Average Word Error Rate (AWER) of pretrained languages from 14.2% to 12.4% compared with Experience Replay, without compromising the AWER of new languages.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization
Authors:
Yuchen Hu,
Chen Chen,
Siyin Wang,
Eng Siong Chng,
Chao Zhang
Abstract:
In this paper, we propose reverse inference optimization (RIO), a simple and effective method designed to enhance the robustness of autoregressive-model-based zero-shot text-to-speech (TTS) systems using reinforcement learning from human feedback (RLHF). To assess the quality of speech produced by the TTS system without human annotations, RIO introduces a novel concept termed as reverse inference…
▽ More
In this paper, we propose reverse inference optimization (RIO), a simple and effective method designed to enhance the robustness of autoregressive-model-based zero-shot text-to-speech (TTS) systems using reinforcement learning from human feedback (RLHF). To assess the quality of speech produced by the TTS system without human annotations, RIO introduces a novel concept termed as reverse inference based on the Bayesian principle, which suggests that a high-quality generated speech should be able to be used as a prompt for subsequent generation using the same TTS model. By leveraging reverse inference as the standard to select exemplars used in RLHF from the speech samples generated by the TTS system itself, RIO steers the subsequent optimization towards a direction of enhancing the TTS robustness. The RIO framework, comprising sampling, automatic annotating, and learning, obviates the need for a reward model or pairwise preference data, and significantly improves the stability of zero-shot TTS performance by reducing the discrepancies between training and inference conditions. Our experimental results verify that RIO can effectively improve both subjective and objective metrics, including mean opinion scores, word error rates, and speaker similarity. Remarkably, RIO can also diminish the incidence of bad outputs to nearly zero percent, rivalling the robustness when using ground-truth speech as the prompt.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection
Authors:
Duc-Tuan Truong,
Ruijie Tao,
Tuan Nguyen,
Hieu-Thi Luong,
Kong Aik Lee,
Eng Siong Chng
Abstract:
Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in sp…
▽ More
Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in specific regions of both frequency channels and temporal segments, while MHSA neglects this temporal-channel dependency of the input sequence. In this work, we proposed a Temporal-Channel Modeling (TCM) module to enhance MHSA's capability for capturing temporal-channel dependencies. Experimental results on the ASVspoof 2021 show that with only 0.03M additional parameters, the TCM module can outperform the state-of-the-art system by 9.25% in EER. Further ablation study reveals that utilizing both temporal and channel information yields the most improvement for detecting synthetic speech.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Towards Audio Codec-based Speech Separation
Authors:
Jia Qi Yip,
Shengkui Zhao,
Dianwen Ng,
Eng Siong Chng,
Bin Ma
Abstract:
Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task. SS can benefit from high compression because the compute required for traditional SS models makes them imp…
▽ More
Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task. SS can benefit from high compression because the compute required for traditional SS models makes them impractical for many edge computing use cases. However, SS is a waveform-masking task where compression tends to introduce distortions that severely impact performance. Here we propose a novel task of Audio Codec-based SS, where SS is performed within the embedding space of a NAC, and propose a new model, Codecformer, to address this task. At inference, Codecformer achieves a 52x reduction in MAC while producing separation performance comparable to a cloud deployment of Sepformer. This method charts a new direction for performing efficient SS in practical scenarios.
△ Less
Submitted 5 July, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
Dataset-Distillation Generative Model for Speech Emotion Recognition
Authors:
Fabian Ritter-Gutierrez,
Kuan-Po Huang,
Jeremy H. M Wong,
Dianwen Ng,
Hung-yi Lee,
Nancy F. Chen,
Eng Siong Chng
Abstract:
Deep learning models for speech rely on large datasets, presenting computational challenges. Yet, performance hinges on training data size. Dataset Distillation (DD) aims to learn a smaller dataset without much performance degradation when training with it. DD has been investigated in computer vision but not yet in speech. This paper presents the first approach for DD to speech targeting Speech Em…
▽ More
Deep learning models for speech rely on large datasets, presenting computational challenges. Yet, performance hinges on training data size. Dataset Distillation (DD) aims to learn a smaller dataset without much performance degradation when training with it. DD has been investigated in computer vision but not yet in speech. This paper presents the first approach for DD to speech targeting Speech Emotion Recognition on IEMOCAP. We employ Generative Adversarial Networks (GANs) not to mimic real data but to distil key discriminative information of IEMOCAP that is useful for downstream training. The GAN then replaces the original dataset and can sample custom synthetic dataset sizes. It performs comparably when following the original class imbalance but improves performance by 0.3% absolute UAR with balanced classes. It also reduces dataset storage and accelerates downstream training by 95% in both cases and reduces speaker information which could help for a privacy application.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback
Authors:
Chen Chen,
Yuchen Hu,
Wen Wu,
Helin Wang,
Eng Siong Chng,
Chao Zhang
Abstract:
In recent years, text-to-speech (TTS) technology has witnessed impressive advancements, particularly with large-scale training datasets, showcasing human-level speech quality and impressive zero-shot capabilities on unseen speakers. However, despite human subjective evaluations, such as the mean opinion score (MOS), remaining the gold standard for assessing the quality of synthetic speech, even st…
▽ More
In recent years, text-to-speech (TTS) technology has witnessed impressive advancements, particularly with large-scale training datasets, showcasing human-level speech quality and impressive zero-shot capabilities on unseen speakers. However, despite human subjective evaluations, such as the mean opinion score (MOS), remaining the gold standard for assessing the quality of synthetic speech, even state-of-the-art TTS approaches have kept human feedback isolated from training that resulted in mismatched training objectives and evaluation metrics. In this work, we investigate a novel topic of integrating subjective human evaluation into the TTS training loop. Inspired by the recent success of reinforcement learning from human feedback, we propose a comprehensive sampling-annotating-learning framework tailored to TTS optimization, namely uncertainty-aware optimization (UNO). Specifically, UNO eliminates the need for a reward model or preference data by directly maximizing the utility of speech generations while considering the uncertainty that lies in the inherent variability in subjective human speech perception and evaluations. Experimental results of both subjective and objective evaluations demonstrate that UNO considerably improves the zero-shot performance of TTS models in terms of MOS, word error rate, and speaker similarity. Additionally, we present a remarkable ability of UNO that it can adapt to the desired speaking style in emotional TTS seamlessly and flexibly.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models
Authors:
Yuchen Hu,
Chen Chen,
Chao-Han Huck Yang,
Chengwei Qin,
Pin-Yu Chen,
Eng Siong Chng,
Chao Zhang
Abstract:
We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents. STAR is developed for prevalent speech foundation models based on Transformer-related architecture with auto-regressive decoding (e.g., Whisper, Canary). Specifica…
▽ More
We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents. STAR is developed for prevalent speech foundation models based on Transformer-related architecture with auto-regressive decoding (e.g., Whisper, Canary). Specifically, we propose a novel indicator that empirically integrates step-wise information during decoding to assess the token-level quality of pseudo labels without ground truth, thereby guiding model updates for effective unsupervised adaptation. Experimental results show that STAR achieves an average of 13.5% relative reduction in word error rate across 14 target domains, and it sometimes even approaches the upper-bound performance of supervised adaptation. Surprisingly, we also observe that STAR prevents the adapted model from the common catastrophic forgetting problem without recalling source-domain data. Furthermore, STAR exhibits high data efficiency that only requires less than one-hour unlabeled data, and seamless generality to alternative large speech models and speech translation tasks. Our code aims to open source to the research communities.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models
Authors:
Yuchen Hu,
Chen Chen,
Chengwei Qin,
Qiushi Zhu,
Eng Siong Chng,
Ruizhe Li
Abstract:
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which aims to predict the ground-truth transcription from the decoded N-best hypotheses. Thanks to the strong language generation ability of LLMs and rich information in the N-best list, GER shows great effectiveness in enhancing ASR results. However, it still suf…
▽ More
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which aims to predict the ground-truth transcription from the decoded N-best hypotheses. Thanks to the strong language generation ability of LLMs and rich information in the N-best list, GER shows great effectiveness in enhancing ASR results. However, it still suffers from two limitations: 1) LLMs are unaware of the source speech during GER, which may lead to results that are grammatically correct but violate the source speech content, 2) N-best hypotheses usually only vary in a few tokens, making it redundant to send all of them for GER, which could confuse LLM about which tokens to focus on and thus lead to increased miscorrection. In this paper, we propose ClozeGER, a new paradigm for ASR generative error correction. First, we introduce a multimodal LLM (i.e., SpeechGPT) to receive source speech as extra input to improve the fidelity of correction output. Then, we reformat GER as a cloze test with logits calibration to remove the input information redundancy and simplify GER with clear instructions. Experiments show that ClozeGER achieves a new breakthrough over vanilla GER on 9 popular ASR datasets.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
Aligning Speech to Languages to Enhance Code-switching Speech Recognition
Authors:
Hexin Liu,
Xiangyu Zhang,
Leibny Paola Garcia,
Andy W. H. Khong,
Eng Siong Chng,
Shinji Watanabe
Abstract:
Code-switching (CS) refers to the switching of languages within a speech signal and results in language confusion for automatic speech recognition (ASR). To address language confusion, we propose the language alignment loss that performs frame-level language identification using pseudo language labels learned from the ASR decoder. This eliminates the need for frame-level language annotations. To f…
▽ More
Code-switching (CS) refers to the switching of languages within a speech signal and results in language confusion for automatic speech recognition (ASR). To address language confusion, we propose the language alignment loss that performs frame-level language identification using pseudo language labels learned from the ASR decoder. This eliminates the need for frame-level language annotations. To further tackle the complex token alternatives for language modeling in bilingual scenarios, we propose to employ large language models via a generative error correction method. A linguistic hint that incorporates language information (derived from the proposed language alignment loss and decoded hypotheses) is introduced to guide the prompting of large language models. The proposed methods are evaluated on the SEAME dataset and data from the ASRU 2019 Mandarin-English code-switching speech recognition challenge. The incorporation of the proposed language alignment loss demonstrates a higher CS-ASR performance with only a negligible increase in the number of parameters on both datasets compared to the baseline model. This work also highlights the efficacy of language alignment loss in balancing primary-language-dominant bilingual data during training, with an 8.6% relative improvement on the ASRU dataset compared to the baseline model. Performance evaluation using large language models reveals the advantage of the linguistic hint by achieving 14.1% and 5.5% relative improvement on test sets of the ASRU and SEAME datasets, respectively.
△ Less
Submitted 9 March, 2024;
originally announced March 2024.
-
Speaking in Wavelet Domain: A Simple and Efficient Approach to Speed up Speech Diffusion Model
Authors:
Xiangyu Zhang,
Daijiao Liu,
Hexin Liu,
Qiquan Zhang,
Hanyu Meng,
Leibny Paola Garcia,
Eng Siong Chng,
Lina Yao
Abstract:
Recently, Denoising Diffusion Probabilistic Models (DDPMs) have attained leading performances across a diverse range of generative tasks. However, in the field of speech synthesis, although DDPMs exhibit impressive performance, their long training duration and substantial inference costs hinder practical deployment. Existing approaches primarily focus on enhancing inference speed, while approaches…
▽ More
Recently, Denoising Diffusion Probabilistic Models (DDPMs) have attained leading performances across a diverse range of generative tasks. However, in the field of speech synthesis, although DDPMs exhibit impressive performance, their long training duration and substantial inference costs hinder practical deployment. Existing approaches primarily focus on enhancing inference speed, while approaches to accelerate training a key factor in the costs associated with adding or customizing voices often necessitate complex modifications to the model, compromising their universal applicability. To address the aforementioned challenges, we propose an inquiry: is it possible to enhance the training/inference speed and performance of DDPMs by modifying the speech signal itself? In this paper, we double the training and inference speed of Speech DDPMs by simply redirecting the generative target to the wavelet domain. This method not only achieves comparable or superior performance to the original model in speech synthesis tasks but also demonstrates its versatility. By investigating and utilizing different wavelet bases, our approach proves effective not just in speech synthesis, but also in speech enhancement.
△ Less
Submitted 16 February, 2024;
originally announced February 2024.
-
Preconditioners for the Stochastic Training of Implicit Neural Representations
Authors:
Shin-Fang Chng,
Hemanth Saratchandran,
Simon Lucey
Abstract:
Implicit neural representations have emerged as a powerful technique for encoding complex continuous multidimensional signals as neural networks, enabling a wide range of applications in computer vision, robotics, and geometry. While Adam is commonly used for training due to its stochastic proficiency, it entails lengthy training durations. To address this, we explore alternative optimization tech…
▽ More
Implicit neural representations have emerged as a powerful technique for encoding complex continuous multidimensional signals as neural networks, enabling a wide range of applications in computer vision, robotics, and geometry. While Adam is commonly used for training due to its stochastic proficiency, it entails lengthy training durations. To address this, we explore alternative optimization techniques for accelerated training without sacrificing accuracy. Traditional second-order optimizers like L-BFGS are suboptimal in stochastic settings, making them unsuitable for large-scale data sets. Instead, we propose stochastic training using curvature-aware diagonal preconditioners, showcasing their effectiveness across various signal modalities such as images, shape reconstruction, and Neural Radiance Fields (NeRF).
△ Less
Submitted 13 February, 2024;
originally announced February 2024.
-
GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators
Authors:
Yuchen Hu,
Chen Chen,
Chao-Han Huck Yang,
Ruizhe Li,
Dong Zhang,
Zhehuai Chen,
Eng Siong Chng
Abstract:
Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the divers…
▽ More
Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the diverse N-best hypotheses, making them less optimal for translation tasks that require a single, high-quality output sequence. In this paper, we propose a new generative paradigm for translation tasks, namely "GenTranslate", which builds upon LLMs to generate better results from the diverse translation versions in N-best list. Leveraging the rich linguistic knowledge and strong reasoning abilities of LLMs, our new paradigm can integrate the rich information in N-best candidates to generate a higher-quality translation result. Furthermore, to support LLM finetuning, we build and release a HypoTranslate dataset that contains over 592K hypotheses-translation pairs in 11 languages. Experiments on various speech and machine translation benchmarks (e.g., FLEURS, CoVoST-2, WMT) demonstrate that our GenTranslate significantly outperforms the state-of-the-art model.
△ Less
Submitted 16 May, 2024; v1 submitted 10 February, 2024;
originally announced February 2024.
-
Analyzing the Neural Tangent Kernel of Periodically Activated Coordinate Networks
Authors:
Hemanth Saratchandran,
Shin-Fang Chng,
Simon Lucey
Abstract:
Recently, neural networks utilizing periodic activation functions have been proven to demonstrate superior performance in vision tasks compared to traditional ReLU-activated networks. However, there is still a limited understanding of the underlying reasons for this improved performance. In this paper, we aim to address this gap by providing a theoretical understanding of periodically activated ne…
▽ More
Recently, neural networks utilizing periodic activation functions have been proven to demonstrate superior performance in vision tasks compared to traditional ReLU-activated networks. However, there is still a limited understanding of the underlying reasons for this improved performance. In this paper, we aim to address this gap by providing a theoretical understanding of periodically activated networks through an analysis of their Neural Tangent Kernel (NTK). We derive bounds on the minimum eigenvalue of their NTK in the finite width setting, using a fairly general network architecture which requires only one wide layer that grows at least linearly with the number of data samples. Our findings indicate that periodically activated networks are \textit{notably more well-behaved}, from the NTK perspective, than ReLU activated networks. Additionally, we give an application to the memorization capacity of such networks and verify our theoretical predictions empirically. Our study offers a deeper understanding of the properties of periodically activated neural networks and their potential in the field of deep learning.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
Architectural Strategies for the optimization of Physics-Informed Neural Networks
Authors:
Hemanth Saratchandran,
Shin-Fang Chng,
Simon Lucey
Abstract:
Physics-informed neural networks (PINNs) offer a promising avenue for tackling both forward and inverse problems in partial differential equations (PDEs) by incorporating deep learning with fundamental physics principles. Despite their remarkable empirical success, PINNs have garnered a reputation for their notorious training challenges across a spectrum of PDEs. In this work, we delve into the in…
▽ More
Physics-informed neural networks (PINNs) offer a promising avenue for tackling both forward and inverse problems in partial differential equations (PDEs) by incorporating deep learning with fundamental physics principles. Despite their remarkable empirical success, PINNs have garnered a reputation for their notorious training challenges across a spectrum of PDEs. In this work, we delve into the intricacies of PINN optimization from a neural architecture perspective. Leveraging the Neural Tangent Kernel (NTK), our study reveals that Gaussian activations surpass several alternate activations when it comes to effectively training PINNs. Building on insights from numerical linear algebra, we introduce a preconditioned neural architecture, showcasing how such tailored architectures enhance the optimization process. Our theoretical findings are substantiated through rigorous validation against established PDEs within the scientific literature.
△ Less
Submitted 4 February, 2024;
originally announced February 2024.
-
Cross-Modality and Within-Modality Regularization for Audio-Visual DeepFake Detection
Authors:
Heqing Zou,
Meng Shen,
Yuchen Hu,
Chen Chen,
Eng Siong Chng,
Deepu Rajan
Abstract:
Audio-visual deepfake detection scrutinizes manipulations in public video using complementary multimodal cues. Current methods, which train on fused multimodal data for multimodal targets face challenges due to uncertainties and inconsistencies in learned representations caused by independent modality manipulations in deepfake videos. To address this, we propose cross-modality and within-modality…
▽ More
Audio-visual deepfake detection scrutinizes manipulations in public video using complementary multimodal cues. Current methods, which train on fused multimodal data for multimodal targets face challenges due to uncertainties and inconsistencies in learned representations caused by independent modality manipulations in deepfake videos. To address this, we propose cross-modality and within-modality regularization to preserve modality distinctions during multimodal representation learning. Our approach includes an audio-visual transformer module for modality correspondence and a cross-modality regularization module to align paired audio-visual signals, preserving modality distinctions. Simultaneously, a within-modality regularization module refines unimodal representations with modality-specific targets to retain modal-specific details. Experimental results on the public audio-visual dataset, FakeAVCeleb, demonstrate the effectiveness and competitiveness of our approach.
△ Less
Submitted 11 January, 2024;
originally announced January 2024.
-
ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge
Authors:
He Wang,
Pengcheng Guo,
Yue Li,
Ao Zhang,
Jiayao Sun,
Lei Xie,
Wei Chen,
Pan Zhou,
Hui Bu,
Xin Xu,
Binbin Zhang,
Zhuo Chen,
Jian Wu,
Longbiao Wang,
Eng Siong Chng,
Sun Li
Abstract:
To promote speech processing and recognition research in driving scenarios, we build on the success of the Intelligent Cockpit Speech Recognition Challenge (ICSRC) held at ISCSLP 2022 and launch the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge. This challenge collects over 100 hours of multi-channel speech data recorded inside a new energy vehicle and 40 hours…
▽ More
To promote speech processing and recognition research in driving scenarios, we build on the success of the Intelligent Cockpit Speech Recognition Challenge (ICSRC) held at ISCSLP 2022 and launch the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge. This challenge collects over 100 hours of multi-channel speech data recorded inside a new energy vehicle and 40 hours of noise for data augmentation. Two tracks, including automatic speech recognition (ASR) and automatic speech diarization and recognition (ASDR) are set up, using character error rate (CER) and concatenated minimum permutation character error rate (cpCER) as evaluation metrics, respectively. Overall, the ICMC-ASR Challenge attracts 98 participating teams and receives 53 valid results in both tracks. In the end, first-place team USTCiflytek achieves a CER of 13.16% in the ASR track and a cpCER of 21.48% in the ASDR track, showing an absolute improvement of 13.08% and 51.4% compared to our challenge baseline, respectively.
△ Less
Submitted 20 February, 2024; v1 submitted 7 January, 2024;
originally announced January 2024.
-
Noise robust distillation of self-supervised speech models via correlation metrics
Authors:
Fabian Ritter-Gutierrez,
Kuan-Po Huang,
Dianwen Ng,
Jeremy H. M. Wong,
Hung-yi Lee,
Eng Siong Chng,
Nancy F. Chen
Abstract:
Compared to large speech foundation models, small distilled models exhibit degraded noise robustness. The student's robustness can be improved by introducing noise at the inputs during pre-training. Despite this, using the standard distillation loss still yields a student with degraded performance. Thus, this paper proposes improving student robustness via distillation with correlation metrics. Te…
▽ More
Compared to large speech foundation models, small distilled models exhibit degraded noise robustness. The student's robustness can be improved by introducing noise at the inputs during pre-training. Despite this, using the standard distillation loss still yields a student with degraded performance. Thus, this paper proposes improving student robustness via distillation with correlation metrics. Teacher behavior is learned by maximizing the teacher and student cross-correlation matrix between their representations towards identity. Noise robustness is encouraged via the student's self-correlation minimization. The proposed method is agnostic of the teacher model and consistently outperforms the previous approach. This work also proposes an heuristic to weigh the importance of the two correlation terms automatically. Experiments show consistently better clean and noise generalization on Intent Classification, Keyword Spotting, and Automatic Speech Recognition tasks on SUPERB Challenge.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
Adapting OpenAI's Whisper for Speech Recognition on Code-Switch Mandarin-English SEAME and ASRU2019 Datasets
Authors:
Yuhang Yang,
Yizhou Peng,
Xionghu Zhong,
Hao Huang,
Eng Siong Chng
Abstract:
This paper details the experimental results of adapting the OpenAI's Whisper model for Code-Switch Mandarin-English Speech Recognition (ASR) on the SEAME and ASRU2019 corpora. We conducted 2 experiments: a) using adaptation data from 1 to 100/200 hours to demonstrate effectiveness of adaptation, b) examining different language ID setup on Whisper prompt.
The Mixed Error Rate results show that th…
▽ More
This paper details the experimental results of adapting the OpenAI's Whisper model for Code-Switch Mandarin-English Speech Recognition (ASR) on the SEAME and ASRU2019 corpora. We conducted 2 experiments: a) using adaptation data from 1 to 100/200 hours to demonstrate effectiveness of adaptation, b) examining different language ID setup on Whisper prompt.
The Mixed Error Rate results show that the amount of adaptation data may be as low as $1\sim10$ hours to achieve saturation in performance gain (SEAME) while the ASRU task continued to show performance with more adaptation data ($>$100 hours). For the language prompt, the results show that although various prompting strategies initially produce different outcomes, adapting the Whisper model with code-switch data uniformly improves its performance.
These results may be relevant also to the community when applying Whisper for related tasks of adapting to new target domains.
△ Less
Submitted 29 November, 2023;
originally announced November 2023.
-
Generative error correction for code-switching speech recognition using large language models
Authors:
Chen Chen,
Yuchen Hu,
Chao-Han Huck Yang,
Hexin Liu,
Sabato Marco Siniscalchi,
Eng Siong Chng
Abstract:
Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence. Despite the recent advances in automatic speech recognition (ASR), CS-ASR is still a challenging task ought to the grammatical structure complexity of the phenomenon and the data scarcity of specific training corpus. In this work, we propose to leverage large language models (LLMs) and lis…
▽ More
Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence. Despite the recent advances in automatic speech recognition (ASR), CS-ASR is still a challenging task ought to the grammatical structure complexity of the phenomenon and the data scarcity of specific training corpus. In this work, we propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem. Specifically, we first employ multiple well-trained ASR models for N-best hypotheses generation, with the aim of increasing the diverse and informative elements in the set of hypotheses. Next, we utilize the LLMs to learn the hypotheses-to-transcription (H2T) map** by adding a trainable low-rank adapter. Such a generative error correction (GER) method directly predicts the accurate transcription according to its expert linguistic knowledge and N-best hypotheses, resulting in a paradigm shift from the traditional language model rescoring or error correction techniques. Experimental evidence demonstrates that GER significantly enhances CS-ASR accuracy, in terms of reduced mixed error rate (MER). Furthermore, LLMs show remarkable data efficiency for H2T learning, providing a potential solution to the data scarcity problem of CS-ASR in low-resource languages.
△ Less
Submitted 17 October, 2023;
originally announced October 2023.
-
Multi-Body Neural Scene Flow
Authors:
Kavisha Vidanapathirana,
Shin-Fang Chng,
Xueqian Li,
Simon Lucey
Abstract:
The test-time optimization of scene flow - using a coordinate network as a neural prior - has gained popularity due to its simplicity, lack of dataset bias, and state-of-the-art performance. We observe, however, that although coordinate networks capture general motions by implicitly regularizing the scene flow predictions to be spatially smooth, the neural prior by itself is unable to identify the…
▽ More
The test-time optimization of scene flow - using a coordinate network as a neural prior - has gained popularity due to its simplicity, lack of dataset bias, and state-of-the-art performance. We observe, however, that although coordinate networks capture general motions by implicitly regularizing the scene flow predictions to be spatially smooth, the neural prior by itself is unable to identify the underlying multi-body rigid motions present in real-world data. To address this, we show that multi-body rigidity can be achieved without the cumbersome and brittle strategy of constraining the $SE(3)$ parameters of each rigid body as done in previous works. This is achieved by regularizing the scene flow optimization to encourage isometry in flow predictions for rigid bodies. This strategy enables multi-body rigidity in scene flow while maintaining a continuous flow field, hence allowing dense long-term scene flow integration across a sequence of point clouds. We conduct extensive experiments on real-world datasets and demonstrate that our approach outperforms the state-of-the-art in 3D scene flow and long-term point-wise 4D trajectory prediction. The code is available at: https://github.com/kavisha725/MBNSF.
△ Less
Submitted 6 February, 2024; v1 submitted 16 October, 2023;
originally announced October 2023.
-
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models
Authors:
Chen Chen,
Yuchen Hu,
Chao-Han Huck Yang,
Sabato Macro Siniscalchi,
Pin-Yu Chen,
Eng Siong Chng
Abstract:
Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuit…
▽ More
Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. This approach is a paradigm shift from the traditional language model rescoring strategy that can only select one candidate hypothesis as the output transcription. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses and corresponding accurate transcriptions across prevalent speech domains. Given this dataset, we examine three types of error correction techniques based on LLMs with varying amounts of labeled hypotheses-transcription pairs, which gains a significant word error rate (WER) reduction. Experimental evidence demonstrates the proposed technique achieves a breakthrough by surpassing the upper bound of traditional re-ranking based methods. More surprisingly, LLM with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list. We make our results publicly accessible for reproducible pipelines with released pre-trained models, thus providing a new evaluation paradigm for ASR error correction with LLMs.
△ Less
Submitted 16 October, 2023; v1 submitted 27 September, 2023;
originally announced September 2023.
-
Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification
Authors:
Duc-Tuan Truong,
Ruijie Tao,
Jia Qi Yip,
Kong Aik Lee,
Eng Siong Chng
Abstract:
Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student networks at the embedding level or label level. However, the conventional label-level KD overlooks the significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for automa…
▽ More
Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student networks at the embedding level or label level. However, the conventional label-level KD overlooks the significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for automatic speaker verification. In this paper, we first demonstrate that leveraging a larger number of training non-target speakers improves the performance of automatic speaker verification models. Inspired by this finding about the importance of non-target speakers' knowledge, we modified the conventional label-level KD by disentangling and emphasizing the classification probabilities of non-target speakers during knowledge distillation. The proposed method is applied to three different student model architectures and achieves an average of 13.67% improvement in EER on the VoxCeleb dataset compared to embedding-level and conventional label-level KD methods.
△ Less
Submitted 14 January, 2024; v1 submitted 26 September, 2023;
originally announced September 2023.
-
SPGM: Prioritizing Local Features for enhanced speech separation performance
Authors:
Jia Qi Yip,
Shengkui Zhao,
Yukun Ma,
Chongjia Ni,
Chong Zhang,
Hao Wang,
Trung Hieu Nguyen,
Kun Zhou,
Dianwen Ng,
Eng Siong Chng,
Bin Ma
Abstract:
Dual-path is a popular architecture for speech separation models (e.g. Sepformer) which splits long sequences into overlap** chunks for its intra- and inter-blocks that separately model intra-chunk local features and inter-chunk global relationships. However, it has been found that inter-blocks, which comprise half a dual-path model's parameters, contribute minimally to performance. Thus, we pro…
▽ More
Dual-path is a popular architecture for speech separation models (e.g. Sepformer) which splits long sequences into overlap** chunks for its intra- and inter-blocks that separately model intra-chunk local features and inter-chunk global relationships. However, it has been found that inter-blocks, which comprise half a dual-path model's parameters, contribute minimally to performance. Thus, we propose the Single-Path Global Modulation (SPGM) block to replace inter-blocks. SPGM is named after its structure consisting of a parameter-free global pooling module followed by a modulation module comprising only 2% of the model's total parameters. The SPGM block allows all transformer layers in the model to be dedicated to local feature modelling, making the overall model single-path. SPGM achieves 22.1 dB SI-SDRi on WSJ0-2Mix and 20.4 dB SI-SDRi on Libri2Mix, exceeding the performance of Sepformer by 0.5 dB and 0.3 dB respectively and matches the performance of recent SOTA models with up to 8 times fewer parameters. Model and weights are available at huggingface.co/yipjiaqi/spgm
△ Less
Submitted 10 March, 2024; v1 submitted 21 September, 2023;
originally announced September 2023.
-
Are Soft Prompts Good Zero-shot Learners for Speech Recognition?
Authors:
Dianwen Ng,
Chong Zhang,
Ruixi Zhang,
Yukun Ma,
Fabian Ritter-Gutierrez,
Trung Hieu Nguyen,
Chongjia Ni,
Shengkui Zhao,
Eng Siong Chng,
Bin Ma
Abstract:
Large self-supervised pre-trained speech models require computationally expensive fine-tuning for downstream tasks. Soft prompt tuning offers a simple parameter-efficient alternative by utilizing minimal soft prompt guidance, enhancing portability while also maintaining competitive performance. However, not many people understand how and why this is so. In this study, we aim to deepen our understa…
▽ More
Large self-supervised pre-trained speech models require computationally expensive fine-tuning for downstream tasks. Soft prompt tuning offers a simple parameter-efficient alternative by utilizing minimal soft prompt guidance, enhancing portability while also maintaining competitive performance. However, not many people understand how and why this is so. In this study, we aim to deepen our understanding of this emerging method by investigating the role of soft prompts in automatic speech recognition (ASR). Our findings highlight their role as zero-shot learners in improving ASR performance but also make them vulnerable to malicious modifications. Soft prompts aid generalization but are not obligatory for inference. We also identify two primary roles of soft prompts: content refinement and noise information enhancement, which enhances robustness against background noise. Additionally, we propose an effective modification on noise prompts to show that they are capable of zero-shot learning on adapting to out-of-distribution noise environments.
△ Less
Submitted 17 September, 2023;
originally announced September 2023.
-
Codec Data Augmentation for Time-domain Heart Sound Classification
Authors:
Ansh Mishra,
Jia Qi Yip,
Eng Siong Chng
Abstract:
Heart auscultations are a low-cost and effective way of detecting valvular heart diseases early, which can save lives. Nevertheless, it has been difficult to scale this screening method since the effectiveness of auscultations is dependent on the skill of doctors. As such, there has been increasing research interest in the automatic classification of heart sounds using deep learning algorithms. Ho…
▽ More
Heart auscultations are a low-cost and effective way of detecting valvular heart diseases early, which can save lives. Nevertheless, it has been difficult to scale this screening method since the effectiveness of auscultations is dependent on the skill of doctors. As such, there has been increasing research interest in the automatic classification of heart sounds using deep learning algorithms. However, it is currently difficult to develop good heart sound classification models due to the limited data available for training. In this work, we propose a simple time domain approach, to the heart sound classification problem with a base classification error rate of 0.8 and show that augmentation of the data through codec simulation can improve the classification error rate to 0.2. With data augmentation, our approach outperforms the existing time-domain CNN-BiLSTM baseline model. Critically, our experiments show that codec data augmentation is effective in getting around the data limitation.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
Noise-aware Speech Enhancement using Diffusion Probabilistic Model
Authors:
Yuchen Hu,
Chen Chen,
Ruizhe Li,
Qiushi Zhu,
Eng Siong Chng
Abstract:
With recent advances of diffusion model, generative speech enhancement (SE) has attracted a surge of research interest due to its great potential for unseen testing noises. However, existing efforts mainly focus on inherent properties of clean speech, underexploiting the varying noise information in real world. In this paper, we propose a noise-aware speech enhancement (NASE) approach that extract…
▽ More
With recent advances of diffusion model, generative speech enhancement (SE) has attracted a surge of research interest due to its great potential for unseen testing noises. However, existing efforts mainly focus on inherent properties of clean speech, underexploiting the varying noise information in real world. In this paper, we propose a noise-aware speech enhancement (NASE) approach that extracts noise-specific information to guide the reverse process in diffusion model. Specifically, we design a noise classification (NC) model to produce acoustic embedding as a noise conditioner to guide the reverse denoising process. Meanwhile, a multi-task learning scheme is devised to jointly optimize SE and NC tasks to enhance the noise specificity of conditioner. NASE is shown to be a plug-and-play module that can be generalized to any diffusion SE models. Experiments on VB-DEMAND dataset show that NASE effectively improves multiple mainstream diffusion SE models, especially on unseen noises.
△ Less
Submitted 4 June, 2024; v1 submitted 16 July, 2023;
originally announced July 2023.
-
MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
Authors:
Yuchen Hu,
Chen Chen,
Ruizhe Li,
Heqing Zou,
Eng Siong Chng
Abstract:
Audio-visual speech recognition (AVSR) attracts a surge of research interest recently by leveraging multimodal signals to understand human speech. Mainstream approaches addressing this task have developed sophisticated architectures and techniques for multi-modality fusion and representation learning. However, the natural heterogeneity of different modalities causes distribution gap between their…
▽ More
Audio-visual speech recognition (AVSR) attracts a surge of research interest recently by leveraging multimodal signals to understand human speech. Mainstream approaches addressing this task have developed sophisticated architectures and techniques for multi-modality fusion and representation learning. However, the natural heterogeneity of different modalities causes distribution gap between their representations, making it challenging to fuse them. In this paper, we aim to learn the shared representations across modalities to bridge their gap. Different from existing similar methods on other multimodal tasks like sentiment analysis, we focus on the temporal contextual dependencies considering the sequence-to-sequence task setting of AVSR. In particular, we propose an adversarial network to refine frame-level modality-invariant representations (MIR-GAN), which captures the commonality across modalities to ease the subsequent multimodal fusion process. Extensive experiments on public benchmarks LRS3 and LRS2 show that our approach outperforms the state-of-the-arts.
△ Less
Submitted 18 June, 2023;
originally announced June 2023.
-
Hearing Lips in Noise: Universal Viseme-Phoneme Map** and Transfer for Robust Audio-Visual Speech Recognition
Authors:
Yuchen Hu,
Ruizhe Li,
Chen Chen,
Chengwei Qin,
Qiushi Zhu,
Eng Siong Chng
Abstract:
Audio-visual speech recognition (AVSR) provides a promising solution to ameliorate the noise-robustness of audio-only speech recognition with visual information. However, most existing efforts still focus on audio modality to improve robustness considering its dominance in AVSR task, with noise adaptation techniques such as front-end denoise processing. Though effective, these methods are usually…
▽ More
Audio-visual speech recognition (AVSR) provides a promising solution to ameliorate the noise-robustness of audio-only speech recognition with visual information. However, most existing efforts still focus on audio modality to improve robustness considering its dominance in AVSR task, with noise adaptation techniques such as front-end denoise processing. Though effective, these methods are usually faced with two practical challenges: 1) lack of sufficient labeled noisy audio-visual training data in some real-world scenarios and 2) less optimal model generality to unseen testing noises. In this work, we investigate the noise-invariant visual modality to strengthen robustness of AVSR, which can adapt to any testing noises while without dependence on noisy training data, a.k.a., unsupervised noise adaptation. Inspired by human perception mechanism, we propose a universal viseme-phoneme map** (UniVPM) approach to implement modality transfer, which can restore clean audio from visual signals to enable speech recognition under any noisy conditions. Extensive experiments on public benchmarks LRS3 and LRS2 show that our approach achieves the state-of-the-art under various noisy as well as clean conditions. In addition, we also outperform previous state-of-the-arts on visual speech recognition task.
△ Less
Submitted 18 June, 2023;
originally announced June 2023.
-
A Neural State-Space Model Approach to Efficient Speech Separation
Authors:
Chen Chen,
Chao-Han Huck Yang,
Kai Li,
Yuchen Hu,
Pin-Jui Ku,
Eng Siong Chng
Abstract:
In this work, we introduce S4M, a new efficient speech separation framework based on neural state-space models (SSM). Motivated by linear time-invariant systems for sequence modeling, our SSM-based approach can efficiently model input signals into a format of linear ordinary differential equations (ODEs) for representation learning. To extend the SSM technique into speech separation tasks, we firs…
▽ More
In this work, we introduce S4M, a new efficient speech separation framework based on neural state-space models (SSM). Motivated by linear time-invariant systems for sequence modeling, our SSM-based approach can efficiently model input signals into a format of linear ordinary differential equations (ODEs) for representation learning. To extend the SSM technique into speech separation tasks, we first decompose the input mixture into multi-scale representations with different resolutions. This mechanism enables S4M to learn globally coherent separation and reconstruction. The experimental results show that S4M performs comparably to other separation backbones in terms of SI-SDRi, while having a much lower model complexity with significantly fewer trainable parameters. In addition, our S4M-tiny model (1.8M parameters) even surpasses attention-based Sepformer (26.0M parameters) in noisy conditions with only 9.2 of multiply-accumulate operation (MACs).
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention
Authors:
Jia Qi Yip,
Tuan Truong,
Dianwen Ng,
Chong Zhang,
Yukun Ma,
Trung Hieu Nguyen,
Chongjia Ni,
Shengkui Zhao,
Eng Siong Chng,
Bin Ma
Abstract:
In this paper, we propose ACA-Net, a lightweight, global context-aware speaker embedding extractor for Speaker Verification (SV) that improves upon existing work by using Asymmetric Cross Attention (ACA) to replace temporal pooling. ACA is able to distill large, variable-length sequences into small, fixed-sized latents by attending a small query to large key and value matrices. In ACA-Net, we buil…
▽ More
In this paper, we propose ACA-Net, a lightweight, global context-aware speaker embedding extractor for Speaker Verification (SV) that improves upon existing work by using Asymmetric Cross Attention (ACA) to replace temporal pooling. ACA is able to distill large, variable-length sequences into small, fixed-sized latents by attending a small query to large key and value matrices. In ACA-Net, we build a Multi-Layer Aggregation (MLA) block using ACA to generate fixed-sized identity vectors from variable-length inputs. Through global attention, ACA-Net acts as an efficient global feature extractor that adapts to temporal variability unlike existing SV models that apply a fixed function for pooling over the temporal dimension which may obscure information about the signal's non-stationary temporal variability. Our experiments on the WSJ0-1talker show ACA-Net outperforms a strong baseline by 5\% relative improvement in EER using only 1/5 of the parameters.
△ Less
Submitted 20 May, 2023;
originally announced May 2023.
-
Noise-Aware Speech Separation with Contrastive Learning
Authors:
Zizheng Zhang,
Chen Chen,
Hsin-Hung Chen,
Xiang Liu,
Yuchen Hu,
Eng Siong Chng
Abstract:
Recently, speech separation (SS) task has achieved remarkable progress driven by deep learning technique. However, it is still challenging to separate target speech from noisy mixture, as the neural model is vulnerable to assign background noise to each speaker. In this paper, we propose a noise-aware SS (NASS) method, which aims to improve the speech quality for separated signals under noisy cond…
▽ More
Recently, speech separation (SS) task has achieved remarkable progress driven by deep learning technique. However, it is still challenging to separate target speech from noisy mixture, as the neural model is vulnerable to assign background noise to each speaker. In this paper, we propose a noise-aware SS (NASS) method, which aims to improve the speech quality for separated signals under noisy conditions. Specifically, NASS views background noise as an additional output and predicts it along with other speakers in a mask-based manner. To effectively denoise, we introduce patch-wise contrastive learning (PCL) between noise and speaker representations from the decoder input and encoder output. PCL loss aims to minimize the mutual information between predicted noise and other speakers at multiple-patch level to suppress the noise information in separated signals. Experimental results show that NASS achieves 1 to 2dB SI-SNRi or SDRi over DPRNN and Sepformer on WHAM! and LibriMix noisy datasets, with less than 0.1M parameter increase.
△ Less
Submitted 8 January, 2024; v1 submitted 18 May, 2023;
originally announced May 2023.
-
UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning
Authors:
Heqing Zou,
Meng Shen,
Chen Chen,
Yuchen Hu,
Deepu Rajan,
Eng Siong Chng
Abstract:
Multimodal learning aims to imitate human beings to acquire complementary information from multiple modalities for various downstream tasks. However, traditional aggregation-based multimodal fusion methods ignore the inter-modality relationship, treat each modality equally, suffer sensor noise, and thus reduce multimodal learning performance. In this work, we propose a novel multimodal contrastive…
▽ More
Multimodal learning aims to imitate human beings to acquire complementary information from multiple modalities for various downstream tasks. However, traditional aggregation-based multimodal fusion methods ignore the inter-modality relationship, treat each modality equally, suffer sensor noise, and thus reduce multimodal learning performance. In this work, we propose a novel multimodal contrastive method to explore more reliable multimodal representations under the weak supervision of unimodal predicting. Specifically, we first capture task-related unimodal representations and the unimodal predictions from the introduced unimodal predicting task. Then the unimodal representations are aligned with the more effective one by the designed multimodal contrastive method under the supervision of the unimodal predictions. Experimental results with fused features on two image-text classification benchmarks UPMC-Food-101 and N24News show that our proposed Unimodality-Supervised MultiModal Contrastive UniS-MMC learning method outperforms current state-of-the-art multimodal methods. The detailed ablation study and analysis further demonstrate the advantage of our proposed method.
△ Less
Submitted 16 May, 2023;
originally announced May 2023.
-
Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition
Authors:
Yuchen Hu,
Ruizhe Li,
Chen Chen,
Heqing Zou,
Qiushi Zhu,
Eng Siong Chng
Abstract:
Audio-visual speech recognition (AVSR) research has gained a great success recently by improving the noise-robustness of audio-only automatic speech recognition (ASR) with noise-invariant visual information. However, most existing AVSR approaches simply fuse the audio and visual features by concatenation, without explicit interactions to capture the deep correlations between them, which results in…
▽ More
Audio-visual speech recognition (AVSR) research has gained a great success recently by improving the noise-robustness of audio-only automatic speech recognition (ASR) with noise-invariant visual information. However, most existing AVSR approaches simply fuse the audio and visual features by concatenation, without explicit interactions to capture the deep correlations between them, which results in sub-optimal multimodal representations for downstream speech recognition task. In this paper, we propose a cross-modal global interaction and local alignment (GILA) approach for AVSR, which captures the deep audio-visual (A-V) correlations from both global and local perspectives. Specifically, we design a global interaction model to capture the A-V complementary relationship on modality level, as well as a local alignment approach to model the A-V temporal consistency on frame level. Such a holistic view of cross-modal correlations enable better multimodal representations for AVSR. Experiments on public benchmarks LRS3 and LRS2 show that our GILA outperforms the supervised learning state-of-the-art.
△ Less
Submitted 16 May, 2023;
originally announced May 2023.
-
Curvature-Aware Training for Coordinate Networks
Authors:
Hemanth Saratchandran,
Shin-Fang Chng,
Sameera Ramasinghe,
Lachlan MacDonald,
Simon Lucey
Abstract:
Coordinate networks are widely used in computer vision due to their ability to represent signals as compressed, continuous entities. However, training these networks with first-order optimizers can be slow, hindering their use in real-time applications. Recent works have opted for shallow voxel-based representations to achieve faster training, but this sacrifices memory efficiency. This work propo…
▽ More
Coordinate networks are widely used in computer vision due to their ability to represent signals as compressed, continuous entities. However, training these networks with first-order optimizers can be slow, hindering their use in real-time applications. Recent works have opted for shallow voxel-based representations to achieve faster training, but this sacrifices memory efficiency. This work proposes a solution that leverages second-order optimization methods to significantly reduce training times for coordinate networks while maintaining their compressibility. Experiments demonstrate the effectiveness of this approach on various signal modalities, such as audio, images, videos, shape reconstruction, and neural radiance fields.
△ Less
Submitted 15 May, 2023;
originally announced May 2023.
-
Contrastive Speech Mixup for Low-resource Keyword Spotting
Authors:
Dianwen Ng,
Ruixi Zhang,
Jia Qi Yip,
Chong Zhang,
Yukun Ma,
Trung Hieu Nguyen,
Chongjia Ni,
Eng Siong Chng,
Bin Ma
Abstract:
Most of the existing neural-based models for keyword spotting (KWS) in smart devices require thousands of training samples to learn a decent audio representation. However, with the rising demand for smart devices to become more personalized, KWS models need to adapt quickly to smaller user samples. To tackle this challenge, we propose a contrastive speech mixup (CosMix) learning algorithm for low-…
▽ More
Most of the existing neural-based models for keyword spotting (KWS) in smart devices require thousands of training samples to learn a decent audio representation. However, with the rising demand for smart devices to become more personalized, KWS models need to adapt quickly to smaller user samples. To tackle this challenge, we propose a contrastive speech mixup (CosMix) learning algorithm for low-resource KWS. CosMix introduces an auxiliary contrastive loss to the existing mixup augmentation technique to maximize the relative similarity between the original pre-mixed samples and the augmented samples. The goal is to inject enhancing constraints to guide the model towards simpler but richer content-based speech representations from two augmented views (i.e. noisy mixed and clean pre-mixed utterances). We conduct our experiments on the Google Speech Command dataset, where we trim the size of the training set to as small as 2.5 mins per keyword to simulate a low-resource condition. Our experimental results show a consistent improvement in the performance of multiple models, which exhibits the effectiveness of our method.
△ Less
Submitted 1 May, 2023;
originally announced May 2023.
-
Leveraging Audio-Tagging Assisted Sound Event Detection using Weakified Strong Labels and Frequency Dynamic Convolutions
Authors:
Tanmay Khandelwal,
Rohan Kumar Das,
Andrew Koh,
Eng Siong Chng
Abstract:
Jointly learning from a small labeled set and a larger unlabeled set is an active research topic under semi-supervised learning (SSL). In this paper, we propose a novel SSL method based on a two-stage framework for leveraging a large unlabeled in-domain set. Stage-1 of our proposed framework focuses on audio-tagging (AT), which assists the sound event detection (SED) system in Stage-2. The AT syst…
▽ More
Jointly learning from a small labeled set and a larger unlabeled set is an active research topic under semi-supervised learning (SSL). In this paper, we propose a novel SSL method based on a two-stage framework for leveraging a large unlabeled in-domain set. Stage-1 of our proposed framework focuses on audio-tagging (AT), which assists the sound event detection (SED) system in Stage-2. The AT system is trained utilizing a strongly labeled set converted into weak predictions referred to as weakified set, a weakly labeled set, and an unlabeled set. This AT system then infers on the unlabeled set to generate reliable pseudo-weak labels, which are used with the strongly and weakly labeled set to train a frequency dynamic convolutional recurrent neural network-based SED system at Stage-2 in a supervised manner. Our system outperforms the baseline by 45.5% in terms of polyphonic sound detection score on the DESED real validation set.
△ Less
Submitted 25 April, 2023;
originally announced April 2023.
-
Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR
Authors:
Yuchen Hu,
Chen Chen,
Qiushi Zhu,
Eng Siong Chng
Abstract:
Automatic speech recognition (ASR) has gained remarkable successes thanks to recent advances of deep learning, but it usually degrades significantly under real-world noisy conditions. Recent works introduce speech enhancement (SE) as front-end to improve speech quality, which is proved effective but may not be optimal for downstream ASR due to speech distortion problem. Based on that, latest works…
▽ More
Automatic speech recognition (ASR) has gained remarkable successes thanks to recent advances of deep learning, but it usually degrades significantly under real-world noisy conditions. Recent works introduce speech enhancement (SE) as front-end to improve speech quality, which is proved effective but may not be optimal for downstream ASR due to speech distortion problem. Based on that, latest works combine SE and currently popular self-supervised learning (SSL) to alleviate distortion and improve noise robustness. Despite the effectiveness, the speech distortion caused by conventional SE still cannot be cleared out. In this paper, we propose a self-supervised framework named Wav2code to implement a feature-level SE with reduced distortions for noise-robust ASR. First, in pre-training stage the clean speech representations from SSL model are sent to lookup a discrete codebook via nearest-neighbor feature matching, the resulted code sequence are then exploited to reconstruct the original clean representations, in order to store them in codebook as prior. Second, during finetuning we propose a Transformer-based code predictor to accurately predict clean codes by modeling the global dependency of input noisy representations, which enables discovery and restoration of high-quality clean representations with reduced distortions. Furthermore, we propose an interactive feature fusion network to combine original noisy and the restored clean representations to consider both fidelity and quality, resulting in more informative features for downstream ASR. Finally, experiments on both synthetic and real noisy datasets demonstrate that Wav2code can solve the speech distortion and improve ASR performance under various noisy conditions, resulting in stronger robustness.
△ Less
Submitted 18 April, 2024; v1 submitted 11 April, 2023;
originally announced April 2023.
-
deHuBERT: Disentangling Noise in a Self-supervised Model for Robust Speech Recognition
Authors:
Dianwen Ng,
Ruixi Zhang,
Jia Qi Yip,
Zhao Yang,
**jie Ni,
Chong Zhang,
Yukun Ma,
Chongjia Ni,
Eng Siong Chng,
Bin Ma
Abstract:
Existing self-supervised pre-trained speech models have offered an effective way to leverage massive unannotated corpora to build good automatic speech recognition (ASR). However, many current models are trained on a clean corpus from a single source, which tends to do poorly when noise is present during testing. Nonetheless, it is crucial to overcome the adverse influence of noise for real-world…
▽ More
Existing self-supervised pre-trained speech models have offered an effective way to leverage massive unannotated corpora to build good automatic speech recognition (ASR). However, many current models are trained on a clean corpus from a single source, which tends to do poorly when noise is present during testing. Nonetheless, it is crucial to overcome the adverse influence of noise for real-world applications. In this work, we propose a novel training framework, called deHuBERT, for noise reduction encoding inspired by H. Barlow's redundancy-reduction principle. The new framework improves the HuBERT training algorithm by introducing auxiliary losses that drive the self- and cross-correlation matrix between pairwise noise-distorted embeddings towards identity matrix. This encourages the model to produce noise-agnostic speech representations. With this method, we report improved robustness in noisy environments, including unseen noises, without impairing the performance on the clean set.
△ Less
Submitted 28 February, 2023;
originally announced February 2023.
-
Metric-oriented Speech Enhancement using Diffusion Probabilistic Model
Authors:
Chen Chen,
Yuchen Hu,
Weiwei Weng,
Eng Siong Chng
Abstract:
Deep neural network based speech enhancement technique focuses on learning a noisy-to-clean transformation supervised by paired training data. However, the task-specific evaluation metric (e.g., PESQ) is usually non-differentiable and can not be directly constructed in the training criteria. This mismatch between the training objective and evaluation metric likely results in sub-optimal performanc…
▽ More
Deep neural network based speech enhancement technique focuses on learning a noisy-to-clean transformation supervised by paired training data. However, the task-specific evaluation metric (e.g., PESQ) is usually non-differentiable and can not be directly constructed in the training criteria. This mismatch between the training objective and evaluation metric likely results in sub-optimal performance. To alleviate it, we propose a metric-oriented speech enhancement method (MOSE), which leverages the recent advances in the diffusion probabilistic model and integrates a metric-oriented training strategy into its reverse process. Specifically, we design an actor-critic based framework that considers the evaluation metric as a posterior reward, thus guiding the reverse process to the metric-increasing direction. The experimental results demonstrate that MOSE obviously benefits from metric-oriented training and surpasses the generative baselines in terms of all evaluation metrics.
△ Less
Submitted 23 February, 2023;
originally announced February 2023.
-
Unsupervised Noise adaptation using Data Simulation
Authors:
Chen Chen,
Yuchen Hu,
Heqing Zou,
Linhui Sun,
Eng Siong Chng
Abstract:
Deep neural network based speech enhancement approaches aim to learn a noisy-to-clean transformation using a supervised learning paradigm. However, such a trained-well transformation is vulnerable to unseen noises that are not included in training set. In this work, we focus on the unsupervised noise adaptation problem in speech enhancement, where the ground truth of target domain data is complete…
▽ More
Deep neural network based speech enhancement approaches aim to learn a noisy-to-clean transformation using a supervised learning paradigm. However, such a trained-well transformation is vulnerable to unseen noises that are not included in training set. In this work, we focus on the unsupervised noise adaptation problem in speech enhancement, where the ground truth of target domain data is completely unavailable. Specifically, we propose a generative adversarial network based method to efficiently learn a converse clean-to-noisy transformation using a few minutes of unpaired target domain data. Then this transformation is utilized to generate sufficient simulated data for domain adaptation of the enhancement model. Experimental results show that our method effectively mitigates the domain mismatch between training and test sets, and surpasses the best baseline by a large margin.
△ Less
Submitted 23 February, 2023;
originally announced February 2023.
-
Gradient Remedy for Multi-Task Learning in End-to-End Noise-Robust Speech Recognition
Authors:
Yuchen Hu,
Chen Chen,
Ruizhe Li,
Qiushi Zhu,
Eng Siong Chng
Abstract:
Speech enhancement (SE) is proved effective in reducing noise from noisy speech signals for downstream automatic speech recognition (ASR), where multi-task learning strategy is employed to jointly optimize these two tasks. However, the enhanced speech learned by SE objective may not always yield good ASR results. From the optimization view, there sometimes exists interference between the gradients…
▽ More
Speech enhancement (SE) is proved effective in reducing noise from noisy speech signals for downstream automatic speech recognition (ASR), where multi-task learning strategy is employed to jointly optimize these two tasks. However, the enhanced speech learned by SE objective may not always yield good ASR results. From the optimization view, there sometimes exists interference between the gradients of SE and ASR tasks, which could hinder the multi-task learning and finally lead to sub-optimal ASR performance. In this paper, we propose a simple yet effective approach called gradient remedy (GR) to solve interference between task gradients in noise-robust speech recognition, from perspectives of both angle and magnitude. Specifically, we first project the SE task's gradient onto a dynamic surface that is at acute angle to ASR gradient, in order to remove the conflict between them and assist in ASR optimization. Furthermore, we adaptively rescale the magnitude of two gradients to prevent the dominant ASR task from being misled by SE gradient. Experimental results show that the proposed approach well resolves the gradient interference and achieves relative word error rate (WER) reductions of 9.3% and 11.1% over multi-task learning baseline, on RATS and CHiME-4 datasets, respectively. Our code is available at GitHub.
△ Less
Submitted 3 May, 2023; v1 submitted 22 February, 2023;
originally announced February 2023.
-
Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation
Authors:
Yuchen Hu,
Chen Chen,
Heqing Zou,
Xionghu Zhong,
Eng Siong Chng
Abstract:
Recent studies in neural network-based monaural speech separation (SS) have achieved a remarkable success thanks to increasing ability of long sequence modeling. However, they would degrade significantly when put under realistic noisy conditions, as the background noise could be mistaken for speaker's speech and thus interfere with the separated sources. To alleviate this problem, we propose a nov…
▽ More
Recent studies in neural network-based monaural speech separation (SS) have achieved a remarkable success thanks to increasing ability of long sequence modeling. However, they would degrade significantly when put under realistic noisy conditions, as the background noise could be mistaken for speaker's speech and thus interfere with the separated sources. To alleviate this problem, we propose a novel network to unify speech enhancement and separation with gradient modulation to improve noise-robustness. Specifically, we first build a unified network by combining speech enhancement (SE) and separation modules, with multi-task learning for optimization, where SE is supervised by parallel clean mixture to reduce noise for downstream speech separation. Furthermore, in order to avoid suppressing valid speaker information when reducing noise, we propose a gradient modulation (GM) strategy to harmonize the SE and SS tasks from optimization view. Experimental results show that our approach achieves the state-of-the-art on large-scale Libri2Mix- and Libri3Mix-noisy datasets, with SI-SNRi results of 16.0 dB and 15.8 dB respectively. Our code is available at GitHub.
△ Less
Submitted 21 February, 2023;
originally announced February 2023.
-
Probabilistic Back-ends for Online Speaker Recognition and Clustering
Authors:
Alexey Sholokhov,
Nikita Kuzmin,
Kong Aik Lee,
Eng Siong Chng
Abstract:
This paper focuses on multi-enrollment speaker recognition which naturally occurs in the task of online speaker clustering, and studies the properties of different scoring back-ends in this scenario. First, we show that popular cosine scoring suffers from poor score calibration with a varying number of enrollment utterances. Second, we propose a simple replacement for cosine scoring based on an ex…
▽ More
This paper focuses on multi-enrollment speaker recognition which naturally occurs in the task of online speaker clustering, and studies the properties of different scoring back-ends in this scenario. First, we show that popular cosine scoring suffers from poor score calibration with a varying number of enrollment utterances. Second, we propose a simple replacement for cosine scoring based on an extremely constrained version of probabilistic linear discriminant analysis (PLDA). The proposed model improves over the cosine scoring for multi-enrollment recognition while kee** the same performance in the case of one-to-one comparisons. Finally, we consider an online speaker clustering task where each step naturally involves multi-enrollment recognition. We propose an online clustering algorithm allowing us to take benefits from the PLDA model such as the ability to handle uncertainty and better score calibration. Our experiments demonstrate the effectiveness of the proposed algorithm.
△ Less
Submitted 19 February, 2023;
originally announced February 2023.
-
Improving Spoken Language Identification with Map-Mix
Authors:
Shangeth Rajaa,
Kriti Anandan,
Swaraj Dalmia,
Tarun Gupta,
Eng Siong Chng
Abstract:
The pre-trained multi-lingual XLSR model generalizes well for language identification after fine-tuning on unseen languages. However, the performance significantly degrades when the languages are not very distinct from each other, for example, in the case of dialects. Low resource dialect classification remains a challenging problem to solve. We present a new data augmentation method that leverage…
▽ More
The pre-trained multi-lingual XLSR model generalizes well for language identification after fine-tuning on unseen languages. However, the performance significantly degrades when the languages are not very distinct from each other, for example, in the case of dialects. Low resource dialect classification remains a challenging problem to solve. We present a new data augmentation method that leverages model training dynamics of individual data points to improve sampling for latent mixup. The method works well in low-resource settings where generalization is paramount. Our datamaps-based mixup technique, which we call Map-Mix improves weighted F1 scores by 2% compared to the random mixup baseline and results in a significantly well-calibrated model. The code for our method is open sourced on https://github.com/skit-ai/Map-Mix.
△ Less
Submitted 16 February, 2023;
originally announced February 2023.
-
Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning
Authors:
Chen Chen,
Yuchen Hu,
Qiang Zhang,
Heqing Zou,
Beier Zhu,
Eng Siong Chng
Abstract:
Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, th…
▽ More
Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.
△ Less
Submitted 2 February, 2023; v1 submitted 10 December, 2022;
originally announced December 2022.
-
The ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge (ICSRC): Dataset, Tracks, Baseline and Results
Authors:
Ao Zhang,
Fan Yu,
Kaixun Huang,
Lei Xie,
Longbiao Wang,
Eng Siong Chng,
Hui Bu,
Binbin Zhang,
Wei Chen,
Xin Xu
Abstract:
This paper summarizes the outcomes from the ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge (ICSRC). We first address the necessity of the challenge and then introduce the associated dataset collected from a new-energy vehicle (NEV) covering a variety of cockpit acoustic conditions and linguistic contents. We then describe the track arrangement and the baseline system. Specifically, w…
▽ More
This paper summarizes the outcomes from the ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge (ICSRC). We first address the necessity of the challenge and then introduce the associated dataset collected from a new-energy vehicle (NEV) covering a variety of cockpit acoustic conditions and linguistic contents. We then describe the track arrangement and the baseline system. Specifically, we set up two tracks in terms of allowed model/system size to investigate resource-constrained and -unconstrained setups, targeting to vehicle embedded as well as cloud ASR systems respectively. Finally we summarize the challenge results and provide the major observations from the submitted systems.
△ Less
Submitted 3 November, 2022;
originally announced November 2022.
-
Speech-text based multi-modal training with bidirectional attention for improved speech recognition
Authors:
Yuhang Yang,
Haihua Xu,
Hao Huang,
Eng Siong Chng,
Sheng Li
Abstract:
To let the state-of-the-art end-to-end ASR model enjoy data efficiency, as well as much more unpaired text data by multi-modal training, one needs to address two problems: 1) the synchronicity of feature sampling rates between speech and language (aka text data); 2) the homogeneity of the learned representations from two encoders. In this paper we propose to employ a novel bidirectional attention…
▽ More
To let the state-of-the-art end-to-end ASR model enjoy data efficiency, as well as much more unpaired text data by multi-modal training, one needs to address two problems: 1) the synchronicity of feature sampling rates between speech and language (aka text data); 2) the homogeneity of the learned representations from two encoders. In this paper we propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method. The BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space, with diversified objective functions. As a result, the speech representations are enriched with more linguistic information, while the representations generated by the text encoder are more similar to corresponding speech ones, and therefore the shared ASR models are more amenable for unpaired text data pretraining. To validate the efficacy of the proposed method, we perform two categories of experiments with or without extra unpaired text data. Experimental results on Librispeech corpus show it can achieve up to 6.15% word error rate reduction (WERR) with only paired data learning, while 9.23% WERR when more unpaired text data is employed.
△ Less
Submitted 1 November, 2022;
originally announced November 2022.
-
I2CR: Improving Noise Robustness on Keyword Spotting Using Inter-Intra Contrastive Regularization
Authors:
Dianwen Ng,
Jia Qi Yip,
Tanmay Surana,
Zhao Yang,
Chong Zhang,
Yukun Ma,
Chongjia Ni,
Eng Siong Chng,
Bin Ma
Abstract:
Noise robustness in keyword spotting remains a challenge as many models fail to overcome the heavy influence of noises, causing the deterioration of the quality of feature embeddings. We proposed a contrastive regularization method called Inter-Intra Contrastive Regularization (I2CR) to improve the feature representations by guiding the model to learn the fundamental speech information specific to…
▽ More
Noise robustness in keyword spotting remains a challenge as many models fail to overcome the heavy influence of noises, causing the deterioration of the quality of feature embeddings. We proposed a contrastive regularization method called Inter-Intra Contrastive Regularization (I2CR) to improve the feature representations by guiding the model to learn the fundamental speech information specific to the cluster. This involves maximizing the similarity across Intra and Inter samples of the same class. As a result, it pulls the instances closer to more generalized representations that form more prominent clusters and reduces the adverse impact of noises. We show that our method provides consistent improvements in accuracy over different backbone model architectures under different noise environments. We also demonstrate that our proposed framework has improved the accuracy of unseen out-of-domain noises and unseen variant noise SNRs. This indicates the significance of our work with the overall refinement in noise robustness.
△ Less
Submitted 13 September, 2022;
originally announced September 2022.
-
On Quantizing Implicit Neural Representations
Authors:
Cameron Gordon,
Shin-Fang Chng,
Lachlan MacDonald,
Simon Lucey
Abstract:
The role of quantization within implicit/coordinate neural networks is still not fully understood. We note that using a canonical fixed quantization scheme during training produces poor performance at low-rates due to the network weight distributions changing over the course of training. In this work, we show that a non-uniform quantization of neural weights can lead to significant improvements. S…
▽ More
The role of quantization within implicit/coordinate neural networks is still not fully understood. We note that using a canonical fixed quantization scheme during training produces poor performance at low-rates due to the network weight distributions changing over the course of training. In this work, we show that a non-uniform quantization of neural weights can lead to significant improvements. Specifically, we demonstrate that a clustered quantization enables improved reconstruction. Finally, by characterising a trade-off between quantization and network capacity, we demonstrate that it is possible (while memory inefficient) to reconstruct signals using binary neural networks. We demonstrate our findings experimentally on 2D image reconstruction and 3D radiance fields; and show that simple quantization methods and architecture search can achieve compression of NeRF to less than 16kb with minimal loss in performance (323x smaller than the original NeRF).
△ Less
Submitted 1 September, 2022;
originally announced September 2022.
-
DENT-DDSP: Data-efficient noisy speech generator using differentiable digital signal processors for explicit distortion modelling and noise-robust speech recognition
Authors:
Z. Guo,
C. Chen,
E. S. Chng
Abstract:
The performances of automatic speech recognition (ASR) systems degrade drastically under noisy conditions. Explicit distortion modelling (EDM), as a feature compensation step, is able to enhance ASR systems under such conditions by simulating the in-domain noisy speeches from the clean counterparts. Yet, existing distortion models are either non-trainable or unexplainable and often lack controllab…
▽ More
The performances of automatic speech recognition (ASR) systems degrade drastically under noisy conditions. Explicit distortion modelling (EDM), as a feature compensation step, is able to enhance ASR systems under such conditions by simulating the in-domain noisy speeches from the clean counterparts. Yet, existing distortion models are either non-trainable or unexplainable and often lack controllability and generalization ability. In this paper, we propose a fully explainable and controllable model: DENT-DDSP to achieve EDM. DENT-DDSP utilizes novel differentiable digital signal processing (DDSP) components and requires only 10 seconds of training data to achieve high fidelity. The experiment shows that the simulated noisy data from DENT-DDSP achieves the highest simulation fidelity compared to other baseline models in terms of multi-scale spectral loss (MSSL). Moreover, to validate whether the data simulated by DENT-DDSP are able to replace the scarce in-domain noisy data in the noise-robust ASR tasks, several downstream ASR models with the same architecture are trained using the simulated data and the real data. The experiment shows that the model trained with the simulated noisy data from DENT-DDSP achieves similar performances to the benchmark with a 2.7\% difference in terms of word error rate (WER). The code of the model is released online.
△ Less
Submitted 1 August, 2022;
originally announced August 2022.