Skip to main content

Showing 1–50 of 90 results for author: Chng, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.02243  [pdf, other

    cs.CL cs.SD eess.AS

    Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization

    Authors: Yuchen Hu, Chen Chen, Siyin Wang, Eng Siong Chng, Chao Zhang

    Abstract: In this paper, we propose reverse inference optimization (RIO), a simple and effective method designed to enhance the robustness of autoregressive-model-based zero-shot text-to-speech (TTS) systems using reinforcement learning from human feedback (RLHF). To assess the quality of speech produced by the TTS system without human annotations, RIO introduces a novel concept termed as reverse inference… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: 12 pages, Work in progress

  2. arXiv:2406.17376  [pdf, other

    cs.SD cs.AI eess.AS

    Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

    Authors: Duc-Tuan Truong, Ruijie Tao, Tuan Nguyen, Hieu-Thi Luong, Kong Aik Lee, Eng Siong Chng

    Abstract: Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in sp… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024

  3. arXiv:2406.12434  [pdf, other

    cs.SD cs.LG eess.AS

    Towards Audio Codec-based Speech Separation

    Authors: Jia Qi Yip, Shengkui Zhao, Dianwen Ng, Eng Siong Chng, Bin Ma

    Abstract: Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task. SS can benefit from high compression because the compute required for traditional SS models makes them imp… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: This paper was accepted by Interspeech 2024, Blue Sky Track

  4. arXiv:2406.02963  [pdf, other

    cs.SD eess.AS

    Dataset-Distillation Generative Model for Speech Emotion Recognition

    Authors: Fabian Ritter-Gutierrez, Kuan-Po Huang, Jeremy H. M Wong, Dianwen Ng, Hung-yi Lee, Nancy F. Chen, Eng Siong Chng

    Abstract: Deep learning models for speech rely on large datasets, presenting computational challenges. Yet, performance hinges on training data size. Dataset Distillation (DD) aims to learn a smaller dataset without much performance degradation when training with it. DD has been investigated in computer vision but not yet in speech. This paper presents the first approach for DD to speech targeting Speech Em… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  5. arXiv:2406.00654  [pdf, other

    cs.CL cs.SD eess.AS

    Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback

    Authors: Chen Chen, Yuchen Hu, Wen Wu, Helin Wang, Eng Siong Chng, Chao Zhang

    Abstract: In recent years, text-to-speech (TTS) technology has witnessed impressive advancements, particularly with large-scale training datasets, showcasing human-level speech quality and impressive zero-shot capabilities on unseen speakers. However, despite human subjective evaluations, such as the mean opinion score (MOS), remaining the gold standard for assessing the quality of synthetic speech, even st… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: 19 pages, Preprint

  6. arXiv:2405.14161  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

    Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Chengwei Qin, Pin-Yu Chen, Eng Siong Chng, Chao Zhang

    Abstract: We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents. STAR is developed for prevalent speech foundation models based on Transformer-related architecture with auto-regressive decoding (e.g., Whisper, Canary). Specifica… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: 23 pages, Preprint

  7. arXiv:2405.10025  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models

    Authors: Yuchen Hu, Chen Chen, Chengwei Qin, Qiushi Zhu, Eng Siong Chng, Ruizhe Li

    Abstract: Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which aims to predict the ground-truth transcription from the decoded N-best hypotheses. Thanks to the strong language generation ability of LLMs and rich information in the N-best list, GER shows great effectiveness in enhancing ASR results. However, it still suf… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: 14 pages, Accepted by ACL 2024

  8. arXiv:2402.10642  [pdf, other

    eess.AS cs.AI

    Speaking in Wavelet Domain: A Simple and Efficient Approach to Speed up Speech Diffusion Model

    Authors: Xiangyu Zhang, Daijiao Liu, Hexin Liu, Qiquan Zhang, Hanyu Meng, Leibny Paola Garcia, Eng Siong Chng, Lina Yao

    Abstract: Recently, Denoising Diffusion Probabilistic Models (DDPMs) have attained leading performances across a diverse range of generative tasks. However, in the field of speech synthesis, although DDPMs exhibit impressive performance, their long training duration and substantial inference costs hinder practical deployment. Existing approaches primarily focus on enhancing inference speed, while approaches… ▽ More

    Submitted 16 February, 2024; originally announced February 2024.

  9. arXiv:2402.08784  [pdf, other

    cs.CV cs.LG

    Preconditioners for the Stochastic Training of Implicit Neural Representations

    Authors: Shin-Fang Chng, Hemanth Saratchandran, Simon Lucey

    Abstract: Implicit neural representations have emerged as a powerful technique for encoding complex continuous multidimensional signals as neural networks, enabling a wide range of applications in computer vision, robotics, and geometry. While Adam is commonly used for training due to its stochastic proficiency, it entails lengthy training durations. To address this, we explore alternative optimization tech… ▽ More

    Submitted 13 February, 2024; originally announced February 2024.

    Comments: The first two authors contributed equally

  10. arXiv:2402.06894  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

    Authors: Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Dong Zhang, Zhehuai Chen, Eng Siong Chng

    Abstract: Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the divers… ▽ More

    Submitted 16 May, 2024; v1 submitted 10 February, 2024; originally announced February 2024.

    Comments: 18 pages, Accepted by ACL 2024. This work is open sourced at: https://github.com/YUCHEN005/GenTranslate

  11. arXiv:2402.04783  [pdf, other

    cs.LG

    Analyzing the Neural Tangent Kernel of Periodically Activated Coordinate Networks

    Authors: Hemanth Saratchandran, Shin-Fang Chng, Simon Lucey

    Abstract: Recently, neural networks utilizing periodic activation functions have been proven to demonstrate superior performance in vision tasks compared to traditional ReLU-activated networks. However, there is still a limited understanding of the underlying reasons for this improved performance. In this paper, we aim to address this gap by providing a theoretical understanding of periodically activated ne… ▽ More

    Submitted 7 February, 2024; originally announced February 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2402.02711

  12. arXiv:2402.02711  [pdf, other

    cs.LG

    Architectural Strategies for the optimization of Physics-Informed Neural Networks

    Authors: Hemanth Saratchandran, Shin-Fang Chng, Simon Lucey

    Abstract: Physics-informed neural networks (PINNs) offer a promising avenue for tackling both forward and inverse problems in partial differential equations (PDEs) by incorporating deep learning with fundamental physics principles. Despite their remarkable empirical success, PINNs have garnered a reputation for their notorious training challenges across a spectrum of PDEs. In this work, we delve into the in… ▽ More

    Submitted 4 February, 2024; originally announced February 2024.

  13. arXiv:2401.05746  [pdf, other

    cs.MM

    Cross-Modality and Within-Modality Regularization for Audio-Visual DeepFake Detection

    Authors: Heqing Zou, Meng Shen, Yuchen Hu, Chen Chen, Eng Siong Chng, Deepu Rajan

    Abstract: Audio-visual deepfake detection scrutinizes manipulations in public video using complementary multimodal cues. Current methods, which train on fused multimodal data for multimodal targets face challenges due to uncertainties and inconsistencies in learned representations caused by independent modality manipulations in deepfake videos. To address this, we propose cross-modality and within-modality… ▽ More

    Submitted 11 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024

  14. arXiv:2401.03473  [pdf, ps, other

    cs.SD cs.AI eess.AS

    ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge

    Authors: He Wang, Pengcheng Guo, Yue Li, Ao Zhang, Jiayao Sun, Lei Xie, Wei Chen, Pan Zhou, Hui Bu, Xin Xu, Binbin Zhang, Zhuo Chen, Jian Wu, Longbiao Wang, Eng Siong Chng, Sun Li

    Abstract: To promote speech processing and recognition research in driving scenarios, we build on the success of the Intelligent Cockpit Speech Recognition Challenge (ICSRC) held at ISCSLP 2022 and launch the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge. This challenge collects over 100 hours of multi-channel speech data recorded inside a new energy vehicle and 40 hours… ▽ More

    Submitted 20 February, 2024; v1 submitted 7 January, 2024; originally announced January 2024.

    Comments: Accepted at ICASSP 2024

  15. arXiv:2312.12153  [pdf, other

    cs.SD eess.AS

    Noise robust distillation of self-supervised speech models via correlation metrics

    Authors: Fabian Ritter-Gutierrez, Kuan-Po Huang, Dianwen Ng, Jeremy H. M. Wong, Hung-yi Lee, Eng Siong Chng, Nancy F. Chen

    Abstract: Compared to large speech foundation models, small distilled models exhibit degraded noise robustness. The student's robustness can be improved by introducing noise at the inputs during pre-training. Despite this, using the standard distillation loss still yields a student with degraded performance. Thus, this paper proposes improving student robustness via distillation with correlation metrics. Te… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: 6 pages

  16. arXiv:2310.13013  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Generative error correction for code-switching speech recognition using large language models

    Authors: Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Hexin Liu, Sabato Marco Siniscalchi, Eng Siong Chng

    Abstract: Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence. Despite the recent advances in automatic speech recognition (ASR), CS-ASR is still a challenging task ought to the grammatical structure complexity of the phenomenon and the data scarcity of specific training corpus. In this work, we propose to leverage large language models (LLMs) and lis… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP2024

  17. arXiv:2310.10301  [pdf, other

    cs.CV cs.RO

    Multi-Body Neural Scene Flow

    Authors: Kavisha Vidanapathirana, Shin-Fang Chng, Xueqian Li, Simon Lucey

    Abstract: The test-time optimization of scene flow - using a coordinate network as a neural prior - has gained popularity due to its simplicity, lack of dataset bias, and state-of-the-art performance. We observe, however, that although coordinate networks capture general motions by implicitly regularizing the scene flow predictions to be spatially smooth, the neural prior by itself is unable to identify the… ▽ More

    Submitted 6 February, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

    Comments: Accepted for 3DV'2024 (oral)

  18. arXiv:2309.15701  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models

    Authors: Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Sabato Macro Siniscalchi, Pin-Yu Chen, Eng Siong Chng

    Abstract: Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuit… ▽ More

    Submitted 16 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: Accepted to NeurIPS 2023, 24 pages. Datasets and Benchmarks Track. Added the first Mandarin and code-switching (zh-cn and en-us) results from the LLM-based generative ASR error correction to Table 8 on Page 21

  19. Emphasized Non-Target Speaker Knowledge in Knowledge Distillation for Automatic Speaker Verification

    Authors: Duc-Tuan Truong, Ruijie Tao, Jia Qi Yip, Kong Aik Lee, Eng Siong Chng

    Abstract: Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student networks at the embedding level or label level. However, the conventional label-level KD overlooks the significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for automa… ▽ More

    Submitted 14 January, 2024; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Accepted by ICASSP 2024

    Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 10336-10340

  20. arXiv:2309.12608  [pdf, other

    eess.AS cs.SD

    SPGM: Prioritizing Local Features for enhanced speech separation performance

    Authors: Jia Qi Yip, Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang, Hao Wang, Trung Hieu Nguyen, Kun Zhou, Dianwen Ng, Eng Siong Chng, Bin Ma

    Abstract: Dual-path is a popular architecture for speech separation models (e.g. Sepformer) which splits long sequences into overlap** chunks for its intra- and inter-blocks that separately model intra-chunk local features and inter-chunk global relationships. However, it has been found that inter-blocks, which comprise half a dual-path model's parameters, contribute minimally to performance. Thus, we pro… ▽ More

    Submitted 10 March, 2024; v1 submitted 21 September, 2023; originally announced September 2023.

    Comments: This paper was accepted by ICASSP 2024

  21. arXiv:2309.09413  [pdf, other

    cs.SD eess.AS

    Are Soft Prompts Good Zero-shot Learners for Speech Recognition?

    Authors: Dianwen Ng, Chong Zhang, Ruixi Zhang, Yukun Ma, Fabian Ritter-Gutierrez, Trung Hieu Nguyen, Chongjia Ni, Shengkui Zhao, Eng Siong Chng, Bin Ma

    Abstract: Large self-supervised pre-trained speech models require computationally expensive fine-tuning for downstream tasks. Soft prompt tuning offers a simple parameter-efficient alternative by utilizing minimal soft prompt guidance, enhancing portability while also maintaining competitive performance. However, not many people understand how and why this is so. In this study, we aim to deepen our understa… ▽ More

    Submitted 17 September, 2023; originally announced September 2023.

  22. arXiv:2309.07466  [pdf, other

    eess.AS cs.SD

    Codec Data Augmentation for Time-domain Heart Sound Classification

    Authors: Ansh Mishra, Jia Qi Yip, Eng Siong Chng

    Abstract: Heart auscultations are a low-cost and effective way of detecting valvular heart diseases early, which can save lives. Nevertheless, it has been difficult to scale this screening method since the effectiveness of auscultations is dependent on the skill of doctors. As such, there has been increasing research interest in the automatic classification of heart sounds using deep learning algorithms. Ho… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

    Comments: Accepted by ICAICTA 2023

  23. arXiv:2307.08029  [pdf, other

    eess.AS cs.LG cs.SD

    Noise-aware Speech Enhancement using Diffusion Probabilistic Model

    Authors: Yuchen Hu, Chen Chen, Ruizhe Li, Qiushi Zhu, Eng Siong Chng

    Abstract: With recent advances of diffusion model, generative speech enhancement (SE) has attracted a surge of research interest due to its great potential for unseen testing noises. However, existing efforts mainly focus on inherent properties of clean speech, underexploiting the varying noise information in real world. In this paper, we propose a noise-aware speech enhancement (NASE) approach that extract… ▽ More

    Submitted 4 June, 2024; v1 submitted 16 July, 2023; originally announced July 2023.

    Comments: 5 pages, 2 figures, Accepted by InterSpeech 2024

  24. arXiv:2306.10567  [pdf, other

    eess.AS cs.CV cs.MM cs.SD

    MIR-GAN: Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition

    Authors: Yuchen Hu, Chen Chen, Ruizhe Li, Heqing Zou, Eng Siong Chng

    Abstract: Audio-visual speech recognition (AVSR) attracts a surge of research interest recently by leveraging multimodal signals to understand human speech. Mainstream approaches addressing this task have developed sophisticated architectures and techniques for multi-modality fusion and representation learning. However, the natural heterogeneity of different modalities causes distribution gap between their… ▽ More

    Submitted 18 June, 2023; originally announced June 2023.

    Comments: 14 pages, 5 figures, Accepted by ACL 2023

  25. arXiv:2306.10563  [pdf, other

    eess.AS cs.CV cs.MM cs.SD

    Hearing Lips in Noise: Universal Viseme-Phoneme Map** and Transfer for Robust Audio-Visual Speech Recognition

    Authors: Yuchen Hu, Ruizhe Li, Chen Chen, Chengwei Qin, Qiushi Zhu, Eng Siong Chng

    Abstract: Audio-visual speech recognition (AVSR) provides a promising solution to ameliorate the noise-robustness of audio-only speech recognition with visual information. However, most existing efforts still focus on audio modality to improve robustness considering its dominance in AVSR task, with noise adaptation techniques such as front-end denoise processing. Though effective, these methods are usually… ▽ More

    Submitted 18 June, 2023; originally announced June 2023.

    Comments: 19 pages, 9 figures, Accepted by ACL 2023

  26. arXiv:2305.16932  [pdf, other

    cs.SD cs.CL eess.AS

    A Neural State-Space Model Approach to Efficient Speech Separation

    Authors: Chen Chen, Chao-Han Huck Yang, Kai Li, Yuchen Hu, Pin-Jui Ku, Eng Siong Chng

    Abstract: In this work, we introduce S4M, a new efficient speech separation framework based on neural state-space models (SSM). Motivated by linear time-invariant systems for sequence modeling, our SSM-based approach can efficiently model input signals into a format of linear ordinary differential equations (ODEs) for representation learning. To extend the SSM technique into speech separation tasks, we firs… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: Accepted by InterSpeech 2023

  27. arXiv:2305.12121  [pdf, other

    cs.SD cs.LG eess.AS

    ACA-Net: Towards Lightweight Speaker Verification using Asymmetric Cross Attention

    Authors: Jia Qi Yip, Tuan Truong, Dianwen Ng, Chong Zhang, Yukun Ma, Trung Hieu Nguyen, Chongjia Ni, Shengkui Zhao, Eng Siong Chng, Bin Ma

    Abstract: In this paper, we propose ACA-Net, a lightweight, global context-aware speaker embedding extractor for Speaker Verification (SV) that improves upon existing work by using Asymmetric Cross Attention (ACA) to replace temporal pooling. ACA is able to distill large, variable-length sequences into small, fixed-sized latents by attending a small query to large key and value matrices. In ACA-Net, we buil… ▽ More

    Submitted 20 May, 2023; originally announced May 2023.

    Comments: Accepted to INTERSPEECH 2023

  28. arXiv:2305.10761  [pdf, other

    cs.SD eess.AS

    Noise-Aware Speech Separation with Contrastive Learning

    Authors: Zizheng Zhang, Chen Chen, Hsin-Hung Chen, Xiang Liu, Yuchen Hu, Eng Siong Chng

    Abstract: Recently, speech separation (SS) task has achieved remarkable progress driven by deep learning technique. However, it is still challenging to separate target speech from noisy mixture, as the neural model is vulnerable to assign background noise to each speaker. In this paper, we propose a noise-aware SS (NASS) method, which aims to improve the speech quality for separated signals under noisy cond… ▽ More

    Submitted 8 January, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

    Comments: 5 pages, 3 figures, ICASSP 2024

  29. arXiv:2305.09299  [pdf, other

    cs.CV cs.CL

    UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning

    Authors: Heqing Zou, Meng Shen, Chen Chen, Yuchen Hu, Deepu Rajan, Eng Siong Chng

    Abstract: Multimodal learning aims to imitate human beings to acquire complementary information from multiple modalities for various downstream tasks. However, traditional aggregation-based multimodal fusion methods ignore the inter-modality relationship, treat each modality equally, suffer sensor noise, and thus reduce multimodal learning performance. In this work, we propose a novel multimodal contrastive… ▽ More

    Submitted 16 May, 2023; originally announced May 2023.

    Comments: ACL 2023 Findings

  30. arXiv:2305.09212  [pdf, other

    eess.AS cs.CV cs.MM cs.SD

    Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition

    Authors: Yuchen Hu, Ruizhe Li, Chen Chen, Heqing Zou, Qiushi Zhu, Eng Siong Chng

    Abstract: Audio-visual speech recognition (AVSR) research has gained a great success recently by improving the noise-robustness of audio-only automatic speech recognition (ASR) with noise-invariant visual information. However, most existing AVSR approaches simply fuse the audio and visual features by concatenation, without explicit interactions to capture the deep correlations between them, which results in… ▽ More

    Submitted 16 May, 2023; originally announced May 2023.

    Comments: 12 pages, 5 figures, Accepted by IJCAI 2023

  31. arXiv:2305.08552  [pdf, other

    cs.CV

    Curvature-Aware Training for Coordinate Networks

    Authors: Hemanth Saratchandran, Shin-Fang Chng, Sameera Ramasinghe, Lachlan MacDonald, Simon Lucey

    Abstract: Coordinate networks are widely used in computer vision due to their ability to represent signals as compressed, continuous entities. However, training these networks with first-order optimizers can be slow, hindering their use in real-time applications. Recent works have opted for shallow voxel-based representations to achieve faster training, but this sacrifices memory efficiency. This work propo… ▽ More

    Submitted 15 May, 2023; originally announced May 2023.

  32. arXiv:2305.01170  [pdf, other

    cs.SD eess.AS

    Contrastive Speech Mixup for Low-resource Keyword Spotting

    Authors: Dianwen Ng, Ruixi Zhang, Jia Qi Yip, Chong Zhang, Yukun Ma, Trung Hieu Nguyen, Chongjia Ni, Eng Siong Chng, Bin Ma

    Abstract: Most of the existing neural-based models for keyword spotting (KWS) in smart devices require thousands of training samples to learn a decent audio representation. However, with the rising demand for smart devices to become more personalized, KWS models need to adapt quickly to smaller user samples. To tackle this challenge, we propose a contrastive speech mixup (CosMix) learning algorithm for low-… ▽ More

    Submitted 1 May, 2023; originally announced May 2023.

    Comments: Accepted by ICASSP 2023

  33. arXiv:2304.04974  [pdf, other

    eess.AS cs.LG cs.SD

    Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR

    Authors: Yuchen Hu, Chen Chen, Qiushi Zhu, Eng Siong Chng

    Abstract: Automatic speech recognition (ASR) has gained remarkable successes thanks to recent advances of deep learning, but it usually degrades significantly under real-world noisy conditions. Recent works introduce speech enhancement (SE) as front-end to improve speech quality, which is proved effective but may not be optimal for downstream ASR due to speech distortion problem. Based on that, latest works… ▽ More

    Submitted 18 April, 2024; v1 submitted 11 April, 2023; originally announced April 2023.

    Comments: 12 pages, 7 figures, IEEE/ACM Transactions on Audio, Speech, and Language Processing

  34. arXiv:2302.14597  [pdf, other

    cs.SD eess.AS

    deHuBERT: Disentangling Noise in a Self-supervised Model for Robust Speech Recognition

    Authors: Dianwen Ng, Ruixi Zhang, Jia Qi Yip, Zhao Yang, **jie Ni, Chong Zhang, Yukun Ma, Chongjia Ni, Eng Siong Chng, Bin Ma

    Abstract: Existing self-supervised pre-trained speech models have offered an effective way to leverage massive unannotated corpora to build good automatic speech recognition (ASR). However, many current models are trained on a clean corpus from a single source, which tends to do poorly when noise is present during testing. Nonetheless, it is crucial to overcome the adverse influence of noise for real-world… ▽ More

    Submitted 28 February, 2023; originally announced February 2023.

    Comments: Accepted by ICASSP 2023

  35. arXiv:2302.11989  [pdf, other

    cs.SD cs.CL eess.AS

    Metric-oriented Speech Enhancement using Diffusion Probabilistic Model

    Authors: Chen Chen, Yuchen Hu, Weiwei Weng, Eng Siong Chng

    Abstract: Deep neural network based speech enhancement technique focuses on learning a noisy-to-clean transformation supervised by paired training data. However, the task-specific evaluation metric (e.g., PESQ) is usually non-differentiable and can not be directly constructed in the training criteria. This mismatch between the training objective and evaluation metric likely results in sub-optimal performanc… ▽ More

    Submitted 23 February, 2023; originally announced February 2023.

    Comments: Accepted by ICASSP2023

  36. arXiv:2302.11981  [pdf, other

    cs.SD cs.AI eess.AS

    Unsupervised Noise adaptation using Data Simulation

    Authors: Chen Chen, Yuchen Hu, Heqing Zou, Linhui Sun, Eng Siong Chng

    Abstract: Deep neural network based speech enhancement approaches aim to learn a noisy-to-clean transformation using a supervised learning paradigm. However, such a trained-well transformation is vulnerable to unseen noises that are not included in training set. In this work, we focus on the unsupervised noise adaptation problem in speech enhancement, where the ground truth of target domain data is complete… ▽ More

    Submitted 23 February, 2023; originally announced February 2023.

    Comments: Accepted by ICASSP2023

  37. arXiv:2302.11362  [pdf, other

    eess.AS cs.LG cs.SD

    Gradient Remedy for Multi-Task Learning in End-to-End Noise-Robust Speech Recognition

    Authors: Yuchen Hu, Chen Chen, Ruizhe Li, Qiushi Zhu, Eng Siong Chng

    Abstract: Speech enhancement (SE) is proved effective in reducing noise from noisy speech signals for downstream automatic speech recognition (ASR), where multi-task learning strategy is employed to jointly optimize these two tasks. However, the enhanced speech learned by SE objective may not always yield good ASR results. From the optimization view, there sometimes exists interference between the gradients… ▽ More

    Submitted 3 May, 2023; v1 submitted 22 February, 2023; originally announced February 2023.

    Comments: 5 pages, 5 figures, Accepted by ICASSP 2023

  38. arXiv:2302.11131  [pdf, other

    eess.AS cs.LG cs.SD

    Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation

    Authors: Yuchen Hu, Chen Chen, Heqing Zou, Xionghu Zhong, Eng Siong Chng

    Abstract: Recent studies in neural network-based monaural speech separation (SS) have achieved a remarkable success thanks to increasing ability of long sequence modeling. However, they would degrade significantly when put under realistic noisy conditions, as the background noise could be mistaken for speaker's speech and thus interfere with the separated sources. To alleviate this problem, we propose a nov… ▽ More

    Submitted 21 February, 2023; originally announced February 2023.

    Comments: 5 pages, 5 figures, Accepted by ICASSP 2023

  39. arXiv:2302.09523  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Probabilistic Back-ends for Online Speaker Recognition and Clustering

    Authors: Alexey Sholokhov, Nikita Kuzmin, Kong Aik Lee, Eng Siong Chng

    Abstract: This paper focuses on multi-enrollment speaker recognition which naturally occurs in the task of online speaker clustering, and studies the properties of different scoring back-ends in this scenario. First, we show that popular cosine scoring suffers from poor score calibration with a varying number of enrollment utterances. Second, we propose a simple replacement for cosine scoring based on an ex… ▽ More

    Submitted 19 February, 2023; originally announced February 2023.

    Comments: Accepted to ICASSP 2023

  40. arXiv:2302.08229  [pdf, other

    cs.LG cs.CL

    Improving Spoken Language Identification with Map-Mix

    Authors: Shangeth Rajaa, Kriti Anandan, Swaraj Dalmia, Tarun Gupta, Eng Siong Chng

    Abstract: The pre-trained multi-lingual XLSR model generalizes well for language identification after fine-tuning on unseen languages. However, the performance significantly degrades when the languages are not very distinct from each other, for example, in the case of dialects. Low resource dialect classification remains a challenging problem to solve. We present a new data augmentation method that leverage… ▽ More

    Submitted 16 February, 2023; originally announced February 2023.

    Comments: Accepted at ICASSP 2023

  41. arXiv:2212.05301  [pdf, other

    cs.SD cs.CV cs.MM eess.AS

    Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning

    Authors: Chen Chen, Yuchen Hu, Qiang Zhang, Heqing Zou, Beier Zhu, Eng Siong Chng

    Abstract: Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, th… ▽ More

    Submitted 2 February, 2023; v1 submitted 10 December, 2022; originally announced December 2022.

    Comments: Accepted by AAAI2023

  42. arXiv:2211.01585  [pdf, other

    cs.SD eess.AS

    The ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge (ICSRC): Dataset, Tracks, Baseline and Results

    Authors: Ao Zhang, Fan Yu, Kaixun Huang, Lei Xie, Longbiao Wang, Eng Siong Chng, Hui Bu, Binbin Zhang, Wei Chen, Xin Xu

    Abstract: This paper summarizes the outcomes from the ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge (ICSRC). We first address the necessity of the challenge and then introduce the associated dataset collected from a new-energy vehicle (NEV) covering a variety of cockpit acoustic conditions and linguistic contents. We then describe the track arrangement and the baseline system. Specifically, w… ▽ More

    Submitted 3 November, 2022; originally announced November 2022.

    Comments: Accepted by ISCSLP2022

  43. arXiv:2211.00325  [pdf, other

    eess.AS cs.CL cs.SD

    Speech-text based multi-modal training with bidirectional attention for improved speech recognition

    Authors: Yuhang Yang, Haihua Xu, Hao Huang, Eng Siong Chng, Sheng Li

    Abstract: To let the state-of-the-art end-to-end ASR model enjoy data efficiency, as well as much more unpaired text data by multi-modal training, one needs to address two problems: 1) the synchronicity of feature sampling rates between speech and language (aka text data); 2) the homogeneity of the learned representations from two encoders. In this paper we propose to employ a novel bidirectional attention… ▽ More

    Submitted 1 November, 2022; originally announced November 2022.

    Comments: 5 pages, 3 figures, 3 tables

  44. arXiv:2209.06360  [pdf, other

    cs.SD eess.AS

    I2CR: Improving Noise Robustness on Keyword Spotting Using Inter-Intra Contrastive Regularization

    Authors: Dianwen Ng, Jia Qi Yip, Tanmay Surana, Zhao Yang, Chong Zhang, Yukun Ma, Chongjia Ni, Eng Siong Chng, Bin Ma

    Abstract: Noise robustness in keyword spotting remains a challenge as many models fail to overcome the heavy influence of noises, causing the deterioration of the quality of feature embeddings. We proposed a contrastive regularization method called Inter-Intra Contrastive Regularization (I2CR) to improve the feature representations by guiding the model to learn the fundamental speech information specific to… ▽ More

    Submitted 13 September, 2022; originally announced September 2022.

  45. arXiv:2209.01019  [pdf, other

    cs.CV

    On Quantizing Implicit Neural Representations

    Authors: Cameron Gordon, Shin-Fang Chng, Lachlan MacDonald, Simon Lucey

    Abstract: The role of quantization within implicit/coordinate neural networks is still not fully understood. We note that using a canonical fixed quantization scheme during training produces poor performance at low-rates due to the network weight distributions changing over the course of training. In this work, we show that a non-uniform quantization of neural weights can lead to significant improvements. S… ▽ More

    Submitted 1 September, 2022; originally announced September 2022.

    Comments: 10 pages, 10 figures

  46. arXiv:2208.00987  [pdf, other

    eess.AS cs.SD

    DENT-DDSP: Data-efficient noisy speech generator using differentiable digital signal processors for explicit distortion modelling and noise-robust speech recognition

    Authors: Z. Guo, C. Chen, E. S. Chng

    Abstract: The performances of automatic speech recognition (ASR) systems degrade drastically under noisy conditions. Explicit distortion modelling (EDM), as a feature compensation step, is able to enhance ASR systems under such conditions by simulating the in-domain noisy speeches from the clean counterparts. Yet, existing distortion models are either non-trainable or unexplainable and often lack controllab… ▽ More

    Submitted 1 August, 2022; originally announced August 2022.

  47. arXiv:2207.07429  [pdf, other

    cs.SD cs.AI eess.AS

    Continual Learning For On-Device Environmental Sound Classification

    Authors: Yang Xiao, Xubo Liu, James King, Arshdeep Singh, Eng Siong Chng, Mark D. Plumbley, Wenwu Wang

    Abstract: Continuously learning new classes without catastrophic forgetting is a challenging problem for on-device environmental sound classification given the restrictions on computation resources (e.g., model size, running memory). To address this issue, we propose a simple and efficient continual learning method. Our method selects the historical data for the training by measuring the per-sample classifi… ▽ More

    Submitted 18 July, 2022; v1 submitted 15 July, 2022; originally announced July 2022.

    Comments: The first two authors contributed equally, 5 pages one figure, submitted to DCASE2022 Workshop

  48. arXiv:2207.04177  [pdf, other

    eess.AS cs.SD

    Intermediate-layer output Regularization for Attention-based Speech Recognition with Shared Decoder

    Authors: Jicheng Zhang, Yizhou Peng, Haihua Xu, Yi He, Eng Siong Chng, Hao Huang

    Abstract: Intermediate layer output (ILO) regularization by means of multitask training on encoder side has been shown to be an effective approach to yielding improved results on a wide range of end-to-end ASR frameworks. In this paper, we propose a novel method to do ILO regularized training differently. Instead of using conventional multitask methods that entail more training overhead, we directly make th… ▽ More

    Submitted 8 July, 2022; originally announced July 2022.

    Comments: 5 pages. Submitted to INTERSPEECH 2022

  49. arXiv:2207.04176  [pdf, other

    eess.AS cs.CL cs.SD

    Internal Language Model Estimation based Language Model Fusion for Cross-Domain Code-Switching Speech Recognition

    Authors: Yizhou Peng, Yufei Liu, Jicheng Zhang, Haihua Xu, Yi He, Hao Huang, Eng Siong Chng

    Abstract: Internal Language Model Estimation (ILME) based language model (LM) fusion has been shown significantly improved recognition results over conventional shallow fusion in both intra-domain and cross-domain speech recognition tasks. In this paper, we attempt to apply our ILME method to cross-domain code-switching speech recognition (CSSR) work. Specifically, our curiosity comes from several aspects.… ▽ More

    Submitted 8 July, 2022; originally announced July 2022.

    Comments: 5 pages. Submitted to INTERSPEECH 2022

  50. arXiv:2206.14659  [pdf, other

    cs.SD cs.CL cs.IR eess.AS

    Language-Based Audio Retrieval with Converging Tied Layers and Contrastive Loss

    Authors: Andrew Koh, Eng Siong Chng

    Abstract: In this paper, we tackle the new Language-Based Audio Retrieval task proposed in DCASE 2022. Firstly, we introduce a simple, scalable architecture which ties both the audio and text encoder together. Secondly, we show that using this architecture along with contrastive loss allows the model to significantly beat the performance of the baseline model. Finally, in addition to having an extremely low… ▽ More

    Submitted 29 June, 2022; originally announced June 2022.