Skip to main content

Showing 51–98 of 98 results for author: Chng, S

.
  1. arXiv:2208.00935  [pdf, other

    q-bio.QM eess.AS

    Amino Acid Classification in 2D NMR Spectra via Acoustic Signal Embeddings

    Authors: Jia Qi Yip, Dianwen Ng, Bin Ma, Konstantin Pervushin, Eng Siong Chng

    Abstract: Nuclear Magnetic Resonance (NMR) is used in structural biology to experimentally determine the structure of proteins, which is used in many areas of biology and is an important part of drug development. Unfortunately, NMR data can cost thousands of dollars per sample to collect and it can take a specialist weeks to assign the observed resonances to specific chemical groups. There has thus been gro… ▽ More

    Submitted 1 August, 2022; originally announced August 2022.

  2. arXiv:2207.07429  [pdf, other

    cs.SD cs.AI eess.AS

    Continual Learning For On-Device Environmental Sound Classification

    Authors: Yang Xiao, Xubo Liu, James King, Arshdeep Singh, Eng Siong Chng, Mark D. Plumbley, Wenwu Wang

    Abstract: Continuously learning new classes without catastrophic forgetting is a challenging problem for on-device environmental sound classification given the restrictions on computation resources (e.g., model size, running memory). To address this issue, we propose a simple and efficient continual learning method. Our method selects the historical data for the training by measuring the per-sample classifi… ▽ More

    Submitted 18 July, 2022; v1 submitted 15 July, 2022; originally announced July 2022.

    Comments: The first two authors contributed equally, 5 pages one figure, submitted to DCASE2022 Workshop

  3. arXiv:2207.04177  [pdf, other

    eess.AS cs.SD

    Intermediate-layer output Regularization for Attention-based Speech Recognition with Shared Decoder

    Authors: Jicheng Zhang, Yizhou Peng, Haihua Xu, Yi He, Eng Siong Chng, Hao Huang

    Abstract: Intermediate layer output (ILO) regularization by means of multitask training on encoder side has been shown to be an effective approach to yielding improved results on a wide range of end-to-end ASR frameworks. In this paper, we propose a novel method to do ILO regularized training differently. Instead of using conventional multitask methods that entail more training overhead, we directly make th… ▽ More

    Submitted 8 July, 2022; originally announced July 2022.

    Comments: 5 pages. Submitted to INTERSPEECH 2022

  4. arXiv:2207.04176  [pdf, other

    eess.AS cs.CL cs.SD

    Internal Language Model Estimation based Language Model Fusion for Cross-Domain Code-Switching Speech Recognition

    Authors: Yizhou Peng, Yufei Liu, Jicheng Zhang, Haihua Xu, Yi He, Hao Huang, Eng Siong Chng

    Abstract: Internal Language Model Estimation (ILME) based language model (LM) fusion has been shown significantly improved recognition results over conventional shallow fusion in both intra-domain and cross-domain speech recognition tasks. In this paper, we attempt to apply our ILME method to cross-domain code-switching speech recognition (CSSR) work. Specifically, our curiosity comes from several aspects.… ▽ More

    Submitted 8 July, 2022; originally announced July 2022.

    Comments: 5 pages. Submitted to INTERSPEECH 2022

  5. arXiv:2206.14659  [pdf, other

    cs.SD cs.CL cs.IR eess.AS

    Language-Based Audio Retrieval with Converging Tied Layers and Contrastive Loss

    Authors: Andrew Koh, Eng Siong Chng

    Abstract: In this paper, we tackle the new Language-Based Audio Retrieval task proposed in DCASE 2022. Firstly, we introduce a simple, scalable architecture which ties both the audio and text encoder together. Secondly, we show that using this architecture along with contrastive loss allows the model to significantly beat the performance of the baseline model. Finally, in addition to having an extremely low… ▽ More

    Submitted 29 June, 2022; originally announced June 2022.

  6. arXiv:2204.06260  [pdf, other

    cs.CL cs.SD eess.AS

    Self-critical Sequence Training for Automatic Speech Recognition

    Authors: Chen Chen, Yuchen Hu, Nana Hou, Xiaofeng Qi, Heqing Zou, Eng Siong Chng

    Abstract: Although automatic speech recognition (ASR) task has gained remarkable success by sequence-to-sequence models, there are two main mismatches between its training and testing that might lead to performance degradation: 1) The typically used cross-entropy criterion aims to maximize log-likelihood of the training data, while the performance is evaluated by word error rate (WER), not log-likelihood; 2… ▽ More

    Submitted 13 April, 2022; originally announced April 2022.

    Comments: Accepted by ICASSP 2022

  7. arXiv:2204.05735  [pdf, other

    cs.CV

    GARF: Gaussian Activated Radiance Fields for High Fidelity Reconstruction and Pose Estimation

    Authors: Shin-Fang Chng, Sameera Ramasinghe, Jamie Sherrah, Simon Lucey

    Abstract: Despite Neural Radiance Fields (NeRF) showing compelling results in photorealistic novel views synthesis of real-world scenes, most existing approaches require accurate prior camera poses. Although approaches for jointly recovering the radiance field and camera pose exist (BARF), they rely on a cumbersome coarse-to-fine auxiliary positional embedding to ensure good performance. We present Gaussian… ▽ More

    Submitted 12 April, 2022; originally announced April 2022.

    Comments: Project page: https://sfchng.github.io/garf/

  8. arXiv:2204.05445  [pdf, other

    cs.SD eess.AS

    Small Footprint Multi-channel ConvMixer for Keyword Spotting with Centroid Based Awareness

    Authors: Dianwen Ng, ** Hui Pang, Yang Xiao, Biao Tian, Qiang Fu, Eng Siong Chng

    Abstract: It is critical for a keyword spotting model to have a small footprint as it typically runs on-device with low computational resources. However, maintaining the previous SOTA performance with reduced model size is challenging. In addition, a far-field and noisy environment with multiple signals interference aggravates the problem causing the accuracy to degrade significantly. In this paper, we pres… ▽ More

    Submitted 11 April, 2022; originally announced April 2022.

    Comments: submitted to INTERSPEECH 2022

  9. arXiv:2203.16361  [pdf, other

    cs.SD cs.CL eess.AS

    Rainbow Keywords: Efficient Incremental Learning for Online Spoken Keyword Spotting

    Authors: Yang Xiao, Nana Hou, Eng Siong Chng

    Abstract: Catastrophic forgetting is a thorny challenge when updating keyword spotting (KWS) models after deployment. This problem will be more challenging if KWS models are further required for edge devices due to their limited memory. To alleviate such an issue, we propose a novel diversity-aware incremental learning method named Rainbow Keywords (RK). Specifically, the proposed RK approach introduces a d… ▽ More

    Submitted 30 June, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: Accepted to Interspeech 2022

  10. arXiv:2203.15526  [pdf, other

    cs.SD cs.CL eess.AS

    Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning

    Authors: Chen Chen, Nana Hou, Yuchen Hu, Heqing Zou, Xiaofeng Qi, Eng Siong Chng

    Abstract: Automated Audio captioning (AAC) is a cross-modal task that generates natural language to describe the content of input audio. Most prior works usually extract single-modality acoustic features and are therefore sub-optimal for the cross-modal decoding task. In this work, we propose a novel AAC system called CLIP-AAC to learn interactive cross-modality representation with both acoustic and textual… ▽ More

    Submitted 12 April, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: Submitted to Interspeech 2022

  11. arXiv:2203.15326  [pdf, other

    cs.SD cs.AI eess.AS

    Speech Emotion Recognition with Co-Attention based Multi-level Acoustic Information

    Authors: Heqing Zou, Yuke Si, Chen Chen, Deepu Rajan, Eng Siong Chng

    Abstract: Speech Emotion Recognition (SER) aims to help the machine to understand human's subjective emotion from only audio information. However, extracting and utilizing comprehensive in-depth audio information is still a challenging task. In this paper, we propose an end-to-end speech emotion recognition system using multi-level acoustic information with a newly designed co-attention module. We firstly e… ▽ More

    Submitted 29 March, 2022; originally announced March 2022.

    Comments: Accepted by ICASSP 2022

  12. arXiv:2203.15321  [pdf, other

    cs.SD cs.CL eess.AS

    Noise-robust Speech Recognition with 10 Minutes Unparalleled In-domain Data

    Authors: Chen Chen, Nana Hou, Yuchen Hu, Shashank Shirol, Eng Siong Chng

    Abstract: Noise-robust speech recognition systems require large amounts of training data including noisy speech data and corresponding transcripts to achieve state-of-the-art performances in face of various practical environments. However, such plenty of in-domain data is not always available in the real-life world. In this paper, we propose a generative adversarial network to simulate noisy spectrum from t… ▽ More

    Submitted 29 March, 2022; originally announced March 2022.

    Comments: Accepted by ICASSP2022

  13. arXiv:2203.14838  [pdf, other

    eess.AS cs.LG cs.SD

    Dual-Path Style Learning for End-to-End Noise-Robust Speech Recognition

    Authors: Yuchen Hu, Nana Hou, Chen Chen, Eng Siong Chng

    Abstract: Automatic speech recognition (ASR) systems degrade significantly under noisy conditions. Recently, speech enhancement (SE) is introduced as front-end to reduce noise for ASR, but it also suppresses some important speech information, i.e., over-suppression. To alleviate this, we propose a dual-path style learning approach for end-to-end noise-robust speech recognition (DPSL-ASR). Specifically, we f… ▽ More

    Submitted 27 May, 2023; v1 submitted 28 March, 2022; originally announced March 2022.

    Comments: 5 pages, 3 figures, Accepted by InterSpeech 2023

  14. arXiv:2202.09995  [pdf, other

    eess.AS cs.SD

    L-SpEx: Localized Target Speaker Extraction

    Authors: Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

    Abstract: Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance. Recent studies show that speaker extraction benefits from the location or direction of the target speaker. However, these studies assume that the target speaker's location is known in advance or detected by an extra visual cue, e.g., face image or video. In this… ▽ More

    Submitted 21 February, 2022; originally announced February 2022.

    Comments: Accepted in ICASSP 2022

  15. ConvMixer: Feature Interactive Convolution with Curriculum Learning for Small Footprint and Noisy Far-field Keyword Spotting

    Authors: Dianwen Ng, Yunqi Chen, Biao Tian, Qiang Fu, Eng Siong Chng

    Abstract: Building efficient architecture in neural speech processing is paramount to success in keyword spotting deployment. However, it is very challenging for lightweight models to achieve noise robustness with concise neural operations. In a real-world application, the user environment is typically noisy and may also contain reverberations. We proposed a novel feature interactive convolutional model wit… ▽ More

    Submitted 15 January, 2022; originally announced January 2022.

    Comments: submitted to ICASSP 2022

  16. arXiv:2110.08545  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    A Unified Speaker Adaptation Approach for ASR

    Authors: Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma

    Abstract: Transformer models have been used in automatic speech recognition (ASR) successfully and yields state-of-the-art results. However, its performance is still affected by speaker mismatch between training and test data. Further finetuning a trained model with target speaker data is the most natural approach for adaptation, but it takes a lot of compute and may cause catastrophic forgetting to the exi… ▽ More

    Submitted 16 October, 2021; originally announced October 2021.

    Comments: Accepted by EMNLP 2021

  17. arXiv:2110.05267  [pdf, other

    eess.AS cs.LG cs.SD

    Interactive Feature Fusion for End-to-End Noise-Robust Speech Recognition

    Authors: Yuchen Hu, Nana Hou, Chen Chen, Eng Siong Chng

    Abstract: Speech enhancement (SE) aims to suppress the additive noise from a noisy speech signal to improve the speech's perceptual quality and intelligibility. However, the over-suppression phenomenon in the enhanced speech might degrade the performance of downstream automatic speech recognition (ASR) task due to the missing latent information. To alleviate such problem, we propose an interactive feature f… ▽ More

    Submitted 7 April, 2022; v1 submitted 11 October, 2021; originally announced October 2021.

    Comments: 5 pages, 7 figures, Accepted by ICASSP 2022

  18. arXiv:2110.03573  [pdf, other

    eess.AS

    Minimum word error training for non-autoregressive Transformer-based code-switching ASR

    Authors: Yizhou Peng, Jicheng Zhang, Haihua Xu, Hao Huang, Eng Siong Chng

    Abstract: Non-autoregressive end-to-end ASR framework might be potentially appropriate for code-switching recognition task thanks to its inherent property that present output token being independent of historical ones. However, it still under-performs the state-of-the-art autoregressive ASR frameworks. In this paper, we propose various approaches to boosting the performance of a CTC-mask-based nonautoregres… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

    Comments: Submit to ICASSP 2021

  19. arXiv:2108.04692  [pdf, other

    cs.CL cs.SD eess.AS

    Automated Audio Captioning using Transfer Learning and Reconstruction Latent Space Similarity Regularization

    Authors: Andrew Koh, Fuzhao Xue, Eng Siong Chng

    Abstract: In this paper, we examine the use of Transfer Learning using Pretrained Audio Neural Networks (PANNs), and propose an architecture that is able to better leverage the acoustic features provided by PANNs for the Automated Audio Captioning Task. We also introduce a novel self-supervised objective, Reconstruction Latent Space Similarity Regularization (RLSSR). The RLSSR module supplements the trainin… ▽ More

    Submitted 10 August, 2021; originally announced August 2021.

    Comments: to be submitted to icassp 2022

    MSC Class: 68T50 ACM Class: I.2.7

  20. arXiv:2107.10701  [pdf, other

    eess.AS cs.SD

    Multitask-Based Joint Learning Approach To Robust ASR For Radio Communication Speech

    Authors: Duo Ma, Nana Hou, Van Tung Pham, Haihua Xu, Eng Siong Chng

    Abstract: To realize robust end-to-end Automatic Speech Recognition(E2E ASR) under radio communication condition, we propose a multitask-based method to joint train a Speech Enhancement (SE) module as the front-end and an E2E ASR model as the back-end in this paper. One of the advantage of the proposed method is that the entire system can be trained from scratch. Different from prior works, either component… ▽ More

    Submitted 22 July, 2021; originally announced July 2021.

    Comments: 7pages,3figures,Submitted to APSIPA2021

  21. arXiv:2106.08211  [pdf, other

    eess.AS

    E2E-based Multi-task Learning Approach to Joint Speech and Accent Recognition

    Authors: Jicheng Zhang, Yizhou Peng, Pham Van Tung, Haihua Xu, Hao Huang, Eng Siong Chng

    Abstract: In this paper, we propose a single multi-task learning framework to perform End-to-End (E2E) speech recognition (ASR) and accent recognition (AR) simultaneously. The proposed framework is not only more compact but can also yield comparable or even better results than standalone systems. Specifically, we found that the overall performance is predominantly determined by the ASR task, and the E2E-bas… ▽ More

    Submitted 15 June, 2021; originally announced June 2021.

  22. arXiv:2103.08292  [pdf, other

    cs.CV

    Rotation Coordinate Descent for Fast Globally Optimal Rotation Averaging

    Authors: Álvaro Parra, Shin-Fang Chng, Tat-Jun Chin, Anders Eriksson, Ian Reid

    Abstract: Under mild conditions on the noise level of the measurements, rotation averaging satisfies strong duality, which enables global solutions to be obtained via semidefinite programming (SDP) relaxation. However, generic solvers for SDP are rather slow in practice, even on rotation averaging instances of moderate size, thus develo** specialised algorithms is vital. In this paper, we present a fast a… ▽ More

    Submitted 15 March, 2021; v1 submitted 15 March, 2021; originally announced March 2021.

    Comments: Accepted to CVPR 2021 as an oral presentation

  23. arXiv:2101.05056  [pdf, other

    cs.SD cs.CL cs.LG

    End-to-End Speaker Height and age estimation using Attention Mechanism with LSTM-RNN

    Authors: Manav Kaushik, Van Tung Pham, Eng Siong Chng

    Abstract: Automatic height and age estimation of speakers using acoustic features is widely used for the purpose of human-computer interaction, forensics, etc. In this work, we propose a novel approach of using attention mechanism to build an end-to-end architecture for height and age estimation. The attention mechanism is combined with Long Short-Term Memory(LSTM) encoder which is able to capture long-term… ▽ More

    Submitted 13 January, 2021; originally announced January 2021.

    Comments: 5 Pages

  24. An Embarrassingly Simple Model for Dialogue Relation Extraction

    Authors: Fuzhao Xue, Aixin Sun, Hao Zhang, **jie Ni, Eng Siong Chng

    Abstract: Dialogue relation extraction (RE) is to predict the relation type of two entities mentioned in a dialogue. In this paper, we propose a simple yet effective model named SimpleRE for the RE task. SimpleRE captures the interrelations among multiple relations in a dialogue through a novel input format named BERT Relation Token Sequence (BRS). In BRS, multiple [CLS] tokens are used to capture possible… ▽ More

    Submitted 24 January, 2022; v1 submitted 27 December, 2020; originally announced December 2020.

    Comments: Accepted by ICASSP 2022

  25. GDPNet: Refining Latent Multi-View Graph for Relation Extraction

    Authors: Fuzhao Xue, Aixin Sun, Hao Zhang, Eng Siong Chng

    Abstract: Relation Extraction (RE) is to predict the relation type of two entities that are mentioned in a piece of text, e.g., a sentence or a dialogue. When the given text is long, it is challenging to identify indicative words for the relation prediction. Recent advances on RE task are from BERT-based sequence modeling and graph-based modeling of relationships among the tokens in the sequence. In this pa… ▽ More

    Submitted 12 December, 2020; originally announced December 2020.

    Comments: To appear at AAAI 2021

  26. arXiv:2011.09624  [pdf, other

    eess.AS cs.LG

    Multi-stage Speaker Extraction with Utterance and Frame-Level Reference Signals

    Authors: Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

    Abstract: Speaker extraction requires a sample speech from the target speaker as the reference. However, enrolling a speaker with a long speech is not practical. We propose a speaker extraction technique, that performs in multiple stages to take full advantage of short reference speech sample. The extracted speech in early stages is used as the reference speech for late stages. For the first time, we use fr… ▽ More

    Submitted 2 April, 2021; v1 submitted 18 November, 2020; originally announced November 2020.

    Comments: Accepted in ICASSP 2021

  27. arXiv:2010.12143  [pdf, other

    cs.SD eess.AS

    Enriching Under-Represented Named-Entities To Improve Speech Recognition Performance

    Authors: Tingzhi Mao, Yerbolat Khassanov, Van Tung Pham, Haihua Xu, Hao Huang, Aishan Wumaier, Eng Siong Chng

    Abstract: Automatic speech recognition (ASR) for under-represented named-entity (UR-NE) is challenging due to such named-entities (NE) have insufficient instances and poor contextual coverage in the training data to learn reliable estimates and representations. In this paper, we propose approaches to enriching UR-NEs to improve speech recognition performance. Specifically, our first priority is to ensure th… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  28. arXiv:2010.11483  [pdf, other

    eess.AS cs.SD

    Multilingual Approach to Joint Speech and Accent Recognition with DNN-HMM Framework

    Authors: Yizhou Peng, Jicheng Zhang, Haobo Zhang, Haihua Xu, Hao Huang, Eng Siong Chng

    Abstract: Human can recognize speech, as well as the peculiar accent of the speech simultaneously. However, present state-of-the-art ASR system can rarely do that. In this paper, we propose a multilingual approach to recognizing English speech, and related accent that speaker conveys using DNN-HMM framework. Specifically, we assume different accents of English as different languages. We then merge them toge… ▽ More

    Submitted 8 May, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: 5 pages, Conference

  29. arXiv:2009.11795  [pdf, other

    cs.CL cs.LG

    Adapting BERT for Word Sense Disambiguation with Gloss Selection Objective and Example Sentences

    Authors: Boon Peng Yap, Andrew Koh, Eng Siong Chng

    Abstract: Domain adaptation or transfer learning using pre-trained language models such as BERT has proven to be an effective approach for many natural language processing tasks. In this work, we propose to formulate word sense disambiguation as a relevance ranking task, and fine-tune BERT on sequence-pair ranking task to select the most probable sense definition given a context sentence and a list of candi… ▽ More

    Submitted 1 October, 2020; v1 submitted 24 September, 2020; originally announced September 2020.

    Comments: Accepted to appear in Findings of EMNLP 2020

  30. arXiv:2006.07094  [pdf, other

    eess.AS

    Monolingual Data Selection Analysis for English-Mandarin Hybrid Code-switching Speech Recognition

    Authors: Haobo Zhang, Haihua Xu, Van Tung Pham, Hao Huang, Eng Siong Chng

    Abstract: In this paper, we conduct data selection analysis in building an English-Mandarin code-switching (CS) speech recognition (CSSR) system, which is aimed for a real CSSR contest in China. The overall training sets have three subsets, i.e., a code-switching data set, an English (LibriSpeech) and a Mandarin data set respectively. The code-switching data are Mandarin dominated. First of all, it is found… ▽ More

    Submitted 13 September, 2020; v1 submitted 12 June, 2020; originally announced June 2020.

    Comments: 5 pages, conference, Accepted by Interspeech2020

  31. arXiv:2006.06986  [pdf, other

    cs.CV

    Quantum Robust Fitting

    Authors: Tat-Jun Chin, David Suter, Shin-Fang Chng, James Quach

    Abstract: Many computer vision applications need to recover structure from imperfect measurements of the real world. The task is often solved by robustly fitting a geometric model onto noisy and outlier-contaminated data. However, recent theoretical analyses indicate that many commonly used formulations of robust fitting in computer vision are not amenable to tractable solution and approximation. In this pa… ▽ More

    Submitted 9 October, 2020; v1 submitted 12 June, 2020; originally announced June 2020.

    Comments: Appears in: Asian Conference on Computer Vision 2020 (ACCV 2020)

  32. arXiv:2005.10407  [pdf, other

    eess.AS cs.LG cs.SD

    Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning

    Authors: Zhi** Zeng, Van Tung Pham, Haihua Xu, Yerbolat Khassanov, Eng Siong Chng, Chongjia Ni, Bin Ma

    Abstract: In this work, we study leveraging extra text data to improve low-resource end-to-end ASR under cross-lingual transfer learning setting. To this end, we extend our prior work [1], and propose a hybrid Transformer-LSTM based architecture. This architecture not only takes advantage of the highly effective encoding capacity of the Transformer network but also benefits from extra text data due to the L… ▽ More

    Submitted 28 May, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

  33. arXiv:2005.08742  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Approaches to Improving Recognition of Underrepresented Named Entities in Hybrid ASR Systems

    Authors: Tingzhi Mao, Yerbolat Khassanov, Van Tung Pham, Haihua Xu, Hao Huang, Eng Siong Chng

    Abstract: In this paper, we present a series of complementary approaches to improve the recognition of underrepresented named entities (NE) in hybrid ASR systems without compromising overall word error rate performance. The underrepresented words correspond to rare or out-of-vocabulary (OOV) words in the training data, and thereby can't be modeled reliably. We begin with graphemic lexicon which allows to dr… ▽ More

    Submitted 18 May, 2020; originally announced May 2020.

  34. arXiv:2005.04686  [pdf, other

    eess.AS cs.SD

    SpEx+: A Complete Time Domain Speaker Extraction Network

    Authors: Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

    Abstract: Speaker extraction aims to extract the target speech signal from a multi-talker environment given a target speaker's reference speech. We recently proposed a time-domain solution, SpEx, that avoids the phase estimation in frequency-domain approaches. Unfortunately, SpEx is not fully a time-domain solution since it performs time-domain speech encoding for speaker extraction, while taking frequency-… ▽ More

    Submitted 17 August, 2020; v1 submitted 10 May, 2020; originally announced May 2020.

    Comments: accepted in INTERSPEECH 2020

  35. Time-domain speaker extraction network

    Authors: Chenglin Xu, Wei Rao, Eng Siong Chng, Haizhou Li

    Abstract: Speaker extraction is to extract a target speaker's voice from multi-talker speech. It simulates humans' cocktail party effect or the selective listening ability. The prior work mostly performs speaker extraction in frequency domain, then reconstructs the signal with some phase approximation. The inaccuracy of phase estimation is inherent to the frequency domain processing, that affects the qualit… ▽ More

    Submitted 29 April, 2020; originally announced April 2020.

    Comments: Published in ASRU 2019. arXiv admin note: text overlap with arXiv:2004.08326

  36. arXiv:2004.08326  [pdf, other

    eess.AS cs.CL cs.SD

    SpEx: Multi-Scale Time Domain Speaker Extraction Network

    Authors: Chenglin Xu, Wei Rao, Eng Siong Chng, Haizhou Li

    Abstract: Speaker extraction aims to mimic humans' selective auditory attention by extracting a target speaker's voice from a multi-talker environment. It is common to perform the extraction in frequency-domain, and reconstruct the time-domain signal from the extracted magnitude and estimated phase spectra. However, such an approach is adversely affected by the inherent difficulty of phase estimation. Inspi… ▽ More

    Submitted 17 April, 2020; originally announced April 2020.

    Comments: ACCEPTED in IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020

  37. arXiv:1912.00863  [pdf, other

    cs.CL eess.AS

    Independent language modeling architecture for end-to-end ASR

    Authors: Van Tung Pham, Haihua Xu, Yerbolat Khassanov, Zhi** Zeng, Eng Siong Chng, Chongjia Ni, Bin Ma, Haizhou Li

    Abstract: The attention-based end-to-end (E2E) automatic speech recognition (ASR) architecture allows for joint optimization of acoustic and language models within a single network. However, in a vanilla E2E ASR architecture, the decoder sub-network (subnet), which incorporates the role of the language model (LM), is conditioned on the encoder output. This means that the acoustic encoder and the language mo… ▽ More

    Submitted 25 November, 2019; originally announced December 2019.

  38. arXiv:1904.07386  [pdf, other

    eess.AS cs.CL cs.SD

    I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences

    Authors: Kong Aik Lee, Ville Hautamaki, Tomi Kinnunen, Hitoshi Yamamoto, Koji Okabe, Ville Vestman, **g Huang, Guohong Ding, Hanwu Sun, Anthony Larcher, Rohan Kumar Das, Haizhou Li, Mickael Rouvier, Pierre-Michel Bousquet, Wei Rao, Qing Wang, Chunlei Zhang, Fahimeh Bahmaninezhad, Hector Delgado, Jose Patino, Qiongqiong Wang, Ling Guo, Takafumi Koshinaka, Jiacen Zhang, Koichi Shinoda , et al. (21 additional authors not shown)

    Abstract: The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the res… ▽ More

    Submitted 15 April, 2019; originally announced April 2019.

    Comments: 5 pages

  39. Constrained Output Embeddings for End-to-End Code-Switching Speech Recognition with Only Monolingual Data

    Authors: Yerbolat Khassanov, Haihua Xu, Van Tung Pham, Zhi** Zeng, Eng Siong Chng, Chongjia Ni, Bin Ma

    Abstract: The lack of code-switch training data is one of the major concerns in the development of end-to-end code-switching automatic speech recognition (ASR) models. In this work, we propose a method to train an improved end-to-end code-switching ASR using only monolingual data. Our method encourages the distributions of output token embeddings of monolingual languages to be similar, and hence, promotes t… ▽ More

    Submitted 31 July, 2019; v1 submitted 7 April, 2019; originally announced April 2019.

    Comments: 5 pages, 3 figures, accepted to INTERSPEECH 2019

  40. Enriching Rare Word Representations in Neural Language Models by Embedding Matrix Augmentation

    Authors: Yerbolat Khassanov, Zhi** Zeng, Van Tung Pham, Haihua Xu, Eng Siong Chng

    Abstract: The neural language models (NLM) achieve strong generalization capability by learning the dense representation of words and using them to estimate probability distribution function. However, learning the representation of rare words is a challenging problem causing the NLM to produce unreliable probability estimates. To address this problem, we propose a method to enrich representations of rare wo… ▽ More

    Submitted 31 July, 2019; v1 submitted 7 April, 2019; originally announced April 2019.

    Comments: 5 pages, 2 figures, accepted to INTERSPEECH 2019

  41. arXiv:1903.09952  [pdf, other

    eess.AS cs.CL cs.SD

    Optimization of Speaker Extraction Neural Network with Magnitude and Temporal Spectrum Approximation Loss

    Authors: Chenglin Xu, Wei Rao, Eng Siong Chng, Haizhou Li

    Abstract: The SpeakerBeam-FE (SBF) method is proposed for speaker extraction. It attempts to overcome the problem of unknown number of speakers in an audio recording during source separation. The mask approximation loss of SBF is sub-optimal, which doesn't calculate direct signal reconstruction error and consider the speech context. To address these problems, this paper proposes a magnitude and temporal spe… ▽ More

    Submitted 24 March, 2019; originally announced March 2019.

    Comments: Accepted in ICASSP 2019

  42. arXiv:1902.03705  [pdf, other

    eess.AS cs.SD

    A Vocoder-free WaveNet Voice Conversion with Non-Parallel Data

    Authors: Xiaohai Tian, Eng Siong Chng, Haizhou Li

    Abstract: In a typical voice conversion system, vocoder is commonly used for speech-to-features analysis and features-to-speech synthesis. However, vocoder can be a source of speech quality degradation. This paper presents a vocoder-free voice conversion approach using WaveNet for non-parallel training data. Instead of dealing with the intermediate features, the proposed approach utilizes the WaveNet to map… ▽ More

    Submitted 17 September, 2019; v1 submitted 10 February, 2019; originally announced February 2019.

    Comments: 5 pages, 4 figures, This paper is submitted to INTERSPEECH 2019

  43. arXiv:1902.02546  [pdf, other

    eess.AS cs.SD

    Target Speaker Extraction for Overlapped Multi-Talker Speaker Verification

    Authors: Wei Rao, Chenglin Xu, Eng Siong Chng, Haizhou Li

    Abstract: The performance of speaker verification degrades significantly when the test speech is corrupted by interference speakers. Speaker diarization does well to separate speakers if the speakers are temporally overlapped. However, if multi-talkers speak at the same time, we need the technique to separate the speech in the spectral domain. This paper proposes an overlapped multi-talker speaker verificat… ▽ More

    Submitted 7 February, 2019; originally announced February 2019.

    Comments: 5 pages, 3 figures. This paper is submitted to Interspeech 2019

  44. arXiv:1811.00241  [pdf, other

    cs.CL

    On the End-to-End Solution to Mandarin-English Code-switching Speech Recognition

    Authors: Zhi** Zeng, Yerbolat Khassanov, Van Tung Pham, Haihua Xu, Eng Siong Chng, Haizhou Li

    Abstract: Code-switching (CS) refers to a linguistic phenomenon where a speaker uses different languages in an utterance or between alternating utterances. In this work, we study end-to-end (E2E) approaches to the Mandarin-English code-switching speech recognition (CSSR) task. We first examine the effectiveness of using data augmentation and byte-pair encoding (BPE) subword units. More importantly, we propo… ▽ More

    Submitted 11 July, 2019; v1 submitted 1 November, 2018; originally announced November 2018.

    Comments: Accepted for Interspeech 2019

  45. Unsupervised and Efficient Vocabulary Expansion for Recurrent Neural Network Language Models in ASR

    Authors: Yerbolat Khassanov, Eng Siong Chng

    Abstract: In automatic speech recognition (ASR) systems, recurrent neural network language models (RNNLM) are used to rescore a word lattice or N-best hypotheses list. Due to the expensive training, the RNNLM's vocabulary set accommodates only small shortlist of most frequent words. This leads to suboptimal performance if an input speech contains many out-of-shortlist (OOS) words. An effective solution is t… ▽ More

    Submitted 27 June, 2018; originally announced June 2018.

    Comments: 5 pages, 1 figure, accepted at INTERSPEECH 2018

  46. arXiv:1806.06200  [pdf, other

    cs.CL

    Study of Semi-supervised Approaches to Improving English-Mandarin Code-Switching Speech Recognition

    Authors: Pengcheng Guo, Haihua Xu, Lei Xie, Eng Siong Chng

    Abstract: In this paper, we present our overall efforts to improve the performance of a code-switching speech recognition system using semi-supervised training methods from lexicon learning to acoustic modeling, on the South East Asian Mandarin-English (SEAME) data. We first investigate semi-supervised lexicon learning approach to adapt the canonical lexicon, which is meant to alleviate the heavily accented… ▽ More

    Submitted 16 June, 2018; originally announced June 2018.

    Comments: 5pages, 3 figures, INTERSPEECH 2018

  47. arXiv:1602.02950  [pdf, other

    cs.LG cs.SD

    Spoofing detection under noisy conditions: a preliminary investigation and an initial database

    Authors: Xiaohai Tian, Zhizheng Wu, Xiong Xiao, Eng Siong Chng, Haizhou Li

    Abstract: Spoofing detection for automatic speaker verification (ASV), which is to discriminate between live speech and attacks, has received increasing attentions recently. However, all the previous studies have been done on the clean data without significant additive noise. To simulate the real-life scenarios, we perform a preliminary investigation of spoofing detection under additive noisy conditions, an… ▽ More

    Submitted 9 February, 2016; originally announced February 2016.

    Comments: Submitted to Odyssey: The Speaker and Language Recognition Workshop 2016

  48. High quality voice conversion using prosodic and high-resolution spectral features

    Authors: Hy Quy Nguyen, Siu Wa Lee, Xiaohai Tian, Minghui Dong, Eng Siong Chng

    Abstract: Voice conversion methods have advanced rapidly over the last decade. Studies have shown that speaker characteristics are captured by spectral feature as well as various prosodic features. Most existing conversion methods focus on the spectral feature as it directly represents the timbre characteristics, while some conversion methods have focused only on the prosodic feature represented by the fund… ▽ More

    Submitted 6 December, 2015; originally announced December 2015.