Skip to main content

Showing 1–50 of 57 results for author: Woodland, P

.
  1. arXiv:2407.02007  [pdf, other

    eess.AS

    SOT Triggered Neural Clustering for Speaker Attributed ASR

    Authors: Xianrui Zheng, Guangzhi Sun, Chao Zhang, Philip C. Woodland

    Abstract: This paper introduces a novel approach to speaker-attributed ASR transcription using a neural clustering method. With a parallel processing mechanism, diarisation and ASR can be applied simultaneously, hel** to prevent the accumulation of errors from one sub-system to the next in a cascaded system. This is achieved by the use of ASR, trained using a serialised output training method, together wi… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: To appear in Interspeech 2024

  2. arXiv:2406.06420  [pdf, other

    cs.LG

    An Improved Empirical Fisher Approximation for Natural Gradient Descent

    Authors: Xiaodong Wu, Wenyi Yu, Chao Zhang, Philip Woodland

    Abstract: Approximate Natural Gradient Descent (NGD) methods are an important family of optimisers for deep learning models, which use approximate Fisher information matrices to pre-condition gradients during training. The empirical Fisher (EF) method approximates the Fisher information matrix empirically by reusing the per-sample gradients collected during back-propagation. Despite its ease of implementati… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 33 pages, 11 figures, 7 tables

  3. arXiv:2406.04541  [pdf, other

    cs.CL eess.AS

    Label-Synchronous Neural Transducer for E2E Simultaneous Speech Translation

    Authors: Keqi Deng, Philip C. Woodland

    Abstract: While the neural transducer is popular for online speech recognition, simultaneous speech translation (SST) requires both streaming and re-ordering capabilities. This paper presents the LS-Transducer-SST, a label-synchronous neural transducer for SST, which naturally possesses these two properties. The LS-Transducer-SST dynamically decides when to emit translation tokens based on an Auto-regressiv… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Accepted by ACL 2024 Main Conference

  4. arXiv:2406.00522  [pdf, other

    eess.AS cs.SD

    Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in Zero and Few-shot Learning

    Authors: Keqi Deng, Guangzhi Sun, Philip C. Woodland

    Abstract: Wav2Prompt is proposed which allows straightforward integration between spoken input and a text-based large language model (LLM). Wav2Prompt uses a simple training process with only the same data used to train an automatic speech recognition (ASR) model. After training, Wav2Prompt learns continuous representations from speech and uses them as LLM prompts. To avoid task over-fitting issues found in… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

  5. arXiv:2405.20064  [pdf, other

    eess.AS cs.SD

    1st Place Solution to Odyssey Emotion Recognition Challenge Task1: Tackling Class Imbalance Problem

    Authors: Mingjie Chen, Hezhao Zhang, Yuanchao Li, Jiachen Luo, Wen Wu, Ziyang Ma, Peter Bell, Catherine Lai, Joshua Reiss, Lin Wang, Philip C. Woodland, Xie Chen, Huy Phan, Thomas Hain

    Abstract: Speech emotion recognition is a challenging classification task with natural emotional speech, especially when the distribution of emotion types is imbalanced in the training and test data. In this case, it is more difficult for a model to learn to separate minority classes, resulting in those sometimes being ignored or frequently misclassified. Previous work has utilised class weighted loss for t… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

  6. arXiv:2405.13684  [pdf, other

    cs.CL

    CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models

    Authors: Guangzhi Sun, Potsawee Manakul, Adian Liusie, Kunat Pipatanakul, Chao Zhang, Phil Woodland, Mark Gales

    Abstract: Multimodal foundation models are prone to hallucination, generating outputs that either contradict the input or are not grounded by factual information. Given the diversity in architectures, training data and instruction tuning techniques, there can be large variations in systems' susceptibility to hallucinations. To assess system hallucination robustness, hallucination ranking approaches have bee… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: 21 pages. Preprint

  7. arXiv:2402.12862  [pdf, other

    cs.CL

    Handling Ambiguity in Emotion: From Out-of-Domain Detection to Distribution Estimation

    Authors: Wen Wu, Bo Li, Chao Zhang, Chung-Cheng Chiu, Qiujia Li, Junwen Bai, Tara N. Sainath, Philip C. Woodland

    Abstract: The subjective perception of emotion leads to inconsistent labels from human annotators. Typically, utterances lacking majority-agreed labels are excluded when training an emotion classifier, which cause problems when encountering ambiguous emotional expressions during testing. This paper investigates three methods to handle ambiguous emotion. First, we show that incorporating utterances without m… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

  8. Parameter Efficient Finetuning for Speech Emotion Recognition and Domain Adaptation

    Authors: Nineli Lashkarashvili, Wen Wu, Guangzhi Sun, Philip C. Woodland

    Abstract: Foundation models have shown superior performance for speech emotion recognition (SER). However, given the limited data in emotion corpora, finetuning all parameters of large pre-trained models for SER can be both resource-intensive and susceptible to overfitting. This paper investigates parameter-efficient finetuning (PEFT) for SER. Various PEFT adaptors are systematically studied for both classi… ▽ More

    Submitted 18 February, 2024; originally announced February 2024.

    Journal ref: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 10986-10990

  9. arXiv:2312.09100  [pdf, other

    eess.AS cs.SD

    FastInject: Injecting Unpaired Text Data into CTC-based ASR training

    Authors: Keqi Deng, Philip C. Woodland

    Abstract: Recently, connectionist temporal classification (CTC)-based end-to-end (E2E) automatic speech recognition (ASR) models have achieved impressive results, especially with the development of self-supervised learning. However, E2E ASR models trained on paired speech-text data often suffer from domain shifts from training to testing. To alleviate this issue, this paper proposes a flat-start joint train… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP2024

  10. arXiv:2311.11353  [pdf, other

    eess.AS

    Label-Synchronous Neural Transducer for Adaptable Online E2E Speech Recognition

    Authors: Keqi Deng, Philip C. Woodland

    Abstract: Although end-to-end (E2E) automatic speech recognition (ASR) has shown state-of-the-art recognition accuracy, it tends to be implicitly biased towards the training data distribution which can degrade generalisation. This paper proposes a label-synchronous neural transducer (LS-Transducer), which provides a natural approach to domain adaptation based on text-only data. The LS-Transducer extracts a… ▽ More

    Submitted 19 November, 2023; originally announced November 2023.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  11. arXiv:2311.07418  [pdf, other

    cs.CL cs.SD eess.AS

    Speech-based Slot Filling using Large Language Models

    Authors: Guangzhi Sun, Shutong Feng, Dongcheng Jiang, Chao Zhang, Milica Gašić, Philip C. Woodland

    Abstract: Recently, advancements in large language models (LLMs) have shown an unprecedented ability across various language tasks. This paper investigates the potential application of LLMs to slot filling with noisy ASR transcriptions, via both in-context learning and task-specific fine-tuning. Dedicated prompt designs and fine-tuning approaches are proposed to improve the robustness of LLMs for slot filli… ▽ More

    Submitted 13 November, 2023; originally announced November 2023.

  12. arXiv:2310.04791  [pdf, other

    eess.AS cs.LG cs.SD

    Conditional Diffusion Model for Target Speaker Extraction

    Authors: Theodor Nguyen, Guangzhi Sun, Xianrui Zheng, Chao Zhang, Philip C Woodland

    Abstract: We propose DiffSpEx, a generative target speaker extraction method based on score-based generative modelling through stochastic differential equations. DiffSpEx deploys a continuous-time stochastic diffusion process in the complex short-time Fourier transform domain, starting from the target speaker source and converging to a Gaussian distribution centred on the mixture of sources. For the reverse… ▽ More

    Submitted 7 October, 2023; originally announced October 2023.

    Comments: 5 pages, 4 figures, submitted to ICASSP 2024

  13. arXiv:2310.00486  [pdf, other

    cs.CL cs.HC cs.LG

    It HAS to be Subjective: Human Annotator Simulation via Zero-shot Density Estimation

    Authors: Wen Wu, Wenlin Chen, Chao Zhang, Philip C. Woodland

    Abstract: Human annotator simulation (HAS) serves as a cost-effective substitute for human evaluation such as data annotation and system assessment. Human perception and behaviour during human evaluation exhibit inherent variability due to diverse cognitive processes and subjective interpretations, which should be taken into account in modelling to better mimic the way people perceive and interact with the… ▽ More

    Submitted 30 September, 2023; originally announced October 2023.

    Comments: Code available at: https://github.com/W-Wu/HAS_CNF

  14. arXiv:2308.13345  [pdf, other

    eess.AS cs.CL cs.SD

    Decoupled Structure for Improved Adaptability of End-to-End Models

    Authors: Keqi Deng, Philip C. Woodland

    Abstract: Although end-to-end (E2E) trainable automatic speech recognition (ASR) has shown great success by jointly learning acoustic and linguistic information, it still suffers from the effect of domain shifts, thus limiting potential applications. The E2E ASR model implicitly learns an internal language model (LM) which characterises the training distribution of the source domain, and the E2E trainable n… ▽ More

    Submitted 25 August, 2023; originally announced August 2023.

  15. Integrating Emotion Recognition with Speech Recognition and Speaker Diarisation for Conversations

    Authors: Wen Wu, Chao Zhang, Philip C. Woodland

    Abstract: Although automatic emotion recognition (AER) has recently drawn significant research interest, most current AER studies use manually segmented utterances, which are usually unavailable for dialogue systems. This paper proposes integrating AER with automatic speech recognition (ASR) and speaker diarisation (SD) in a jointly-trained system. Distinct output layers are built for four sub-tasks includi… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: Interspeech 2023

  16. arXiv:2307.03088  [pdf, other

    eess.AS

    Label-Synchronous Neural Transducer for End-to-End ASR

    Authors: Keqi Deng, Philip C. Woodland

    Abstract: Neural transducers provide a natural way of streaming ASR. However, they augment output sequences with blank tokens which leads to challenges for domain adaptation using text data. This paper proposes a label-synchronous neural transducer (LS-Transducer), which extracts a label-level encoder representation before combining it with the prediction network output. Hence blank tokens are no longer nee… ▽ More

    Submitted 11 October, 2023; v1 submitted 6 July, 2023; originally announced July 2023.

  17. arXiv:2307.01764  [pdf, other

    cs.CL

    Knowledge-Aware Audio-Grounded Generative Slot Filling for Limited Annotated Data

    Authors: Guangzhi Sun, Chao Zhang, Ivan Vulić, Paweł Budzianowski, Philip C. Woodland

    Abstract: Manually annotating fine-grained slot-value labels for task-oriented dialogue (ToD) systems is an expensive and time-consuming endeavour. This motivates research into slot-filling methods that operate with limited amounts of labelled data. Moreover, the majority of current work on ToD is based solely on text as the input modality, neglecting the additional challenges of imperfect automatic speech… ▽ More

    Submitted 4 July, 2023; originally announced July 2023.

    Comments: to submit to CS&L

  18. Estimating the Uncertainty in Emotion Attributes using Deep Evidential Regression

    Authors: Wen Wu, Chao Zhang, Philip C. Woodland

    Abstract: In automatic emotion recognition (AER), labels assigned by different human annotators to the same utterance are often inconsistent due to the inherent complexity of emotion and the subjectivity of perception. Though deterministic labels generated by averaging or voting are often used as the ground truth, it ignores the intrinsic uncertainty revealed by the inconsistent labels. This paper proposes… ▽ More

    Submitted 11 June, 2023; originally announced June 2023.

    Comments: Accepted by ACL 2023

    Journal ref: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2023

  19. arXiv:2306.01942  [pdf, other

    cs.CL cs.SD eess.AS

    Can Contextual Biasing Remain Effective with Whisper and GPT-2?

    Authors: Guangzhi Sun, Xianrui Zheng, Chao Zhang, Philip C. Woodland

    Abstract: End-to-end automatic speech recognition (ASR) and large language models, such as Whisper and GPT-2, have recently been scaled to use vast amounts of training data. Despite the large amount of training data, infrequent content words that occur in a particular task may still exhibit poor ASR performance, with contextual biasing a possible remedy. This paper investigates the effectiveness of neural c… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

    Comments: To appear in Interspeech 2023

  20. arXiv:2305.18824  [pdf, other

    cs.CL cs.SD eess.AS

    Graph Neural Networks for Contextual ASR with the Tree-Constrained Pointer Generator

    Authors: Guangzhi Sun, Chao Zhang, Phil Woodland

    Abstract: The incorporation of biasing words obtained through contextual knowledge is of paramount importance in automatic speech recognition (ASR) applications. This paper proposes an innovative method for achieving end-to-end contextual ASR using graph neural network (GNN) encodings based on the tree-constrained pointer generator method. GNN node encodings facilitate lookahead for future word pieces in th… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

  21. Self-supervised representations in speech-based depression detection

    Authors: Wen Wu, Chao Zhang, Philip C. Woodland

    Abstract: This paper proposes handling training data sparsity in speech-based automatic depression detection (SDD) using foundation models pre-trained with self-supervised learning (SSL). An analysis of SSL representations derived from different layers of pre-trained foundation models is first presented for SDD, which provides insight to suitable indicator for depression detection. Knowledge transfer is the… ▽ More

    Submitted 6 July, 2023; v1 submitted 20 May, 2023; originally announced May 2023.

  22. arXiv:2304.00871  [pdf, other

    eess.AS

    Self-Supervised Learning-Based Source Separation for Meeting Data

    Authors: Yuang Li, Xianrui Zheng, Philip C. Woodland

    Abstract: Source separation can improve automatic speech recognition (ASR) under multi-party meeting scenarios by extracting single-speaker signals from overlapped speech. Despite the success of self-supervised learning models in single-channel source separation, most studies have focused on simulated setups. In this paper, seven SSL models were compared on both simulated and real-world corpora. Then, we pr… ▽ More

    Submitted 3 April, 2023; originally announced April 2023.

    Comments: To appear in Proc. ICASSP2023

  23. arXiv:2303.10917  [pdf, other

    eess.AS cs.SD

    Knowledge Distillation from Multiple Foundation Models for End-to-End Speech Recognition

    Authors: Xiaoyu Yang, Qiujia Li, Chao Zhang, Philip C. Woodland

    Abstract: Although large foundation models pre-trained by self-supervised learning have achieved state-of-the-art performance in many tasks including automatic speech recognition (ASR), knowledge distillation (KD) is often required in practice to transfer the knowledge learned by large teacher models into much smaller student models with affordable computation and memory costs. This paper proposes a novel t… ▽ More

    Submitted 20 March, 2023; originally announced March 2023.

  24. arXiv:2302.08579  [pdf, other

    eess.AS cs.SD

    Adaptable End-to-End ASR Models using Replaceable Internal LMs and Residual Softmax

    Authors: Keqi Deng, Philip C. Woodland

    Abstract: End-to-end (E2E) automatic speech recognition (ASR) implicitly learns the token sequence distribution of paired audio-transcript training data. However, it still suffers from domain shifts from training to testing, and domain adaptation is still challenging. To alleviate this problem, this paper designs a replaceable internal language model (RILM) method, which makes it feasible to directly replac… ▽ More

    Submitted 14 March, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

    Comments: Accepted by ICASSP2023

  25. Distribution-based Emotion Recognition in Conversation

    Authors: Wen Wu, Chao Zhang, Philip C. Woodland

    Abstract: Automatic emotion recognition in conversation (ERC) is crucial for emotion-aware conversational artificial intelligence. This paper proposes a distribution-based framework that formulates ERC as a sequence-to-sequence problem for emotion distribution estimation. The inherent ambiguity of emotions and the subjectivity of human perception lead to disagreements in emotion labels, which is handled nat… ▽ More

    Submitted 9 November, 2022; originally announced November 2022.

    Comments: To appear in SLT 2022

    Journal ref: 2022 IEEE Spoken Language Technology Workshop (SLT)

  26. arXiv:2211.02536  [pdf, ps, other

    cs.CL cs.AI cs.SD eess.AS

    Biased Self-supervised learning for ASR

    Authors: Florian L. Kreyssig, Yangyang Shi, **xi Guo, Leda Sari, Abdelrahman Mohamed, Philip C. Woodland

    Abstract: Self-supervised learning via masked prediction pre-training (MPPT) has shown impressive performance on a range of speech-processing tasks. This paper proposes a method to bias self-supervised learning towards a specific task. The core idea is to slightly finetune the model that is used to obtain the target sequence. This leads to better performance and a substantial increase in training speed. Fur… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  27. arXiv:2210.16554  [pdf, other

    cs.CL cs.SD eess.AS

    End-to-end Spoken Language Understanding with Tree-constrained Pointer Generator

    Authors: Guangzhi Sun, Chao Zhang, Philip C. Woodland

    Abstract: End-to-end spoken language understanding (SLU) suffers from the long-tail word problem. This paper exploits contextual biasing, a technique to improve the speech recognition of rare words, in end-to-end SLU systems. Specifically, a tree-constrained pointer generator (TCPGen), a powerful and efficient biasing model component, is studied, which leverages a slot shortlist with corresponding entities… ▽ More

    Submitted 14 March, 2023; v1 submitted 29 October, 2022; originally announced October 2022.

    Comments: 5 pages, to appear in ICASSP 2023

  28. arXiv:2210.13576  [pdf, ps, other

    cs.SD eess.AS

    Spectral Clustering-aware Learning of Embeddings for Speaker Diarisation

    Authors: Evonne P. C. Lee, Guangzhi Sun, Chao Zhang, Philip C. Woodland

    Abstract: In speaker diarisation, speaker embedding extraction models often suffer from the mismatch between their training loss functions and the speaker clustering method. In this paper, we propose the method of spectral clustering-aware learning of embeddings (SCALE) to address the mismatch. Specifically, besides an angular prototype cal (AP) loss, SCALE uses a novel affinity matrix loss which directly m… ▽ More

    Submitted 14 March, 2023; v1 submitted 24 October, 2022; originally announced October 2022.

    Comments: To appear in ICASSP 2023, 5 pages

  29. arXiv:2207.03852  [pdf, other

    eess.AS cs.SD

    Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

    Authors: Xianrui Zheng, Chao Zhang, Philip C. Woodland

    Abstract: Self-supervised-learning-based pre-trained models for speech data, such as Wav2Vec 2.0 (W2V2), have become the backbone of many speech tasks. In this paper, to achieve speaker diarisation and speech recognition using a single model, a tandem multitask training (TMT) method is proposed to fine-tune W2V2. For speaker diarisation, the tasks of voice activity detection (VAD) and speaker classification… ▽ More

    Submitted 8 July, 2022; originally announced July 2022.

    Comments: To appear in Interspeech 2022

  30. arXiv:2207.00857  [pdf, other

    cs.SD cs.CL eess.AS

    Tree-constrained Pointer Generator with Graph Neural Network Encodings for Contextual Speech Recognition

    Authors: Guangzhi Sun, Chao Zhang, Philip C. Woodland

    Abstract: Incorporating biasing words obtained as contextual knowledge is critical for many automatic speech recognition (ASR) applications. This paper proposes the use of graph neural network (GNN) encodings in a tree-constrained pointer generator (TCPGen) component for end-to-end contextual ASR. By encoding the biasing words in the prefix-tree with a tree-based GNN, lookahead for future wordpieces in end-… ▽ More

    Submitted 2 July, 2022; originally announced July 2022.

    Comments: To appear in Interspeech 2022. arXiv admin note: text overlap with arXiv:2205.09058

  31. arXiv:2205.09058  [pdf, other

    cs.CL cs.SD eess.AS

    Minimising Biasing Word Errors for Contextual ASR with the Tree-Constrained Pointer Generator

    Authors: Guangzhi Sun, Chao Zhang, Philip C Woodland

    Abstract: Contextual knowledge is essential for reducing speech recognition errors on high-valued long-tail words. This paper proposes a novel tree-constrained pointer generator (TCPGen) component that enables end-to-end ASR models to bias towards a list of long-tail words obtained using external contextual information. With only a small overhead in memory use and computation cost, TCPGen can structure thou… ▽ More

    Submitted 23 May, 2022; v1 submitted 18 May, 2022; originally announced May 2022.

    Comments: This work has been submitted to the IEEE Transactions on Audio, Speech, and Language Processing for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  32. Estimating the Uncertainty in Emotion Class Labels with Utterance-Specific Dirichlet Priors

    Authors: Wen Wu, Chao Zhang, Xixin Wu, Philip C. Woodland

    Abstract: Emotion recognition is a key attribute for artificial intelligence systems that need to naturally interact with humans. However, the task definition is still an open problem due to the inherent ambiguity of emotions. In this paper, a novel Bayesian training loss based on per-utterance Dirichlet prior distributions is proposed for verbal emotion recognition, which models the uncertainty in one-hot… ▽ More

    Submitted 17 November, 2022; v1 submitted 8 March, 2022; originally announced March 2022.

    Journal ref: IEEE Transactions on Affective Computing ( Volume: 14, Issue: 4, 01 Oct.-Dec. 2023)

  33. arXiv:2110.03334  [pdf, ps, other

    eess.AS

    Knowledge Distillation for Neural Transducers from Large Self-Supervised Pre-trained Models

    Authors: Xiaoyu Yang, Qiujia Li, Philip C. Woodland

    Abstract: Self-supervised pre-training is an effective approach to leveraging a large amount of unlabelled data to reduce word error rates (WERs) of automatic speech recognition (ASR) systems. Since it is impractical to use large pre-trained models for many real-world ASR applications, it is desirable to have a much smaller model while retaining the performance of the pre-trained model. In this paper, we pr… ▽ More

    Submitted 2 March, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: Accepted as a conference paper at ICASSP 2022

  34. arXiv:2110.03327  [pdf, other

    eess.AS cs.LG

    Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition

    Authors: Qiujia Li, Yu Zhang, David Qiu, Yanzhang He, Liangliang Cao, Philip C. Woodland

    Abstract: As end-to-end automatic speech recognition (ASR) models reach promising performance, various downstream tasks rely on good confidence estimators for these systems. Recent research has shown that model-based confidence estimators have a significant advantage over using the output softmax probabilities. If the input data to the speech recogniser is from mismatched acoustic and linguistic conditions,… ▽ More

    Submitted 2 March, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: Accepted as a conference paper at ICASSP 2022

  35. arXiv:2109.00627  [pdf, other

    cs.CL cs.SD

    Tree-constrained Pointer Generator for End-to-end Contextual Speech Recognition

    Authors: Guangzhi Sun, Chao Zhang, Philip C. Woodland

    Abstract: Contextual knowledge is important for real-world automatic speech recognition (ASR) applications. In this paper, a novel tree-constrained pointer generator (TCPGen) component is proposed that incorporates such knowledge as a list of biasing words into both attention-based encoder-decoder and transducer end-to-end ASR models in a neural-symbolic way. TCPGen structures the biasing words into an effi… ▽ More

    Submitted 17 September, 2021; v1 submitted 1 September, 2021; originally announced September 2021.

    Comments: To appear in ASRU 2021

  36. arXiv:2108.07789  [pdf, other

    cs.CL cs.SD eess.AS

    Adapting GPT, GPT-2 and BERT Language Models for Speech Recognition

    Authors: Xianrui Zheng, Chao Zhang, Philip C. Woodland

    Abstract: Language models (LMs) pre-trained on massive amounts of text, in particular bidirectional encoder representations from Transformers (BERT), generative pre-training (GPT), and GPT-2, have become a key technology for many natural language processing tasks. In this paper, we present results using fine-tuned GPT, GPT-2, and their combination for automatic speech recognition (ASR). Unlike unidirectiona… ▽ More

    Submitted 1 October, 2021; v1 submitted 29 July, 2021; originally announced August 2021.

    Comments: To appear in ASRU 2021

  37. arXiv:2107.00764  [pdf, other

    eess.AS

    Combining Frame-Synchronous and Label-Synchronous Systems for Speech Recognition

    Authors: Qiujia Li, Chao Zhang, Philip C. Woodland

    Abstract: Commonly used automatic speech recognition (ASR) systems can be classified into frame-synchronous and label-synchronous categories, based on whether the speech is decoded on a per-frame or per-label basis. Frame-synchronous systems, such as traditional hidden Markov model systems, can easily incorporate existing knowledge and can support streaming ASR applications. Label-synchronous systems, based… ▽ More

    Submitted 1 July, 2021; originally announced July 2021.

    Comments: Submitted to IEEE/ACM Transactions on Audio Speech and Language Processing

  38. arXiv:2103.14152  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Residual Energy-Based Models for End-to-End Speech Recognition

    Authors: Qiujia Li, Yu Zhang, Bo Li, Liangliang Cao, Philip C. Woodland

    Abstract: End-to-end models with auto-regressive decoders have shown impressive results for automatic speech recognition (ASR). These models formulate the sequence-level probability as a product of the conditional probabilities of all individual tokens given their histories. However, the performance of locally normalised models can be sub-optimal because of factors such as exposure bias. Consequently, the m… ▽ More

    Submitted 23 June, 2021; v1 submitted 25 March, 2021; originally announced March 2021.

    Comments: To appear in Proc. Interspeech 2021

  39. arXiv:2103.07554  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    A Distributed Optimisation Framework Combining Natural Gradient with Hessian-Free for Discriminative Sequence Training

    Authors: Adnan Haider, Chao Zhang, Florian L. Kreyssig, Philip C. Woodland

    Abstract: This paper presents a novel natural gradient and Hessian-free (NGHF) optimisation framework for neural network training that can operate efficiently in a distributed manner. It relies on the linear conjugate gradient (CG) algorithm to combine the natural gradient (NG) method with local curvature information from Hessian-free (HF) or other second-order methods. A solution to a numerical issue in CG… ▽ More

    Submitted 12 March, 2021; originally announced March 2021.

  40. arXiv:2102.06474  [pdf, other

    cs.CL cs.AI

    Transformer Language Models with LSTM-based Cross-utterance Information Representation

    Authors: G. Sun, C. Zhang, P. C. Woodland

    Abstract: The effective incorporation of cross-utterance information has the potential to improve language models (LMs) for automatic speech recognition (ASR). To extract more powerful and robust cross-utterance representations for the Transformer LM (TLM), this paper proposes the R-TLM which uses hidden states in a long short-term memory (LSTM) LM. To encode the cross-utterance information, the R-TLM incor… ▽ More

    Submitted 12 February, 2021; originally announced February 2021.

  41. arXiv:2102.06467  [pdf, other

    cs.SD cs.LG eess.AS eess.IV

    Content-Aware Speaker Embeddings for Speaker Diarisation

    Authors: G. Sun, D. Liu, C. Zhang, P. C. Woodland

    Abstract: Recent speaker diarisation systems often convert variable length speech segments into fixed-length vector representations for speaker clustering, which are known as speaker embeddings. In this paper, the content-aware speaker embeddings (CASE) approach is proposed, which extends the input of the speaker classifier to include not only acoustic features but also their corresponding speech content, v… ▽ More

    Submitted 12 February, 2021; originally announced February 2021.

  42. arXiv:2010.14102  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Emotion recognition by fusing time synchronous and time asynchronous representations

    Authors: Wen Wu, Chao Zhang, Philip C. Woodland

    Abstract: In this paper, a novel two-branch neural network model structure is proposed for multimodal emotion recognition, which consists of a time synchronous branch (TSB) and a time asynchronous branch (TAB). To capture correlations between each word and its acoustic realisation, the TSB combines speech and text modalities at each input window frame and then does pooling across time to form a single embed… ▽ More

    Submitted 22 July, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

    Journal ref: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6269-6273

  43. Combination of Deep Speaker Embeddings for Diarisation

    Authors: Guangzhi Sun, Chao Zhang, Phil Woodland

    Abstract: Significant progress has recently been made in speaker diarisation after the introduction of d-vectors as speaker embeddings extracted from neural network (NN) speaker classifiers for clustering speech segments. To extract better-performing and more robust speaker embeddings, this paper proposes a c-vector method by combining multiple sets of complementary d-vectors derived from systems with diffe… ▽ More

    Submitted 7 May, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: Manualscript accepted by Neural Networks

  44. arXiv:2010.11428  [pdf, other

    eess.AS cs.CL cs.LG

    Confidence Estimation for Attention-based Sequence-to-sequence Models for Speech Recognition

    Authors: Qiujia Li, David Qiu, Yu Zhang, Bo Li, Yanzhang He, Philip C. Woodland, Liangliang Cao, Trevor Strohman

    Abstract: For various speech-related tasks, confidence scores from a speech recogniser are a useful measure to assess the quality of transcriptions. In traditional hidden Markov model-based automatic speech recognition (ASR) systems, confidence scores can be reliably obtained from word posteriors in decoding lattices. However, for an ASR system with an auto-regressive decoder, such as an attention-based seq… ▽ More

    Submitted 23 October, 2020; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: Submitted to ICASSP 2021

  45. arXiv:2009.01008  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Cross-Utterance Language Models with Acoustic Error Sampling

    Authors: G. Sun, C. Zhang, P. C. Woodland

    Abstract: The effective exploitation of richer contextual information in language models (LMs) is a long-standing research problem for automatic speech recognition (ASR). A cross-utterance LM (CULM) is proposed in this paper, which augments the input to a standard long short-term memory (LSTM) LM with a context vector derived from past and future utterances using an extraction network. The extraction networ… ▽ More

    Submitted 19 August, 2020; originally announced September 2020.

    Comments: 5 pages

  46. arXiv:2008.03756  [pdf, ps, other

    eess.AS cs.SD

    Cosine-Distance Virtual Adversarial Training for Semi-Supervised Speaker-Discriminative Acoustic Embeddings

    Authors: Florian L. Kreyssig, Philip C. Woodland

    Abstract: In this paper, we propose a semi-supervised learning (SSL) technique for training deep neural networks (DNNs) to generate speaker-discriminative acoustic embeddings (speaker embeddings). Obtaining large amounts of speaker recognition train-ing data can be difficult for desired target domains, especially under privacy constraints. The proposed technique reduces requirements for labelled data by lev… ▽ More

    Submitted 9 August, 2020; originally announced August 2020.

    Comments: Accepted to Interspeech 2020

  47. arXiv:1911.03970  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Improved Large-margin Softmax Loss for Speaker Diarisation

    Authors: Yassir Fathullah, Chao Zhang, Philip C. Woodland

    Abstract: Speaker diarisation systems nowadays use embeddings generated from speech segments in a bottleneck layer, which are needed to be discriminative for unseen speakers. It is well-known that large-margin training can improve the generalisation ability to unseen data, and its use in such open-set problems has been widespread. Therefore, this paper introduces a general approach to the large-margin softm… ▽ More

    Submitted 6 July, 2020; v1 submitted 10 November, 2019; originally announced November 2019.

    Comments: ICASSP 2020

    Journal ref: ICASSP 2020, Barcelona, Spain, 2020, pp. 7104-7108

  48. arXiv:1910.09703  [pdf, other

    eess.AS cs.CL cs.CV cs.LG cs.SD

    Discriminative Neural Clustering for Speaker Diarisation

    Authors: Qiujia Li, Florian L. Kreyssig, Chao Zhang, Philip C. Woodland

    Abstract: In this paper, we propose Discriminative Neural Clustering (DNC) that formulates data clustering with a maximum number of clusters as a supervised sequence-to-sequence learning problem. Compared to traditional unsupervised clustering algorithms, DNC learns clustering patterns from training data without requiring an explicit definition of a similarity measure. An implementation of DNC based on the… ▽ More

    Submitted 23 November, 2020; v1 submitted 21 October, 2019; originally announced October 2019.

    Comments: Accepted as a conference paper at the 8th IEEE Spoken Language Technology Workshop (SLT 2021)

  49. arXiv:1909.06614  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Integrating Source-channel and Attention-based Sequence-to-sequence Models for Speech Recognition

    Authors: Qiujia Li, Chao Zhang, Philip C. Woodland

    Abstract: This paper proposes a novel automatic speech recognition (ASR) framework called Integrated Source-Channel and Attention (ISCA) that combines the advantages of traditional systems based on the noisy source-channel model (SC) and end-to-end style systems using attention-based sequence-to-sequence models. The traditional SC system framework includes hidden Markov models and connectionist temporal cla… ▽ More

    Submitted 1 October, 2019; v1 submitted 14 September, 2019; originally announced September 2019.

    Comments: To appear in Proc. ASRU2019, December 14-18, 2019, Sentosa, Singapore

  50. arXiv:1906.11047  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Multi-Span Acoustic Modelling using Raw Waveform Signals

    Authors: Patrick von Platen, Chao Zhang, Philip Woodland

    Abstract: Traditional automatic speech recognition (ASR) systems often use an acoustic model (AM) built on handcrafted acoustic features, such as log Mel-filter bank (FBANK) values. Recent studies found that AMs with convolutional neural networks (CNNs) can directly use the raw waveform signal as input. Given sufficient training data, these AMs can yield a competitive word error rate (WER) to those built on… ▽ More

    Submitted 3 October, 2019; v1 submitted 21 June, 2019; originally announced June 2019.

    Comments: To appear in INTERSPEECH 2019