Skip to main content

Showing 1–45 of 45 results for author: Hain, T

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.09153  [pdf, other

    cs.CL cs.SD eess.AS

    LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks

    Authors: Amit Meghanani, Thomas Hain

    Abstract: Self-supervised learning (SSL)-based speech models are extensively used for full-stack speech processing. However, it has been observed that improving SSL-based speech representations using unlabeled speech for content-related tasks is challenging and computationally expensive. Recent attempts have been made to address this issue with cost-effective self-supervised fine-tuning (SSFT) approaches. C… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  2. arXiv:2406.08914  [pdf, other

    cs.SD cs.LG eess.AS

    Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

    Authors: William Ravenscroft, George Close, Stefan Goetze, Thomas Hain, Mohammad Soleymanpour, Anurag Chowdhury, Mark C. Fuhs

    Abstract: One solution to automatic speech recognition (ASR) of overlap** speakers is to separate speech and then perform ASR on the separated signals. Commonly, the separator produces artefacts which often degrade ASR performance. Addressing this issue typically requires reference transcriptions to jointly train the separation and ASR networks. This is often not viable for training on real-world in-domai… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 5 pages, 3 Figures, 3 Tables, Accepted for Interspeech 2024

  3. arXiv:2406.07162  [pdf, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark

    Authors: Ziyang Ma, Mingjie Chen, Hezhao Zhang, Zhisheng Zheng, Wenxi Chen, Xiquan Li, Jiaxin Ye, Xie Chen, Thomas Hain

    Abstract: Speech emotion recognition (SER) is an important part of human-computer interaction, receiving extensive attention from both industry and academia. However, the current research field of SER has long suffered from the following problems: 1) There are few reasonable and universal splits of the datasets, making comparing different models and methods difficult. 2) No commonly used benchmark covers nu… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024. GitHub Repository: https://github.com/emo-box/EmoBox

  4. arXiv:2405.20064  [pdf, other

    eess.AS cs.SD

    1st Place Solution to Odyssey Emotion Recognition Challenge Task1: Tackling Class Imbalance Problem

    Authors: Mingjie Chen, Hezhao Zhang, Yuanchao Li, Jiachen Luo, Wen Wu, Ziyang Ma, Peter Bell, Catherine Lai, Joshua Reiss, Lin Wang, Philip C. Woodland, Xie Chen, Huy Phan, Thomas Hain

    Abstract: Speech emotion recognition is a challenging classification task with natural emotional speech, especially when the distribution of emotion types is imbalanced in the training and test data. In this case, it is more difficult for a model to learn to separate minority classes, resulting in those sometimes being ignored or frequently misclassified. Previous work has utilised class weighted loss for t… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

  5. arXiv:2404.16743  [pdf, other

    cs.CL cs.SD eess.AS

    Automatic Speech Recognition System-Independent Word Error Rate Estimation

    Authors: Chanho Park, Mingjie Chen, Thomas Hain

    Abstract: Word error rate (WER) is a metric used to evaluate the quality of transcriptions produced by Automatic Speech Recognition (ASR) systems. In many applications, it is of interest to estimate WER given a pair of a speech utterance and a transcript. Previous work on WER estimation focused on building models that are trained with a specific ASR system in mind (referred to as ASR system-dependent). Thes… ▽ More

    Submitted 26 April, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: Accepted to LREC-COLING 2024 (long)

  6. arXiv:2403.11732  [pdf, other

    cs.SD eess.AS

    Hallucination in Perceptual Metric-Driven Speech Enhancement Networks

    Authors: George Close, Thomas Hain, Stefan Goetze

    Abstract: Within the area of speech enhancement, there is an ongoing interest in the creation of neural systems which explicitly aim to improve the perceptual quality of the processed audio. In concert with this is the topic of non-intrusive (i.e. without clean reference) speech quality prediction, for which neural networks are trained to predict human-assigned quality labels directly from distorted audio.… ▽ More

    Submitted 24 May, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: Accepted for EUSIPCO 2024

  7. arXiv:2403.08738  [pdf, other

    cs.CL cs.SD eess.AS

    Improving Acoustic Word Embeddings through Correspondence Training of Self-supervised Speech Representations

    Authors: Amit Meghanani, Thomas Hain

    Abstract: Acoustic word embeddings (AWEs) are vector representations of spoken words. An effective method for obtaining AWEs is the Correspondence Auto-Encoder (CAE). In the past, the CAE method has been associated with traditional MFCC features. Representations obtained from self-supervised learning (SSL)-based speech models such as HuBERT, Wav2vec2, etc., are outperforming MFCC in many downstream tasks. H… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

    Comments: Accepted to EACL 2024 Main Conference, Long paper

  8. arXiv:2403.06260  [pdf, other

    cs.CL cs.SD eess.AS

    SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations

    Authors: Amit Meghanani, Thomas Hain

    Abstract: There is a growing interest in cost-effective self-supervised fine-tuning (SSFT) of self-supervised learning (SSL)-based speech models to obtain task-specific representations. These task-specific representations are used for robust performance on various downstream tasks by fine-tuning on the labelled data. This work presents a cost-effective SSFT method named Self-supervised Correspondence (SCORE… ▽ More

    Submitted 10 March, 2024; originally announced March 2024.

    Comments: Accepted at ICASSP 2024

  9. arXiv:2402.04805  [pdf, other

    eess.AS

    Progressive unsupervised domain adaptation for ASR using ensemble models and multi-stage training

    Authors: Rehan Ahmad, Muhammad Umar Farooq, Thomas Hain

    Abstract: In Automatic Speech Recognition (ASR), teacher-student (T/S) training has shown to perform well for domain adaptation with small amount of training data. However, adaption without ground-truth labels is still challenging. A previous study has shown the effectiveness of using ensemble teacher models in T/S training for unsupervised domain adaptation (UDA) but its performance still lags behind compa… ▽ More

    Submitted 7 February, 2024; originally announced February 2024.

  10. arXiv:2401.13611  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Non-Intrusive Speech Intelligibility Prediction for Hearing-Impaired Users using Intermediate ASR Features and Human Memory Models

    Authors: Rhiannon Mogridge, George Close, Robert Sutherland, Thomas Hain, Jon Barker, Stefan Goetze, Anton Ragni

    Abstract: Neural networks have been successfully used for non-intrusive speech intelligibility prediction. Recently, the use of feature representations sourced from intermediate layers of pre-trained self-supervised and weakly-supervised models has been found to be particularly useful for this task. This work combines the use of Whisper ASR decoder layer representations as neural network input features with… ▽ More

    Submitted 24 January, 2024; originally announced January 2024.

    Comments: Accepted paper. IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Seoul, Korea, April 2024

  11. arXiv:2312.08979  [pdf, ps, other

    cs.SD eess.AS

    Multi-CMGAN+/+: Leveraging Multi-Objective Speech Quality Metric Prediction for Speech Enhancement

    Authors: George Close, William Ravenscroft, Thomas Hain, Stefan Goetze

    Abstract: Neural network based approaches to speech enhancement have shown to be particularly powerful, being able to leverage a data-driven approach to result in a significant performance gain versus other approaches. Such approaches are reliant on artificially created labelled training data such that the neural model can be trained using intrusive loss functions which compare the output of the model with… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

    Comments: Accepted @ ICASSP 2024

  12. arXiv:2310.18865  [pdf, other

    cs.CL cs.SD eess.AS

    MUST: A Multilingual Student-Teacher Learning approach for low-resource speech recognition

    Authors: Muhammad Umar Farooq, Rehan Ahmad, Thomas Hain

    Abstract: Student-teacher learning or knowledge distillation (KD) has been previously used to address data scarcity issue for training of speech recognition (ASR) systems. However, a limitation of KD training is that the student model classes must be a proper or improper subset of the teacher model classes. It prevents distillation from even acoustically similar languages if the character sets are not same.… ▽ More

    Submitted 28 October, 2023; originally announced October 2023.

    Comments: Accepted for IEEE ASRU 2023

  13. arXiv:2310.08225  [pdf, other

    eess.AS cs.CL cs.SD

    Fast Word Error Rate Estimation Using Self-Supervised Representations For Speech And Text

    Authors: Chanho Park, Chengsong Lu, Mingjie Chen, Thomas Hain

    Abstract: The quality of automatic speech recognition (ASR) is typically measured by word error rate (WER). WER estimation is a task aiming to predict the WER of an ASR system, given a speech utterance and a transcription. This task has gained increasing attention while advanced ASR systems are trained on large amounts of data. In this case, WER estimation becomes necessary in many scenarios, for example, s… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: 5 pages

  14. arXiv:2310.06125  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    On Time Domain Conformer Models for Monaural Speech Separation in Noisy Reverberant Acoustic Environments

    Authors: William Ravenscroft, Stefan Goetze, Thomas Hain

    Abstract: Speech separation remains an important topic for multi-speaker technology researchers. Convolution augmented transformers (conformers) have performed well for many speech processing tasks but have been under-researched for speech separation. Most recent state-of-the-art (SOTA) separation models have been time-domain audio separation networks (TasNets). A number of successful models have made use o… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

    Comments: Accepted at ASRU Workshop 2023

  15. arXiv:2307.14502  [pdf, ps, other

    eess.AS cs.LG cs.SD

    The Effect of Spoken Language on Speech Enhancement using Self-Supervised Speech Representation Loss Functions

    Authors: George Close, Thomas Hain, Stefan Goetze

    Abstract: Recent work in the field of speech enhancement (SE) has involved the use of self-supervised speech representations (SSSRs) as feature transformations in loss functions. However, in prior work, very little attention has been paid to the relationship between the language of the audio used to train the self-supervised representation and that used to train the SE system. Enhancement models trained usi… ▽ More

    Submitted 20 October, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

    Comments: Accepted at WASPAA 2023

  16. arXiv:2307.13423  [pdf, other

    cs.SD cs.LG eess.AS

    Non Intrusive Intelligibility Predictor for Hearing Impaired Individuals using Self Supervised Speech Representations

    Authors: George Close, Thomas Hain, Stefan Goetze

    Abstract: Self-supervised speech representations (SSSRs) have been successfully applied to a number of speech-processing tasks, e.g. as feature extractor for speech quality (SQ) prediction, which is, in turn, relevant for assessment and training speech enhancement systems for users with normal or impaired hearing. However, exact knowledge of why and how quality-related information is encoded well in such re… ▽ More

    Submitted 7 December, 2023; v1 submitted 25 July, 2023; originally announced July 2023.

    Comments: Accepted @ ASRU 2023 SPARKS workshop

  17. arXiv:2306.17500  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Empirical Interpretation of the Relationship Between Speech Acoustic Context and Emotion Recognition

    Authors: Anna Ollerenshaw, Md Asif Jalal, Rosanna Milner, Thomas Hain

    Abstract: Speech emotion recognition (SER) is vital for obtaining emotional intelligence and understanding the contextual meaning of speech. Variations of consonant-vowel (CV) phonemic boundaries can enrich acoustic context with linguistic cues, which impacts SER. In practice, speech emotions are treated as single labels over an acoustic segment for a given time duration. However, phone boundaries within sp… ▽ More

    Submitted 30 June, 2023; originally announced June 2023.

  18. arXiv:2306.08577  [pdf, other

    cs.CL cs.SD eess.AS

    Learning Cross-lingual Map**s for Data Augmentation to Improve Low-Resource Speech Recognition

    Authors: Muhammad Umar Farooq, Thomas Hain

    Abstract: Exploiting cross-lingual resources is an effective way to compensate for data scarcity of low resource languages. Recently, a novel multilingual model fusion technique has been proposed where a model is trained to learn cross-lingual acoustic-phonetic similarities as a map** function. However, handcrafted lexicons have been used to train hybrid DNN-HMM ASR systems. To remove this dependency, we… ▽ More

    Submitted 14 June, 2023; originally announced June 2023.

    Comments: Accepted for Interspeech 2023

  19. arXiv:2304.07142  [pdf, other

    cs.SD cs.AI cs.LG cs.NE eess.AS

    On Data Sampling Strategies for Training Neural Network Speech Separation Models

    Authors: William Ravenscroft, Stefan Goetze, Thomas Hain

    Abstract: Speech separation remains an important area of multi-speaker signal processing. Deep neural network (DNN) models have attained the best performance on many speech separation benchmarks. Some of these models can take significant time to train and have high memory requirements. Previous work has proposed shortening training examples to address these issues but the impact of this on model performance… ▽ More

    Submitted 16 June, 2023; v1 submitted 14 April, 2023; originally announced April 2023.

    Comments: Accepted for EUSIPCO 2023

  20. arXiv:2303.00550  [pdf, other

    eess.AS cs.SD

    Towards domain generalisation in ASR with elitist sampling and ensemble knowledge distillation

    Authors: Rehan Ahmad, Md Asif Jalal, Muhammad Umar Farooq, Anna Ollerenshaw, Thomas Hain

    Abstract: Knowledge distillation has widely been used for model compression and domain adaptation for speech applications. In the presence of multiple teachers, knowledge can easily be transferred to the student by averaging the models output. However, previous research shows that the student do not adapt well with such combination. This paper propose to use an elitist sampling strategy at the output of ens… ▽ More

    Submitted 1 March, 2023; originally announced March 2023.

  21. arXiv:2301.04388  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Perceive and predict: self-supervised speech representation based loss functions for speech enhancement

    Authors: George Close, William Ravenscroft, Thomas Hain, Stefan Goetze

    Abstract: Recent work in the domain of speech enhancement has explored the use of self-supervised speech representations to aid in the training of neural speech enhancement models. However, much of this work focuses on using the deepest or final outputs of self supervised speech representation models, rather than the earlier feature encodings. The use of self supervised representations in such a way is ofte… ▽ More

    Submitted 26 June, 2023; v1 submitted 11 January, 2023; originally announced January 2023.

    Comments: 4 pages, accepted at ICASSP 2023

    Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  22. arXiv:2211.02000  [pdf, other

    cs.SD cs.CL eess.AS

    Dynamic Kernels and Channel Attention for Low Resource Speaker Verification

    Authors: Anna Ollerenshaw, Md Asif Jalal, Thomas Hain

    Abstract: State-of-the-art speaker verification frameworks have typically focused on develo** models with increasingly deeper (more layers) and wider (number of channels) models to improve their verification performance. Instead, this paper proposes an approach to increase the model resolution capability using attention-based dynamic kernels in a convolutional neural network to adapt the model parameters… ▽ More

    Submitted 27 February, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

  23. arXiv:2211.01993  [pdf, other

    cs.CL cs.SD eess.AS

    Probing Statistical Representations For End-To-End ASR

    Authors: Anna Ollerenshaw, Md Asif Jalal, Thomas Hain

    Abstract: End-to-End automatic speech recognition (ASR) models aim to learn a generalised speech representation to perform recognition. In this domain there is little research to analyse internal representation dependencies and their relationship to modelling approaches. This paper investigates cross-domain language model dependencies within transformer architectures using SVCCA and uses these insights to e… ▽ More

    Submitted 3 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  24. arXiv:2210.15305  [pdf, other

    cs.SD cs.AI eess.AS

    Deformable Temporal Convolutional Networks for Monaural Noisy Reverberant Speech Separation

    Authors: William Ravenscroft, Stefan Goetze, Thomas Hain

    Abstract: Speech separation models are used for isolating individual speakers in many speech processing applications. Deep learning models have been shown to lead to state-of-the-art (SOTA) results on a number of speech separation benchmarks. One such class of models known as temporal convolutional networks (TCNs) has shown promising results for speech separation tasks. A limitation of these models is that… ▽ More

    Submitted 10 March, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: Accepted for ICASSP 2023

  25. Unsupervised data selection for Speech Recognition with contrastive loss ratios

    Authors: Chanho Park, Rehan Ahmad, Thomas Hain

    Abstract: This paper proposes an unsupervised data selection method by using a submodular function based on contrastive loss ratios of target and training data sets. A model using a contrastive loss function is trained on both sets. Then the ratio of frame-level losses for each model is used by a submodular function. By using the submodular function, a training set for automatic speech recognition matching… ▽ More

    Submitted 25 July, 2022; originally announced July 2022.

    Comments: 5 pages, accepted by ICASSP 2022

    Journal ref: IEEEInt.Conf.Acoust.SpeechSignalProcess. (2022) 8587-8591

  26. arXiv:2207.03391  [pdf, other

    cs.CL eess.AS

    Non-Linear Pairwise Language Map**s for Low-Resource Multilingual Acoustic Model Fusion

    Authors: Muhammad Umar Farooq, Darshan Adiga Haniya Narayana, Thomas Hain

    Abstract: Multilingual speech recognition has drawn significant attention as an effective way to compensate data scarcity for low-resource languages. End-to-end (e2e) modelling is preferred over conventional hybrid systems, mainly because of no lexicon requirement. However, hybrid DNN-HMMs still outperform e2e models in limited data scenarios. Furthermore, the problem of manual lexicon creation has been all… ▽ More

    Submitted 7 July, 2022; originally announced July 2022.

    Comments: Accepted for Interspeech 2022

  27. arXiv:2207.03390  [pdf, other

    cs.CL eess.AS

    Investigating the Impact of Cross-lingual Acoustic-Phonetic Similarities on Multilingual Speech Recognition

    Authors: Muhammad Umar Farooq, Thomas Hain

    Abstract: Multilingual automatic speech recognition (ASR) systems mostly benefit low resource languages but suffer degradation in performance across several languages relative to their monolingual counterparts. Limited studies have focused on understanding the languages behaviour in the multilingual speech recognition setups. In this paper, a novel data-driven approach is proposed to investigate the cross-l… ▽ More

    Submitted 7 July, 2022; originally announced July 2022.

    Comments: Accepted for Interspeech 2022

  28. A cross-corpus study on speech emotion recognition

    Authors: Rosanna Milner, Md Asif Jalal, Raymond W. M. Ng, Thomas Hain

    Abstract: For speech emotion datasets, it has been difficult to acquire large quantities of reliable data and acted emotions may be over the top compared to less expressive emotions displayed in everyday life. Lately, larger datasets with natural emotions have been created. Instead of ignoring smaller, acted datasets, this study investigates whether information learnt from acted emotions is useful for detec… ▽ More

    Submitted 5 July, 2022; originally announced July 2022.

    Comments: ASRU 2019

    Journal ref: IEEE Workshop on Automatic Speech Recognition and Understanding 2019

  29. Insights on Neural Representations for End-to-End Speech Recognition

    Authors: Anna Ollerenshaw, Md Asif Jalal, Thomas Hain

    Abstract: End-to-end automatic speech recognition (ASR) models aim to learn a generalised speech representation. However, there are limited tools available to understand the internal functions and the effect of hierarchical dependencies within the model architecture. It is crucial to understand the correlations between the layer-wise representations, to derive insights on the relationship between neural rep… ▽ More

    Submitted 19 May, 2022; originally announced May 2022.

    Comments: Submitted to Interspeech 2021

    Journal ref: Proc. Interspeech 2021, 4079-4083

  30. arXiv:2205.08455  [pdf, other

    cs.SD cs.LG eess.AS

    Utterance Weighted Multi-Dilation Temporal Convolutional Networks for Monaural Speech Dereverberation

    Authors: William Ravenscroft, Stefan Goetze, Thomas Hain

    Abstract: Speech dereverberation is an important stage in many speech technology applications. Recent work in this area has been dominated by deep neural network models. Temporal convolutional networks (TCNs) are deep learning models that have been proposed for sequence modelling in the task of dereverberating speech. In this work a weighted multi-dilation depthwise-separable convolution is proposed to repl… ▽ More

    Submitted 22 July, 2022; v1 submitted 17 May, 2022; originally announced May 2022.

    Comments: Accepted at IWAENC 2022

  31. arXiv:2204.06439  [pdf, other

    cs.SD cs.LG eess.AS

    Receptive Field Analysis of Temporal Convolutional Networks for Monaural Speech Dereverberation

    Authors: William Ravenscroft, Stefan Goetze, Thomas Hain

    Abstract: Speech dereverberation is often an important requirement in robust speech processing tasks. Supervised deep learning (DL) models give state-of-the-art performance for single-channel speech dereverberation. Temporal convolutional networks (TCNs) are commonly used for sequence modelling in speech enhancement tasks. A feature of TCNs is that they have a receptive field (RF) dependent on the specific… ▽ More

    Submitted 1 July, 2022; v1 submitted 13 April, 2022; originally announced April 2022.

    Comments: Accepted at EUSIPCO 2022

  32. arXiv:2203.17172  [pdf, other

    eess.AS

    Efficient Non-Autoregressive GAN Voice Conversion using VQWav2vec Features and Dynamic Convolution

    Authors: Mingjie Chen, Yanghao Zhou, Heyan Huang, Thomas Hain

    Abstract: It was shown recently that a combination of ASR and TTS models yield highly competitive performance on standard voice conversion tasks such as the Voice Conversion Challenge 2020 (VCC2020). To obtain good performance both models require pretraining on large amounts of data, thereby obtaining large models that are potentially inefficient in use. In this work we present a model that is significantly… ▽ More

    Submitted 31 March, 2022; originally announced March 2022.

    Comments: submitted to interspeech 2022

  33. arXiv:2203.12369  [pdf, other

    cs.SD cs.LG eess.AS

    MetricGAN+/-: Increasing Robustness of Noise Reduction on Unseen Data

    Authors: George Close, Thomas Hain, Stefan Goetze

    Abstract: Training of speech enhancement systems often does not incorporate knowledge of human perception and thus can lead to unnatural sounding results. Incorporating psychoacoustically motivated speech perception metrics as part of model training via a predictor network has recently gained interest. However, the performance of such predictors is limited by the distribution of metric scores that appear in… ▽ More

    Submitted 15 June, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

    Comments: 5 pages, 4 figures, Accepted to EUSIPCO 2022

  34. arXiv:2010.16071  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    T-vectors: Weakly Supervised Speaker Identification Using Hierarchical Transformer Model

    Authors: Yanpei Shi, Mingjie Chen, Qiang Huang, Thomas Hain

    Abstract: Identifying multiple speakers without knowing where a speaker's voice is in a recording is a challenging task. This paper proposes a hierarchical network with transformer encoders and memory mechanism to address this problem. The proposed model contains a frame-level encoder and segment-level encoder, both of them make use of the transformer encoder block. The multi-head attention mechanism in the… ▽ More

    Submitted 29 October, 2020; originally announced October 2020.

    Comments: Submitted to ICASSP2021. arXiv admin note: text overlap with arXiv:2005.07817

  35. arXiv:2010.11646  [pdf, other

    cs.SD eess.AS

    Towards Low-Resource StarGAN Voice Conversion using Weight Adaptive Instance Normalization

    Authors: Mingjie Chen, Yanpei Shi, Thomas Hain

    Abstract: Many-to-many voice conversion with non-parallel training data has seen significant progress in recent years. StarGAN-based models have been interests of voice conversion. However, most of the StarGAN-based methods only focused on voice conversion experiments for the situations where the number of speakers was small, and the amount of training data was large. In this work, we aim at improving the d… ▽ More

    Submitted 10 April, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: Accepted by ICASSP2021

  36. arXiv:2010.11286  [pdf, other

    eess.AS cs.SD

    Improving Audio Anomalies Recognition Using Temporal Convolutional Attention Network

    Authors: Qiang Huang, Thomas Hain

    Abstract: Anomalous audio in speech recordings is often caused by speaker voice distortion, external noise, or even electric interferences. These obstacles have become a serious problem in some fields, such as high-quality music mixing and speech processing. In this paper, a novel approach using a temporal convolutional attention network (TCAN) is proposed to tackle this problem. The use of temporal convent… ▽ More

    Submitted 9 February, 2021; v1 submitted 21 October, 2020; originally announced October 2020.

    Comments: 5 pages, accepted by ICASSP'2021

  37. arXiv:2008.06892  [pdf, other

    eess.AS cs.SD

    Unsupervised Acoustic Unit Representation Learning for Voice Conversion using WaveNet Auto-encoders

    Authors: Mingjie Chen, Thomas Hain

    Abstract: Unsupervised representation learning of speech has been of keen interest in recent years, which is for example evident in the wide interest of the ZeroSpeech challenges. This work presents a new method for learning frame level representations based on WaveNet auto-encoders. Of particular interest in the ZeroSpeech Challenge 2019 were models with discrete latent variable such as the Vector Quantize… ▽ More

    Submitted 16 August, 2020; originally announced August 2020.

    Comments: To be presented in Interspeech 2020

  38. arXiv:2005.08053  [pdf, other

    eess.AS cs.CL cs.SD

    Exploration of Audio Quality Assessment and Anomaly Localisation Using Attention Models

    Authors: Qiang Huang, Thomas Hain

    Abstract: Many applications of speech technology require more and more audio data. Automatic assessment of the quality of the collected recordings is important to ensure they meet the requirements of the related applications. However, effective and high performing assessment remains a challenging task without a clean reference. In this paper, a novel model for audio quality assessment is proposed by jointly… ▽ More

    Submitted 16 May, 2020; originally announced May 2020.

    Comments: Submitted to InterSpeech 2020

  39. arXiv:2005.07818  [pdf, other

    eess.AS cs.CL cs.SD

    Speaker Re-identification with Speaker Dependent Speech Enhancement

    Authors: Yanpei Shi, Qiang Huang, Thomas Hain

    Abstract: While the use of deep neural networks has significantly boosted speaker recognition performance, it is still challenging to separate speakers in poor acoustic environments. Here speech enhancement methods have traditionally allowed improved performance. The recent works have shown that adapting speech enhancement can lead to further gains. This paper introduces a novel approach that cascades speec… ▽ More

    Submitted 27 August, 2020; v1 submitted 15 May, 2020; originally announced May 2020.

    Comments: Acceptted for presentation at Interspeech2020

  40. arXiv:2005.07817  [pdf, other

    eess.AS cs.CL cs.SD

    Weakly Supervised Training of Hierarchical Attention Networks for Speaker Identification

    Authors: Yanpei Shi, Qiang Huang, Thomas Hain

    Abstract: Identifying multiple speakers without knowing where a speaker's voice is in a recording is a challenging task. In this paper, a hierarchical attention network is proposed to solve a weakly labelled speaker identification problem. The use of a hierarchical structure, consisting of a frame-level encoder and a segment-level encoder, aims to learn speaker related information locally and globally. Spee… ▽ More

    Submitted 27 August, 2020; v1 submitted 15 May, 2020; originally announced May 2020.

    Comments: Acceptted for presentation at Interspeech2020

  41. arXiv:2001.06397  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Supervised Speaker Embedding De-Mixing in Two-Speaker Environment

    Authors: Yanpei Shi, Thomas Hain

    Abstract: Separating different speaker properties from a multi-speaker environment is challenging. Instead of separating a two-speaker signal in signal space like speech source separation, a speaker embedding de-mixing approach is proposed. The proposed approach separates different speaker properties from a two-speaker signal in embedding space. The proposed approach contains two steps. In step one, the cle… ▽ More

    Submitted 5 February, 2021; v1 submitted 14 January, 2020; originally announced January 2020.

    Comments: Published at SLT2021

  42. arXiv:2001.05031  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Robust Speaker Recognition Using Speech Enhancement And Attention Model

    Authors: Yanpei Shi, Qiang Huang, Thomas Hain

    Abstract: In this paper, a novel architecture for speaker recognition is proposed by cascading speech enhancement and speaker processing. Its aim is to improve speaker recognition performance when speech signals are corrupted by noise. Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural net… ▽ More

    Submitted 22 May, 2020; v1 submitted 14 January, 2020; originally announced January 2020.

    Comments: Acceptted by Odyssey 2020

  43. arXiv:1910.07900  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    H-VECTORS: Utterance-level Speaker Embedding Using A Hierarchical Attention Model

    Authors: Yanpei Shi, Qiang Huang, Thomas Hain

    Abstract: In this paper, a hierarchical attention network to generate utterance-level embeddings (H-vectors) for speaker identification is proposed. Since different parts of an utterance may have different contributions to speaker identities, the use of hierarchical structure aims to learn speaker related information locally and globally. In the proposed approach, frame-level encoder and attention are appli… ▽ More

    Submitted 19 October, 2019; v1 submitted 17 October, 2019; originally announced October 2019.

  44. arXiv:1910.07601  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Contextual Joint Factor Acoustic Embeddings

    Authors: Yanpei Shi, Thomas Hain

    Abstract: Embedding acoustic information into fixed length representations is of interest for a whole range of applications in speech and audio technology. Two novel unsupervised approaches to generate acoustic embeddings by modelling of acoustic context are proposed. The first approach is a contextual joint factor synthesis encoder, where the encoder in an encoder/decoder framework is trained to extract jo… ▽ More

    Submitted 5 February, 2021; v1 submitted 16 October, 2019; originally announced October 2019.

    Comments: Published at SLT2021

  45. arXiv:1909.11200  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Improving Noise Robustness In Speaker Identification Using A Two-Stage Attention Model

    Authors: Yanpei Shi, Qiang Huang, Thomas Hain

    Abstract: While the use of deep neural networks has significantly boosted speaker recognition performance, it is still challenging to separate speakers in poor acoustic environments. To improve robustness of speaker recognition system performance in noise, a novel two-stage attention mechanism which can be used in existing architectures such as Time Delay Neural Networks (TDNNs) and Convolutional Neural Net… ▽ More

    Submitted 15 May, 2020; v1 submitted 24 September, 2019; originally announced September 2019.

    Comments: Submitted to Interspeech2020