Skip to main content

Showing 1–45 of 45 results for author: Renals, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.00898  [pdf, other

    cs.SD cs.CL eess.AS

    Phonetic Error Analysis of Raw Waveform Acoustic Models with Parametric and Non-Parametric CNNs

    Authors: Erfan Loweimi, Andrea Carmantini, Peter Bell, Steve Renals, Zoran Cvetkovic

    Abstract: In this paper, we analyse the error patterns of the raw waveform acoustic models in TIMIT's phone recognition task. Our analysis goes beyond the conventional phone error rate (PER) metric. We categorise the phones into three groups: {affricate, diphthong, fricative, nasal, plosive, semi-vowel, vowel, silence}, {consonant, vowel+, silence}, and {voiced, unvoiced, silence} and, compute the PER for e… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: 5 pages, 6 figures, 3 tables

  2. arXiv:2110.08634  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Towards Robust Waveform-Based Acoustic Models

    Authors: Dino Oglic, Zoran Cvetkovic, Peter Sollich, Steve Renals, Bin Yu

    Abstract: We study the problem of learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. First, we characterize data augmentation theoretically as an instance of vicinal risk minimization, wh… ▽ More

    Submitted 29 June, 2022; v1 submitted 16 October, 2021; originally announced October 2021.

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022

  3. arXiv:2105.15162  [pdf, other

    eess.AS cs.CL cs.LG cs.SD eess.IV

    Automatic audiovisual synchronisation for ultrasound tongue imaging

    Authors: Aciel Eshky, Joanne Cleland, Manuel Sam Ribeiro, Eleanor Sugden, Korin Richmond, Steve Renals

    Abstract: Ultrasound tongue imaging is used to visualise the intra-oral articulators during speech production. It is utilised in a range of applications, including speech and language therapy and phonetics research. Ultrasound and speech audio are recorded simultaneously, and in order to correctly use this data, the two modalities should be correctly synchronised. Synchronisation is achieved using specialis… ▽ More

    Submitted 31 May, 2021; originally announced May 2021.

    Comments: 18 pages, 10 figures. Manuscript accepted at Speech Communication

  4. arXiv:2103.00333  [pdf, other

    eess.AS cs.CL cs.SD q-bio.QM

    Silent versus modal multi-speaker speech recognition from ultrasound and video

    Authors: Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals

    Abstract: We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips. We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech. We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode misma… ▽ More

    Submitted 27 February, 2021; originally announced March 2021.

    Comments: 5 pages, 5 figures, Submitted to Interspeech 2021

  5. arXiv:2103.00324  [pdf, ps, other

    eess.AS cs.CL cs.SD q-bio.NC

    Exploiting ultrasound tongue imaging for the automatic detection of speech articulation errors

    Authors: Manuel Sam Ribeiro, Joanne Cleland, Aciel Eshky, Korin Richmond, Steve Renals

    Abstract: Speech sound disorders are a common communication impairment in childhood. Because speech disorders can negatively affect the lives and the development of children, clinical intervention is often recommended. To help with diagnosis and treatment, clinicians use instrumented methods such as spectrograms or ultrasound tongue imaging to analyse speech articulations. Analysis with these methods can be… ▽ More

    Submitted 27 February, 2021; originally announced March 2021.

    Comments: 15 pages, 9 figures, 6 tables

    Journal ref: Speech Communication, Volume 128, April 2021, Pages 24-34

  6. arXiv:2102.04697  [pdf, other

    eess.AS cs.AI cs.SD

    Train your classifier first: Cascade Neural Networks Training from upper layers to lower layers

    Authors: Shucong Zhang, Cong-Thanh Do, Rama Doddipatla, Erfan Loweimi, Peter Bell, Steve Renals

    Abstract: Although the lower layers of a deep neural network learn features which are transferable across datasets, these layers are not transferable within the same dataset. That is, in general, freezing the trained feature extractor (the lower layers) and retraining the classifier (the upper layers) on the same dataset leads to worse performance. In this paper, for the first time, we show that the frozen… ▽ More

    Submitted 9 February, 2021; originally announced February 2021.

    Comments: Accepted by ICASSP 2021

  7. arXiv:2011.09804  [pdf, other

    eess.AS cs.CL cs.CV cs.SD eess.IV

    TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos

    Authors: Manuel Sam Ribeiro, Jennifer Sanger, **g-Xuan Zhang, Aciel Eshky, Alan Wrench, Korin Richmond, Steve Renals

    Abstract: We present the Tongue and Lips corpus (TaL), a multi-speaker corpus of audio, ultrasound tongue imaging, and lip videos. TaL consists of two parts: TaL1 is a set of six recording sessions of one professional voice talent, a male native speaker of English; TaL80 is a set of recording sessions of 81 native speakers of English without voice talent experience. Overall, the corpus contains 24 hours of… ▽ More

    Submitted 19 November, 2020; originally announced November 2020.

    Comments: 8 pages, 4 figures, Accepted to SLT2021, IEEE Spoken Language Technology Workshop

  8. arXiv:2011.04906  [pdf, other

    cs.CL cs.SD eess.AS

    On the Usefulness of Self-Attention for Automatic Speech Recognition with Transformers

    Authors: Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals

    Abstract: Self-attention models such as Transformers, which can capture temporal relationships without being limited by the distance between events, have given competitive speech recognition results. However, we note the range of the learned context increases from the lower to upper self-attention layers, whilst acoustic events often happen within short time spans in a left-to-right order. This leads to a q… ▽ More

    Submitted 8 November, 2020; originally announced November 2020.

    Comments: arXiv admin note: substantial text overlap with arXiv:2005.13895

  9. arXiv:2011.04004  [pdf, other

    cs.CL cs.SD eess.AS

    Stochastic Attention Head Removal: A simple and effective method for improving Transformer Based ASR Models

    Authors: Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals

    Abstract: Recently, Transformer based models have shown competitive automatic speech recognition (ASR) performance. One key factor in the success of these models is the multi-head attention mechanism. However, for trained models, we have previously observed that many attention matrices are close to diagonal, indicating the redundancy of the corresponding attention heads. We have also found that some archite… ▽ More

    Submitted 6 April, 2021; v1 submitted 8 November, 2020; originally announced November 2020.

  10. arXiv:2010.14269  [pdf, other

    cs.SD cs.LG eess.AS

    Leveraging speaker attribute information using multi task learning for speaker verification and diarization

    Authors: Chau Luu, Peter Bell, Steve Renals

    Abstract: Deep speaker embeddings have become the leading method for encoding speaker identity in speaker recognition tasks. The embedding space should ideally capture the variations between all possible speakers, encoding the multiple acoustic aspects that make up a speaker's identity, whilst being robust to non-speaker acoustic variation. Deep speaker embeddings are normally trained discriminatively, pred… ▽ More

    Submitted 23 April, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

    Comments: Submitted to Interspeech 2021

  11. arXiv:2008.06580  [pdf, other

    eess.AS cs.CL cs.SD

    Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

    Authors: Peter Bell, Joachim Fainberg, Ondrej Klejch, **yu Li, Steve Renals, Pawel Swietojanski

    Abstract: We present a structured overview of adaptation algorithms for neural network-based speech recognition, considering both hybrid hidden Markov model / neural network systems and end-to-end neural network systems, with a focus on speaker adaptation, domain adaptation, and accent adaptation. The overview characterizes adaptation algorithms as based on embeddings, model parameter adaptation, or data au… ▽ More

    Submitted 28 February, 2021; v1 submitted 14 August, 2020; originally announced August 2020.

    Comments: Total of 31 pages, 27 figures. Associated repository: https://github.com/pswietojanski/ojsp_adaptation_review_2020

    Journal ref: IEEE Open Journal of Signal Processing, vol. 2, pp. 33-66, 2021

  12. arXiv:2008.03403  [pdf, other

    eess.AS cs.CL cs.SD

    Word Error Rate Estimation Without ASR Output: e-WER2

    Authors: Ahmed Ali, Steve Renals

    Abstract: Measuring the performance of automatic speech recognition (ASR) systems requires manually transcribed data in order to compute the word error rate (WER), which is often time-consuming and expensive. In this paper, we continue our effort in estimating WER using acoustic, lexical and phonotactic features. Our novel approach to estimate the WER uses a multistream end-to-end architecture. We report re… ▽ More

    Submitted 7 August, 2020; originally announced August 2020.

  13. arXiv:2005.13895  [pdf, other

    eess.AS cs.CL cs.SD

    When Can Self-Attention Be Replaced by Feed Forward Layers?

    Authors: Shucong Zhang, Erfan Loweimi, Peter Bell, Steve Renals

    Abstract: Recently, self-attention models such as Transformers have given competitive results compared to recurrent neural network systems in speech recognition. The key factor for the outstanding performance of self-attention models is their ability to capture temporal relationships without being limited by the distance between two related events. However, we note that the range of the learned context prog… ▽ More

    Submitted 28 May, 2020; originally announced May 2020.

  14. arXiv:2003.13551  [pdf

    cs.CL

    European Language Grid: An Overview

    Authors: Georg Rehm, Maria Berger, Ela Elsholz, Stefanie Hegele, Florian Kintzel, Katrin Marheinecke, Stelios Piperidis, Miltos Deligiannis, Dimitris Galanis, Katerina Gkirtzou, Penny Labropoulou, Kalina Bontcheva, David Jones, Ian Roberts, Jan Hajic, Jana Hamrlová, Lukáš Kačena, Khalid Choukri, Victoria Arranz, Andrejs Vasiļjevs, Orians Anvari, Andis Lagzdiņš, Jūlija Meļņika, Gerhard Backfried, Erinç Dikici , et al. (11 additional authors not shown)

    Abstract: With 24 official EU and many additional languages, multilingualism in Europe and an inclusive Digital Single Market can only be enabled through Language Technologies (LTs). European LT business is dominated by hundreds of SMEs and a few large players. Many are world-class, with technologies that outperform the global players. However, European LT business is also fragmented, by nation states, lang… ▽ More

    Submitted 30 March, 2020; originally announced March 2020.

    Comments: Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020). To appear

  15. arXiv:2002.00453  [pdf, other

    cs.SD cs.LG eess.AS

    DropClass and DropAdapt: Drop** classes for deep speaker representation learning

    Authors: Chau Luu, Peter Bell, Steve Renals

    Abstract: Many recent works on deep speaker embeddings train their feature extraction networks on large classification tasks, distinguishing between all speakers in a training set. Empirically, this has been shown to produce speaker-discriminative embeddings, even for unseen speakers. However, it is not clear that this is the optimal means of training embeddings that generalize well. This work proposes two… ▽ More

    Submitted 2 February, 2020; originally announced February 2020.

    Comments: Submitted to Speaker Odyssey 2020

  16. arXiv:1910.14443  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Multi-scale Octave Convolutions for Robust Speech Recognition

    Authors: Joanna Rownicka, Peter Bell, Steve Renals

    Abstract: We propose a multi-scale octave convolution layer to learn robust speech representations efficiently. Octave convolutions were introduced by Chen et al [1] in the computer vision field to reduce the spatial redundancy of the feature maps by decomposing the output of a convolutional layer into feature maps at two different spatial resolutions, one octave apart. This approach improved the efficiency… ▽ More

    Submitted 31 October, 2019; originally announced October 2019.

    Comments: submitted to ICASSP2020

  17. Channel adversarial training for speaker verification and diarization

    Authors: Chau Luu, Peter Bell, Steve Renals

    Abstract: Previous work has encouraged domain-invariance in deep speaker embedding by adversarially classifying the dataset or labelled environment to which the generated features belong. We propose a training strategy which aims to produce features that are invariant at the granularity of the recording or channel, a finer grained objective than dataset- or environment-invariance. By training an adversary t… ▽ More

    Submitted 25 October, 2019; originally announced October 2019.

    Comments: Submitted to IEEE ICASSP 2020

  18. arXiv:1910.10605  [pdf, ps, other

    cs.CL cs.LG eess.AS

    Speaker Adaptive Training using Model Agnostic Meta-Learning

    Authors: Ondřej Klejch, Joachim Fainberg, Peter Bell, Steve Renals

    Abstract: Speaker adaptive training (SAT) of neural network acoustic models learns models in a way that makes them more suitable for adaptation to test conditions. Conventionally, model-based speaker adaptive training is performed by having a set of speaker dependent parameters that are jointly optimised with speaker independent parameters in order to remove speaker variation. However, this does not scale w… ▽ More

    Submitted 23 October, 2019; originally announced October 2019.

    Comments: Accepted to IEEE ASRU 2019

  19. arXiv:1909.13759  [pdf, other

    eess.AS cs.CL cs.SD

    Acoustic Model Adaptation from Raw Waveforms with SincNet

    Authors: Joachim Fainberg, Ondřej Klejch, Erfan Loweimi, Peter Bell, Steve Renals

    Abstract: Raw waveform acoustic modelling has recently gained interest due to neural networks' ability to learn feature extraction, and the potential for finding better representations for a given scenario than hand-crafted features. SincNet has been proposed to reduce the number of parameters required in raw-waveform modelling, by restricting the filter functions, rather than having to learn every tap of e… ▽ More

    Submitted 30 September, 2019; originally announced September 2019.

    Comments: Accepted to IEEE ASRU 2019

  20. arXiv:1909.13537  [pdf, other

    cs.CL cs.SD eess.AS

    Embeddings for DNN speaker adaptive training

    Authors: Joanna Rownicka, Peter Bell, Steve Renals

    Abstract: In this work, we investigate the use of embeddings for speaker-adaptive training of DNNs (DNN-SAT) focusing on a small amount of adaptation data per speaker. DNN-SAT can be viewed as learning a map** from each embedding to transformation parameters that are applied to the shared parameters of the DNN. We investigate different approaches to applying these transformations, and find that with a goo… ▽ More

    Submitted 30 September, 2019; originally announced September 2019.

    Comments: Accepted at ASRU 2019

  21. arXiv:1907.01413  [pdf, other

    eess.AS cs.CL cs.CV cs.LG cs.SD eess.IV

    Speaker-independent classification of phonetic segments from raw ultrasound in child speech

    Authors: Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals

    Abstract: Ultrasound tongue imaging (UTI) provides a convenient way to visualize the vocal tract during speech production. UTI is increasingly being used for speech therapy, making it important to develop automatic methods to assist various time-consuming manual tasks currently performed by speech therapists. A key challenge is to generalize the automatic processing of ultrasound tongue images to previously… ▽ More

    Submitted 1 July, 2019; originally announced July 2019.

    Comments: 5 pages, 4 figures, published in ICASSP2019 (IEEE International Conference on Acoustics, Speech and Signal Processing, 2019)

  22. arXiv:1907.00818  [pdf, other

    eess.AS cs.CL cs.SD eess.IV

    Ultrasound tongue imaging for diarization and alignment of child speech therapy sessions

    Authors: Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals

    Abstract: We investigate the automatic processing of child speech therapy sessions using ultrasound visual biofeedback, with a specific focus on complementing acoustic features with ultrasound images of the tongue for the tasks of speaker diarization and time-alignment of target words. For speaker diarization, we propose an ultrasound-based time-domain signal which we call estimated tongue activity. For wor… ▽ More

    Submitted 15 August, 2019; v1 submitted 1 July, 2019; originally announced July 2019.

    Comments: 5 pages, 3 figures, Accepted for publication at Interspeech 2019

  23. arXiv:1907.00758  [pdf, other

    cs.CL cs.CV cs.LG cs.SD eess.AS eess.IV

    Synchronising audio and ultrasound by learning cross-modal embeddings

    Authors: Aciel Eshky, Manuel Sam Ribeiro, Korin Richmond, Steve Renals

    Abstract: Audiovisual synchronisation is the task of determining the time offset between speech audio and a video recording of the articulators. In child speech therapy, audio and ultrasound videos of the tongue are captured using instruments which rely on hardware to synchronise the two modalities at recording time. Hardware synchronisation can fail in practice, and no mechanism exists to synchronise the s… ▽ More

    Submitted 27 November, 2019; v1 submitted 1 July, 2019; originally announced July 2019.

    Comments: 5 pages, 1 figure, 4 tables; Interspeech 2019 with the following edits: 1) Loss and accuracy upon convergence were accidentally reported from an older model. Now updated with model described throughout the paper. All other results remain unchanged. 2) Max true offset in the training data corrected from 179ms to 1789ms. 3) Detectability "boundary/range" renamed to detectability "thresholds"

  24. arXiv:1906.11521  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Lattice-Based Unsupervised Test-Time Adaptation of Neural Network Acoustic Models

    Authors: Ondrej Klejch, Joachim Fainberg, Peter Bell, Steve Renals

    Abstract: Acoustic model adaptation to unseen test recordings aims to reduce the mismatch between training and testing conditions. Most adaptation schemes for neural network models require the use of an initial one-best transcription for the test data, generated by an unadapted model, in order to estimate the adaptation transform. It has been found that adaptation methods using discriminative objective func… ▽ More

    Submitted 27 June, 2019; originally announced June 2019.

  25. arXiv:1905.13150  [pdf, other

    cs.CL cs.SD eess.AS

    Lattice-based lightly-supervised acoustic model training

    Authors: Joachim Fainberg, Ondřej Klejch, Steve Renals, Peter Bell

    Abstract: In the broadcast domain there is an abundance of related text data and partial transcriptions, such as closed captions and subtitles. This text data can be used for lightly supervised training, in which text matching the audio is selected using an existing speech recognition model. Current approaches to light supervision typically filter the data based on matching error rates between the transcrip… ▽ More

    Submitted 13 July, 2019; v1 submitted 30 May, 2019; originally announced May 2019.

    Comments: Proc. INTERSPEECH 2019

  26. arXiv:1904.08378  [pdf, other

    cs.LG cs.NE stat.ML

    Dynamic Evaluation of Transformer Language Models

    Authors: Ben Krause, Emmanuel Kahembwe, Iain Murray, Steve Renals

    Abstract: This research note combines two methods that have recently improved the state of the art in language modeling: Transformers and dynamic evaluation. Transformers use stacked layers of self-attention that allow them to capture long range dependencies in sequential data. Dynamic evaluation fits models to the recent sequence history, allowing them to assign higher probabilities to re-occurring sequent… ▽ More

    Submitted 17 April, 2019; originally announced April 2019.

  27. arXiv:1811.04708  [pdf, other

    cs.CL

    Analyzing deep CNN-based utterance embeddings for acoustic model adaptation

    Authors: Joanna Rownicka, Peter Bell, Steve Renals

    Abstract: We explore why deep convolutional neural networks (CNNs) with small two-dimensional kernels, primarily used for modeling spatial relations in images, are also effective in speech recognition. We analyze the representations learned by deep CNNs and compare them with deep neural network (DNN) representations and i-vectors, in the context of acoustic model adaptation. To explore whether interpretable… ▽ More

    Submitted 12 November, 2018; originally announced November 2018.

    Comments: accepted to SLT 2018

  28. arXiv:1709.07484  [pdf, other

    cs.CL

    WERd: Using Social Text Spelling Variants for Evaluating Dialectal Speech Recognition

    Authors: Ahmed Ali, Preslav Nakov, Peter Bell, Steve Renals

    Abstract: We study the problem of evaluating automatic speech recognition (ASR) systems that target dialectal speech input. A major challenge in this case is that the orthography of dialects is typically not standardized. From an ASR evaluation perspective, this means that there is no clear gold standard for the expected output, and several possible outputs could be considered correct according to different… ▽ More

    Submitted 21 September, 2017; originally announced September 2017.

    Comments: ASRU-2017

    MSC Class: 68T10 ACM Class: I.2.7

  29. arXiv:1709.07432  [pdf, other

    cs.NE cs.CL

    Dynamic Evaluation of Neural Sequence Models

    Authors: Ben Krause, Emmanuel Kahembwe, Iain Murray, Steve Renals

    Abstract: We present methodology for using dynamic evaluation to improve neural sequence models. Models are adapted to recent history via a gradient descent based mechanism, causing them to assign higher probabilities to re-occurring sequential patterns. Dynamic evaluation outperforms existing adaptation approaches in our comparisons. Dynamic evaluation improves the state-of-the-art word-level perplexities… ▽ More

    Submitted 25 October, 2017; v1 submitted 21 September, 2017; originally announced September 2017.

  30. arXiv:1709.07276  [pdf, other

    cs.CL

    Speech Recognition Challenge in the Wild: Arabic MGB-3

    Authors: Ahmed Ali, Stephan Vogel, Steve Renals

    Abstract: This paper describes the Arabic MGB-3 Challenge - Arabic Speech Recognition in the Wild. Unlike last year's Arabic MGB-2 Challenge, for which the recognition task was based on more than 1,200 hours broadcast TV news recordings from Aljazeera Arabic TV programs, MGB-3 emphasises dialectal Arabic using a multi-genre collection of Egyptian YouTube videos. Seven genres were used for the data collectio… ▽ More

    Submitted 21 September, 2017; originally announced September 2017.

  31. End-to-End Neural Segmental Models for Speech Recognition

    Authors: Hao Tang, Liang Lu, Lingpeng Kong, Kevin Gimpel, Karen Livescu, Chris Dyer, Noah A. Smith, Steve Renals

    Abstract: Segmental models are an alternative to frame-based models for sequence prediction, where hypothesized path weights are based on entire segment scores rather than a single frame at a time. Neural segmental models are segmental models that use neural network-based weight functions. Neural segmental models have achieved competitive results for speech recognition, and their end-to-end training has bee… ▽ More

    Submitted 15 August, 2017; v1 submitted 1 August, 2017; originally announced August 2017.

  32. Small-footprint Highway Deep Neural Networks for Speech Recognition

    Authors: Liang Lu, Steve Renals

    Abstract: State-of-the-art speech recognition systems typically employ neural network acoustic models. However, compared to Gaussian mixture models, deep neural network (DNN) based acoustic models often have many more model parameters, making it challenging for them to be deployed on resource-constrained platforms, such as mobile devices. In this paper, we study the application of the recently proposed high… ▽ More

    Submitted 25 April, 2017; v1 submitted 18 October, 2016; originally announced October 2016.

    Comments: 9 pages, 6 figures. Accepted to IEEE/ACM Transactions on Audio, Speech and Language Processing, 2017. arXiv admin note: text overlap with arXiv:1608.00892, arXiv:1607.01963

  33. arXiv:1609.07959  [pdf, other

    cs.NE stat.ML

    Multiplicative LSTM for sequence modelling

    Authors: Ben Krause, Liang Lu, Iain Murray, Steve Renals

    Abstract: We introduce multiplicative LSTM (mLSTM), a recurrent neural network architecture for sequence modelling that combines the long short-term memory (LSTM) and multiplicative recurrent neural network architectures. mLSTM is characterised by its ability to have different recurrent transition functions for each possible input, which we argue makes it more expressive for autoregressive density estimatio… ▽ More

    Submitted 12 October, 2017; v1 submitted 26 September, 2016; originally announced September 2016.

  34. arXiv:1609.05650  [pdf, other

    cs.CL

    Multi-view Dimensionality Reduction for Dialect Identification of Arabic Broadcast Speech

    Authors: Sameer Khurana, Ahmed Ali, Steve Renals

    Abstract: In this work, we present a new Vector Space Model (VSM) of speech utterances for the task of spoken dialect identification. Generally, DID systems are built using two sets of features that are extracted from speech utterances; acoustic and phonetic. The acoustic and phonetic features are used to form vector representations of speech utterances in an attempt to encode information about the spoken d… ▽ More

    Submitted 19 September, 2016; originally announced September 2016.

  35. arXiv:1609.05625  [pdf, other

    cs.CL

    The MGB-2 Challenge: Arabic Multi-Dialect Broadcast Media Recognition

    Authors: Ahmed Ali, Peter Bell, James Glass, Yacine Messaoui, Hamdy Mubarak, Steve Renals, Yifan Zhang

    Abstract: This paper describes the Arabic Multi-Genre Broadcast (MGB-2) Challenge for SLT-2016. Unlike last year's English MGB Challenge, which focused on recognition of diverse TV genres, this year, the challenge has an emphasis on handling the diversity in dialect in Arabic speech. Audio data comes from 19 distinct programmes from the Aljazeera Arabic TV channel between March 2005 and December 2015. Progr… ▽ More

    Submitted 31 August, 2019; v1 submitted 19 September, 2016; originally announced September 2016.

  36. arXiv:1608.00892  [pdf, other

    cs.CL

    Knowledge Distillation for Small-footprint Highway Networks

    Authors: Liang Lu, Michelle Guo, Steve Renals

    Abstract: Deep learning has significantly advanced state-of-the-art of speech recognition in the past few years. However, compared to conventional Gaussian mixture acoustic models, neural network models are usually much larger, and are therefore not very deployable in embedded devices. Previously, we investigated a compact highway deep neural network (HDNN) for acoustic modelling, which is a type of depth-g… ▽ More

    Submitted 20 December, 2016; v1 submitted 2 August, 2016; originally announced August 2016.

    Comments: 5 pages, 2 figures, accepted to icassp 2017

  37. arXiv:1604.01221  [pdf

    cs.CL

    Character-Level Neural Translation for Multilingual Media Monitoring in the SUMMA Project

    Authors: Guntis Barzdins, Steve Renals, Didzis Gosko

    Abstract: The paper steps outside the comfort-zone of the traditional NLP tasks like automatic speech recognition (ASR) and machine translation (MT) to addresses two novel problems arising in the automated multilingual news monitoring: segmentation of the TV and radio program ASR transcripts into individual stories, and clustering of the individual stories coming from various sources and languages into stor… ▽ More

    Submitted 5 April, 2016; originally announced April 2016.

    Comments: LREC-2016 submission

  38. Differentiable Pooling for Unsupervised Acoustic Model Adaptation

    Authors: Pawel Swietojanski, Steve Renals

    Abstract: We present a deep neural network (DNN) acoustic model that includes parametrised and differentiable pooling operators. Unsupervised acoustic model adaptation is cast as the problem of updating the decision boundaries implemented by each pooling operator. In particular, we experiment with two types of pooling parametrisations: learned $L_p$-norm pooling and weighted Gaussian pooling, in which the w… ▽ More

    Submitted 13 July, 2016; v1 submitted 31 March, 2016; originally announced March 2016.

    Comments: 11 pages, 7 Tables, 7 Figures in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, num. 11, 2016

  39. arXiv:1603.00223  [pdf, other

    cs.CL cs.LG cs.NE

    Segmental Recurrent Neural Networks for End-to-end Speech Recognition

    Authors: Liang Lu, Lingpeng Kong, Chris Dyer, Noah A. Smith, Steve Renals

    Abstract: We study the segmental recurrent neural network for end-to-end acoustic modelling. This model connects the segmental conditional random field (CRF) with a recurrent neural network (RNN) used for feature extraction. Compared to most previous CRF-based acoustic models, it does not rely on an external system to provide features or segmentation boundaries. Instead, this model marginalises out all the… ▽ More

    Submitted 20 June, 2016; v1 submitted 1 March, 2016; originally announced March 2016.

    Comments: 5 pages, 2 figures, accepted by Interspeech 2016

  40. Learning Hidden Unit Contributions for Unsupervised Acoustic Model Adaptation

    Authors: Pawel Swietojanski, **yu Li, Steve Renals

    Abstract: This work presents a broad study on the adaptation of neural network acoustic models by means of learning hidden unit contributions (LHUC) -- a method that linearly re-combines hidden units in a speaker- or environment-dependent manner using small amounts of unsupervised adaptation data. We also extend LHUC to a speaker adaptive training (SAT) framework that leads to a more adaptable DNN acoustic… ▽ More

    Submitted 13 July, 2016; v1 submitted 12 January, 2016; originally announced January 2016.

    Comments: 14 pages, 9 Tables, 11 Figues in IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 24, Num. 8, 2016

  41. arXiv:1512.04280  [pdf, other

    cs.CL cs.LG cs.NE

    Small-footprint Deep Neural Networks with Highway Connections for Speech Recognition

    Authors: Liang Lu, Steve Renals

    Abstract: For speech recognition, deep neural networks (DNNs) have significantly improved the recognition accuracy in most of benchmark datasets and application domains. However, compared to the conventional Gaussian mixture models, DNN-based acoustic models usually have much larger number of model parameters, making it challenging for their applications in resource constrained platforms, e.g., mobile devic… ▽ More

    Submitted 14 June, 2017; v1 submitted 14 December, 2015; originally announced December 2015.

    Comments: 5 pages, 3 figures, fixed typo, accepted by Interspeech 2016

  42. arXiv:1509.06928  [pdf, ps, other

    cs.CL

    Automatic Dialect Detection in Arabic Broadcast Speech

    Authors: Ahmed Ali, Najim Dehak, Patrick Cardinal, Sameer Khurana, Sree Harsha Yella, James Glass, Peter Bell, Steve Renals

    Abstract: We investigate different approaches for dialect identification in Arabic broadcast speech, using phonetic, lexical features obtained from a speech recognition system, and acoustic features using the i-vector framework. We studied both generative and discriminate classifiers, and we combined these features using a multi-class Support Vector Machine (SVM). We validated our results on an Arabic/Engli… ▽ More

    Submitted 10 August, 2016; v1 submitted 23 September, 2015; originally announced September 2015.

  43. Tied Probabilistic Linear Discriminant Analysis for Speech Recognition

    Authors: Liang Lu, Steve Renals

    Abstract: Acoustic models using probabilistic linear discriminant analysis (PLDA) capture the correlations within feature vectors using subspaces which do not vastly expand the model. This allows high dimensional and correlated feature spaces to be used, without requiring the estimation of multiple high dimension covariance matrices. In this letter we extend the recently presented PLDA mixture model for spe… ▽ More

    Submitted 4 November, 2014; originally announced November 2014.

  44. Information Extraction from Broadcast News

    Authors: Yoshihiko Gotoh, Steve Renals

    Abstract: This paper discusses the development of trainable statistical models for extracting content from television and radio news broadcasts. In particular we concentrate on statistical finite state models for identifying proper names and other named entities in broadcast speech. Two models are presented: the first represents name class information as a word attribute; the second represents both word-w… ▽ More

    Submitted 30 March, 2000; originally announced March 2000.

    Comments: 12 pages, 3 figures, Philosophical Transactions of the Royal Society of London, series A: Mathematical, Physical and Engineering Sciences, vol. 358, 2000

    ACM Class: I.2.7

  45. arXiv:cs/0003081  [pdf, ps, other

    cs.CL

    Variable Word Rate N-grams

    Authors: Yoshihiko Gotoh, Steve Renals

    Abstract: The rate of occurrence of words is not uniform but varies from document to document. Despite this observation, parameters for conventional n-gram language models are usually derived using the assumption of a constant word rate. In this paper we investigate the use of variable word rate assumption, modelled by a Poisson distribution or a continuous mixture of Poissons. We present an approach to e… ▽ More

    Submitted 29 March, 2000; originally announced March 2000.

    Comments: 4 pages, 4 figures, ICASSP-2000

    ACM Class: I.2.7