-
Fine-Tuned Self-Supervised Speech Representations for Language Diarization in Multilingual Code-Switched Speech
Authors:
Geoffrey Frost,
Emily Morris,
Joshua Jansen van Vüren,
Thomas Niesler
Abstract:
Annotating a multilingual code-switched corpus is a painstaking process requiring specialist linguistic expertise. This is partly due to the large number of language combinations that may appear within and across utterances, which might require several annotators with different linguistic expertise to consider an utterance sequentially. This is time-consuming and costly. It would be useful if the…
▽ More
Annotating a multilingual code-switched corpus is a painstaking process requiring specialist linguistic expertise. This is partly due to the large number of language combinations that may appear within and across utterances, which might require several annotators with different linguistic expertise to consider an utterance sequentially. This is time-consuming and costly. It would be useful if the spoken languages in an utterance and the boundaries thereof were known before annotation commences, to allow segments to be assigned to the relevant language experts in parallel. To address this, we investigate the development of a continuous multilingual language diarizer using fine-tuned speech representations extracted from a large pre-trained self-supervised architecture (WavLM). We experiment with a code-switched corpus consisting of five South African languages (isiZulu, isiXhosa, Setswana, Sesotho and English) and show substantial diarization error rate improvements for language families, language groups, and individual languages over baseline systems.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
TB or not TB? Acoustic cough analysis for tuberculosis classification
Authors:
Geoffrey Frost,
Grant Theron,
Thomas Niesler
Abstract:
In this work, we explore recurrent neural network architectures for tuberculosis (TB) cough classification. In contrast to previous unsuccessful attempts to implement deep architectures in this domain, we show that a basic bidirectional long short-term memory network (BiLSTM) can achieve improved performance. In addition, we show that by performing greedy feature selection in conjunction with a ne…
▽ More
In this work, we explore recurrent neural network architectures for tuberculosis (TB) cough classification. In contrast to previous unsuccessful attempts to implement deep architectures in this domain, we show that a basic bidirectional long short-term memory network (BiLSTM) can achieve improved performance. In addition, we show that by performing greedy feature selection in conjunction with a newly-proposed attention-based architecture that learns patient invariant features, substantially better generalisation can be achieved compared to a baseline and other considered architectures. Furthermore, this attention mechanism allows an inspection of the temporal regions of the audio signal considered to be important for classification to be performed. Finally, we develop a neural style transfer technique to infer idealised inputs which can subsequently be analysed. We find distinct differences between the idealised power spectra of TB and non-TB coughs, which provide clues about the origin of the features in the audio signal.
△ Less
Submitted 2 September, 2022;
originally announced September 2022.
-
Automatic Tuberculosis and COVID-19 cough classification using deep learning
Authors:
Madhurananda Pahar,
Marisa Klopper,
Byron Reeve,
Rob Warren,
Grant Theron,
Andreas Diacon,
Thomas Niesler
Abstract:
We present a deep learning based automatic cough classifier which can discriminate tuberculosis (TB) coughs from COVID-19 coughs and healthy coughs. Both TB and COVID-19 are respiratory diseases, contagious, have cough as a predominant symptom and claim thousands of lives each year. The cough audio recordings were collected at both indoor and outdoor settings and also uploaded using smartphones fr…
▽ More
We present a deep learning based automatic cough classifier which can discriminate tuberculosis (TB) coughs from COVID-19 coughs and healthy coughs. Both TB and COVID-19 are respiratory diseases, contagious, have cough as a predominant symptom and claim thousands of lives each year. The cough audio recordings were collected at both indoor and outdoor settings and also uploaded using smartphones from subjects around the globe, thus containing various levels of noise. This cough data include 1.68 hours of TB coughs, 18.54 minutes of COVID-19 coughs and 1.69 hours of healthy coughs from 47 TB patients, 229 COVID-19 patients and 1498 healthy patients and were used to train and evaluate a CNN, LSTM and Resnet50. These three deep architectures were also pre-trained on 2.14 hours of sneeze, 2.91 hours of speech and 2.79 hours of noise for improved performance. The class-imbalance in our dataset was addressed by using SMOTE data balancing technique and using performance metrics such as F1-score and AUC. Our study shows that the highest F1-scores of 0.9259 and 0.8631 have been achieved from a pre-trained Resnet50 for two-class (TB vs COVID-19) and three-class (TB vs COVID-19 vs healthy) cough classification tasks, respectively. The application of deep transfer learning has improved the classifiers' performance and makes them more robust as they generalise better over the cross-validation folds. Their performances exceed the TB triage test requirements set by the world health organisation (WHO). The features producing the best performance contain higher order of MFCCs suggesting that the differences between TB and COVID-19 coughs are not perceivable by the human ear. This type of cough audio classification is non-contact, cost-effective and can easily be deployed on a smartphone, thus it can be an excellent tool for both TB and COVID-19 screening.
△ Less
Submitted 10 September, 2022; v1 submitted 11 May, 2022;
originally announced May 2022.
-
Accelerometer-based Bed Occupancy Detection for Automatic, Non-invasive Long-term Cough Monitoring
Authors:
Madhurananda Pahar,
Igor Miranda,
Andreas Diacon,
Thomas Niesler
Abstract:
We present a new machine learning based bed-occupancy detection system that uses the accelerometer signal captured by a bed-attached consumer smartphone. Automatic bed-occupancy detection is necessary for automatic long-term cough monitoring, since the time which the monitored patient occupies the bed is required to accurately calculate a cough rate. Accelerometer measurements are more cost effect…
▽ More
We present a new machine learning based bed-occupancy detection system that uses the accelerometer signal captured by a bed-attached consumer smartphone. Automatic bed-occupancy detection is necessary for automatic long-term cough monitoring, since the time which the monitored patient occupies the bed is required to accurately calculate a cough rate. Accelerometer measurements are more cost effective and less intrusive than alternatives such as video monitoring or pressure sensors. A 249-hour dataset of manually-labelled acceleration signals gathered from seven patients undergoing treatment for tuberculosis (TB) was compiled for experimentation. These signals are characterised by brief activity bursts interspersed with long periods of little or no activity, even when the bed is occupied. To process them effectively, we propose an architecture consisting of three interconnected components. An occupancy-change detector locates instances at which bed occupancy is likely to have changed, an occupancy-interval detector classifies periods between detected occupancy changes and an occupancy-state detector corrects falsely-identified occupancy changes. Using long short-term memory (LSTM) networks, this architecture was demonstrated to achieve an AUC of 0.94. When integrated into a complete cough monitoring system, the daily cough rate of a patient undergoing TB treatment was determined over a period of 14 days. As the colony forming unit (CFU) counts decreased and the time to positivity (TPP) increased, the measured cough rate decreased, indicating effective TB treatment. This provides a first indication that automatic cough monitoring based on bed-mounted accelerometer measurements may present a non-invasive, non-intrusive and cost-effective means of monitoring long-term recovery of TB patients.
△ Less
Submitted 13 March, 2022; v1 submitted 8 February, 2022;
originally announced February 2022.
-
Mathematical Content Browsing for Print-Disabled Readers Based on Virtual-World Exploration and Audio-Visual Sensory substitution
Authors:
Rynhardt Kruger,
Febe de Wet,
Thomas Niesler
Abstract:
Documents containing mathematical content remain largely inaccessible to blind and visually impaired readers because they are predominantly published as untagged PDF which does not include the semantic data necessary for effective accessibility. We present a browsing approach for print-disabled readers specifically aimed at such mathematical content. This approach draws on the navigational mechani…
▽ More
Documents containing mathematical content remain largely inaccessible to blind and visually impaired readers because they are predominantly published as untagged PDF which does not include the semantic data necessary for effective accessibility. We present a browsing approach for print-disabled readers specifically aimed at such mathematical content. This approach draws on the navigational mechanisms often used to explore the virtual worlds of text adventure games with audio-visual sensory substitution for graphical content. The relative spatial placement of the elements of an equation are represented as a virtual world, so that the reader can navigate from element to element. Text elements are announced conventionally using synthesised speech while graphical elements, such as roots and fraction lines, are rendered using a modification of the vOICe algorithm. The virtual world allows the reader to interactively discover the spatial structure of the equation, while the rendition of graphical elements as sound allows the shape and identity of elements that cannot be synthesised as speech to be discovered and recognised. The browsing approach was evaluated by eleven blind and fourteen sighted participants in a user trial that included the identification of twelve equations extracted from PDF documents. Overall, equations were identified completely correctly in 78% of cases (74% and 83% respectively for blind and sighted subjects). If partial correctness is considered, the performance is substantially higher. We conclude that the integration of a spatial model represented as a virtual world in conjunction with audio-visual sensory substitution for non-textual elements can be an effective way for blind and visually impaired readers to read currently inaccessible mathematical content in PDF documents.
△ Less
Submitted 3 February, 2022;
originally announced February 2022.
-
Wake-Cough: cough spotting and cougher identification for personalised long-term cough monitoring
Authors:
Madhurananda Pahar,
Marisa Klopper,
Byron Reeve,
Rob Warren,
Grant Theron,
Andreas Diacon,
Thomas Niesler
Abstract:
We present `wake-cough', an application of wake-word spotting to coughs using a Resnet50 and the identification of coughers using i-vectors, for the purpose of a long-term, personalised cough monitoring system. Coughs, recorded in a quiet (73$\pm$5 dB) and noisy (34$\pm$17 dB) environment, were used to extract i-vectors, x-vectors and d-vectors, used as features to the classifiers. The system achi…
▽ More
We present `wake-cough', an application of wake-word spotting to coughs using a Resnet50 and the identification of coughers using i-vectors, for the purpose of a long-term, personalised cough monitoring system. Coughs, recorded in a quiet (73$\pm$5 dB) and noisy (34$\pm$17 dB) environment, were used to extract i-vectors, x-vectors and d-vectors, used as features to the classifiers. The system achieves 90.02\% accuracy when using an MLP to discriminate between 51 coughers using 2-sec long cough segments in the noisy environment. When discriminating between 5 and 14 coughers using longer (100 sec) segments in the quiet environment, this accuracy improves to 99.78% and 98.39% respectively. Unlike speech, i-vectors outperform x-vectors and d-vectors in identifying coughers. These coughs were added as an extra class to the Google Speech Commands dataset and features were extracted by preserving the end-to-end time-domain information in a trigger phrase. The highest accuracy of 88.58% is achieved in spotting coughs among 35 other trigger phrases using a Resnet50. Thus, wake-cough represents a personalised, non-intrusive cough monitoring system, which is power-efficient as on-device wake-word detection can keep a smartphone-based monitoring device mostly dormant. This makes wake-cough extremely attractive in multi-bed ward environments to monitor patients' long-term recovery from lung ailments such as tuberculosis (TB) and COVID-19.
△ Less
Submitted 10 September, 2022; v1 submitted 7 October, 2021;
originally announced October 2021.
-
Automatic non-invasive Cough Detection based on Accelerometer and Audio Signals
Authors:
Madhurananda Pahar,
Igor Miranda,
Andreas Diacon,
Thomas Niesler
Abstract:
We present an automatic non-invasive way of detecting cough events based on both accelerometer and audio signals.
The acceleration signals are captured by a smartphone firmly attached to the patient's bed, using its integrated accelerometer.
The audio signals are captured simultaneously by the same smartphone using an external microphone.
We have compiled a manually-annotated dataset contain…
▽ More
We present an automatic non-invasive way of detecting cough events based on both accelerometer and audio signals.
The acceleration signals are captured by a smartphone firmly attached to the patient's bed, using its integrated accelerometer.
The audio signals are captured simultaneously by the same smartphone using an external microphone.
We have compiled a manually-annotated dataset containing such simultaneously-captured acceleration and audio signals for approximately 6000 cough and 68000 non-cough events from 14 adult male patients in a tuberculosis clinic.
LR, SVM and MLP are evaluated as baseline classifiers and compared with deep architectures such as CNN, LSTM, and Resnet50 using a leave-one-out cross-validation scheme.
We find that the studied classifiers can use either acceleration or audio signals to distinguish between coughing and other activities including sneezing, throat-clearing, and movement on the bed with high accuracy.
However, in all cases, the deep neural networks outperform the shallow classifiers by a clear margin and the Resnet50 offers the best performance by achieving an AUC exceeding 0.98 and 0.99 for acceleration and audio signals respectively.
While audio-based classification consistently offers a better performance than acceleration-based classification, we observe that the difference is very small for the best systems.
Since the acceleration signal requires less processing power, and since the need to record audio is sidestepped and thus privacy is inherently secured, and since the recording device is attached to the bed and not worn, an accelerometer-based highly accurate non-invasive cough detector may represent a more convenient and readily accepted method in long-term cough monitoring.
△ Less
Submitted 31 August, 2021;
originally announced September 2021.
-
COVID-19 Detection in Cough, Breath and Speech using Deep Transfer Learning and Bottleneck Features
Authors:
Madhurananda Pahar,
Marisa Klopper,
Robin Warren,
Thomas Niesler
Abstract:
We present an experimental investigation into the effectiveness of transfer learning and bottleneck feature extraction in detecting COVID-19 from audio recordings of cough, breath and speech.
This type of screening is non-contact, does not require specialist medical expertise or laboratory facilities and can be deployed on inexpensive consumer hardware.
We use datasets that contain recordings…
▽ More
We present an experimental investigation into the effectiveness of transfer learning and bottleneck feature extraction in detecting COVID-19 from audio recordings of cough, breath and speech.
This type of screening is non-contact, does not require specialist medical expertise or laboratory facilities and can be deployed on inexpensive consumer hardware.
We use datasets that contain recordings of coughing, sneezing, speech and other noises, but do not contain COVID-19 labels, to pre-train three deep neural networks: a CNN, an LSTM and a Resnet50.
These pre-trained networks are subsequently either fine-tuned using smaller datasets of coughing with COVID-19 labels in the process of transfer learning, or are used as bottleneck feature extractors.
Results show that a Resnet50 classifier trained by this transfer learning process delivers optimal or near-optimal performance across all datasets achieving areas under the receiver operating characteristic (ROC AUC) of 0.98, 0.94 and 0.92 respectively for all three sound classes (coughs, breaths and speech).
This indicates that coughs carry the strongest COVID-19 signature, followed by breath and speech.
Our results also show that applying transfer learning and extracting bottleneck features using the larger datasets without COVID-19 labels led not only to improve performance, but also to minimise the standard deviation of the classifier AUCs among the outer folds of the leave-$p$-out cross-validation, indicating better generalisation.
We conclude that deep transfer learning and bottleneck feature extraction can improve COVID-19 cough, breath and speech audio classification, yielding automatic classifiers with higher accuracy.
△ Less
Submitted 17 August, 2021; v1 submitted 2 April, 2021;
originally announced April 2021.
-
Automatic Cough Classification for Tuberculosis Screening in a Real-World Environment
Authors:
Madhurananda Pahar,
Marisa Klopper,
Byron Reeve,
Grant Theron,
Rob Warren,
Thomas Niesler
Abstract:
Objective: The automatic discrimination between the coughing sounds produced by patients with tuberculosis (TB) and those produced by patients with other lung ailments.
Approach: We present experiments based on a dataset of 1358 forced cough recordings obtained in a develo**-world clinic from 16 patients with confirmed active pulmonary TB and 35 patients suffering from respiratory conditions s…
▽ More
Objective: The automatic discrimination between the coughing sounds produced by patients with tuberculosis (TB) and those produced by patients with other lung ailments.
Approach: We present experiments based on a dataset of 1358 forced cough recordings obtained in a develo**-world clinic from 16 patients with confirmed active pulmonary TB and 35 patients suffering from respiratory conditions suggestive of TB but confirmed to be TB negative. Using nested cross-validation, we have trained and evaluated five machine learning classifiers: logistic regression (LR), support vector machines (SVM), k-nearest neighbour (KNN), multilayer perceptrons (MLP) and convolutional neural networks (CNN).
Main Results: Although classification is possible in all cases, the best performance is achieved using LR. In combination with feature selection by sequential forward selection (SFS), our best LR system achieves an area under the ROC curve (AUC) of 0.94 using 23 features selected from a set of 78 high-resolution mel-frequency cepstral coefficients (MFCCs). This system achieves a sensitivity of 93\% at a specificity of 95\% and thus exceeds the 90\% sensitivity at 70\% specificity specification considered by the World Health Organisation (WHO) as a minimal requirement for a community-based TB triage test.
Significance: The automatic classification of cough audio sounds, when applied to symptomatic patients requiring investigation for TB, can meet the WHO triage specifications for the identification of patients who should undergo expensive molecular downstream testing. This makes it a promising and viable means of low cost, easily deployable frontline screening for TB, which can benefit especially develo** countries with a heavy TB burden.
△ Less
Submitted 17 October, 2021; v1 submitted 23 March, 2021;
originally announced March 2021.
-
Deep Neural Network based Cough Detection using Bed-mounted Accelerometer Measurements
Authors:
Madhurananda Pahar,
Igor Miranda,
Andreas Diacon,
Thomas Niesler
Abstract:
We have performed cough detection based on measurements from an accelerometer attached to the patient's bed. This form of monitoring is less intrusive than body-attached accelerometer sensors, and sidesteps privacy concerns encountered when using audio for cough detection. For our experiments, we have compiled a manually-annotated dataset containing the acceleration signals of approximately 6000 c…
▽ More
We have performed cough detection based on measurements from an accelerometer attached to the patient's bed. This form of monitoring is less intrusive than body-attached accelerometer sensors, and sidesteps privacy concerns encountered when using audio for cough detection. For our experiments, we have compiled a manually-annotated dataset containing the acceleration signals of approximately 6000 cough and 68000 non-cough events from 14 adult male patients in a tuberculosis clinic. As classifiers, we have considered convolutional neural networks (CNN), long-short-term-memory (LSTM) networks, and a residual neural network (Resnet50). We find that all classifiers are able to distinguish between the acceleration signals due to coughing and those due to other activities including sneezing, throat-clearing and movement in the bed with high accuracy. The Resnet50 performs the best, achieving an area under the ROC curve (AUC) exceeding 0.98 in cross-validation experiments. We conclude that high-accuracy cough monitoring based only on measurements from the accelerometer in a consumer smartphone is possible. Since the need to gather audio is avoided and therefore privacy is inherently protected, and since the accelerometer is attached to the bed and not worn, this form of monitoring may represent a more convenient and readily accepted method of long-term patient cough monitoring.
△ Less
Submitted 9 February, 2021;
originally announced February 2021.
-
COVID-19 Cough Classification using Machine Learning and Global Smartphone Recordings
Authors:
Madhurananda Pahar,
Marisa Klopper,
Robin Warren,
Thomas Niesler
Abstract:
We present a machine learning based COVID-19 cough classifier which can discriminate COVID-19 positive coughs from both COVID-19 negative and healthy coughs recorded on a smartphone. This type of screening is non-contact, easy to apply, and can reduce the workload in testing centres as well as limit transmission by recommending early self-isolation to those who have a cough suggestive of COVID-19.…
▽ More
We present a machine learning based COVID-19 cough classifier which can discriminate COVID-19 positive coughs from both COVID-19 negative and healthy coughs recorded on a smartphone. This type of screening is non-contact, easy to apply, and can reduce the workload in testing centres as well as limit transmission by recommending early self-isolation to those who have a cough suggestive of COVID-19. The datasets used in this study include subjects from all six continents and contain both forced and natural coughs, indicating that the approach is widely applicable. The publicly available Coswara dataset contains 92 COVID-19 positive and 1079 healthy subjects, while the second smaller dataset was collected mostly in South Africa and contains 18 COVID-19 positive and 26 COVID-19 negative subjects who have undergone a SARS-CoV laboratory test. Both datasets indicate that COVID-19 positive coughs are 15\%-20\% shorter than non-COVID coughs. Dataset skew was addressed by applying the synthetic minority oversampling technique (SMOTE). A leave-$p$-out cross-validation scheme was used to train and evaluate seven machine learning classifiers: LR, KNN, SVM, MLP, CNN, LSTM and Resnet50. Our results show that although all classifiers were able to identify COVID-19 coughs, the best performance was exhibited by the Resnet50 classifier, which was best able to discriminate between the COVID-19 positive and the healthy coughs with an area under the ROC curve (AUC) of 0.98. An LSTM classifier was best able to discriminate between the COVID-19 positive and COVID-19 negative coughs, with an AUC of 0.94 after selecting the best 13 features from a sequential forward selection (SFS). Since this type of cough audio classification is cost-effective and easy to deploy, it is potentially a useful and viable means of non-contact COVID-19 screening.
△ Less
Submitted 14 June, 2021; v1 submitted 2 December, 2020;
originally announced December 2020.
-
Multilingual Bottleneck Features for Improving ASR Performance of Code-Switched Speech in Under-Resourced Languages
Authors:
Trideba Padhi,
Astik Biswas,
Febe De Wet,
Ewald van der Westhuizen,
Thomas Niesler
Abstract:
In this work, we explore the benefits of using multilingual bottleneck features (mBNF) in acoustic modelling for the automatic speech recognition of code-switched (CS) speech in African languages. The unavailability of annotated corpora in the languages of interest has always been a primary challenge when develo** speech recognition systems for this severely under-resourced type of speech. Hence…
▽ More
In this work, we explore the benefits of using multilingual bottleneck features (mBNF) in acoustic modelling for the automatic speech recognition of code-switched (CS) speech in African languages. The unavailability of annotated corpora in the languages of interest has always been a primary challenge when develo** speech recognition systems for this severely under-resourced type of speech. Hence, it is worthwhile to investigate the potential of using speech corpora available for other better-resourced languages to improve speech recognition performance. To achieve this, we train a mBNF extractor using nine Southern Bantu languages that form part of the freely available multilingual NCHLT corpus. We append these mBNFs to the existing MFCCs, pitch features and i-vectors to train acoustic models for automatic speech recognition (ASR) in the target code-switched languages. Our results show that the inclusion of the mBNF features leads to clear performance improvements over a baseline trained without the mBNFs for code-switched English-isiZulu, English-isiXhosa, English-Sesotho and English-Setswana speech.
△ Less
Submitted 31 October, 2020;
originally announced November 2020.
-
Semi-supervised acoustic modelling for five-lingual code-switched ASR using automatically-segmented soap opera speech
Authors:
N. Wilkinson,
A. Biswas,
E. Yılmaz,
F. de Wet,
E. van der Westhuizen,
T. R. Niesler
Abstract:
This paper considers the impact of automatic segmentation on the fully-automatic, semi-supervised training of automatic speech recognition (ASR) systems for five-lingual code-switched (CS) speech. Four automatic segmentation techniques were evaluated in terms of the recognition performance of an ASR system trained on the resulting segments in a semi-supervised manner. The system's output was compa…
▽ More
This paper considers the impact of automatic segmentation on the fully-automatic, semi-supervised training of automatic speech recognition (ASR) systems for five-lingual code-switched (CS) speech. Four automatic segmentation techniques were evaluated in terms of the recognition performance of an ASR system trained on the resulting segments in a semi-supervised manner. The system's output was compared with the recognition rates achieved by a semi-supervised system trained on manually assigned segments. Three of the automatic techniques use a newly proposed convolutional neural network (CNN) model for framewise classification, and include a novel form of HMM smoothing of the CNN outputs. Automatic segmentation was applied in combination with automatic speaker diarization. The best-performing segmentation technique was also tested without speaker diarization. An evaluation based on 248 unsegmented soap opera episodes indicated that voice activity detection (VAD) based on a CNN followed by Gaussian mixture modelhidden Markov model smoothing (CNN-GMM-HMM) yields the best ASR performance. The semi-supervised system trained with the resulting segments achieved an overall WER improvement of 1.1% absolute over the system trained with manually created segments. Furthermore, we found that system performance improved even further when the automatic segmentation was used in conjunction with speaker diarization.
△ Less
Submitted 8 April, 2020;
originally announced April 2020.
-
Semi-supervised acoustic and language model training for English-isiZulu code-switched speech recognition
Authors:
A. Biswas,
F. de Wet,
E. van der Westhuizen,
T. R. Niesler
Abstract:
We present an analysis of semi-supervised acoustic and language model training for English-isiZulu code-switched ASR using soap opera speech. Approximately 11 hours of untranscribed multilingual speech was transcribed automatically using four bilingual code-switching transcription systems operating in English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho. These transcriptions wer…
▽ More
We present an analysis of semi-supervised acoustic and language model training for English-isiZulu code-switched ASR using soap opera speech. Approximately 11 hours of untranscribed multilingual speech was transcribed automatically using four bilingual code-switching transcription systems operating in English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho. These transcriptions were incorporated into the acoustic and language model training sets. Results showed that the TDNN-F acoustic models benefit from the additional semi-supervised data and that even better performance could be achieved by including additional CNN layers. Using these CNN-TDNN-F acoustic models, a first iteration of semi-supervised training achieved an absolute mixed-language WER reduction of 3.4%, and a further 2.2% after a second iteration. Although the languages in the untranscribed data were unknown, the best results were obtained when all automatically transcribed data was used for training and not just the utterances classified as English-isiZulu. Despite reducing perplexity, the semi-supervised language model was not able to improve the ASR performance.
△ Less
Submitted 5 April, 2020;
originally announced April 2020.
-
Semi-supervised Development of ASR Systems for Multilingual Code-switched Speech in Under-resourced Languages
Authors:
Astik Biswas,
Emre Yılmaz,
Febe de Wet,
Ewald van der Westhuizen,
Thomas Niesler
Abstract:
This paper reports on the semi-supervised development of acoustic and language models for under-resourced, code-switched speech in five South African languages. Two approaches are considered. The first constructs four separate bilingual automatic speech recognisers (ASRs) corresponding to four different language pairs between which speakers switch frequently. The second uses a single, unified, fiv…
▽ More
This paper reports on the semi-supervised development of acoustic and language models for under-resourced, code-switched speech in five South African languages. Two approaches are considered. The first constructs four separate bilingual automatic speech recognisers (ASRs) corresponding to four different language pairs between which speakers switch frequently. The second uses a single, unified, five-lingual ASR system that represents all the languages (English, isiZulu, isiXhosa, Setswana and Sesotho). We evaluate the effectiveness of these two approaches when used to add additional data to our extremely sparse training sets. Results indicate that batch-wise semi-supervised training yields better results than a non-batch-wise approach. Furthermore, while the separate bilingual systems achieved better recognition performance than the unified system, they benefited more from pseudo-labels generated by the five-lingual system than from those generated by the bilingual systems.
△ Less
Submitted 6 March, 2020;
originally announced March 2020.
-
Improved low-resource Somali speech recognition by semi-supervised acoustic and language model training
Authors:
Astik Biswas,
Raghav Menon,
Ewald van der Westhuizen,
Thomas Niesler
Abstract:
We present improvements in automatic speech recognition (ASR) for Somali, a currently extremely under-resourced language. This forms part of a continuing United Nations (UN) effort to employ ASR-based keyword spotting systems to support humanitarian relief programmes in rural Africa. Using just 1.57 hours of annotated speech data as a seed corpus, we increase the pool of training data by applying…
▽ More
We present improvements in automatic speech recognition (ASR) for Somali, a currently extremely under-resourced language. This forms part of a continuing United Nations (UN) effort to employ ASR-based keyword spotting systems to support humanitarian relief programmes in rural Africa. Using just 1.57 hours of annotated speech data as a seed corpus, we increase the pool of training data by applying semi-supervised training to 17.55 hours of untranscribed speech. We make use of factorised time-delay neural networks (TDNN-F) for acoustic modelling, since these have recently been shown to be effective in resource-scarce situations. Three semi-supervised training passes were performed, where the decoded output from each pass was used for acoustic model training in the subsequent pass. The automatic transcriptions from the best performing pass were used for language model augmentation. To ensure the quality of automatic transcriptions, decoder confidence is used as a threshold. The acoustic and language models obtained from the semi-supervised approach show significant improvement in terms of WER and perplexity compared to the baseline. Incorporating the automatically generated transcriptions yields a 6.55\% improvement in language model perplexity. The use of 17.55 hour of Somali acoustic data in semi-supervised training shows an improvement of 7.74\% relative over the baseline.
△ Less
Submitted 5 July, 2019;
originally announced July 2019.
-
Semi-supervised acoustic model training for five-lingual code-switched ASR
Authors:
Astik Biswas,
Emre Yılmaz,
Febe de Wet,
Ewald van der Westhuizen,
Thomas Niesler
Abstract:
This paper presents recent progress in the acoustic modelling of under-resourced code-switched (CS) speech in multiple South African languages. We consider two approaches. The first constructs separate bilingual acoustic models corresponding to language pairs (English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho). The second constructs a single unified five-lingual acoustic mode…
▽ More
This paper presents recent progress in the acoustic modelling of under-resourced code-switched (CS) speech in multiple South African languages. We consider two approaches. The first constructs separate bilingual acoustic models corresponding to language pairs (English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho). The second constructs a single unified five-lingual acoustic model representing all the languages (English, isiZulu, isiXhosa, Setswana and Sesotho). For these two approaches we consider the effectiveness of semi-supervised training to increase the size of the very sparse acoustic training sets. Using approximately 11 hours of untranscribed speech, we show that both approaches benefit from semi-supervised training. The bilingual TDNN-F acoustic models also benefit from the addition of CNN layers (CNN-TDNN-F), while the five-lingual system does not show any significant improvement. Furthermore, because English is common to all language pairs in our data, it dominates when training a unified language model, leading to improved English ASR performance at the expense of the other languages. Nevertheless, the five-lingual model offers flexibility because it can process more than two languages simultaneously, and is therefore an attractive option as an automatic transcription system in a semi-supervised training pipeline.
△ Less
Submitted 15 October, 2019; v1 submitted 20 June, 2019;
originally announced June 2019.
-
Feature exploration for almost zero-resource ASR-free keyword spotting using a multilingual bottleneck extractor and correspondence autoencoders
Authors:
Raghav Menon,
Herman Kamper,
Ewald van der Westhuizen,
John Quinn,
Thomas Niesler
Abstract:
We compare features for dynamic time war** (DTW) when used to bootstrap keyword spotting (KWS) in an almost zero-resource setting. Such quickly-deployable systems aim to support United Nations (UN) humanitarian relief efforts in parts of Africa with severely under-resourced languages. Our objective is to identify acoustic features that provide acceptable KWS performance in such environments. As…
▽ More
We compare features for dynamic time war** (DTW) when used to bootstrap keyword spotting (KWS) in an almost zero-resource setting. Such quickly-deployable systems aim to support United Nations (UN) humanitarian relief efforts in parts of Africa with severely under-resourced languages. Our objective is to identify acoustic features that provide acceptable KWS performance in such environments. As supervised resource, we restrict ourselves to a small, easily acquired and independently compiled set of isolated keywords. For feature extraction, a multilingual bottleneck feature (BNF) extractor, trained on well-resourced out-of-domain languages, is integrated with a correspondence autoencoder (CAE) trained on extremely sparse in-domain data. On their own, BNFs and CAE features are shown to achieve a more than 2% absolute performance improvement over baseline MFCCs. However, by using BNFs as input to the CAE, even better performance is achieved, with a more than 11% absolute improvement in ROC AUC over MFCCs and more than twice as many top-10 retrievals for two evaluated languages, English and Luganda. We conclude that integrating BNFs with the CAE allows both large out-of-domain and sparse in-domain resources to be exploited for improved ASR-free keyword spotting.
△ Less
Submitted 12 July, 2019; v1 submitted 14 November, 2018;
originally announced November 2018.
-
Direction of Arrival Estimation of Wide-band Signals with Planar Microphone Arrays
Authors:
Rudolf Byker,
Thomas Niesler
Abstract:
An approach to the estimation of the Direction of Arrival (DOA) of wide-band signals with a planar microphone array is presented. Our algorithm estimates an unambiguous DOA using a single planar array in which the microphones are placed fairly close together and the sound source is expected to be in the far field. The algorithm uses the ambiguous DOA estimates obtained from microphone pairs in the…
▽ More
An approach to the estimation of the Direction of Arrival (DOA) of wide-band signals with a planar microphone array is presented. Our algorithm estimates an unambiguous DOA using a single planar array in which the microphones are placed fairly close together and the sound source is expected to be in the far field. The algorithm uses the ambiguous DOA estimates obtained from microphone pairs in the array to determine an unambiguous DOA estimate for the array as a whole. The required pair-wise DOAs may be calculated using Time Delay Estimations (TDEs), which may in turn be calculated using cross-correlation, making the algorithm suitable for wide-band signals. No a priori knowledge of the true Sound Source Location (SSL) is required. Simulations show that the algorithm is robust against noise in the input data. An average ratio of approximately 3:1 exists between the input DOA errors and the output DOA error. Field tests with a moving sound source provided DOA estimates with standard deviations between 20.4 and 15.2 degrees.
△ Less
Submitted 16 November, 2018;
originally announced November 2018.
-
Cluster Size Management in Multi-Stage Agglomerative Hierarchical Clustering of Acoustic Speech Segments
Authors:
Lerato Lerato,
Thomas Niesler
Abstract:
Agglomerative hierarchical clustering (AHC) requires only the similarity between objects to be known. This is attractive when clustering signals of varying length, such as speech, which are not readily represented in fixed-dimensional vector space. However, AHC is characterised by $O(N^2)$ space and time complexity, making it infeasible for partitioning large datasets. This has recently been addre…
▽ More
Agglomerative hierarchical clustering (AHC) requires only the similarity between objects to be known. This is attractive when clustering signals of varying length, such as speech, which are not readily represented in fixed-dimensional vector space. However, AHC is characterised by $O(N^2)$ space and time complexity, making it infeasible for partitioning large datasets. This has recently been addressed by an approach based on the iterative re-clustering of independent subsets of the larger dataset. We show that, due to its iterative nature, this procedure can sometimes lead to unchecked growth of individual subsets, thereby compromising its effectiveness. We propose the integration of a simple space management strategy into the iterative process, and show experimentally that this leads to no loss in performance in terms of F-measure while guaranteeing that a threshold space complexity is not breached.
△ Less
Submitted 30 October, 2018;
originally announced October 2018.
-
Feature Trajectory Dynamic Time War** for Clustering of Speech Segments
Authors:
Lerato Lerato,
Thomas Niesler
Abstract:
Dynamic time war** (DTW) can be used to compute the similarity between two sequences of generally differing length. We propose a modification to DTW that performs individual and independent pairwise alignment of feature trajectories. The modified technique, termed feature trajectory dynamic time war** (FTDTW), is applied as a similarity measure in the agglomerative hierarchical clustering of s…
▽ More
Dynamic time war** (DTW) can be used to compute the similarity between two sequences of generally differing length. We propose a modification to DTW that performs individual and independent pairwise alignment of feature trajectories. The modified technique, termed feature trajectory dynamic time war** (FTDTW), is applied as a similarity measure in the agglomerative hierarchical clustering of speech segments. Experiments using MFCC and PLP parametrisations extracted from TIMIT and from the Spoken Arabic Digit Dataset (SADD) show consistent and statistically significant improvements in the quality of the resulting clusters in terms of F-measure and normalised mutual information (NMI).
△ Less
Submitted 30 October, 2018;
originally announced October 2018.
-
Building a Unified Code-Switching ASR System for South African Languages
Authors:
Emre Yılmaz,
Astik Biswas,
Ewald van der Westhuizen,
Febe de Wet,
Thomas Niesler
Abstract:
We present our first efforts towards building a single multilingual automatic speech recognition (ASR) system that can process code-switching (CS) speech in five languages spoken within the same population. This contrasts with related prior work which focuses on the recognition of CS speech in bilingual scenarios. Recently, we have compiled a small five-language corpus of South African soap opera…
▽ More
We present our first efforts towards building a single multilingual automatic speech recognition (ASR) system that can process code-switching (CS) speech in five languages spoken within the same population. This contrasts with related prior work which focuses on the recognition of CS speech in bilingual scenarios. Recently, we have compiled a small five-language corpus of South African soap opera speech which contains examples of CS between 5 languages occurring in various contexts such as using English as the matrix language and switching to other indigenous languages. The ASR system presented in this work is trained on 4 corpora containing English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho CS speech. The interpolation of multiple language models trained on these language pairs enables the ASR system to hypothesize mixed word sequences from these 5 languages. We evaluate various state-of-the-art acoustic models trained on this 5-lingual training data and report ASR accuracy and language recognition performance on the development and test sets of the South African multilingual soap opera corpus.
△ Less
Submitted 28 July, 2018;
originally announced July 2018.
-
Automatic Speech Recognition for Humanitarian Applications in Somali
Authors:
Raghav Menon,
Astik Biswas,
Armin Saeb,
John Quinn,
Thomas Niesler
Abstract:
We present our first efforts in building an automatic speech recognition system for Somali, an under-resourced language, using 1.57 hrs of annotated speech for acoustic model training. The system is part of an ongoing effort by the United Nations (UN) to implement keyword spotting systems supporting humanitarian relief programmes in parts of Africa where languages are severely under-resourced. We…
▽ More
We present our first efforts in building an automatic speech recognition system for Somali, an under-resourced language, using 1.57 hrs of annotated speech for acoustic model training. The system is part of an ongoing effort by the United Nations (UN) to implement keyword spotting systems supporting humanitarian relief programmes in parts of Africa where languages are severely under-resourced. We evaluate several types of acoustic model, including recent neural architectures. Language model data augmentation using a combination of recurrent neural networks (RNN) and long short-term memory neural networks (LSTMs) as well as the perturbation of acoustic data are also considered. We find that both types of data augmentation are beneficial to performance, with our best system using a combination of convolutional neural networks (CNNs), time-delay neural networks (TDNNs) and bi-directional long short term memory (BLSTMs) to achieve a word error rate of 53.75%.
△ Less
Submitted 23 July, 2018;
originally announced July 2018.
-
ASR-free CNN-DTW keyword spotting using multilingual bottleneck features for almost zero-resource languages
Authors:
Raghav Menon,
Herman Kamper,
Emre Yilmaz,
John Quinn,
Thomas Niesler
Abstract:
We consider multilingual bottleneck features (BNFs) for nearly zero-resource keyword spotting. This forms part of a United Nations effort using keyword spotting to support humanitarian relief programmes in parts of Africa where languages are severely under-resourced. We use 1920 isolated keywords (40 types, 34 minutes) as exemplars for dynamic time war** (DTW) template matching, which is perform…
▽ More
We consider multilingual bottleneck features (BNFs) for nearly zero-resource keyword spotting. This forms part of a United Nations effort using keyword spotting to support humanitarian relief programmes in parts of Africa where languages are severely under-resourced. We use 1920 isolated keywords (40 types, 34 minutes) as exemplars for dynamic time war** (DTW) template matching, which is performed on a much larger body of untranscribed speech. These DTW costs are used as targets for a convolutional neural network (CNN) keyword spotter, giving a much faster system than direct DTW. Here we consider how available data from well-resourced languages can improve this CNN-DTW approach. We show that multilingual BNFs trained on ten languages improve the area under the ROC curve of a CNN-DTW system by 10.9% absolute relative to the MFCC baseline. By combining low-resource DTW-based supervision with information from well-resourced languages, CNN-DTW is a competitive option for low-resource keyword spotting.
△ Less
Submitted 23 July, 2018;
originally announced July 2018.
-
Fast ASR-free and almost zero-resource keyword spotting using DTW and CNNs for humanitarian monitoring
Authors:
Raghav Menon,
Herman Kamper,
John Quinn,
Thomas Niesler
Abstract:
We use dynamic time war** (DTW) as supervision for training a convolutional neural network (CNN) based keyword spotting system using a small set of spoken isolated keywords. The aim is to allow rapid deployment of a keyword spotting system in a new language to support urgent United Nations (UN) relief programmes in parts of Africa where languages are extremely under-resourced and the development…
▽ More
We use dynamic time war** (DTW) as supervision for training a convolutional neural network (CNN) based keyword spotting system using a small set of spoken isolated keywords. The aim is to allow rapid deployment of a keyword spotting system in a new language to support urgent United Nations (UN) relief programmes in parts of Africa where languages are extremely under-resourced and the development of annotated speech resources is infeasible. First, we use 1920 recorded keywords (40 keyword types, 34 minutes of speech) as exemplars in a DTW-based template matching system and apply it to untranscribed broadcast speech. Then, we use the resulting DTW scores as targets to train a CNN on the same unlabelled speech. In this way we use just 34 minutes of labelled speech, but leverage a large amount of unlabelled data for training. While the resulting CNN keyword spotter cannot match the performance of the DTW-based system, it substantially outperforms a CNN classifier trained only on the keywords, improving the area under the ROC curve from 0.54 to 0.64. Because our CNN system is several orders of magnitude faster at runtime than the DTW system, it represents the most viable keyword spotter on this extremely limited dataset.
△ Less
Submitted 25 June, 2018;
originally announced June 2018.