Search | arXiv e-print repository

Reducing the Sampling Burden of Fourier Sensing with a Non-rectangular Field-of-View

Authors: Nicholas Dwork, Erin K. Englund, Alex J. Barker

Abstract: With Fourier sensing, it is commonly the case that the field-of-view (FOV), the area of space to be imaged, is known prior to reconstruction. To date, reconstruction algorithms have focused on FOVs with simple geometries: a rectangle or a hexagon. This yields sampling patterns that are more burdensome than necessary. Due to the reduced area of imaging possible with an arbitrary (e.g., non-rectangu… ▽ More With Fourier sensing, it is commonly the case that the field-of-view (FOV), the area of space to be imaged, is known prior to reconstruction. To date, reconstruction algorithms have focused on FOVs with simple geometries: a rectangle or a hexagon. This yields sampling patterns that are more burdensome than necessary. Due to the reduced area of imaging possible with an arbitrary (e.g., non-rectangular) FOV, the number of samples required for a high-quality images is reduced. However, when an arbitrary FOV has been considered, the reconstruction algorithm is computationally expensive. In this manuscript, we present a method to reduce the sampling pattern for an arbitrary FOV with an accompanying direct (non-iterative) reconstruction algorithm. We also present a method to decrease the computational cost of the (iterative) model-based reconstruction (MBR) algorithm. We present results using MRI data of an ankle, a pineapple, and a brain. △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2405.05980 [pdf]

Overcoming challenges of translating deep-learning models for glioblastoma: the ZGBM consortium

Authors: Haris Shuaib, Gareth J Barker, Peter Sasieni, Enrico De Vita, Alysha Chelliah, Roman Andrei, Keyoumars Ashkan, Erica Beaumont, Lucy Brazil, Chris Rowland-Hill, Yue Hui Lau, Aysha Luis, James Powell, Angela Swampillai, Sean Tenant, Stefanie C Thust, Stephen Wastling, Tom Young, Thomas C Booth

Abstract: Objective: To report imaging protocol and scheduling variance in routine care of glioblastoma patients in order to demonstrate challenges of integrating deep-learning models in glioblastoma care pathways. Additionally, to understand the most common imaging studies and image contrasts to inform the development of potentially robust deep-learning models. Methods: MR imaging data were analysed from a… ▽ More Objective: To report imaging protocol and scheduling variance in routine care of glioblastoma patients in order to demonstrate challenges of integrating deep-learning models in glioblastoma care pathways. Additionally, to understand the most common imaging studies and image contrasts to inform the development of potentially robust deep-learning models. Methods: MR imaging data were analysed from a random sample of five patients from the prospective cohort across five participating sites of the ZGBM consortium. Reported clinical and treatment data alongside DICOM header information were analysed to understand treatment pathway imaging schedules. Results: All sites perform all structural imaging at every stage in the pathway except for the presurgical study, where in some sites only contrast-enhanced T1-weighted imaging is performed. Diffusion MRI is the most common non-structural imaging type, performed at every site. Conclusion: The imaging protocol and scheduling varies across the UK, making it challenging to develop machine-learning models that could perform robustly at other centres. Structural imaging is performed most consistently across all centres. Advances in knowledge: Successful translation of deep-learning models will likely be based on structural post-treatment imaging unless there is significant effort made to standardise non-structural or peri-operative imaging protocols and schedules. △ Less

Submitted 7 May, 2024; originally announced May 2024.

arXiv:2402.01413 [pdf, other]

Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge

Authors: Simon Leglaive, Matthieu Fraticelli, Hend ElGhazaly, Léonie Borne, Mostafa Sadeghi, Scott Wisdom, Manuel Pariente, John R. Hershey, Daniel Pressnitzer, Jon P. Barker

Abstract: Supervised models for speech enhancement are trained using artificially generated mixtures of clean speech and noise signals. However, the synthetic training conditions may not accurately reflect real-world conditions encountered during testing. This discrepancy can result in poor performance when the test domain significantly differs from the synthetic training domain. To tackle this issue, the U… ▽ More Supervised models for speech enhancement are trained using artificially generated mixtures of clean speech and noise signals. However, the synthetic training conditions may not accurately reflect real-world conditions encountered during testing. This discrepancy can result in poor performance when the test domain significantly differs from the synthetic training domain. To tackle this issue, the UDASE task of the 7th CHiME challenge aimed to leverage real-world noisy speech recordings from the test domain for unsupervised domain adaptation of speech enhancement models. Specifically, this test domain corresponds to the CHiME-5 dataset, characterized by real multi-speaker and conversational speech recordings made in noisy and reverberant domestic environments, for which ground-truth clean speech signals are not available. In this paper, we present the objective and subjective evaluations of the systems that were submitted to the CHiME-7 UDASE task, and we provide an analysis of the results. This analysis reveals a limited correlation between subjective ratings and several supervised nonintrusive performance metrics recently proposed for speech enhancement. Conversely, the results suggest that more traditional intrusive objective metrics can be used for in-domain performance evaluation using the reverberant LibriCHiME-5 dataset developed for the challenge. The subjective evaluation indicates that all systems successfully reduced the background noise, but always at the expense of increased distortion. Out of the four speech enhancement methods evaluated subjectively, only one demonstrated an improvement in overall quality compared to the unprocessed noisy speech, highlighting the difficulty of the task. The tools and audio material created for the CHiME-7 UDASE task are shared with the community. △ Less

Submitted 2 February, 2024; originally announced February 2024.

arXiv:2401.13611 [pdf, ps, other]

Non-Intrusive Speech Intelligibility Prediction for Hearing-Impaired Users using Intermediate ASR Features and Human Memory Models

Authors: Rhiannon Mogridge, George Close, Robert Sutherland, Thomas Hain, Jon Barker, Stefan Goetze, Anton Ragni

Abstract: Neural networks have been successfully used for non-intrusive speech intelligibility prediction. Recently, the use of feature representations sourced from intermediate layers of pre-trained self-supervised and weakly-supervised models has been found to be particularly useful for this task. This work combines the use of Whisper ASR decoder layer representations as neural network input features with… ▽ More Neural networks have been successfully used for non-intrusive speech intelligibility prediction. Recently, the use of feature representations sourced from intermediate layers of pre-trained self-supervised and weakly-supervised models has been found to be particularly useful for this task. This work combines the use of Whisper ASR decoder layer representations as neural network input features with an exemplar-based, psychologically motivated model of human memory to predict human intelligibility ratings for hearing-aid users. Substantial performance improvement over an established intrusive HASPI baseline system is found, including on enhancement systems and listeners unseen in the training data, with a root mean squared error of 25.3 compared with the baseline of 28.7. △ Less

Submitted 24 January, 2024; originally announced January 2024.

Comments: Accepted paper. IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Seoul, Korea, April 2024

arXiv:2311.14490 [pdf, other]

Overview Of The 2023 Icassp Sp Clarity Challenge: Speech Enhancement For Hearing Aids

Authors: Trevor J. Cox, Jon Barker, Will Bailey, Simone Graetzer, Michael A. Akeroyd, John F. Culling, Graham Naylor

Abstract: This paper reports on the design and outcomes of the ICASSP SP Clarity Challenge: Speech Enhancement for Hearing Aids. The scenario was a listener attending to a target speaker in a noisy, domestic environment. There were multiple interferers and head rotation by the listener. The challenge extended the second Clarity Enhancement Challenge (CEC2) by fixing the amplification stage of the hearing ai… ▽ More This paper reports on the design and outcomes of the ICASSP SP Clarity Challenge: Speech Enhancement for Hearing Aids. The scenario was a listener attending to a target speaker in a noisy, domestic environment. There were multiple interferers and head rotation by the listener. The challenge extended the second Clarity Enhancement Challenge (CEC2) by fixing the amplification stage of the hearing aid; evaluating with a combined metric for speech intelligibility and quality; and providing two evaluation sets, one based on simulation and the other on real-room measurements. Five teams improved on the baseline system for the simulated evaluation set, but the performance on the measured evaluation set was much poorer. Investigations are on-going to determine the exact cause of the mismatch between the simulated and measured data sets. The presence of transducer noise in the measurements, lower order Ambisonics harming the ability for systems to exploit binaural cues and the differences between real and simulated room impulse responses are suggested causes △ Less

Submitted 24 November, 2023; originally announced November 2023.

Comments: ICASSP 2023

arXiv:2310.19817 [pdf, other]

Intelligibility prediction with a pretrained noise-robust automatic speech recognition model

Authors: Zehai Tu, Ning Ma, Jon Barker

Abstract: This paper describes two intelligibility prediction systems derived from a pretrained noise-robust automatic speech recognition (ASR) model for the second Clarity Prediction Challenge (CPC2). One system is intrusive and leverages the hidden representations of the ASR model. The other system is non-intrusive and makes predictions with derived ASR uncertainty. The ASR model is only pretrained with a… ▽ More This paper describes two intelligibility prediction systems derived from a pretrained noise-robust automatic speech recognition (ASR) model for the second Clarity Prediction Challenge (CPC2). One system is intrusive and leverages the hidden representations of the ASR model. The other system is non-intrusive and makes predictions with derived ASR uncertainty. The ASR model is only pretrained with a simulated noisy speech corpus and does not take advantage of the CPC2 data. For that reason, the intelligibility prediction systems are robust to unseen scenarios given the accurate prediction performance on the CPC2 evaluation. △ Less

Submitted 20 October, 2023; originally announced October 2023.

arXiv:2310.05799 [pdf, other]

The First Cadenza Signal Processing Challenge: Improving Music for Those With a Hearing Loss

Authors: Gerardo Roa Dabike, Scott Bannister, Jennifer Firth, Simone Graetzer, Rebecca Vos, Michael A. Akeroyd, Jon Barker, Trevor J. Cox, Bruno Fazenda, Alinka Greasley, William Whitmer

Abstract: The Cadenza project aims to improve the audio quality of music for those who have a hearing loss. This is being done through a series of signal processing challenges, to foster better and more inclusive technologies. In the first round, two common listening scenarios are considered: listening to music over headphones, and with a hearing aid in a car. The first scenario is cast as a demixing-remixi… ▽ More The Cadenza project aims to improve the audio quality of music for those who have a hearing loss. This is being done through a series of signal processing challenges, to foster better and more inclusive technologies. In the first round, two common listening scenarios are considered: listening to music over headphones, and with a hearing aid in a car. The first scenario is cast as a demixing-remixing problem, where the music is decomposed into vocals, bass, drums and other components. These can then be intelligently remixed in a personalized way, to increase the audio quality for a person who has a hearing loss. In the second scenario, music is coming from car loudspeakers, and the music has to be enhanced to overcome the masking effect of the car noise. This is done by taking into account the music, the hearing ability of the listener, the hearing aid and the speed of the car. The audio quality of the submissions will be evaluated using the Hearing Aid Audio Quality Index (HAAQI) for objective assessment and by a panel of people with hearing loss for subjective evaluation. △ Less

Submitted 9 October, 2023; originally announced October 2023.

arXiv:2310.03480 [pdf, other]

The ICASSP SP Cadenza Challenge: Music Demixing/Remixing for Hearing Aids

Authors: Gerardo Roa Dabike, Michael A. Akeroyd, Scott Bannister, Jon Barker, Trevor J. Cox, Bruno Fazenda, Jennifer Firth, Simone Graetzer, Alinka Greasley, Rebecca R. Vos, William M. Whitmer

Abstract: This paper reports on the design and results of the 2024 ICASSP SP Cadenza Challenge: Music Demixing/Remixing for Hearing Aids. The Cadenza project is working to enhance the audio quality of music for those with a hearing loss. The scenario for the challenge was listening to stereo reproduction over loudspeakers via hearing aids. The task was to: decompose pop/rock music into vocal, drums, bass an… ▽ More This paper reports on the design and results of the 2024 ICASSP SP Cadenza Challenge: Music Demixing/Remixing for Hearing Aids. The Cadenza project is working to enhance the audio quality of music for those with a hearing loss. The scenario for the challenge was listening to stereo reproduction over loudspeakers via hearing aids. The task was to: decompose pop/rock music into vocal, drums, bass and other (VDBO); rebalance the different tracks with specified gains and then remixing back to stereo. End-to-end approaches were also accepted. 17 systems were submitted by 11 teams. Causal systems performed poorer than non-causal approaches. 9 systems beat the baseline. A common approach was to fine-tuning pretrained demixing models. The best approach used an ensemble of models. △ Less

Submitted 29 January, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

Comments: 2-page paper for ICASSP 2024 SP Grand Challenge

arXiv:2205.01751 [pdf, other]

doi 10.21437/Interspeech.2022-11359

On monoaural speech enhancement for automatic recognition of real noisy speech using mixture invariant training

Authors: Jisi Zhang, Catalin Zorila, Rama Doddipatla, Jon Barker

Abstract: In this paper, we explore an improved framework to train a monoaural neural enhancement model for robust speech recognition. The designed training framework extends the existing mixture invariant training criterion to exploit both unpaired clean speech and real noisy data. It is found that the unpaired clean speech is crucial to improve quality of separated speech from real noisy speech. The propo… ▽ More In this paper, we explore an improved framework to train a monoaural neural enhancement model for robust speech recognition. The designed training framework extends the existing mixture invariant training criterion to exploit both unpaired clean speech and real noisy data. It is found that the unpaired clean speech is crucial to improve quality of separated speech from real noisy speech. The proposed method also performs remixing of processed and unprocessed signals to alleviate the processing artifacts. Experiments on the single-channel CHiME-3 real test sets show that the proposed method improves significantly in terms of speech recognition performance over the enhancement system trained either on the mismatched simulated data in a supervised fashion or on the matched real data in an unsupervised fashion. Between 16% and 39% relative WER reduction has been achieved by the proposed system compared to the unprocessed signal using end-to-end and hybrid acoustic models without retraining on distorted data. △ Less

Submitted 20 September, 2022; v1 submitted 3 May, 2022; originally announced May 2022.

Comments: Accepted to INTERSPEECH 2022

arXiv:2204.04288 [pdf, other]

Unsupervised Uncertainty Measures of Automatic Speech Recognition for Non-intrusive Speech Intelligibility Prediction

Authors: Zehai Tu, Ning Ma, Jon Barker

Abstract: Non-intrusive intelligibility prediction is important for its application in realistic scenarios, where a clean reference signal is difficult to access. The construction of many non-intrusive predictors require either ground truth intelligibility labels or clean reference signals for supervised learning. In this work, we leverage an unsupervised uncertainty estimation method for predicting speech… ▽ More Non-intrusive intelligibility prediction is important for its application in realistic scenarios, where a clean reference signal is difficult to access. The construction of many non-intrusive predictors require either ground truth intelligibility labels or clean reference signals for supervised learning. In this work, we leverage an unsupervised uncertainty estimation method for predicting speech intelligibility, which does not require intelligibility labels or reference signals to train the predictor. Our experiments demonstrate that the uncertainty from state-of-the-art end-to-end automatic speech recognition (ASR) models is highly correlated with speech intelligibility. The proposed method is evaluated on two databases and the results show that the unsupervised uncertainty measures of ASR models are more correlated with speech intelligibility from listening results than the predictions made by widely used intrusive methods. △ Less

Submitted 6 July, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

Comments: Accepted to INTERSPEECH2022

arXiv:2204.04287 [pdf, other]

Exploiting Hidden Representations from a DNN-based Speech Recogniser for Speech Intelligibility Prediction in Hearing-impaired Listeners

Authors: Zehai Tu, Ning Ma, Jon Barker

Abstract: An accurate objective speech intelligibility prediction algorithms is of great interest for many applications such as speech enhancement for hearing aids. Most algorithms measures the signal-to-noise ratios or correlations between the acoustic features of clean reference signals and degraded signals. However, these hand-picked acoustic features are usually not explicitly correlated with recognitio… ▽ More An accurate objective speech intelligibility prediction algorithms is of great interest for many applications such as speech enhancement for hearing aids. Most algorithms measures the signal-to-noise ratios or correlations between the acoustic features of clean reference signals and degraded signals. However, these hand-picked acoustic features are usually not explicitly correlated with recognition. Meanwhile, deep neural network (DNN) based automatic speech recogniser (ASR) is approaching human performance in some speech recognition tasks. This work leverages the hidden representations from DNN-based ASR as features for speech intelligibility prediction in hearing-impaired listeners. The experiments based on a hearing aid intelligibility database show that the proposed method could make better prediction than a widely used short-time objective intelligibility (STOI) based binaural measure. △ Less

Submitted 6 July, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

Comments: Accepted to INTERSPEECH2022

arXiv:2204.04284 [pdf, other]

Auditory-Based Data Augmentation for End-to-End Automatic Speech Recognition

Authors: Zehai Tu, Jack Deadman, Ning Ma, Jon Barker

Abstract: End-to-end models have achieved significant improvement on automatic speech recognition. One common method to improve performance of these models is expanding the data-space through data augmentation. Meanwhile, human auditory inspired front-ends have also demonstrated improvement for automatic speech recognisers. In this work, a well-verified auditory-based model, which can simulate various heari… ▽ More End-to-end models have achieved significant improvement on automatic speech recognition. One common method to improve performance of these models is expanding the data-space through data augmentation. Meanwhile, human auditory inspired front-ends have also demonstrated improvement for automatic speech recognisers. In this work, a well-verified auditory-based model, which can simulate various hearing abilities, is investigated for the purpose of data augmentation for end-to-end speech recognition. By introducing the auditory model into the data augmentation process, end-to-end systems are encouraged to ignore variation from the signal that cannot be heard and thereby focus on robust features for speech recognition. Two mechanisms in the auditory model, spectral smearing and loudness recruitment, are studied on the LibriSpeech dataset with a transformer-based end-to-end model. The results show that the proposed augmentation methods can bring statistically significant improvement on the performance of the state-of-the-art SpecAugment. △ Less

Submitted 8 April, 2022; originally announced April 2022.

arXiv:2202.00011 [pdf, other]

Leveraging Bitstream Metadata for Fast, Accurate, Generalized Compressed Video Quality Enhancement

Authors: Max Ehrlich, Jon Barker, Namitha Padmanabhan, Larry Davis, Andrew Tao, Bryan Catanzaro, Abhinav Shrivastava

Abstract: Video compression is a central feature of the modern internet powering technologies from social media to video conferencing. While video compression continues to mature, for many compression settings, quality loss is still noticeable. These settings nevertheless have important applications to the efficient transmission of videos over bandwidth constrained or otherwise unstable connections. In this… ▽ More Video compression is a central feature of the modern internet powering technologies from social media to video conferencing. While video compression continues to mature, for many compression settings, quality loss is still noticeable. These settings nevertheless have important applications to the efficient transmission of videos over bandwidth constrained or otherwise unstable connections. In this work, we develop a deep learning architecture capable of restoring detail to compressed videos which leverages the underlying structure and motion information embedded in the video bitstream. We show that this improves restoration accuracy compared to prior compression correction methods and is competitive when compared with recent deep-learning-based video compression methods on rate-distortion while achieving higher throughput. Furthermore, we condition our model on quantization data which is readily available in the bitstream. This allows our single model to handle a variety of different compression quality settings which required an ensemble of models in prior work. △ Less

Submitted 30 October, 2023; v1 submitted 31 January, 2022; originally announced February 2022.

Comments: WACV 2024

arXiv:2112.00556 [pdf, other]

Semi-Supervised Surface Anomaly Detection of Composite Wind Turbine Blades From Drone Imagery

Authors: Jack. W. Barker, Neelanjan Bhowmik, Toby. P. Breckon

Abstract: Within commercial wind energy generation, the monitoring and predictive maintenance of wind turbine blades in-situ is a crucial task, for which remote monitoring via aerial survey from an Unmanned Aerial Vehicle (UAV) is commonplace. Turbine blades are susceptible to both operational and weather-based damage over time, reducing the energy efficiency output of turbines. In this study, we address au… ▽ More Within commercial wind energy generation, the monitoring and predictive maintenance of wind turbine blades in-situ is a crucial task, for which remote monitoring via aerial survey from an Unmanned Aerial Vehicle (UAV) is commonplace. Turbine blades are susceptible to both operational and weather-based damage over time, reducing the energy efficiency output of turbines. In this study, we address automating the otherwise time-consuming task of both blade detection and extraction, together with fault detection within UAV-captured turbine blade inspection imagery. We propose BladeNet, an application-based, robust dual architecture to perform both unsupervised turbine blade detection and extraction, followed by super-pixel generation using the Simple Linear Iterative Clustering (SLIC) method to produce regional clusters. These clusters are then processed by a suite of semi-supervised detection methods. Our dual architecture detects surface faults of glass fibre composite material blades with high aptitude while requiring minimal prior manual image annotation. BladeNet produces an Average Precision (AP) of 0.995 across our Ørsted blade inspection dataset for offshore wind turbines and 0.223 across the Danish Technical University (DTU) NordTank turbine blade inspection dataset. BladeNet also obtains an AUC of 0.639 for surface anomaly detection across the Ørsted blade inspection dataset. △ Less

Submitted 1 December, 2021; originally announced December 2021.

Comments: In-proceedings at 2022 17th International Conference on Computer Vision Theory and Applications (VISAPP)

arXiv:2106.07843 [pdf, other]

Teacher-Student MixIT for Unsupervised and Semi-supervised Speech Separation

Authors: Jisi Zhang, Catalin Zorila, Rama Doddipatla, Jon Barker

Abstract: In this paper, we introduce a novel semi-supervised learning framework for end-to-end speech separation. The proposed method first uses mixtures of unseparated sources and the mixture invariant training (MixIT) criterion to train a teacher model. The teacher model then estimates separated sources that are used to train a student model with standard permutation invariant training (PIT). The student… ▽ More In this paper, we introduce a novel semi-supervised learning framework for end-to-end speech separation. The proposed method first uses mixtures of unseparated sources and the mixture invariant training (MixIT) criterion to train a teacher model. The teacher model then estimates separated sources that are used to train a student model with standard permutation invariant training (PIT). The student model can be fine-tuned with supervised data, i.e., paired artificial mixtures and clean speech sources, and further improved via model distillation. Experiments with single and multi channel mixtures show that the teacher-student training resolves the over-separation problem observed in the original MixIT method. Further, the semisupervised performance is comparable to a fully-supervised separation system trained using ten times the amount of supervised data. △ Less

Submitted 9 September, 2021; v1 submitted 14 June, 2021; originally announced June 2021.

Comments: Accepted to Interspeech 2021

arXiv:2106.04639 [pdf, other]

Optimising Hearing Aid Fittings for Speech in Noise with a Differentiable Hearing Loss Model

Authors: Zehai Tu, Ning Ma, Jon Barker

Abstract: Current hearing aids normally provide amplification based on a general prescriptive fitting, and the benefits provided by the hearing aids vary among different listening environments despite the inclusion of noise suppression feature. Motivated by this fact, this paper proposes a data-driven machine learning technique to develop hearing aid fittings that are customised to speech in different noisy… ▽ More Current hearing aids normally provide amplification based on a general prescriptive fitting, and the benefits provided by the hearing aids vary among different listening environments despite the inclusion of noise suppression feature. Motivated by this fact, this paper proposes a data-driven machine learning technique to develop hearing aid fittings that are customised to speech in different noisy environments. A differentiable hearing loss model is proposed and used to optimise fittings with back-propagation. The customisation is reflected on the data of speech in different noise with also the consideration of noise suppression. The objective evaluation shows the advantages of optimised custom fittings over general prescriptive fittings. △ Less

Submitted 8 June, 2021; originally announced June 2021.

Comments: Accepted to Interspeech 2021

arXiv:2102.10376 [pdf, other]

The Use of Voice Source Features for Sung Speech Recognition

Authors: Gerardo Roa Dabike, Jon Barker

Abstract: In this paper, we ask whether vocal source features (pitch, shimmer, jitter, etc) can improve the performance of automatic sung speech recognition, arguing that conclusions previously drawn from spoken speech studies may not be valid in the sung speech domain. We first use a parallel singing/speaking corpus (NUS-48E) to illustrate differences in sung vs spoken voicing characteristics including pit… ▽ More In this paper, we ask whether vocal source features (pitch, shimmer, jitter, etc) can improve the performance of automatic sung speech recognition, arguing that conclusions previously drawn from spoken speech studies may not be valid in the sung speech domain. We first use a parallel singing/speaking corpus (NUS-48E) to illustrate differences in sung vs spoken voicing characteristics including pitch range, syllables duration, vibrato, jitter and shimmer. We then use this analysis to inform speech recognition experiments on the sung speech DSing corpus, using a state of the art acoustic model and augmenting conventional features with various voice source parameters. Experiments are run with three standard (increasingly large) training sets, DSing1 (15.1 hours), DSing3 (44.7 hours) and DSing30 (149.1 hours). Pitch combined with degree of voicing produces a significant decrease in WER from 38.1% to 36.7% when training with DSing1 however smaller decreases in WER observed when training with the larger more varied DSing3 and DSing30 sets were not seen to be statistically significant. Voicing quality characteristics did not improve recognition performance although analysis suggests that they do contribute to an improved discrimination between voiced/unvoiced phoneme pairs. △ Less

Submitted 23 February, 2021; v1 submitted 20 February, 2021; originally announced February 2021.

Comments: Accepted to ICASSP 2021

arXiv:2102.03762 [pdf, other]

doi 10.1109/ICASSP39728.2021.9414092

Time-Domain Speech Extraction with Spatial Information and Multi Speaker Conditioning Mechanism

Authors: Jisi Zhang, Catalin Zorila, Rama Doddipatla, Jon Barker

Abstract: In this paper, we present a novel multi-channel speech extraction system to simultaneously extract multiple clean individual sources from a mixture in noisy and reverberant environments. The proposed method is built on an improved multi-channel time-domain speech separation network which employs speaker embeddings to identify and extract multiple targets without label permutation ambiguity. To eff… ▽ More In this paper, we present a novel multi-channel speech extraction system to simultaneously extract multiple clean individual sources from a mixture in noisy and reverberant environments. The proposed method is built on an improved multi-channel time-domain speech separation network which employs speaker embeddings to identify and extract multiple targets without label permutation ambiguity. To efficiently inform the speaker information to the extraction model, we propose a new speaker conditioning mechanism by designing an additional speaker branch for receiving external speaker embeddings. Experiments on 2-channel WHAMR! data show that the proposed system improves by 9% relative the source separation performance over a strong multi-channel baseline, and it increases the speech recognition accuracy by more than 16% relative over the same baseline. △ Less

Submitted 7 February, 2021; originally announced February 2021.

Comments: Accepted for ICASSP 2021

MSC Class: 68T10

arXiv:2011.05958 [pdf, other]

doi 10.1109/ICASSP40776.2020.9053833

On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments

Authors: Jisi Zhang, Catalin Zorila, Rama Doddipatla, Jon Barker

Abstract: This paper introduces a new method for multi-channel time domain speech separation in reverberant environments. A fully-convolutional neural network structure has been used to directly separate speech from multiple microphone recordings, with no need of conventional spatial feature extraction. To reduce the influence of reverberation on spatial feature extraction, a dereverberation pre-processing… ▽ More This paper introduces a new method for multi-channel time domain speech separation in reverberant environments. A fully-convolutional neural network structure has been used to directly separate speech from multiple microphone recordings, with no need of conventional spatial feature extraction. To reduce the influence of reverberation on spatial feature extraction, a dereverberation pre-processing method has been applied to further improve the separation performance. A spatialized version of wsj0-2mix dataset has been simulated to evaluate the proposed system. Both source separation and speech recognition performance of the separated signals have been evaluated objectively. Experiments show that the proposed fully-convolutional network improves the source separation metric and the word error rate (WER) by more than 13% and 50% relative, respectively, over a reference system with conventional features. Applying dereverberation as pre-processing to the proposed system can further reduce the WER by 29% relative using an acoustic model trained on clean and reverberated data. △ Less

Submitted 11 November, 2020; originally announced November 2020.

Comments: Presented at IEEE ICASSP 2020

MSC Class: 68T10

Journal ref: Proc. ICASSP (2020) 6389-6393

arXiv:2006.11140 [pdf, other]

Clarity: Machine Learning Challenges to Revolutionise Hearing Device Processing

Authors: Simone Graetzer, Michael Akeroyd, Jon P. Barker, Trevor J. Cox, John F. Culling, Graham Naylor, Eszter Porter, Rhoddy Viveros Muñoz

Abstract: In the Clarity project, we will run a series of machine learning challenges to revolutionise speech processing for hearing devices. Over five years, there will be three paired challenges. Each pair will consist of a competition focussed on hearing-device processing ("enhancement") and another focussed on speech perception modelling ("prediction"). The enhancement challenges will deliver new and im… ▽ More In the Clarity project, we will run a series of machine learning challenges to revolutionise speech processing for hearing devices. Over five years, there will be three paired challenges. Each pair will consist of a competition focussed on hearing-device processing ("enhancement") and another focussed on speech perception modelling ("prediction"). The enhancement challenges will deliver new and improved approaches for hearing device signal processing for speech. The parallel prediction challenges will develop and improve methods for predicting speech intelligibility and quality for hearing impaired listeners. This Engineering and Physical Sciences Research Council (EPSRC) funded project involves researchers from the Universities of Sheffield, Salford, Nottingham and Cardiff in conjunction with the Hearing Industry Research Consortium, Action on Hearing Loss, Amazon, and Honda. To register interest in the challenges, go to www.claritychallenge.org. △ Less

Submitted 17 August, 2020; v1 submitted 19 June, 2020; originally announced June 2020.

Comments: 3 pages, 2 figures

arXiv:2004.09249 [pdf, other]

CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Authors: Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh Raj, David Snyder, Aswin Shanmugam Subramanian, Jan Trmal, Bar Ben Yair, Christoph Boeddeker, Zhaoheng Ni, Yusuke Fujita, Shota Horiguchi, Naoyuki Kanda, Takuya Yoshioka, Neville Ryant

Abstract: Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous C… ▽ More Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous CHiME-5 recordings except for accurate array synchronization. The material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition (Track 1) and unsegmented multispeaker speech recognition (Track 2). Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules. △ Less

Submitted 2 May, 2020; v1 submitted 20 April, 2020; originally announced April 2020.

arXiv:1911.08216 [pdf, other]

On the Impact of Object and Sub-component Level Segmentation Strategies for Supervised Anomaly Detection within X-ray Security Imagery

Authors: Neelanjan Bhowmik, Yona Falinie A. Gaus, Samet Akcay, Jack W. Barker, Toby P. Breckon

Abstract: X-ray security screening is in widespread use to maintain transportation security against a wide range of potential threat profiles. Of particular interest is the recent focus on the use of automated screening approaches, including the potential anomaly detection as a methodology for concealment detection within complex electronic items. Here we address this problem considering varying segmentatio… ▽ More X-ray security screening is in widespread use to maintain transportation security against a wide range of potential threat profiles. Of particular interest is the recent focus on the use of automated screening approaches, including the potential anomaly detection as a methodology for concealment detection within complex electronic items. Here we address this problem considering varying segmentation strategies to enable the use of both object level and sub-component level anomaly detection via the use of secondary convolutional neural network (CNN) architectures. Relative performance is evaluated over an extensive dataset of exemplar cluttered X-ray imagery, with a focus on consumer electronics items. We find that sub-component level segmentation produces marginally superior performance in the secondary anomaly detection via classification stage, with true positive of ~98% of anomalies, with a ~3% false positive. △ Less

Submitted 19 November, 2019; originally announced November 2019.

arXiv:1808.00060 [pdf, other]

doi 10.21437/Interspeech.2018-2516

DNN driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation

Authors: Mandar Gogate, Ahsan Adeel, Ricard Marxer, Jon Barker, Amir Hussain

Abstract: Human auditory cortex excels at selectively suppressing background noise to focus on a target speaker. The process of selective attention in the brain is known to contextually exploit the available audio and visual cues to better focus on target speaker while filtering out other noises. In this study, we propose a novel deep neural network (DNN) based audiovisual (AV) mask estimation model. The pr… ▽ More Human auditory cortex excels at selectively suppressing background noise to focus on a target speaker. The process of selective attention in the brain is known to contextually exploit the available audio and visual cues to better focus on target speaker while filtering out other noises. In this study, we propose a novel deep neural network (DNN) based audiovisual (AV) mask estimation model. The proposed AV mask estimation model contextually integrates the temporal dynamics of both audio and noise-immune visual features for improved mask estimation and speech separation. For optimal AV features extraction and ideal binary mask (IBM) estimation, a hybrid DNN architecture is exploited to leverages the complementary strengths of a stacked long short term memory (LSTM) and convolution LSTM network. The comparative simulation results in terms of speech quality and intelligibility demonstrate significant performance improvement of our proposed AV mask estimation model as compared to audio-only and visual-only mask estimation approaches for both speaker dependent and independent scenarios. △ Less

Submitted 31 July, 2018; originally announced August 2018.

Comments: Accepted for Interspeech 2018, 5 pages, 4 figures

ACM Class: I.5; I.4; I.2

arXiv:1803.10609 [pdf, ps, other]

The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines

Authors: Jon Barker, Shinji Watanabe, Emmanuel Vincent, Jan Trmal

Abstract: The CHiME challenge series aims to advance robust automatic speech recognition (ASR) technology by promoting research at the interface of speech and language processing, signal processing , and machine learning. This paper introduces the 5th CHiME Challenge, which considers the task of distant multi-microphone conversational ASR in real home environments. Speech material was elicited using a dinne… ▽ More The CHiME challenge series aims to advance robust automatic speech recognition (ASR) technology by promoting research at the interface of speech and language processing, signal processing , and machine learning. This paper introduces the 5th CHiME Challenge, which considers the task of distant multi-microphone conversational ASR in real home environments. Speech material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech and recorded by 6 Kinect microphone arrays and 4 binaural microphone pairs. The challenge features a single-array track and a multiple-array track and, for each track, distinct rankings will be produced for systems focusing on robustness with respect to distant-microphone capture vs. systems attempting to address all aspects of the task including conversational language modeling. We discuss the rationale for the challenge and provide a detailed description of the data collection procedure, the task, and the baseline systems for array synchronization, speech enhancement, and conventional and end-to-end ASR. △ Less

Submitted 28 March, 2018; originally announced March 2018.

Showing 1–24 of 24 results for author: Barker, J