Search | arXiv e-print repository

arXiv:2009.02940 [pdf, ps, other]

Deep Learning-Based Single-Ended Objective Quality Measures for Time-Scale Modified Audio

Authors: Timothy Roberts, Aaron Nicolson, Kuldip K. Paliwal

Abstract: Objective evaluation of audio processed with Time-Scale Modification (TSM) is seeing a resurgence of interest. Recently, a labelled time-scaled audio dataset was used to train an objective measure for TSM evaluation. This DE measure was an extension of Perceptual Evaluation of Audio Quality, and required reference and test signals. In this paper, two single-ended objective quality measures for tim… ▽ More Objective evaluation of audio processed with Time-Scale Modification (TSM) is seeing a resurgence of interest. Recently, a labelled time-scaled audio dataset was used to train an objective measure for TSM evaluation. This DE measure was an extension of Perceptual Evaluation of Audio Quality, and required reference and test signals. In this paper, two single-ended objective quality measures for time-scaled audio are proposed that do not require a reference signal. Data driven features are created by either a convolutional neural network (CNN) or a bidirectional gated recurrent unit (BGRU) network and fed to a fully-connected network to predict subjective mean opinion scores. The proposed CNN and BGRU measures achieve an average Root Mean Squared Error of 0.608 and 0.576, and a mean Pearson correlation of 0.771 and 0.794, respectively. The proposed measures are used to evaluate TSM algorithms, and comparisons are provided for 16 TSM implementations. The objective measure is available at https://www.github.com/zygurt/TSM. △ Less

Submitted 7 September, 2020; originally announced September 2020.

Comments: 13 pages, 11 figures, Submitted to The Journal of the Acoustical Society of America

arXiv:2006.06153 [pdf, ps, other]

doi 10.1121/10.0003753

An Objective Measure of Quality for Time-Scale Modification of Audio

Authors: Timothy Roberts, Kuldip K. Paliwal

Abstract: Objective evaluation of audio processed with Time-Scale Modification (TSM) remains an open problem. Recently, a dataset of time-scaled audio with subjective quality labels was published and used to create an initial objective measure of quality. In this paper, an improved objective measure of quality for time-scaled audio is proposed. The measure uses hand-crafted features and a fully connected ne… ▽ More Objective evaluation of audio processed with Time-Scale Modification (TSM) remains an open problem. Recently, a dataset of time-scaled audio with subjective quality labels was published and used to create an initial objective measure of quality. In this paper, an improved objective measure of quality for time-scaled audio is proposed. The measure uses hand-crafted features and a fully connected network to predict subjective mean opinion scores. Basic and Advanced Perceptual Evaluation of Audio Quality features are used in addition to nine features specific to TSM artefacts. Six methods of alignment are explored, with interpolation of the reference magnitude spectrum to the length of the test magnitude spectrum giving the best performance. The proposed measure achieves a mean Root Mean Squared Error of 0.487 and a mean Pearson correlation of 0.865, equivalent to 98th and 82nd percentiles of subjective sessions respectively. The proposed measure is used to evaluate time-scale modification algorithms, finding that Elastique gives the highest objective quality for Solo instrument and voice signals, while the Identity Phase-Locking Phase Vocoder gives the highest objective quality for music signals and the best overall quality. The objective measure is available at https://www.github.com/zygurt/TSM. △ Less

Submitted 10 June, 2020; originally announced June 2020.

Comments: 12 pages, 7 figures, Submitted to The Journal of the Acoustical Society of America, Currently under review

arXiv:2006.00848 [pdf, ps, other]

doi 10.1121/10.0001567

A time-scale modification dataset with subjective quality labels

Authors: Timothy Roberts, Kuldip K. Paliwal

Abstract: Time Scale Modification (TSM) is a well-researched field; however, no effective objective measure of quality exists. This paper details the creation, subjective evaluation, and analysis of a dataset for use in the development of an objective measure of quality for TSM. Comprised of two parts, the training component contains 88 source files processed using six TSM methods at 10 time scales, while t… ▽ More Time Scale Modification (TSM) is a well-researched field; however, no effective objective measure of quality exists. This paper details the creation, subjective evaluation, and analysis of a dataset for use in the development of an objective measure of quality for TSM. Comprised of two parts, the training component contains 88 source files processed using six TSM methods at 10 time scales, while the testing component contains 20 source files processed using three additional methods at four time scales. The source material contains speech, solo harmonic and percussive instruments, sound effects, and a range of music genres. Ratings (42 529) were collected from 633 sessions using laboratory and remote collection methods. Analysis of results shows no correlation between age and quality of rating; expert and non-expert listeners to be equivalent; minor differences between participants with and without hearing issues; and minimal differences between testing modalities. A comparison of published objective measures and subjective scores shows the objective measures to be poor indicators of subjective quality. Initial results for a retrained objective measure of quality are presented with results approaching average root mean squared error loss and Pearson correlation values of subjective sessions. The labeled dataset is available at http://ieee-dataport.org/1987. △ Less

Submitted 15 July, 2020; v1 submitted 1 June, 2020; originally announced June 2020.

Comments: 12 Pages, 13 Figures, Published in The Journal of the Acoustical Society of America (Vol.148, Issue 1), For associated dataset, see http://ieee-dataport.org/1987

Journal ref: J. Acoust. Soc. Am. 148(1). pp. 201-210 (2020)

arXiv:2002.12794 [pdf, other]

Deep Residual-Dense Lattice Network for Speech Enhancement

Authors: Mohammad Nikzad, Aaron Nicolson, Yongsheng Gao, Jun Zhou, Kuldip K. Paliwal, Fanhua Shang

Abstract: Convolutional neural networks (CNNs) with residual links (ResNets) and causal dilated convolutional units have been the network of choice for deep learning approaches to speech enhancement. While residual links improve gradient flow during training, feature diminution of shallow layer outputs can occur due to repetitive summations with deeper layer outputs. One strategy to improve feature re-usage… ▽ More Convolutional neural networks (CNNs) with residual links (ResNets) and causal dilated convolutional units have been the network of choice for deep learning approaches to speech enhancement. While residual links improve gradient flow during training, feature diminution of shallow layer outputs can occur due to repetitive summations with deeper layer outputs. One strategy to improve feature re-usage is to fuse both ResNets and densely connected CNNs (DenseNets). DenseNets, however, over-allocate parameters for feature re-usage. Motivated by this, we propose the residual-dense lattice network (RDL-Net), which is a new CNN for speech enhancement that employs both residual and dense aggregations without over-allocating parameters for feature re-usage. This is managed through the topology of the RDL blocks, which limit the number of outputs used for dense aggregations. Our extensive experimental investigation shows that RDL-Nets are able to achieve a higher speech enhancement performance than CNNs that employ residual and/or dense aggregations. RDL-Nets also use substantially fewer parameters and have a lower computational requirement. Furthermore, we demonstrate that RDL-Nets outperform many state-of-the-art deep learning approaches to speech enhancement. △ Less

Submitted 26 February, 2020; originally announced February 2020.

Comments: 8 pages, Accepted by AAAI-2020

arXiv:1912.12023 [pdf, other]

Monaural Speech Enhancement Using a Multi-Branch Temporal Convolutional Network

Authors: Qiquan Zhang, Aaron Nicolson, Mingjiang Wang, Kuldip K. Paliwal, Chenxu Wang

Abstract: Deep learning has achieved substantial improvement on single-channel speech enhancement tasks. However, the performance of multi-layer perceptions (MLPs)-based methods is limited by the ability to capture the long-term effective history information. The recurrent neural networks (RNNs), e.g., long short-term memory (LSTM) model, are able to capture the long-term temporal dependencies, but come wit… ▽ More Deep learning has achieved substantial improvement on single-channel speech enhancement tasks. However, the performance of multi-layer perceptions (MLPs)-based methods is limited by the ability to capture the long-term effective history information. The recurrent neural networks (RNNs), e.g., long short-term memory (LSTM) model, are able to capture the long-term temporal dependencies, but come with the issues of the high latency and the complexity of training.To address these issues, the temporal convolutional network (TCN) was proposed to replace the RNNs in various sequence modeling tasks. In this paper we propose a novel TCN model that employs multi-branch structure, called multi-branch TCN (MB-TCN), for monaural speech enhancement.The MB-TCN exploits split-transform-aggregate design, which is expected to obtain strong representational power at a low computational complexity.Inspired by the TCN, the MB-TCN model incorporates one dimensional causal dilated CNN and residual learning to expand receptive fields for capturing long-term temporal contextual information.Our extensive experimental investigation suggests that the MB-TCNs outperform the residual long short-term memory networks (ResLSTMs), temporal convolutional networks (TCNs), and the CNN networks that employ dense aggregations in terms of speech intelligibility and quality, while providing superior parameter efficiency. Furthermore, our experimental results demonstrate that our proposed MB-TCN model is able to outperform multiple state-of-the-art deep learning-based speech enhancement methods in terms of five widely used objective metrics. △ Less

Submitted 17 May, 2020; v1 submitted 27 December, 2019; originally announced December 2019.

Comments: There are some inappropriate decriptions. These descriptions exist on many pages

arXiv:1910.11969 [pdf, other]

Sum-Product Networks for Robust Automatic Speaker Identification

Authors: Aaron Nicolson, Kuldip K. Paliwal

Abstract: We introduce sum-product networks (SPNs) for robust speech processing through a simple robust automatic speaker identification (ASI) task. SPNs are deep probabilistic graphical models capable of answering multiple probabilistic queries. We show that SPNs are able to remain robust by using the marginal probability density function (PDF) of the spectral features that reliably represent speech. Thoug… ▽ More We introduce sum-product networks (SPNs) for robust speech processing through a simple robust automatic speaker identification (ASI) task. SPNs are deep probabilistic graphical models capable of answering multiple probabilistic queries. We show that SPNs are able to remain robust by using the marginal probability density function (PDF) of the spectral features that reliably represent speech. Though current SPN toolkits and learning algorithms are in their infancy, we aim to show that SPNs have the potential to become a useful tool for robust speech processing in the future. SPN speaker models are evaluated here on real-world non-stationary and coloured noise sources at multiple signal-to-noise ratio (SNR) levels. In terms of ASI accuracy, we find that SPN speaker models are more robust than two recent convolutional neural network (CNN)-based ASI systems. Additionally, SPN speaker models consist of significantly fewer parameters than their CNN-based counterparts. The results indicate that SPN speaker models could be a robust, parameter-efficient alternative for ASI. Additionally, this work demonstrates that SPNs have potential in related tasks, such as robust automatic speech recognition (ASR) and automatic speaker verification (ASV). Availability: The SPN ASI system is available at https://github.com/anicolson/SPN-ASI. △ Less

Submitted 13 August, 2020; v1 submitted 25 October, 2019; originally announced October 2019.

Comments: Proc. Interspeech 2020

arXiv:1906.07319 [pdf, other]

Deep Xi as a Front-End for Robust Automatic Speech Recognition

Authors: Aaron Nicolson, Kuldip K. Paliwal

Abstract: Current front-ends for robust automatic speech recognition(ASR) include masking- and map**-based deep learning approaches to speech enhancement. A recently proposed deep learning approach toa prioriSNR estimation, called Deep**-based approaches. Motivated by this, we investigate Deep Xi… ▽ More Current front-ends for robust automatic speech recognition(ASR) include masking- and map**-based deep learning approaches to speech enhancement. A recently proposed deep learning approach toa prioriSNR estimation, called Deep**-based approaches. Motivated by this, we investigate Deep **-based deep learning front-ends. The results presented in this work show that Deep Xi is a viable front-end, and is able to significantly increase the robustness of an ASR system. Availability: Deep Xi is available at:https://github.com/anicolson/DeepXi △ Less

Submitted 27 January, 2020; v1 submitted 17 June, 2019; originally announced June 2019.

Showing 1–7 of 7 results for author: Paliwal, K K