-
Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation
Authors:
Kun Wei,
Bei Li,
Hang Lv,
Quan Lu,
Ning Jiang,
Lei Xie
Abstract:
Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel conversational ASR system, extending the C…
▽ More
Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation. Our approach leverages a cross-modal extractor that combines pre-trained speech and text models through a specialized encoder and a modal-level mask input. This enables the extraction of richer historical speech context without explicit error propagation. We also incorporate conditional latent variational modules to learn conversational level attributes such as role preference and topic coherence. By introducing both cross-modal and conversational representations into the decoder, our model retains context over longer sentences without information loss, achieving relative accuracy improvements of 8.8% and 23% on Mandarin conversation datasets HKUST and MagicData-RAMC, respectively, compared to the standard Conformer model.
△ Less
Submitted 27 April, 2024; v1 submitted 22 October, 2023;
originally announced October 2023.
-
Asca: less audio data is more insightful
Authors:
Xiang Li,
Junhao Chen,
Chao Li,
Hongwu Lv
Abstract:
Audio recognition in specialized areas such as birdsong and submarine acoustics faces challenges in large-scale pre-training due to the limitations in available samples imposed by sampling environments and specificity requirements. While the Transformer model excels in audio recognition, its dependence on vast amounts of data becomes restrictive in resource-limited settings. Addressing this, we in…
▽ More
Audio recognition in specialized areas such as birdsong and submarine acoustics faces challenges in large-scale pre-training due to the limitations in available samples imposed by sampling environments and specificity requirements. While the Transformer model excels in audio recognition, its dependence on vast amounts of data becomes restrictive in resource-limited settings. Addressing this, we introduce the Audio Spectrogram Convolution Attention (ASCA) based on CoAtNet, integrating a Transformer-convolution hybrid architecture, novel network design, and attention techniques, further augmented with data enhancement and regularization strategies. On the BirdCLEF2023 and AudioSet(Balanced), ASCA achieved accuracies of 81.2% and 35.1%, respectively, significantly outperforming competing methods. The unique structure of our model enriches output, enabling generalization across various audio detection tasks. Our code can be found at https://github.com/LeeCiang/ASCA.
△ Less
Submitted 23 September, 2023;
originally announced September 2023.
-
Estimating Brain Age with Global and Local Dependencies
Authors:
Yanwu Yang,
Xutao Guo,
Zhikai Chang,
Chenfei Ye,
Yang Xiang,
Haiyan Lv,
Ting Ma
Abstract:
The brain age has been proven to be a phenotype of relevance to cognitive performance and brain disease. Achieving accurate brain age prediction is an essential prerequisite for optimizing the predicted brain-age difference as a biomarker. As a comprehensive biological characteristic, the brain age is hard to be exploited accurately with models using feature engineering and local processing such a…
▽ More
The brain age has been proven to be a phenotype of relevance to cognitive performance and brain disease. Achieving accurate brain age prediction is an essential prerequisite for optimizing the predicted brain-age difference as a biomarker. As a comprehensive biological characteristic, the brain age is hard to be exploited accurately with models using feature engineering and local processing such as local convolution and recurrent operations that process one local neighborhood at a time. Instead, Vision Transformers learn global attentive interaction of patch tokens, introducing less inductive bias and modeling long-range dependencies. In terms of this, we proposed a novel network for learning brain age interpreting with global and local dependencies, where the corresponding representations are captured by Successive Permuted Transformer (SPT) and convolution blocks. The SPT brings computation efficiency and locates the 3D spatial information indirectly via continuously encoding 2D slices from different views. Finally, we collect a large cohort of 22645 subjects with ages ranging from 14 to 97 and our network performed the best among a series of deep learning methods, yielding a mean absolute error (MAE) of 2.855 in validation set, and 2.911 in an independent test set.
△ Less
Submitted 19 September, 2022;
originally announced September 2022.
-
Minimizing Sequential Confusion Error in Speech Command Recognition
Authors:
Zhanheng Yang,
Hang Lv,
Xiong Wang,
Ao Zhang,
Lei Xie
Abstract:
Speech command recognition (SCR) has been commonly used on resource constrained devices to achieve hands-free user experience. However, in real applications, confusion among commands with similar pronunciations often happens due to the limited capacity of small models deployed on edge devices, which drastically affects the user experience. In this paper, inspired by the advances of discriminative…
▽ More
Speech command recognition (SCR) has been commonly used on resource constrained devices to achieve hands-free user experience. However, in real applications, confusion among commands with similar pronunciations often happens due to the limited capacity of small models deployed on edge devices, which drastically affects the user experience. In this paper, inspired by the advances of discriminative training in speech recognition, we propose a novel minimize sequential confusion error (MSCE) training criterion particularly for SCR, aiming to alleviate the command confusion problem. Specifically, we aim to improve the ability of discriminating the target command from other commands on the basis of MCE discriminative criteria. We define the likelihood of different commands through connectionist temporal classification (CTC). During training, we propose several strategies to use prior knowledge creating a confusing sequence set for similar-sounding command instead of creating the whole non-target command set, which can better save the training resources and effectively reduce command confusion errors. Specifically, we design and compare three different strategies for confusing set construction. By using our proposed method, we can relatively reduce the False Reject Rate~(FRR) by 33.7% at 0.01 False Alarm Rate~(FAR) and confusion errors by 18.28% on our collected speech command set.
△ Less
Submitted 4 July, 2022;
originally announced July 2022.
-
Identification of diffracted vortex beams at different propagation distances using deep learning
Authors:
Heng Lv,
Yan Guo,
Zi-Xiang Yang,
Chunling Ding,
Wu-Hao Cai,
Chenglong You,
Rui-Bo **
Abstract:
Orbital angular momentum of light is regarded as a valuable resource in quantum technology, especially in quantum communication and quantum sensing and ranging. However, the OAM state of light is susceptible to undesirable experimental conditions such as propagation distance and phase distortions, which hinders the potential for the realistic implementation of relevant technologies. In this articl…
▽ More
Orbital angular momentum of light is regarded as a valuable resource in quantum technology, especially in quantum communication and quantum sensing and ranging. However, the OAM state of light is susceptible to undesirable experimental conditions such as propagation distance and phase distortions, which hinders the potential for the realistic implementation of relevant technologies. In this article, we exploit an enhanced deep learning neural network to identify different OAM modes of light at multiple propagation distances with phase distortions. Specifically, our trained deep learning neural network can efficiently identify the vortex beam's topological charge and propagation distance with 97% accuracy. Our technique has important implications for OAM based communication and sensing protocols.
△ Less
Submitted 30 March, 2022;
originally announced March 2022.
-
WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit
Authors:
Binbin Zhang,
Di Wu,
Zhendong Peng,
Xingchen Song,
Zhuoyuan Yao,
Hang Lv,
Lei Xie,
Chao Yang,
Fu** Pan,
Jianwei Niu
Abstract:
Recently, we made available WeNet, a production-oriented end-to-end speech recognition toolkit, which introduces a unified two-pass (U2) framework and a built-in runtime to address the streaming and non-streaming decoding modes in a single model. To further improve ASR performance and facilitate various production requirements, in this paper, we present WeNet 2.0 with four important updates. (1) W…
▽ More
Recently, we made available WeNet, a production-oriented end-to-end speech recognition toolkit, which introduces a unified two-pass (U2) framework and a built-in runtime to address the streaming and non-streaming decoding modes in a single model. To further improve ASR performance and facilitate various production requirements, in this paper, we present WeNet 2.0 with four important updates. (1) We propose U2++, a unified two-pass framework with bidirectional attention decoders, which includes the future contextual information by a right-to-left attention decoder to improve the representative ability of the shared encoder and the performance during the rescoring stage. (2) We introduce an n-gram based language model and a WFST-based decoder into WeNet 2.0, promoting the use of rich text data in production scenarios. (3) We design a unified contextual biasing framework, which leverages user-specific context (e.g., contact lists) to provide rapid adaptation ability for production and improves ASR accuracy in both with-LM and without-LM scenarios. (4) We design a unified IO to support large-scale data for effective model training. In summary, the brand-new WeNet 2.0 achieves up to 10\% relative recognition performance improvement over the original WeNet on various corpora and makes available several important production-oriented features.
△ Less
Submitted 5 July, 2022; v1 submitted 29 March, 2022;
originally announced March 2022.
-
Uncertainty Quantification in Medical Image Segmentation with Multi-decoder U-Net
Authors:
Yanwu Yang,
Xutao Guo,
Yiwei Pan,
Pengcheng Shi,
Haiyan Lv,
Ting Ma
Abstract:
Accurate medical image segmentation is crucial for diagnosis and analysis. However, the models without calibrated uncertainty estimates might lead to errors in downstream analysis and exhibit low levels of robustness. Estimating the uncertainty in the measurement is vital to making definite, informed conclusions. Especially, it is difficult to make accurate predictions on ambiguous areas and focus…
▽ More
Accurate medical image segmentation is crucial for diagnosis and analysis. However, the models without calibrated uncertainty estimates might lead to errors in downstream analysis and exhibit low levels of robustness. Estimating the uncertainty in the measurement is vital to making definite, informed conclusions. Especially, it is difficult to make accurate predictions on ambiguous areas and focus boundaries for both models and radiologists, even harder to reach a consensus with multiple annotations. In this work, the uncertainty under these areas is studied, which introduces significant information with anatomical structure and is as important as segmentation performance. We exploit the medical image segmentation uncertainty quantification by measuring segmentation performance with multiple annotations in a supervised learning manner and propose a U-Net based architecture with multiple decoders, where the image representation is encoded with the same encoder, and segmentation referring to each annotation is estimated with multiple decoders. Nevertheless, a cross-loss function is proposed for bridging the gap between different branches. The proposed architecture is trained in an end-to-end manner and able to improve predictive uncertainty estimates. The model achieves comparable performance with fewer parameters to the integrated training model that ranked the runner-up in the MICCAI-QUBIQ 2020 challenge.
△ Less
Submitted 14 September, 2021;
originally announced September 2021.
-
An Asynchronous WFST-Based Decoder For Automatic Speech Recognition
Authors:
Hang Lv,
Zhehuai Chen,
Hainan Xu,
Daniel Povey,
Lei Xie,
Sanjeev Khudanpur
Abstract:
We introduce asynchronous dynamic decoder, which adopts an efficient A* algorithm to incorporate big language models in the one-pass decoding for large vocabulary continuous speech recognition. Unlike standard one-pass decoding with on-the-fly composition decoder which might induce a significant computation overhead, the asynchronous dynamic decoder has a novel design where it has two fronts, with…
▽ More
We introduce asynchronous dynamic decoder, which adopts an efficient A* algorithm to incorporate big language models in the one-pass decoding for large vocabulary continuous speech recognition. Unlike standard one-pass decoding with on-the-fly composition decoder which might induce a significant computation overhead, the asynchronous dynamic decoder has a novel design where it has two fronts, with one performing "exploration" and the other "backfill". The computation of the two fronts alternates in the decoding process, resulting in more effective pruning than the standard one-pass decoding with an on-the-fly composition decoder. Experiments show that the proposed decoder works notably faster than the standard one-pass decoding with on-the-fly composition decoder, while the acceleration will be more obvious with the increment of data complexity.
△ Less
Submitted 16 March, 2021;
originally announced March 2021.
-
Wake Word Detection with Streaming Transformers
Authors:
Yiming Wang,
Hang Lv,
Daniel Povey,
Lei Xie,
Sanjeev Khudanpur
Abstract:
Modern wake word detection systems usually rely on neural networks for acoustic modeling. Transformers has recently shown superior performance over LSTM and convolutional networks in various sequence modeling tasks with their better temporal modeling power. However it is not clear whether this advantage still holds for short-range temporal modeling like wake word detection. Besides, the vanilla Tr…
▽ More
Modern wake word detection systems usually rely on neural networks for acoustic modeling. Transformers has recently shown superior performance over LSTM and convolutional networks in various sequence modeling tasks with their better temporal modeling power. However it is not clear whether this advantage still holds for short-range temporal modeling like wake word detection. Besides, the vanilla Transformer is not directly applicable to the task due to its non-streaming nature and the quadratic time and space complexity. In this paper we explore the performance of several variants of chunk-wise streaming Transformers tailored for wake word detection in a recently proposed LF-MMI system, including looking-ahead to the next chunk, gradient stop**, different positional embedding methods and adding same-layer dependency between chunks. Our experiments on the Mobvoi wake word dataset demonstrate that our proposed Transformer model outperforms the baseline convolution network by 25% on average in false rejection rate at the same false alarm rate with a comparable model size, while still maintaining linear complexity w.r.t. the sequence length.
△ Less
Submitted 8 February, 2021;
originally announced February 2021.
-
Context-aware RNNLM Rescoring for Conversational Speech Recognition
Authors:
Kun Wei,
Pengcheng Guo,
Hang Lv,
Zhen Tu,
Lei Xie
Abstract:
Conversational speech recognition is regarded as a challenging task due to its free-style speaking and long-term contextual dependencies. Prior work has explored the modeling of long-range context through RNNLM rescoring with improved performance. To further take advantage of the persisted nature during a conversation, such as topics or speaker turn, we extend the rescoring procedure to a new cont…
▽ More
Conversational speech recognition is regarded as a challenging task due to its free-style speaking and long-term contextual dependencies. Prior work has explored the modeling of long-range context through RNNLM rescoring with improved performance. To further take advantage of the persisted nature during a conversation, such as topics or speaker turn, we extend the rescoring procedure to a new context-aware manner. For RNNLM training, we capture the contextual dependencies by concatenating adjacent sentences with various tag words, such as speaker or intention information. For lattice rescoring, the lattice of adjacent sentences are also connected with the first-pass decoded result by tag words. Besides, we also adopt a selective concatenation strategy based on tf-idf, making the best use of contextual similarity to improve transcription performance. Results on four different conversation test sets show that our approach yields up to 13.1% and 6% relative char-error-rate (CER) reduction compared with 1st-pass decoding and common lattice-rescoring, respectively.
△ Less
Submitted 18 November, 2020;
originally announced November 2020.
-
Empirical Fourier Decomposition: An Accurate Adaptive Signal Decomposition Method
Authors:
Wei Zhou,
Zhongren Feng,
Y. F. Xu,
Xiongjiang Wang,
Hao Lv
Abstract:
Signal decomposition is an effective tool to assist the identification of modal information in time-domain signals. Two signal decomposition methods, including the empirical wavelet transform (EWT) and Fourier decomposition method (FDM), have been developed based on Fourier theory. However, the EWT can suffer from a mode mixing problem for signals with closely-spaced modes and decomposition result…
▽ More
Signal decomposition is an effective tool to assist the identification of modal information in time-domain signals. Two signal decomposition methods, including the empirical wavelet transform (EWT) and Fourier decomposition method (FDM), have been developed based on Fourier theory. However, the EWT can suffer from a mode mixing problem for signals with closely-spaced modes and decomposition results by FDM can suffer from an inconsistency problem. An accurate adaptive signal decomposition method, called the empirical Fourier decomposition (EFD), is proposed to solve the problems in this work. The proposed EFD combines the uses of an improved Fourier spectrum segmentation technique and an ideal filter bank. The segmentation technique can solve the inconsistency problem by predefining the number of modes in a signal to be decomposed and filter functions in the ideal filter bank have no transition phases, which can solve the mode mixing problem. Numerical investigations are conducted to study the accuracy of the EFD. It is shown that the EFD can yield accurate and consistent decomposition results for signals with multiple non-stationary modes and those with closely-spaced modes, compared with decomposition results by the EWT, FDM, variational mode decomposition and empirical mode decomposition. It is also shown that the EFD can yield accurate time-frequency representation results and it has the highest computational efficiency among the compared methods.
△ Less
Submitted 7 October, 2020; v1 submitted 16 September, 2020;
originally announced September 2020.
-
Wake Word Detection with Alignment-Free Lattice-Free MMI
Authors:
Yiming Wang,
Hang Lv,
Daniel Povey,
Lei Xie,
Sanjeev Khudanpur
Abstract:
Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input. We present novel methods to train a hybrid DNN/HMM wake word detection system from partially labeled training data, and to use it in on-line applications: (i) we remove the prerequisite of frame-level alignments in the LF-MMI training algorithm, permitting the use of un-tra…
▽ More
Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input. We present novel methods to train a hybrid DNN/HMM wake word detection system from partially labeled training data, and to use it in on-line applications: (i) we remove the prerequisite of frame-level alignments in the LF-MMI training algorithm, permitting the use of un-transcribed training examples that are annotated only for the presence/absence of the wake word; (ii) we show that the classical keyword/filler model must be supplemented with an explicit non-speech (silence) model for good performance; (iii) we present an FST-based decoder to perform online detection. We evaluate our methods on two real data sets, showing 50%--90% reduction in false rejection rates at pre-specified false alarm rates over the best previously published figures, and re-validate them on a third (large) data set.
△ Less
Submitted 28 July, 2020; v1 submitted 17 May, 2020;
originally announced May 2020.
-
RPCA-Based High Resolution Through-the-Wall Human Motion Detection and Classification
Authors:
Qiang An,
Shuoguang Wang,
Wenji Zhang,
Hao Lv,
Jianqi Wang,
Shiyong Li,
Ahmad Hoorfar
Abstract:
Radar based assisted living has received great amount of research interest in recent years. By employing the micro-Doppler features of indoor human motions, accurate recognition and classification of different types of movements become possible. Whereas, most of the existing works are focused only on free space detection, the literature on detection and recognition of human motions in through-the-…
▽ More
Radar based assisted living has received great amount of research interest in recent years. By employing the micro-Doppler features of indoor human motions, accurate recognition and classification of different types of movements become possible. Whereas, most of the existing works are focused only on free space detection, the literature on detection and recognition of human motions in through-the-wall scenarios is still in its infancy. As can be anticipated, the wall media and indoor static non-human targets would cause clutters and significantly corrupt the motion information of human subjects behind wall. However, no relevant work is reported to effectively handle this problem. In the present work, we aim to fill the gap and propose to use a low center-frequency ultra-wideband (UWB) radar system to probe the behind wall scene. Then, a Robust Principal Component Analysis (RPCA) based subspace decomposition technique, as its first reported implementation, is employed not only to remove the stationary clutters in raw range slow-time map but also to mitigate the multipath effects in the time-frequency map. Onsite experiments of detecting human motions behind a single layer of concrete wall is carried out to investigate the performance of the technique. Lastly, a two dimensional (2D)-PCA algorithm-based motion classification is provided to further verify the effectiveness of the proposed technique. Classification result shows that an enhanced recognition capability can be achieved using the proposed technique in detection and classification of indoor human motions.
△ Less
Submitted 29 January, 2020;
originally announced January 2020.
-
Range-Max Enhanced Ultra-Wideband Micro-Doppler Signatures of Behind Wall Indoor Human Activities
Authors:
Qiang An,
Shuoguang Wang,
Ahmad Hoorfar,
Wenji Zhang,
Hao Lv,
Shiyong Li,
Jianqi Wang
Abstract:
Penetrating detection and recognition of behind wall indoor human activities has drawn great attentions from social security and emergency service department in recent years since intelligent surveillance aforehand could avail the proper decision making before operations being carried out. However, due to the influence of the wall effects, the obtained micro-Doppler signatures would be severely de…
▽ More
Penetrating detection and recognition of behind wall indoor human activities has drawn great attentions from social security and emergency service department in recent years since intelligent surveillance aforehand could avail the proper decision making before operations being carried out. However, due to the influence of the wall effects, the obtained micro-Doppler signatures would be severely degenerated by strong near zero-frequency DC components, which would inevitably smear the detailed characteristic features of different behind wall motions in time-frequency (TF) map and further hinder the motion recognition and classification. In this paper, an ultra-wideband (UWB) radar system is first employed to probe through the opaque wall to detect the behind wall motions, which often span a certain number of range bin cells. By employing such a system, a high resolution range map can be obtained, in which the embedded rich range information is expected to be fully exploited to improve the subsequent recognition and classification performance. Secondly, a high-pass filter is applied to remove the effect of the wall in the raw range map. Then, with the aim of enhancing the characteristic features of different behind wall motions in TF maps, a novel range-max enhancement strategy is proposed to extract the most significant micro-Doppler feature of each TF cell along all range bins for a specific motion. Lastly, the effectiveness of the proposed micro-Doppler signature enhancement strategy is investigated by means of onsite experiments and comparative classification. Both the feature enhanced TF maps and classification results show that the proposed approach outperforms other state-of-art Short-Time Fourier Transform (STFT) based TF feature extraction methods.
△ Less
Submitted 29 January, 2020; v1 submitted 28 January, 2020;
originally announced January 2020.
-
Empirical Fourier Decomposition
Authors:
Wei Zhou,
Zhongren Feng,
Xiongjiang Wang,
Hao Lv
Abstract:
In this paper, a novel decomposition method for non-stationary and nonlinear signals is proposed. This method is inspired by the adaptive wavelet filter bank of the empirical wavelet transform (EWT) and Fourier intrinsic band functions (FIBFs) of the Fourier decomposition method (FDM). Therefore, the proposed approach is entitled as empirical Fourier decomposition (EFD). EFD is defined as the adap…
▽ More
In this paper, a novel decomposition method for non-stationary and nonlinear signals is proposed. This method is inspired by the adaptive wavelet filter bank of the empirical wavelet transform (EWT) and Fourier intrinsic band functions (FIBFs) of the Fourier decomposition method (FDM). Therefore, the proposed approach is entitled as empirical Fourier decomposition (EFD). EFD is defined as the adaptive bandpass filter bank, regarded as the adaptive FIBFs based on the segment of the Fourier spectrum. Firstly, an enhanced segmentation technology of the Fourier spectrum based is presented. Secondly, the framework of EFD is established both in a continuous series and a discrete series. Finally, combined with the Hilbert transform, EFD is extended to a time-frequency representation. To verify the effectiveness of EFD, three non-stationary multimode signals, a simulated free vibration, and one real ECG signal are tested. The results manifest that EFD is more effective, compared with EWT and FDM, with higher processing precision, computation efficiency and noise robustness particularly to the closely-spaced frequencies and high-frequency noise.
△ Less
Submitted 1 December, 2019;
originally announced December 2019.
-
Espresso: A Fast End-to-end Neural Speech Recognition Toolkit
Authors:
Yiming Wang,
Tongfei Chen,
Hainan Xu,
Shuoyang Ding,
Hang Lv,
Yiwen Shao,
Nanyun Peng,
Lei Xie,
Shinji Watanabe,
Sanjeev Khudanpur
Abstract:
We present Espresso, an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit fairseq. Espresso supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language…
▽ More
We present Espresso, an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit fairseq. Espresso supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language model fusion, for which a fast, parallelized decoder is implemented. Espresso achieves state-of-the-art ASR performance on the WSJ, LibriSpeech, and Switchboard data sets among other end-to-end systems without data augmentation, and is 4--11x faster for decoding than similar systems (e.g. ESPnet).
△ Less
Submitted 14 October, 2019; v1 submitted 18 September, 2019;
originally announced September 2019.