Search | arXiv e-print repository

Dataset-Distillation Generative Model for Speech Emotion Recognition

Authors: Fabian Ritter-Gutierrez, Kuan-Po Huang, Jeremy H. M Wong, Dianwen Ng, Hung-yi Lee, Nancy F. Chen, Eng Siong Chng

Abstract: Deep learning models for speech rely on large datasets, presenting computational challenges. Yet, performance hinges on training data size. Dataset Distillation (DD) aims to learn a smaller dataset without much performance degradation when training with it. DD has been investigated in computer vision but not yet in speech. This paper presents the first approach for DD to speech targeting Speech Em… ▽ More Deep learning models for speech rely on large datasets, presenting computational challenges. Yet, performance hinges on training data size. Dataset Distillation (DD) aims to learn a smaller dataset without much performance degradation when training with it. DD has been investigated in computer vision but not yet in speech. This paper presents the first approach for DD to speech targeting Speech Emotion Recognition on IEMOCAP. We employ Generative Adversarial Networks (GANs) not to mimic real data but to distil key discriminative information of IEMOCAP that is useful for downstream training. The GAN then replaces the original dataset and can sample custom synthetic dataset sizes. It performs comparably when following the original class imbalance but improves performance by 0.3% absolute UAR with balanced classes. It also reduces dataset storage and accelerates downstream training by 95% in both cases and reduces speaker information which could help for a privacy application. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: Accepted at Interspeech 2024

arXiv:2312.12153 [pdf, other]

Noise robust distillation of self-supervised speech models via correlation metrics

Authors: Fabian Ritter-Gutierrez, Kuan-Po Huang, Dianwen Ng, Jeremy H. M. Wong, Hung-yi Lee, Eng Siong Chng, Nancy F. Chen

Abstract: Compared to large speech foundation models, small distilled models exhibit degraded noise robustness. The student's robustness can be improved by introducing noise at the inputs during pre-training. Despite this, using the standard distillation loss still yields a student with degraded performance. Thus, this paper proposes improving student robustness via distillation with correlation metrics. Te… ▽ More Compared to large speech foundation models, small distilled models exhibit degraded noise robustness. The student's robustness can be improved by introducing noise at the inputs during pre-training. Despite this, using the standard distillation loss still yields a student with degraded performance. Thus, this paper proposes improving student robustness via distillation with correlation metrics. Teacher behavior is learned by maximizing the teacher and student cross-correlation matrix between their representations towards identity. Noise robustness is encouraged via the student's self-correlation minimization. The proposed method is agnostic of the teacher model and consistently outperforms the previous approach. This work also proposes an heuristic to weigh the importance of the two correlation terms automatically. Experiments show consistently better clean and noise generalization on Intent Classification, Keyword Spotting, and Automatic Speech Recognition tasks on SUPERB Challenge. △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: 6 pages

arXiv:2306.02719 [pdf, ps, other]

Multiple output samples per input in a single-output Gaussian process

Authors: Jeremy H. M. Wong, Huayun Zhang, Nancy F. Chen

Abstract: The standard Gaussian Process (GP) only considers a single output sample per input in the training set. Datasets for subjective tasks, such as spoken language assessment, may be annotated with output labels from multiple human raters per input. This paper proposes to generalise the GP to allow for these multiple output samples in the training set, and thus make use of available output uncertainty… ▽ More The standard Gaussian Process (GP) only considers a single output sample per input in the training set. Datasets for subjective tasks, such as spoken language assessment, may be annotated with output labels from multiple human raters per input. This paper proposes to generalise the GP to allow for these multiple output samples in the training set, and thus make use of available output uncertainty information. This differs from a multi-output GP, as all output samples are from the same task here. The output density function is formulated to be the joint likelihood of observing all output samples, and latent variables are not repeated to reduce computation cost. The test set predictions are inferred similarly to a standard GP, with a difference being in the optimised hyper-parameters. This is evaluated on speechocean762, showing that it allows the GP to compute a test set output distribution that is more similar to the collection of reference outputs from the multiple human raters. △ Less

Submitted 25 January, 2024; v1 submitted 5 June, 2023; originally announced June 2023.

Comments: This paper is presented in the "Symposium for Celebrating 40 Years of Bayesian Learning in Speech and Language Processing and Beyond", which is a satellite event of the ASRU workshop, on 20 December 2023. https://bayesian40.github.io/

arXiv:2305.18881 [pdf, other]

MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization

Authors: Victoria Y. H. Chua, Hexin Liu, Leibny Paola Garcia Perera, Fei Ting Woon, **yi Wong, Xiangyu Zhang, Sanjeev Khudanpur, Andy W. H. Khong, Justin Dauwels, Suzy J. Styles

Abstract: To enhance the reliability and robustness of language identification (LID) and language diarization (LD) systems for heterogeneous populations and scenarios, there is a need for speech processing models to be trained on datasets that feature diverse language registers and speech patterns. We present the MERLIon CCS challenge, featuring a first-of-its-kind Zoom video call dataset of parent-child sh… ▽ More To enhance the reliability and robustness of language identification (LID) and language diarization (LD) systems for heterogeneous populations and scenarios, there is a need for speech processing models to be trained on datasets that feature diverse language registers and speech patterns. We present the MERLIon CCS challenge, featuring a first-of-its-kind Zoom video call dataset of parent-child shared book reading, of over 30 hours with over 300 recordings, annotated by multilingual transcribers using a high-fidelity linguistic transcription protocol. The audio corpus features spontaneous and in-the-wild English-Mandarin code-switching, child-directed speech in non-standard accents with diverse language-mixing patterns recorded in a variety of home environments. This report describes the corpus, as well as LID and LD results for our baseline and several systems submitted to the MERLIon CCS challenge using the corpus. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: Accepted by Interspeech 2023, 5 pages, 2 figures, 3 tables

arXiv:2210.11923 [pdf, other]

doi 10.1145/3627827

RollBack: A New Time-Agnostic Replay Attack Against the Automotive Remote Keyless Entry Systems

Authors: Levente Csikor, Hoon Wei Lim, Jun Wen Wong, Soundarya Ramesh, Rohini Poolat Parameswarath, Mun Choon Chan

Abstract: Today's RKE systems implement disposable rolling codes, making every key fob button press unique, effectively preventing simple replay attacks. However, a prior attack called RollJam was proven to break all rolling code-based systems in general. By a careful sequence of signal jamming, capturing, and replaying, an attacker can become aware of the subsequent valid unlock signal that has not been us… ▽ More Today's RKE systems implement disposable rolling codes, making every key fob button press unique, effectively preventing simple replay attacks. However, a prior attack called RollJam was proven to break all rolling code-based systems in general. By a careful sequence of signal jamming, capturing, and replaying, an attacker can become aware of the subsequent valid unlock signal that has not been used yet. RollJam, however, requires continuous deployment indefinitely until it is exploited. Otherwise, the captured signals become invalid if the key fob is used again without RollJam in place. We introduce RollBack, a new replay-and-resynchronize attack against most of today's RKE systems. In particular, we show that even though the one-time code becomes invalid in rolling code systems, replaying a few previously captured signals consecutively can trigger a rollback-like mechanism in the RKE system. Put differently, the rolling codes become resynchronized back to a previous code used in the past from where all subsequent yet already used signals work again. Moreover, the victim can still use the key fob without noticing any difference before and after the attack. Unlike RollJam, RollBack does not necessitate jamming at all. Furthermore, it requires signal capturing only once and can be exploited at any time in the future as many times as desired. This time-agnostic property is particularly attractive to attackers, especially in car-sharing/renting scenarios where accessing the key fob is straightforward. However, while RollJam defeats virtually any rolling code-based system, vehicles might have additional anti-theft measures against malfunctioning key fobs, hence against RollBack. Our ongoing analysis (covering Asian vehicle manufacturers for the time being) against different vehicle makes and models has revealed that ~70% of them are vulnerable to RollBack. △ Less

Submitted 14 September, 2022; originally announced October 2022.

Comments: 24 pages, 5 figures Under submission to a journal

Journal ref: ACM Transactions on Cyber-Physical Systems, 2024

arXiv:2210.01158 [pdf, other]

An Analysis of RF Transfer Learning Behavior Using Synthetic Data

Authors: Lauren J. Wong, Sean McPherson, Alan J. Michaels

Abstract: Transfer learning (TL) techniques, which leverage prior knowledge gained from data with different distributions to achieve higher performance and reduced training time, are often used in computer vision (CV) and natural language processing (NLP), but have yet to be fully utilized in the field of radio frequency machine learning (RFML). This work systematically evaluates how radio frequency (RF) TL… ▽ More Transfer learning (TL) techniques, which leverage prior knowledge gained from data with different distributions to achieve higher performance and reduced training time, are often used in computer vision (CV) and natural language processing (NLP), but have yet to be fully utilized in the field of radio frequency machine learning (RFML). This work systematically evaluates how radio frequency (RF) TL behavior by examining how the training domain and task, characterized by the transmitter/receiver hardware and channel environment, impact RF TL performance for an example automatic modulation classification (AMC) use-case. Through exhaustive experimentation using carefully curated synthetic datasets with varying signal types, signal-to-noise ratios (SNRs), and frequency offsets (FOs), generalized conclusions are drawn regarding how best to use RF TL techniques for domain adaptation and sequential learning. Consistent with trends identified in other modalities, results show that RF TL performance is highly dependent on the similarity between the source and target domains/tasks. Results also discuss the impacts of channel environment, hardware variations, and domain/task difficulty on RF TL performance, and compare RF TL performance using head re-training and model fine-tuning methods. △ Less

Submitted 3 October, 2022; originally announced October 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2206.08329

arXiv:2206.08329 [pdf, other]

Assessing the Value of Transfer Learning Metrics for RF Domain Adaptation

Authors: Lauren J. Wong, Sean McPherson, Alan J. Michaels

Abstract: The use of transfer learning (TL) techniques has become common practice in fields such as computer vision (CV) and natural language processing (NLP). Leveraging prior knowledge gained from data with different distributions, TL offers higher performance and reduced training time, but has yet to be fully utilized in applications of machine learning (ML) and deep learning (DL) techniques to applicati… ▽ More The use of transfer learning (TL) techniques has become common practice in fields such as computer vision (CV) and natural language processing (NLP). Leveraging prior knowledge gained from data with different distributions, TL offers higher performance and reduced training time, but has yet to be fully utilized in applications of machine learning (ML) and deep learning (DL) techniques to applications related to wireless communications, a field loosely termed radio frequency machine learning (RFML). This work begins this examination by evaluating the how radio frequency (RF) domain changes encourage or prevent the transfer of features learned by convolutional neural network (CNN)-based automatic modulation classifiers. Additionally, we examine existing transferability metrics, Log Expected Empirical Prediction (LEEP) and Logarithm of Maximum Evidence (LogME), as a means to both select source models for RF domain adaptation and predict post-transfer accuracy without further training. △ Less

Submitted 16 June, 2022; originally announced June 2022.

arXiv:2203.11903 [pdf]

Enabling faster and more reliable sonographic assessment of gestational age through machine learning

Authors: Chace Lee, Angelica Willis, Christina Chen, Marcin Sieniek, Akib Uddin, Jonny Wong, Rory Pilgrim, Katherine Chou, Daniel Tse, Shravya Shetty, Ryan G. Gomes

Abstract: Fetal ultrasounds are an essential part of prenatal care and can be used to estimate gestational age (GA). Accurate GA assessment is important for providing appropriate prenatal care throughout pregnancy and identifying complications such as fetal growth disorders. Since derivation of GA from manual fetal biometry measurements (head, abdomen, femur) are operator-dependent and time-consuming, there… ▽ More Fetal ultrasounds are an essential part of prenatal care and can be used to estimate gestational age (GA). Accurate GA assessment is important for providing appropriate prenatal care throughout pregnancy and identifying complications such as fetal growth disorders. Since derivation of GA from manual fetal biometry measurements (head, abdomen, femur) are operator-dependent and time-consuming, there have been a number of research efforts focused on using artificial intelligence (AI) models to estimate GA using standard biometry images, but there is still room to improve the accuracy and reliability of these AI systems for widescale adoption. To improve GA estimates, without significant change to provider workflows, we leverage AI to interpret standard plane ultrasound images as well as 'fly-to' ultrasound videos, which are 5-10s videos automatically recorded as part of the standard of care before the still image is captured. We developed and validated three AI models: an image model using standard plane images, a video model using fly-to videos, and an ensemble model (combining both image and video). All three were statistically superior to standard fetal biometry-based GA estimates derived by expert sonographers, the ensemble model has the lowest mean absolute error (MAE) compared to the clinical standard fetal biometry (mean difference: -1.51 $\pm$ 3.96 days, 95% CI [-1.9, -1.1]) on a test set that consisted of 404 participants. We showed that our models outperform standard biometry by a more substantial margin on fetuses that were small for GA. Our AI models have the potential to empower trained operators to estimate GA with higher accuracy while reducing the amount of time required and user variability in measurement acquisition. △ Less

Submitted 22 March, 2022; originally announced March 2022.

arXiv:2203.10139 [pdf]

AI system for fetal ultrasound in low-resource settings

Authors: Ryan G. Gomes, Bellington Vwalika, Chace Lee, Angelica Willis, Marcin Sieniek, Joan T. Price, Christina Chen, Margaret P. Kasaro, James A. Taylor, Elizabeth M. Stringer, Scott Mayer McKinney, Ntazana Sindano, George E. Dahl, William Goodnight III, Justin Gilmer, Benjamin H. Chi, Charles Lau, Terry Spitz, T Saensuksopa, Kris Liu, Jonny Wong, Rory Pilgrim, Akib Uddin, Greg Corrado, Lily Peng , et al. (4 additional authors not shown)

Abstract: Despite considerable progress in maternal healthcare, maternal and perinatal deaths remain high in low-to-middle income countries. Fetal ultrasound is an important component of antenatal care, but shortage of adequately trained healthcare workers has limited its adoption. We developed and validated an artificial intelligence (AI) system that uses novice-acquired "blind sweep" ultrasound videos to… ▽ More Despite considerable progress in maternal healthcare, maternal and perinatal deaths remain high in low-to-middle income countries. Fetal ultrasound is an important component of antenatal care, but shortage of adequately trained healthcare workers has limited its adoption. We developed and validated an artificial intelligence (AI) system that uses novice-acquired "blind sweep" ultrasound videos to estimate gestational age (GA) and fetal malpresentation. We further addressed obstacles that may be encountered in low-resourced settings. Using a simplified sweep protocol with real-time AI feedback on sweep quality, we have demonstrated the generalization of model performance to minimally trained novice ultrasound operators using low cost ultrasound devices with on-device AI integration. The GA model was non-inferior to standard fetal biometry estimates with as few as two sweeps, and the fetal malpresentation model had high AUC-ROCs across operators and devices. Our AI models have the potential to assist in upleveling the capabilities of lightly trained ultrasound operators in low resource settings. △ Less

Submitted 18 March, 2022; originally announced March 2022.

arXiv:2109.10598 [pdf, other]

Diarisation using location tracking with agglomerative clustering

Authors: Jeremy H. M. Wong, Igor Abramovski, Xiong Xiao, Yifan Gong

Abstract: Previous works have shown that spatial location information can be complementary to speaker embeddings for a speaker diarisation task. However, the models used often assume that speakers are fairly stationary throughout a meeting. This paper proposes to relax this assumption, by explicitly modelling the movements of speakers within an Agglomerative Hierarchical Clustering (AHC) diarisation framewo… ▽ More Previous works have shown that spatial location information can be complementary to speaker embeddings for a speaker diarisation task. However, the models used often assume that speakers are fairly stationary throughout a meeting. This paper proposes to relax this assumption, by explicitly modelling the movements of speakers within an Agglomerative Hierarchical Clustering (AHC) diarisation framework. Kalman filters, which track the locations of speakers, are used to compute log-likelihood ratios that contribute to the cluster affinity computations for the AHC merging and stop** decisions. Experiments show that the proposed approach is able to yield improvements on a Microsoft rich meeting transcription task, compared to methods that do not use location information or that make stationarity assumptions. △ Less

Submitted 23 September, 2021; v1 submitted 22 September, 2021; originally announced September 2021.

arXiv:2105.01644 [pdf, other]

Market Potential for CO$_2$ Removal and Sequestration from Renewable Natural Gas Production in California

Authors: Jun Wong, Jonathan Santoso, Marjorie Went, Daniel Sanchez

Abstract: Bioenergy with Carbon Capture and Sequestration (BECCS) is critical for stringent climate change mitigation, but is commercially and technologically immature and resource-intensive. In California, state and federal fuel and climate policies can drive first-markets for BECCS. We develop a spatially explicit optimization model to assess niche markets for renewable natural gas (RNG) production with c… ▽ More Bioenergy with Carbon Capture and Sequestration (BECCS) is critical for stringent climate change mitigation, but is commercially and technologically immature and resource-intensive. In California, state and federal fuel and climate policies can drive first-markets for BECCS. We develop a spatially explicit optimization model to assess niche markets for renewable natural gas (RNG) production with carbon capture and sequestration (CCS) from waste biomass in California. Existing biomass residues produce biogas and RNG and enable low-cost CCS through the upgrading process and CO$_2$ truck transport. Under current state and federal policy incentives, we could capture and sequester 2.9 million MT CO$_2$/year (0.7% of California's 2018 CO$_2$ emissions) and produce 93 PJ RNG/year (4% of California's 2018 natural gas demand) with a profit maximizing objective. Existing federal and state policies produce profits of \$11/GJ. Distributed RNG production with CCS potentially catalyzes markets and technologies for CO$_2$ capture, transport, and storage in California. △ Less

Submitted 4 May, 2021; originally announced May 2021.

Comments: 25 pages, 6 figures

arXiv:2101.01239 [pdf, other]

Explainable Neural Network-based Modulation Classification via Concept Bottleneck Models

Authors: Lauren J. Wong, Sean McPherson

Abstract: While RFML is expected to be a key enabler of future wireless standards, a significant challenge to the widespread adoption of RFML techniques is the lack of explainability in deep learning models. This work investigates the use of CB models as a means to provide inherent decision explanations in the context of DL-based AMC. Results show that the proposed approach not only meets the performance of… ▽ More While RFML is expected to be a key enabler of future wireless standards, a significant challenge to the widespread adoption of RFML techniques is the lack of explainability in deep learning models. This work investigates the use of CB models as a means to provide inherent decision explanations in the context of DL-based AMC. Results show that the proposed approach not only meets the performance of single-network DL-based AMC algorithms, but provides the desired model explainability and shows potential for classifying modulation schemes not seen during training (i.e. zero-shot learning). △ Less

Submitted 4 January, 2021; originally announced January 2021.

arXiv:2010.00432 [pdf, other]

The RFML Ecosystem: A Look at the Unique Challenges of Applying Deep Learning to Radio Frequency Applications

Authors: Lauren J. Wong, William H. Clark IV, Bryse Flowers, R. Michael Buehrer, Alan J. Michaels, William C. Headley

Abstract: While deep machine learning technologies are now pervasive in state-of-the-art image recognition and natural language processing applications, only in recent years have these technologies started to sufficiently mature in applications related to wireless communications. In particular, recent research has shown deep machine learning to be an enabling technology for cognitive radio applications as w… ▽ More While deep machine learning technologies are now pervasive in state-of-the-art image recognition and natural language processing applications, only in recent years have these technologies started to sufficiently mature in applications related to wireless communications. In particular, recent research has shown deep machine learning to be an enabling technology for cognitive radio applications as well as a useful tool for supplementing expertly defined algorithms for spectrum sensing applications such as signal detection, estimation, and classification (termed here as Radio Frequency Machine Learning, or RFML). A major driver for the usage of deep machine learning in the context of wireless communications is that little, to no, a priori knowledge of the intended spectral environment is required, given that there is an abundance of representative data to facilitate training and evaluation. However, in addition to this fundamental need for sufficient data, there are other key considerations, such as trust, security, and hardware/software issues, that must be taken into account before deploying deep machine learning systems in real-world wireless communication applications. This paper provides an overview and survey of prior work related to these major research considerations. In particular, we present their unique considerations in the RFML application space, which are not generally present in the image, audio, and/or text application spaces. △ Less

Submitted 1 October, 2020; originally announced October 2020.

arXiv:2009.08563 [pdf, other]

SCREENet: A Multi-view Deep Convolutional Neural Network for Classification of High-resolution Synthetic Mammographic Screening Scans

Authors: Saeed Seyyedi, Margaret J. Wong, Debra M. Ikeda, Curtis P. Langlotz

Abstract: Purpose: To develop and evaluate the accuracy of a multi-view deep learning approach to the analysis of high-resolution synthetic mammograms from digital breast tomosynthesis screening cases, and to assess the effect on accuracy of image resolution and training set size. Materials and Methods: In a retrospective study, 21,264 screening digital breast tomosynthesis (DBT) exams obtained at our insti… ▽ More Purpose: To develop and evaluate the accuracy of a multi-view deep learning approach to the analysis of high-resolution synthetic mammograms from digital breast tomosynthesis screening cases, and to assess the effect on accuracy of image resolution and training set size. Materials and Methods: In a retrospective study, 21,264 screening digital breast tomosynthesis (DBT) exams obtained at our institution were collected along with associated radiology reports. The 2D synthetic mammographic images from these exams, with varying resolutions and data set sizes, were used to train a multi-view deep convolutional neural network (MV-CNN) to classify screening images into BI-RADS classes (BI-RADS 0, 1 and 2) before evaluation on a held-out set of exams. Results: Area under the receiver operating characteristic curve (AUC) for BI-RADS 0 vs non-BI-RADS 0 class was 0.912 for the MV-CNN trained on the full dataset. The model obtained accuracy of 84.8%, recall of 95.9% and precision of 95.0%. This AUC value decreased when the same model was trained with 50% and 25% of images (AUC = 0.877, P=0.010 and 0.834, P=0.009 respectively). Also, the performance dropped when the same model was trained using images that were under-sampled by 1/2 and 1/4 (AUC = 0.870, P=0.011 and 0.813, P=0.009 respectively). Conclusion: This deep learning model classified high-resolution synthetic mammography scans into normal vs needing further workup using tens of thousands of high-resolution images. Smaller training data sets and lower resolution images both caused significant decrease in performance. △ Less

Submitted 25 September, 2020; v1 submitted 17 September, 2020; originally announced September 2020.

arXiv:2008.04874 [pdf, other]

Classification of Radio Signals Using Truncated Gaussian Discriminant Analysis of Convolutional Neural Network-Derived Features

Authors: J. B. Persons, Lauren J. Wong, W. Chris Headley, Michael C. Fowler

Abstract: To improve the utility and scalability of distributed radio frequency (RF) sensor and communication networks, reduce the need for convolutional neural network (CNN) retraining, and efficiently share learned information about signals, we examined a supervised bootstrap** approach for RF modulation classification. We show that CNN-bootstrapped features of new and existing modulation classes can be… ▽ More To improve the utility and scalability of distributed radio frequency (RF) sensor and communication networks, reduce the need for convolutional neural network (CNN) retraining, and efficiently share learned information about signals, we examined a supervised bootstrap** approach for RF modulation classification. We show that CNN-bootstrapped features of new and existing modulation classes can be considered as mixtures of truncated Gaussian distributions, allowing for maximumlikelihood-based classification of new classes without retraining the network. In this work, the authors observed classification performance using maximum likelihood estimation of CNNbootstrapped features to be comparable to that of a CNN trained on all classes, even for those classes on which the bootstrap** CNN was not trained. This performance was achieved while reducing the number of parameters needed for new class definition from over 8 million to only 200. Furthermore, some physical features of interest, not directly labeled during training, e.g. signal-to-noise ratio (SNR), can be learned or estimated from these same CNN-derived features. Finally, we show that SNR estimation accuracy is highest when classification accuracy is lowest and therefore can be used to calibrate a confidence in the classification. △ Less

Submitted 11 August, 2020; originally announced August 2020.

Comments: Under peer review as of 11 August 2020. 11 pages, 13 figures

arXiv:2003.07482 [pdf, other]

High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model

Authors: **yu Li, Rui Zhao, Eric Sun, Jeremy H. M. Wong, Amit Das, Zhong Meng, Yifan Gong

Abstract: While the community keeps promoting end-to-end models over conventional hybrid models, which usually are long short-term memory (LSTM) models trained with a cross entropy criterion followed by a sequence discriminative training criterion, we argue that such conventional hybrid models can still be significantly improved. In this paper, we detail our recent efforts to improve conventional hybrid LST… ▽ More While the community keeps promoting end-to-end models over conventional hybrid models, which usually are long short-term memory (LSTM) models trained with a cross entropy criterion followed by a sequence discriminative training criterion, we argue that such conventional hybrid models can still be significantly improved. In this paper, we detail our recent efforts to improve conventional hybrid LSTM acoustic models for high-accuracy and low-latency automatic speech recognition. To achieve high accuracy, we use a contextual layer trajectory LSTM (cltLSTM), which decouples the temporal modeling and target classification tasks, and incorporates future context frames to get more information for accurate acoustic modeling. We further improve the training strategy with sequence-level teacher-student learning. To obtain low latency, we design a two-head cltLSTM, in which one head has zero latency and the other head has a small latency, compared to an LSTM. When trained with Microsoft's 65 thousand hours of anonymized training data and evaluated with test sets with 1.8 million words, the proposed two-head cltLSTM model with the proposed training strategy yields a 28.2\% relative WER reduction over the conventional LSTM acoustic model, with a similar perceived latency. △ Less

Submitted 16 March, 2020; originally announced March 2020.

Comments: Accepted by ICASSP 2020

arXiv:1808.02369 [pdf, other]

Emitter Identification Using CNN IQ Imbalance Estimators

Authors: Lauren J. Wong, William C. Headley, Alan J. Michaels

Abstract: Specific Emitter Identification is the association of a received signal to a unique emitter, and is made possible by the naturally occurring and unintentional characteristics an emitter imparts onto each transmission, known as its radio frequency fingerprint. This work presents an approach for identifying emitters using Convolutional Neural Networks to estimate the IQ imbalance parameters of each… ▽ More Specific Emitter Identification is the association of a received signal to a unique emitter, and is made possible by the naturally occurring and unintentional characteristics an emitter imparts onto each transmission, known as its radio frequency fingerprint. This work presents an approach for identifying emitters using Convolutional Neural Networks to estimate the IQ imbalance parameters of each emitter, using only raw IQ data as input. Because an emitter's IQ imbalance parameters will not change as it changes modulation schemes, the proposed approach has the ability to track emitters, even as they change modulation scheme. The performance of the developed approach is evaluated using simulated quadrature amplitude modulation and phase-shift keying signals, and the impact of signal-to-noise ratio, imbalance value, and modulation scheme are considered. Further, the developed approach is shown to outperform a comparable feature-based approach, while making fewer assumptions and using less data. △ Less

Submitted 7 August, 2018; originally announced August 2018.

arXiv:1802.00254 [pdf, ps, other]

Phonetic and Graphemic Systems for Multi-Genre Broadcast Transcription

Authors: Yu Wang, Xie Chen, Mark Gales, Anton Ragni, Jeremy Wong

Abstract: State-of-the-art English automatic speech recognition systems typically use phonetic rather than graphemic lexicons. Graphemic systems are known to perform less well for English as the map** from the written form to the spoken form is complicated. However, in recent years the representational power of deep-learning based acoustic models has improved, raising interest in graphemic acoustic models… ▽ More State-of-the-art English automatic speech recognition systems typically use phonetic rather than graphemic lexicons. Graphemic systems are known to perform less well for English as the map** from the written form to the spoken form is complicated. However, in recent years the representational power of deep-learning based acoustic models has improved, raising interest in graphemic acoustic models for English, due to the simplicity of generating the lexicon. In this paper, phonetic and graphemic models are compared for an English Multi-Genre Broadcast transcription task. A range of acoustic models based on lattice-free MMI training are constructed using phonetic and graphemic lexicons. For this task, it is found that having a long-span temporal history reduces the difference in performance between the two forms of models. In addition, system combination is examined, using parameter smoothing and hypothesis combination. As the combination approaches become more complicated the difference between the phonetic and graphemic systems further decreases. Finally, for all configurations examined the combination of phonetic and graphemic systems yields consistent gains. △ Less

Submitted 1 February, 2018; originally announced February 2018.

Comments: 5 pages, 6 tables, to appear in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018)

Showing 1–18 of 18 results for author: Wong, J