-
Radio Resource Management Design for RSMA: Optimization of Beamforming, User Admission, and Discrete/Continuous Rates with Imperfect SIC
Authors:
L. F. Abanto-Leon,
A. Krishnamoorthy,
A. Garcia-Saavedra,
G. H. Sim,
R. Schober,
M. Hollick
Abstract:
This paper investigates the radio resource management (RRM) design for multiuser rate-splitting multiple access (RSMA), accounting for various characteristics of practical wireless systems, such as the use of discrete rates, the inability to serve all users, and the imperfect successive interference cancellation (SIC). Specifically, failure to consider these characteristics in RRM design may lead…
▽ More
This paper investigates the radio resource management (RRM) design for multiuser rate-splitting multiple access (RSMA), accounting for various characteristics of practical wireless systems, such as the use of discrete rates, the inability to serve all users, and the imperfect successive interference cancellation (SIC). Specifically, failure to consider these characteristics in RRM design may lead to inefficient use of radio resources. Therefore, we formulate the RRM of RSMA as optimization problems to maximize respectively the weighted sum rate (WSR) and weighted energy efficiency (WEE), and jointly optimize the beamforming, user admission, discrete/continuous rates, accounting for imperfect SIC, which result in nonconvex mixed-integer nonlinear programs that are challenging to solve. Despite the difficulty of the optimization problems, we develop algorithms that can find high-quality solutions. We show via simulations that carefully accounting for the aforementioned characteristics, can lead to significant gains. Precisely, by considering that transmission rates are discrete, the transmit power can be utilized more intelligently, allocating just enough power to guarantee a given discrete rate. Additionally, we reveal that user admission plays a crucial role in RSMA, enabling additional gains compared to random admission by facilitating the servicing of selected users with mutually beneficial channel characteristics. Furthermore, provisioning for possibly imperfect SIC makes RSMA more robust and reliable.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
Hierarchical Recurrent Adapters for Efficient Multi-Task Adaptation of Large Speech Models
Authors:
Tsendsuren Munkhdalai,
Youzheng Chen,
Khe Chai Sim,
Fadi Biadsy,
Tara Sainath,
Pedro Moreno Mengibar
Abstract:
Parameter efficient adaptation methods have become a key mechanism to train large pre-trained models for downstream tasks. However, their per-task parameter overhead is considered still high when the number of downstream tasks to adapt for is large. We introduce an adapter module that has a better efficiency in large scale multi-task adaptation scenario. Our adapter is hierarchical in terms of how…
▽ More
Parameter efficient adaptation methods have become a key mechanism to train large pre-trained models for downstream tasks. However, their per-task parameter overhead is considered still high when the number of downstream tasks to adapt for is large. We introduce an adapter module that has a better efficiency in large scale multi-task adaptation scenario. Our adapter is hierarchical in terms of how the adapter parameters are allocated. The adapter consists of a single shared controller network and multiple task-level adapter heads to reduce the per-task parameter overhead without performance regression on downstream tasks. The adapter is also recurrent so the entire adapter parameters are reused across different layers of the pre-trained model. Our Hierarchical Recurrent Adapter (HRA) outperforms the previous adapter-based approaches as well as full model fine-tuning baseline in both single and multi-task adaptation settings when evaluated on automatic speech recognition tasks.
△ Less
Submitted 25 March, 2024;
originally announced March 2024.
-
Multi-Robot Relative Pose Estimation in SE(2) with Observability Analysis: A Comparison of Extended Kalman Filtering and Robust Pose Graph Optimization
Authors:
Kihoon Shin,
Hyunjae Sim,
Seungwon Nam,
Yonghee Kim,
Jae Hu,
Kwang-Ki K. Kim
Abstract:
In this study, we address multi-robot localization issues, with a specific focus on cooperative localization and observability analysis of relative pose estimation. Cooperative localization involves enhancing each robot's information through a communication network and message passing. If odometry data from a target robot can be transmitted to the ego robot, observability of their relative pose es…
▽ More
In this study, we address multi-robot localization issues, with a specific focus on cooperative localization and observability analysis of relative pose estimation. Cooperative localization involves enhancing each robot's information through a communication network and message passing. If odometry data from a target robot can be transmitted to the ego robot, observability of their relative pose estimation can be achieved through range-only or bearing-only measurements, provided both robots have non-zero linear velocities. In cases where odometry data from a target robot are not directly transmitted but estimated by the ego robot, both range and bearing measurements are necessary to ensure observability of relative pose estimation. For ROS/Gazebo simulations, we explore four sensing and communication structures. We compare extended Kalman filtering (EKF) and pose graph optimization (PGO) estimation using different robust loss functions (filtering and smoothing with varying batch sizes of sliding windows) in terms of estimation accuracy. In hardware experiments, two Turtlebot3 equipped with UWB modules are used for real-world inter-robot relative pose estimation, applying both EKF and PGO and comparing their performance.
△ Less
Submitted 4 February, 2024; v1 submitted 27 January, 2024;
originally announced January 2024.
-
Multivessel Coronary Artery Segmentation and Stenosis Localisation using Ensemble Learning
Authors:
Muhammad Bilal,
Dinis Martinho,
Reiner Sim,
Adnan Qayyum,
Hunaid Vohra,
Massimo Caputo,
Taofeek Akinosho,
Sofiat Abioye,
Zaheer Khan,
Waleed Niaz,
Junaid Qadir
Abstract:
Coronary angiography analysis is a common clinical task performed by cardiologists to diagnose coronary artery disease (CAD) through an assessment of atherosclerotic plaque's accumulation. This study introduces an end-to-end machine learning solution developed as part of our solution for the MICCAI 2023 Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiography imagEs (ARCA…
▽ More
Coronary angiography analysis is a common clinical task performed by cardiologists to diagnose coronary artery disease (CAD) through an assessment of atherosclerotic plaque's accumulation. This study introduces an end-to-end machine learning solution developed as part of our solution for the MICCAI 2023 Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiography imagEs (ARCADE) challenge, which aims to benchmark solutions for multivessel coronary artery segmentation and potential stenotic lesion localisation from X-ray coronary angiograms. We adopted a robust baseline model training strategy to progressively improve performance, comprising five successive stages of binary class pretraining, multivessel segmentation, fine-tuning using class frequency weighted dataloaders, fine-tuning using F1-based curriculum learning strategy (F1-CLS), and finally multi-target angiogram view classifier-based collective adaptation. Unlike many other medical imaging procedures, this task exhibits a notable degree of interobserver variability. %, making it particularly amenable to automated analysis. Our ensemble model combines the outputs from six baseline models using the weighted ensembling approach, which our analysis shows is found to double the predictive accuracy of the proposed solution. The final prediction was further refined, targeting the correction of misclassified blobs. Our solution achieved a mean F1 score of $37.69\%$ for coronary artery segmentation, and $39.41\%$ for stenosis localisation, positioning our team in the 5th position on both leaderboards. This work demonstrates the potential of automated tools to aid CAD diagnosis, guide interventions, and improve the accuracy of stent injections in clinical settings.
△ Less
Submitted 27 October, 2023;
originally announced October 2023.
-
Contextual Biasing with the Knuth-Morris-Pratt Matching Algorithm
Authors:
Weiran Wang,
Zelin Wu,
Diamantino Caseiro,
Tsendsuren Munkhdalai,
Khe Chai Sim,
Pat Rondon,
Golan Pundak,
Gan Song,
Rohit Prabhavalkar,
Zhong Meng,
Ding Zhao,
Tara Sainath,
Pedro Moreno Mengibar
Abstract:
Contextual biasing refers to the problem of biasing the automatic speech recognition (ASR) systems towards rare entities that are relevant to the specific user or application scenarios. We propose algorithms for contextual biasing based on the Knuth-Morris-Pratt algorithm for pattern matching. During beam search, we boost the score of a token extension if it extends matching into a set of biasing…
▽ More
Contextual biasing refers to the problem of biasing the automatic speech recognition (ASR) systems towards rare entities that are relevant to the specific user or application scenarios. We propose algorithms for contextual biasing based on the Knuth-Morris-Pratt algorithm for pattern matching. During beam search, we boost the score of a token extension if it extends matching into a set of biasing phrases. Our method simulates the classical approaches often implemented in the weighted finite state transducer (WFST) framework, but avoids the FST language altogether, with careful considerations on memory footprint and efficiency on tensor processing units (TPUs) by vectorization. Without introducing additional model parameters, our method achieves significant word error rate (WER) reductions on biasing test sets by itself, and yields further performance gain when combined with a model-based biasing method.
△ Less
Submitted 29 September, 2023;
originally announced October 2023.
-
Massive End-to-end Models for Short Search Queries
Authors:
Weiran Wang,
Rohit Prabhavalkar,
Dongseong Hwang,
Qiujia Li,
Khe Chai Sim,
Bo Li,
James Qin,
Xingyu Cai,
Adam Stooke,
Zhong Meng,
CJ Zheng,
Yanzhang He,
Tara Sainath,
Pedro Moreno Mengibar
Abstract:
In this work, we investigate two popular end-to-end automatic speech recognition (ASR) models, namely Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), for offline recognition of voice search queries, with up to 2B model parameters. The encoders of our models use the neural architecture of Google's universal speech model (USM), with additional funnel pooling layers to signifi…
▽ More
In this work, we investigate two popular end-to-end automatic speech recognition (ASR) models, namely Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), for offline recognition of voice search queries, with up to 2B model parameters. The encoders of our models use the neural architecture of Google's universal speech model (USM), with additional funnel pooling layers to significantly reduce the frame rate and speed up training and inference. We perform extensive studies on vocabulary size, time reduction strategy, and its generalization performance on long-form test sets. Despite the speculation that, as the model size increases, CTC can be as good as RNN-T which builds label dependency into the prediction, we observe that a 900M RNN-T clearly outperforms a 1.8B CTC and is more tolerant to severe time reduction, although the WER gap can be largely removed by LM shallow fusion.
△ Less
Submitted 22 September, 2023;
originally announced September 2023.
-
Improving Speech Recognition for African American English With Audio Classification
Authors:
Shefali Garg,
Zhouyuan Huo,
Khe Chai Sim,
Suzan Schwartz,
Mason Chua,
Alëna Aksënova,
Tsendsuren Munkhdalai,
Levi King,
Darryl Wright,
Zion Mengesha,
Dongseong Hwang,
Tara Sainath,
Françoise Beaufays,
Pedro Moreno Mengibar
Abstract:
Automatic speech recognition (ASR) systems have been shown to have large quality disparities between the language varieties they are intended or expected to recognize. One way to mitigate this is to train or fine-tune models with more representative datasets. But this approach can be hindered by limited in-domain data for training and evaluation. We propose a new way to improve the robustness of a…
▽ More
Automatic speech recognition (ASR) systems have been shown to have large quality disparities between the language varieties they are intended or expected to recognize. One way to mitigate this is to train or fine-tune models with more representative datasets. But this approach can be hindered by limited in-domain data for training and evaluation. We propose a new way to improve the robustness of a US English short-form speech recognizer using a small amount of out-of-domain (long-form) African American English (AAE) data. We use CORAAL, YouTube and Mozilla Common Voice to train an audio classifier to approximately output whether an utterance is AAE or some other variety including Mainstream American English (MAE). By combining the classifier output with coarse geographic information, we can select a subset of utterances from a large corpus of untranscribed short-form queries for semi-supervised learning at scale. Fine-tuning on this data results in a 38.5% relative word error rate disparity reduction between AAE and MAE without reducing MAE quality.
△ Less
Submitted 16 September, 2023;
originally announced September 2023.
-
Edit Distance based RL for RNNT decoding
Authors:
Dongseong Hwang,
Changwan Ryu,
Khe Chai Sim
Abstract:
RNN-T is currently considered the industry standard in ASR due to its exceptional WERs in various benchmark tests and its ability to support seamless streaming and longform transcription. However, its biggest drawback lies in the significant discrepancy between its training and inference objectives. During training, RNN-T maximizes all alignment probabilities by teacher forcing, while during infer…
▽ More
RNN-T is currently considered the industry standard in ASR due to its exceptional WERs in various benchmark tests and its ability to support seamless streaming and longform transcription. However, its biggest drawback lies in the significant discrepancy between its training and inference objectives. During training, RNN-T maximizes all alignment probabilities by teacher forcing, while during inference, it uses beam search which may not necessarily find the maximum probable alignment. Additionally, RNN-T's inability to experience mistakes during teacher forcing training makes it more problematic when a mistake occurs in inference. To address this issue, this paper proposes a Reinforcement Learning method that minimizes the gap between training and inference time. Our Edit Distance based RL (EDRL) approach computes rewards based on the edit distance, and trains the network at every action level. The proposed approach yielded SoTA WERs on LibriSpeech for the 600M Conformer RNN-T model.
△ Less
Submitted 14 July, 2023; v1 submitted 31 May, 2023;
originally announced June 2023.
-
Debiased Automatic Speech Recognition for Dysarthric Speech via Sample Reweighting with Sample Affinity Test
Authors:
Eungbeom Kim,
Yunkee Chae,
Jaeheon Sim,
Kyogu Lee
Abstract:
Automatic speech recognition systems based on deep learning are mainly trained under empirical risk minimization (ERM). Since ERM utilizes the averaged performance on the data samples regardless of a group such as healthy or dysarthric speakers, ASR systems are unaware of the performance disparities across the groups. This results in biased ASR systems whose performance differences among groups ar…
▽ More
Automatic speech recognition systems based on deep learning are mainly trained under empirical risk minimization (ERM). Since ERM utilizes the averaged performance on the data samples regardless of a group such as healthy or dysarthric speakers, ASR systems are unaware of the performance disparities across the groups. This results in biased ASR systems whose performance differences among groups are severe. In this study, we aim to improve the ASR system in terms of group robustness for dysarthric speakers. To achieve our goal, we present a novel approach, sample reweighting with sample affinity test (Re-SAT). Re-SAT systematically measures the debiasing helpfulness of the given data sample and then mitigates the bias by debiasing helpfulness-based sample reweighting. Experimental results demonstrate that Re-SAT contributes to improved ASR performance on dysarthric speech without performance degradation on healthy speech.
△ Less
Submitted 27 June, 2023; v1 submitted 22 May, 2023;
originally announced May 2023.
-
Efficient Domain Adaptation for Speech Foundation Models
Authors:
Bo Li,
Dongseong Hwang,
Zhouyuan Huo,
Junwen Bai,
Guru Prakash,
Tara N. Sainath,
Khe Chai Sim,
Yu Zhang,
Wei Han,
Trevor Strohman,
Francoise Beaufays
Abstract:
Foundation models (FMs), that are trained on broad data at scale and are adaptable to a wide range of downstream tasks, have brought large interest in the research community. Benefiting from the diverse data sources such as different modalities, languages and application domains, foundation models have demonstrated strong generalization and knowledge transfer capabilities. In this paper, we presen…
▽ More
Foundation models (FMs), that are trained on broad data at scale and are adaptable to a wide range of downstream tasks, have brought large interest in the research community. Benefiting from the diverse data sources such as different modalities, languages and application domains, foundation models have demonstrated strong generalization and knowledge transfer capabilities. In this paper, we present a pioneering study towards building an efficient solution for FM-based speech recognition systems. We adopt the recently developed self-supervised BEST-RQ for pretraining, and propose the joint finetuning with both source and unsupervised target domain data using JUST Hydra. The FM encoder adapter and decoder are then finetuned to the target domain with a small amount of supervised in-domain data. On a large-scale YouTube and Voice Search task, our method is shown to be both data and model parameter efficient. It achieves the same quality with only 21.6M supervised in-domain data and 130.8M finetuned parameters, compared to the 731.1M model trained from scratch on additional 300M supervised in-domain data.
△ Less
Submitted 2 February, 2023;
originally announced February 2023.
-
CRU: A Novel Neural Architecture for Improving the Predictive Performance of Time-Series Data
Authors:
Sunghyun Sim,
Dohee Kim,
Hyerim Bae
Abstract:
The time-series forecasting (TSF) problem is a traditional problem in the field of artificial intelligence. Models such as Recurrent Neural Network (RNN), Long Short Term Memory (LSTM), and GRU (Gate Recurrent Units) have contributed to improving the predictive accuracy of TSF. Furthermore, model structures have been proposed to combine time-series decomposition methods, such as seasonal-trend dec…
▽ More
The time-series forecasting (TSF) problem is a traditional problem in the field of artificial intelligence. Models such as Recurrent Neural Network (RNN), Long Short Term Memory (LSTM), and GRU (Gate Recurrent Units) have contributed to improving the predictive accuracy of TSF. Furthermore, model structures have been proposed to combine time-series decomposition methods, such as seasonal-trend decomposition using Loess (STL) to ensure improved predictive accuracy. However, because this approach is learned in an independent model for each component, it cannot learn the relationships between time-series components. In this study, we propose a new neural architecture called a correlation recurrent unit (CRU) that can perform time series decomposition within a neural cell and learn correlations (autocorrelation and correlation) between each decomposition component. The proposed neural architecture was evaluated through comparative experiments with previous studies using five univariate time-series datasets and four multivariate time-series data. The results showed that long- and short-term predictive performance was improved by more than 10%. The experimental results show that the proposed CRU is an excellent method for TSF problems compared to other neural architectures.
△ Less
Submitted 6 February, 2023; v1 submitted 29 November, 2022;
originally announced November 2022.
-
Decomposing 3D Neuroimaging into 2+1D Processing for Schizophrenia Recognition
Authors:
Mengjiao Hu,
Xudong Jiang,
Kang Sim,
Juan Helen Zhou,
Cuntai Guan
Abstract:
Deep learning has been successfully applied to recognizing both natural images and medical images. However, there remains a gap in recognizing 3D neuroimaging data, especially for psychiatric diseases such as schizophrenia and depression that have no visible alteration in specific slices. In this study, we propose to process the 3D data by a 2+1D framework so that we can exploit the powerful deep…
▽ More
Deep learning has been successfully applied to recognizing both natural images and medical images. However, there remains a gap in recognizing 3D neuroimaging data, especially for psychiatric diseases such as schizophrenia and depression that have no visible alteration in specific slices. In this study, we propose to process the 3D data by a 2+1D framework so that we can exploit the powerful deep 2D Convolutional Neural Network (CNN) networks pre-trained on the huge ImageNet dataset for 3D neuroimaging recognition. Specifically, 3D volumes of Magnetic Resonance Imaging (MRI) metrics (grey matter, white matter, and cerebrospinal fluid) are decomposed to 2D slices according to neighboring voxel positions and inputted to 2D CNN models pre-trained on the ImageNet to extract feature maps from three views (axial, coronal, and sagittal). Global pooling is applied to remove redundant information as the activation patterns are sparsely distributed over feature maps. Channel-wise and slice-wise convolutions are proposed to aggregate the contextual information in the third view dimension unprocessed by the 2D CNN model. Multi-metric and multi-view information are fused for final prediction. Our approach outperforms handcrafted feature-based machine learning, deep feature approach with a support vector machine (SVM) classifier and 3D CNN models trained from scratch with better cross-validation results on publicly available Northwestern University Schizophrenia Dataset and the results are replicated on another independent dataset.
△ Less
Submitted 21 November, 2022; v1 submitted 21 November, 2022;
originally announced November 2022.
-
Resource-Efficient Transfer Learning From Speech Foundation Model Using Hierarchical Feature Fusion
Authors:
Zhouyuan Huo,
Khe Chai Sim,
Bo Li,
Dongseong Hwang,
Tara N. Sainath,
Trevor Strohman
Abstract:
Self-supervised pre-training of a speech foundation model, followed by supervised fine-tuning, has shown impressive quality improvements on automatic speech recognition (ASR) tasks. Fine-tuning separate foundation models for many downstream tasks are expensive since the foundation model is usually very big. Parameter-efficient fine-tuning methods (e.g. adapter, sparse update methods) offer an alte…
▽ More
Self-supervised pre-training of a speech foundation model, followed by supervised fine-tuning, has shown impressive quality improvements on automatic speech recognition (ASR) tasks. Fine-tuning separate foundation models for many downstream tasks are expensive since the foundation model is usually very big. Parameter-efficient fine-tuning methods (e.g. adapter, sparse update methods) offer an alternative paradigm where a small set of parameters are updated to adapt the foundation model to new tasks. However, these methods still suffer from a high computational memory cost and slow training speed because they require backpropagation through the entire neural network at each step. In the paper, we analyze the performance of features at different layers of a foundation model on the speech recognition task and propose a novel hierarchical feature fusion method for resource-efficient transfer learning from speech foundation models. Experimental results show that the proposed method can achieve better performance on speech recognition task than existing algorithms with fewer number of trainable parameters, less computational memory cost and faster training speed. After combining with Adapters at all layers, the proposed method can achieve the same performance as fine-tuning the whole model with $97\%$ fewer trainable encoder parameters and $53\%$ faster training speed.
△ Less
Submitted 4 November, 2022;
originally announced November 2022.
-
Exploring Train and Test-Time Augmentations for Audio-Language Learning
Authors:
Eungbeom Kim,
**hee Kim,
Yoori Oh,
Kyungsu Kim,
Minju Park,
Jaeheon Sim,
**woo Lee,
Kyogu Lee
Abstract:
In this paper, we aim to unveil the impact of data augmentation in audio-language multi-modal learning, which has not been explored despite its importance. We explore various augmentation methods at not only train-time but also test-time and find out that proper data augmentation can lead to substantial improvements. Specifically, applying our proposed audio-language paired augmentation PairMix, w…
▽ More
In this paper, we aim to unveil the impact of data augmentation in audio-language multi-modal learning, which has not been explored despite its importance. We explore various augmentation methods at not only train-time but also test-time and find out that proper data augmentation can lead to substantial improvements. Specifically, applying our proposed audio-language paired augmentation PairMix, which is the first multi-modal audio-language augmentation method, outperforms the baselines for both automated audio captioning and audio-text retrieval tasks. To fully take advantage of data augmentation, we also present multi-level test-time augmentation (Multi-TTA) for the test-time. We successfully incorporate the two proposed methods and uni-modal augmentations and achieve 47.5 SPIDEr on audio captioning, which is an 18.2% relative increase over the baseline. In audio-text retrieval, the proposed methods also show an improvement in performance as well.
△ Less
Submitted 23 May, 2023; v1 submitted 31 October, 2022;
originally announced October 2022.
-
Feature Engineering and Classification Models for Partial Discharge in Power Transformers
Authors:
Jonathan Wang,
Kesheng Wu,
Alex Sim,
Seongwook Hwangbo
Abstract:
To ensure reliability, power transformers are monitored for partial discharge (PD) events, which are symptoms of transformer failure. Since failures can have catastrophic cascading consequences, it is critical to preempt them as early as possible. Our goal is to classify PDs as corona, floating, particle, or void, to gain an understanding of the failure location. Using phase resolved PD signal dat…
▽ More
To ensure reliability, power transformers are monitored for partial discharge (PD) events, which are symptoms of transformer failure. Since failures can have catastrophic cascading consequences, it is critical to preempt them as early as possible. Our goal is to classify PDs as corona, floating, particle, or void, to gain an understanding of the failure location. Using phase resolved PD signal data, we create a small set of features, which can be used to classify PDs with high accuracy. This set of features consists of the total magnitude, the maximum magnitude, and the length of the longest empty band. These features represent the entire signal and not just a single phase, so the feature set has a fixed size and is easily comprehensible. With both Random Forest and SVM classification methods, we attain a 99% classification accuracy, which is significantly higher than classification using phase based feature sets such as phase magnitude. Furthermore, we develop a stacking ensemble to combine several classification models, resulting in a superior model that outperforms existing methods in both accuracy and variance.
△ Less
Submitted 21 October, 2022;
originally announced October 2022.
-
Enemy Spotted: in-game gun sound dataset for gunshot classification and localization
Authors:
Junwoo Park,
Youngwoo Cho,
Gyuhyeon Sim,
Hojoon Lee,
Jaegul Choo
Abstract:
Recently, deep learning-based methods have drawn huge attention due to their simple yet high performance without domain knowledge in sound classification and localization tasks. However, a lack of gun sounds in existing datasets has been a major obstacle to implementing a support system to spot criminals from their gunshots by leveraging deep learning models. Since the occurrence of gunshot is rar…
▽ More
Recently, deep learning-based methods have drawn huge attention due to their simple yet high performance without domain knowledge in sound classification and localization tasks. However, a lack of gun sounds in existing datasets has been a major obstacle to implementing a support system to spot criminals from their gunshots by leveraging deep learning models. Since the occurrence of gunshot is rare and unpredictable, it is impractical to collect gun sounds in the real world. As an alternative, gun sounds can be obtained from an FPS game that is designed to mimic real-world warfare. The recent FPS game offers a realistic environment where we can safely collect gunshot data while simulating even dangerous situations. By exploiting the advantage of the game environment, we construct a gunshot dataset, namely BGG, for the firearm classification and gunshot localization tasks. The BGG dataset consists of 37 different types of firearms, distances, and directions between the sound source and a receiver. We carefully verify that the in-game gunshot data has sufficient information to identify the location and type of gunshots by training several sound classification and localization baselines on the BGG dataset. Afterward, we demonstrate that the accuracy of real-world firearm classification and localization tasks can be enhanced by utilizing the BGG dataset.
△ Less
Submitted 16 February, 2023; v1 submitted 12 October, 2022;
originally announced October 2022.
-
Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR
Authors:
Dongseong Hwang,
Khe Chai Sim,
Yu Zhang,
Trevor Strohman
Abstract:
Knowledge distillation is an effective machine learning technique to transfer knowledge from a teacher model to a smaller student model, especially with unlabeled data. In this paper, we focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) automatic speech recognition (ASR). Specifically, we compared using soft and hard target distillation to train l…
▽ More
Knowledge distillation is an effective machine learning technique to transfer knowledge from a teacher model to a smaller student model, especially with unlabeled data. In this paper, we focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) automatic speech recognition (ASR). Specifically, we compared using soft and hard target distillation to train large-scaleRNN-T models on the LibriSpeech/LibriLight public dataset (60k hours) and our in-house data (600k hours). We found that hard tar-gets are more effective when the teacher and student have different architecture, such as large teacher and small streaming student. On the other hand, soft target distillation works better in self-training scenario like iterative large teacher training. For a large model with0.6B weights, we achieve a new SoTA word error rate (WER) on LibriSpeech (8% relative improvement on dev-other) using Noisy Student Training with soft target distillation. It also allows our production teacher to adapt new data domain continuously.
△ Less
Submitted 28 October, 2022; v1 submitted 11 October, 2022;
originally announced October 2022.
-
Large vocabulary speech recognition for languages of Africa: multilingual modeling and self-supervised learning
Authors:
Sandy Ritchie,
You-Chi Cheng,
Mingqing Chen,
Rajiv Mathews,
Daan van Esch,
Bo Li,
Khe Chai Sim
Abstract:
Almost none of the 2,000+ languages spoken in Africa have widely available automatic speech recognition systems, and the required data is also only available for a few languages. We have experimented with two techniques which may provide pathways to large vocabulary speech recognition for African languages: multilingual modeling and self-supervised learning. We gathered available open source data…
▽ More
Almost none of the 2,000+ languages spoken in Africa have widely available automatic speech recognition systems, and the required data is also only available for a few languages. We have experimented with two techniques which may provide pathways to large vocabulary speech recognition for African languages: multilingual modeling and self-supervised learning. We gathered available open source data and collected data for 15 languages, and trained experimental models using these techniques. Our results show that pooling the small amounts of data available in multilingual end-to-end models, and pre-training on unsupervised data can help improve speech recognition quality for many African languages.
△ Less
Submitted 4 October, 2022; v1 submitted 5 August, 2022;
originally announced August 2022.
-
UserLibri: A Dataset for ASR Personalization Using Only Text
Authors:
Theresa Breiner,
Swaroop Ramaswamy,
Ehsan Variani,
Shefali Garg,
Rajiv Mathews,
Khe Chai Sim,
Kilol Gupta,
Mingqing Chen,
Lara McConnaughey
Abstract:
Personalization of speech models on mobile devices (on-device personalization) is an active area of research, but more often than not, mobile devices have more text-only data than paired audio-text data. We explore training a personalized language model on text-only data, used during inference to improve speech recognition performance for that user. We experiment on a user-clustered LibriSpeech co…
▽ More
Personalization of speech models on mobile devices (on-device personalization) is an active area of research, but more often than not, mobile devices have more text-only data than paired audio-text data. We explore training a personalized language model on text-only data, used during inference to improve speech recognition performance for that user. We experiment on a user-clustered LibriSpeech corpus, supplemented with personalized text-only data for each user from Project Gutenberg. We release this User-Specific LibriSpeech (UserLibri) dataset to aid future personalization research. LibriSpeech audio-transcript pairs are grouped into 55 users from the test-clean dataset and 52 users from test-other. We are able to lower the average word error rate per user across both sets in streaming and nonstreaming models, including an improvement of 2.5 for the harder set of test-other users when streaming.
△ Less
Submitted 1 July, 2022;
originally announced July 2022.
-
Analog Self-Interference Cancellation with Practical RF Components for Full-Duplex Radios
Authors:
Jong Woo Kwak,
Min Soo Sim,
In-Woong Kang,
Jaedon Park,
Kai-Kit Wong,
Chan-Byoung Chae
Abstract:
One of the main obstacles in full-duplex radios is analog-to-digital converter (ADC) saturation on a receiver due to the strong self-interference (SI). To solve this issue, researchers have proposed two different types of analog self-interference cancellation (SIC) methods -- i) passive suppression and ii) regeneration-and-subtraction of SI. For the latter case, the tunable RF component, such as a…
▽ More
One of the main obstacles in full-duplex radios is analog-to-digital converter (ADC) saturation on a receiver due to the strong self-interference (SI). To solve this issue, researchers have proposed two different types of analog self-interference cancellation (SIC) methods -- i) passive suppression and ii) regeneration-and-subtraction of SI. For the latter case, the tunable RF component, such as a multi-tap circuit, reproduces and subtracts the SI. The resolutions of such RF components constitute the key factor of the analog SIC. Indeed, they are directly related to how well the SI is imitated. Another major issue in analog SIC is the inaccurate estimation of the SI channel due to the nonlinear distortions, which mainly come from the power amplifier (PA). In this paper, we derive a closed-form expression for the SIC performance of the multi-tap circuit; we consider how the RF components must overcome such practical impairments as digitally-controlled attenuators, phase shifters, and PA. For a realistic performance analysis, we exploit the measured PA characteristics and carry out a 3D ray-tracing-based, system-level throughput analysis. Our results confirm that the non-idealities of the RF components significantly affect the analog SIC performance. We believe our study provides insight into the design of the practical full-duplex system.
△ Less
Submitted 21 June, 2022;
originally announced June 2022.
-
Myocardial Segmentation of Late Gadolinium Enhanced MR Images by Propagation of Contours from Cine MR Images
Authors:
Dong Wei,
Ying Sun,
** Chai,
Adrian Low,
Sim Heng Ong
Abstract:
Automatic segmentation of myocardium in Late Gadolinium Enhanced (LGE) Cardiac MR (CMR) images is often difficult due to the intensity heterogeneity resulting from accumulation of contrast agent in infarcted areas. In this paper, we propose an automatic segmentation framework that fully utilizes shared information between corresponding cine and LGE images of a same patient. Given myocardial contou…
▽ More
Automatic segmentation of myocardium in Late Gadolinium Enhanced (LGE) Cardiac MR (CMR) images is often difficult due to the intensity heterogeneity resulting from accumulation of contrast agent in infarcted areas. In this paper, we propose an automatic segmentation framework that fully utilizes shared information between corresponding cine and LGE images of a same patient. Given myocardial contours in cine CMR images, the proposed framework achieves accurate segmentation of LGE CMR images in a coarse-to-fine manner. Affine registration is first performed between the corresponding cine and LGE image pair, followed by nonrigid registration, and finally local deformation of myocardial contours driven by forces derived from local features of the LGE image. Experimental results on real patient data with expert outlined ground truth show that the proposed framework can generate accurate and reliable results for myocardial segmentation of LGE CMR images.
△ Less
Submitted 21 May, 2022;
originally announced May 2022.
-
Extract Dynamic Information To Improve Time Series Modeling: a Case Study with Scientific Workflow
Authors:
Jeeyung Kim,
Mengtian **,
Youkow Homma,
Alex Sim,
Wilko Kroeger,
Kesheng Wu
Abstract:
In modeling time series data, we often need to augment the existing data records to increase the modeling accuracy. In this work, we describe a number of techniques to extract dynamic information about the current state of a large scientific workflow, which could be generalized to other types of applications. The specific task to be modeled is the time needed for transferring a file from an experi…
▽ More
In modeling time series data, we often need to augment the existing data records to increase the modeling accuracy. In this work, we describe a number of techniques to extract dynamic information about the current state of a large scientific workflow, which could be generalized to other types of applications. The specific task to be modeled is the time needed for transferring a file from an experimental facility to a data center. The key idea of our approach is to find recent past data transfer events that match the current event in some ways. Tests showed that we could identify recent events matching some recorded properties and reduce the prediction error by about 12% compared to the similar models with only static features. We additionally explored an application specific technique to extract information about the data production process, and was able to reduce the average prediction error by 44%.
△ Less
Submitted 19 May, 2022;
originally announced May 2022.
-
Studying Scientific Data Lifecycle in On-demand Distributed Storage Caches
Authors:
Julian Bellavita,
Alex Sim,
Kesheng Wu,
Inder Monga,
Chin Guok,
Frank Würthwein,
Diego Davila
Abstract:
The XRootD system is used to transfer, store, and cache large datasets from high-energy physics (HEP). In this study we focus on its capability as distributed on-demand storage cache. Through exploring a large set of daily log files between 2020 and 2021, we seek to understand the data access patterns that might inform future cache design. Our study begins with a set of summary statistics regardin…
▽ More
The XRootD system is used to transfer, store, and cache large datasets from high-energy physics (HEP). In this study we focus on its capability as distributed on-demand storage cache. Through exploring a large set of daily log files between 2020 and 2021, we seek to understand the data access patterns that might inform future cache design. Our study begins with a set of summary statistics regarding file read operations, file lifetimes, and file transfers. We observe that the number of read operations on each file remains nearly constant, while the average size of a read operation grows over time. Furthermore, files tend to have a consistent length of time during which they remain open and are in use. Based on this comprehensive study of the cache access statistics, we developed a cache simulator to explore the behavior of caches of different sizes. Within a certain size range, we find that increasing the XRootD cache size improves the cache hit rate, yielding faster overall file access. In particular, we find that increase the cache size from 40TB to 56TB could increase the hit rate from 0.62 to 0.89, which is a significant increase in cache effectiveness for modest cost.
△ Less
Submitted 11 May, 2022;
originally announced May 2022.
-
Sequential Parametric Optimization for Rate-Splitting Precoding in Non-Orthogonal Unicast and Multicast Transmissions
Authors:
Luis F. Abanto-Leon,
Matthias Hollick,
Bruno Clerckx,
Gek Hong Sim
Abstract:
This paper investigates rate-splitting (RS) precoding for non-orthogonal unicast and multicast (NOUM) transmissions using fully-digital and hybrid precoders. We study the nonconvex weighted sum-rate (WSR) maximization problem subject to a multicast requirement. We propose FALCON, an approach based on sequential parametric optimization, to solve the aforementioned problem. We show that FALCON conve…
▽ More
This paper investigates rate-splitting (RS) precoding for non-orthogonal unicast and multicast (NOUM) transmissions using fully-digital and hybrid precoders. We study the nonconvex weighted sum-rate (WSR) maximization problem subject to a multicast requirement. We propose FALCON, an approach based on sequential parametric optimization, to solve the aforementioned problem. We show that FALCON converges to a local optimum without requiring judicious selection of an initial feasible point. Besides, we show through simulations that by leveraging RS, hybrid precoders can attain nearly the same performance as their fully-digital counterparts under certain specific settings.
△ Less
Submitted 25 January, 2022;
originally announced January 2022.
-
RadiOrchestra: Proactive Management of Millimeter-wave Self-backhauled Small Cells via Joint Optimization of Beamforming, User Association, Rate Selection, and Admission Control
Authors:
L. F. Abanto-Leon,
A. Asadi,
G. H. Sim,
A. Garcia-Saavedra,
M. Hollick
Abstract:
Millimeter-wave self-backhauled small cells are a key component of next-generation wireless networks. Their dense deployment will increase data rates, reduce latency, and enable efficient data transport between the access and backhaul networks, providing greater flexibility not previously possible with optical fiber. Despite their high potential, operating dense self-backhauled networks optimally…
▽ More
Millimeter-wave self-backhauled small cells are a key component of next-generation wireless networks. Their dense deployment will increase data rates, reduce latency, and enable efficient data transport between the access and backhaul networks, providing greater flexibility not previously possible with optical fiber. Despite their high potential, operating dense self-backhauled networks optimally is an open challenge, particularly for radio resource management (RRM). This paper presents, RadiOrchestra, a holistic RRM framework that models and optimizes beamforming, rate selection as well as user association and admission control for self-backhauled networks. The framework is designed to account for practical challenges such as hardware limitations of base stations (e.g., computational capacity, discrete rates), the need for adaptability of backhaul links, and the presence of interference. Our framework is formulated as a nonconvex mixed-integer nonlinear program, which is challenging to solve. To approach this problem, we propose three algorithms that provide a trade-off between complexity and optimality. Furthermore, we derive upper and lower bounds to characterize the performance limits of the system. We evaluate the developed strategies in various scenarios, showing the feasibility of deploying practical self-backhauling in future networks.
△ Less
Submitted 13 July, 2022; v1 submitted 25 January, 2022;
originally announced January 2022.
-
Come-Closer-Diffuse-Faster: Accelerating Conditional Diffusion Models for Inverse Problems through Stochastic Contraction
Authors:
Hyung** Chung,
Byeongsu Sim,
Jong Chul Ye
Abstract:
Diffusion models have recently attained significant interest within the community owing to their strong performance as generative models. Furthermore, its application to inverse problems have demonstrated state-of-the-art performance. Unfortunately, diffusion models have a critical downside - they are inherently slow to sample from, needing few thousand steps of iteration to generate images from p…
▽ More
Diffusion models have recently attained significant interest within the community owing to their strong performance as generative models. Furthermore, its application to inverse problems have demonstrated state-of-the-art performance. Unfortunately, diffusion models have a critical downside - they are inherently slow to sample from, needing few thousand steps of iteration to generate images from pure Gaussian noise. In this work, we show that starting from Gaussian noise is unnecessary. Instead, starting from a single forward diffusion with better initialization significantly reduces the number of sampling steps in the reverse conditional diffusion. This phenomenon is formally explained by the contraction theory of the stochastic difference equations like our conditional diffusion strategy - the alternating applications of reverse diffusion followed by a non-expansive data consistency step. The new sampling strategy, dubbed Come-Closer-Diffuse-Faster (CCDF), also reveals a new insight on how the existing feed-forward neural network approaches for inverse problems can be synergistically combined with the diffusion models. Experimental results with super-resolution, image inpainting, and compressed sensing MRI demonstrate that our method can achieve state-of-the-art reconstruction performance at significantly reduced sampling steps.
△ Less
Submitted 19 March, 2022; v1 submitted 8 December, 2021;
originally announced December 2021.
-
Joint Unsupervised and Supervised Training for Multilingual ASR
Authors:
Junwen Bai,
Bo Li,
Yu Zhang,
Ankur Bapna,
Nikhil Siddhartha,
Khe Chai Sim,
Tara N. Sainath
Abstract:
Self-supervised training has shown promising gains in pretraining models and facilitating the downstream finetuning for speech recognition, like multilingual ASR. Most existing methods adopt a 2-stage scheme where the self-supervised loss is optimized in the first pretraining stage, and the standard supervised finetuning resumes in the second stage. In this paper, we propose an end-to-end (E2E) Jo…
▽ More
Self-supervised training has shown promising gains in pretraining models and facilitating the downstream finetuning for speech recognition, like multilingual ASR. Most existing methods adopt a 2-stage scheme where the self-supervised loss is optimized in the first pretraining stage, and the standard supervised finetuning resumes in the second stage. In this paper, we propose an end-to-end (E2E) Joint Unsupervised and Supervised Training (JUST) method to combine the supervised RNN-T loss and the self-supervised contrastive and masked language modeling (MLM) losses. We validate its performance on the public dataset Multilingual LibriSpeech (MLS), which includes 8 languages and is extremely imbalanced. On MLS, we explore (1) JUST trained from scratch, and (2) JUST finetuned from a pretrained checkpoint. Experiments show that JUST can consistently outperform other existing state-of-the-art methods, and beat the monolingual baseline by a significant margin, demonstrating JUST's capability of handling low-resource languages in multilingual ASR. Our average WER of all languages outperforms average monolingual baseline by 33.3%, and the state-of-the-art 2-stage XLSR by 32%. On low-resource languages like Polish, our WER is less than half of the monolingual baseline and even beats the supervised transfer learning method which uses external supervision.
△ Less
Submitted 15 November, 2021;
originally announced November 2021.
-
Fast Contextual Adaptation with Neural Associative Memory for On-Device Personalized Speech Recognition
Authors:
Tsendsuren Munkhdalai,
Khe Chai Sim,
Angad Chandorkar,
Fan Gao,
Mason Chua,
Trevor Strohman,
Françoise Beaufays
Abstract:
Fast contextual adaptation has shown to be effective in improving Automatic Speech Recognition (ASR) of rare words and when combined with an on-device personalized training, it can yield an even better recognition result. However, the traditional re-scoring approaches based on an external language model is prone to diverge during the personalized training. In this work, we introduce a model-based…
▽ More
Fast contextual adaptation has shown to be effective in improving Automatic Speech Recognition (ASR) of rare words and when combined with an on-device personalized training, it can yield an even better recognition result. However, the traditional re-scoring approaches based on an external language model is prone to diverge during the personalized training. In this work, we introduce a model-based end-to-end contextual adaptation approach that is decoder-agnostic and amenable to on-device personalization. Our on-device simulation experiments demonstrate that the proposed approach outperforms the traditional re-scoring technique by 12% relative WER and 15.7% entity mention specific F1-score in a continues personalization scenario.
△ Less
Submitted 6 October, 2021; v1 submitted 4 October, 2021;
originally announced October 2021.
-
Large-scale ASR Domain Adaptation using Self- and Semi-supervised Learning
Authors:
Dongseong Hwang,
Ananya Misra,
Zhouyuan Huo,
Nikhil Siddhartha,
Shefali Garg,
David Qiu,
Khe Chai Sim,
Trevor Strohman,
Françoise Beaufays,
Yanzhang He
Abstract:
Self- and semi-supervised learning methods have been actively investigated to reduce labeled training data or enhance the model performance. However, the approach mostly focus on in-domain performance for public datasets. In this study, we utilize the combination of self- and semi-supervised learning methods to solve unseen domain adaptation problem in a large-scale production setting for online A…
▽ More
Self- and semi-supervised learning methods have been actively investigated to reduce labeled training data or enhance the model performance. However, the approach mostly focus on in-domain performance for public datasets. In this study, we utilize the combination of self- and semi-supervised learning methods to solve unseen domain adaptation problem in a large-scale production setting for online ASR model. This approach demonstrates that using the source domain data with a small fraction of the target domain data (3%) can recover the performance gap compared to a full data baseline: relative 13.5% WER improvement for target domain data.
△ Less
Submitted 15 February, 2022; v1 submitted 30 September, 2021;
originally announced October 2021.
-
Incremental Layer-wise Self-Supervised Learning for Efficient Speech Domain Adaptation On Device
Authors:
Zhouyuan Huo,
Dongseong Hwang,
Khe Chai Sim,
Shefali Garg,
Ananya Misra,
Nikhil Siddhartha,
Trevor Strohman,
Françoise Beaufays
Abstract:
Streaming end-to-end speech recognition models have been widely applied to mobile devices and show significant improvement in efficiency. These models are typically trained on the server using transcribed speech data. However, the server data distribution can be very different from the data distribution on user devices, which could affect the model performance. There are two main challenges for on…
▽ More
Streaming end-to-end speech recognition models have been widely applied to mobile devices and show significant improvement in efficiency. These models are typically trained on the server using transcribed speech data. However, the server data distribution can be very different from the data distribution on user devices, which could affect the model performance. There are two main challenges for on device training, limited reliable labels and limited training memory. While self-supervised learning algorithms can mitigate the mismatch between domains using unlabeled data, they are not applicable on mobile devices directly because of the memory constraint. In this paper, we propose an incremental layer-wise self-supervised learning algorithm for efficient speech domain adaptation on mobile devices, in which only one layer is updated at a time. Extensive experimental results demonstrate that the proposed algorithm obtains a Word Error Rate (WER) on the target domain $24.2\%$ better than supervised baseline and costs $89.7\%$ less training memory than the end-to-end self-supervised learning algorithm.
△ Less
Submitted 30 September, 2021;
originally announced October 2021.
-
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition
Authors:
Yu Zhang,
Daniel S. Park,
Wei Han,
James Qin,
Anmol Gulati,
Joel Shor,
Aren Jansen,
Yuanzhong Xu,
Yan** Huang,
Shibo Wang,
Zongwei Zhou,
Bo Li,
Min Ma,
William Chan,
Jiahui Yu,
Yongqiang Wang,
Liangliang Cao,
Khe Chai Sim,
Bhuvana Ramabhadran,
Tara N. Sainath,
Françoise Beaufays,
Zhifeng Chen,
Quoc V. Le,
Chung-Cheng Chiu,
Ruoming Pang
, et al. (1 additional authors not shown)
Abstract:
We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled da…
▽ More
We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks.
△ Less
Submitted 21 July, 2022; v1 submitted 27 September, 2021;
originally announced September 2021.
-
BEAMWAVE: Cross-Layer Beamforming and Scheduling for Superimposed Transmissions in Industrial IoT mmWave Networks
Authors:
Luis F. Abanto-Leon,
Matthias Hollick,
Gek Hong Sim
Abstract:
The omnipresence of IoT devices in Industry 4.0 is expected to foster higher reliability, safety, and efficiency. However, interconnecting a large number of wireless devices without jeopardizing the system performance proves challenging. To address the requirements of future industries, we investigate the cross-layer design of beamforming and scheduling for layered-division multiplexing (LDM) syst…
▽ More
The omnipresence of IoT devices in Industry 4.0 is expected to foster higher reliability, safety, and efficiency. However, interconnecting a large number of wireless devices without jeopardizing the system performance proves challenging. To address the requirements of future industries, we investigate the cross-layer design of beamforming and scheduling for layered-division multiplexing (LDM) systems in millimeter-wave bands. Scheduling is crucial as the devices in industrial settings are expected to proliferate rapidly. Also, highly performant beamforming is necessary to ensure scalability. By adopting LDM, multiple transmissions can be non-orthogonally superimposed. Specifically, we consider a superior-importance control multicast message required to be ubiquitous to all devices and inferior-importance private unicast messages targeting a subset of scheduled devices. Due to NP-hardness, we propose BEAMWAVE, which decomposes the problem into beamforming and scheduling. Through simulations, we show that BEAMWAVE attains near-optimality and outperforms other competing schemes.
△ Less
Submitted 9 August, 2021;
originally announced August 2021.
-
On-Device Personalization of Automatic Speech Recognition Models for Disordered Speech
Authors:
Katrin Tomanek,
Françoise Beaufays,
Julie Cattiau,
Angad Chandorkar,
Khe Chai Sim
Abstract:
While current state-of-the-art Automatic Speech Recognition (ASR) systems achieve high accuracy on typical speech, they suffer from significant performance degradation on disordered speech and other atypical speech patterns. Personalization of ASR models, a commonly applied solution to this problem, is usually performed in a server-based training environment posing problems around data privacy, de…
▽ More
While current state-of-the-art Automatic Speech Recognition (ASR) systems achieve high accuracy on typical speech, they suffer from significant performance degradation on disordered speech and other atypical speech patterns. Personalization of ASR models, a commonly applied solution to this problem, is usually performed in a server-based training environment posing problems around data privacy, delayed model-update times, and communication cost for copying data and models between mobile device and server infrastructure. In this paper, we present an approach to on-device based ASR personalization with very small amounts of speaker-specific data. We test our approach on a diverse set of 100 speakers with disordered speech and find median relative word error rate improvement of 71% with only 50 short utterances required per speaker. When tested on a voice-controlled home automation platform, on-device personalized models show a median task success rate of 81%, compared to only 40% of the unadapted models.
△ Less
Submitted 18 June, 2021;
originally announced June 2021.
-
An Open-Source Low-Cost Mobile Robot System with an RGB-D Camera and Efficient Real-Time Navigation Algorithm
Authors:
Taekyung Kim,
Seunghyun Lim,
Gwanjun Shin,
Geonhee Sim,
Dongwon Yun
Abstract:
Currently, mobile robots are develo** rapidly and are finding numerous applications in the industry. However, several problems remain related to their practical use, such as the need for expensive hardware and high power consumption levels. In this study, we build a low-cost indoor mobile robot platform that does not include a LiDAR or a GPU. Then, we design an autonomous navigation architecture…
▽ More
Currently, mobile robots are develo** rapidly and are finding numerous applications in the industry. However, several problems remain related to their practical use, such as the need for expensive hardware and high power consumption levels. In this study, we build a low-cost indoor mobile robot platform that does not include a LiDAR or a GPU. Then, we design an autonomous navigation architecture that guarantees real-time performance on our platform with an RGB-D camera and a low-end off-the-shelf single board computer. The overall system includes SLAM, global path planning, ground segmentation, and motion planning. The proposed ground segmentation approach extracts a traversability map from raw depth images for the safe driving of low-body mobile robots. We apply both rule-based and learning-based navigation policies using the traversability map. Running sensor data processing and other autonomous driving components simultaneously, our navigation policies perform rapidly at a refresh rate of 18 Hz for control command, whereas other systems have slower refresh rates. Our methods show better performances than current state-of-the-art navigation approaches within limited computation resources as shown in 3D simulation tests. In addition, we demonstrate the applicability of our mobile robot system through successful autonomous driving in an indoor environment.
△ Less
Submitted 13 December, 2022; v1 submitted 4 March, 2021;
originally announced March 2021.
-
Unpaired Deep Learning for Accelerated MRI using Optimal Transport Driven CycleGAN
Authors:
Gyutaek Oh,
Byeongsu Sim,
Hyung** Chung,
Leonard Sunwoo,
Jong Chul Ye
Abstract:
Recently, deep learning approaches for accelerated MRI have been extensively studied thanks to their high performance reconstruction in spite of significantly reduced runtime complexity. These neural networks are usually trained in a supervised manner, so matched pairs of subsampled and fully sampled k-space data are required. Unfortunately, it is often difficult to acquire matched fully sampled k…
▽ More
Recently, deep learning approaches for accelerated MRI have been extensively studied thanks to their high performance reconstruction in spite of significantly reduced runtime complexity. These neural networks are usually trained in a supervised manner, so matched pairs of subsampled and fully sampled k-space data are required. Unfortunately, it is often difficult to acquire matched fully sampled k-space data, since the acquisition of fully sampled k-space data requires long scan time and often leads to the change of the acquisition protocol. Therefore, unpaired deep learning without matched label data has become a very important research topic. In this paper, we propose an unpaired deep learning approach using a optimal transport driven cycle-consistent generative adversarial network (OT-cycleGAN) that employs a single pair of generator and discriminator. The proposed OT-cycleGAN architecture is rigorously derived from a dual formulation of the optimal transport formulation using a specially designed penalized least squares cost. The experimental results show that our method can reconstruct high resolution MR images from accelerated k- space data from both single and multiple coil acquisition, without requiring matched reference data.
△ Less
Submitted 29 August, 2020;
originally announced August 2020.
-
SWAN: Swarm-Based Low-Complexity Scheme for PAPR Reduction
Authors:
Luis F. Abanto-Leon,
Gek Hong Sim,
Matthias Hollick,
Amnart Boonkajay,
Fumiyuki Adachi
Abstract:
Cyclically shifted partial transmit sequences (CS-PTS) has conventionally been used in SISO systems for PAPR reduction of OFDM signals. Compared to other techniques, CS-PTS attains superior performance. Nevertheless, due to the exhaustive search requirement, it demands excessive computational complexity. In this paper, we adapt CS-PTS to operate in a MIMO framework, where singular value decomposit…
▽ More
Cyclically shifted partial transmit sequences (CS-PTS) has conventionally been used in SISO systems for PAPR reduction of OFDM signals. Compared to other techniques, CS-PTS attains superior performance. Nevertheless, due to the exhaustive search requirement, it demands excessive computational complexity. In this paper, we adapt CS-PTS to operate in a MIMO framework, where singular value decomposition (SVD) precoding is employed. We also propose SWAN, a novel optimization method based on swarm intelligence to circumvent the exhaustive search. SWAN not only provides a significant reduction in computational complexity, but it also attains a fair balance between optimality and complexity. Through simulations, we show that SWAN achieves near-optimal performance at a much lower complexity than other competing approaches.
△ Less
Submitted 15 September, 2020; v1 submitted 17 August, 2020;
originally announced August 2020.
-
Breaking Moravec's Paradox: Visual-Based Distribution in Smart Fashion Retail
Authors:
Shin Woong Sung,
Hyunsuk Baek,
Hyeonjun Sim,
Eun Hie Kim,
Hyunwoo Hwangbo,
Young Jae Jang
Abstract:
In this paper, we report an industry-academia collaborative study on the distribution method of fashion products using an artificial intelligence (AI) technique combined with an optimization method. To meet the current fashion trend of short product lifetimes and an increasing variety of styles, the company produces limited volumes of a large variety of styles. However, due to the limited volume o…
▽ More
In this paper, we report an industry-academia collaborative study on the distribution method of fashion products using an artificial intelligence (AI) technique combined with an optimization method. To meet the current fashion trend of short product lifetimes and an increasing variety of styles, the company produces limited volumes of a large variety of styles. However, due to the limited volume of each style, some styles may not be distributed to some off-line stores. As a result, this high-variety, low-volume strategy presents another challenge to distribution managers. We collaborated with KOLON F/C, one of the largest fashion business units in South Korea, to develop models and an algorithm to optimally distribute the products to the stores based on the visual images of the products. The team developed a deep learning model that effectively represents the styles of clothes based on their visual image. Moreover, the team created an optimization model that effectively determines the product mix for each store based on the image representation of clothes. In the past, computers were only considered to be useful for conducting logical calculations, and visual perception and cognition were considered to be difficult computational tasks. The proposed approach is significant in that it uses both AI (perception and cognition) and mathematical optimization (logical calculation) to address a practical supply chain problem, which is why the study was called "Breaking Moravec's Paradox."
△ Less
Submitted 9 July, 2020;
originally announced July 2020.
-
Brain MRI-based 3D Convolutional Neural Networks for Classification of Schizophrenia and Controls
Authors:
Mengjiao Hu,
Kang Sim,
Juan Helen Zhou,
Xudong Jiang,
Cuntai Guan
Abstract:
Convolutional Neural Network (CNN) has been successfully applied on classification of both natural images and medical images but not yet been applied to differentiating patients with schizophrenia from healthy controls. Given the subtle, mixed, and sparsely distributed brain atrophy patterns of schizophrenia, the capability of automatic feature learning makes CNN a powerful tool for classifying sc…
▽ More
Convolutional Neural Network (CNN) has been successfully applied on classification of both natural images and medical images but not yet been applied to differentiating patients with schizophrenia from healthy controls. Given the subtle, mixed, and sparsely distributed brain atrophy patterns of schizophrenia, the capability of automatic feature learning makes CNN a powerful tool for classifying schizophrenia from controls as it removes the subjectivity in selecting relevant spatial features. To examine the feasibility of applying CNN to classification of schizophrenia and controls based on structural Magnetic Resonance Imaging (MRI), we built 3D CNN models with different architectures and compared their performance with a handcrafted feature-based machine learning approach. Support vector machine (SVM) was used as classifier and Voxel-based Morphometry (VBM) was used as feature for handcrafted feature-based machine learning. 3D CNN models with sequential architecture, inception module and residual module were trained from scratch. CNN models achieved higher cross-validation accuracy than handcrafted feature-based machine learning. Moreover, testing on an independent dataset, 3D CNN models greatly outperformed handcrafted feature-based machine learning. This study underscored the potential of CNN for identifying patients with schizophrenia using 3D brain MR images and paved the way for imaging-based individual-level diagnosis and prognosis in psychiatric disorders.
△ Less
Submitted 14 March, 2020;
originally announced March 2020.
-
HydraWave: Multi-Group Multicast Hybrid Precoding and Low-Latency Scheduling for Ubiquitous Industry 4.0 mmWave Communication
Authors:
Luis F. Abanto-Leon,
Matthias Hollick,
Gek Hong Sim
Abstract:
Industry 4.0 anticipates massive interconnectivity of industrial devices (e.g., sensors, actuators) to support factory automation and production. Due to the rigidity of wired connections to harmonize with automation, wireless information transfer has attracted substantial attention. However, existing solutions for the manufacturing sector face critical issues in co** with the key performance dem…
▽ More
Industry 4.0 anticipates massive interconnectivity of industrial devices (e.g., sensors, actuators) to support factory automation and production. Due to the rigidity of wired connections to harmonize with automation, wireless information transfer has attracted substantial attention. However, existing solutions for the manufacturing sector face critical issues in co** with the key performance demands: ultra-low latency, high throughput, and high reliability. Besides, recent advancements in wireless millimeter-wave technology advocates hybrid precoding with affordable hardware and outstanding spatial multiplexing performance. Thus, we present HYDRAWAVE -- a new paradigm that contemplates the joint design of group scheduling and hybrid precoding for multi-group multicasting to support ubiquitous low-latency communications. Our hybrid precoder, based on semidefinite relaxation and Cholesky matrix factorization, facilitates the robust design of the constant-modulus phase shifts rendering formidable performance at a fraction of the power required by fully-digital precoders. Further, our novel group scheduling formulation minimizes the number of scheduling windows while accounting for the channel correlation of the co-scheduled multicast receivers. Compared to exhaustive search, which renders the optimal scheduling at high overhead, HYDRAWAVE incurs only 9.5% more delay. Notoriously, HYDRAWAVE attains up to 102% gain when compared to the other benchmarked schemes.
△ Less
Submitted 2 September, 2020; v1 submitted 3 February, 2020;
originally announced February 2020.
-
Fairness-Aware Hybrid Precoding for mmWave NOMA Unicast/Multicast Transmissions in Industrial IoT
Authors:
Luis F. Abanto-Leon,
Gek Hong,
Sim
Abstract:
This paper investigates dual-layer non-orthogonally superimposed transmissions for industrial internet of things (IoT) millimeter-wave communications. Essentially, the overlayer is a ubiquitous multicast signal devised to serve all the devices in coverage with a common message, i.e., critical control packet. The underlayer is a composite signal that consists of private unicast messages. Due to saf…
▽ More
This paper investigates dual-layer non-orthogonally superimposed transmissions for industrial internet of things (IoT) millimeter-wave communications. Essentially, the overlayer is a ubiquitous multicast signal devised to serve all the devices in coverage with a common message, i.e., critical control packet. The underlayer is a composite signal that consists of private unicast messages. Due to safety implications, it is critical that all devices can decode the multicast information. To ensure this requirement, we jointly optimize the hybrid precoder, analog combiners, power allocation, and fairness. Specifically, we incorporate a power splitting constraint between the two overlaid signals and enforce supplementary per-device constraints to guarantee multicast fairness. Performance is evaluated in terms of the spectral efficiency, multicast fairness, and bit error rate, thus corroborating the feasibility of our proposed scheme.
△ Less
Submitted 27 February, 2020; v1 submitted 3 February, 2020;
originally announced February 2020.
-
Learning-based Max-Min Fair Hybrid Precoding for mmWave Multicasting
Authors:
Luis F. Abanto-Leon,
Gek Hong,
Sim
Abstract:
This paper investigates the joint design of hybrid transmit precoder and analog receive combiners for single-group multicasting in millimeter-wave systems. We propose LB-GDM, a low-complexity learning-based approach that leverages gradient descent with momentum and alternating optimization to design (i) the digital and analog constituents of a hybrid transmitter and (ii) the analog combiners of ea…
▽ More
This paper investigates the joint design of hybrid transmit precoder and analog receive combiners for single-group multicasting in millimeter-wave systems. We propose LB-GDM, a low-complexity learning-based approach that leverages gradient descent with momentum and alternating optimization to design (i) the digital and analog constituents of a hybrid transmitter and (ii) the analog combiners of each receiver. In addition, we also extend our proposed approach to design fully-digital precoders. We show through numerical evaluation that, implementing LB-GDM in either hybrid or digital precoders attain superlative performance compared to competing designs based on semidefinite relaxation. Specifically, in terms of minimum signal-to-noise ratio, we report a remarkable improvement with gains of up to 105% and 101% for the fully-digital and hybrid precoders, respectively.
△ Less
Submitted 27 February, 2020; v1 submitted 3 February, 2020;
originally announced February 2020.
-
Low-rank Gradient Approximation For Memory-Efficient On-device Training of Deep Neural Network
Authors:
Mary Gooneratne,
Khe Chai Sim,
Petr Zadrazil,
Andreas Kabel,
Françoise Beaufays,
Giovanni Motta
Abstract:
Training machine learning models on mobile devices has the potential of improving both privacy and accuracy of the models. However, one of the major obstacles to achieving this goal is the memory limitation of mobile devices. Reducing training memory enables models with high-dimensional weight matrices, like automatic speech recognition (ASR) models, to be trained on-device. In this paper, we prop…
▽ More
Training machine learning models on mobile devices has the potential of improving both privacy and accuracy of the models. However, one of the major obstacles to achieving this goal is the memory limitation of mobile devices. Reducing training memory enables models with high-dimensional weight matrices, like automatic speech recognition (ASR) models, to be trained on-device. In this paper, we propose approximating the gradient matrices of deep neural networks using a low-rank parameterization as an avenue to save training memory. The low-rank gradient approximation enables more advanced, memory-intensive optimization techniques to be run on device. Our experimental results show that we can reduce the training memory by about 33.0% for Adam optimization. It uses comparable memory to momentum optimization and achieves a 4.5% relative lower word error rate on an ASR personalization task.
△ Less
Submitted 24 January, 2020;
originally announced January 2020.
-
Personalization of End-to-end Speech Recognition On Mobile Devices For Named Entities
Authors:
Khe Chai Sim,
Françoise Beaufays,
Arnaud Benard,
Dhruv Guliani,
Andreas Kabel,
Nikhil Khare,
Tamar Lucassen,
Petr Zadrazil,
Harry Zhang,
Leif Johnson,
Giovanni Motta,
Lillian Zhou
Abstract:
We study the effectiveness of several techniques to personalize end-to-end speech models and improve the recognition of proper names relevant to the user. These techniques differ in the amounts of user effort required to provide supervision, and are evaluated on how they impact speech recognition performance. We propose using keyword-dependent precision and recall metrics to measure vocabulary acq…
▽ More
We study the effectiveness of several techniques to personalize end-to-end speech models and improve the recognition of proper names relevant to the user. These techniques differ in the amounts of user effort required to provide supervision, and are evaluated on how they impact speech recognition performance. We propose using keyword-dependent precision and recall metrics to measure vocabulary acquisition performance. We evaluate the algorithms on a dataset that we designed to contain names of persons that are difficult to recognize. Therefore, the baseline recall rate for proper names in this dataset is very low: 2.4%. A data synthesis approach we developed brings it to 48.6%, with no need for speech input from the user. With speech input, if the user corrects only the names, the name recall rate improves to 64.4%. If the user corrects all the recognition errors, we achieve the best recall of 73.5%. To eliminate the need to upload user data and store personalized models on a server, we focus on performing the entire personalization workflow on a mobile device.
△ Less
Submitted 14 December, 2019;
originally announced December 2019.
-
6G Massive Radio Access Networks: Key Issues, Technologies, and Future Challenges
Authors:
Ying Loong Lee,
Donghong Qin,
Li-Chun Wang,
Gek Hong,
Sim
Abstract:
Driven by the emerging use cases in massive access future networks, there is a need for technological advancements and evolutions for wireless communications beyond the fifth-generation (5G) networks. In particular, we envisage the upcoming sixth-generation (6G) networks to consist of numerous devices demanding extremely high-performance interconnections even under strenuous scenarios such as dive…
▽ More
Driven by the emerging use cases in massive access future networks, there is a need for technological advancements and evolutions for wireless communications beyond the fifth-generation (5G) networks. In particular, we envisage the upcoming sixth-generation (6G) networks to consist of numerous devices demanding extremely high-performance interconnections even under strenuous scenarios such as diverse mobility, extreme density, and dynamic environment. To cater for such a demand, investigation on flexible and sustainable radio access network (RAN) techniques capable of supporting highly diverse requirements and massive connectivity is of utmost importance. To this end, this paper first outlines the key driving applications for 6G, including smart city and factory, which trigger the transformation of existing RAN techniques. We then examine and provide in-depth discussions on several critical performance requirements (i.e., the level of flexibility, the support for massive interconnectivity, and energy efficiency), issues, enabling technologies, and challenges in designing 6G massive RANs. We conclude the article by providing several artificial-intelligence-based approaches to overcome future challenges.
△ Less
Submitted 23 October, 2019;
originally announced October 2019.
-
Optimal Transport driven CycleGAN for Unsupervised Learning in Inverse Problems
Authors:
Byeongsu Sim,
Gyutaek Oh,
Jeongsol Kim,
Chanyong Jung,
Jong Chul Ye
Abstract:
To improve the performance of classical generative adversarial network (GAN), Wasserstein generative adversarial networks (W-GAN) was developed as a Kantorovich dual formulation of the optimal transport (OT) problem using Wasserstein-1 distance. However, it was not clear how cycleGAN-type generative models can be derived from the optimal transport theory. Here we show that a novel cycleGAN archite…
▽ More
To improve the performance of classical generative adversarial network (GAN), Wasserstein generative adversarial networks (W-GAN) was developed as a Kantorovich dual formulation of the optimal transport (OT) problem using Wasserstein-1 distance. However, it was not clear how cycleGAN-type generative models can be derived from the optimal transport theory. Here we show that a novel cycleGAN architecture can be derived as a Kantorovich dual OT formulation if a penalized least square (PLS) cost with deep learning-based inverse path penalty is used as a transportation cost. One of the most important advantages of this formulation is that depending on the knowledge of the forward problem, distinct variations of cycleGAN architecture can be derived: for example, one with two pairs of generators and discriminators, and the other with only a single pair of generator and discriminator. Even for the two generator cases, we show that the structural knowledge of the forward operator can lead to a simpler generator architecture which significantly simplifies the neural network training. The new cycleGAN formulation, what we call the OT-cycleGAN, have been applied for various biomedical imaging problems, such as accelerated magnetic resonance imaging (MRI), super-resolution microscopy, and low-dose x-ray computed tomography (CT). Experimental results confirm the efficacy and flexibility of the theory.
△ Less
Submitted 30 August, 2020; v1 submitted 25 September, 2019;
originally announced September 2019.
-
An Investigation Into On-device Personalization of End-to-end Automatic Speech Recognition Models
Authors:
Khe Chai Sim,
Petr Zadrazil,
Françoise Beaufays
Abstract:
Speaker-independent speech recognition systems trained with data from many users are generally robust against speaker variability and work well for a large population of speakers. However, these systems do not always generalize well for users with very different speech characteristics. This issue can be addressed by building personalized systems that are designed to work well for each specific use…
▽ More
Speaker-independent speech recognition systems trained with data from many users are generally robust against speaker variability and work well for a large population of speakers. However, these systems do not always generalize well for users with very different speech characteristics. This issue can be addressed by building personalized systems that are designed to work well for each specific user. In this paper, we investigate the idea of securely training personalized end-to-end speech recognition models on mobile devices so that user data and models never leave the device and are never stored on a server. We study how the mobile training environment impacts performance by simulating on-device data consumption. We conduct experiments using data collected from speech impaired users for personalization. Our results show that personalization achieved 63.7\% relative word error rate reduction when trained in a server environment and 58.1% in a mobile environment. Moving to on-device personalization resulted in 18.7% performance degradation, in exchange for improved scalability and data privacy. To train the model on device, we split the gradient computation into two and achieved 45% memory reduction at the expense of 42% increase in training time.
△ Less
Submitted 14 September, 2019;
originally announced September 2019.
-
Hybrid Precoding for Multi-Group Multicasting in mmWave Systems
Authors:
Luis F. Abanto-Leon,
Matthias Hollick,
Gek Hong,
Sim
Abstract:
Multicast beamforming is known to improve spectral efficiency. However, its benefits and challenges for hybrid precoders design in millimeter-wave (mmWave) systems remain understudied. To this end, this paper investigates the first joint design of hybrid transmit precoders (with an arbitrary number of finite-resolution phase shifts) and receive combiners for mmWave multi-group multicasting. Our pr…
▽ More
Multicast beamforming is known to improve spectral efficiency. However, its benefits and challenges for hybrid precoders design in millimeter-wave (mmWave) systems remain understudied. To this end, this paper investigates the first joint design of hybrid transmit precoders (with an arbitrary number of finite-resolution phase shifts) and receive combiners for mmWave multi-group multicasting. Our proposed design leverages semidefinite relaxation (SDR), alternating optimization and Cholesky matrix factorization to sequentially optimize the digital/analog precoders at the transmitter and the combiners at each receiver. By considering receivers with multiple-antenna architecture, our design remarkably improves the overall system performance. Specifically, with only two receive antennas the average transmit power per received message improves by $ 16.8\% $ while the successful information reception is boosted by $ 60\% $. We demonstrate by means of extensive simulations that our hybrid precoder design performs very close to its fully-digital counterpart even under challenging scenarios (i.e., when co-located users belong to distinct multicast groups).
△ Less
Submitted 3 February, 2020; v1 submitted 7 August, 2019;
originally announced August 2019.
-
A Comparative Study of Analog/Digital Self-Interference Cancellation for Full Duplex Radios
Authors:
Jong Woo Kwak,
Min Soo Sim,
In-Woong Kang,
Jong Sung Park,
Jaedon Park,
Chan-Byoung Chae
Abstract:
Self-interference (SI) is the main obstacle to full-duplex radios. To overcome the SI, researchers have proposed several analog and digital domain self-interference cancellation (SIC) techniques. How well the digital cancellation works depends on the results of analog cancellation. Therefore, to analyze overall SIC performance, one should do so in an integrated manner. In this paper, we build a si…
▽ More
Self-interference (SI) is the main obstacle to full-duplex radios. To overcome the SI, researchers have proposed several analog and digital domain self-interference cancellation (SIC) techniques. How well the digital cancellation works depends on the results of analog cancellation. Therefore, to analyze overall SIC performance, one should do so in an integrated manner. In this paper, we build a simulator that can analyze the performance of analog and digital SIC techniques. Through this simulator, we can analyze the overall SIC performance within various system parameters such as the resolution of an analog-to-digital converter (ADC) and/or nonlinearity of a power amplifier (PA). With our simulator, we expect that configurations and tuning algorithms of an active analog canceller can be optimized before real hardware implementation.
△ Less
Submitted 23 May, 2019;
originally announced May 2019.
-
Inter-Patient ECG Classification with Convolutional and Recurrent Neural Networks
Authors:
Li Guo,
Gavin Sim,
Bogdan Matuszewski
Abstract:
The recent advances in ECG sensor devices provide opportunities for user self-managed auto-diagnosis and monitoring services over the internet. This imposes the requirements for generic ECG classification methods that are inter-patient and device independent. In this paper, we present our work on using the densely connected convolutional neural network (DenseNet) and gated recurrent unit network (…
▽ More
The recent advances in ECG sensor devices provide opportunities for user self-managed auto-diagnosis and monitoring services over the internet. This imposes the requirements for generic ECG classification methods that are inter-patient and device independent. In this paper, we present our work on using the densely connected convolutional neural network (DenseNet) and gated recurrent unit network (GRU) for addressing the inter-patient ECG classification problem. A deep learning model architecture is proposed and is evaluated using the MIT-BIH Arrhythmia and Supraventricular Databases. The results obtained show that without applying any complicated data pre-processing or feature engineering methods, both of our models have considerably outperformed the state-of-the-art performance for supraventricular (SVEB) and ventricular (VEB) arrhythmia classifications on the unseen testing dataset (with the F1 score improved from 51.08 to 61.25 for SVEB detection and from 88.59 to 89.75 for VEB detection respectively). As no patient-specific or device-specific information is used at the training stage in this work, it can be considered as a more generic approach for dealing with scenarios in which varieties of ECG signals are collected from different patients using different types of sensor devices.
△ Less
Submitted 27 September, 2018;
originally announced October 2018.
-
Toward domain-invariant speech recognition via large scale training
Authors:
Arun Narayanan,
Ananya Misra,
Khe Chai Sim,
Golan Pundak,
Anshuman Tripathi,
Mohamed Elfeky,
Parisa Haghani,
Trevor Strohman,
Michiel Bacchiani
Abstract:
Current state-of-the-art automatic speech recognition systems are trained to work in specific `domains', defined based on factors like application, sampling rate and codec. When such recognizers are used in conditions that do not match the training domain, performance significantly drops. This work explores the idea of building a single domain-invariant model for varied use-cases by combining larg…
▽ More
Current state-of-the-art automatic speech recognition systems are trained to work in specific `domains', defined based on factors like application, sampling rate and codec. When such recognizers are used in conditions that do not match the training domain, performance significantly drops. This work explores the idea of building a single domain-invariant model for varied use-cases by combining large scale training data from multiple application domains. Our final system is trained using 162,000 hours of speech. Additionally, each utterance is artificially distorted during training to simulate effects like background noise, codec distortion, and sampling rates. Our results show that, even at such a scale, a model thus trained works almost as well as those fine-tuned to specific subsets: A single model can be robust to multiple application domains, and variations like codecs and noise. More importantly, such models generalize better to unseen conditions and allow for rapid adaptation -- we show that by using as little as 10 hours of data from a new domain, an adapted domain-invariant model can match performance of a domain-specific model trained from scratch using 70 times as much data. We also highlight some of the limitations of such models and areas that need addressing in future work.
△ Less
Submitted 15 August, 2018;
originally announced August 2018.