Search | arXiv e-print repository

TransformerFAM: Feedback attention is working memory

Authors: Dongseong Hwang, Weiran Wang, Zhuoyuan Huo, Khe Chai Sim, Pedro Moreno Mengibar

Abstract: While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations. This design fosters the emergence of working memory within the Transformer, a… ▽ More While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations. This design fosters the emergence of working memory within the Transformer, allowing it to process indefinitely long sequences. TransformerFAM requires no additional weights, enabling seamless integration with pre-trained models. Our experiments show that TransformerFAM significantly improves Transformer performance on long-context tasks across various model sizes (1B, 8B, and 24B). These results showcase the potential to empower Large Language Models (LLMs) to process sequences of unlimited length. △ Less

Submitted 7 May, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

Comments: 26 pages, 12 figures, 14 tables

arXiv:2403.19709 [pdf, other]

Hierarchical Recurrent Adapters for Efficient Multi-Task Adaptation of Large Speech Models

Authors: Tsendsuren Munkhdalai, Youzheng Chen, Khe Chai Sim, Fadi Biadsy, Tara Sainath, Pedro Moreno Mengibar

Abstract: Parameter efficient adaptation methods have become a key mechanism to train large pre-trained models for downstream tasks. However, their per-task parameter overhead is considered still high when the number of downstream tasks to adapt for is large. We introduce an adapter module that has a better efficiency in large scale multi-task adaptation scenario. Our adapter is hierarchical in terms of how… ▽ More Parameter efficient adaptation methods have become a key mechanism to train large pre-trained models for downstream tasks. However, their per-task parameter overhead is considered still high when the number of downstream tasks to adapt for is large. We introduce an adapter module that has a better efficiency in large scale multi-task adaptation scenario. Our adapter is hierarchical in terms of how the adapter parameters are allocated. The adapter consists of a single shared controller network and multiple task-level adapter heads to reduce the per-task parameter overhead without performance regression on downstream tasks. The adapter is also recurrent so the entire adapter parameters are reused across different layers of the pre-trained model. Our Hierarchical Recurrent Adapter (HRA) outperforms the previous adapter-based approaches as well as full model fine-tuning baseline in both single and multi-task adaptation settings when evaluated on automatic speech recognition tasks. △ Less

Submitted 25 March, 2024; originally announced March 2024.

Comments: 5 pages, 3 figures, 5 tables

arXiv:2403.05530 [pdf, other]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content. △ Less

Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

arXiv:2310.04627 [pdf, other]

Profit: Benchmarking Personalization and Robustness Trade-off in Federated Prompt Tuning

Authors: Liam Collins, Shanshan Wu, Sewoong Oh, Khe Chai Sim

Abstract: In many applications of federated learning (FL), clients desire models that are personalized using their local data, yet are also robust in the sense that they retain general global knowledge. However, the presence of data heterogeneity across clients induces a fundamental trade-off between personalization (i.e., adaptation to a local distribution) and robustness (i.e., not forgetting previously l… ▽ More In many applications of federated learning (FL), clients desire models that are personalized using their local data, yet are also robust in the sense that they retain general global knowledge. However, the presence of data heterogeneity across clients induces a fundamental trade-off between personalization (i.e., adaptation to a local distribution) and robustness (i.e., not forgetting previously learned general knowledge). It is critical to understand how to navigate this personalization vs robustness trade-off when designing federated systems, which are increasingly moving towards a paradigm of fine-tuning large foundation models. Due to limited computational and communication capabilities in most federated settings, this foundation model fine-tuning must be done using parameter-efficient fine-tuning (PEFT) approaches. While some recent work has studied federated approaches to PEFT, the personalization vs robustness trade-off of federated PEFT has been largely unexplored. In this work, we take a step towards bridging this gap by benchmarking fundamental FL algorithms -- FedAvg and FedSGD plus personalization (via client local fine-tuning) -- applied to one of the most ubiquitous PEFT approaches to large language models (LLMs) -- prompt tuning -- in a multitude of hyperparameter settings under varying levels of data heterogeneity. Our results show that federated-trained prompts can be surprisingly robust when using a small learning rate with many local epochs for personalization, especially when using an adaptive optimizer as the client optimizer during federated training. We also demonstrate that simple approaches such as adding regularization and interpolating two prompts are effective in improving the personalization vs robustness trade-off in computation-limited settings with few local updates allowed for personalization. △ Less

Submitted 6 October, 2023; originally announced October 2023.

arXiv:2310.00178 [pdf, other]

Contextual Biasing with the Knuth-Morris-Pratt Matching Algorithm

Authors: Weiran Wang, Zelin Wu, Diamantino Caseiro, Tsendsuren Munkhdalai, Khe Chai Sim, Pat Rondon, Golan Pundak, Gan Song, Rohit Prabhavalkar, Zhong Meng, Ding Zhao, Tara Sainath, Pedro Moreno Mengibar

Abstract: Contextual biasing refers to the problem of biasing the automatic speech recognition (ASR) systems towards rare entities that are relevant to the specific user or application scenarios. We propose algorithms for contextual biasing based on the Knuth-Morris-Pratt algorithm for pattern matching. During beam search, we boost the score of a token extension if it extends matching into a set of biasing… ▽ More Contextual biasing refers to the problem of biasing the automatic speech recognition (ASR) systems towards rare entities that are relevant to the specific user or application scenarios. We propose algorithms for contextual biasing based on the Knuth-Morris-Pratt algorithm for pattern matching. During beam search, we boost the score of a token extension if it extends matching into a set of biasing phrases. Our method simulates the classical approaches often implemented in the weighted finite state transducer (WFST) framework, but avoids the FST language altogether, with careful considerations on memory footprint and efficiency on tensor processing units (TPUs) by vectorization. Without introducing additional model parameters, our method achieves significant word error rate (WER) reductions on biasing test sets by itself, and yields further performance gain when combined with a model-based biasing method. △ Less

Submitted 29 September, 2023; originally announced October 2023.

arXiv:2309.12963 [pdf, ps, other]

Massive End-to-end Models for Short Search Queries

Authors: Weiran Wang, Rohit Prabhavalkar, Dongseong Hwang, Qiujia Li, Khe Chai Sim, Bo Li, James Qin, Xingyu Cai, Adam Stooke, Zhong Meng, CJ Zheng, Yanzhang He, Tara Sainath, Pedro Moreno Mengibar

Abstract: In this work, we investigate two popular end-to-end automatic speech recognition (ASR) models, namely Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), for offline recognition of voice search queries, with up to 2B model parameters. The encoders of our models use the neural architecture of Google's universal speech model (USM), with additional funnel pooling layers to signifi… ▽ More In this work, we investigate two popular end-to-end automatic speech recognition (ASR) models, namely Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), for offline recognition of voice search queries, with up to 2B model parameters. The encoders of our models use the neural architecture of Google's universal speech model (USM), with additional funnel pooling layers to significantly reduce the frame rate and speed up training and inference. We perform extensive studies on vocabulary size, time reduction strategy, and its generalization performance on long-form test sets. Despite the speculation that, as the model size increases, CTC can be as good as RNN-T which builds label dependency into the prediction, we observe that a 900M RNN-T clearly outperforms a 1.8B CTC and is more tolerant to severe time reduction, although the WER gap can be largely removed by LM shallow fusion. △ Less

Submitted 22 September, 2023; originally announced September 2023.

arXiv:2309.09996 [pdf, other]

Improving Speech Recognition for African American English With Audio Classification

Authors: Shefali Garg, Zhouyuan Huo, Khe Chai Sim, Suzan Schwartz, Mason Chua, Alëna Aksënova, Tsendsuren Munkhdalai, Levi King, Darryl Wright, Zion Mengesha, Dongseong Hwang, Tara Sainath, Françoise Beaufays, Pedro Moreno Mengibar

Abstract: Automatic speech recognition (ASR) systems have been shown to have large quality disparities between the language varieties they are intended or expected to recognize. One way to mitigate this is to train or fine-tune models with more representative datasets. But this approach can be hindered by limited in-domain data for training and evaluation. We propose a new way to improve the robustness of a… ▽ More Automatic speech recognition (ASR) systems have been shown to have large quality disparities between the language varieties they are intended or expected to recognize. One way to mitigate this is to train or fine-tune models with more representative datasets. But this approach can be hindered by limited in-domain data for training and evaluation. We propose a new way to improve the robustness of a US English short-form speech recognizer using a small amount of out-of-domain (long-form) African American English (AAE) data. We use CORAAL, YouTube and Mozilla Common Voice to train an audio classifier to approximately output whether an utterance is AAE or some other variety including Mainstream American English (MAE). By combining the classifier output with coarse geographic information, we can select a subset of utterances from a large corpus of untranscribed short-form queries for semi-supervised learning at scale. Fine-tuning on this data results in a 38.5% relative word error rate disparity reduction between AAE and MAE without reducing MAE quality. △ Less

Submitted 16 September, 2023; originally announced September 2023.

arXiv:2307.11069 [pdf, other]

doi 10.1109/ICNC57223.2023.10074058

Effectiveness and predictability of in-network storage cache for scientific workflows

Authors: Caitlin Sim, Kesheng Wu, Alex Sim, Inder Monga, Chin Guok, Frank Wurthwein, Diego Davila, Harvey Newman, Justas Balcas

Abstract: Large scientific collaborations often have multiple scientists accessing the same set of files while doing different analyses, which create repeated accesses to the large amounts of shared data located far away. These data accesses have long latency due to distance and occupy the limited bandwidth available over the wide-area network. To reduce the wide-area network traffic and the data access lat… ▽ More Large scientific collaborations often have multiple scientists accessing the same set of files while doing different analyses, which create repeated accesses to the large amounts of shared data located far away. These data accesses have long latency due to distance and occupy the limited bandwidth available over the wide-area network. To reduce the wide-area network traffic and the data access latency, regional data storage caches have been installed as a new networking service. To study the effectiveness of such a cache system in scientific applications, we examine the Southern California Petabyte Scale Cache for a high-energy physics experiment. By examining about 3TB of operational logs, we show that this cache removed 67.6% of file requests from the wide-area network and reduced the traffic volume on wide-area network by 12.3TB (or 35.4%) an average day. The reduction in the traffic volume (35.4%) is less than the reduction in file counts (67.6%) because the larger files are less likely to be reused. Due to this difference in data access patterns, the cache system has implemented a policy to avoid evicting smaller files when processing larger files. We also build a machine learning model to study the predictability of the cache behavior. Tests show that this model is able to accurately predict the cache accesses, cache misses, and network throughput, making the model useful for future studies on resource provisioning and planning. △ Less

Submitted 20 July, 2023; originally announced July 2023.

arXiv:2306.01789 [pdf, other]

Edit Distance based RL for RNNT decoding

Authors: Dongseong Hwang, Changwan Ryu, Khe Chai Sim

Abstract: RNN-T is currently considered the industry standard in ASR due to its exceptional WERs in various benchmark tests and its ability to support seamless streaming and longform transcription. However, its biggest drawback lies in the significant discrepancy between its training and inference objectives. During training, RNN-T maximizes all alignment probabilities by teacher forcing, while during infer… ▽ More RNN-T is currently considered the industry standard in ASR due to its exceptional WERs in various benchmark tests and its ability to support seamless streaming and longform transcription. However, its biggest drawback lies in the significant discrepancy between its training and inference objectives. During training, RNN-T maximizes all alignment probabilities by teacher forcing, while during inference, it uses beam search which may not necessarily find the maximum probable alignment. Additionally, RNN-T's inability to experience mistakes during teacher forcing training makes it more problematic when a mistake occurs in inference. To address this issue, this paper proposes a Reinforcement Learning method that minimizes the gap between training and inference time. Our Edit Distance based RL (EDRL) approach computes rewards based on the edit distance, and trains the network at every action level. The proposed approach yielded SoTA WERs on LibriSpeech for the 600M Conformer RNN-T model. △ Less

Submitted 14 July, 2023; v1 submitted 31 May, 2023; originally announced June 2023.

Comments: 5 pages, 2 figures

arXiv:2302.01496 [pdf, ps, other]

Efficient Domain Adaptation for Speech Foundation Models

Authors: Bo Li, Dongseong Hwang, Zhouyuan Huo, Junwen Bai, Guru Prakash, Tara N. Sainath, Khe Chai Sim, Yu Zhang, Wei Han, Trevor Strohman, Francoise Beaufays

Abstract: Foundation models (FMs), that are trained on broad data at scale and are adaptable to a wide range of downstream tasks, have brought large interest in the research community. Benefiting from the diverse data sources such as different modalities, languages and application domains, foundation models have demonstrated strong generalization and knowledge transfer capabilities. In this paper, we presen… ▽ More Foundation models (FMs), that are trained on broad data at scale and are adaptable to a wide range of downstream tasks, have brought large interest in the research community. Benefiting from the diverse data sources such as different modalities, languages and application domains, foundation models have demonstrated strong generalization and knowledge transfer capabilities. In this paper, we present a pioneering study towards building an efficient solution for FM-based speech recognition systems. We adopt the recently developed self-supervised BEST-RQ for pretraining, and propose the joint finetuning with both source and unsupervised target domain data using JUST Hydra. The FM encoder adapter and decoder are then finetuned to the target domain with a small amount of supervised in-domain data. On a large-scale YouTube and Voice Search task, our method is shown to be both data and model parameter efficient. It achieves the same quality with only 21.6M supervised in-domain data and 130.8M finetuned parameters, compared to the 731.1M model trained from scratch on additional 300M supervised in-domain data. △ Less

Submitted 2 February, 2023; originally announced February 2023.

arXiv:2211.08215 [pdf, ps, other]

Superlinear Convergence of an Interior Point Algorithm on Linear Semi-definite Feasibility Problems with Application to Linear Matrix Inequalities

Authors: Chee-Khian Sim

Abstract: In the literature, besides the assumption of strict complementarity, superlinear convergence of implementable polynomial-time interior point algorithms using known search directions, namely, the HKM direction, its dual or the NT direction, to solve semi-definite programs (SDPs) is shown by (i) assuming that the given SDP is nondegenerate and making modifications to these algorithms [10], or (ii) c… ▽ More In the literature, besides the assumption of strict complementarity, superlinear convergence of implementable polynomial-time interior point algorithms using known search directions, namely, the HKM direction, its dual or the NT direction, to solve semi-definite programs (SDPs) is shown by (i) assuming that the given SDP is nondegenerate and making modifications to these algorithms [10], or (ii) considering special classes of SDPs, such as the class of linear semi-definite feasibility problems (LSDFPs) and requiring the initial iterate to the algorithm to satisfy certain conditions [26, 27]. Otherwise, these algorithms are not easy to implement even though they are shown to have polynomial iteration complexities and superlinear convergence [14]. The conditions in [26, 27] that the initial iterate to the algorithm is required to satisfy to have superlinear convergence when solving LSDFPs however are not practical. In this paper, we propose a practical initial iterate to an implementable infeasible interior point algorithm that guarantees superlinear convergence when the algorithm is used to solve the homogeneous feasibility model of an LSDFP. △ Less

Submitted 12 January, 2024; v1 submitted 15 November, 2022; originally announced November 2022.

Comments: This replacement is a corrected version of the submission arXiv2211.08215 with different title and nontrivial changes

MSC Class: 90C22; 90C51

arXiv:2211.02712 [pdf, other]

Resource-Efficient Transfer Learning From Speech Foundation Model Using Hierarchical Feature Fusion

Authors: Zhouyuan Huo, Khe Chai Sim, Bo Li, Dongseong Hwang, Tara N. Sainath, Trevor Strohman

Abstract: Self-supervised pre-training of a speech foundation model, followed by supervised fine-tuning, has shown impressive quality improvements on automatic speech recognition (ASR) tasks. Fine-tuning separate foundation models for many downstream tasks are expensive since the foundation model is usually very big. Parameter-efficient fine-tuning methods (e.g. adapter, sparse update methods) offer an alte… ▽ More Self-supervised pre-training of a speech foundation model, followed by supervised fine-tuning, has shown impressive quality improvements on automatic speech recognition (ASR) tasks. Fine-tuning separate foundation models for many downstream tasks are expensive since the foundation model is usually very big. Parameter-efficient fine-tuning methods (e.g. adapter, sparse update methods) offer an alternative paradigm where a small set of parameters are updated to adapt the foundation model to new tasks. However, these methods still suffer from a high computational memory cost and slow training speed because they require backpropagation through the entire neural network at each step. In the paper, we analyze the performance of features at different layers of a foundation model on the speech recognition task and propose a novel hierarchical feature fusion method for resource-efficient transfer learning from speech foundation models. Experimental results show that the proposed method can achieve better performance on speech recognition task than existing algorithms with fewer number of trainable parameters, less computational memory cost and faster training speed. After combining with Adapters at all layers, the proposed method can achieve the same performance as fine-tuning the whole model with $97\%$ fewer trainable encoder parameters and $53\%$ faster training speed. △ Less

Submitted 4 November, 2022; originally announced November 2022.

arXiv:2210.05793 [pdf, other]

Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR

Authors: Dongseong Hwang, Khe Chai Sim, Yu Zhang, Trevor Strohman

Abstract: Knowledge distillation is an effective machine learning technique to transfer knowledge from a teacher model to a smaller student model, especially with unlabeled data. In this paper, we focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) automatic speech recognition (ASR). Specifically, we compared using soft and hard target distillation to train l… ▽ More Knowledge distillation is an effective machine learning technique to transfer knowledge from a teacher model to a smaller student model, especially with unlabeled data. In this paper, we focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) automatic speech recognition (ASR). Specifically, we compared using soft and hard target distillation to train large-scaleRNN-T models on the LibriSpeech/LibriLight public dataset (60k hours) and our in-house data (600k hours). We found that hard tar-gets are more effective when the teacher and student have different architecture, such as large teacher and small streaming student. On the other hand, soft target distillation works better in self-training scenario like iterative large teacher training. For a large model with0.6B weights, we achieve a new SoTA word error rate (WER) on LibriSpeech (8% relative improvement on dev-other) using Noisy Student Training with soft target distillation. It also allows our production teacher to adapt new data domain continuously. △ Less

Submitted 28 October, 2022; v1 submitted 11 October, 2022; originally announced October 2022.

Comments: 8 pages, 2 figures

arXiv:2208.03067 [pdf, ps, other]

Large vocabulary speech recognition for languages of Africa: multilingual modeling and self-supervised learning

Authors: Sandy Ritchie, You-Chi Cheng, Mingqing Chen, Rajiv Mathews, Daan van Esch, Bo Li, Khe Chai Sim

Abstract: Almost none of the 2,000+ languages spoken in Africa have widely available automatic speech recognition systems, and the required data is also only available for a few languages. We have experimented with two techniques which may provide pathways to large vocabulary speech recognition for African languages: multilingual modeling and self-supervised learning. We gathered available open source data… ▽ More Almost none of the 2,000+ languages spoken in Africa have widely available automatic speech recognition systems, and the required data is also only available for a few languages. We have experimented with two techniques which may provide pathways to large vocabulary speech recognition for African languages: multilingual modeling and self-supervised learning. We gathered available open source data and collected data for 15 languages, and trained experimental models using these techniques. Our results show that pooling the small amounts of data available in multilingual end-to-end models, and pre-training on unsupervised data can help improve speech recognition quality for many African languages. △ Less

Submitted 4 October, 2022; v1 submitted 5 August, 2022; originally announced August 2022.

arXiv:2207.00706 [pdf, other]

UserLibri: A Dataset for ASR Personalization Using Only Text

Authors: Theresa Breiner, Swaroop Ramaswamy, Ehsan Variani, Shefali Garg, Rajiv Mathews, Khe Chai Sim, Kilol Gupta, Mingqing Chen, Lara McConnaughey

Abstract: Personalization of speech models on mobile devices (on-device personalization) is an active area of research, but more often than not, mobile devices have more text-only data than paired audio-text data. We explore training a personalized language model on text-only data, used during inference to improve speech recognition performance for that user. We experiment on a user-clustered LibriSpeech co… ▽ More Personalization of speech models on mobile devices (on-device personalization) is an active area of research, but more often than not, mobile devices have more text-only data than paired audio-text data. We explore training a personalized language model on text-only data, used during inference to improve speech recognition performance for that user. We experiment on a user-clustered LibriSpeech corpus, supplemented with personalized text-only data for each user from Project Gutenberg. We release this User-Specific LibriSpeech (UserLibri) dataset to aid future personalization research. LibriSpeech audio-transcript pairs are grouped into 55 users from the test-clean dataset and 52 users from test-other. We are able to lower the average word error rate per user across both sets in streaming and nonstreaming models, including an improvement of 2.5 for the harder set of test-other users when streaming. △ Less

Submitted 1 July, 2022; originally announced July 2022.

Comments: Accepted for publication in Interspeech 2022. 9 total pages with appendix, 9 total tables, 5 total figures

arXiv:2203.12668 [pdf, other]

Pseudo Label Is Better Than Human Label

Authors: Dongseong Hwang, Khe Chai Sim, Zhouyuan Huo, Trevor Strohman

Abstract: State-of-the-art automatic speech recognition (ASR) systems are trained with tens of thousands of hours of labeled speech data. Human transcription is expensive and time consuming. Factors such as the quality and consistency of the transcription can greatly affect the performance of the ASR models trained with these data. In this paper, we show that we can train a strong teacher model to produce h… ▽ More State-of-the-art automatic speech recognition (ASR) systems are trained with tens of thousands of hours of labeled speech data. Human transcription is expensive and time consuming. Factors such as the quality and consistency of the transcription can greatly affect the performance of the ASR models trained with these data. In this paper, we show that we can train a strong teacher model to produce high quality pseudo labels by utilizing recent self-supervised and semi-supervised learning techniques. Specifically, we use JUST (Joint Unsupervised/Supervised Training) and iterative noisy student teacher training to train a 600 million parameter bi-directional teacher model. This model achieved 4.0% word error rate (WER) on a voice search task, 11.1% relatively better than a baseline. We further show that by using this strong teacher model to generate high-quality pseudo labels for training, we can achieve 13.6% relative WER reduction (5.9% to 5.1%) for a streaming model compared to using human labels. △ Less

Submitted 1 July, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

Comments: 6 pages, 2 figures, 9 tables, Proceedings of INTERSPEECH 2022

arXiv:2111.08137 [pdf, other]

Joint Unsupervised and Supervised Training for Multilingual ASR

Authors: Junwen Bai, Bo Li, Yu Zhang, Ankur Bapna, Nikhil Siddhartha, Khe Chai Sim, Tara N. Sainath

Abstract: Self-supervised training has shown promising gains in pretraining models and facilitating the downstream finetuning for speech recognition, like multilingual ASR. Most existing methods adopt a 2-stage scheme where the self-supervised loss is optimized in the first pretraining stage, and the standard supervised finetuning resumes in the second stage. In this paper, we propose an end-to-end (E2E) Jo… ▽ More Self-supervised training has shown promising gains in pretraining models and facilitating the downstream finetuning for speech recognition, like multilingual ASR. Most existing methods adopt a 2-stage scheme where the self-supervised loss is optimized in the first pretraining stage, and the standard supervised finetuning resumes in the second stage. In this paper, we propose an end-to-end (E2E) Joint Unsupervised and Supervised Training (JUST) method to combine the supervised RNN-T loss and the self-supervised contrastive and masked language modeling (MLM) losses. We validate its performance on the public dataset Multilingual LibriSpeech (MLS), which includes 8 languages and is extremely imbalanced. On MLS, we explore (1) JUST trained from scratch, and (2) JUST finetuned from a pretrained checkpoint. Experiments show that JUST can consistently outperform other existing state-of-the-art methods, and beat the monolingual baseline by a significant margin, demonstrating JUST's capability of handling low-resource languages in multilingual ASR. Our average WER of all languages outperforms average monolingual baseline by 33.3%, and the state-of-the-art 2-stage XLSR by 32%. On low-resource languages like Polish, our WER is less than half of the monolingual baseline and even beats the supervised transfer learning method which uses external supervision. △ Less

Submitted 15 November, 2021; originally announced November 2021.

arXiv:2111.04177 [pdf, other]

Solution to a Monotone Inclusion Problem using the Relaxed Peaceman-Rachford Splitting Method: Convergence and its Rates

Authors: Chee Khian Sim

Abstract: We consider the convergence behavior using the relaxed Peaceman-Rachford splitting method to solve the monotone inclusion problem $0 \in (A + B)(u)$, where $A, B: \Re^n \rightrightarrows \Re^n$ are maximal $β$-strongly monotone operators, $n \geq 1$ and $β> 0$. Under a technical assumption, convergence of iterates using the method on the problem is proved when either $A$ or $B$ is single-valued, a… ▽ More We consider the convergence behavior using the relaxed Peaceman-Rachford splitting method to solve the monotone inclusion problem $0 \in (A + B)(u)$, where $A, B: \Re^n \rightrightarrows \Re^n$ are maximal $β$-strongly monotone operators, $n \geq 1$ and $β> 0$. Under a technical assumption, convergence of iterates using the method on the problem is proved when either $A$ or $B$ is single-valued, and the fixed relaxation parameter $θ$ lies in the interval $(2 + β, 2 + β+ \min \{ β, 1/β\})$. With this convergence result, we address an open problem that is not settled in [20] on the convergence of these iterates for $θ\in (2 + β, 2 + β+ \min \{ β, 1/β\})$. Pointwise convergence rate results and $R$-linear convergence rate results when $θ$ lies in the interval $[2 + β, 2 + β+ \min \{ β, 1/β\})$ are also provided in the paper. Our analysis to achieve these results is atypical and hence novel. Numerical experiments are conducted on the weighted Lasso minimization problem to test the validity of the assumption . △ Less

Submitted 13 November, 2022; v1 submitted 7 November, 2021; originally announced November 2021.

Comments: 23 pages, 1 figure, 1 table

MSC Class: 90C25; 90C06

arXiv:2110.02220 [pdf, other]

Fast Contextual Adaptation with Neural Associative Memory for On-Device Personalized Speech Recognition

Authors: Tsendsuren Munkhdalai, Khe Chai Sim, Angad Chandorkar, Fan Gao, Mason Chua, Trevor Strohman, Françoise Beaufays

Abstract: Fast contextual adaptation has shown to be effective in improving Automatic Speech Recognition (ASR) of rare words and when combined with an on-device personalized training, it can yield an even better recognition result. However, the traditional re-scoring approaches based on an external language model is prone to diverge during the personalized training. In this work, we introduce a model-based… ▽ More Fast contextual adaptation has shown to be effective in improving Automatic Speech Recognition (ASR) of rare words and when combined with an on-device personalized training, it can yield an even better recognition result. However, the traditional re-scoring approaches based on an external language model is prone to diverge during the personalized training. In this work, we introduce a model-based end-to-end contextual adaptation approach that is decoder-agnostic and amenable to on-device personalization. Our on-device simulation experiments demonstrate that the proposed approach outperforms the traditional re-scoring technique by 12% relative WER and 15.7% entity mention specific F1-score in a continues personalization scenario. △ Less

Submitted 6 October, 2021; v1 submitted 4 October, 2021; originally announced October 2021.

Comments: 5 pages, 3 figures, 3 tables

arXiv:2110.00165 [pdf, other]

Large-scale ASR Domain Adaptation using Self- and Semi-supervised Learning

Authors: Dongseong Hwang, Ananya Misra, Zhouyuan Huo, Nikhil Siddhartha, Shefali Garg, David Qiu, Khe Chai Sim, Trevor Strohman, Françoise Beaufays, Yanzhang He

Abstract: Self- and semi-supervised learning methods have been actively investigated to reduce labeled training data or enhance the model performance. However, the approach mostly focus on in-domain performance for public datasets. In this study, we utilize the combination of self- and semi-supervised learning methods to solve unseen domain adaptation problem in a large-scale production setting for online A… ▽ More Self- and semi-supervised learning methods have been actively investigated to reduce labeled training data or enhance the model performance. However, the approach mostly focus on in-domain performance for public datasets. In this study, we utilize the combination of self- and semi-supervised learning methods to solve unseen domain adaptation problem in a large-scale production setting for online ASR model. This approach demonstrates that using the source domain data with a small fraction of the target domain data (3%) can recover the performance gap compared to a full data baseline: relative 13.5% WER improvement for target domain data. △ Less

Submitted 15 February, 2022; v1 submitted 30 September, 2021; originally announced October 2021.

Comments: ICASSP 2022 accepted, 5 pages, 2 figures, 5 tables

arXiv:2110.00155 [pdf, other]

Incremental Layer-wise Self-Supervised Learning for Efficient Speech Domain Adaptation On Device

Authors: Zhouyuan Huo, Dongseong Hwang, Khe Chai Sim, Shefali Garg, Ananya Misra, Nikhil Siddhartha, Trevor Strohman, Françoise Beaufays

Abstract: Streaming end-to-end speech recognition models have been widely applied to mobile devices and show significant improvement in efficiency. These models are typically trained on the server using transcribed speech data. However, the server data distribution can be very different from the data distribution on user devices, which could affect the model performance. There are two main challenges for on… ▽ More Streaming end-to-end speech recognition models have been widely applied to mobile devices and show significant improvement in efficiency. These models are typically trained on the server using transcribed speech data. However, the server data distribution can be very different from the data distribution on user devices, which could affect the model performance. There are two main challenges for on device training, limited reliable labels and limited training memory. While self-supervised learning algorithms can mitigate the mismatch between domains using unlabeled data, they are not applicable on mobile devices directly because of the memory constraint. In this paper, we propose an incremental layer-wise self-supervised learning algorithm for efficient speech domain adaptation on mobile devices, in which only one layer is updated at a time. Extensive experimental results demonstrate that the proposed algorithm obtains a Word Error Rate (WER) on the target domain $24.2\%$ better than supervised baseline and costs $89.7\%$ less training memory than the end-to-end self-supervised learning algorithm. △ Less

Submitted 30 September, 2021; originally announced October 2021.

Comments: 5 pages

arXiv:2109.13226 [pdf, other]

doi 10.1109/JSTSP.2022.3182537

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Authors: Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yan** Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang , et al. (1 additional authors not shown)

Abstract: We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled da… ▽ More We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks. △ Less

Submitted 21 July, 2022; v1 submitted 27 September, 2021; originally announced September 2021.

Comments: 14 pages, 7 figures, 13 tables; v2: minor corrections, reference baselines and bibliography updated; v3: corrections based on reviewer feedback, bibliography updated

arXiv:2106.10259 [pdf, other]

On-Device Personalization of Automatic Speech Recognition Models for Disordered Speech

Authors: Katrin Tomanek, Françoise Beaufays, Julie Cattiau, Angad Chandorkar, Khe Chai Sim

Abstract: While current state-of-the-art Automatic Speech Recognition (ASR) systems achieve high accuracy on typical speech, they suffer from significant performance degradation on disordered speech and other atypical speech patterns. Personalization of ASR models, a commonly applied solution to this problem, is usually performed in a server-based training environment posing problems around data privacy, de… ▽ More While current state-of-the-art Automatic Speech Recognition (ASR) systems achieve high accuracy on typical speech, they suffer from significant performance degradation on disordered speech and other atypical speech patterns. Personalization of ASR models, a commonly applied solution to this problem, is usually performed in a server-based training environment posing problems around data privacy, delayed model-update times, and communication cost for copying data and models between mobile device and server infrastructure. In this paper, we present an approach to on-device based ASR personalization with very small amounts of speaker-specific data. We test our approach on a diverse set of 100 speakers with disordered speech and find median relative word error rate improvement of 71% with only 50 short utterances required per speaker. When tested on a voice-controlled home automation platform, on-device personalized models show a median task success rate of 81%, compared to only 40% of the unadapted models. △ Less

Submitted 18 June, 2021; originally announced June 2021.

arXiv:2008.09911 [pdf, ps, other]

A FISTA-Type First Order Algorithm on Composite Optimization Problems that is Adaptable to the Convex Situation

Authors: Chee-Khian Sim

Abstract: In this note, we propose a FISTA-type first order algorithm, VAR-FISTA, to solve a composite optimization problem. A distinctive feature of VAR-FISTA is its ability to exploit the convexity of the function in the problem, resulting in an improved iteration complexity when the function is convex compared to when it is nonconvex. The iteration complexity result for the convex and nonconvex case obta… ▽ More In this note, we propose a FISTA-type first order algorithm, VAR-FISTA, to solve a composite optimization problem. A distinctive feature of VAR-FISTA is its ability to exploit the convexity of the function in the problem, resulting in an improved iteration complexity when the function is convex compared to when it is nonconvex. The iteration complexity result for the convex and nonconvex case obtained in the note are compatible to the best known in the literature so far. △ Less

Submitted 22 August, 2020; originally announced August 2020.

Comments: 13 pages, no figures

MSC Class: 90C26; 90C25

arXiv:2007.07087 [pdf]

The case for a multi-channel polarization sensitive LIDAR for investigation of insolation-driven ices and atmospheres

Authors: Adrian J. Brown, Gorden Videen, Evgenij Zubko, Nicholas Heavens, Nicole-Jeanne Schlegel, Patricio Becerra, Young-Jun Choi, Colin R. Meyer, Tanya N. Harrison, Paul Hayne, Rachel W. Obbard, Tim Michaels, Michael J. Wolff, Scott Guzewich, Yongxiang Hu, Claire Newman, Christian J. Grund, Chae Kyung Sim, Peter B. Buhler, Margaret E. Landis, Timothy J. Stubbs, Aymeric Spiga, Devanshu Jha

Abstract: All LIDAR instruments are not the same, and advancement of LIDAR technology requires an ongoing interest and demand from the community to foster further development of the required components. The purpose of this paper is to make the community aware of the need for further technical development, and the potential payoff of investing experimental time, money and thought into the next generation of… ▽ More All LIDAR instruments are not the same, and advancement of LIDAR technology requires an ongoing interest and demand from the community to foster further development of the required components. The purpose of this paper is to make the community aware of the need for further technical development, and the potential payoff of investing experimental time, money and thought into the next generation of LIDARs. △ Less

Submitted 11 July, 2020; originally announced July 2020.

Comments: 12 pages, 1 figure. arXiv admin note: substantial text overlap with arXiv:1406.0030

arXiv:2001.08885 [pdf, other]

Low-rank Gradient Approximation For Memory-Efficient On-device Training of Deep Neural Network

Authors: Mary Gooneratne, Khe Chai Sim, Petr Zadrazil, Andreas Kabel, Françoise Beaufays, Giovanni Motta

Abstract: Training machine learning models on mobile devices has the potential of improving both privacy and accuracy of the models. However, one of the major obstacles to achieving this goal is the memory limitation of mobile devices. Reducing training memory enables models with high-dimensional weight matrices, like automatic speech recognition (ASR) models, to be trained on-device. In this paper, we prop… ▽ More Training machine learning models on mobile devices has the potential of improving both privacy and accuracy of the models. However, one of the major obstacles to achieving this goal is the memory limitation of mobile devices. Reducing training memory enables models with high-dimensional weight matrices, like automatic speech recognition (ASR) models, to be trained on-device. In this paper, we propose approximating the gradient matrices of deep neural networks using a low-rank parameterization as an avenue to save training memory. The low-rank gradient approximation enables more advanced, memory-intensive optimization techniques to be run on device. Our experimental results show that we can reduce the training memory by about 33.0% for Adam optimization. It uses comparable memory to momentum optimization and achieves a 4.5% relative lower word error rate on an ASR personalization task. △ Less

Submitted 24 January, 2020; originally announced January 2020.

arXiv:1912.09251 [pdf, other]

Personalization of End-to-end Speech Recognition On Mobile Devices For Named Entities

Authors: Khe Chai Sim, Françoise Beaufays, Arnaud Benard, Dhruv Guliani, Andreas Kabel, Nikhil Khare, Tamar Lucassen, Petr Zadrazil, Harry Zhang, Leif Johnson, Giovanni Motta, Lillian Zhou

Abstract: We study the effectiveness of several techniques to personalize end-to-end speech models and improve the recognition of proper names relevant to the user. These techniques differ in the amounts of user effort required to provide supervision, and are evaluated on how they impact speech recognition performance. We propose using keyword-dependent precision and recall metrics to measure vocabulary acq… ▽ More We study the effectiveness of several techniques to personalize end-to-end speech models and improve the recognition of proper names relevant to the user. These techniques differ in the amounts of user effort required to provide supervision, and are evaluated on how they impact speech recognition performance. We propose using keyword-dependent precision and recall metrics to measure vocabulary acquisition performance. We evaluate the algorithms on a dataset that we designed to contain names of persons that are difficult to recognize. Therefore, the baseline recall rate for proper names in this dataset is very low: 2.4%. A data synthesis approach we developed brings it to 48.6%, with no need for speech input from the user. With speech input, if the user corrects only the names, the name recall rate improves to 64.4%. If the user corrects all the recognition errors, we achieve the best recall of 73.5%. To eliminate the need to upload user data and store personalized models on a server, we focus on performing the entire personalization workflow on a mobile device. △ Less

Submitted 14 December, 2019; originally announced December 2019.

arXiv:1909.06678 [pdf, other]

An Investigation Into On-device Personalization of End-to-end Automatic Speech Recognition Models

Authors: Khe Chai Sim, Petr Zadrazil, Françoise Beaufays

Abstract: Speaker-independent speech recognition systems trained with data from many users are generally robust against speaker variability and work well for a large population of speakers. However, these systems do not always generalize well for users with very different speech characteristics. This issue can be addressed by building personalized systems that are designed to work well for each specific use… ▽ More Speaker-independent speech recognition systems trained with data from many users are generally robust against speaker variability and work well for a large population of speakers. However, these systems do not always generalize well for users with very different speech characteristics. This issue can be addressed by building personalized systems that are designed to work well for each specific user. In this paper, we investigate the idea of securely training personalized end-to-end speech recognition models on mobile devices so that user data and models never leave the device and are never stored on a server. We study how the mobile training environment impacts performance by simulating on-device data consumption. We conduct experiments using data collected from speech impaired users for personalization. Our results show that personalization achieved 63.7\% relative word error rate reduction when trained in a server environment and 58.1% in a mobile environment. Moving to on-device personalization resulted in 18.7% performance degradation, in exchange for improved scalability and data privacy. To train the model on device, we split the gradient computation into two and achieved 45% memory reduction at the expense of 42% increase in training time. △ Less

Submitted 14 September, 2019; originally announced September 2019.

arXiv:1907.10351 [pdf, other]

Energy-preserving multi-symplectic Runge-Kutta methods for Hamiltonian wave equations

Authors: Chuchu Chen, Jialin Hong, Chol Sim, Kwang Sonwu

Abstract: It is well-known that a numerical method which is at the same time geometric structure-preserving and physical property-preserving cannot exist in general for Hamiltonian partial differential equations. In this paper, we present a novel class of parametric multi-symplectic Runge-Kutta methods for Hamiltonian wave equations, which can also conserve energy simultaneously in a weaker sense with a sui… ▽ More It is well-known that a numerical method which is at the same time geometric structure-preserving and physical property-preserving cannot exist in general for Hamiltonian partial differential equations. In this paper, we present a novel class of parametric multi-symplectic Runge-Kutta methods for Hamiltonian wave equations, which can also conserve energy simultaneously in a weaker sense with a suitable parameter. The existence of such a parameter, which enforces the energy-preserving property, is proved under certain assumptions on the fixed step sizes and the fixed initial condition. We compare the proposed method with the classical multi-symplectic Runge-Kutta method in numerical experiments, which shows the remarkable energy-preserving property of the proposed method and illustrate the validity of theoretical results. △ Less

Submitted 24 July, 2019; originally announced July 2019.

Comments: 26 pages, 6 figures

arXiv:1905.07010 [pdf, ps, other]

A FISTA-type accelerated gradient algorithm for solving smooth nonconvex composite optimization problems

Authors: Jiaming Liang, Renato D. C. Monteiro, Chee-Khian Sim

Abstract: In this paper, we describe and establish iteration-complexity of two accelerated composite gradient (ACG) variants to solve a smooth nonconvex composite optimization problem whose objective function is the sum of a nonconvex differentiable function $ f $ with a Lipschitz continuous gradient and a simple nonsmooth closed convex function $ h $. When $f$ is convex, the first ACG variant reduces to th… ▽ More In this paper, we describe and establish iteration-complexity of two accelerated composite gradient (ACG) variants to solve a smooth nonconvex composite optimization problem whose objective function is the sum of a nonconvex differentiable function $ f $ with a Lipschitz continuous gradient and a simple nonsmooth closed convex function $ h $. When $f$ is convex, the first ACG variant reduces to the well-known FISTA for a specific choice of the input, and hence the first one can be viewed as a natural extension of the latter one to the nonconvex setting. The first variant requires an input pair $(M,m)$ such that $f$ is $m$-weakly convex, $\nabla f$ is $M$-Lipschitz continuous, and $m \le M$ (possibly $m<M$), which is usually hard to obtain or poorly estimated. The second variant on the other hand can start from an arbitrary input pair $(M,m)$ of positive scalars and its complexity is shown to be not worse, and better in some cases, than that of the first variant for a large range of the input pairs. Finally, numerical results are provided to illustrate the efficiency of the two ACG variants. △ Less

Submitted 5 March, 2021; v1 submitted 16 May, 2019; originally announced May 2019.

Comments: 28 pages

arXiv:1811.06621 [pdf, other]

Streaming End-to-end Speech Recognition For Mobile Devices

Authors: Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, Qiao Liang, Deepti Bhatia, Yuan Shangguan, Bo Li, Golan Pundak, Khe Chai Sim, Tom Bagby, Shuo-yiin Chang, Kanishka Rao, Alexander Gruenstein

Abstract: End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specif… ▽ More End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories. △ Less

Submitted 15 November, 2018; originally announced November 2018.

arXiv:1810.01945 [pdf, ps, other]

Generating Labeled Flow Data from MAWILab Traces for Network Intrusion Detection

Authors: **oh Kim, Caitlin Sim, **hwan Choi

Abstract: A growing issue in the modern cyberspace world is the direct identification of malicious activity over network connections. The boom of the machine learning industry in the past few years has led to the increasing usage of machine learning technologies, which are especially prevalent in the network intrusion detection research community. When utilizing these fairly contemporary techniques, the com… ▽ More A growing issue in the modern cyberspace world is the direct identification of malicious activity over network connections. The boom of the machine learning industry in the past few years has led to the increasing usage of machine learning technologies, which are especially prevalent in the network intrusion detection research community. When utilizing these fairly contemporary techniques, the community has realized that datasets are pivotal for identifying malicious packets and connections, particularly ones associated with information concerning labeling in order to construct learning models. However, there exists a shortage of publicly available, relevant datasets to researchers in the network intrusion detection community. Thus, in this paper, we introduce a method to construct labeled flow data by combining the packet meta-information with IDS logs to infer labels for intrusion detection research. Specifically, we designed a NetFlow-compatible format due to the capability of a a large body of network devices, such as routers and switches, to export NetFlow records from raw traffic. In doing so, the introduced method at hand would aid researchers to access relevant network flow datasets along with label information. △ Less

Submitted 3 October, 2018; originally announced October 2018.

Comments: 4 pages

arXiv:1808.05312 [pdf, other]

Toward domain-invariant speech recognition via large scale training

Authors: Arun Narayanan, Ananya Misra, Khe Chai Sim, Golan Pundak, Anshuman Tripathi, Mohamed Elfeky, Parisa Haghani, Trevor Strohman, Michiel Bacchiani

Abstract: Current state-of-the-art automatic speech recognition systems are trained to work in specific `domains', defined based on factors like application, sampling rate and codec. When such recognizers are used in conditions that do not match the training domain, performance significantly drops. This work explores the idea of building a single domain-invariant model for varied use-cases by combining larg… ▽ More Current state-of-the-art automatic speech recognition systems are trained to work in specific `domains', defined based on factors like application, sampling rate and codec. When such recognizers are used in conditions that do not match the training domain, performance significantly drops. This work explores the idea of building a single domain-invariant model for varied use-cases by combining large scale training data from multiple application domains. Our final system is trained using 162,000 hours of speech. Additionally, each utterance is artificially distorted during training to simulate effects like background noise, codec distortion, and sampling rates. Our results show that, even at such a scale, a model thus trained works almost as well as those fine-tuned to specific subsets: A single model can be robust to multiple application domains, and variations like codecs and noise. More importantly, such models generalize better to unseen conditions and allow for rapid adaptation -- we show that by using as little as 10 hours of data from a new domain, an adapted domain-invariant model can match performance of a domain-specific model trained from scratch using 70 times as much data. We also highlight some of the limitations of such models and areas that need addressing in future work. △ Less

Submitted 15 August, 2018; originally announced August 2018.

arXiv:1802.03816 [pdf, other]

Understanding Recurrent Neural State Using Memory Signatures

Authors: Skanda Koppula, Khe Chai Sim, Kean Chin

Abstract: We demonstrate a network visualization technique to analyze the recurrent state inside the LSTMs/GRUs used commonly in language and acoustic models. Interpreting intermediate state and network activations inside end-to-end models remains an open challenge. Our method allows users to understand exactly how much and what history is encoded inside recurrent state in grapheme sequence models. Our proc… ▽ More We demonstrate a network visualization technique to analyze the recurrent state inside the LSTMs/GRUs used commonly in language and acoustic models. Interpreting intermediate state and network activations inside end-to-end models remains an open challenge. Our method allows users to understand exactly how much and what history is encoded inside recurrent state in grapheme sequence models. Our procedure trains multiple decoders that predict prior input history. Compiling results from these decoders, a user can obtain a signature of the recurrent kernel that characterizes its memory behavior. We demonstrate this method's usefulness in revealing information divergence in the bases of recurrent factorized kernels, visualizing the character-level differences between the memory of n-gram and recurrent language models, and extracting knowledge of history encoded in the layers of grapheme-based end-to-end ASR networks. △ Less

Submitted 11 February, 2018; originally announced February 2018.

Comments: Accepted to 2018 IEEE International Conference on Acoustics, Speech and Signal Processing

arXiv:1712.01541 [pdf, other]

Multi-Dialect Speech Recognition With A Single Sequence-To-Sequence Model

Authors: Bo Li, Tara N. Sainath, Khe Chai Sim, Michiel Bacchiani, Eugene Weinstein, Patrick Nguyen, Zhifeng Chen, Yonghui Wu, Kanishka Rao

Abstract: Sequence-to-sequence models provide a simple and elegant solution for building speech recognition systems by folding separate components of a typical system, namely acoustic (AM), pronunciation (PM) and language (LM) models into a single neural network. In this work, we look at one such sequence-to-sequence model, namely listen, attend and spell (LAS), and explore the possibility of training a sin… ▽ More Sequence-to-sequence models provide a simple and elegant solution for building speech recognition systems by folding separate components of a typical system, namely acoustic (AM), pronunciation (PM) and language (LM) models into a single neural network. In this work, we look at one such sequence-to-sequence model, namely listen, attend and spell (LAS), and explore the possibility of training a single model to serve different English dialects, which simplifies the process of training multi-dialect systems without the need for separate AM, PM and LMs for each dialect. We show that simply pooling the data from all dialects into one LAS model falls behind the performance of a model fine-tuned on each dialect. We then look at incorporating dialect-specific information into the model, both by modifying the training targets by inserting the dialect symbol at the end of the original grapheme sequence and also feeding a 1-hot representation of the dialect information into all layers of the model. Experimental results on seven English dialects show that our proposed system is effective in modeling dialect variations within a single LAS model, outperforming a LAS model trained individually on each of the seven dialects by 3.1 ~ 16.5% relative. △ Less

Submitted 5 December, 2017; originally announced December 2017.

Comments: submitted to ICASSP 2018

arXiv:1611.03567 [pdf, other]

Complexity of the relaxed Peaceman-Rachford splitting method for the sum of two maximal strongly monotone operators

Authors: Renato D. C. Monteiro, Chee-Khian Sim

Abstract: This paper considers the relaxed Peaceman-Rachford (PR) splitting method for finding an approximate solution of a monotone inclusion whose underlying operator consists of the sum of two maximal strongly monotone operators. Using general results obtained in the setting of a non-Euclidean hybrid proximal extragradient framework, we extend a previous convergence result on the iterates generated by th… ▽ More This paper considers the relaxed Peaceman-Rachford (PR) splitting method for finding an approximate solution of a monotone inclusion whose underlying operator consists of the sum of two maximal strongly monotone operators. Using general results obtained in the setting of a non-Euclidean hybrid proximal extragradient framework, we extend a previous convergence result on the iterates generated by the relaxed PR splitting method, as well as establish new pointwise and ergodic convergence rate results for the method whenever an associated relaxation parameter is within a certain interval. An example is also discussed to demonstrate that the iterates may not converge when the relaxation parameter is outside this interval. △ Less

Submitted 5 November, 2017; v1 submitted 10 November, 2016; originally announced November 2016.

Comments: 26 pages, 2 figures

arXiv:1405.4984 [pdf, ps, other]

doi 10.1016/j.asr.2014.05.023

Medium Resolution Near-Infrared Spectra of the Host Galaxies of Nearby Quasars

Authors: Huynh Anh N. Le, Soojong Pak, Myungshin Im, Min** Kim, Chae Kyung Sim, Luis C. Ho

Abstract: We present medium resolution near-infrared host galaxy spectra of low redshift quasars, PG 0844 + 349 (z=0.064), PG 1226 + 023 (z=0.158), and PG 1426+015 (z=0.086). The observations were done by using the Infrared Camera and Spectrograph (IRCS) at the Subaru 8.2 m telescope. The full width at half maximum of the point spread function was about 0.3 arcsec by operations of an adaptive optics system,… ▽ More We present medium resolution near-infrared host galaxy spectra of low redshift quasars, PG 0844 + 349 (z=0.064), PG 1226 + 023 (z=0.158), and PG 1426+015 (z=0.086). The observations were done by using the Infrared Camera and Spectrograph (IRCS) at the Subaru 8.2 m telescope. The full width at half maximum of the point spread function was about 0.3 arcsec by operations of an adaptive optics system, which can effectively resolve the quasar spectra from the host galaxy spectra. We spent up to several hours per target and developed data reduction methods to reduce the systematic noises of the telluric emissions and absorptions. From the obtained spectra, we identified absorption features of Mg I (1.503 um), Si I (1.589 um) and CO (6-3) (1.619 um), and measured the velocity dispersions of PG 0844 + 349 to be 132+/-110 km s-1 and PG 1426 + 015 to be 264+/-215 km s-1. By using an M_BH-sigma relation of elliptical galaxies, we derived the black hole (BH) mass of PG 0844+349, log(M_BH/M_SUN) = 7.7+/-5.5 and PG 1426+015, log(M_BH/M_SUN) = 9.0+/-7.5. These values are consistent with the BH mass values from broad emission lines with an assumption of a virial factor of 5.5. △ Less

Submitted 9 June, 2014; v1 submitted 20 May, 2014; originally announced May 2014.

Comments: 16 pages, 5 figures

arXiv:1310.2771 [pdf, ps, other]

doi 10.1063/1.4861459

Asymmetry in effective fields of spin-orbit torques in Pt/Co/Pt stacks

Authors: Cheow Hin Sim, Jian Cheng Huang, Michael Tran, Kwaku Eason

Abstract: Measurements of switching via spin-orbit coupling (SOC) mechanisms are discussed for a pair of inverted Pt/Co/Pt stacks with asymmetrical Pt thicknesses. Taking into account the planar Hall effect contribution, effective fields of spin-orbit torques (SOT) are evaluated using lock-in measurements of the first and second harmonics of the Hall voltage. Reversing the stack structure leads to significa… ▽ More Measurements of switching via spin-orbit coupling (SOC) mechanisms are discussed for a pair of inverted Pt/Co/Pt stacks with asymmetrical Pt thicknesses. Taking into account the planar Hall effect contribution, effective fields of spin-orbit torques (SOT) are evaluated using lock-in measurements of the first and second harmonics of the Hall voltage. Reversing the stack structure leads to significant asymmetries in the switching behavior, including clear evidence of a nonlinear current dependence of the transverse effective field. Our results demonstrate potentially complex interplay in devices with all-metallic interfaces utilizing SOT. △ Less

Submitted 10 October, 2013; originally announced October 2013.

Showing 1–38 of 38 results for author: Sim, C