Search | arXiv e-print repository

The Gift of Feedback: Improving ASR Model Quality by Learning from User Corrections through Federated Learning

Authors: Lillian Zhou, Yuxin Ding, Mingqing Chen, Harry Zhang, Rohit Prabhavalkar, Dhruv Guliani, Giovanni Motta, Rajiv Mathews

Abstract: Automatic speech recognition (ASR) models are typically trained on large datasets of transcribed speech. As language evolves and new terms come into use, these models can become outdated and stale. In the context of models trained on the server but deployed on edge devices, errors may result from the mismatch between server training data and actual on-device usage. In this work, we seek to continu… ▽ More Automatic speech recognition (ASR) models are typically trained on large datasets of transcribed speech. As language evolves and new terms come into use, these models can become outdated and stale. In the context of models trained on the server but deployed on edge devices, errors may result from the mismatch between server training data and actual on-device usage. In this work, we seek to continually learn from on-device user corrections through Federated Learning (FL) to address this issue. We explore techniques to target fresh terms that the model has not previously encountered, learn long-tail words, and mitigate catastrophic forgetting. In experimental evaluations, we find that the proposed techniques improve model recognition of fresh terms, while preserving quality on the overall language distribution. △ Less

Submitted 30 November, 2023; v1 submitted 29 September, 2023; originally announced October 2023.

Comments: Accepted to IEEE ASRU 2023

arXiv:2209.06359 [pdf, other]

Federated Pruning: Improving Neural Network Efficiency with Federated Learning

Authors: Rongmei Lin, Yonghui Xiao, Tien-Ju Yang, Ding Zhao, Li Xiong, Giovanni Motta, Françoise Beaufays

Abstract: Automatic Speech Recognition models require large amount of speech data for training, and the collection of such data often leads to privacy concerns. Federated learning has been widely used and is considered to be an effective decentralized technique by collaboratively learning a shared prediction model while kee** the data local on different clients devices. However, the limited computation an… ▽ More Automatic Speech Recognition models require large amount of speech data for training, and the collection of such data often leads to privacy concerns. Federated learning has been widely used and is considered to be an effective decentralized technique by collaboratively learning a shared prediction model while kee** the data local on different clients devices. However, the limited computation and communication resources on clients devices present practical difficulties for large models. To overcome such challenges, we propose Federated Pruning to train a reduced model under the federated setting, while maintaining similar performance compared to the full model. Moreover, the vast amount of clients data can also be leveraged to improve the pruning results compared to centralized training. We explore different pruning schemes and provide empirical evidence of the effectiveness of our methods. △ Less

Submitted 13 September, 2022; originally announced September 2022.

Comments: To appear in INTERSPEECH 2022

arXiv:2205.03494 [pdf, other]

Online Model Compression for Federated Learning with Large Models

Authors: Tien-Ju Yang, Yonghui Xiao, Giovanni Motta, Françoise Beaufays, Rajiv Mathews, Mingqing Chen

Abstract: This paper addresses the challenges of training large neural network models under federated learning settings: high on-device memory usage and communication cost. The proposed Online Model Compression (OMC) provides a framework that stores model parameters in a compressed format and decompresses them only when needed. We use quantization as the compression method in this paper and propose three me… ▽ More This paper addresses the challenges of training large neural network models under federated learning settings: high on-device memory usage and communication cost. The proposed Online Model Compression (OMC) provides a framework that stores model parameters in a compressed format and decompresses them only when needed. We use quantization as the compression method in this paper and propose three methods, (1) using per-variable transformation, (2) weight matrices only quantization, and (3) partial parameter quantization, to minimize the impact on model accuracy. According to our experiments on two recent neural networks for speech recognition and two different datasets, OMC can reduce memory usage and communication cost of model parameters by up to 59% while attaining comparable accuracy and training speed when compared with full-precision training. △ Less

Submitted 6 May, 2022; originally announced May 2022.

Comments: Submitted to INTERSPEECH 2022

arXiv:2111.10264 [pdf, other]

doi 10.3847/1538-4357/ac3833

Periodic Variable Stars Modulated by Time-Varying Parameters

Authors: Giovanni Motta, Darlin Soto, Márcio Catelan

Abstract: Many astrophysical phenomena are time-varying, in the sense that their brightness change over time. In the case of periodic stars, previous approaches assumed that changes in period, amplitude, and phase are well described by either parametric or piecewise-constant functions. With this paper, we introduce a new mathematical model for the description of the so-called modulated light curves, as foun… ▽ More Many astrophysical phenomena are time-varying, in the sense that their brightness change over time. In the case of periodic stars, previous approaches assumed that changes in period, amplitude, and phase are well described by either parametric or piecewise-constant functions. With this paper, we introduce a new mathematical model for the description of the so-called modulated light curves, as found in periodic variable stars that exhibit smoothly time-varying parameters such as amplitude, frequency, and/or phase. Our model accounts for a smoothly time-varying trend, and a harmonic sum with smoothly time-varying weights. In this sense, our approach is flexible because it avoids restrictive assumptions (parametric or piecewise-constant) about the functional form of trend and amplitudes. We apply our methodology to the light curve of a pulsating RR Lyrae star characterised by the Blazhko effect. To estimate the time-varying parameters of our model, we develop a semi-parametric method for unequally spaced time series. The estimation of our time-varying curves translates into the estimation of time-invariant parameters that can be performed by ordinary least-squares, with the following two advantages: modeling and forecasting can be implemented in a parametric fashion, and we are able to cope with missing observations. To detect serial correlation in the residuals of our fitted model, we derive the mathematical definition of the spectral density for unequally spaced time series. The proposed method is designed to estimate smoothly time-varying trend and amplitudes, as well as the spectral density function of the errors. We provide simulation results and applications to real data. △ Less

Submitted 19 November, 2021; originally announced November 2021.

Comments: 26 pages, 6 figures, to be published in The Astrophysical Journal

arXiv:2110.05607 [pdf, other]

Partial Variable Training for Efficient On-Device Federated Learning

Authors: Tien-Ju Yang, Dhruv Guliani, Françoise Beaufays, Giovanni Motta

Abstract: This paper aims to address the major challenges of Federated Learning (FL) on edge devices: limited memory and expensive communication. We propose a novel method, called Partial Variable Training (PVT), that only trains a small subset of variables on edge devices to reduce memory usage and communication cost. With PVT, we show that network accuracy can be maintained by utilizing more local trainin… ▽ More This paper aims to address the major challenges of Federated Learning (FL) on edge devices: limited memory and expensive communication. We propose a novel method, called Partial Variable Training (PVT), that only trains a small subset of variables on edge devices to reduce memory usage and communication cost. With PVT, we show that network accuracy can be maintained by utilizing more local training steps and devices, which is favorable for FL involving a large population of devices. According to our experiments on two state-of-the-art neural networks for speech recognition and two different datasets, PVT can reduce memory usage by up to 1.9$\times$ and communication cost by up to 593$\times$ while attaining comparable accuracy when compared with full network training. △ Less

Submitted 11 October, 2021; originally announced October 2021.

arXiv:2110.04267 [pdf, other]

Exploring Heterogeneous Characteristics of Layers in ASR Models for More Efficient Training

Authors: Lillian Zhou, Dhruv Guliani, Andreas Kabel, Giovanni Motta, Françoise Beaufays

Abstract: Transformer-based architectures have been the subject of research aimed at understanding their overparameterization and the non-uniform importance of their layers. Applying these approaches to Automatic Speech Recognition, we demonstrate that the state-of-the-art Conformer models generally have multiple ambient layers. We study the stability of these layers across runs and model sizes, propose tha… ▽ More Transformer-based architectures have been the subject of research aimed at understanding their overparameterization and the non-uniform importance of their layers. Applying these approaches to Automatic Speech Recognition, we demonstrate that the state-of-the-art Conformer models generally have multiple ambient layers. We study the stability of these layers across runs and model sizes, propose that group normalization may be used without disrupting their formation, and examine their correlation with model weight updates in each layer. Finally, we apply these findings to Federated Learning in order to improve the training procedure, by targeting Federated Dropout to layers by importance. This allows us to reduce the model size optimized by clients without quality degradation, and shows potential for future exploration. △ Less

Submitted 4 February, 2022; v1 submitted 8 October, 2021; originally announced October 2021.

Comments: \c{opyright} 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

MSC Class: 68T10 ACM Class: I.2.7

arXiv:2110.03634 [pdf, other]

Enabling On-Device Training of Speech Recognition Models with Federated Dropout

Authors: Dhruv Guliani, Lillian Zhou, Changwan Ryu, Tien-Ju Yang, Harry Zhang, Yonghui Xiao, Francoise Beaufays, Giovanni Motta

Abstract: Federated learning can be used to train machine learning models on the edge on local data that never leave devices, providing privacy by default. This presents a challenge pertaining to the communication and computation costs associated with clients' devices. These costs are strongly correlated with the size of the model being trained, and are significant for state-of-the-art automatic speech reco… ▽ More Federated learning can be used to train machine learning models on the edge on local data that never leave devices, providing privacy by default. This presents a challenge pertaining to the communication and computation costs associated with clients' devices. These costs are strongly correlated with the size of the model being trained, and are significant for state-of-the-art automatic speech recognition models. We propose using federated dropout to reduce the size of client models while training a full-size model server-side. We provide empirical evidence of the effectiveness of federated dropout, and propose a novel approach to vary the dropout rate applied at each layer. Furthermore, we find that federated dropout enables a set of smaller sub-models within the larger model to independently have low word error rates, making it easier to dynamically adjust the size of the model deployed for inference. △ Less

Submitted 7 October, 2021; originally announced October 2021.

Comments: \c{opyright} 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

MSC Class: 68T10 ACM Class: I.2.7

arXiv:2104.11358 [pdf, other]

Joint Mean-Vector and Var-Matrix estimation for Locally Stationary VAR(1) processes

Authors: Giovanni Motta

Abstract: During the last two decades, locally stationary processes have been widely studied in the time series literature. In this paper we consider the locally-stationary vector-auto-regression model of order one, or LS-VAR(1), and estimate its parameters by weighted least squares. The LS-VAR(1) we consider allows for a smoothly time-varying non-diagonal VAR matrix, as well as for a smoothly time-varying… ▽ More During the last two decades, locally stationary processes have been widely studied in the time series literature. In this paper we consider the locally-stationary vector-auto-regression model of order one, or LS-VAR(1), and estimate its parameters by weighted least squares. The LS-VAR(1) we consider allows for a smoothly time-varying non-diagonal VAR matrix, as well as for a smoothly time-varying non-zero mean. The weighting scheme is based on kernel smoothers. The time-varying mean and the time-varying VAR matrix are estimated jointly, and the definition of the local-linear weighting matrix is provided in closed-from. The quality of the estimated curves is illustrated through simulation results. △ Less

Submitted 22 April, 2021; originally announced April 2021.

arXiv:2010.15965 [pdf, other]

doi 10.1109/ICASSP39728.2021.9413397

Training Speech Recognition Models with Federated Learning: A Quality/Cost Framework

Authors: Dhruv Guliani, Francoise Beaufays, Giovanni Motta

Abstract: We propose using federated learning, a decentralized on-device learning paradigm, to train speech recognition models. By performing epochs of training on a per-user basis, federated learning must incur the cost of dealing with non-IID data distributions, which are expected to negatively affect the quality of the trained model. We propose a framework by which the degree of non-IID-ness can be varie… ▽ More We propose using federated learning, a decentralized on-device learning paradigm, to train speech recognition models. By performing epochs of training on a per-user basis, federated learning must incur the cost of dealing with non-IID data distributions, which are expected to negatively affect the quality of the trained model. We propose a framework by which the degree of non-IID-ness can be varied, consequently illustrating a trade-off between model quality and the computational cost of federated training, which we capture through a novel metric. Finally, we demonstrate that hyper-parameter optimization and appropriate use of variational noise are sufficient to compensate for the quality impact of non-IID distributions, while decreasing the cost. △ Less

Submitted 14 May, 2021; v1 submitted 29 October, 2020; originally announced October 2020.

Comments: Paper published at ICASSP 2021

MSC Class: 68T10 ACM Class: I.2.7

Journal ref: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 3080-3084

arXiv:2001.08885 [pdf, other]

Low-rank Gradient Approximation For Memory-Efficient On-device Training of Deep Neural Network

Authors: Mary Gooneratne, Khe Chai Sim, Petr Zadrazil, Andreas Kabel, Françoise Beaufays, Giovanni Motta

Abstract: Training machine learning models on mobile devices has the potential of improving both privacy and accuracy of the models. However, one of the major obstacles to achieving this goal is the memory limitation of mobile devices. Reducing training memory enables models with high-dimensional weight matrices, like automatic speech recognition (ASR) models, to be trained on-device. In this paper, we prop… ▽ More Training machine learning models on mobile devices has the potential of improving both privacy and accuracy of the models. However, one of the major obstacles to achieving this goal is the memory limitation of mobile devices. Reducing training memory enables models with high-dimensional weight matrices, like automatic speech recognition (ASR) models, to be trained on-device. In this paper, we propose approximating the gradient matrices of deep neural networks using a low-rank parameterization as an avenue to save training memory. The low-rank gradient approximation enables more advanced, memory-intensive optimization techniques to be run on device. Our experimental results show that we can reduce the training memory by about 33.0% for Adam optimization. It uses comparable memory to momentum optimization and achieves a 4.5% relative lower word error rate on an ASR personalization task. △ Less

Submitted 24 January, 2020; originally announced January 2020.

arXiv:1912.09251 [pdf, other]

Personalization of End-to-end Speech Recognition On Mobile Devices For Named Entities

Authors: Khe Chai Sim, Françoise Beaufays, Arnaud Benard, Dhruv Guliani, Andreas Kabel, Nikhil Khare, Tamar Lucassen, Petr Zadrazil, Harry Zhang, Leif Johnson, Giovanni Motta, Lillian Zhou

Abstract: We study the effectiveness of several techniques to personalize end-to-end speech models and improve the recognition of proper names relevant to the user. These techniques differ in the amounts of user effort required to provide supervision, and are evaluated on how they impact speech recognition performance. We propose using keyword-dependent precision and recall metrics to measure vocabulary acq… ▽ More We study the effectiveness of several techniques to personalize end-to-end speech models and improve the recognition of proper names relevant to the user. These techniques differ in the amounts of user effort required to provide supervision, and are evaluated on how they impact speech recognition performance. We propose using keyword-dependent precision and recall metrics to measure vocabulary acquisition performance. We evaluate the algorithms on a dataset that we designed to contain names of persons that are difficult to recognize. Therefore, the baseline recall rate for proper names in this dataset is very low: 2.4%. A data synthesis approach we developed brings it to 48.6%, with no need for speech input from the user. With speech input, if the user corrects only the names, the name recall rate improves to 64.4%. If the user corrects all the recognition errors, we achieve the best recall of 73.5%. To eliminate the need to upload user data and store personalized models on a server, we focus on performing the entire personalization workflow on a mobile device. △ Less

Submitted 14 December, 2019; originally announced December 2019.

arXiv:1711.06506 [pdf, other]

doi 10.1093/mnras/stx2958

Poincaré surfaces of section around a 3-D irregular body: The case of asteroid 4179 Toutatis

Authors: Gabriel Borderes Motta, Othon Cabo Winter

Abstract: In general, small bodies of the solar system, e.g., asteroids and comets, have a very irregular shape. This feature affects significantly the gravitational potential around these irregular bodies, which hinders dynamical studies. The Poincaré surface of sec- tion technique is often used to look for stable and chaotic regions in two-dimensional dynamic cases. In this work, we show that this tool ca… ▽ More In general, small bodies of the solar system, e.g., asteroids and comets, have a very irregular shape. This feature affects significantly the gravitational potential around these irregular bodies, which hinders dynamical studies. The Poincaré surface of sec- tion technique is often used to look for stable and chaotic regions in two-dimensional dynamic cases. In this work, we show that this tool can be useful for exploring the surroundings of irregular bodies such as the asteroid 4179 Toutatis. Considering a rotating system with a particle, under the effect of the gravitational field computed three-dimensionally, we define a plane in the phase space to build the Poincaré surface of sections. Despite the extra dimension, the sections created allow us to find trajec- tories and classify their stabilities. Thus, we have also been able to map stable and chaotic regions, as well as to find correlations between those regions and the contri- bution of the third dimension of the system to the trajectory dynamics as well. As examples, we show details of periodic(resonant or not) and quasi-periodic trajectories. △ Less

Submitted 17 November, 2017; originally announced November 2017.

arXiv:1608.08342 [pdf, ps, other]

A New Paradigm of Software Service Engineering in the Era of Big Data and Big Service

Authors: Xiaofei Xu, Gianmario Motta, Xianzhi Wang, Zhiying Tu, Hanchuan Xu

Abstract: Servitization is one of the most significant trends that reshapes the information world and society in recent years. The requirement of collecting,storing, processing, and sharing of the Big Data has led to massive software resources being developed and made accessible as web-based services to facilitate such process. These services that handle the Big Data come from various domains and heterogene… ▽ More Servitization is one of the most significant trends that reshapes the information world and society in recent years. The requirement of collecting,storing, processing, and sharing of the Big Data has led to massive software resources being developed and made accessible as web-based services to facilitate such process. These services that handle the Big Data come from various domains and heterogeneous networks, and converge into a huge complicated service network (or ecosystem), called the Big Service.The key issue facing the big data and big service ecosystem is how to optimally configure and operate the related service resources to serve the specific requirements of possible applications, i.e., how to reuse the existing service resources effectively and efficiently to develop the new applications or software services, to meet the massive individualized requirements of end-users.Based on analyzing the big service ecosystem, we present in this paper a new paradigm for software service engineering, RE2SEP (Requirement-Engineering Two-Phase of Service Engineering Paradigm), which includes three components: service-oriented requirement engineering, domain-oriented service engineering, and software service development approach. RE2SEP enables the rapid design and implementation of service solutions to match the requirement propositions of massive individualized customers in the Big Service ecosystem. A case study on people's mobility service in a smart city environment is given to demonstrate the application of RE2SEP.RE2SEP can potentially revolutionize the traditional life-cycle oriented software engineering, leading to a new approach to software service engineering. △ Less

Submitted 30 August, 2016; originally announced August 2016.

Comments: 23 pages+ 1 page references. Submitted to Springer Computing

Showing 1–13 of 13 results for author: Motta, G