-
A Pixel-based Reconfigurable Antenna Design for Fluid Antenna Systems
Authors:
Jichen Zhang,
Junhui Rao,
Zhaoyang Ming,
Zan Li,
Chi-Yuk Chiu,
Kai-Kit Wong,
Kin-Fai Tong,
Ross Murch
Abstract:
Fluid Antenna Systems (FASs) have recently been proposed for enhancing the performance of wireless communication. Previous antenna designs to meet the requirements of FAS have been based on mechanically movable or liquid antennas and therefore have limited reconfiguration speeds. In this paper, we propose a design for a pixel-based reconfigurable antenna (PRA) that meets the requirements of FAS an…
▽ More
Fluid Antenna Systems (FASs) have recently been proposed for enhancing the performance of wireless communication. Previous antenna designs to meet the requirements of FAS have been based on mechanically movable or liquid antennas and therefore have limited reconfiguration speeds. In this paper, we propose a design for a pixel-based reconfigurable antenna (PRA) that meets the requirements of FAS and the required switching speed. It can provide 12 FAS ports across 1/2 wavelength and consists of an E-slot patch antenna and an upper reconfigurable pixel layer with 6 RF switches. Simulation and experimental results from a prototype operating at 2.5 GHz demonstrate that the design can meet the requirements of FAS including port correlation with matched impedance.
△ Less
Submitted 14 June, 2024; v1 submitted 8 June, 2024;
originally announced June 2024.
-
A Shared-Aperture Dual-Band sub-6 GHz and mmWave Reconfigurable Intelligent Surface With Independent Operation
Authors:
Junhui Rao,
Yujie Zhang,
Shiwen Tang,
Zan Li,
Zhaoyang Ming,
Jichen Zhang,
Chi Yuk Chiu,
Ross Murch
Abstract:
A novel dual-band reconfigurable intelligent surface (DBI-RIS) design that combines the functionalities of millimeter-wave (mmWave) and sub-6 GHz bands within a single aperture is proposed. This design aims to bridge the gap between current single-band reconfigurable intelligent surfaces (RISs) and wireless systems utilizing sub-6 GHz and mmWave bands that require RIS with independently reconfigur…
▽ More
A novel dual-band reconfigurable intelligent surface (DBI-RIS) design that combines the functionalities of millimeter-wave (mmWave) and sub-6 GHz bands within a single aperture is proposed. This design aims to bridge the gap between current single-band reconfigurable intelligent surfaces (RISs) and wireless systems utilizing sub-6 GHz and mmWave bands that require RIS with independently reconfigurable dual-band operation. The mmWave element is realized by a double-layer patch antenna loaded with 1-bit phase shifters, providing two reconfigurable states. An 8x8 mmWave element array is selectively interconnected using three RF switches to form a reconfigurable sub-6 GHz element at 3.5 GHz. A suspended electromagnetic band gap (EBG) structure is proposed to suppress surface waves and ensure sufficient geometric space for the phase shifter and control networks in the mmWave element. A low-cost planar spiral inductor (PSI) is carefully optimized to connect mmWave elements, enabling the sub-6 GHz function without affecting mmWave operation. Finally, prototypes of the DBI-RIS are fabricated, and experimental verification is conducted using two separate measurement testbeds. The fabricated sub-6 GHz RIS successfully achieves beam steering within the range of -35 to 35 degrees for DBI-RIS with 4x4 sub-6 GHz elements, while the mmWave RIS demonstrates beam steering between -30 to 30 degrees for DBI-RIS with 8x8 mmWave elements, and have good agreement with simulation results.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Credit vs. Discount-Based Congestion Pricing: A Comparison Study
Authors:
Chih-Yuan Chiu,
Devansh Jalota,
Marco Pavone
Abstract:
Tolling, or congestion pricing, offers a promising traffic management policy for regulating congestion, but has also attracted criticism for placing outsized financial burdens on low-income users. Credit-based congestion pricing (CBCP) and discount-based congestion pricing (DBCP) policies, which respectively provide travel credits and toll discounts to low-income users on tolled roads, have emerge…
▽ More
Tolling, or congestion pricing, offers a promising traffic management policy for regulating congestion, but has also attracted criticism for placing outsized financial burdens on low-income users. Credit-based congestion pricing (CBCP) and discount-based congestion pricing (DBCP) policies, which respectively provide travel credits and toll discounts to low-income users on tolled roads, have emerged as promising mechanisms for reducing traffic congestion without worsening societal inequities. However, the optimal design of CBCP and DBCP policies, as well as their relative advantages and disadvantages, remain poorly understood. To address this, we study the effects of implementing CBCP and DBCP policies to route users on a network of multi-lane highways with tolled express lanes. We formulate a non-atomic routing game framework in which a subset of eligible users is granted toll relief in the form of a fixed budget or toll discount, while the remaining ineligible users must pay out-of-pocket. We prove the existence of Nash equilibrium traffic flow patterns corresponding to any given CBCP or DBCP policy. Under the additional assumption that eligible users have time-invariant VoTs, we provide a convex program to efficiently compute these equilibria. For networks consisting of a single edge, we identify conditions under which CBCP policies outperform DBCP policies (and vice versa), in the sense of improving eligible users' access to the express lane. Finally, we present empirical results from a CBCP pilot study of the San Mateo 101 Express Lane Project in California. Our empirical results corroborate our theoretical analysis of the impact of deploying credit-based and discount-based policies, and lend insights into the sensitivity of their impact with respect to the travel demand and users' VoTs.
△ Less
Submitted 9 May, 2024; v1 submitted 20 March, 2024;
originally announced March 2024.
-
SLM: Bridge the thin gap between speech and text foundation models
Authors:
Mingqiu Wang,
Wei Han,
Izhak Shafran,
Zelin Wu,
Chung-Cheng Chiu,
Yuan Cao,
Yongqiang Wang,
Nanxin Chen,
Yu Zhang,
Hagen Soltau,
Paul Rubenstein,
Lukas Zilka,
Dian Yu,
Zhong Meng,
Golan Pundak,
Nikhil Siddhartha,
Johan Schalkwyk,
Yonghui Wu
Abstract:
We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models. SLM freezes the pretrained foundation models to maximally preserves their capabilities, and only trains a simple adapter with just 1\% (156M) of the foundation models' parameters. This adaptation not only leads SLM to achiev…
▽ More
We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models. SLM freezes the pretrained foundation models to maximally preserves their capabilities, and only trains a simple adapter with just 1\% (156M) of the foundation models' parameters. This adaptation not only leads SLM to achieve strong performance on conventional tasks such as speech recognition (ASR) and speech translation (AST), but also introduces the novel capability of zero-shot instruction-following for more diverse tasks: given a speech input and a text instruction, SLM is able to perform unseen generation tasks including contextual biasing ASR using real-time context, dialog generation, speech continuation, and question answering, etc. Our approach demonstrates that the representational gap between pretrained speech and language models might be narrower than one would expect, and can be bridged by a simple adaptation mechanism. As a result, SLM is not only efficient to train, but also inherits strong capabilities already acquired in foundation models of different modalities.
△ Less
Submitted 29 September, 2023;
originally announced October 2023.
-
Local Periodicity-Based Beat Tracking for Expressive Classical Piano Music
Authors:
Ching-Yu Chiu,
Meinard Müller,
Matthew E. P. Davies,
Alvin Wen-Yu Su,
Yi-Hsuan Yang
Abstract:
To model the periodicity of beats, state-of-the-art beat tracking systems use "post-processing trackers" (PPTs) that rely on several empirically determined global assumptions for tempo transition, which work well for music with a steady tempo. For expressive classical music, however, these assumptions can be too rigid. With two large datasets of Western classical piano music, namely the Aligned Sc…
▽ More
To model the periodicity of beats, state-of-the-art beat tracking systems use "post-processing trackers" (PPTs) that rely on several empirically determined global assumptions for tempo transition, which work well for music with a steady tempo. For expressive classical music, however, these assumptions can be too rigid. With two large datasets of Western classical piano music, namely the Aligned Scores and Performances (ASAP) dataset and a dataset of Chopin's Mazurkas (Maz-5), we report on experiments showing the failure of existing PPTs to cope with local tempo changes, thus calling for new methods. In this paper, we propose a new local periodicity-based PPT, called predominant local pulse-based dynamic programming (PLPDP) tracking, that allows for more flexible tempo transitions. Specifically, the new PPT incorporates a method called "predominant local pulses" (PLP) in combination with a dynamic programming (DP) component to jointly consider the locally detected periodicity and beat activation strength at each time instant. Accordingly, PLPDP accounts for the local periodicity, rather than relying on a global tempo assumption. Compared to existing PPTs, PLPDP particularly enhances the recall values at the cost of a lower precision, resulting in an overall improvement of F1-score for beat tracking in ASAP (from 0.473 to 0.493) and Maz-5 (from 0.595 to 0.838).
△ Less
Submitted 20 August, 2023;
originally announced August 2023.
-
SALC: Skeleton-Assisted Learning-Based Clustering for Time-Varying Indoor Localization
Authors:
An-Hung Hsiao,
Li-Hsiang Shen,
Chen-Yi Chang,
Chun-Jie Chiu,
Kai-Ten Feng
Abstract:
Wireless indoor localization has attracted significant amount of attention in recent years. Using received signal strength (RSS) obtained from WiFi access points (APs) for establishing fingerprinting database is a widely utilized method in indoor localization. However, the time-variant problem for indoor positioning systems is not well-investigated in existing literature. Compared to conventional…
▽ More
Wireless indoor localization has attracted significant amount of attention in recent years. Using received signal strength (RSS) obtained from WiFi access points (APs) for establishing fingerprinting database is a widely utilized method in indoor localization. However, the time-variant problem for indoor positioning systems is not well-investigated in existing literature. Compared to conventional static fingerprinting, the dynamicallyreconstructed database can adapt to a highly-changing environment, which achieves sustainability of localization accuracy. To deal with the time-varying issue, we propose a skeleton-assisted learning-based clustering localization (SALC) system, including RSS-oriented map-assisted clustering (ROMAC), cluster-based online database establishment (CODE), and cluster-scaled location estimation (CsLE). The SALC scheme jointly considers similarities from the skeleton-based shortest path (SSP) and the time-varying RSS measurements across the reference points (RPs). ROMAC clusters RPs into different feature sets and therefore selects suitable monitor points (MPs) for enhancing location estimation. Moreover, the CODE algorithm aims for establishing adaptive fingerprint database to alleviate the timevarying problem. Finally, CsLE is adopted to acquire the target position by leveraging the benefits of clustering information and estimated signal variations in order to rescale the weights fromweighted k-nearest neighbors (WkNN) method. Both simulation and experimental results demonstrate that the proposed SALC system can effectively reconstruct the fingerprint database with an enhanced location estimation accuracy, which outperforms the other existing schemes in the open literature.
△ Less
Submitted 14 July, 2023;
originally announced July 2023.
-
Dynamic Tolling in Arc-based Traffic Assignment Models
Authors:
Chih-Yuan Chiu,
Chinmay Maheshwari,
Pan-Yang Su,
Shankar Sastry
Abstract:
Tolling in traffic networks offers a popular measure to minimize overall congestion. Existing toll designs primarily focus on congestion in route-based traffic assignment models (TAMs), in which travelers make a single route selection from their source to destination. However, these models do not reflect real-world traveler decisions because they preclude deviations from a chosen route, and becaus…
▽ More
Tolling in traffic networks offers a popular measure to minimize overall congestion. Existing toll designs primarily focus on congestion in route-based traffic assignment models (TAMs), in which travelers make a single route selection from their source to destination. However, these models do not reflect real-world traveler decisions because they preclude deviations from a chosen route, and because the enumeration of all routes is computationally expensive. To address these limitations, our work focuses on arc-based TAMs, in which travelers sequentially select individual arcs (or edges) on the network to reach their destination. We first demonstrate that marginal pricing, a tolling scheme commonly used in route-based TAMs, also achieves socially optimal congestion levels in our arc-based formulation. Then, we use perturbed best response dynamics to model the evolution of travelers' arc selection preferences over time, and a marginal pricing scheme to the social planner's adaptive toll updates in response. We prove that our adaptive learning and marginal pricing dynamics converge to a neighborhood of the socially optimal loads and tolls. We then present empirical results that verify our theoretical claims.
△ Less
Submitted 24 October, 2023; v1 submitted 11 July, 2023;
originally announced July 2023.
-
Efficient Adapters for Giant Speech Models
Authors:
Nanxin Chen,
Izhak Shafran,
Yu Zhang,
Chung-Cheng Chiu,
Hagen Soltau,
James Qin,
Yonghui Wu
Abstract:
Large pre-trained speech models are widely used as the de-facto paradigm, especially in scenarios when there is a limited amount of labeled data available. However, finetuning all parameters from the self-supervised learned model can be computationally expensive, and becomes infeasiable as the size of the model and the number of downstream tasks scales. In this paper, we propose a novel approach c…
▽ More
Large pre-trained speech models are widely used as the de-facto paradigm, especially in scenarios when there is a limited amount of labeled data available. However, finetuning all parameters from the self-supervised learned model can be computationally expensive, and becomes infeasiable as the size of the model and the number of downstream tasks scales. In this paper, we propose a novel approach called Two Parallel Adapter (TPA) that is inserted into the conformer-based model pre-trained model instead. TPA is based on systematic studies of the residual adapter, a popular approach for finetuning a subset of parameters. We evaluate TPA on various public benchmarks and experiment results demonstrates its superior performance, which is close to the full finetuning on different datasets and speech tasks. These results show that TPA is an effective and efficient approach for serving large pre-trained speech models. Ablation studies show that TPA can also be pruned, especially for lower blocks.
△ Less
Submitted 13 June, 2023;
originally announced June 2023.
-
Arc-based Traffic Assignment: Equilibrium Characterization and Learning
Authors:
Chih-Yuan Chiu,
Chinmay Maheshwari,
Pan-Yang Su,
Shankar Sastry
Abstract:
Arc-based traffic assignment models (TAMs) are a popular framework for modeling traffic network congestion generated by self-interested travelers who sequentially select arcs based on their perceived latency on the network. However, existing arc-based TAMs either assign travelers to cyclic paths, or do not extend to networks with bi-directional arcs (or edges) between nodes. To overcome these diff…
▽ More
Arc-based traffic assignment models (TAMs) are a popular framework for modeling traffic network congestion generated by self-interested travelers who sequentially select arcs based on their perceived latency on the network. However, existing arc-based TAMs either assign travelers to cyclic paths, or do not extend to networks with bi-directional arcs (or edges) between nodes. To overcome these difficulties, we propose a new modeling framework for stochastic arc-based TAMs. Given a traffic network with bidirectional arcs, we replicate its arcs and nodes to construct a directed acyclic graph (DAG), which we call the Condensed DAG (CoDAG) representation. Self-interested travelers sequentially select arcs on the CoDAG representation to reach their destination. We show that the associated equilibrium flow, which we call the Condensed DAG equilibrium, exists, is unique, and can be characterized as a strictly convex optimization problem. Moreover, we propose a discrete-time dynamical system that captures a natural adaptation rule employed by self-interested travelers to learn about the emergent congestion on the network. We show that the arc flows generated by this adaptation rule converges to a neighborhood of Condensed DAG equilibrium. To our knowledge, our work is the first to study learning and adaptation in an arc-based TAM. Finally, we present numerical results that corroborate our theoretical results.
△ Less
Submitted 4 May, 2024; v1 submitted 10 April, 2023;
originally announced April 2023.
-
Scenario-Game ADMM: A Parallelized Scenario-Based Solver for Stochastic Noncooperative Games
Authors:
**gqi Li,
Chih-Yuan Chiu,
Lasse Peters,
Fernando Palafox,
Mustafa Karabag,
Javier Alonso-Mora,
Somayeh Sojoudi,
Claire Tomlin,
David Fridovich-Keil
Abstract:
Decision-making in multi-player games can be extremely challenging, particularly under uncertainty. In this work, we propose a new sample-based approximation to a class of stochastic, general-sum, pure Nash games, where each player has an expected-value objective and a set of chance constraints. This new approximation scheme inherits the accuracy of objective approximation from the established sam…
▽ More
Decision-making in multi-player games can be extremely challenging, particularly under uncertainty. In this work, we propose a new sample-based approximation to a class of stochastic, general-sum, pure Nash games, where each player has an expected-value objective and a set of chance constraints. This new approximation scheme inherits the accuracy of objective approximation from the established sample average approximation (SAA) method and enjoys a feasibility guarantee derived from the scenario optimization literature. We characterize the sample complexity of this new game-theoretic approximation scheme, and observe that high accuracy usually requires a large number of samples, which results in a large number of sampled constraints. To accommodate this, we decompose the approximated game into a set of smaller games with few constraints for each sampled scenario, and propose a decentralized, consensus-based ADMM algorithm to efficiently compute a generalized Nash equilibrium (GNE) of the approximated game. We prove the convergence of our algorithm to a GNE and empirically demonstrate superior performance relative to a recent baseline algorithm based on ADMM and interior point method.
△ Less
Submitted 13 September, 2023; v1 submitted 4 April, 2023;
originally announced April 2023.
-
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages
Authors:
Yu Zhang,
Wei Han,
James Qin,
Yongqiang Wang,
Ankur Bapna,
Zhehuai Chen,
Nanxin Chen,
Bo Li,
Vera Axelrod,
Gary Wang,
Zhong Meng,
Ke Hu,
Andrew Rosenberg,
Rohit Prabhavalkar,
Daniel S. Park,
Parisa Haghani,
Jason Riesa,
Ginger Perng,
Hagen Soltau,
Trevor Strohman,
Bhuvana Ramabhadran,
Tara Sainath,
Pedro Moreno,
Chung-Cheng Chiu,
Johan Schalkwyk
, et al. (2 additional authors not shown)
Abstract:
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quant…
▽ More
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.
△ Less
Submitted 24 September, 2023; v1 submitted 2 March, 2023;
originally announced March 2023.
-
Cost Inference for Feedback Dynamic Games from Noisy Partial State Observations and Incomplete Trajectories
Authors:
**gqi Li,
Chih-Yuan Chiu,
Lasse Peters,
Somayeh Sojoudi,
Claire Tomlin,
David Fridovich-Keil
Abstract:
In multi-agent dynamic games, the Nash equilibrium state trajectory of each agent is determined by its cost function and the information pattern of the game. However, the cost and trajectory of each agent may be unavailable to the other agents. Prior work on using partial observations to infer the costs in dynamic games assumes an open-loop information pattern. In this work, we demonstrate that th…
▽ More
In multi-agent dynamic games, the Nash equilibrium state trajectory of each agent is determined by its cost function and the information pattern of the game. However, the cost and trajectory of each agent may be unavailable to the other agents. Prior work on using partial observations to infer the costs in dynamic games assumes an open-loop information pattern. In this work, we demonstrate that the feedback Nash equilibrium concept is more expressive and encodes more complex behavior. It is desirable to develop specific tools for inferring players' objectives in feedback games. Therefore, we consider the dynamic game cost inference problem under the feedback information pattern, using only partial state observations and incomplete trajectory data. To this end, we first propose an inverse feedback game loss function, whose minimizer yields a feedback Nash equilibrium state trajectory closest to the observation data. We characterize the landscape and differentiability of the loss function. Given the difficulty of obtaining the exact gradient, our main contribution is an efficient gradient approximator, which enables a novel inverse feedback game solver that minimizes the loss using first-order optimization. In thorough empirical evaluations, we demonstrate that our algorithm converges reliably and has better robustness and generalization performance than the open-loop baseline method when the observation data reflects a group of players acting in a feedback Nash game.
△ Less
Submitted 3 January, 2023;
originally announced January 2023.
-
Towards Dynamic Causal Discovery with Rare Events: A Nonparametric Conditional Independence Test
Authors:
Chih-Yuan Chiu,
Kshitij Kulkarni,
Shankar Sastry
Abstract:
Causal phenomena associated with rare events occur across a wide range of engineering problems, such as risk-sensitive safety analysis, accident analysis and prevention, and extreme value theory. However, current methods for causal discovery are often unable to uncover causal links, between random variables in a dynamic setting, that manifest only when the variables first experience low-probabilit…
▽ More
Causal phenomena associated with rare events occur across a wide range of engineering problems, such as risk-sensitive safety analysis, accident analysis and prevention, and extreme value theory. However, current methods for causal discovery are often unable to uncover causal links, between random variables in a dynamic setting, that manifest only when the variables first experience low-probability realizations. To address this issue, we introduce a novel statistical independence test on data collected from time-invariant dynamical systems in which rare but consequential events occur. In particular, we exploit the time-invariance of the underlying data to construct a superimposed dataset of the system state before rare events happen at different timesteps. We then design a conditional independence test on the reorganized data. We provide non-asymptotic sample complexity bounds for the consistency of our method, and validate its performance across various simulated and real-world datasets, including incident data collected from the Caltrans Performance Measurement System (PeMS). Code containing the datasets and experiments is publicly available.
△ Less
Submitted 17 July, 2023; v1 submitted 29 November, 2022;
originally announced November 2022.
-
Textless Direct Speech-to-Speech Translation with Discrete Speech Representation
Authors:
Xinjian Li,
Ye Jia,
Chung-Cheng Chiu
Abstract:
Research on speech-to-speech translation (S2ST) has progressed rapidly in recent years. Many end-to-end systems have been proposed and show advantages over conventional cascade systems, which are often composed of recognition, translation and synthesis sub-systems. However, most of the end-to-end systems still rely on intermediate textual supervision during training, which makes it infeasible to w…
▽ More
Research on speech-to-speech translation (S2ST) has progressed rapidly in recent years. Many end-to-end systems have been proposed and show advantages over conventional cascade systems, which are often composed of recognition, translation and synthesis sub-systems. However, most of the end-to-end systems still rely on intermediate textual supervision during training, which makes it infeasible to work for languages without written forms. In this work, we propose a novel model, Textless Translatotron, which is based on Translatotron 2, for training an end-to-end direct S2ST model without any textual supervision. Instead of jointly training with an auxiliary task predicting target phonemes as in Translatotron 2, the proposed model uses an auxiliary task predicting discrete speech representations which are obtained from learned or random speech quantizers. When a speech encoder pre-trained with unsupervised speech data is used for both models, the proposed model obtains translation quality nearly on-par with Translatotron 2 on the multilingual CVSS-C corpus as well as the bilingual Fisher Spanish-English corpus. On the latter, it outperforms the prior state-of-the-art textless model by +18.5 BLEU.
△ Less
Submitted 31 October, 2022;
originally announced November 2022.
-
An Analysis Method for Metric-Level Switching in Beat Tracking
Authors:
Ching-Yu Chiu,
Meinard Müller,
Matthew E. P. Davies,
Alvin Wen-Yu Su,
Yi-Hsuan Yang
Abstract:
For expressive music, the tempo may change over time, posing challenges to tracking the beats by an automatic model. The model may first tap to the correct tempo, but then may fail to adapt to a tempo change, or switch between several incorrect but perceptually plausible ones (e.g., half- or double-tempo). Existing evaluation metrics for beat tracking do not reflect such behaviors, as they typical…
▽ More
For expressive music, the tempo may change over time, posing challenges to tracking the beats by an automatic model. The model may first tap to the correct tempo, but then may fail to adapt to a tempo change, or switch between several incorrect but perceptually plausible ones (e.g., half- or double-tempo). Existing evaluation metrics for beat tracking do not reflect such behaviors, as they typically assume a fixed relationship between the reference beats and estimated beats. In this paper, we propose a new performance analysis method, called annotation coverage ratio (ACR), that accounts for a variety of possible metric-level switching behaviors of beat trackers. The idea is to derive sequences of modified reference beats of all metrical levels for every two consecutive reference beats, and compare every sequence of modified reference beats to the subsequences of estimated beats. We show via experiments on three datasets of different genres the usefulness of ACR when utilized alongside existing metrics, and discuss the new insights to be gained.
△ Less
Submitted 13 October, 2022;
originally announced October 2022.
-
JukeDrummer: Conditional Beat-aware Audio-domain Drum Accompaniment Generation via Transformer VQ-VAE
Authors:
Yueh-Kao Wu,
Ching-Yu Chiu,
Yi-Hsuan Yang
Abstract:
This paper proposes a model that generates a drum track in the audio domain to play along to a user-provided drum-free recording. Specifically, using paired data of drumless tracks and the corresponding human-made drum tracks, we train a Transformer model to improvise the drum part of an unseen drumless recording. We combine two approaches to encode the input audio. First, we train a vector-quanti…
▽ More
This paper proposes a model that generates a drum track in the audio domain to play along to a user-provided drum-free recording. Specifically, using paired data of drumless tracks and the corresponding human-made drum tracks, we train a Transformer model to improvise the drum part of an unseen drumless recording. We combine two approaches to encode the input audio. First, we train a vector-quantized variational autoencoder (VQ-VAE) to represent the input audio with discrete codes, which can then be readily used in a Transformer. Second, using an audio-domain beat tracking model, we compute beat-related features of the input audio and use them as embeddings in the Transformer. Instead of generating the drum track directly as waveforms, we use a separate VQ-VAE to encode the mel-spectrogram of a drum track into another set of discrete codes, and train the Transformer to predict the sequence of drum-related discrete codes. The output codes are then converted to a mel-spectrogram with a decoder, and then to the waveform with a vocoder. We report both objective and subjective evaluations of variants of the proposed model, demonstrating that the model with beat information generates drum accompaniment that is rhythmically and stylistically consistent with the input audio.
△ Less
Submitted 31 October, 2022; v1 submitted 12 October, 2022;
originally announced October 2022.
-
SLAM Backends with Objects in Motion: A Unifying Framework and Tutorial
Authors:
Chih-Yuan Chiu
Abstract:
Simultaneous Localization and Map** (SLAM) algorithms are frequently deployed to support a wide range of robotics applications, such as autonomous navigation in unknown environments, and scene map** in virtual reality. Many of these applications require autonomous agents to perform SLAM in highly dynamic scenes. To this end, this tutorial extends a recently introduced, unifying optimization-ba…
▽ More
Simultaneous Localization and Map** (SLAM) algorithms are frequently deployed to support a wide range of robotics applications, such as autonomous navigation in unknown environments, and scene map** in virtual reality. Many of these applications require autonomous agents to perform SLAM in highly dynamic scenes. To this end, this tutorial extends a recently introduced, unifying optimization-based SLAM backend framework to environments with moving objects and features. Using this framework, we consider a rapprochement of recent advances in dynamic SLAM. Moreover, we present dynamic EKF SLAM: a novel, filtering-based dynamic SLAM algorithm generated from our framework, and prove that it is mathematically equivalent to a direct extension of the classical EKF SLAM algorithm to the dynamic environment setting. Empirical results with simulated data indicate that dynamic EKF SLAM can achieve high localization and mobile object pose estimation accuracy, as well as high map precision, with high efficiency.
△ Less
Submitted 27 February, 2023; v1 submitted 11 July, 2022;
originally announced July 2022.
-
Using Loaded N-port Structures to Achieve the Continuous-Space Electromagnetic Channel Capacity Bound
Authors:
Zixiang Han,
Shanpu Shen,
Yujie Zhang,
Shiwen Tang,
Chi-Yuk Chiu,
Ross Murch
Abstract:
A method for achieving the continuous-space electromagnetic channel capacity bound using loaded N-port structures is described. It is relevant for the design of compact multiple-input multiple-output (MIMO) antennas that can achieve channel capacity bounds when constrained by size. The method is not restricted to a specific antenna configuration and a closed-form expression for the channel capacit…
▽ More
A method for achieving the continuous-space electromagnetic channel capacity bound using loaded N-port structures is described. It is relevant for the design of compact multiple-input multiple-output (MIMO) antennas that can achieve channel capacity bounds when constrained by size. The method is not restricted to a specific antenna configuration and a closed-form expression for the channel capacity limits are provided with various constraints. Furthermore, using loaded N-port structures to represent arbitrary antenna geometries, an efficient optimization approach is proposed for finding the optimum MIMO antenna design that achieves the channel capacity bounds. Simulation results of the channel capacity bounds achieved using our MIMO antenna design with one square wavelength size are provided. These show that at least 18 ports can be supported in one square wavelength and achieve the continuous-space electromagnetic channel capacity bound. The results demonstrate that our method can link continuous-space electromagnetic channel capacity bounds to MIMO antenna design.
△ Less
Submitted 25 May, 2022;
originally announced May 2022.
-
Accented Speech Recognition: Benchmarking, Pre-training, and Diverse Data
Authors:
Alëna Aksënova,
Zhehuai Chen,
Chung-Cheng Chiu,
Daan van Esch,
Pavel Golik,
Wei Han,
Levi King,
Bhuvana Ramabhadran,
Andrew Rosenberg,
Suzan Schwartz,
Gary Wang
Abstract:
Building inclusive speech recognition systems is a crucial step towards develo** technologies that speakers of all language varieties can use. Therefore, ASR systems must work for everybody independently of the way they speak. To accomplish this goal, there should be available data sets representing language varieties, and also an understanding of model configuration that is the most helpful in…
▽ More
Building inclusive speech recognition systems is a crucial step towards develo** technologies that speakers of all language varieties can use. Therefore, ASR systems must work for everybody independently of the way they speak. To accomplish this goal, there should be available data sets representing language varieties, and also an understanding of model configuration that is the most helpful in achieving robust understanding of all types of speech. However, there are not enough data sets for accented speech, and for the ones that are already available, more training approaches need to be explored to improve the quality of accented speech recognition. In this paper, we discuss recent progress towards develo** more inclusive ASR systems, namely, the importance of building new data sets representing linguistic diversity, and exploring novel training approaches to improve performance for all users. We address recent directions within benchmarking ASR systems for accented speech, measure the effects of wav2vec 2.0 pre-training on accented speech recognition, and highlight corpora relevant for diverse ASR evaluations.
△ Less
Submitted 16 May, 2022;
originally announced May 2022.
-
GTP-SLAM: Game-Theoretic Priors for Simultaneous Localization and Map** in Multi-Agent Scenarios
Authors:
Chih-Yuan Chiu,
David Fridovich-Keil
Abstract:
Robots operating in multi-player settings must simultaneously model the environment and the behavior of human or robotic agents who share that environment. This modeling is often approached using Simultaneous Localization and Map** (SLAM); however, SLAM algorithms usually neglect multi-player interactions. In contrast, the motion planning literature often uses dynamic game theory to explicitly m…
▽ More
Robots operating in multi-player settings must simultaneously model the environment and the behavior of human or robotic agents who share that environment. This modeling is often approached using Simultaneous Localization and Map** (SLAM); however, SLAM algorithms usually neglect multi-player interactions. In contrast, the motion planning literature often uses dynamic game theory to explicitly model noncooperative interactions of multiple agents in a known environment with perfect localization. Here, we present GTP-SLAM, a novel, iterative best response-based SLAM algorithm that accurately performs state localization and map reconstruction, while using game theoretic priors to capture the inherent non-cooperative interactions among multiple agents in an uncharted scene. By formulating the underlying SLAM problem as a potential game, we inherit a strong convergence guarantee. Empirical results indicate that, when deployed in a realistic traffic simulation, our approach performs localization and map** more accurately than a standard bundle adjustment algorithm across a wide range of noise levels.
△ Less
Submitted 8 August, 2022; v1 submitted 30 March, 2022;
originally announced March 2022.
-
On Real-time Image Reconstruction with Neural Networks for MRI-guided Radiotherapy
Authors:
David E. J. Waddington,
Nicholas Hindley,
Neha Koonjoo,
Christopher Chiu,
Tess Reynolds,
Paul Z. Y. Liu,
Bo Zhu,
Danyal Bhutto,
Chiara Paganelli,
Paul J. Keall,
Matthew S. Rosen
Abstract:
MRI-guidance techniques that dynamically adapt radiation beams to follow tumor motion in real-time will lead to more accurate cancer treatments and reduced collateral healthy tissue damage. The gold-standard for reconstruction of undersampled MR data is compressed sensing (CS) which is computationally slow and limits the rate that images can be available for real-time adaptation. Here, we demonstr…
▽ More
MRI-guidance techniques that dynamically adapt radiation beams to follow tumor motion in real-time will lead to more accurate cancer treatments and reduced collateral healthy tissue damage. The gold-standard for reconstruction of undersampled MR data is compressed sensing (CS) which is computationally slow and limits the rate that images can be available for real-time adaptation. Here, we demonstrate the use of automated transform by manifold approximation (AUTOMAP), a generalized framework that maps raw MR signal to the target image domain, to rapidly reconstruct images from undersampled radial k-space data. The AUTOMAP neural network was trained to reconstruct images from a golden-angle radial acquisition, a benchmark for motion-sensitive imaging, on lung cancer patient data and generic images from ImageNet. Model training was subsequently augmented with motion-encoded k-space data derived from videos in the YouTube-8M dataset to encourage motion robust reconstruction. We find that AUTOMAP-reconstructed radial k-space has equivalent accuracy to CS but with much shorter processing times after initial fine-tuning on retrospectively acquired lung cancer patient data. Validation of motion-trained models with a virtual dynamic lung tumor phantom showed that the generalized motion properties learned from YouTube lead to improved target tracking accuracy. Our work shows that AUTOMAP can achieve real-time, accurate reconstruction of radial data. These findings imply that neural-network-based reconstruction is potentially superior to existing approaches for real-time image guidance applications.
△ Less
Submitted 18 May, 2022; v1 submitted 9 February, 2022;
originally announced February 2022.
-
Self-supervised Learning with Random-projection Quantizer for Speech Recognition
Authors:
Chung-Cheng Chiu,
James Qin,
Yu Zhang,
Jiahui Yu,
Yonghui Wu
Abstract:
We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook. Neither…
▽ More
We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook. Neither the matrix nor the codebook is updated during self-supervised learning. Since the random-projection quantizer is not trained and is separated from the speech recognition model, the design makes the approach flexible and is compatible with universal speech recognition architecture. On LibriSpeech our approach achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models, and provides lower word-error-rates and latency than wav2vec 2.0 and w2v-BERT with streaming models. On multilingual tasks the approach also provides significant improvement over wav2vec 2.0 and w2v-BERT.
△ Less
Submitted 29 June, 2022; v1 submitted 3 February, 2022;
originally announced February 2022.
-
Simultaneous Localization and Map**: Through the Lens of Nonlinear Optimization
Authors:
Amay Saxena,
Chih-Yuan Chiu,
Joseph Menke,
Ritika Shrivastava,
Shankar Sastry
Abstract:
Simultaneous Localization and Map** (SLAM) algorithms perform visual-inertial estimation via filtering or batch optimization methods. Empirical evidence suggests that filtering algorithms are computationally faster, while optimization methods are more accurate. This work presents an optimization-based framework that unifies these approaches, and allows users to flexibly implement different desig…
▽ More
Simultaneous Localization and Map** (SLAM) algorithms perform visual-inertial estimation via filtering or batch optimization methods. Empirical evidence suggests that filtering algorithms are computationally faster, while optimization methods are more accurate. This work presents an optimization-based framework that unifies these approaches, and allows users to flexibly implement different design choices, e.g., the number and types of variables maintained in the algorithm at each time. We prove that filtering methods correspond to specific design choices in our generalized framework. We then reformulate the Multi-State Constrained Kalman Filter (MSCKF), implement the reformulation on challenging image sequence datasets in simulation, and contrast its performance with that of sliding window based filters. Using these results, we explain the relative performance characteristics of these two classes of algorithms in the context of our algorithm. Finally, we illustrate that under different design choices, the empirical performance of our algorithm interpolates between those of state-of-the-art approaches.
△ Less
Submitted 3 August, 2022; v1 submitted 11 December, 2021;
originally announced December 2021.
-
Cross-attention conformer for context modeling in speech enhancement for ASR
Authors:
Arun Narayanan,
Chung-Cheng Chiu,
Tom O'Malley,
Quan Wang,
Yanzhang He
Abstract:
This work introduces \emph{cross-attention conformer}, an attention-based architecture for context modeling in speech enhancement. Given that the context information can often be sequential, and of different length as the audio that is to be enhanced, we make use of cross-attention to summarize and merge contextual information with input features. Building upon the recently proposed conformer mode…
▽ More
This work introduces \emph{cross-attention conformer}, an attention-based architecture for context modeling in speech enhancement. Given that the context information can often be sequential, and of different length as the audio that is to be enhanced, we make use of cross-attention to summarize and merge contextual information with input features. Building upon the recently proposed conformer model that uses self attention layers as building blocks, the proposed cross-attention conformer can be used to build deep contextual models. As a concrete example, we show how noise context, i.e., short noise-only audio segment preceding an utterance, can be used to build a speech enhancement feature frontend using cross-attention conformer layers for improving noise robustness of automatic speech recognition.
△ Less
Submitted 29 October, 2021;
originally announced November 2021.
-
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition
Authors:
Yu Zhang,
Daniel S. Park,
Wei Han,
James Qin,
Anmol Gulati,
Joel Shor,
Aren Jansen,
Yuanzhong Xu,
Yan** Huang,
Shibo Wang,
Zongwei Zhou,
Bo Li,
Min Ma,
William Chan,
Jiahui Yu,
Yongqiang Wang,
Liangliang Cao,
Khe Chai Sim,
Bhuvana Ramabhadran,
Tara N. Sainath,
Françoise Beaufays,
Zhifeng Chen,
Quoc V. Le,
Chung-Cheng Chiu,
Ruoming Pang
, et al. (1 additional authors not shown)
Abstract:
We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled da…
▽ More
We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks.
△ Less
Submitted 21 July, 2022; v1 submitted 27 September, 2021;
originally announced September 2021.
-
W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training
Authors:
Yu-An Chung,
Yu Zhang,
Wei Han,
Chung-Cheng Chiu,
James Qin,
Ruoming Pang,
Yonghui Wu
Abstract:
Motivated by the success of masked language modeling~(MLM) in pre-training natural language processing models, we propose w2v-BERT that explores MLM for self-supervised speech representation learning. w2v-BERT is a framework that combines contrastive learning and MLM, where the former trains the model to discretize input continuous speech signals into a finite set of discriminative speech tokens,…
▽ More
Motivated by the success of masked language modeling~(MLM) in pre-training natural language processing models, we propose w2v-BERT that explores MLM for self-supervised speech representation learning. w2v-BERT is a framework that combines contrastive learning and MLM, where the former trains the model to discretize input continuous speech signals into a finite set of discriminative speech tokens, and the latter trains the model to learn contextualized speech representations via solving a masked prediction task consuming the discretized tokens. In contrast to existing MLM-based speech pre-training frameworks such as HuBERT, which relies on an iterative re-clustering and re-training process, or vq-wav2vec, which concatenates two separately trained modules, w2v-BERT can be optimized in an end-to-end fashion by solving the two self-supervised tasks~(the contrastive task and MLM) simultaneously. Our experiments show that w2v-BERT achieves competitive results compared to current state-of-the-art pre-trained models on the LibriSpeech benchmarks when using the Libri-Light~60k corpus as the unsupervised data. In particular, when compared to published models such as conformer-based wav2vec~2.0 and HuBERT, our model shows~5\% to~10\% relative WER reduction on the test-clean and test-other subsets. When applied to the Google's Voice Search traffic dataset, w2v-BERT outperforms our internal conformer-based wav2vec~2.0 by more than~30\% relatively.
△ Less
Submitted 13 September, 2021; v1 submitted 7 August, 2021;
originally announced August 2021.
-
Source Separation-based Data Augmentation for Improved Joint Beat and Downbeat Tracking
Authors:
Ching-Yu Chiu,
Joann Ching,
Wen-Yi Hsiao,
Yu-Hua Chen,
Alvin Wen-Yu Su,
Yi-Hsuan Yang
Abstract:
Due to advances in deep learning, the performance of automatic beat and downbeat tracking in musical audio signals has seen great improvement in recent years. In training such deep learning based models, data augmentation has been found an important technique. However, existing data augmentation methods for this task mainly target at balancing the distribution of the training data with respect to…
▽ More
Due to advances in deep learning, the performance of automatic beat and downbeat tracking in musical audio signals has seen great improvement in recent years. In training such deep learning based models, data augmentation has been found an important technique. However, existing data augmentation methods for this task mainly target at balancing the distribution of the training data with respect to their tempo. In this paper, we investigate another approach for data augmentation, to account for the composition of the training data in terms of the percussive and non-percussive sound sources. Specifically, we propose to employ a blind drum separation model to segregate the drum and non-drum sounds from each training audio signal, filtering out training signals that are drumless, and then use the obtained drum and non-drum stems to augment the training data. We report experiments on four completely unseen test sets, validating the effectiveness of the proposed method, and accordingly the importance of drum sound composition in the training data for beat and downbeat tracking.
△ Less
Submitted 16 June, 2021;
originally announced June 2021.
-
Drum-Aware Ensemble Architecture for Improved Joint Musical Beat and Downbeat Tracking
Authors:
Ching-Yu Chiu,
Alvin Wen-Yu Su,
Yi-Hsuan Yang
Abstract:
This paper presents a novel system architecture that integrates blind source separation with joint beat and downbeat tracking in musical audio signals. The source separation module segregates the percussive and non-percussive components of the input signal, over which beat and downbeat tracking are performed separately and then the results are aggregated with a learnable fusion mechanism. This way…
▽ More
This paper presents a novel system architecture that integrates blind source separation with joint beat and downbeat tracking in musical audio signals. The source separation module segregates the percussive and non-percussive components of the input signal, over which beat and downbeat tracking are performed separately and then the results are aggregated with a learnable fusion mechanism. This way, the system can adaptively determine how much the tracking result for an input signal should depend on the input's percussive or non-percussive components. Evaluation on four testing sets that feature different levels of presence of drum sounds shows that the new architecture consistently outperforms the widely-adopted baseline architecture that does not employ source separation.
△ Less
Submitted 16 June, 2021;
originally announced June 2021.
-
Stabilizability of Vector Systems with Uniform Actuation Unpredictability
Authors:
Rahul Arya,
Chih-Yuan Chiu,
Gireeja Ranade
Abstract:
This paper explores the fundamental limits of a simple system, inspired by the intermittent Kalman filtering model, where the actuation direction is drawn uniformly from the unit hypersphere. The model allows us to focus on a fundamental tension in the control of underactuated vector systems -- the need to balance the growth of the system in different dimensions.
We characterize the stabilizabil…
▽ More
This paper explores the fundamental limits of a simple system, inspired by the intermittent Kalman filtering model, where the actuation direction is drawn uniformly from the unit hypersphere. The model allows us to focus on a fundamental tension in the control of underactuated vector systems -- the need to balance the growth of the system in different dimensions.
We characterize the stabilizability of $d$-dimensional systems with symmetric gain matrices by providing tight necessary and sufficient conditions that depend on the eigenvalues of the system. The proof technique is slightly different from the standard dynamic programming approach and relies on the fact that the second moment stability of the system can also be understood by examining any arbitrary weighted two-norm of the state.
△ Less
Submitted 17 May, 2021; v1 submitted 10 May, 2021;
originally announced May 2021.
-
Bridging the gap between streaming and non-streaming ASR systems bydistilling ensembles of CTC and RNN-T models
Authors:
Thibault Doutre,
Wei Han,
Chung-Cheng Chiu,
Ruoming Pang,
Olivier Siohan,
Liangliang Cao
Abstract:
Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their minimal latency makes them suitable for such tasks. Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context and suffer from higher word error rates (WER). To improve streaming mode…
▽ More
Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their minimal latency makes them suitable for such tasks. Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context and suffer from higher word error rates (WER). To improve streaming models, a recent study [1] proposed to distill a non-streaming teacher model on unsupervised utterances, and then train a streaming student using the teachers' predictions. However, the performance gap between teacher and student WERs remains high. In this paper, we aim to close this gap by using a diversified set of non-streaming teacher models and combining them using Recognizer Output Voting Error Reduction (ROVER). In particular, we show that, despite being weaker than RNN-T models, CTC models are remarkable teachers. Further, by fusing RNN-T and CTC models together, we build the strongest teachers. The resulting student models drastically improve upon streaming models of previous work [1]: the WER decreases by 41% on Spanish, 27% on Portuguese, and 13% on French.
△ Less
Submitted 25 April, 2021;
originally announced April 2021.
-
Pushing the Limits of Non-Autoregressive Speech Recognition
Authors:
Edwin G. Ng,
Chung-Cheng Chiu,
Yu Zhang,
William Chan
Abstract:
We combine recent advancements in end-to-end speech recognition to non-autoregressive automatic speech recognition. We push the limits of non-autoregressive state-of-the-art results for multiple datasets: LibriSpeech, Fisher+Switchboard and Wall Street Journal. Key to our recipe, we leverage CTC on giant Conformer neural network architectures with SpecAugment and wav2vec2 pre-training. We achieve…
▽ More
We combine recent advancements in end-to-end speech recognition to non-autoregressive automatic speech recognition. We push the limits of non-autoregressive state-of-the-art results for multiple datasets: LibriSpeech, Fisher+Switchboard and Wall Street Journal. Key to our recipe, we leverage CTC on giant Conformer neural network architectures with SpecAugment and wav2vec2 pre-training. We achieve 1.8%/3.6% WER on LibriSpeech test/test-other sets, 5.1%/9.8% WER on Switchboard, and 3.4% on the Wall Street Journal, all without a language model.
△ Less
Submitted 11 September, 2021; v1 submitted 7 April, 2021;
originally announced April 2021.
-
A Better and Faster End-to-End Model for Streaming ASR
Authors:
Bo Li,
Anmol Gulati,
Jiahui Yu,
Tara N. Sainath,
Chung-Cheng Chiu,
Arun Narayanan,
Shuo-Yiin Chang,
Ruoming Pang,
Yanzhang He,
James Qin,
Wei Han,
Qiao Liang,
Yu Zhang,
Trevor Strohman,
Yonghui Wu
Abstract:
End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this i…
▽ More
End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this issue, we look at encouraging the E2E model to emit words early, through an algorithm called FastEmit [3]. Naturally, improving on latency results in a quality degradation. To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers [4], which has shown good improvements for ASR. Secondly, we also explore running a 2nd-pass beam search to improve quality. In order to ensure the 2nd-pass completes quickly, we explore non-causal Conformer layers that feed into the same 1st-pass RNN-T decoder, an algorithm called Cascaded Encoders [5]. Overall, we find that the Conformer RNN-T with Cascaded Encoders offers a better quality and latency tradeoff for streaming ASR.
△ Less
Submitted 11 February, 2021; v1 submitted 21 November, 2020;
originally announced November 2020.
-
Efficient Knowledge Distillation for RNN-Transducer Models
Authors:
Sankaran Panchapagesan,
Daniel S. Park,
Chung-Cheng Chiu,
Yuan Shangguan,
Qiao Liang,
Alexander Gruenstein
Abstract:
Knowledge Distillation is an effective method of transferring knowledge from a large model to a smaller model. Distillation can be viewed as a type of model compression, and has played an important role for on-device ASR applications. In this paper, we develop a distillation method for RNN-Transducer (RNN-T) models, a popular end-to-end neural network architecture for streaming speech recognition.…
▽ More
Knowledge Distillation is an effective method of transferring knowledge from a large model to a smaller model. Distillation can be viewed as a type of model compression, and has played an important role for on-device ASR applications. In this paper, we develop a distillation method for RNN-Transducer (RNN-T) models, a popular end-to-end neural network architecture for streaming speech recognition. Our proposed distillation loss is simple and efficient, and uses only the "y" and "blank" posterior probabilities from the RNN-T output probability lattice. We study the effectiveness of the proposed approach in improving the accuracy of sparse RNN-T models obtained by gradually pruning a larger uncompressed model, which also serves as the teacher during distillation. With distillation of 60% and 90% sparse multi-domain RNN-T models, we obtain WER reductions of 4.3% and 12.1% respectively, on a noisy FarField eval set. We also present results of experiments on LibriSpeech, where the introduction of the distillation loss yields a 4.8% relative WER reduction on the test-other dataset for a small Conformer model.
△ Less
Submitted 11 November, 2020;
originally announced November 2020.
-
Encoding Defensive Driving as a Dynamic Nash Game
Authors:
Chih-Yuan Chiu,
David Fridovich-Keil,
Claire J. Tomlin
Abstract:
Robots deployed in real-world environments should operate safely in a robust manner. In scenarios where an "ego" agent navigates in an environment with multiple other "non-ego" agents, two modes of safety are commonly proposed -- adversarial robustness and probabilistic constraint satisfaction. However, while the former is generally computationally intractable and leads to overconservative solutio…
▽ More
Robots deployed in real-world environments should operate safely in a robust manner. In scenarios where an "ego" agent navigates in an environment with multiple other "non-ego" agents, two modes of safety are commonly proposed -- adversarial robustness and probabilistic constraint satisfaction. However, while the former is generally computationally intractable and leads to overconservative solutions, the latter typically relies on strong distributional assumptions and ignores strategic coupling between agents.
To avoid these drawbacks, we present a novel formulation of robustness within the framework of general-sum dynamic game theory, modeled on defensive driving. More precisely, we prepend an adversarial phase to the ego agent's cost function. That is, we prepend a time interval during which other agents are assumed to be temporarily distracted, in order to render the ego agent's equilibrium trajectory robust against other agents' potentially dangerous behavior during this time. We demonstrate the effectiveness of our new formulation in encoding safety via multiple traffic scenarios.
△ Less
Submitted 30 March, 2021; v1 submitted 9 November, 2020;
originally announced November 2020.
-
Reducing latency and bandwidth for video streaming using keypoint extraction and digital puppetry
Authors:
Roshan Prabhakar,
Shubham Chandak,
Carina Chiu,
Renee Liang,
Huong Nguyen,
Kedar Tatwawadi,
Tsachy Weissman
Abstract:
COVID-19 has made video communication one of the most important modes of information exchange. While extensive research has been conducted on the optimization of the video streaming pipeline, in particular the development of novel video codecs, further improvement in the video quality and latency is required, especially under poor network conditions. This paper proposes an alternative to the conve…
▽ More
COVID-19 has made video communication one of the most important modes of information exchange. While extensive research has been conducted on the optimization of the video streaming pipeline, in particular the development of novel video codecs, further improvement in the video quality and latency is required, especially under poor network conditions. This paper proposes an alternative to the conventional codec through the implementation of a keypoint-centric encoder relying on the transmission of keypoint information from within a video feed. The decoder uses the streamed keypoints to generate a reconstruction preserving the semantic features in the input feed. Focusing on video calling applications, we detect and transmit the body pose and face mesh information through the network, which are displayed at the receiver in the form of animated puppets. Using efficient pose and face mesh detection in conjunction with skeleton-based animation, we demonstrate a prototype requiring lower than 35 kbps bandwidth, an order of magnitude reduction over typical video calling systems. The added computational latency due to the mesh extraction and animation is below 120ms on a standard laptop, showcasing the potential of this framework for real-time applications. The code for this work is available at https://github.com/shubhamchandak94/digital-puppetry/.
△ Less
Submitted 8 January, 2021; v1 submitted 7 November, 2020;
originally announced November 2020.
-
Cascaded encoders for unifying streaming and non-streaming ASR
Authors:
Arun Narayanan,
Tara N. Sainath,
Ruoming Pang,
Jiahui Yu,
Chung-Cheng Chiu,
Rohit Prabhavalkar,
Ehsan Variani,
Trevor Strohman
Abstract:
End-to-end (E2E) automatic speech recognition (ASR) models, by now, have shown competitive performance on several benchmarks. These models are structured to either operate in streaming or non-streaming mode. This work presents cascaded encoders for building a single E2E ASR model that can operate in both these modes simultaneously. The proposed model consists of streaming and non-streaming encoder…
▽ More
End-to-end (E2E) automatic speech recognition (ASR) models, by now, have shown competitive performance on several benchmarks. These models are structured to either operate in streaming or non-streaming mode. This work presents cascaded encoders for building a single E2E ASR model that can operate in both these modes simultaneously. The proposed model consists of streaming and non-streaming encoders. Input features are first processed by the streaming encoder; the non-streaming encoder operates exclusively on the output of the streaming encoder. A single decoder then learns to decode either using the output of the streaming or the non-streaming encoder. Results show that this model achieves similar word error rates (WER) as a standalone streaming model when operating in streaming mode, and obtains 10% -- 27% relative improvement when operating in non-streaming mode. Our results also show that the proposed approach outperforms existing E2E two-pass models, especially on long-form speech.
△ Less
Submitted 27 October, 2020;
originally announced October 2020.
-
Improving Streaming Automatic Speech Recognition With Non-Streaming Model Distillation On Unsupervised Data
Authors:
Thibault Doutre,
Wei Han,
Min Ma,
Zhiyun Lu,
Chung-Cheng Chiu,
Ruoming Pang,
Arun Narayanan,
Ananya Misra,
Yu Zhang,
Liangliang Cao
Abstract:
Streaming end-to-end automatic speech recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Consequently, streaming models usually perform worse than non-streaming models. We propose a nov…
▽ More
Streaming end-to-end automatic speech recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Consequently, streaming models usually perform worse than non-streaming models. We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher to generate transcripts on an arbitrarily large data set, which is then used to distill knowledge into streaming ASR models. This way, we scale the training of streaming models to up to 3 million hours of YouTube audio. Experiments show that our approach can significantly reduce the word error rate (WER) of RNNT models not only on LibriSpeech but also on YouTube data in four languages. For example, in French, we are able to reduce the WER by 16.4% relatively to a baseline streaming model by leveraging a non-streaming teacher model trained on the same amount of labeled data as the baseline.
△ Less
Submitted 21 February, 2021; v1 submitted 22 October, 2020;
originally announced October 2020.
-
FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization
Authors:
Jiahui Yu,
Chung-Cheng Chiu,
Bo Li,
Shuo-yiin Chang,
Tara N. Sainath,
Yanzhang He,
Arun Narayanan,
Wei Han,
Anmol Gulati,
Yonghui Wu,
Ruoming Pang
Abstract:
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible. However, emitting fast without degrading quality, as measured by word error rate (WER), is highly challenging. Existing approaches including Early and Late Penalties and Constrained Alignments penalize emission delay by manipulating per-token or per-frame probability prediction i…
▽ More
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible. However, emitting fast without degrading quality, as measured by word error rate (WER), is highly challenging. Existing approaches including Early and Late Penalties and Constrained Alignments penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models. While being successful in reducing delay, these approaches suffer from significant accuracy regression and also require additional word alignment information from an existing model. In this work, we propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models, and does not require any alignment. We demonstrate that FastEmit is more suitable to the sequence-level optimization of transducer models for streaming ASR by applying it on various end-to-end streaming ASR networks including RNN-Transducer, Transformer-Transducer, ConvNet-Transducer and Conformer-Transducer. We achieve 150-300 ms latency reduction with significantly better accuracy over previous techniques on a Voice Search test set. FastEmit also improves streaming ASR accuracy from 4.4%/8.9% to 3.1%/7.5% WER, meanwhile reduces 90th percentile latency from 210 ms to only 30 ms on LibriSpeech.
△ Less
Submitted 3 February, 2021; v1 submitted 21 October, 2020;
originally announced October 2020.
-
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
Authors:
Yu Zhang,
James Qin,
Daniel S. Park,
Wei Han,
Chung-Cheng Chiu,
Ruoming Pang,
Quoc V. Le,
Yonghui Wu
Abstract:
We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio of the Libri-Light dataset. More precisely, we carry out noisy student training with SpecAugment using giant Conformer models pre-trained using wav2vec 2.0 pre-training. By doing so, we are able to achieve word-e…
▽ More
We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio of the Libri-Light dataset. More precisely, we carry out noisy student training with SpecAugment using giant Conformer models pre-trained using wav2vec 2.0 pre-training. By doing so, we are able to achieve word-error-rates (WERs) 1.4%/2.6% on the LibriSpeech test/test-other sets against the current state-of-the-art WERs 1.7%/3.3%.
△ Less
Submitted 20 July, 2022; v1 submitted 20 October, 2020;
originally announced October 2020.
-
Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling
Authors:
Jiahui Yu,
Wei Han,
Anmol Gulati,
Chung-Cheng Chiu,
Bo Li,
Tara N. Sainath,
Yonghui Wu,
Ruoming Pang
Abstract:
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses. In this work, we propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech reco…
▽ More
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses. In this work, we propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR, especially with inplace knowledge distillation during the training. The Dual-mode ASR framework can be applied to recent state-of-the-art convolution-based and transformer-based ASR networks. We present extensive experiments with two state-of-the-art ASR networks, ContextNet and Conformer, on two datasets, a widely used public dataset LibriSpeech and a large-scale dataset MultiDomain. Experiments and ablation studies demonstrate that Dual-mode ASR not only simplifies the workflow of training and deploying streaming and full-context ASR models, but also significantly improves both emission latency and recognition accuracy of streaming ASR. With Dual-mode ASR, we achieve new state-of-the-art streaming ASR results on both LibriSpeech and MultiDomain in terms of accuracy and latency.
△ Less
Submitted 27 January, 2021; v1 submitted 12 October, 2020;
originally announced October 2020.
-
Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition
Authors:
Wei Li,
James Qin,
Chung-Cheng Chiu,
Ruoming Pang,
Yanzhang He
Abstract:
Recent advances of end-to-end models have outperformed conventional models through employing a two-pass model. The two-pass model provides better speed-quality trade-offs for on-device speech recognition, where a 1st-pass model generates hypotheses in a streaming fashion, and a 2nd-pass model re-scores the hypotheses with full audio sequence context. The 2nd-pass model plays a key role in the qual…
▽ More
Recent advances of end-to-end models have outperformed conventional models through employing a two-pass model. The two-pass model provides better speed-quality trade-offs for on-device speech recognition, where a 1st-pass model generates hypotheses in a streaming fashion, and a 2nd-pass model re-scores the hypotheses with full audio sequence context. The 2nd-pass model plays a key role in the quality improvement of the end-to-end model to surpass the conventional model. One main challenge of the two-pass model is the computation latency introduced by the 2nd-pass model. Specifically, the original design of the two-pass model uses LSTMs for the 2nd-pass model, which are subject to long latency as they are constrained by the recurrent nature and have to run inference sequentially. In this work we explore replacing the LSTM layers in the 2nd-pass rescorer with Transformer layers, which can process the entire hypothesis sequences in parallel and can therefore utilize the on-device computation resources more efficiently. Compared with an LSTM-based baseline, our proposed Transformer rescorer achieves more than 50% latency reduction with quality improvement.
△ Less
Submitted 2 September, 2020; v1 submitted 30 August, 2020;
originally announced August 2020.
-
Adaptive multi-channel event segmentation and feature extraction for monitoring health outcomes
Authors:
Xichen She,
Yaya Zhai,
Ricardo Henao,
Christopher W. Woods,
Christopher Chiu,
Geoffrey S. Ginsburg,
Peter X. K. Song,
Alfred O. Hero
Abstract:
$\textbf{Objective}$: To develop a multi-channel device event segmentation and feature extraction algorithm that is robust to changes in data distribution. $\textbf{Methods}…
▽ More
$\textbf{Objective}$: To develop a multi-channel device event segmentation and feature extraction algorithm that is robust to changes in data distribution. $\textbf{Methods}$: We introduce an adaptive transfer learning algorithm to classify and segment events from non-stationary multi-channel temporal data. Using a multivariate hidden Markov model (HMM) and Fisher's linear discriminant analysis (FLDA) the algorithm adaptively adjusts to shifts in distribution over time. The proposed algorithm is unsupervised and learns to label events without requiring $\textit{a priori}$ information about true event states. The procedure is illustrated on experimental data collected from a cohort in a human viral challenge (HVC) study, where certain subjects have disrupted wake and sleep patterns after exposure to a H1N1 influenza pathogen. $\textbf{Results}$: Simulations establish that the proposed adaptive algorithm significantly outperforms other event classification methods. When applied to early time points in the HVC data the algorithm extracts sleep/wake features that are predictive of both infection and infection onset time. $\textbf{Conclusion}$: The proposed transfer learning event segmentation method is robust to temporal shifts in data distribution and can be used to produce highly discriminative event-labeled features for health monitoring. $\textbf{Significance}$: Our integrated multisensor signal processing and transfer learning method is applicable to many ambulatory monitoring applications.
△ Less
Submitted 19 November, 2020; v1 submitted 20 August, 2020;
originally announced August 2020.
-
Mixing-Specific Data Augmentation Techniques for Improved Blind Violin/Piano Source Separation
Authors:
Ching-Yu Chiu,
Wen-Yi Hsiao,
Yin-Cheng Yeh,
Yi-Hsuan Yang,
Alvin Wen-Yu Su
Abstract:
Blind music source separation has been a popular and active subject of research in both the music information retrieval and signal processing communities. To counter the lack of available multi-track data for supervised model training, a data augmentation method that creates artificial mixtures by combining tracks from different songs has been shown useful in recent works. Following this light, we…
▽ More
Blind music source separation has been a popular and active subject of research in both the music information retrieval and signal processing communities. To counter the lack of available multi-track data for supervised model training, a data augmentation method that creates artificial mixtures by combining tracks from different songs has been shown useful in recent works. Following this light, we examine further in this paper extended data augmentation methods that consider more sophisticated mixing settings employed in the modern music production routine, the relationship between the tracks to be combined, and factors of silence. As a case study, we consider the separation of violin and piano tracks in a violin piano ensemble, evaluating the performance in terms of common metrics, namely SDR, SIR, and SAR. In addition to examining the effectiveness of these new data augmentation methods, we also study the influence of the amount of training data. Our evaluation shows that the proposed mixing-specific data augmentation methods can help improve the performance of a deep learning-based model for source separation, especially in the case of small training data.
△ Less
Submitted 6 August, 2020;
originally announced August 2020.
-
Improved Noisy Student Training for Automatic Speech Recognition
Authors:
Daniel S. Park,
Yu Zhang,
Ye Jia,
Wei Han,
Chung-Cheng Chiu,
Bo Li,
Yonghui Wu,
Quoc V. Le
Abstract:
Recently, a semi-supervised learning method known as "noisy student training" has been shown to improve image classification performance of deep networks significantly. Noisy student training is an iterative self-training method that leverages augmentation to improve network performance. In this work, we adapt and improve noisy student training for automatic speech recognition, employing (adaptive…
▽ More
Recently, a semi-supervised learning method known as "noisy student training" has been shown to improve image classification performance of deep networks significantly. Noisy student training is an iterative self-training method that leverages augmentation to improve network performance. In this work, we adapt and improve noisy student training for automatic speech recognition, employing (adaptive) SpecAugment as the augmentation method. We find effective methods to filter, balance and augment the data generated in between self-training iterations. By doing so, we are able to obtain word error rates (WERs) 4.2%/8.6% on the clean/noisy LibriSpeech test sets by only using the clean 100h subset of LibriSpeech as the supervised set and the rest (860h) as the unlabeled set. Furthermore, we are able to achieve WERs 1.7%/3.4% on the clean/noisy LibriSpeech test sets by using the unlab-60k subset of LibriLight as the unlabeled set for LibriSpeech 960h. We are thus able to improve upon the previous state-of-the-art clean/noisy test WERs achieved on LibriSpeech 100h (4.74%/12.20%) and LibriSpeech (1.9%/4.1%).
△ Less
Submitted 29 October, 2020; v1 submitted 19 May, 2020;
originally announced May 2020.
-
Conformer: Convolution-augmented Transformer for Speech Recognition
Authors:
Anmol Gulati,
James Qin,
Chung-Cheng Chiu,
Niki Parmar,
Yu Zhang,
Jiahui Yu,
Wei Han,
Shibo Wang,
Zhengdong Zhang,
Yonghui Wu,
Ruoming Pang
Abstract:
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. In this work, we achieve the best of both worlds by studying how to combine convolution ne…
▽ More
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. In this work, we achieve the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way. To this regard, we propose the convolution-augmented transformer for speech recognition, named Conformer. Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies. On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother. We also observe competitive performance of 2.7%/6.3% with a small model of only 10M parameters.
△ Less
Submitted 16 May, 2020;
originally announced May 2020.
-
Eyes-Closed Safety Kernels: Safety for Autonomous Systems Under Loss of Observability
Authors:
Forrest Laine,
Chiu-Yuan Chiu,
Claire Tomlin
Abstract:
A framework is presented for handling a potential loss of observability of a dynamical system in a provably-safe way. Inspired by the fragility of data-driven perception systems used by autonomous vehicles, we formulate the problem that arises when a sensing modality fails or is found to be untrustworthy during autonomous operation. We cast this problem as a differential game played between the dy…
▽ More
A framework is presented for handling a potential loss of observability of a dynamical system in a provably-safe way. Inspired by the fragility of data-driven perception systems used by autonomous vehicles, we formulate the problem that arises when a sensing modality fails or is found to be untrustworthy during autonomous operation. We cast this problem as a differential game played between the dynamical system being controlled and the external system factor(s) for which observations are lost. The game is a zero-sum Stackelberg game in which the controlled system (leader) is trying to find a trajectory which maximizes a function representing the safety of the system, and the unobserved factor (follower) is trying to minimize the same function. The set of winning initial configurations of this game for the controlled system represent the set of all states in which safety can be maintained with respect to the external factor, even if observability of that factor is lost. This is the set we refer to as the Eyes-Closed Safety Kernel. In practical use, the policy defined by the winning strategy of the controlled system is only needed to be executed whenever observability of the external system is lost or the system deviates from the Eyes-Closed Safety Kernel due to other, non-safety oriented control schemes. We present a means for solving this game offline, such that the resulting winning strategy can be used for computationally efficient, provably-safe, online control when needed. The solution approach presented is based on representing the game using the solutions of two Hamilton-Jacobi partial differential equations. We illustrate the applicability of our framework by working through a realistic example in which an autonomous car must avoid a dynamic obstacle despite potentially losing observability.
△ Less
Submitted 15 May, 2020; v1 submitted 14 May, 2020;
originally announced May 2020.
-
RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions
Authors:
Chung-Cheng Chiu,
Arun Narayanan,
Wei Han,
Rohit Prabhavalkar,
Yu Zhang,
Navdeep Jaitly,
Ruoming Pang,
Tara N. Sainath,
Patrick Nguyen,
Liangliang Cao,
Yonghui Wu
Abstract:
In recent years, all-neural end-to-end approaches have obtained state-of-the-art results on several challenging automatic speech recognition (ASR) tasks. However, most existing works focus on building ASR models where train and test data are drawn from the same domain. This results in poor generalization characteristics on mismatched-domains: e.g., end-to-end models trained on short segments perfo…
▽ More
In recent years, all-neural end-to-end approaches have obtained state-of-the-art results on several challenging automatic speech recognition (ASR) tasks. However, most existing works focus on building ASR models where train and test data are drawn from the same domain. This results in poor generalization characteristics on mismatched-domains: e.g., end-to-end models trained on short segments perform poorly when evaluated on longer utterances. In this work, we analyze the generalization properties of streaming and non-streaming recurrent neural network transducer (RNN-T) based end-to-end models in order to identify model components that negatively affect generalization performance. We propose two solutions: combining multiple regularization techniques during training, and using dynamic overlap** inference. On a long-form YouTube test set, when the nonstreaming RNN-T model is trained with shorter segments of data, the proposed combination improves word error rate (WER) from 22.3% to 14.8%; when the streaming RNN-T model trained on short Search queries, the proposed techniques improve WER on the YouTube set from 67.0% to 25.3%. Finally, when trained on Librispeech, we find that dynamic overlap** inference improves WER on YouTube from 99.8% to 33.0%.
△ Less
Submitted 23 December, 2020; v1 submitted 7 May, 2020;
originally announced May 2020.
-
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context
Authors:
Wei Han,
Zhengdong Zhang,
Yu Zhang,
Jiahui Yu,
Chung-Cheng Chiu,
James Qin,
Anmol Gulati,
Ruoming Pang,
Yonghui Wu
Abstract:
Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into…
▽ More
Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. In addition, we propose a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy. We demonstrate that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets. This compares to the previous best published system of 2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The superiority of the proposed ContextNet model is also verified on a much larger internal dataset.
△ Less
Submitted 15 May, 2020; v1 submitted 6 May, 2020;
originally announced May 2020.
-
SpecAugment on Large Scale Datasets
Authors:
Daniel S. Park,
Yu Zhang,
Chung-Cheng Chiu,
Youzheng Chen,
Bo Li,
William Chan,
Quoc V. Le,
Yonghui Wu
Abstract:
Recently, SpecAugment, an augmentation scheme for automatic speech recognition that acts directly on the spectrogram of input utterances, has shown to be highly effective in enhancing the performance of end-to-end networks on public datasets. In this paper, we demonstrate its effectiveness on tasks with large scale datasets by investigating its application to the Google Multidomain Dataset (Naraya…
▽ More
Recently, SpecAugment, an augmentation scheme for automatic speech recognition that acts directly on the spectrogram of input utterances, has shown to be highly effective in enhancing the performance of end-to-end networks on public datasets. In this paper, we demonstrate its effectiveness on tasks with large scale datasets by investigating its application to the Google Multidomain Dataset (Narayanan et al., 2018). We achieve improvement across all test domains by mixing raw training data augmented with SpecAugment and noise-perturbed training data when training the acoustic model. We also introduce a modification of SpecAugment that adapts the time mask size and/or multiplicity depending on the length of the utterance, which can potentially benefit large scale tasks. By using adaptive masking, we are able to further improve the performance of the Listen, Attend and Spell model on LibriSpeech to 2.2% WER on test-clean and 5.2% WER on test-other.
△ Less
Submitted 11 December, 2019;
originally announced December 2019.
-
Speech Sentiment Analysis via Pre-trained Features from End-to-end ASR Models
Authors:
Zhiyun Lu,
Liangliang Cao,
Yu Zhang,
Chung-Cheng Chiu,
James Fan
Abstract:
In this paper, we propose to use pre-trained features from end-to-end ASR models to solve speech sentiment analysis as a down-stream task. We show that end-to-end ASR features, which integrate both acoustic and text information from speech, achieve promising results. We use RNN with self-attention as the sentiment classifier, which also provides an easy visualization through attention weights to h…
▽ More
In this paper, we propose to use pre-trained features from end-to-end ASR models to solve speech sentiment analysis as a down-stream task. We show that end-to-end ASR features, which integrate both acoustic and text information from speech, achieve promising results. We use RNN with self-attention as the sentiment classifier, which also provides an easy visualization through attention weights to help interpret model predictions. We use well benchmarked IEMOCAP dataset and a new large-scale speech sentiment dataset SWBD-sentiment for evaluation. Our approach improves the-state-of-the-art accuracy on IEMOCAP from 66.6% to 71.7%, and achieves an accuracy of 70.10% on SWBD-sentiment with more than 49,500 utterances.
△ Less
Submitted 4 March, 2020; v1 submitted 21 November, 2019;
originally announced November 2019.