Skip to main content

Showing 1–50 of 61 results for author: Chiu, C

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.05499  [pdf, other

    eess.SP

    A Pixel-based Reconfigurable Antenna Design for Fluid Antenna Systems

    Authors: Jichen Zhang, Junhui Rao, Zhaoyang Ming, Zan Li, Chi-Yuk Chiu, Kai-Kit Wong, Kin-Fai Tong, Ross Murch

    Abstract: Fluid Antenna Systems (FASs) have recently been proposed for enhancing the performance of wireless communication. Previous antenna designs to meet the requirements of FAS have been based on mechanically movable or liquid antennas and therefore have limited reconfiguration speeds. In this paper, we propose a design for a pixel-based reconfigurable antenna (PRA) that meets the requirements of FAS an… ▽ More

    Submitted 14 June, 2024; v1 submitted 8 June, 2024; originally announced June 2024.

    Comments: 13 pages, 16 figures, Submitted to IEEE Transations on Antennas and Propagation

  2. arXiv:2406.02975  [pdf, other

    eess.SP

    A Shared-Aperture Dual-Band sub-6 GHz and mmWave Reconfigurable Intelligent Surface With Independent Operation

    Authors: Junhui Rao, Yujie Zhang, Shiwen Tang, Zan Li, Zhaoyang Ming, Jichen Zhang, Chi Yuk Chiu, Ross Murch

    Abstract: A novel dual-band reconfigurable intelligent surface (DBI-RIS) design that combines the functionalities of millimeter-wave (mmWave) and sub-6 GHz bands within a single aperture is proposed. This design aims to bridge the gap between current single-band reconfigurable intelligent surfaces (RISs) and wireless systems utilizing sub-6 GHz and mmWave bands that require RIS with independently reconfigur… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  3. arXiv:2403.13923  [pdf, other

    eess.SY

    Credit vs. Discount-Based Congestion Pricing: A Comparison Study

    Authors: Chih-Yuan Chiu, Devansh Jalota, Marco Pavone

    Abstract: Tolling, or congestion pricing, offers a promising traffic management policy for regulating congestion, but has also attracted criticism for placing outsized financial burdens on low-income users. Credit-based congestion pricing (CBCP) and discount-based congestion pricing (DBCP) policies, which respectively provide travel credits and toll discounts to low-income users on tolled roads, have emerge… ▽ More

    Submitted 9 May, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

  4. arXiv:2310.00230  [pdf, other

    cs.CL cs.SD eess.AS

    SLM: Bridge the thin gap between speech and text foundation models

    Authors: Mingqiu Wang, Wei Han, Izhak Shafran, Zelin Wu, Chung-Cheng Chiu, Yuan Cao, Yongqiang Wang, Nanxin Chen, Yu Zhang, Hagen Soltau, Paul Rubenstein, Lukas Zilka, Dian Yu, Zhong Meng, Golan Pundak, Nikhil Siddhartha, Johan Schalkwyk, Yonghui Wu

    Abstract: We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models. SLM freezes the pretrained foundation models to maximally preserves their capabilities, and only trains a simple adapter with just 1\% (156M) of the foundation models' parameters. This adaptation not only leads SLM to achiev… ▽ More

    Submitted 29 September, 2023; originally announced October 2023.

  5. arXiv:2308.10355  [pdf, other

    eess.AS cs.SD

    Local Periodicity-Based Beat Tracking for Expressive Classical Piano Music

    Authors: Ching-Yu Chiu, Meinard Müller, Matthew E. P. Davies, Alvin Wen-Yu Su, Yi-Hsuan Yang

    Abstract: To model the periodicity of beats, state-of-the-art beat tracking systems use "post-processing trackers" (PPTs) that rely on several empirically determined global assumptions for tempo transition, which work well for music with a steady tempo. For expressive classical music, however, these assumptions can be too rigid. With two large datasets of Western classical piano music, namely the Aligned Sc… ▽ More

    Submitted 20 August, 2023; originally announced August 2023.

    Comments: Accepted to IEEE/ACM Transactions on Audio, Speech, and Language Processing (July 2023)

  6. arXiv:2307.07650  [pdf, ps, other

    cs.LG cs.AI eess.SP

    SALC: Skeleton-Assisted Learning-Based Clustering for Time-Varying Indoor Localization

    Authors: An-Hung Hsiao, Li-Hsiang Shen, Chen-Yi Chang, Chun-Jie Chiu, Kai-Ten Feng

    Abstract: Wireless indoor localization has attracted significant amount of attention in recent years. Using received signal strength (RSS) obtained from WiFi access points (APs) for establishing fingerprinting database is a widely utilized method in indoor localization. However, the time-variant problem for indoor positioning systems is not well-investigated in existing literature. Compared to conventional… ▽ More

    Submitted 14 July, 2023; originally announced July 2023.

  7. arXiv:2307.05466  [pdf, other

    eess.SY

    Dynamic Tolling in Arc-based Traffic Assignment Models

    Authors: Chih-Yuan Chiu, Chinmay Maheshwari, Pan-Yang Su, Shankar Sastry

    Abstract: Tolling in traffic networks offers a popular measure to minimize overall congestion. Existing toll designs primarily focus on congestion in route-based traffic assignment models (TAMs), in which travelers make a single route selection from their source to destination. However, these models do not reflect real-world traveler decisions because they preclude deviations from a chosen route, and becaus… ▽ More

    Submitted 24 October, 2023; v1 submitted 11 July, 2023; originally announced July 2023.

    Comments: 18 pages, 4 figures, 2 tables. arXiv admin note: text overlap with arXiv:2304.04705

  8. arXiv:2306.08131  [pdf, other

    eess.AS cs.SD

    Efficient Adapters for Giant Speech Models

    Authors: Nanxin Chen, Izhak Shafran, Yu Zhang, Chung-Cheng Chiu, Hagen Soltau, James Qin, Yonghui Wu

    Abstract: Large pre-trained speech models are widely used as the de-facto paradigm, especially in scenarios when there is a limited amount of labeled data available. However, finetuning all parameters from the self-supervised learned model can be computationally expensive, and becomes infeasiable as the size of the model and the number of downstream tasks scales. In this paper, we propose a novel approach c… ▽ More

    Submitted 13 June, 2023; originally announced June 2023.

  9. arXiv:2304.04705  [pdf, other

    eess.SY

    Arc-based Traffic Assignment: Equilibrium Characterization and Learning

    Authors: Chih-Yuan Chiu, Chinmay Maheshwari, Pan-Yang Su, Shankar Sastry

    Abstract: Arc-based traffic assignment models (TAMs) are a popular framework for modeling traffic network congestion generated by self-interested travelers who sequentially select arcs based on their perceived latency on the network. However, existing arc-based TAMs either assign travelers to cyclic paths, or do not extend to networks with bi-directional arcs (or edges) between nodes. To overcome these diff… ▽ More

    Submitted 4 May, 2024; v1 submitted 10 April, 2023; originally announced April 2023.

    Comments: 17 pages, 3 figures, 2 tables

  10. arXiv:2304.01945  [pdf, other

    eess.SY

    Scenario-Game ADMM: A Parallelized Scenario-Based Solver for Stochastic Noncooperative Games

    Authors: **gqi Li, Chih-Yuan Chiu, Lasse Peters, Fernando Palafox, Mustafa Karabag, Javier Alonso-Mora, Somayeh Sojoudi, Claire Tomlin, David Fridovich-Keil

    Abstract: Decision-making in multi-player games can be extremely challenging, particularly under uncertainty. In this work, we propose a new sample-based approximation to a class of stochastic, general-sum, pure Nash games, where each player has an expected-value objective and a set of chance constraints. This new approximation scheme inherits the accuracy of objective approximation from the established sam… ▽ More

    Submitted 13 September, 2023; v1 submitted 4 April, 2023; originally announced April 2023.

  11. arXiv:2303.01037  [pdf, other

    cs.CL cs.SD eess.AS

    Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

    Authors: Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk , et al. (2 additional authors not shown)

    Abstract: We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quant… ▽ More

    Submitted 24 September, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

    Comments: 20 pages, 7 figures, 8 tables

  12. arXiv:2301.01398  [pdf, other

    cs.MA cs.RO eess.SY

    Cost Inference for Feedback Dynamic Games from Noisy Partial State Observations and Incomplete Trajectories

    Authors: **gqi Li, Chih-Yuan Chiu, Lasse Peters, Somayeh Sojoudi, Claire Tomlin, David Fridovich-Keil

    Abstract: In multi-agent dynamic games, the Nash equilibrium state trajectory of each agent is determined by its cost function and the information pattern of the game. However, the cost and trajectory of each agent may be unavailable to the other agents. Prior work on using partial observations to infer the costs in dynamic games assumes an open-loop information pattern. In this work, we demonstrate that th… ▽ More

    Submitted 3 January, 2023; originally announced January 2023.

    Comments: Accepted by AAMAS 2023. This is a preprint version

  13. arXiv:2211.16596  [pdf, other

    stat.ML cs.LG eess.SY

    Towards Dynamic Causal Discovery with Rare Events: A Nonparametric Conditional Independence Test

    Authors: Chih-Yuan Chiu, Kshitij Kulkarni, Shankar Sastry

    Abstract: Causal phenomena associated with rare events occur across a wide range of engineering problems, such as risk-sensitive safety analysis, accident analysis and prevention, and extreme value theory. However, current methods for causal discovery are often unable to uncover causal links, between random variables in a dynamic setting, that manifest only when the variables first experience low-probabilit… ▽ More

    Submitted 17 July, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

  14. arXiv:2211.00115  [pdf, other

    cs.CL cs.SD eess.AS

    Textless Direct Speech-to-Speech Translation with Discrete Speech Representation

    Authors: Xinjian Li, Ye Jia, Chung-Cheng Chiu

    Abstract: Research on speech-to-speech translation (S2ST) has progressed rapidly in recent years. Many end-to-end systems have been proposed and show advantages over conventional cascade systems, which are often composed of recognition, translation and synthesis sub-systems. However, most of the end-to-end systems still rely on intermediate textual supervision during training, which makes it infeasible to w… ▽ More

    Submitted 31 October, 2022; originally announced November 2022.

  15. An Analysis Method for Metric-Level Switching in Beat Tracking

    Authors: Ching-Yu Chiu, Meinard Müller, Matthew E. P. Davies, Alvin Wen-Yu Su, Yi-Hsuan Yang

    Abstract: For expressive music, the tempo may change over time, posing challenges to tracking the beats by an automatic model. The model may first tap to the correct tempo, but then may fail to adapt to a tempo change, or switch between several incorrect but perceptually plausible ones (e.g., half- or double-tempo). Existing evaluation metrics for beat tracking do not reflect such behaviors, as they typical… ▽ More

    Submitted 13 October, 2022; originally announced October 2022.

    Comments: Accepted to IEEE Signal Processing Letters (Oct. 2022)

  16. arXiv:2210.06007  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    JukeDrummer: Conditional Beat-aware Audio-domain Drum Accompaniment Generation via Transformer VQ-VAE

    Authors: Yueh-Kao Wu, Ching-Yu Chiu, Yi-Hsuan Yang

    Abstract: This paper proposes a model that generates a drum track in the audio domain to play along to a user-provided drum-free recording. Specifically, using paired data of drumless tracks and the corresponding human-made drum tracks, we train a Transformer model to improvise the drum part of an unseen drumless recording. We combine two approaches to encode the input audio. First, we train a vector-quanti… ▽ More

    Submitted 31 October, 2022; v1 submitted 12 October, 2022; originally announced October 2022.

    Comments: Accepted at ISMIR 2022

  17. arXiv:2207.05043  [pdf, other

    cs.RO eess.SY

    SLAM Backends with Objects in Motion: A Unifying Framework and Tutorial

    Authors: Chih-Yuan Chiu

    Abstract: Simultaneous Localization and Map** (SLAM) algorithms are frequently deployed to support a wide range of robotics applications, such as autonomous navigation in unknown environments, and scene map** in virtual reality. Many of these applications require autonomous agents to perform SLAM in highly dynamic scenes. To this end, this tutorial extends a recently introduced, unifying optimization-ba… ▽ More

    Submitted 27 February, 2023; v1 submitted 11 July, 2022; originally announced July 2022.

  18. arXiv:2205.12501  [pdf, ps, other

    eess.SP cs.IT

    Using Loaded N-port Structures to Achieve the Continuous-Space Electromagnetic Channel Capacity Bound

    Authors: Zixiang Han, Shanpu Shen, Yujie Zhang, Shiwen Tang, Chi-Yuk Chiu, Ross Murch

    Abstract: A method for achieving the continuous-space electromagnetic channel capacity bound using loaded N-port structures is described. It is relevant for the design of compact multiple-input multiple-output (MIMO) antennas that can achieve channel capacity bounds when constrained by size. The method is not restricted to a specific antenna configuration and a closed-form expression for the channel capacit… ▽ More

    Submitted 25 May, 2022; originally announced May 2022.

  19. arXiv:2205.08014  [pdf, ps, other

    eess.AS cs.SD

    Accented Speech Recognition: Benchmarking, Pre-training, and Diverse Data

    Authors: Alëna Aksënova, Zhehuai Chen, Chung-Cheng Chiu, Daan van Esch, Pavel Golik, Wei Han, Levi King, Bhuvana Ramabhadran, Andrew Rosenberg, Suzan Schwartz, Gary Wang

    Abstract: Building inclusive speech recognition systems is a crucial step towards develo** technologies that speakers of all language varieties can use. Therefore, ASR systems must work for everybody independently of the way they speak. To accomplish this goal, there should be available data sets representing language varieties, and also an understanding of model configuration that is the most helpful in… ▽ More

    Submitted 16 May, 2022; originally announced May 2022.

    Comments: 5 pages, 3 tables

  20. arXiv:2203.16690  [pdf, other

    cs.RO eess.SY

    GTP-SLAM: Game-Theoretic Priors for Simultaneous Localization and Map** in Multi-Agent Scenarios

    Authors: Chih-Yuan Chiu, David Fridovich-Keil

    Abstract: Robots operating in multi-player settings must simultaneously model the environment and the behavior of human or robotic agents who share that environment. This modeling is often approached using Simultaneous Localization and Map** (SLAM); however, SLAM algorithms usually neglect multi-player interactions. In contrast, the motion planning literature often uses dynamic game theory to explicitly m… ▽ More

    Submitted 8 August, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: 6 pages, 3 figures

  21. arXiv:2202.05267  [pdf, other

    physics.med-ph cs.CV eess.IV

    On Real-time Image Reconstruction with Neural Networks for MRI-guided Radiotherapy

    Authors: David E. J. Waddington, Nicholas Hindley, Neha Koonjoo, Christopher Chiu, Tess Reynolds, Paul Z. Y. Liu, Bo Zhu, Danyal Bhutto, Chiara Paganelli, Paul J. Keall, Matthew S. Rosen

    Abstract: MRI-guidance techniques that dynamically adapt radiation beams to follow tumor motion in real-time will lead to more accurate cancer treatments and reduced collateral healthy tissue damage. The gold-standard for reconstruction of undersampled MR data is compressed sensing (CS) which is computationally slow and limits the rate that images can be available for real-time adaptation. Here, we demonstr… ▽ More

    Submitted 18 May, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

    Comments: 12 pages, 6 figures, 1 table. v2 has a typo in eqn 1 corrected and references added to the discussion

  22. arXiv:2202.01855  [pdf, other

    cs.CL cs.SD eess.AS

    Self-supervised Learning with Random-projection Quantizer for Speech Recognition

    Authors: Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, Yonghui Wu

    Abstract: We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook. Neither… ▽ More

    Submitted 29 June, 2022; v1 submitted 3 February, 2022; originally announced February 2022.

    Comments: ICML 2022

  23. arXiv:2112.05921  [pdf, other

    eess.SY

    Simultaneous Localization and Map**: Through the Lens of Nonlinear Optimization

    Authors: Amay Saxena, Chih-Yuan Chiu, Joseph Menke, Ritika Shrivastava, Shankar Sastry

    Abstract: Simultaneous Localization and Map** (SLAM) algorithms perform visual-inertial estimation via filtering or batch optimization methods. Empirical evidence suggests that filtering algorithms are computationally faster, while optimization methods are more accurate. This work presents an optimization-based framework that unifies these approaches, and allows users to flexibly implement different desig… ▽ More

    Submitted 3 August, 2022; v1 submitted 11 December, 2021; originally announced December 2021.

    Comments: 22 pages

  24. arXiv:2111.00127  [pdf, other

    eess.AS cs.SD

    Cross-attention conformer for context modeling in speech enhancement for ASR

    Authors: Arun Narayanan, Chung-Cheng Chiu, Tom O'Malley, Quan Wang, Yanzhang He

    Abstract: This work introduces \emph{cross-attention conformer}, an attention-based architecture for context modeling in speech enhancement. Given that the context information can often be sequential, and of different length as the audio that is to be enhanced, we make use of cross-attention to summarize and merge contextual information with input features. Building upon the recently proposed conformer mode… ▽ More

    Submitted 29 October, 2021; originally announced November 2021.

    Comments: Will appear in IEEE-ASRU 2021

  25. arXiv:2109.13226  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

    Authors: Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yan** Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang , et al. (1 additional authors not shown)

    Abstract: We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled da… ▽ More

    Submitted 21 July, 2022; v1 submitted 27 September, 2021; originally announced September 2021.

    Comments: 14 pages, 7 figures, 13 tables; v2: minor corrections, reference baselines and bibliography updated; v3: corrections based on reviewer feedback, bibliography updated

  26. arXiv:2108.06209  [pdf, other

    cs.LG cs.SD eess.AS

    W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training

    Authors: Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, Yonghui Wu

    Abstract: Motivated by the success of masked language modeling~(MLM) in pre-training natural language processing models, we propose w2v-BERT that explores MLM for self-supervised speech representation learning. w2v-BERT is a framework that combines contrastive learning and MLM, where the former trains the model to discretize input continuous speech signals into a finite set of discriminative speech tokens,… ▽ More

    Submitted 13 September, 2021; v1 submitted 7 August, 2021; originally announced August 2021.

  27. arXiv:2106.08703  [pdf, other

    cs.SD cs.LG eess.AS

    Source Separation-based Data Augmentation for Improved Joint Beat and Downbeat Tracking

    Authors: Ching-Yu Chiu, Joann Ching, Wen-Yi Hsiao, Yu-Hua Chen, Alvin Wen-Yu Su, Yi-Hsuan Yang

    Abstract: Due to advances in deep learning, the performance of automatic beat and downbeat tracking in musical audio signals has seen great improvement in recent years. In training such deep learning based models, data augmentation has been found an important technique. However, existing data augmentation methods for this task mainly target at balancing the distribution of the training data with respect to… ▽ More

    Submitted 16 June, 2021; originally announced June 2021.

    Comments: Accepted to European Signal Processing Conference (EUSIPCO 2021)

  28. arXiv:2106.08685  [pdf, other

    cs.SD cs.LG eess.AS

    Drum-Aware Ensemble Architecture for Improved Joint Musical Beat and Downbeat Tracking

    Authors: Ching-Yu Chiu, Alvin Wen-Yu Su, Yi-Hsuan Yang

    Abstract: This paper presents a novel system architecture that integrates blind source separation with joint beat and downbeat tracking in musical audio signals. The source separation module segregates the percussive and non-percussive components of the input signal, over which beat and downbeat tracking are performed separately and then the results are aggregated with a learnable fusion mechanism. This way… ▽ More

    Submitted 16 June, 2021; originally announced June 2021.

    Comments: Accepted to IEEE Signal Processing Letters (May 2021)

  29. arXiv:2105.04652  [pdf, ps, other

    math.OC eess.SY

    Stabilizability of Vector Systems with Uniform Actuation Unpredictability

    Authors: Rahul Arya, Chih-Yuan Chiu, Gireeja Ranade

    Abstract: This paper explores the fundamental limits of a simple system, inspired by the intermittent Kalman filtering model, where the actuation direction is drawn uniformly from the unit hypersphere. The model allows us to focus on a fundamental tension in the control of underactuated vector systems -- the need to balance the growth of the system in different dimensions. We characterize the stabilizabil… ▽ More

    Submitted 17 May, 2021; v1 submitted 10 May, 2021; originally announced May 2021.

  30. arXiv:2104.14346  [pdf, other

    cs.CL cs.SD eess.AS

    Bridging the gap between streaming and non-streaming ASR systems bydistilling ensembles of CTC and RNN-T models

    Authors: Thibault Doutre, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Olivier Siohan, Liangliang Cao

    Abstract: Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their minimal latency makes them suitable for such tasks. Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context and suffer from higher word error rates (WER). To improve streaming mode… ▽ More

    Submitted 25 April, 2021; originally announced April 2021.

  31. arXiv:2104.03416  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Pushing the Limits of Non-Autoregressive Speech Recognition

    Authors: Edwin G. Ng, Chung-Cheng Chiu, Yu Zhang, William Chan

    Abstract: We combine recent advancements in end-to-end speech recognition to non-autoregressive automatic speech recognition. We push the limits of non-autoregressive state-of-the-art results for multiple datasets: LibriSpeech, Fisher+Switchboard and Wall Street Journal. Key to our recipe, we leverage CTC on giant Conformer neural network architectures with SpecAugment and wav2vec2 pre-training. We achieve… ▽ More

    Submitted 11 September, 2021; v1 submitted 7 April, 2021; originally announced April 2021.

    Comments: Proceedings of INTERSPEECH

  32. arXiv:2011.10798  [pdf, other

    eess.AS cs.SD

    A Better and Faster End-to-End Model for Streaming ASR

    Authors: Bo Li, Anmol Gulati, Jiahui Yu, Tara N. Sainath, Chung-Cheng Chiu, Arun Narayanan, Shuo-Yiin Chang, Ruoming Pang, Yanzhang He, James Qin, Wei Han, Qiao Liang, Yu Zhang, Trevor Strohman, Yonghui Wu

    Abstract: End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this i… ▽ More

    Submitted 11 February, 2021; v1 submitted 21 November, 2020; originally announced November 2020.

    Comments: Accepted in ICASSP 2021

  33. arXiv:2011.06110  [pdf, other

    eess.AS cs.SD

    Efficient Knowledge Distillation for RNN-Transducer Models

    Authors: Sankaran Panchapagesan, Daniel S. Park, Chung-Cheng Chiu, Yuan Shangguan, Qiao Liang, Alexander Gruenstein

    Abstract: Knowledge Distillation is an effective method of transferring knowledge from a large model to a smaller model. Distillation can be viewed as a type of model compression, and has played an important role for on-device ASR applications. In this paper, we develop a distillation method for RNN-Transducer (RNN-T) models, a popular end-to-end neural network architecture for streaming speech recognition.… ▽ More

    Submitted 11 November, 2020; originally announced November 2020.

    Comments: 5 pages, 1 figure, 2 tables; submitted to ICASSP 2021

  34. arXiv:2011.04815  [pdf, other

    cs.RO eess.SY

    Encoding Defensive Driving as a Dynamic Nash Game

    Authors: Chih-Yuan Chiu, David Fridovich-Keil, Claire J. Tomlin

    Abstract: Robots deployed in real-world environments should operate safely in a robust manner. In scenarios where an "ego" agent navigates in an environment with multiple other "non-ego" agents, two modes of safety are commonly proposed -- adversarial robustness and probabilistic constraint satisfaction. However, while the former is generally computationally intractable and leads to overconservative solutio… ▽ More

    Submitted 30 March, 2021; v1 submitted 9 November, 2020; originally announced November 2020.

    Comments: Accepted to ICRA 2021

  35. arXiv:2011.03800  [pdf, other

    eess.IV

    Reducing latency and bandwidth for video streaming using keypoint extraction and digital puppetry

    Authors: Roshan Prabhakar, Shubham Chandak, Carina Chiu, Renee Liang, Huong Nguyen, Kedar Tatwawadi, Tsachy Weissman

    Abstract: COVID-19 has made video communication one of the most important modes of information exchange. While extensive research has been conducted on the optimization of the video streaming pipeline, in particular the development of novel video codecs, further improvement in the video quality and latency is required, especially under poor network conditions. This paper proposes an alternative to the conve… ▽ More

    Submitted 8 January, 2021; v1 submitted 7 November, 2020; originally announced November 2020.

    Comments: 10 pages, 5 figures, 1-page summary to be published at DCC 2021. Revision: added references

  36. arXiv:2010.14606  [pdf, other

    eess.AS cs.CL cs.SD

    Cascaded encoders for unifying streaming and non-streaming ASR

    Authors: Arun Narayanan, Tara N. Sainath, Ruoming Pang, Jiahui Yu, Chung-Cheng Chiu, Rohit Prabhavalkar, Ehsan Variani, Trevor Strohman

    Abstract: End-to-end (E2E) automatic speech recognition (ASR) models, by now, have shown competitive performance on several benchmarks. These models are structured to either operate in streaming or non-streaming mode. This work presents cascaded encoders for building a single E2E ASR model that can operate in both these modes simultaneously. The proposed model consists of streaming and non-streaming encoder… ▽ More

    Submitted 27 October, 2020; originally announced October 2020.

  37. arXiv:2010.12096  [pdf, other

    cs.SD cs.CL eess.AS

    Improving Streaming Automatic Speech Recognition With Non-Streaming Model Distillation On Unsupervised Data

    Authors: Thibault Doutre, Wei Han, Min Ma, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, Arun Narayanan, Ananya Misra, Yu Zhang, Liangliang Cao

    Abstract: Streaming end-to-end automatic speech recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Consequently, streaming models usually perform worse than non-streaming models. We propose a nov… ▽ More

    Submitted 21 February, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

  38. arXiv:2010.11148  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization

    Authors: Jiahui Yu, Chung-Cheng Chiu, Bo Li, Shuo-yiin Chang, Tara N. Sainath, Yanzhang He, Arun Narayanan, Wei Han, Anmol Gulati, Yonghui Wu, Ruoming Pang

    Abstract: Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible. However, emitting fast without degrading quality, as measured by word error rate (WER), is highly challenging. Existing approaches including Early and Late Penalties and Constrained Alignments penalize emission delay by manipulating per-token or per-frame probability prediction i… ▽ More

    Submitted 3 February, 2021; v1 submitted 21 October, 2020; originally announced October 2020.

    Comments: Accepted in ICASSP 2021

  39. arXiv:2010.10504  [pdf, other

    eess.AS cs.LG cs.SD

    Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition

    Authors: Yu Zhang, James Qin, Daniel S. Park, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Quoc V. Le, Yonghui Wu

    Abstract: We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio of the Libri-Light dataset. More precisely, we carry out noisy student training with SpecAugment using giant Conformer models pre-trained using wav2vec 2.0 pre-training. By doing so, we are able to achieve word-e… ▽ More

    Submitted 20 July, 2022; v1 submitted 20 October, 2020; originally announced October 2020.

    Comments: 11 pages, 3 figures, 5 tables. Accepted to NeurIPS SAS 2020 Workshop; v2: minor errors corrected

  40. arXiv:2010.06030  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling

    Authors: Jiahui Yu, Wei Han, Anmol Gulati, Chung-Cheng Chiu, Bo Li, Tara N. Sainath, Yonghui Wu, Ruoming Pang

    Abstract: Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses. In this work, we propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech reco… ▽ More

    Submitted 27 January, 2021; v1 submitted 12 October, 2020; originally announced October 2020.

    Comments: Accepted in ICLR 2021

  41. arXiv:2008.13093  [pdf, other

    eess.AS cs.CL

    Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition

    Authors: Wei Li, James Qin, Chung-Cheng Chiu, Ruoming Pang, Yanzhang He

    Abstract: Recent advances of end-to-end models have outperformed conventional models through employing a two-pass model. The two-pass model provides better speed-quality trade-offs for on-device speech recognition, where a 1st-pass model generates hypotheses in a streaming fashion, and a 2nd-pass model re-scores the hypotheses with full audio sequence context. The 2nd-pass model plays a key role in the qual… ▽ More

    Submitted 2 September, 2020; v1 submitted 30 August, 2020; originally announced August 2020.

    Comments: Proceedings of Interspeech, 2020

  42. Adaptive multi-channel event segmentation and feature extraction for monitoring health outcomes

    Authors: Xichen She, Yaya Zhai, Ricardo Henao, Christopher W. Woods, Christopher Chiu, Geoffrey S. Ginsburg, Peter X. K. Song, Alfred O. Hero

    Abstract: $\textbf{Objective}$: To develop a multi-channel device event segmentation and feature extraction algorithm that is robust to changes in data distribution. $\textbf{Methods}… ▽ More

    Submitted 19 November, 2020; v1 submitted 20 August, 2020; originally announced August 2020.

    Journal ref: IEEE Transactions on Biomedical Engineering, Nov. 17 2020

  43. arXiv:2008.02480  [pdf, other

    eess.AS cs.LG cs.SD

    Mixing-Specific Data Augmentation Techniques for Improved Blind Violin/Piano Source Separation

    Authors: Ching-Yu Chiu, Wen-Yi Hsiao, Yin-Cheng Yeh, Yi-Hsuan Yang, Alvin Wen-Yu Su

    Abstract: Blind music source separation has been a popular and active subject of research in both the music information retrieval and signal processing communities. To counter the lack of available multi-track data for supervised model training, a data augmentation method that creates artificial mixtures by combining tracks from different songs has been shown useful in recent works. Following this light, we… ▽ More

    Submitted 6 August, 2020; originally announced August 2020.

    Comments: Accepted to IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP 2020)

  44. Improved Noisy Student Training for Automatic Speech Recognition

    Authors: Daniel S. Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, Quoc V. Le

    Abstract: Recently, a semi-supervised learning method known as "noisy student training" has been shown to improve image classification performance of deep networks significantly. Noisy student training is an iterative self-training method that leverages augmentation to improve network performance. In this work, we adapt and improve noisy student training for automatic speech recognition, employing (adaptive… ▽ More

    Submitted 29 October, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: 5 pages, 5 figures, 4 tables; v2: minor revisions, reference added

    Journal ref: Proc. Interspeech 2020, 2817-2821

  45. arXiv:2005.08100  [pdf, other

    eess.AS cs.LG cs.SD

    Conformer: Convolution-augmented Transformer for Speech Recognition

    Authors: Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang

    Abstract: Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. In this work, we achieve the best of both worlds by studying how to combine convolution ne… ▽ More

    Submitted 16 May, 2020; originally announced May 2020.

    Comments: Submitted to Interspeech 2020

  46. arXiv:2005.07144  [pdf, other

    eess.SY

    Eyes-Closed Safety Kernels: Safety for Autonomous Systems Under Loss of Observability

    Authors: Forrest Laine, Chiu-Yuan Chiu, Claire Tomlin

    Abstract: A framework is presented for handling a potential loss of observability of a dynamical system in a provably-safe way. Inspired by the fragility of data-driven perception systems used by autonomous vehicles, we formulate the problem that arises when a sensing modality fails or is found to be untrustworthy during autonomous operation. We cast this problem as a differential game played between the dy… ▽ More

    Submitted 15 May, 2020; v1 submitted 14 May, 2020; originally announced May 2020.

    Comments: Accepted at Robotics: Science and Systems 2020, 9 pages

  47. arXiv:2005.03271  [pdf, other

    eess.AS cs.CL

    RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions

    Authors: Chung-Cheng Chiu, Arun Narayanan, Wei Han, Rohit Prabhavalkar, Yu Zhang, Navdeep Jaitly, Ruoming Pang, Tara N. Sainath, Patrick Nguyen, Liangliang Cao, Yonghui Wu

    Abstract: In recent years, all-neural end-to-end approaches have obtained state-of-the-art results on several challenging automatic speech recognition (ASR) tasks. However, most existing works focus on building ASR models where train and test data are drawn from the same domain. This results in poor generalization characteristics on mismatched-domains: e.g., end-to-end models trained on short segments perfo… ▽ More

    Submitted 23 December, 2020; v1 submitted 7 May, 2020; originally announced May 2020.

    Comments: SLT camera-ready version

  48. arXiv:2005.03191  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

    Authors: Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, Yonghui Wu

    Abstract: Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into… ▽ More

    Submitted 15 May, 2020; v1 submitted 6 May, 2020; originally announced May 2020.

    Comments: Submitted to Interspeech 2020

  49. arXiv:1912.05533  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    SpecAugment on Large Scale Datasets

    Authors: Daniel S. Park, Yu Zhang, Chung-Cheng Chiu, Youzheng Chen, Bo Li, William Chan, Quoc V. Le, Yonghui Wu

    Abstract: Recently, SpecAugment, an augmentation scheme for automatic speech recognition that acts directly on the spectrogram of input utterances, has shown to be highly effective in enhancing the performance of end-to-end networks on public datasets. In this paper, we demonstrate its effectiveness on tasks with large scale datasets by investigating its application to the Google Multidomain Dataset (Naraya… ▽ More

    Submitted 11 December, 2019; originally announced December 2019.

    Comments: 5 pages, 3 tables; submitted to ICASSP 2020

  50. arXiv:1911.09762  [pdf, other

    cs.CL cs.LG eess.AS

    Speech Sentiment Analysis via Pre-trained Features from End-to-end ASR Models

    Authors: Zhiyun Lu, Liangliang Cao, Yu Zhang, Chung-Cheng Chiu, James Fan

    Abstract: In this paper, we propose to use pre-trained features from end-to-end ASR models to solve speech sentiment analysis as a down-stream task. We show that end-to-end ASR features, which integrate both acoustic and text information from speech, achieve promising results. We use RNN with self-attention as the sentiment classifier, which also provides an easy visualization through attention weights to h… ▽ More

    Submitted 4 March, 2020; v1 submitted 21 November, 2019; originally announced November 2019.