Skip to main content

Showing 1–19 of 19 results for author: Chandra, V

Searching in archive eess. Search in all archives.
.
  1. arXiv:2402.13076  [pdf, other

    cs.SD cs.LG eess.AS

    Not All Weights Are Created Equal: Enhancing Energy Efficiency in On-Device Streaming Speech Recognition

    Authors: Yang Li, Yuan Shangguan, Yuhao Wang, Liangzhen Lai, Ernie Chang, Changsheng Zhao, Yangyang Shi, Vikas Chandra

    Abstract: Power consumption plays an important role in on-device streaming speech recognition, as it has a direct impact on the user experience. This study delves into how weight parameters in speech recognition models influence the overall power consumption of these models. We discovered that the impact of weight parameters on power consumption varies, influenced by factors including how often they are inv… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

  2. arXiv:2311.00897  [pdf, other

    cs.SD cs.CL eess.AS

    On The Open Prompt Challenge In Conditional Audio Generation

    Authors: Ernie Chang, Sidd Srinivasan, Mahi Luthra, Pin-Jie Lin, Varun Nagaraja, Forrest Iandola, Zechun Liu, Zhaoheng Ni, Changsheng Zhao, Yangyang Shi, Vikas Chandra

    Abstract: Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio samples and hand-annotated text. However, commercializing audio generation is challenging as user-input prompts are often under-specified when compared to text descriptions used to train TTA models. In this work, we treat TTA models as a ``blackbox'' and address the user prompt challenge with two ke… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

    Comments: 5 pages, 3 figures, 4 tables

  3. arXiv:2311.00895  [pdf, other

    cs.SD cs.CL eess.AS

    In-Context Prompt Editing For Conditional Audio Generation

    Authors: Ernie Chang, Pin-Jie Lin, Yang Li, Sidd Srinivasan, Gael Le Lan, David Kant, Yangyang Shi, Forrest Iandola, Vikas Chandra

    Abstract: Distributional shift is a central challenge in the deployment of machine learning models as they can be ill-equipped for real-world data. This is particularly evident in text-to-audio generation where the encoded representations are easily undermined by unseen prompts, which leads to the degradation of generated audio -- the limited set of the text-audio pairs remains inadequate for conditional au… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

    Comments: 5 pages, 3 figures, 2 tables

  4. arXiv:2309.10795  [pdf, other

    eess.AS

    Exploring Speech Enhancement for Low-resource Speech Synthesis

    Authors: Zhaoheng Ni, Sravya Popuri, Ning Dong, Kohei Saijo, Xiaohui Zhang, Gael Le Lan, Yangyang Shi, Vikas Chandra, Changhan Wang

    Abstract: High-quality and intelligible speech is essential to text-to-speech (TTS) model training, however, obtaining high-quality data for low-resource languages is challenging and expensive. Applying speech enhancement on Automatic Speech Recognition (ASR) corpus mitigates the issue by augmenting the training data, while how the nonlinear speech distortion brought by speech enhancement models affects TTS… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  5. arXiv:2309.10537  [pdf, other

    eess.AS cs.MM cs.SD

    FoleyGen: Visually-Guided Audio Generation

    Authors: Xinhao Mei, Varun Nagaraja, Gael Le Lan, Zhaoheng Ni, Ernie Chang, Yangyang Shi, Vikas Chandra

    Abstract: Recent advancements in audio generation have been spurred by the evolution of large-scale deep learning models and expansive datasets. However, the task of video-to-audio (V2A) generation continues to be a challenge, principally because of the intricate relationship between the high-dimensional visual and auditory data, and the challenges associated with temporal synchronization. In this study, we… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

  6. arXiv:2309.08804  [pdf, other

    eess.AS cs.SD

    Stack-and-Delay: a new codebook pattern for music generation

    Authors: Gael Le Lan, Varun Nagaraja, Ernie Chang, David Kant, Zhaoheng Ni, Yangyang Shi, Forrest Iandola, Vikas Chandra

    Abstract: In language modeling based music generation, a generated waveform is represented by a sequence of hierarchical token stacks that can be decoded either in an auto-regressive manner or in parallel, depending on the codebook patterns. In particular, flattening the codebooks represents the highest quality decoding strategy, while being notoriously slow. To this end, we propose a novel stack-and-delay… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

  7. arXiv:2309.08773  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Enhance audio generation controllability through representation similarity regularization

    Authors: Yangyang Shi, Gael Le Lan, Varun Nagaraja, Zhaoheng Ni, Xinhao Mei, Ernie Chang, Forrest Iandola, Yang Liu, Vikas Chandra

    Abstract: This paper presents an innovative approach to enhance control over audio generation by emphasizing the alignment between audio and text representations during model training. In the context of language model-based audio generation, the model leverages input from both textual and audio token representations to predict subsequent audio tokens. However, the current configuration lacks explicit regula… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: 5 pages

  8. arXiv:2309.07988  [pdf, other

    cs.LG cs.AR cs.SD eess.AS

    Folding Attention: Memory and Power Optimization for On-Device Transformer-based Streaming Speech Recognition

    Authors: Yang Li, Liangzhen Lai, Yuan Shangguan, Forrest N. Iandola, Zhaoheng Ni, Ernie Chang, Yangyang Shi, Vikas Chandra

    Abstract: Transformer-based models excel in speech recognition. Existing efforts to optimize Transformer inference, typically for long-context applications, center on simplifying attention score calculations. However, streaming speech recognition models usually process a limited number of tokens each time, making attention score calculation less of a bottleneck. Instead, the bottleneck lies in the linear pr… ▽ More

    Submitted 18 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

  9. arXiv:2309.01947  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models

    Authors: Yuan Shangguan, Haichuan Yang, Danni Li, Chunyang Wu, Yassir Fathullah, Dilin Wang, Ayushi Dalmia, Raghuraman Krishnamoorthi, Ozlem Kalinli, Junteng Jia, Jay Mahadeokar, Xin Lei, Mike Seltzer, Vikas Chandra

    Abstract: Automatic Speech Recognition (ASR) models need to be optimized for specific hardware before they can be deployed on devices. This can be done by tuning the model's hyperparameters or exploring variations in its architecture. Re-training and re-validating models after making these changes can be a resource-intensive task. This paper presents TODM (Train Once Deploy Many), a new approach to efficien… ▽ More

    Submitted 27 November, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

    Comments: Meta AI; Submitted to ICASSP 2024

  10. arXiv:2211.04635  [pdf, other

    cs.LG cs.AI eess.AS

    LiCo-Net: Linearized Convolution Network for Hardware-efficient Keyword Spotting

    Authors: Haichuan Yang, Zhaojun Yang, Li Wan, Biqiao Zhang, Yangyang Shi, Yiteng Huang, Ivaylo Enchev, Limin Tang, Raziel Alvarez, Ming Sun, Xin Lei, Raghuraman Krishnamoorthi, Vikas Chandra

    Abstract: This paper proposes a hardware-efficient architecture, Linearized Convolution Network (LiCo-Net) for keyword spotting. It is optimized specifically for low-power processor units like microcontrollers. ML operators exhibit heterogeneous efficiency profiles on power-efficient hardware. Given the exact theoretical computation cost, int8 operators are more computation-effective than float operators, a… ▽ More

    Submitted 8 November, 2022; originally announced November 2022.

  11. arXiv:2203.15773  [pdf, other

    cs.CL cs.SD eess.AS

    Streaming parallel transducer beam search with fast-slow cascaded encoders

    Authors: Jay Mahadeokar, Yangyang Shi, Ke Li, Duc Le, Jiedan Zhu, Vikas Chandra, Ozlem Kalinli, Michael L Seltzer

    Abstract: Streaming ASR with strict latency constraints is required in many speech recognition applications. In order to achieve the required latency, streaming ASR models sacrifice accuracy compared to non-streaming ASR models due to lack of future input context. Previous research has shown that streaming and non-streaming ASR for RNN Transducers can be unified by cascading causal and non-causal encoders.… ▽ More

    Submitted 29 March, 2022; originally announced March 2022.

    Comments: 5 pages, 2 figures, Interspeech 2022 submission

  12. arXiv:2110.08352  [pdf, other

    cs.SD cs.CL eess.AS

    Omni-sparsity DNN: Fast Sparsity Optimization for On-Device Streaming E2E ASR via Supernet

    Authors: Haichuan Yang, Yuan Shangguan, Dilin Wang, Meng Li, Pierce Chuang, Xiaohui Zhang, Ganesh Venkatesh, Ozlem Kalinli, Vikas Chandra

    Abstract: From wearables to powerful smart devices, modern automatic speech recognition (ASR) models run on a variety of edge devices with different computational budgets. To navigate the Pareto front of model accuracy vs model size, researchers are trapped in a dilemma of optimizing model accuracy by training and fine-tuning models for each individual edge device while kee** the training GPU-hours tracta… ▽ More

    Submitted 20 July, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

  13. arXiv:2106.08960  [pdf, other

    cs.CL cs.SD eess.AS

    Collaborative Training of Acoustic Encoders for Speech Recognition

    Authors: Varun Nagaraja, Yangyang Shi, Ganesh Venkatesh, Ozlem Kalinli, Michael L. Seltzer, Vikas Chandra

    Abstract: On-device speech recognition requires training models of different sizes for deploying on devices with various computational budgets. When building such different models, we can benefit from training them jointly to take advantage of the knowledge shared between them. Joint training is also efficient since it reduces the redundancy in the training procedure's data handling operations. We propose a… ▽ More

    Submitted 13 July, 2021; v1 submitted 16 June, 2021; originally announced June 2021.

    Comments: INTERSPEECH 2021

  14. arXiv:2103.01524  [pdf, other

    eess.IV cs.CV cs.LG

    Feature-Align Network with Knowledge Distillation for Efficient Denoising

    Authors: Lucas D. Young, Fitsum A. Reda, Rakesh Ranjan, Jon Morton, Jun Hu, Yazhu Ling, Xiaoyu Xiang, David Liu, Vikas Chandra

    Abstract: We propose an efficient neural network for RAW image denoising. Although neural network-based denoising has been extensively studied for image restoration, little attention has been given to efficient denoising for compute limited and power sensitive devices, such as smartphones and smartwatches. In this paper, we present a novel architecture and a suite of training techniques for high quality den… ▽ More

    Submitted 17 March, 2021; v1 submitted 2 March, 2021; originally announced March 2021.

    MSC Class: 94A08 (Primary) 68T07; 65D19 (Secondary) ACM Class: I.4.5; I.2.6

  15. arXiv:2102.11531  [pdf, other

    cs.SD cs.CL eess.AS

    Memory-efficient Speech Recognition on Smart Devices

    Authors: Ganesh Venkatesh, Alagappan Valliappan, Jay Mahadeokar, Yuan Shangguan, Christian Fuegen, Michael L. Seltzer, Vikas Chandra

    Abstract: Recurrent transducer models have emerged as a promising solution for speech recognition on the current and next generation smart devices. The transducer models provide competitive accuracy within a reasonable memory footprint alleviating the memory capacity constraints in these devices. However, these models access parameters from off-chip memory for every input time step which adversely effects d… ▽ More

    Submitted 23 February, 2021; originally announced February 2021.

    Journal ref: ICASSP 2021

  16. arXiv:2012.02228  [pdf, other

    cs.CV cs.LG eess.IV

    EVRNet: Efficient Video Restoration on Edge Devices

    Authors: Sachin Mehta, Amit Kumar, Fitsum Reda, Varun Nasery, Vikram Mulukutla, Rakesh Ranjan, Vikas Chandra

    Abstract: Video transmission applications (e.g., conferencing) are gaining momentum, especially in times of global health pandemic. Video signals are transmitted over lossy channels, resulting in low-quality received signals. To restore videos on recipient edge devices in real-time, we introduce an efficient video restoration network, EVRNet. EVRNet efficiently allocates parameters inside the network using… ▽ More

    Submitted 3 December, 2020; originally announced December 2020.

    Comments: Technical report

  17. arXiv:2008.09916  [pdf, other

    cs.LG cs.CV eess.IV

    One Weight Bitwidth to Rule Them All

    Authors: Ting-Wu Chin, Pierce I-Jen Chuang, Vikas Chandra, Diana Marculescu

    Abstract: Weight quantization for deep ConvNets has shown promising results for applications such as image classification and semantic segmentation and is especially important for applications where memory storage is limited. However, when aiming for quantization without accuracy degradation, different tasks may end up with different bitwidths. This creates complexity for software and hardware support and t… ▽ More

    Submitted 28 August, 2020; v1 submitted 22 August, 2020; originally announced August 2020.

    Comments: Accepted at ECCV 2020 Embedded Vision Workshop (Best paper)

  18. arXiv:2002.04116  [pdf, ps, other

    cs.LG eess.SP stat.ML

    Co-Exploration of Neural Architectures and Heterogeneous ASIC Accelerator Designs Targeting Multiple Tasks

    Authors: Lei Yang, Zheyu Yan, Meng Li, Hyoukjun Kwon, Liangzhen Lai, Tushar Krishna, Vikas Chandra, Weiwen Jiang, Yiyu Shi

    Abstract: Neural Architecture Search (NAS) has demonstrated its power on various AI accelerating platforms such as Field Programmable Gate Arrays (FPGAs) and Graphic Processing Units (GPUs). However, it remains an open problem, how to integrate NAS with Application-Specific Integrated Circuits (ASICs), despite them being the most powerful AI accelerating platforms. The major bottleneck comes from the large… ▽ More

    Submitted 10 February, 2020; originally announced February 2020.

    Comments: Accepted by DAC'20

  19. arXiv:1711.07128  [pdf, other

    cs.SD cs.CL cs.LG cs.NE eess.AS

    Hello Edge: Keyword Spotting on Microcontrollers

    Authors: Yundong Zhang, Naveen Suda, Liangzhen Lai, Vikas Chandra

    Abstract: Keyword spotting (KWS) is a critical component for enabling speech based user interactions on smart devices. It requires real-time response and high accuracy for good user experience. Recently, neural networks have become an attractive choice for KWS architecture because of their superior accuracy compared to traditional speech processing algorithms. Due to its always-on nature, KWS application ha… ▽ More

    Submitted 14 February, 2018; v1 submitted 19 November, 2017; originally announced November 2017.

    Comments: Code available in github at https://github.com/ARM-software/ML-KWS-for-MCU