Skip to main content

Showing 1–12 of 12 results for author: Rybakov, O

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.02887  [pdf, other

    eess.AS cs.SD

    USM RNN-T model weights binarization

    Authors: Oleg Rybakov, Dmitriy Serdyuk, Chengjian Zheng

    Abstract: Large-scale universal speech models (USM) are already used in production. However, as the model size grows, the serving cost grows too. Serving cost of large models is dominated by model size that is why model size reduction is an important research topic. In this work we are focused on model size reduction using weights only quantization. We present the weights binarization of USM Recurrent Neura… ▽ More

    Submitted 5 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

  2. arXiv:2406.02133  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    SimulTron: On-Device Simultaneous Speech to Speech Translation

    Authors: Alex Agranovich, Eliya Nachmani, Oleg Rybakov, Yifan Ding, Ye Jia, Nadav Bar, Heiga Zen, Michelle Tadmor Ramanovich

    Abstract: Simultaneous speech-to-speech translation (S2ST) holds the promise of breaking down communication barriers and enabling fluid conversations across languages. However, achieving accurate, real-time translation through mobile devices remains a major challenge. We introduce SimulTron, a novel S2ST architecture designed to tackle this task. SimulTron is a lightweight direct S2ST model that uses the st… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  3. arXiv:2312.08553  [pdf, other

    eess.AS cs.SD

    USM-Lite: Quantization and Sparsity Aware Fine-tuning for Speech Recognition with Universal Speech Models

    Authors: Shao** Ding, David Qiu, David Rim, Yanzhang He, Oleg Rybakov, Bo Li, Rohit Prabhavalkar, Weiran Wang, Tara N. Sainath, Zhonglin Han, Jian Li, Amir Yazdanbakhsh, Shivani Agrawal

    Abstract: End-to-end automatic speech recognition (ASR) models have seen revolutionary quality gains with the recent development of large-scale universal speech models (USM). However, deploying these massive USMs is extremely expensive due to the enormous memory usage and computational cost. Therefore, model compression is an important research topic to fit USM-based ASR under budget in real-world scenarios… ▽ More

    Submitted 16 January, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP 2024. Preprint

  4. arXiv:2305.15536  [pdf, other

    eess.AS cs.LG

    RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models

    Authors: David Qiu, David Rim, Shao** Ding, Oleg Rybakov, Yanzhang He

    Abstract: With the rapid increase in the size of neural networks, model compression has become an important area of research. Quantization is an effective technique at decreasing the model size, memory access, and compute load of large models. Despite recent advances in quantization aware training (QAT) technique, most papers present evaluations that are focused on computer vision tasks, which have differen… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

  5. arXiv:2302.01172  [pdf, other

    cs.LG

    STEP: Learning N:M Structured Sparsity Masks from Scratch with Precondition

    Authors: Yucheng Lu, Shivani Agrawal, Suvinay Subramanian, Oleg Rybakov, Christopher De Sa, Amir Yazdanbakhsh

    Abstract: Recent innovations on hardware (e.g. Nvidia A100) have motivated learning N:M structured sparsity masks from scratch for fast model inference. However, state-of-the-art learning recipes in this regime (e.g. SR-STE) are proposed for non-adaptive optimizers like momentum SGD, while incurring non-trivial accuracy drop for Adam-trained models like attention-based LLMs. In this paper, we first demonstr… ▽ More

    Submitted 2 February, 2023; originally announced February 2023.

  6. arXiv:2210.13761  [pdf, other

    eess.AS cs.SD

    Streaming Parrotron for on-device speech-to-speech conversion

    Authors: Oleg Rybakov, Fadi Biadsy, Xia Zhang, Liyang Jiang, Phoenix Meadowlark, Shivani Agrawal

    Abstract: We present a fully on-device streaming Speech2Speech conversion model that normalizes a given input speech directly to synthesized output speech. Deploying such a model on mobile devices pose significant challenges in terms of memory footprint and computation requirements. We present a streaming-based approach to produce an acceptable delay, with minimal loss in speech conversion quality, when com… ▽ More

    Submitted 24 May, 2023; v1 submitted 25 October, 2022; originally announced October 2022.

  7. arXiv:2203.15952  [pdf, other

    eess.AS cs.LG

    4-bit Conformer with Native Quantization Aware Training for Speech Recognition

    Authors: Shao** Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, Oleg Rybakov

    Abstract: Reducing the latency and model size has always been a significant research problem for live Automatic Speech Recognition (ASR) application scenarios. Along this direction, model quantization has become an increasingly popular approach to compress neural networks and reduce computation cost. Most of the existing practical ASR systems apply post-training 8-bit quantization. To achieve a higher compr… ▽ More

    Submitted 2 March, 2023; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: Published at INTERSPEECH 2022

  8. A Scalable Model Specialization Framework for Training and Inference using Submodels and its Application to Speech Model Personalization

    Authors: Fadi Biadsy, Youzheng Chen, Xia Zhang, Oleg Rybakov, Andrew Rosenberg, Pedro J. Moreno

    Abstract: Model fine-tuning and adaptation have become a common approach for model specialization for downstream tasks or domains. Fine-tuning the entire model or a subset of the parameters using light-weight adaptation has shown considerable success across different specialization tasks. Fine-tuning a model for a large number of domains typically requires starting a new training job for every domain posing… ▽ More

    Submitted 13 September, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH

  9. arXiv:2203.00756  [pdf, other

    eess.AS cs.SD

    Real time spectrogram inversion on mobile phone

    Authors: Oleg Rybakov, Marco Tagliasacchi, Yunpeng Li, Liyang Jiang, Xia Zhang, Fadi Biadsy

    Abstract: We present two methods of real time magnitude spectrogram inversion: streaming Griffin Lim(GL) and streaming MelGAN. We demonstrate the impact of looking ahead on perceptual quality of MelGAN. As little as one hop size (12.5ms) of lookahead is able to significantly improve perceptual quality in comparison to its causal version. We compare streaming GL with the streaming MelGAN and show different t… ▽ More

    Submitted 24 May, 2023; v1 submitted 1 March, 2022; originally announced March 2022.

  10. Pareto-Optimal Quantized ResNet Is Mostly 4-bit

    Authors: AmirAli Abdolrashidi, Lisa Wang, Shivani Agrawal, Jonathan Malmaud, Oleg Rybakov, Chas Leichner, Lukasz Lew

    Abstract: Quantization has become a popular technique to compress neural networks and reduce compute cost, but most prior work focuses on studying quantization without changing the network size. Many real-world applications of neural networks have compute cost and memory budgets, which can be traded off with model quality by changing the number of parameters. In this work, we use ResNet as a case study to s… ▽ More

    Submitted 7 May, 2021; originally announced May 2021.

    Comments: 8 pages. Accepted at the Efficient Deep Learning for Computer Vision Workshop at CVPR 2021

  11. arXiv:2010.10677  [pdf, other

    eess.AS cs.SD

    Real-time Speech Frequency Bandwidth Extension

    Authors: Yunpeng Li, Marco Tagliasacchi, Oleg Rybakov, Victor Ungureanu, Dominik Roblek

    Abstract: In this paper we propose a lightweight model for frequency bandwidth extension of speech signals, increasing the sampling frequency from 8kHz to 16kHz while restoring the high frequency content to a level almost indistinguishable from the 16kHz ground truth. The model architecture is based on SEANet (Sound EnhAncement Network), a wave-to-wave fully convolutional model, which uses a combination of… ▽ More

    Submitted 9 February, 2021; v1 submitted 20 October, 2020; originally announced October 2020.

  12. Streaming keyword spotting on mobile devices

    Authors: Oleg Rybakov, Natasha Kononenko, Niranjan Subrahmanya, Mirko Visontai, Stella Laurenzo

    Abstract: In this work we explore the latency and accuracy of keyword spotting (KWS) models in streaming and non-streaming modes on mobile phones. NN model conversion from non-streaming mode (model receives the whole input sequence and then returns the classification result) to streaming mode (model receives portion of the input sequence and classifies it incrementally) may require manual model rewriting. W… ▽ More

    Submitted 28 July, 2020; v1 submitted 14 May, 2020; originally announced May 2020.