Skip to main content

Showing 1–15 of 15 results for author: Gowda, D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2401.10465  [pdf, other

    cs.CL cs.SD eess.AS

    Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech

    Authors: Abhinav Garg, Jiyeon Kim, Sushil Khyalia, Chanwoo Kim, Dhananjaya Gowda

    Abstract: Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languag… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

    Comments: Accepted at ICASSP 2024

  2. arXiv:2312.09842  [pdf, ps, other

    cs.SD eess.AS

    On the compression of shallow non-causal ASR models using knowledge distillation and tied-and-reduced decoder for low-latency on-device speech recognition

    Authors: Nagaraj Adiga, **hwan Park, Chintigari Shiva Kumar, Shatrughan Singh, Kyungmin Lee, Chanwoo Kim, Dhananjaya Gowda

    Abstract: Recently, the cascaded two-pass architecture has emerged as a strong contender for on-device automatic speech recognition (ASR). A cascade of causal and shallow non-causal encoders coupled with a shared decoder enables operation in both streaming and look-ahead modes. In this paper, we propose shallow cascaded model by combining various model compression techniques such as knowledge distillation,… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

  3. arXiv:2308.16540  [pdf, other

    eess.AS cs.CL cs.SD eess.SP

    Time-Varying Quasi-Closed-Phase Analysis for Accurate Formant Tracking in Speech Signals

    Authors: Dhananjaya Gowda, Sudarsana Reddy Kadiri, Brad Story, Paavo Alku

    Abstract: In this paper, we propose a new method for the accurate estimation and tracking of formants in speech signals using time-varying quasi-closed-phase (TVQCP) analysis. Conventional formant tracking methods typically adopt a two-stage estimate-and-track strategy wherein an initial set of formant candidates are estimated using short-time analysis (e.g., 10--50 ms), followed by a tracking stage based o… ▽ More

    Submitted 31 August, 2023; originally announced August 2023.

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 28, pp. 1901-1914, 2020

  4. arXiv:2308.09051  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Refining a Deep Learning-based Formant Tracker using Linear Prediction Methods

    Authors: Paavo Alku, Sudarsana Reddy Kadiri, Dhananjaya Gowda

    Abstract: In this study, formant tracking is investigated by refining the formants tracked by an existing data-driven tracker, DeepFormants, using the formants estimated in a model-driven manner by linear prediction (LP)-based methods. As LP-based formant estimation methods, conventional covariance analysis (LP-COV) and the recently proposed quasi-closed phase forward-backward (QCP-FB) analysis are used. In… ▽ More

    Submitted 17 August, 2023; originally announced August 2023.

    Comments: Computer Speech and Language, Vol. 81, Article 101515, June 2023

  5. arXiv:2308.08442  [pdf, other

    cs.CL cs.SD eess.AS

    Mitigating the Exposure Bias in Sentence-Level Grapheme-to-Phoneme (G2P) Transduction

    Authors: Eunseop Yoon, Hee Suk Yoon, Dhananjaya Gowda, SooHwan Eom, Daehyeok Kim, John Harvill, Heting Gao, Mark Hasegawa-Johnson, Chanwoo Kim, Chang D. Yoo

    Abstract: Text-to-Text Transfer Transformer (T5) has recently been considered for the Grapheme-to-Phoneme (G2P) transduction. As a follow-up, a tokenizer-free byte-level model based on T5 referred to as ByT5, recently gave promising results on word-level G2P conversion by representing each input character with its corresponding UTF-8 encoding. Although it is generally understood that sentence-level or parag… ▽ More

    Submitted 16 August, 2023; originally announced August 2023.

    Comments: INTERSPEECH 2023

  6. Multi-stage Progressive Compression of Conformer Transducer for On-device Speech Recognition

    Authors: Jash Rathod, Nauman Dawalatabad, Shatrughan Singh, Dhananjaya Gowda

    Abstract: The smaller memory bandwidth in smart devices prompts development of smaller Automatic Speech Recognition (ASR) models. To obtain a smaller model, one can employ the model compression techniques. Knowledge distillation (KD) is a popular model compression approach that has shown to achieve smaller model size with relatively lesser degradation in the model performance. In this approach, knowledge is… ▽ More

    Submitted 30 September, 2022; originally announced October 2022.

    Comments: Published in INTERSPEECH 2022

    Journal ref: Proc. Interspeech 2022, 1691-1695

  7. arXiv:2201.02741  [pdf, other

    eess.AS cs.SD

    Two-Pass End-to-End ASR Model Compression

    Authors: Nauman Dawalatabad, Tushar Vatsal, Ashutosh Gupta, Sungsoo Kim, Shatrughan Singh, Dhananjaya Gowda, Chanwoo Kim

    Abstract: Speech recognition on smart devices is challenging owing to the small memory footprint. Hence small size ASR models are desirable. With the use of popular transducer-based models, it has become possible to practically deploy streaming speech recognition models on small devices [1]. Recently, the two-pass model [2] combining RNN-T and LAS modules has shown exceptional performance for streaming on-d… ▽ More

    Submitted 7 January, 2022; originally announced January 2022.

    Comments: IEEE ASRU 2021

  8. arXiv:2201.01525  [pdf, other

    eess.AS cs.LG cs.SD

    Formant Tracking Using Quasi-Closed Phase Forward-Backward Linear Prediction Analysis and Deep Neural Networks

    Authors: Dhananjaya Gowda, Bajibabu Bollepalli, Sudarsana Reddy Kadiri, Paavo Alku

    Abstract: Formant tracking is investigated in this study by using trackers based on dynamic programming (DP) and deep neural nets (DNNs). Using the DP approach, six formant estimation methods were first compared. The six methods include linear prediction (LP) algorithms, weighted LP algorithms and the recently developed quasi-closed phase forward-backward (QCP-FB) method. QCP-FB gave the best performance in… ▽ More

    Submitted 5 January, 2022; originally announced January 2022.

    Journal ref: Published in IEEE ACCESS. Vol. 9, 2021, pp. 151631-151640

  9. arXiv:2111.10047  [pdf, other

    eess.AS cs.CL cs.SD

    Semi-supervised transfer learning for language expansion of end-to-end speech recognition models to low-resource languages

    Authors: Jiyeon Kim, Mehul Kumar, Dhananjaya Gowda, Abhinav Garg, Chanwoo Kim

    Abstract: In this paper, we propose a three-stage training methodology to improve the speech recognition accuracy of low-resource languages. We explore and propose an effective combination of techniques such as transfer learning, encoder freezing, data augmentation using Text-To-Speech (TTS), and Semi-Supervised Learning (SSL). To improve the accuracy of a low-resource Italian ASR, we leverage a well-traine… ▽ More

    Submitted 19 November, 2021; originally announced November 2021.

    Comments: Accepted as a conference paper at ASRU 2021

  10. arXiv:2111.10043  [pdf, other

    eess.AS cs.SD

    A comparison of streaming models and data augmentation methods for robust speech recognition

    Authors: Jiyeon Kim, Mehul Kumar, Dhananjaya Gowda, Abhinav Garg, Chanwoo Kim

    Abstract: In this paper, we present a comparative study on the robustness of two different online streaming speech recognition models: Monotonic Chunkwise Attention (MoChA) and Recurrent Neural Network-Transducer (RNN-T). We explore three recently proposed data augmentation techniques, namely, multi-conditioned training using an acoustic simulator, Vocal Tract Length Perturbation (VTLP) for speaker variabil… ▽ More

    Submitted 18 November, 2021; originally announced November 2021.

    Comments: Accepted as a conference paper at ASRU 2021

  11. arXiv:2105.01254  [pdf, other

    cs.SD cs.LG eess.AS

    Streaming end-to-end speech recognition with jointly trained neural feature enhancement

    Authors: Chanwoo Kim, Abhinav Garg, Dhananjaya Gowda, Seongkyu Mun, Changwoo Han

    Abstract: In this paper, we present a streaming end-to-end speech recognition model based on Monotonic Chunkwise Attention (MoCha) jointly trained with enhancement layers. Even though the MoCha attention enables streaming speech recognition with recognition accuracy comparable to a full attention-based approach, training this model is sensitive to various factors such as the difficulty of training examples,… ▽ More

    Submitted 3 May, 2021; originally announced May 2021.

    Comments: Accepted to ICASSP 2021

  12. arXiv:2001.00577  [pdf, other

    eess.AS cs.LG cs.SD

    Attention based on-device streaming speech recognition with large speech corpus

    Authors: Kwangyoun Kim, Kyungmin Lee, Dhananjaya Gowda, Junmo Park, Sungsoo Kim, Sichen **, Young-Yoon Lee, **su Yeo, Daehyun Kim, Seokyeong Jung, Jungin Lee, Myoungji Han, Chanwoo Kim

    Abstract: In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses, minimum word error rate (MWER) training, layer… ▽ More

    Submitted 1 January, 2020; originally announced January 2020.

    Comments: Accepted and presented at the ASRU 2019 conference

  13. arXiv:1912.12384  [pdf, other

    eess.AS cs.LG cs.SD eess.SP stat.ML

    Improved Multi-Stage Training of Online Attention-based Encoder-Decoder Models

    Authors: Abhinav Garg, Dhananjaya Gowda, Ankur Kumar, Kwangyoun Kim, Mehul Kumar, Chanwoo Kim

    Abstract: In this paper, we propose a refined multi-stage multi-task training strategy to improve the performance of online attention-based encoder-decoder (AED) models. A three-stage training based on three levels of architectural granularity namely, character encoder, byte pair encoding (BPE) based encoder, and attention decoder, is proposed. Also, multi-task learning based on two-levels of linguistic gra… ▽ More

    Submitted 27 December, 2019; originally announced December 2019.

    Comments: Accepted and presented at the ASRU 2019 conference

  14. arXiv:1912.11041  [pdf, ps, other

    eess.AS cs.LG cs.SD stat.ML

    power-law nonlinearity with maximally uniform distribution criterion for improved neural network training in automatic speech recognition

    Authors: Chanwoo Kim, Mehul Kumar, Kwangyoun Kim, Dhananjaya Gowda

    Abstract: In this paper, we describe the Maximum Uniformity of Distribution (MUD) algorithm with the power-law nonlinearity. In this approach, we hypothesize that neural network training will become more stable if feature distribution is not too much skewed. We propose two different types of MUD approaches: power function-based MUD and histogram-based MUD. In these approaches, we first obtain the mel filter… ▽ More

    Submitted 21 December, 2019; originally announced December 2019.

    Comments: Accepted and presented at the ASRU 2019 conference

  15. arXiv:1912.11040  [pdf, ps, other

    eess.AS cs.LG cs.SD eess.SP stat.ML

    end-to-end training of a large vocabulary end-to-end speech recognition system

    Authors: Chanwoo Kim, Sungsoo Kim, Kwangyoun Kim, Mehul Kumar, Jiyeon Kim, Kyungmin Lee, Changwoo Han, Abhinav Garg, Eunhyang Kim, Minkyoo Shin, Shatrughan Singh, Larry Heck, Dhananjaya Gowda

    Abstract: In this paper, we present an end-to-end training framework for building state-of-the-art end-to-end speech recognition systems. Our training system utilizes a cluster of Central Processing Units(CPUs) and Graphics Processing Units (GPUs). The entire data reading, large scale data augmentation, neural network parameter updates are all performed "on-the-fly". We use vocal tract length perturbation [… ▽ More

    Submitted 21 December, 2019; originally announced December 2019.

    Comments: Accepted and presented at the ASRU 2019 conference