Skip to main content

Showing 1–13 of 13 results for author: Kumatani, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2205.08598  [pdf, other

    cs.SD cs.CL eess.AS eess.SP

    Deploying self-supervised learning in the wild for hybrid automatic speech recognition

    Authors: Mostafa Karimi, Changliang Liu, Kenichi Kumatani, Yao Qian, Tianyu Wu, Jian Wu

    Abstract: Self-supervised learning (SSL) methods have proven to be very successful in automatic speech recognition (ASR). These great improvements have been reported mostly based on highly curated datasets such as LibriSpeech for non-streaming End-to-End ASR models. However, the pivotal characteristics of SSL is to be utilized for any untranscribed audio data. In this paper, we provide a full exploration on… ▽ More

    Submitted 17 May, 2022; originally announced May 2022.

  2. arXiv:2112.05826  [pdf, other

    cs.CL cs.AI cs.LG eess.AS

    Sequence-level self-learning with multiple hypotheses

    Authors: Kenichi Kumatani, Dimitrios Dimitriadis, Yashesh Gaur, Robert Gmyr, Sefik Emre Eskimez, **yu Li, Michael Zeng

    Abstract: In this work, we develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR). For untranscribed speech data, the hypothesis from an ASR system must be used as a label. However, the imperfect ASR result makes unsupervised learning difficult to consistently improve recognition performance especially in the case that multipl… ▽ More

    Submitted 10 December, 2021; originally announced December 2021.

    Comments: Published in Interspeech 2020: https://www.isca-speech.org/archive_v0/Interspeech_2020/pdfs/2020.pdf

    Report number: https://www.isca-speech.org/archive_v0/Interspeech_2020/pdfs/2020.pdf

    Journal ref: Proc. Interspeech 2020, page 3775-3779

  3. arXiv:2112.05820  [pdf, other

    cs.CL cs.AI cs.LG eess.AS

    Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition

    Authors: Kenichi Kumatani, Robert Gmyr, Felipe Cruz Salinas, Linquan Liu, Wei Zuo, Devang Patel, Eric Sun, Yu Shi

    Abstract: The sparsely-gated Mixture of Experts (MoE) can magnify a network capacity with a little computational complexity. In this work, we investigate how multi-lingual Automatic Speech Recognition (ASR) networks can be scaled up with a simple routing algorithm in order to achieve better accuracy. More specifically, we apply the sparsely-gated MoE technique to two types of networks: Sequence-to-Sequence… ▽ More

    Submitted 4 January, 2022; v1 submitted 10 December, 2021; originally announced December 2021.

  4. arXiv:2110.09695  [pdf, other

    cs.LG

    Tackling Dynamics in Federated Incremental Learning with Variational Embedding Rehearsal

    Authors: Tae ** Park, Kenichi Kumatani, Dimitrios Dimitriadis

    Abstract: Federated Learning is a fast growing area of ML where the training datasets are extremely distributed, all while dynamically changing over time. Models need to be trained on clients' devices without any guarantees for either homogeneity or stationarity of the local private data. The need for continual training has also risen, due to the ever-increasing production of in-task data. However, pursuing… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

  5. arXiv:2110.07909  [pdf, other

    cs.CL eess.AS

    Multilingual Speech Recognition using Knowledge Transfer across Learning Processes

    Authors: Rimita Lahiri, Kenichi Kumatani, Eric Sun, Yao Qian

    Abstract: Multilingual end-to-end(E2E) models have shown a great potential in the expansion of the language coverage in the realm of automatic speech recognition(ASR). In this paper, we aim to enhance the multilingual ASR performance in two ways, 1)studying the impact of feeding a one-hot vector identifying the language, 2)formulating the task with a meta-learning objective combined with self-supervised lea… ▽ More

    Submitted 15 October, 2021; originally announced October 2021.

    Comments: 5 pages

  6. arXiv:2106.07578  [pdf, other

    cs.LG cs.DC

    Dynamic Gradient Aggregation for Federated Domain Adaptation

    Authors: Dimitrios Dimitriadis, Kenichi Kumatani, Robert Gmyr, Yashesh Gaur, Sefik Emre Eskimez

    Abstract: In this paper, a new learning algorithm for Federated Learning (FL) is introduced. The proposed scheme is based on a weighted gradient aggregation using two-step optimization to offer a flexible training pipeline. Herein, two different flavors of the aggregation method are presented, leading to an order of magnitude improvement in convergence speed compared to other distributed or FL training algo… ▽ More

    Submitted 14 June, 2021; originally announced June 2021.

    Comments: arXiv admin note: substantial text overlap with arXiv:2008.02452

  7. arXiv:2101.07597  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data

    Authors: Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang

    Abstract: In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve… ▽ More

    Submitted 10 June, 2021; v1 submitted 19 January, 2021; originally announced January 2021.

    Comments: accepted by ICML2021

  8. arXiv:2008.02452  [pdf, other

    cs.LG cs.DC stat.ML

    Federated Transfer Learning with Dynamic Gradient Aggregation

    Authors: Dimitrios Dimitriadis, Kenichi Kumatani, Robert Gmyr, Yashesh Gaur, Sefik Emre Eskimez

    Abstract: In this paper, a Federated Learning (FL) simulation platform is introduced. The target scenario is Acoustic Model training based on this platform. To our knowledge, this is the first attempt to apply FL techniques to Speech Recognition tasks due to the inherent complexity. The proposed FL platform can support different tasks based on the adopted modular design. As part of the platform, a novel hie… ▽ More

    Submitted 6 August, 2020; originally announced August 2020.

  9. arXiv:2002.02520  [pdf, other

    cs.SD cs.CL eess.AS

    Robust Multi-channel Speech Recognition using Frequency Aligned Network

    Authors: Tae** Park, Kenichi Kumatani, Minhua Wu, Shiva Sundaram

    Abstract: Conventional speech enhancement technique such as beamforming has known benefits for far-field speech recognition. Our own work in frequency-domain multi-channel acoustic modeling has shown additional improvements by training a spatial filtering layer jointly within an acoustic model. In this paper, we further develop this idea and use frequency aligned network for robust multi-channel automatic s… ▽ More

    Submitted 6 February, 2020; originally announced February 2020.

  10. Fully Learnable Front-End for Multi-Channel Acoustic Modeling using Semi-Supervised Learning

    Authors: Sanna Wager, Aparna Khare, Minhua Wu, Kenichi Kumatani, Shiva Sundaram

    Abstract: In this work, we investigated the teacher-student training paradigm to train a fully learnable multi-channel acoustic model for far-field automatic speech recognition (ASR). Using a large offline teacher model trained on beamformed audio, we trained a simpler multi-channel student acoustic model used in the speech recognition system. For the student, both multi-channel feature extraction layers an… ▽ More

    Submitted 31 January, 2020; originally announced February 2020.

    Comments: To appear in ICASSP 2020

  11. Multi-Geometry Spatial Acoustic Modeling for Distant Speech Recognition

    Authors: Kenichi Kumatani, Minhua Wu, Shiva Sundaram, Nikko Strom, Bjorn Hoffmeister

    Abstract: The use of spatial information with multiple microphones can improve far-field automatic speech recognition (ASR) accuracy. However, conventional microphone array techniques degrade speech enhancement performance when there is an array geometry mismatch between design and test conditions. Moreover, such speech enhancement techniques do not always yield ASR accuracy improvement due to the differenc… ▽ More

    Submitted 28 April, 2019; v1 submitted 12 March, 2019; originally announced March 2019.

    Comments: ICASSP2019, 5 pages. arXiv admin note: substantial text overlap with arXiv:1903.05299

    Report number: https://doi.org/10.1109/ICASSP.2019.8682294

    Journal ref: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019, page 6635-6639

  12. Frequency Domain Multi-channel Acoustic Modeling for Distant Speech Recognition

    Authors: Minhua Wu, Kenichi Kumatani, Shiva Sundaram, Nikko Strom, Bjorn Hoffmeister

    Abstract: Conventional far-field automatic speech recognition (ASR) systems typically employ microphone array techniques for speech enhancement in order to improve robustness against noise or reverberation. However, such speech enhancement techniques do not always yield ASR accuracy improvement because the optimization criterion for speech enhancement is not directly relevant to the ASR objective. In this w… ▽ More

    Submitted 28 April, 2019; v1 submitted 12 March, 2019; originally announced March 2019.

    Comments: ICASSP 2019, 5 pages

    Report number: https://doi.org/10.1109/ICASSP.2019.8682977

    Journal ref: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019, pages 6640-6644

  13. arXiv:1901.02348  [pdf, other

    eess.AS cs.CL cs.LG cs.SD stat.ML

    Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning

    Authors: Ladislav Mošner, Minhua Wu, Anirudh Raju, Sree Hari Krishnan Parthasarathi, Kenichi Kumatani, Shiva Sundaram, Roland Maas, Björn Hoffmeister

    Abstract: For real-world speech recognition applications, noise robustness is still a challenge. In this work, we adopt the teacher-student (T/S) learning technique using a parallel clean and noisy corpus for improving automatic speech recognition (ASR) performance under multimedia noise. On top of that, we apply a logits selection method which only preserves the k highest values to prevent wrong emphasis o… ▽ More

    Submitted 15 March, 2019; v1 submitted 5 January, 2019; originally announced January 2019.

    Comments: To Appear in ICASSP 2019