Skip to main content

Showing 1–10 of 10 results for author: Karita, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2305.18802  [pdf, other

    eess.AS cs.SD

    LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus

    Authors: Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, Ankur Bapna

    Abstract: This paper introduces a new speech dataset called ``LibriTTS-R'' designed for text-to-speech (TTS) use. It is derived by applying speech restoration to the LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling rate from 2,456 speakers and the corresponding texts. The constituent samples of LibriTTS-R are identical to those of LibriTTS, with only the sound quality improved.… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  2. arXiv:2303.01664  [pdf, other

    cs.SD cs.LG eess.AS

    Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations

    Authors: Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Yu Zhang, Wei Han, Ankur Bapna, Michiel Bacchiani

    Abstract: Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the Web to studio-quality. To make our SR model robust against various degradation,… ▽ More

    Submitted 14 August, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

    Comments: Accepted to WASPAA 2023

  3. arXiv:2202.07894  [pdf, other

    cs.CL cs.SD eess.AS

    Knowledge Transfer from Large-scale Pretrained Language Models to End-to-end Speech Recognizers

    Authors: Yotaro Kubo, Shigeki Karita, Michiel Bacchiani

    Abstract: End-to-end speech recognition is a promising technology for enabling compact automatic speech recognition (ASR) systems since it can unify the acoustic and language model into a single neural network. However, as a drawback, training of end-to-end speech recognizers always requires transcribed utterances. Since end-to-end models are also known to be severely data hungry, this constraint is crucial… ▽ More

    Submitted 16 February, 2022; originally announced February 2022.

    Comments: To be presented in ICASSP 2022

  4. arXiv:2111.00764  [pdf, other

    eess.AS cs.SD

    SNRi Target Training for Joint Speech Enhancement and Recognition

    Authors: Yuma Koizumi, Shigeki Karita, Arun Narayanan, Sankaran Panchapagesan, Michiel Bacchiani

    Abstract: Speech enhancement (SE) is used as a frontend in speech applications including automatic speech recognition (ASR) and telecommunication. A difficulty in using the SE frontend is that the appropriate noise reduction level differs depending on applications and/or noise characteristics. In this study, we propose "signal-to-noise ratio improvement (SNRi) target training"; the SE frontend is trained to… ▽ More

    Submitted 28 March, 2022; v1 submitted 1 November, 2021; originally announced November 2021.

    Comments: Submitted to Interspeech 2022 (v1 has been rejected from ICASSP 2022)

  5. arXiv:2106.15813  [pdf, other

    eess.AS cs.SD

    DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement

    Authors: Yuma Koizumi, Shigeki Karita, Scott Wisdom, Hakan Erdogan, John R. Hershey, Llion Jones, Michiel Bacchiani

    Abstract: Single-channel speech enhancement (SE) is an important task in speech processing. A widely used framework combines an analysis/synthesis filterbank with a mask prediction network, such as the Conv-TasNet architecture. In such systems, the denoising performance and computational efficiency are mainly affected by the structure of the mask prediction network. In this study, we aim to improve the sequ… ▽ More

    Submitted 5 August, 2021; v1 submitted 30 June, 2021; originally announced June 2021.

    Comments: 5 pages, 2 figure. accepted for WASPAA 2021

  6. arXiv:2106.05111  [pdf, other

    cs.CL cs.SD eess.AS

    A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition

    Authors: Shigeki Karita, Yotaro Kubo, Michiel Adriaan Unico Bacchiani, Llion Jones

    Abstract: End-to-end (E2E) modeling is advantageous for automatic speech recognition (ASR) especially for Japanese since word-based tokenization of Japanese is not trivial, and E2E modeling is able to model character sequences directly. This paper focuses on the latest E2E modeling techniques, and investigates their performances on character-based Japanese ASR by conducting comparative experiments. The resu… ▽ More

    Submitted 9 June, 2021; originally announced June 2021.

    Comments: to be published in INTERSPEECH2021

  7. arXiv:2012.13006  [pdf, other

    eess.AS cs.SD

    The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans

    Authors: Shinji Watanabe, Florian Boyer, Xuankai Chang, Pengcheng Guo, Tomoki Hayashi, Yosuke Higuchi, Takaaki Hori, Wen-Chin Huang, Hirofumi Inaguma, Naoyuki Kamo, Shigeki Karita, Chenda Li, **g Shi, Aswin Shanmugam Subramanian, Wangyou Zhang

    Abstract: This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text… ▽ More

    Submitted 23 December, 2020; originally announced December 2020.

  8. arXiv:2010.12973  [pdf, other

    cs.CL cs.SD eess.AS

    Unsupervised Learning of Disentangled Speech Content and Style Representation

    Authors: Andros Tjandra, Ruoming Pang, Yu Zhang, Shigeki Karita

    Abstract: We present an approach for unsupervised learning of speech representation disentangling contents and styles. Our model consists of: (1) a local encoder that captures per-frame information; (2) a global encoder that captures per-utterance information; and (3) a conditional decoder that reconstructs speech given local and global latent variables. Our experiments show that (1) the local latent variab… ▽ More

    Submitted 20 June, 2021; v1 submitted 24 October, 2020; originally announced October 2020.

    Comments: Submitted to Interspeech 2021

  9. arXiv:2004.10234  [pdf, ps, other

    cs.CL cs.SD eess.AS

    ESPnet-ST: All-in-One Speech Translation Toolkit

    Authors: Hirofumi Inaguma, Shun Kiyono, Kevin Duh, Shigeki Karita, Nelson Enrique Yalta Soplin, Tomoki Hayashi, Shinji Watanabe

    Abstract: We present ESPnet-ST, which is designed for the quick development of speech-to-speech translation systems in a single framework. ESPnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet, which integrates or newly implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation. We provide all-in-one recipes including data pre-p… ▽ More

    Submitted 30 September, 2020; v1 submitted 21 April, 2020; originally announced April 2020.

    Comments: Accepted at ACL 2020 System Demonstration (update Table1, fix typo)

  10. A Comparative Study on Transformer vs RNN in Speech Applications

    Authors: Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura, Wangyou Zhang

    Abstract: Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence model called Transformer, which achieves state-of-the-art performance in neural machine translation and other natural language processing applications. We underto… ▽ More

    Submitted 28 September, 2019; v1 submitted 13 September, 2019; originally announced September 2019.

    Comments: Accepted at ASRU 2019

    Journal ref: IEEE Automatic Speech Recognition and Understanding Workshop 2019