Skip to main content

Showing 1–10 of 10 results for author: Mitsui, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.12428  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems

    Authors: Kentaro Mitsui, Koh Mitsuda, Toshiaki Wakatsuki, Yukiya Hono, Kei Sawada

    Abstract: Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. However, current models face two major challenges in response generation latency: (1) generating a spoken response requires the prior generation of a written response, and (2) speech sequences are significantly longer than text sequences. This study addresses these issues by e… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: 8 pages, 4 figures, 4 tables, demo samples: https://rinnakk.github.io/research/publications/PSLM

  2. arXiv:2404.01657  [pdf, other

    cs.CL cs.AI cs.CV cs.LG eess.AS

    Release of Pre-Trained Models for the Japanese Language

    Authors: Kei Sawada, Tianyu Zhao, Makoto Shing, Kentaro Mitsui, Akio Kaga, Yukiya Hono, Toshiaki Wakatsuki, Koh Mitsuda

    Abstract: AI democratization aims to create a world in which the average person can utilize AI techniques. To achieve this goal, numerous research institutes have attempted to make their results accessible to the public. In particular, large pre-trained models trained on large-scale data have shown unprecedented potential, and their release has had a significant impact. However, most of the released models… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: 9 pages, 1 figure, 5 tables, accepted for LREC-COLING 2024. Models are publicly available at https://huggingface.co/rinna

  3. arXiv:2312.03668  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition

    Authors: Yukiya Hono, Koh Mitsuda, Tianyu Zhao, Kentaro Mitsui, Toshiaki Wakatsuki, Kei Sawada

    Abstract: Advances in machine learning have made it possible to perform various text and speech processing tasks, such as automatic speech recognition (ASR), in an end-to-end (E2E) manner. E2E approaches utilizing pre-trained models are gaining attention for conserving training data and resources. However, most of their applications in ASR involve only one of either a pre-trained speech or a language model.… ▽ More

    Submitted 6 June, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

    Comments: 17 pages, 4 figures, 9 tables, accepted for Findings of ACL 2024. The model is available at https://huggingface.co/rinna/nue-asr

  4. arXiv:2310.01088  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Towards human-like spoken dialogue generation between AI agents from written dialogue

    Authors: Kentaro Mitsui, Yukiya Hono, Kei Sawada

    Abstract: The advent of large language models (LLMs) has made it possible to generate natural written dialogues between two agents. However, generating human-like spoken dialogues from these written dialogues remains challenging. Spoken dialogues have several unique characteristics: they frequently include backchannels and laughter, and the smoothness of turn-taking significantly influences the fluidity of… ▽ More

    Submitted 2 October, 2023; originally announced October 2023.

    Comments: 18 pages, 8 figures, 9 tables, audio samples: https://rinnakk.github.io/research/publications/CHATS/

  5. arXiv:2302.14337  [pdf, other

    cs.CV cs.CL cs.SD eess.AS eess.IV

    UniFLG: Unified Facial Landmark Generator from Text or Speech

    Authors: Kentaro Mitsui, Yukiya Hono, Kei Sawada

    Abstract: Talking face generation has been extensively investigated owing to its wide applicability. The two primary frameworks used for talking face generation comprise a text-driven framework, which generates synchronized speech and talking faces from text, and a speech-driven framework, which generates talking faces from speech. To integrate these frameworks, this paper proposes a unified facial landmark… ▽ More

    Submitted 18 May, 2023; v1 submitted 28 February, 2023; originally announced February 2023.

    Comments: 5 pages, 2 figures, 3 tables, accepted for INTERSPEECH 2023. Audio samples: https://rinnakk.github.io/research/publications/UniFLG

  6. arXiv:2302.06883  [pdf, other

    cs.CV

    Text-Guided Scene Sketch-to-Photo Synthesis

    Authors: AprilPyone MaungMaung, Makoto Shing, Kentaro Mitsui, Kei Sawada, Fumio Okura

    Abstract: We propose a method for scene-level sketch-to-photo synthesis with text guidance. Although object-level sketch-to-photo synthesis has been widely studied, whole-scene synthesis is still challenging without reference photos that adequately reflect the target style. To this end, we leverage knowledge from recent large-scale pre-trained generative models, resulting in text-guided sketch-to-photo synt… ▽ More

    Submitted 14 February, 2023; originally announced February 2023.

  7. arXiv:2206.12040  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue

    Authors: Kentaro Mitsui, Tianyu Zhao, Kei Sawada, Yukiya Hono, Yoshihiko Nankaku, Keiichi Tokuda

    Abstract: The recent text-to-speech (TTS) has achieved quality comparable to that of humans; however, its application in spoken dialogue has not been widely studied. This study aims to realize a TTS that closely resembles human dialogue. First, we record and transcribe actual spontaneous dialogues. Then, the proposed dialogue TTS is trained in two stages: first stage, variational autoencoder (VAE)-VITS or G… ▽ More

    Submitted 23 June, 2022; originally announced June 2022.

    Comments: 5 pages, 3 figures, accepted for INTERSPEECH 2022. Audio samples: https://rinnakk.github.io/research/publications/DialogueTTS/

  8. arXiv:2109.13714  [pdf, other

    eess.AS cs.LG cs.SD

    MSR-NV: Neural Vocoder Using Multiple Sampling Rates

    Authors: Kentaro Mitsui, Kei Sawada

    Abstract: The development of neural vocoders (NVs) has resulted in the high-quality and fast generation of waveforms. However, conventional NVs target a single sampling rate and require re-training when applied to different sampling rates. A suitable sampling rate varies from application to application due to the trade-off between speech quality and generation speed. In this study, we propose a method to ha… ▽ More

    Submitted 23 June, 2022; v1 submitted 28 September, 2021; originally announced September 2021.

    Comments: 6 pages including supplement, 3 figures, accepted for INTERSPEECH 2022. Audio samples: https://rinnakk.github.io/research/publications/MSR-NV/

  9. arXiv:2008.02950  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes

    Authors: Kentaro Mitsui, Tomoki Koriyama, Hiroshi Saruwatari

    Abstract: Multi-speaker speech synthesis is a technique for modeling multiple speakers' voices with a single model. Although many approaches using deep neural networks (DNNs) have been proposed, DNNs are prone to overfitting when the amount of training data is limited. We propose a framework for multi-speaker speech synthesis using deep Gaussian processes (DGPs); a DGP is a deep architecture of Bayesian ker… ▽ More

    Submitted 6 August, 2020; originally announced August 2020.

    Comments: 5 pages, accepted for INTERSPEECH 2020

  10. arXiv:1908.06248  [pdf, other

    cs.SD eess.AS

    JVS corpus: free Japanese multi-speaker voice corpus

    Authors: Shinnosuke Takamichi, Kentaro Mitsui, Yuki Saito, Tomoki Koriyama, Naoko Tanji, Hiroshi Saruwatari

    Abstract: Thanks to improvements in machine learning techniques, including deep learning, speech synthesis is becoming a machine learning task. To accelerate speech synthesis research, we are develo** Japanese voice corpora reasonably accessible from not only academic institutions but also commercial companies. In 2017, we released the JSUT corpus, which contains 10 hours of reading-style speech uttered b… ▽ More

    Submitted 17 August, 2019; originally announced August 2019.