Skip to main content

Showing 1–13 of 13 results for author: Hono, Y

.
  1. arXiv:2406.12428  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems

    Authors: Kentaro Mitsui, Koh Mitsuda, Toshiaki Wakatsuki, Yukiya Hono, Kei Sawada

    Abstract: Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. However, current models face two major challenges in response generation latency: (1) generating a spoken response requires the prior generation of a written response, and (2) speech sequences are significantly longer than text sequences. This study addresses these issues by e… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: 8 pages, 4 figures, 4 tables, demo samples: https://rinnakk.github.io/research/publications/PSLM

  2. arXiv:2404.01657  [pdf, other

    cs.CL cs.AI cs.CV cs.LG eess.AS

    Release of Pre-Trained Models for the Japanese Language

    Authors: Kei Sawada, Tianyu Zhao, Makoto Shing, Kentaro Mitsui, Akio Kaga, Yukiya Hono, Toshiaki Wakatsuki, Koh Mitsuda

    Abstract: AI democratization aims to create a world in which the average person can utilize AI techniques. To achieve this goal, numerous research institutes have attempted to make their results accessible to the public. In particular, large pre-trained models trained on large-scale data have shown unprecedented potential, and their release has had a significant impact. However, most of the released models… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: 9 pages, 1 figure, 5 tables, accepted for LREC-COLING 2024. Models are publicly available at https://huggingface.co/rinna

  3. arXiv:2402.14692  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    PeriodGrad: Towards Pitch-Controllable Neural Vocoder Based on a Diffusion Probabilistic Model

    Authors: Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda

    Abstract: This paper presents a neural vocoder based on a denoising diffusion probabilistic model (DDPM) incorporating explicit periodic signals as auxiliary conditioning signals. Recently, DDPM-based neural vocoders have gained prominence as non-autoregressive models that can generate high-quality waveforms. The neural vocoders based on DDPM have the advantage of training with a simple time-domain loss. In… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

    Comments: 5 pages, 4 figures, To appear in ICASSP 2024. Audio samples: https://www.sp.nitech.ac.jp/~hono/demos/icassp2024/

  4. arXiv:2312.03668  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition

    Authors: Yukiya Hono, Koh Mitsuda, Tianyu Zhao, Kentaro Mitsui, Toshiaki Wakatsuki, Kei Sawada

    Abstract: Advances in machine learning have made it possible to perform various text and speech processing tasks, such as automatic speech recognition (ASR), in an end-to-end (E2E) manner. E2E approaches utilizing pre-trained models are gaining attention for conserving training data and resources. However, most of their applications in ASR involve only one of either a pre-trained speech or a language model.… ▽ More

    Submitted 6 June, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

    Comments: 17 pages, 4 figures, 9 tables, accepted for Findings of ACL 2024. The model is available at https://huggingface.co/rinna/nue-asr

  5. arXiv:2310.01088  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Towards human-like spoken dialogue generation between AI agents from written dialogue

    Authors: Kentaro Mitsui, Yukiya Hono, Kei Sawada

    Abstract: The advent of large language models (LLMs) has made it possible to generate natural written dialogues between two agents. However, generating human-like spoken dialogues from these written dialogues remains challenging. Spoken dialogues have several unique characteristics: they frequently include backchannels and laughter, and the smoothness of turn-taking significantly influences the fluidity of… ▽ More

    Submitted 2 October, 2023; originally announced October 2023.

    Comments: 18 pages, 8 figures, 9 tables, audio samples: https://rinnakk.github.io/research/publications/CHATS/

  6. arXiv:2302.14337  [pdf, other

    cs.CV cs.CL cs.SD eess.AS eess.IV

    UniFLG: Unified Facial Landmark Generator from Text or Speech

    Authors: Kentaro Mitsui, Yukiya Hono, Kei Sawada

    Abstract: Talking face generation has been extensively investigated owing to its wide applicability. The two primary frameworks used for talking face generation comprise a text-driven framework, which generates synchronized speech and talking faces from text, and a speech-driven framework, which generates talking faces from speech. To integrate these frameworks, this paper proposes a unified facial landmark… ▽ More

    Submitted 18 May, 2023; v1 submitted 28 February, 2023; originally announced February 2023.

    Comments: 5 pages, 2 figures, 3 tables, accepted for INTERSPEECH 2023. Audio samples: https://rinnakk.github.io/research/publications/UniFLG

  7. arXiv:2301.02262  [pdf, other

    eess.AS cs.LG cs.SD

    Singing voice synthesis based on frame-level sequence-to-sequence models considering vocal timing deviation

    Authors: Miku Nishihara, Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda

    Abstract: This paper proposes singing voice synthesis (SVS) based on frame-level sequence-to-sequence models considering vocal timing deviation. In SVS, it is essential to synchronize the timing of singing with temporal structures represented by scores, taking into account that there are differences between actual vocal timing and note start timing. In many SVS systems including our previous work, phoneme-l… ▽ More

    Submitted 22 February, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

    Comments: 5 pages, 4 figures

  8. arXiv:2212.13703  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Singing Voice Synthesis Based on a Musical Note Position-Aware Attention Mechanism

    Authors: Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda

    Abstract: This paper proposes a novel sequence-to-sequence (seq2seq) model with a musical note position-aware attention mechanism for singing voice synthesis (SVS). A seq2seq modeling approach that can simultaneously perform acoustic and temporal modeling is attractive. However, due to the difficulty of the temporal modeling of singing voices, many recent SVS systems with an encoder-decoder-based model stil… ▽ More

    Submitted 14 March, 2023; v1 submitted 28 December, 2022; originally announced December 2022.

    Comments: 5 pages, 4 figures, 2 tables, accepted to ICASSP 2023

  9. arXiv:2211.11222  [pdf, other

    eess.AS cs.CL cs.SD

    Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System

    Authors: Takenori Yoshimura, Shinji Takaki, Kazuhiro Nakamura, Keiichiro Oura, Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda

    Abstract: This paper integrates a classic mel-cepstral synthesis filter into a modern neural speech synthesis system towards end-to-end controllable speech synthesis. Since the mel-cepstral synthesis filter is explicitly embedded in neural waveform models in the proposed system, both voice characteristics and the pitch of synthesized speech are highly controlled via a frequency war** parameter and fundame… ▽ More

    Submitted 21 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  10. arXiv:2206.12040  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue

    Authors: Kentaro Mitsui, Tianyu Zhao, Kei Sawada, Yukiya Hono, Yoshihiko Nankaku, Keiichi Tokuda

    Abstract: The recent text-to-speech (TTS) has achieved quality comparable to that of humans; however, its application in spoken dialogue has not been widely studied. This study aims to realize a TTS that closely resembles human dialogue. First, we record and transcribe actual spontaneous dialogues. Then, the proposed dialogue TTS is trained in two stages: first stage, variational autoencoder (VAE)-VITS or G… ▽ More

    Submitted 23 June, 2022; originally announced June 2022.

    Comments: 5 pages, 3 figures, accepted for INTERSPEECH 2022. Audio samples: https://rinnakk.github.io/research/publications/DialogueTTS/

  11. arXiv:2108.02776  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Sinsy: A Deep Neural Network-Based Singing Voice Synthesis System

    Authors: Yukiya Hono, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda

    Abstract: This paper presents Sinsy, a deep neural network (DNN)-based singing voice synthesis (SVS) system. In recent years, DNNs have been utilized in statistical parametric SVS systems, and DNN-based SVS systems have demonstrated better performance than conventional hidden Markov model-based ones. SVS systems are required to synthesize a singing voice with pitch and timing that strictly follow a given mu… ▽ More

    Submitted 5 August, 2021; originally announced August 2021.

    Comments: 14 pages, 11 figures, 3 tables, Accepted to IEEE/ACM Transactions on Audio, Speech and Language Processing

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2803-2815, 2021

  12. arXiv:2102.07786  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    PeriodNet: A non-autoregressive waveform generation model with a structure separating periodic and aperiodic components

    Authors: Yukiya Hono, Shinji Takaki, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda

    Abstract: We propose PeriodNet, a non-autoregressive (non-AR) waveform generation model with a new model structure for modeling periodic and aperiodic components in speech waveforms. The non-AR waveform generation models can generate speech waveforms parallelly and can be used as a speech vocoder by conditioning an acoustic feature. Since a speech waveform contains periodic and aperiodic components, both co… ▽ More

    Submitted 15 February, 2021; originally announced February 2021.

    Comments: 5 pages, accepted to ICASSP 2021

  13. arXiv:2009.08474  [pdf, other

    eess.AS cs.LG cs.SD

    Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis

    Authors: Yukiya Hono, Kazuna Tsuboi, Kei Sawada, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda

    Abstract: This paper proposes a hierarchical generative model with a multi-grained latent variable to synthesize expressive speech. In recent years, fine-grained latent variables are introduced into the text-to-speech synthesis that enable the fine control of the prosody and speaking styles of synthesized speech. However, the naturalness of speech degrades when these latent variables are obtained by samplin… ▽ More

    Submitted 26 December, 2021; v1 submitted 17 September, 2020; originally announced September 2020.

    Comments: 5 pages, accepted to INTERSPEECH 2020, demo page: https://www.rinna.jp/research/interspeech2020/