Skip to main content

Showing 1–12 of 12 results for author: Sawada, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.12428  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems

    Authors: Kentaro Mitsui, Koh Mitsuda, Toshiaki Wakatsuki, Yukiya Hono, Kei Sawada

    Abstract: Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. However, current models face two major challenges in response generation latency: (1) generating a spoken response requires the prior generation of a written response, and (2) speech sequences are significantly longer than text sequences. This study addresses these issues by e… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: 8 pages, 4 figures, 4 tables, demo samples: https://rinnakk.github.io/research/publications/PSLM

  2. arXiv:2404.01657  [pdf, other

    cs.CL cs.AI cs.CV cs.LG eess.AS

    Release of Pre-Trained Models for the Japanese Language

    Authors: Kei Sawada, Tianyu Zhao, Makoto Shing, Kentaro Mitsui, Akio Kaga, Yukiya Hono, Toshiaki Wakatsuki, Koh Mitsuda

    Abstract: AI democratization aims to create a world in which the average person can utilize AI techniques. To achieve this goal, numerous research institutes have attempted to make their results accessible to the public. In particular, large pre-trained models trained on large-scale data have shown unprecedented potential, and their release has had a significant impact. However, most of the released models… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: 9 pages, 1 figure, 5 tables, accepted for LREC-COLING 2024. Models are publicly available at https://huggingface.co/rinna

  3. arXiv:2312.03668  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition

    Authors: Yukiya Hono, Koh Mitsuda, Tianyu Zhao, Kentaro Mitsui, Toshiaki Wakatsuki, Kei Sawada

    Abstract: Advances in machine learning have made it possible to perform various text and speech processing tasks, such as automatic speech recognition (ASR), in an end-to-end (E2E) manner. E2E approaches utilizing pre-trained models are gaining attention for conserving training data and resources. However, most of their applications in ASR involve only one of either a pre-trained speech or a language model.… ▽ More

    Submitted 6 June, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

    Comments: 17 pages, 4 figures, 9 tables, accepted for Findings of ACL 2024. The model is available at https://huggingface.co/rinna/nue-asr

  4. arXiv:2310.01088  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Towards human-like spoken dialogue generation between AI agents from written dialogue

    Authors: Kentaro Mitsui, Yukiya Hono, Kei Sawada

    Abstract: The advent of large language models (LLMs) has made it possible to generate natural written dialogues between two agents. However, generating human-like spoken dialogues from these written dialogues remains challenging. Spoken dialogues have several unique characteristics: they frequently include backchannels and laughter, and the smoothness of turn-taking significantly influences the fluidity of… ▽ More

    Submitted 2 October, 2023; originally announced October 2023.

    Comments: 18 pages, 8 figures, 9 tables, audio samples: https://rinnakk.github.io/research/publications/CHATS/

  5. arXiv:2306.00369  [pdf, other

    cs.CL

    Focused Prefix Tuning for Controllable Text Generation

    Authors: Congda Ma, Tianyu Zhao, Makoto Shing, Kei Sawada, Manabu Okumura

    Abstract: In a controllable text generation dataset, there exist unannotated attributes that could provide irrelevant learning signals to models that use it for training and thus degrade their performance. We propose focused prefix tuning(FPT) to mitigate the problem and to enable the control to focus on the desired attribute. Experimental results show that FPT can achieve better control accuracy and text f… ▽ More

    Submitted 10 June, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted to the ACL 2023

  6. arXiv:2302.14337  [pdf, other

    cs.CV cs.CL cs.SD eess.AS eess.IV

    UniFLG: Unified Facial Landmark Generator from Text or Speech

    Authors: Kentaro Mitsui, Yukiya Hono, Kei Sawada

    Abstract: Talking face generation has been extensively investigated owing to its wide applicability. The two primary frameworks used for talking face generation comprise a text-driven framework, which generates synchronized speech and talking faces from text, and a speech-driven framework, which generates talking faces from speech. To integrate these frameworks, this paper proposes a unified facial landmark… ▽ More

    Submitted 18 May, 2023; v1 submitted 28 February, 2023; originally announced February 2023.

    Comments: 5 pages, 2 figures, 3 tables, accepted for INTERSPEECH 2023. Audio samples: https://rinnakk.github.io/research/publications/UniFLG

  7. arXiv:2302.06883  [pdf, other

    cs.CV

    Text-Guided Scene Sketch-to-Photo Synthesis

    Authors: AprilPyone MaungMaung, Makoto Shing, Kentaro Mitsui, Kei Sawada, Fumio Okura

    Abstract: We propose a method for scene-level sketch-to-photo synthesis with text guidance. Although object-level sketch-to-photo synthesis has been widely studied, whole-scene synthesis is still challenging without reference photos that adequately reflect the target style. To this end, we leverage knowledge from recent large-scale pre-trained generative models, resulting in text-guided sketch-to-photo synt… ▽ More

    Submitted 14 February, 2023; originally announced February 2023.

  8. Open Multi-Access Network Platform with Dynamic Task Offloading and Intelligent Resource Monitoring

    Authors: Takuji Tachibana, Kazuki Sawada, Hiroyuki Fujii, Ryo Maruyama, Tomonori Yamada, Masaaki Fujii, Toshimichi Fukuda

    Abstract: We constructed an open multi-access network platform using open-source hardware and software. The open multi-access network platform is characterized by the flexible utilization of network functions, integral management and control of wired and wireless access networks, zero-touch provisioning, intelligent resource monitoring, and dynamic task offloading. We also propose an application-driven dyna… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

    Journal ref: IEEE Communications Magazine, Vol. 60, Issue 8, pp. 52-58, August 2022

  9. arXiv:2206.12040  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue

    Authors: Kentaro Mitsui, Tianyu Zhao, Kei Sawada, Yukiya Hono, Yoshihiko Nankaku, Keiichi Tokuda

    Abstract: The recent text-to-speech (TTS) has achieved quality comparable to that of humans; however, its application in spoken dialogue has not been widely studied. This study aims to realize a TTS that closely resembles human dialogue. First, we record and transcribe actual spontaneous dialogues. Then, the proposed dialogue TTS is trained in two stages: first stage, variational autoencoder (VAE)-VITS or G… ▽ More

    Submitted 23 June, 2022; originally announced June 2022.

    Comments: 5 pages, 3 figures, accepted for INTERSPEECH 2022. Audio samples: https://rinnakk.github.io/research/publications/DialogueTTS/

  10. arXiv:2109.13714  [pdf, other

    eess.AS cs.LG cs.SD

    MSR-NV: Neural Vocoder Using Multiple Sampling Rates

    Authors: Kentaro Mitsui, Kei Sawada

    Abstract: The development of neural vocoders (NVs) has resulted in the high-quality and fast generation of waveforms. However, conventional NVs target a single sampling rate and require re-training when applied to different sampling rates. A suitable sampling rate varies from application to application due to the trade-off between speech quality and generation speed. In this study, we propose a method to ha… ▽ More

    Submitted 23 June, 2022; v1 submitted 28 September, 2021; originally announced September 2021.

    Comments: 6 pages including supplement, 3 figures, accepted for INTERSPEECH 2022. Audio samples: https://rinnakk.github.io/research/publications/MSR-NV/

  11. arXiv:2009.08474  [pdf, other

    eess.AS cs.LG cs.SD

    Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis

    Authors: Yukiya Hono, Kazuna Tsuboi, Kei Sawada, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda

    Abstract: This paper proposes a hierarchical generative model with a multi-grained latent variable to synthesize expressive speech. In recent years, fine-grained latent variables are introduced into the text-to-speech synthesis that enable the fine control of the prosody and speaking styles of synthesized speech. However, the naturalness of speech degrades when these latent variables are obtained by samplin… ▽ More

    Submitted 26 December, 2021; v1 submitted 17 September, 2020; originally announced September 2020.

    Comments: 5 pages, accepted to INTERSPEECH 2020, demo page: https://www.rinna.jp/research/interspeech2020/

  12. arXiv:2006.06119   

    cs.CV cs.LG cs.SD eess.AS

    Dance Revolution: Long-Term Dance Generation with Music via Curriculum Learning

    Authors: Ruozi Huang, Huang Hu, Wei Wu, Kei Sawada, Mi Zhang, Daxin Jiang

    Abstract: Dancing to music is one of human's innate abilities since ancient times. In machine learning research, however, synthesizing dance movements from music is a challenging problem. Recently, researchers synthesize human motion sequences through autoregressive models like recurrent neural network (RNN). Such an approach often generates short sequences due to an accumulation of prediction errors that a… ▽ More

    Submitted 9 September, 2023; v1 submitted 10 June, 2020; originally announced June 2020.

    Comments: This paper includes the unrigorous quantitative experimental results and has been withdrawn from the conference