Search | arXiv e-print repository

PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems

Authors: Kentaro Mitsui, Koh Mitsuda, Toshiaki Wakatsuki, Yukiya Hono, Kei Sawada

Abstract: Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. However, current models face two major challenges in response generation latency: (1) generating a spoken response requires the prior generation of a written response, and (2) speech sequences are significantly longer than text sequences. This study addresses these issues by e… ▽ More Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. However, current models face two major challenges in response generation latency: (1) generating a spoken response requires the prior generation of a written response, and (2) speech sequences are significantly longer than text sequences. This study addresses these issues by extending the input and output sequences of the language model to support the parallel generation of text and speech. Our experiments on spoken question answering tasks demonstrate that our approach improves latency while maintaining the quality of response content. Additionally, we show that latency can be further reduced by generating speech in multiple sequences. Demo samples are available at https://rinnakk.github.io/research/publications/PSLM. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 8 pages, 4 figures, 4 tables, demo samples: https://rinnakk.github.io/research/publications/PSLM

arXiv:2404.01657 [pdf, other]

Release of Pre-Trained Models for the Japanese Language

Authors: Kei Sawada, Tianyu Zhao, Makoto Shing, Kentaro Mitsui, Akio Kaga, Yukiya Hono, Toshiaki Wakatsuki, Koh Mitsuda

Abstract: AI democratization aims to create a world in which the average person can utilize AI techniques. To achieve this goal, numerous research institutes have attempted to make their results accessible to the public. In particular, large pre-trained models trained on large-scale data have shown unprecedented potential, and their release has had a significant impact. However, most of the released models… ▽ More AI democratization aims to create a world in which the average person can utilize AI techniques. To achieve this goal, numerous research institutes have attempted to make their results accessible to the public. In particular, large pre-trained models trained on large-scale data have shown unprecedented potential, and their release has had a significant impact. However, most of the released models specialize in the English language, and thus, AI democratization in non-English-speaking communities is lagging significantly. To reduce this gap in AI access, we released Generative Pre-trained Transformer (GPT), Contrastive Language and Image Pre-training (CLIP), Stable Diffusion, and Hidden-unit Bidirectional Encoder Representations from Transformers (HuBERT) pre-trained in Japanese. By providing these models, users can freely interface with AI that aligns with Japanese cultural values and ensures the identity of Japanese culture, thus enhancing the democratization of AI. Additionally, experiments showed that pre-trained models specialized for Japanese can efficiently achieve high performance in Japanese tasks. △ Less

Submitted 2 April, 2024; originally announced April 2024.

Comments: 9 pages, 1 figure, 5 tables, accepted for LREC-COLING 2024. Models are publicly available at https://huggingface.co/rinna

arXiv:2312.03668 [pdf, other]

Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition

Authors: Yukiya Hono, Koh Mitsuda, Tianyu Zhao, Kentaro Mitsui, Toshiaki Wakatsuki, Kei Sawada

Abstract: Advances in machine learning have made it possible to perform various text and speech processing tasks, such as automatic speech recognition (ASR), in an end-to-end (E2E) manner. E2E approaches utilizing pre-trained models are gaining attention for conserving training data and resources. However, most of their applications in ASR involve only one of either a pre-trained speech or a language model.… ▽ More Advances in machine learning have made it possible to perform various text and speech processing tasks, such as automatic speech recognition (ASR), in an end-to-end (E2E) manner. E2E approaches utilizing pre-trained models are gaining attention for conserving training data and resources. However, most of their applications in ASR involve only one of either a pre-trained speech or a language model. This paper proposes integrating a pre-trained speech representation model and a large language model (LLM) for E2E ASR. The proposed model enables the optimization of the entire ASR process, including acoustic feature extraction and acoustic and language modeling, by combining pre-trained models with a bridge network and also enables the application of remarkable developments in LLM utilization, such as parameter-efficient domain adaptation and inference optimization. Experimental results demonstrate that the proposed model achieves a performance comparable to that of modern E2E ASR models by utilizing powerful pre-training models with the proposed integrated approach. △ Less

Submitted 6 June, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

Comments: 17 pages, 4 figures, 9 tables, accepted for Findings of ACL 2024. The model is available at https://huggingface.co/rinna/nue-asr

arXiv:2010.10173 [pdf]

doi 10.1585/pfr.16.1402030

Non-resonant n = 1 helical core induced by m/n = 2/1 tearing mode in JT-60U

Authors: T. Bando, S. Inoue, K. Shinohara, A. Isayama, T. Wakatsuki, M. Yoshida, M. Honda, G. Matsunaga, M. Takechi, N. Oyama, S. Ide

Abstract: In JT-60U, simultaneous excitation of n = 1 helical cores (HCs) and m/n = 2/1 Tearing Modes (TMs) was observed [T. Bando et al., Plasma Phys. Control. Fusion 61 115014 (2019)]. In this paper, we have investigated the excitation mechanism of n = 1 HCs with m/n = 2/1 TMs based on the experimental observations and a simple quasi-linear MHD model. In the previous study, it was reported that a "couplin… ▽ More In JT-60U, simultaneous excitation of n = 1 helical cores (HCs) and m/n = 2/1 Tearing Modes (TMs) was observed [T. Bando et al., Plasma Phys. Control. Fusion 61 115014 (2019)]. In this paper, we have investigated the excitation mechanism of n = 1 HCs with m/n = 2/1 TMs based on the experimental observations and a simple quasi-linear MHD model. In the previous study, it was reported that a "coupling" on the phase of the MHD mode is observed between n = 1 HCs and m/n = 2/1 TMs. In this study, it is found that the coupling is observed with the mode frequency from several Hz to 6 kHz. This indicates that the resistive wall and the plasma control system do not induce the coupling because the both time scales are different from the mode frequency. In addition, n = 1 HCs appear to be the non-resonant mode from the two observations: n = 1 HCs do not rotate with the plasma around the q = 1 surface in the core and the coupling is also observed even when qmin > 1. It is also observed that the electron fluctuation due to an n = 1 HC in the core region disappears with the stabilization of an m/n = 2/1 neoclassical tearing mode by electron cyclotron current drive, implying that n = 1 HCs are driven by m/n = 2/1 TMs. This perspective, n = 1 HCs are driven by m/n = 2/1 TMs, is supported by the observation that the saturated amplitude of the m/n = 1/1 component of the radial displacement in the core is smaller than that of the m/n = 2/1 component. Finally, we revisit a quasi-linear MHD model where the m/n = 1/1 HC is induced directly by the sideband of the current for the m/n = 2/1 TM, which allows to excite the non-resonant m/n = 1/1 mode. The model also describes the characteristic of the coupling, fm/n=1/1(HC) = 2fm/n=2/1(TM). △ Less

Submitted 14 December, 2020; v1 submitted 20 October, 2020; originally announced October 2020.

arXiv:1005.1589 [pdf]

doi 10.1088/1478-3975/7/2/026011

Analysis of Diffusion of Ras2 in Saccharomyces cerevisiae Using Fluorescence Recovery after Photobleaching

Authors: Kalyan C. Vinnakota, David A. Mitchell, Robert J. Deschenes, Tetsuro Wakatsuki, Daniel A. Beard

Abstract: Binding, lateral diffusion and exchange are fundamental dynamic processes involved in protein association with cellular membranes. In this study, we developed numerical simulations of lateral diffusion and exchange of fluorophores in membranes with arbitrary bleach geometry and exchange of the membrane localized fluorophore with the cytosol during Fluorescence Recovery after Photobleaching (FRAP)… ▽ More Binding, lateral diffusion and exchange are fundamental dynamic processes involved in protein association with cellular membranes. In this study, we developed numerical simulations of lateral diffusion and exchange of fluorophores in membranes with arbitrary bleach geometry and exchange of the membrane localized fluorophore with the cytosol during Fluorescence Recovery after Photobleaching (FRAP) experiments. The model simulations were used to design FRAP experiments with varying bleach region sizes on plasma-membrane localized wild type GFP-Ras2 with a dual lipid anchor and mutant GFP-Ras2C318S with a single lipid anchor in live yeast cells to investigate diffusional mobility and the presence of any exchange processes operating in the time scale of our experiments. Model parameters estimated using data from FRAP experiments with a 1 micron x 1 micron bleach region-of-interest (ROI) and a 0.5 micron x 0.5 micron bleach ROI showed that GFP-Ras2, single or dual lipid modified, diffuses as single species with no evidence of exchange with a cytoplasmic pool. This is the first report of Ras2 mobility in yeast plasma membrane. The methods developed in this study are generally applicable for studying diffusion and exchange of membrane associated fluorophores using FRAP on commercial confocal laser scanning microscopes. △ Less

Submitted 10 May, 2010; originally announced May 2010.

Comments: Accepted for publication in Physical Biology (2010). 28 pages, 7 figures, 3 tables

Journal ref: Kalyan C Vinnakota et al 2010 Phys. Biol. 7 026011

arXiv:0812.3255 [pdf, ps, other]

doi 10.1103/PhysRevLett.102.130502

Local transformation of two EPR photon pairs into a three-photon W state

Authors: Toshiyuki Tashima, Tetsuroh Wakatsuki, Sahin Kaya Ozdemir, Takashi Yamamoto, Masato Koashi, Nobuyuki Imoto

Abstract: We propose and experimentally demonstrate a transformation of two EPR photon pairs distributed among three parties into a three-photon W state using local operations and classical communication. We then characterize the final state using quantum state tomography on the three-photon state and on its marginal bipartite states. The fidelity of the final state to the ideal W state is… ▽ More We propose and experimentally demonstrate a transformation of two EPR photon pairs distributed among three parties into a three-photon W state using local operations and classical communication. We then characterize the final state using quantum state tomography on the three-photon state and on its marginal bipartite states. The fidelity of the final state to the ideal W state is $0.778\pm 0.043$ and the expectation value for its witness operator is $-0.111\pm 0.043$ implying the success of the proposed local transformation. △ Less

Submitted 17 December, 2008; originally announced December 2008.

Comments: 5 pages, 5 figures

Journal ref: Phys.Rev.Lett.102:130502,2009

Showing 1–6 of 6 results for author: Wakatsuki, T