-
PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems
Authors:
Kentaro Mitsui,
Koh Mitsuda,
Toshiaki Wakatsuki,
Yukiya Hono,
Kei Sawada
Abstract:
Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. However, current models face two major challenges in response generation latency: (1) generating a spoken response requires the prior generation of a written response, and (2) speech sequences are significantly longer than text sequences. This study addresses these issues by e…
▽ More
Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. However, current models face two major challenges in response generation latency: (1) generating a spoken response requires the prior generation of a written response, and (2) speech sequences are significantly longer than text sequences. This study addresses these issues by extending the input and output sequences of the language model to support the parallel generation of text and speech. Our experiments on spoken question answering tasks demonstrate that our approach improves latency while maintaining the quality of response content. Additionally, we show that latency can be further reduced by generating speech in multiple sequences. Demo samples are available at https://rinnakk.github.io/research/publications/PSLM.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Release of Pre-Trained Models for the Japanese Language
Authors:
Kei Sawada,
Tianyu Zhao,
Makoto Shing,
Kentaro Mitsui,
Akio Kaga,
Yukiya Hono,
Toshiaki Wakatsuki,
Koh Mitsuda
Abstract:
AI democratization aims to create a world in which the average person can utilize AI techniques. To achieve this goal, numerous research institutes have attempted to make their results accessible to the public. In particular, large pre-trained models trained on large-scale data have shown unprecedented potential, and their release has had a significant impact. However, most of the released models…
▽ More
AI democratization aims to create a world in which the average person can utilize AI techniques. To achieve this goal, numerous research institutes have attempted to make their results accessible to the public. In particular, large pre-trained models trained on large-scale data have shown unprecedented potential, and their release has had a significant impact. However, most of the released models specialize in the English language, and thus, AI democratization in non-English-speaking communities is lagging significantly. To reduce this gap in AI access, we released Generative Pre-trained Transformer (GPT), Contrastive Language and Image Pre-training (CLIP), Stable Diffusion, and Hidden-unit Bidirectional Encoder Representations from Transformers (HuBERT) pre-trained in Japanese. By providing these models, users can freely interface with AI that aligns with Japanese cultural values and ensures the identity of Japanese culture, thus enhancing the democratization of AI. Additionally, experiments showed that pre-trained models specialized for Japanese can efficiently achieve high performance in Japanese tasks.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Development of a near-infrared wide-field integral field unit by ultra-precision diamond cutting
Authors:
Kosuke Kushibiki,
Shinobu Ozaki,
Masahiro Takeda,
Takuya Hosobata,
Yutaka Yamagata,
Shinya Morita,
Toshihiro Tsuzuki,
Keiichi Nakagawa,
Takao Saiki,
Yutaka Ohtake,
Kenji Mitsui,
Hirofumi Okita,
Yutaro Kitagawa,
Yukihiro Kono,
Kentaro Motohara,
Hidenori Takahashi,
Masahiro Konishi,
Natsuko Kato,
Shuhei Koyama,
Nuo Chen
Abstract:
Integral Field Spectroscopy (IFS) is an observational method to obtain spatially resolved spectra over a specific field of view (FoV) in a single exposure. In recent years, near-infrared IFS has gained importance in observing objects with strong dust attenuation or at high redshift. One limitation of existing near-infrared IFS instruments is their relatively small FoV, less than 100 arcsec$^2$, co…
▽ More
Integral Field Spectroscopy (IFS) is an observational method to obtain spatially resolved spectra over a specific field of view (FoV) in a single exposure. In recent years, near-infrared IFS has gained importance in observing objects with strong dust attenuation or at high redshift. One limitation of existing near-infrared IFS instruments is their relatively small FoV, less than 100 arcsec$^2$, compared to optical instruments. Therefore, we have developed a near-infrared (0.9-2.5 $\mathrmμ$m) image-slicer type integral field unit (IFU) with a larger FoV of 13.5 $\times$ 10.4 arcsec$^2$ by matching a slice width to a typical seeing size of 0.4 arcsec. The IFU has a compact optical design utilizing off-axis ellipsoidal mirrors to reduce aberrations. Complex optical elements were fabricated using an ultra-precision cutting machine to achieve RMS surface roughness of less than 10 nm and a P-V shape error of less than 300 nm. The ultra-precision machining can also simplify alignment procedures. The on-sky performance evaluation confirmed that the image quality and the throughput of the IFU were as designed. In conclusion, we have successfully developed a compact IFU utilizing an ultra-precision cutting technique, almost fulfilling the requirements.
△ Less
Submitted 3 March, 2024;
originally announced March 2024.
-
Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition
Authors:
Yukiya Hono,
Koh Mitsuda,
Tianyu Zhao,
Kentaro Mitsui,
Toshiaki Wakatsuki,
Kei Sawada
Abstract:
Advances in machine learning have made it possible to perform various text and speech processing tasks, such as automatic speech recognition (ASR), in an end-to-end (E2E) manner. E2E approaches utilizing pre-trained models are gaining attention for conserving training data and resources. However, most of their applications in ASR involve only one of either a pre-trained speech or a language model.…
▽ More
Advances in machine learning have made it possible to perform various text and speech processing tasks, such as automatic speech recognition (ASR), in an end-to-end (E2E) manner. E2E approaches utilizing pre-trained models are gaining attention for conserving training data and resources. However, most of their applications in ASR involve only one of either a pre-trained speech or a language model. This paper proposes integrating a pre-trained speech representation model and a large language model (LLM) for E2E ASR. The proposed model enables the optimization of the entire ASR process, including acoustic feature extraction and acoustic and language modeling, by combining pre-trained models with a bridge network and also enables the application of remarkable developments in LLM utilization, such as parameter-efficient domain adaptation and inference optimization. Experimental results demonstrate that the proposed model achieves a performance comparable to that of modern E2E ASR models by utilizing powerful pre-training models with the proposed integrated approach.
△ Less
Submitted 6 June, 2024; v1 submitted 6 December, 2023;
originally announced December 2023.
-
Towards human-like spoken dialogue generation between AI agents from written dialogue
Authors:
Kentaro Mitsui,
Yukiya Hono,
Kei Sawada
Abstract:
The advent of large language models (LLMs) has made it possible to generate natural written dialogues between two agents. However, generating human-like spoken dialogues from these written dialogues remains challenging. Spoken dialogues have several unique characteristics: they frequently include backchannels and laughter, and the smoothness of turn-taking significantly influences the fluidity of…
▽ More
The advent of large language models (LLMs) has made it possible to generate natural written dialogues between two agents. However, generating human-like spoken dialogues from these written dialogues remains challenging. Spoken dialogues have several unique characteristics: they frequently include backchannels and laughter, and the smoothness of turn-taking significantly influences the fluidity of conversation. This study proposes CHATS - CHatty Agents Text-to-Speech - a discrete token-based system designed to generate spoken dialogues based on written dialogues. Our system can generate speech for both the speaker side and the listener side simultaneously, using only the transcription from the speaker side, which eliminates the need for transcriptions of backchannels or laughter. Moreover, CHATS facilitates natural turn-taking; it determines the appropriate duration of silence after each utterance in the absence of overlap, and it initiates the generation of overlap** speech based on the phoneme sequence of the next utterance in case of overlap. Experimental evaluations indicate that CHATS outperforms the text-to-speech baseline, producing spoken dialogues that are more interactive and fluid while retaining clarity and intelligibility.
△ Less
Submitted 2 October, 2023;
originally announced October 2023.
-
UniFLG: Unified Facial Landmark Generator from Text or Speech
Authors:
Kentaro Mitsui,
Yukiya Hono,
Kei Sawada
Abstract:
Talking face generation has been extensively investigated owing to its wide applicability. The two primary frameworks used for talking face generation comprise a text-driven framework, which generates synchronized speech and talking faces from text, and a speech-driven framework, which generates talking faces from speech. To integrate these frameworks, this paper proposes a unified facial landmark…
▽ More
Talking face generation has been extensively investigated owing to its wide applicability. The two primary frameworks used for talking face generation comprise a text-driven framework, which generates synchronized speech and talking faces from text, and a speech-driven framework, which generates talking faces from speech. To integrate these frameworks, this paper proposes a unified facial landmark generator (UniFLG). The proposed system exploits end-to-end text-to-speech not only for synthesizing speech but also for extracting a series of latent representations that are common to text and speech, and feeds it to a landmark decoder to generate facial landmarks. We demonstrate that our system achieves higher naturalness in both speech synthesis and facial landmark generation compared to the state-of-the-art text-driven method. We further demonstrate that our system can generate facial landmarks from speech of speakers without facial video data or even speech data.
△ Less
Submitted 18 May, 2023; v1 submitted 28 February, 2023;
originally announced February 2023.
-
Text-Guided Scene Sketch-to-Photo Synthesis
Authors:
AprilPyone MaungMaung,
Makoto Shing,
Kentaro Mitsui,
Kei Sawada,
Fumio Okura
Abstract:
We propose a method for scene-level sketch-to-photo synthesis with text guidance. Although object-level sketch-to-photo synthesis has been widely studied, whole-scene synthesis is still challenging without reference photos that adequately reflect the target style. To this end, we leverage knowledge from recent large-scale pre-trained generative models, resulting in text-guided sketch-to-photo synt…
▽ More
We propose a method for scene-level sketch-to-photo synthesis with text guidance. Although object-level sketch-to-photo synthesis has been widely studied, whole-scene synthesis is still challenging without reference photos that adequately reflect the target style. To this end, we leverage knowledge from recent large-scale pre-trained generative models, resulting in text-guided sketch-to-photo synthesis without the need for reference images. To train our model, we use self-supervised learning from a set of photographs. Specifically, we use a pre-trained edge detector that maps both color and sketch images into a standardized edge domain, which reduces the gap between photograph-based edge images (during training) and hand-drawn sketch images (during inference). We implement our method by fine-tuning a latent diffusion model (i.e., Stable Diffusion) with sketch and text conditions. Experiments show that the proposed method translates original sketch images that are not extracted from color images into photos with compelling visual quality.
△ Less
Submitted 14 February, 2023;
originally announced February 2023.
-
End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue
Authors:
Kentaro Mitsui,
Tianyu Zhao,
Kei Sawada,
Yukiya Hono,
Yoshihiko Nankaku,
Keiichi Tokuda
Abstract:
The recent text-to-speech (TTS) has achieved quality comparable to that of humans; however, its application in spoken dialogue has not been widely studied. This study aims to realize a TTS that closely resembles human dialogue. First, we record and transcribe actual spontaneous dialogues. Then, the proposed dialogue TTS is trained in two stages: first stage, variational autoencoder (VAE)-VITS or G…
▽ More
The recent text-to-speech (TTS) has achieved quality comparable to that of humans; however, its application in spoken dialogue has not been widely studied. This study aims to realize a TTS that closely resembles human dialogue. First, we record and transcribe actual spontaneous dialogues. Then, the proposed dialogue TTS is trained in two stages: first stage, variational autoencoder (VAE)-VITS or Gaussian mixture variational autoencoder (GMVAE)-VITS is trained, which introduces an utterance-level latent variable into variational inference with adversarial learning for end-to-end text-to-speech (VITS), a recently proposed end-to-end TTS model. A style encoder that extracts a latent speaking style representation from speech is trained jointly with TTS. In the second stage, a style predictor is trained to predict the speaking style to be synthesized from dialogue history. During inference, by passing the speaking style representation predicted by the style predictor to VAE/GMVAE-VITS, speech can be synthesized in a style appropriate to the context of the dialogue. Subjective evaluation results demonstrate that the proposed method outperforms the original VITS in terms of dialogue-level naturalness.
△ Less
Submitted 23 June, 2022;
originally announced June 2022.
-
A Super-Earth Orbiting Near the Inner Edge of the Habitable Zone around the M4.5-dwarf Ross 508
Authors:
Hiroki Harakawa,
Takuya Takarada,
Yui Kasagi,
Teruyuki Hirano,
Takayuki Kotani,
Masayuki Kuzuhara,
Masashi Omiya,
Hajime Kawahara,
Akihiko Fukui,
Yasunori Hori,
Hiroyuki Tako Ishikawa,
Masahiro Ogihara,
John Livingston,
Timothy D. Brandt,
Thayne Currie,
Wako Aoki,
Charles A. Beichman,
Thomas Henning,
Klaus Hodapp,
Masato Ishizuka,
Hideyuki Izumiura,
Shane Jacobson,
Markus Janson,
Eiji Kambe,
Takanori Kodama
, et al. (24 additional authors not shown)
Abstract:
We report the near-infrared radial-velocity (RV) discovery of a super-Earth planet on a 10.77-day orbit around the M4.5 dwarf Ross 508 ($J_\mathrm{mag}=9.1$). Using precision RVs from the Subaru Telescope IRD (InfraRed Doppler) instrument, we derive a semi-amplitude of $3.92^{+0.60}_{-0.58}$ ${\rm m\,s}^{-1}$, corresponding to a planet with a minimum mass…
▽ More
We report the near-infrared radial-velocity (RV) discovery of a super-Earth planet on a 10.77-day orbit around the M4.5 dwarf Ross 508 ($J_\mathrm{mag}=9.1$). Using precision RVs from the Subaru Telescope IRD (InfraRed Doppler) instrument, we derive a semi-amplitude of $3.92^{+0.60}_{-0.58}$ ${\rm m\,s}^{-1}$, corresponding to a planet with a minimum mass $m \sin i = 4.00^{+0.53}_{-0.55}\ M_{\oplus}$. We find no evidence of significant signals at the detected period in spectroscopic stellar activity indicators or MEarth photometry. The planet, Ross 508 b, has a semimajor-axis of $0.05366^{+0.00056}_{-0.00049}$ au. This gives an orbit-averaged insolation of $\approx$1.4 times the Earth's value, placing Ross 508 b near the inner edge of its star's habitable zone. We have explored the possibility that the planet has a high eccentricity and its host is accompanied by an additional unconfirmed companion on a wide orbit. Our discovery demonstrates that the near-infrared RV search can play a crucial role to find a low-mass planet around cool M dwarfs like Ross 508.
△ Less
Submitted 24 May, 2022;
originally announced May 2022.
-
Relative compactifications of semiabelian Néron models, I
Authors:
Kentaro Mitsui,
Iku Nakamura
Abstract:
Let $R$ be a complete discrete valuation ring, $k(η)$ its fraction field, $S:=\Spec R$, $(G_η,\cL_η)$ a polarized abelian variety over $k(η)$ with $\cL_η$ ample cubical and $\cG$ the Néron model of $G_η$ over $S$. Suppose that $\cG$ is totally degenerate semiabelian over $S$. Then there exists a (unique) relative compactification $(P,\cN)$ of $\cG$ such that ($α$) $P$ is Cohen-Macaulay with…
▽ More
Let $R$ be a complete discrete valuation ring, $k(η)$ its fraction field, $S:=\Spec R$, $(G_η,\cL_η)$ a polarized abelian variety over $k(η)$ with $\cL_η$ ample cubical and $\cG$ the Néron model of $G_η$ over $S$. Suppose that $\cG$ is totally degenerate semiabelian over $S$. Then there exists a (unique) relative compactification $(P,\cN)$ of $\cG$ such that ($α$) $P$ is Cohen-Macaulay with $\codim_P(P\setminus\cG)=2$ and ($β$) $\cN$ is ample invertible with $\cN_{|\cG}$ cubical and $\cN_η=\cL^{\otimes n}_η$ for some positive integer $n$.
△ Less
Submitted 15 May, 2024; v1 submitted 20 January, 2022;
originally announced January 2022.
-
MSR-NV: Neural Vocoder Using Multiple Sampling Rates
Authors:
Kentaro Mitsui,
Kei Sawada
Abstract:
The development of neural vocoders (NVs) has resulted in the high-quality and fast generation of waveforms. However, conventional NVs target a single sampling rate and require re-training when applied to different sampling rates. A suitable sampling rate varies from application to application due to the trade-off between speech quality and generation speed. In this study, we propose a method to ha…
▽ More
The development of neural vocoders (NVs) has resulted in the high-quality and fast generation of waveforms. However, conventional NVs target a single sampling rate and require re-training when applied to different sampling rates. A suitable sampling rate varies from application to application due to the trade-off between speech quality and generation speed. In this study, we propose a method to handle multiple sampling rates in a single NV, called the MSR-NV. By generating waveforms step-by-step starting from a low sampling rate, MSR-NV can efficiently learn the characteristics of each frequency band and synthesize high-quality speech at multiple sampling rates. It can be regarded as an extension of the previously proposed NVs, and in this study, we extend the structure of Parallel WaveGAN (PWG). Experimental evaluation results demonstrate that the proposed method achieves remarkably higher subjective quality than the original PWG trained separately at 16, 24, and 48 kHz, without increasing the inference time. We also show that MSR-NV can leverage speech with lower sampling rates to further improve the quality of the synthetic speech.
△ Less
Submitted 23 June, 2022; v1 submitted 28 September, 2021;
originally announced September 2021.
-
Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes
Authors:
Kentaro Mitsui,
Tomoki Koriyama,
Hiroshi Saruwatari
Abstract:
Multi-speaker speech synthesis is a technique for modeling multiple speakers' voices with a single model. Although many approaches using deep neural networks (DNNs) have been proposed, DNNs are prone to overfitting when the amount of training data is limited. We propose a framework for multi-speaker speech synthesis using deep Gaussian processes (DGPs); a DGP is a deep architecture of Bayesian ker…
▽ More
Multi-speaker speech synthesis is a technique for modeling multiple speakers' voices with a single model. Although many approaches using deep neural networks (DNNs) have been proposed, DNNs are prone to overfitting when the amount of training data is limited. We propose a framework for multi-speaker speech synthesis using deep Gaussian processes (DGPs); a DGP is a deep architecture of Bayesian kernel regressions and thus robust to overfitting. In this framework, speaker information is fed to duration/acoustic models using speaker codes. We also examine the use of deep Gaussian process latent variable models (DGPLVMs). In this approach, the representation of each speaker is learned simultaneously with other model parameters, and therefore the similarity or dissimilarity of speakers is considered efficiently. We experimentally evaluated two situations to investigate the effectiveness of the proposed methods. In one situation, the amount of data from each speaker is balanced (speaker-balanced), and in the other, the data from certain speakers are limited (speaker-imbalanced). Subjective and objective evaluation results showed that both the DGP and DGPLVM synthesize multi-speaker speech more effective than a DNN in the speaker-balanced situation. We also found that the DGPLVM outperforms the DGP significantly in the speaker-imbalanced situation.
△ Less
Submitted 6 August, 2020;
originally announced August 2020.
-
JVS corpus: free Japanese multi-speaker voice corpus
Authors:
Shinnosuke Takamichi,
Kentaro Mitsui,
Yuki Saito,
Tomoki Koriyama,
Naoko Tanji,
Hiroshi Saruwatari
Abstract:
Thanks to improvements in machine learning techniques, including deep learning, speech synthesis is becoming a machine learning task. To accelerate speech synthesis research, we are develo** Japanese voice corpora reasonably accessible from not only academic institutions but also commercial companies. In 2017, we released the JSUT corpus, which contains 10 hours of reading-style speech uttered b…
▽ More
Thanks to improvements in machine learning techniques, including deep learning, speech synthesis is becoming a machine learning task. To accelerate speech synthesis research, we are develo** Japanese voice corpora reasonably accessible from not only academic institutions but also commercial companies. In 2017, we released the JSUT corpus, which contains 10 hours of reading-style speech uttered by a single speaker, for end-to-end text-to-speech synthesis. For more general use in speech synthesis research, e.g., voice conversion and multi-speaker modeling, in this paper, we construct the JVS corpus, which contains voice data of 100 speakers in three styles (normal, whisper, and falsetto). The corpus contains 30 hours of voice data including 22 hours of parallel normal voices. This paper describes how we designed the corpus and summarizes the specifications. The corpus is available at our project page.
△ Less
Submitted 17 August, 2019;
originally announced August 2019.
-
Logarithmic good reduction and the index
Authors:
Kentaro Mitsui,
Arne Smeets
Abstract:
Let $K$ be the fraction field of a complete discrete valuation ring, with algebraically closed residue field of characteristic $p > 0$. This paper studies the index of a smooth, proper $K$-variety $X$ with logarithmic good reduction. We prove that it is prime to $p$ in `most' cases, for example if the Euler number of $X$ does not vanish, but (perhaps surprisingly) not always. We also fully charact…
▽ More
Let $K$ be the fraction field of a complete discrete valuation ring, with algebraically closed residue field of characteristic $p > 0$. This paper studies the index of a smooth, proper $K$-variety $X$ with logarithmic good reduction. We prove that it is prime to $p$ in `most' cases, for example if the Euler number of $X$ does not vanish, but (perhaps surprisingly) not always. We also fully characterise curves of genus $1$ with logarithmic good reduction, thereby completing classical results of T. Saito and Stix valid for curves of genus at least $2$.
△ Less
Submitted 15 June, 2023; v1 submitted 30 November, 2017;
originally announced November 2017.
-
Growth of critical points in one-dimensional lattice systems
Authors:
Masayuki Asaoka,
Tomohiro Fukaya,
Kentaro Mitsui,
Masaki Tsukamoto
Abstract:
We study the growth of the numbers of critical points in one-dimensional lattice systems by using (real) algebraic geometry and the theory of homoclinic tangency.
We study the growth of the numbers of critical points in one-dimensional lattice systems by using (real) algebraic geometry and the theory of homoclinic tangency.
△ Less
Submitted 27 June, 2012;
originally announced June 2012.
-
Neutrinos from extragalactic cosmic ray interactions in the far infrared background
Authors:
E. V. Bugaev,
A. Misaki,
K. Mitsui
Abstract:
Extragalactic background of high energy neutrinos arising from the interactions of cosmic ray protons with far-infrared extragalactic background radiation is calculated. The main assumption is that the cosmic ray spectrum at energies higher than $10^8 GeV$ has extragalactic origin and, therefore, is proton dominated. All calculations are performed with taking into account the possible cosmologic…
▽ More
Extragalactic background of high energy neutrinos arising from the interactions of cosmic ray protons with far-infrared extragalactic background radiation is calculated. The main assumption is that the cosmic ray spectrum at energies higher than $10^8 GeV$ has extragalactic origin and, therefore, is proton dominated. All calculations are performed with taking into account the possible cosmological evolution of extragalactic sources of cosmic ray protons as well as infrared-luminous galaxies.
△ Less
Submitted 19 May, 2005; v1 submitted 6 May, 2004;
originally announced May 2004.
-
Identification problems of muon and electron events in the Super-Kamiokande detector
Authors:
K Mitsui,
T Kitamura,
T Wada,
K Okei
Abstract:
In the measurement of atmospheric nu_e and nu_mu fluxes, the calculations of the Super Kamiokande group for the distinction between muon-like and electronlike events observed in the water Cerenkov detector have initially assumed a misidentification probability of less than 1 % and later 2 % for the sub-GeV range. In the multi-GeV range, they compared only the observed behaviors of ring patterns…
▽ More
In the measurement of atmospheric nu_e and nu_mu fluxes, the calculations of the Super Kamiokande group for the distinction between muon-like and electronlike events observed in the water Cerenkov detector have initially assumed a misidentification probability of less than 1 % and later 2 % for the sub-GeV range. In the multi-GeV range, they compared only the observed behaviors of ring patterns of muon and electron events, and claimed a 3 % mis-identification. However, the expressions and the calculation method do not include the fluctuation properties due to the stochastic nature of the processes which determine the expected number of photoelectrons (p.e.) produced by muons and electrons. Our full Monte Carlo (MC) simulations including the fluctuations of photoelectron production show that the total mis-identification rate for electrons and muons should be larger than or equal to 20 % for sub-GeV region. Even in the multi-GeV region we expect a mis-identification rate of several % based on our MC simulations taking into account the ring patterns. The mis-identified events are mostly of muonic origin.
△ Less
Submitted 18 September, 2002;
originally announced September 2002.