Skip to main content

Showing 1–50 of 86 results for author: Saruwatari, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.17722  [pdf, other

    cs.SD eess.AS

    Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals

    Authors: Kentaro Seki, Shinnosuke Takamichi, Norihiro Takamune, Yuki Saito, Kanami Imamura, Hiroshi Saruwatari

    Abstract: This paper proposes a new task called spatial voice conversion, which aims to convert a target voice while preserving spatial information and non-target signals. Traditional voice conversion methods focus on single-channel waveforms, ignoring the stereo listening experience inherent in human hearing. Our baseline approach addresses this gap by integrating blind source separation (BSS), voice conve… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  2. arXiv:2406.07280  [pdf, ps, other

    cs.SD eess.AS

    Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment

    Authors: Takuto Igarashi, Yuki Saito, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari

    Abstract: We propose noise-robust voice conversion (VC) which takes into account the recording quality and environment of noisy source speech. Conventional denoising training improves the noise robustness of a VC model by learning noisy-to-clean VC process. However, the naturalness of the converted speech is limited when the noise of the source speech is unseen during the training. To this end, our proposed… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: 5 pages, accepted for INTERSPEECH 2024, audio samples: http://y-saito.sakura.ne.jp/sython/Corpus/SRC4VC/IS2024_CDT_supplementary/demo_cdt.html

  3. arXiv:2406.07254  [pdf, ps, other

    cs.SD eess.AS

    SRC4VC: Smartphone-Recorded Corpus for Voice Conversion Benchmark

    Authors: Yuki Saito, Takuto Igarashi, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari

    Abstract: We present SRC4VC, a new corpus containing 11 hours of speech recorded on smartphones by 100 Japanese speakers. Although high-quality multi-speaker corpora can advance voice conversion (VC) technologies, they are not always suitable for testing VC when low-quality speech recording is given as the input. To this end, we first asked 100 crowdworkers to record their voice samples using smartphones. T… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted for INTERSPEECH 2024, corpus project page: https://y-saito.sakura.ne.jp/sython/Corpus/SRC4VC/index.html

  4. arXiv:2404.03204  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

    Authors: Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, **yu Li, Sheng Zhao

    Abstract: We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of language models. Th… ▽ More

    Submitted 19 May, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

  5. arXiv:2403.13353  [pdf, other

    cs.SD eess.AS

    Building speech corpus with diverse voice characteristics for its prompt-based representation

    Authors: Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Wataru Nakata, Detai Xin, Hiroshi Saruwatari

    Abstract: In text-to-speech synthesis, the ability to control voice characteristics is vital for various applications. By leveraging thriving text prompt-based generation techniques, it should be possible to enhance the nuanced control of voice characteristics. While previous research has explored the prompt-based manipulation of voice characteristics, most studies have used pre-recorded speech, which limit… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing. arXiv admin note: text overlap with arXiv:2309.13509

  6. arXiv:2403.12477  [pdf, other

    cs.SD eess.AS

    Real-time Speech Extraction Using Spatially Regularized Independent Low-rank Matrix Analysis and Rank-constrained Spatial Covariance Matrix Estimation

    Authors: Yuto Ishikawa, Kohei Konaka, Tomohiko Nakamura, Norihiro Takamune, Hiroshi Saruwatari

    Abstract: Real-time speech extraction is an important challenge with various applications such as speech recognition in a human-like avatar/robot. In this paper, we propose the real-time extension of a speech extraction method based on independent low-rank matrix analysis (ILRMA) and rank-constrained spatial covariance matrix estimation (RCSCME). The RCSCME-based method is a multichannel blind speech extrac… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: 5 pages, 3 figures, accepted at HSCMA 2024

  7. arXiv:2401.16812  [pdf, other

    cs.SD eess.AS

    SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics

    Authors: Takaaki Saeki, Soumi Maiti, Shinnosuke Takamichi, Shinji Watanabe, Hiroshi Saruwatari

    Abstract: While subjective assessments have been the gold standard for evaluating speech generation, there is a growing need for objective metrics that are highly correlated with human subjective judgments due to their cost efficiency. This paper proposes reference-aware automatic evaluation methods for speech generation inspired by evaluation metrics in natural language processing. The proposed SpeechBERTS… ▽ More

    Submitted 12 June, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

    Comments: Accepted by Interspeech 2024. An extended version with Appendix. Code: https://github.com/Takaaki-Saeki/DiscreteSpeechMetrics

  8. arXiv:2401.05809  [pdf, ps, other

    eess.AS cs.SD

    Localizing Acoustic Energy in Sound Field Synthesis by Directionally Weighted Exterior Radiation Suppression

    Authors: Yoshihide Tomita, Shoichi Koyama, Hiroshi Saruwatari

    Abstract: A method for synthesizing the desired sound field while suppressing the exterior radiation power with directional weighting is proposed. The exterior radiation from the loudspeakers in sound field synthesis systems can be problematic in practical situations. Although several methods to suppress the exterior radiation have been proposed, suppression in all outward directions is generally difficult,… ▽ More

    Submitted 11 January, 2024; originally announced January 2024.

    Comments: Accepted to International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024

  9. JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions

    Authors: Detai Xin, Junfeng Jiang, Shinnosuke Takamichi, Yuki Saito, Akiko Aizawa, Hiroshi Saruwatari

    Abstract: We present the JVNV, a Japanese emotional speech corpus with verbal content and nonverbal vocalizations whose scripts are generated by a large-scale language model. Existing emotional speech corpora lack not only proper emotional scripts but also nonverbal vocalizations (NVs) that are essential expressions in spoken language to express emotions. We propose an automatic script generation method to… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

  10. arXiv:2309.13509  [pdf, other

    cs.SD eess.AS

    Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control

    Authors: Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Wataru Nakata, Detai Xin, Hiroshi Saruwatari

    Abstract: In text-to-speech, controlling voice characteristics is important in achieving various-purpose speech synthesis. Considering the success of text-conditioned generation, such as text-to-image, free-form text instruction should be useful for intuitive and complicated control of voice characteristics. A sufficiently large corpus of high-quality and diverse voice samples with corresponding free-form d… ▽ More

    Submitted 23 September, 2023; originally announced September 2023.

    Comments: Submitted to ASRU2023

  11. arXiv:2309.09690  [pdf, other

    cs.CL cs.SD eess.AS

    Do learned speech symbols follow Zipf's law?

    Authors: Shinnosuke Takamichi, Hiroki Maeda, Joonyong Park, Daisuke Saito, Hiroshi Saruwatari

    Abstract: In this study, we investigate whether speech symbols, learned through deep learning, follow Zipf's law, akin to natural language symbols. Zipf's law is an empirical law that delineates the frequency distribution of words, forming fundamentals for statistical analysis in natural language processing. Natural language symbols, which are invented by humans to symbolize speech content, are recognized t… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  12. arXiv:2309.08127  [pdf, other

    cs.SD eess.AS

    Diversity-based core-set selection for text-to-speech with linguistic and acoustic features

    Authors: Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari

    Abstract: This paper proposes a method for extracting a lightweight subset from a text-to-speech (TTS) corpus ensuring synthetic speech quality. In recent years, methods have been proposed for constructing large-scale TTS corpora by collecting diverse data from massive sources such as audiobooks and YouTube. Although these methods have gained significant attention for enhancing the expressive capabilities o… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

  13. arXiv:2309.05634  [pdf, other

    cs.SD eess.AS

    Kernel Interpolation of Incident Sound Field in Region Including Scattering Objects

    Authors: Shoichi Koyama, Masaki Nakada, Juliano G. C. Ribeiro, Hiroshi Saruwatari

    Abstract: A method for estimating the incident sound field inside a region containing scattering objects is proposed. The sound field estimation method has various applications, such as spatial audio capturing and spatial active noise control; however, most existing methods do not take into account the presence of scatterers within the target estimation region. Although several techniques exist that employ… ▽ More

    Submitted 11 September, 2023; originally announced September 2023.

    Comments: Accepted to IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2023

  14. arXiv:2307.13941  [pdf, ps, other

    eess.AS cs.SD

    Perceptual Quality Enhancement of Sound Field Synthesis Based on Combination of Pressure and Amplitude Matching

    Authors: Keisuke Kimura, Shoichi Koyama, Hiroshi Saruwatari

    Abstract: A sound field synthesis method enhancing perceptual quality is proposed. Sound field synthesis using multiple loudspeakers enables spatial audio reproduction with a broad listening area; however, synthesis errors at high frequencies called spatial aliasing artifacts are unavoidable. To minimize these artifacts, we propose a method based on the combination of pressure and amplitude matching. On the… ▽ More

    Submitted 25 July, 2023; originally announced July 2023.

    Comments: Accepted to IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2023

  15. arXiv:2306.12820  [pdf, other

    cs.SD eess.AS

    NoisyILRMA: Diffuse-Noise-Aware Independent Low-Rank Matrix Analysis for Fast Blind Source Extraction

    Authors: Koki Nishida, Norihiro Takamune, Rintaro Ikeshita, Daichi Kitamura, Hiroshi Saruwatari, Tomohiro Nakatani

    Abstract: In this paper, we address the multichannel blind source extraction (BSE) of a single source in diffuse noise environments. To solve this problem even faster than by fast multichannel nonnegative matrix factorization (FastMNMF) and its variant, we propose a BSE method called NoisyILRMA, which is a modification of independent low-rank matrix analysis (ILRMA) to account for diffuse noise. NoisyILRMA… ▽ More

    Submitted 22 June, 2023; originally announced June 2023.

    Comments: 5 pages, 3 figures, accepted for European Signal Processing Conference 2023 (EUSIPCO 2023)

  16. arXiv:2306.12169  [pdf, other

    cs.HC

    HumanDiffusion: diffusion model using perceptual gradients

    Authors: Yota Ueda, Shinnosuke Takamichi, Yuki Saito, Norihiro Takamune, Hiroshi Saruwatari

    Abstract: We propose {\it HumanDiffusion,} a diffusion model trained from humans' perceptual gradients to learn an acceptable range of data for humans (i.e., human-acceptable distribution). Conventional HumanGAN aims to model the human-acceptable distribution wider than the real-data distribution by training a neural network-based generator with human-based discriminators. However, HumanGAN training tends t… ▽ More

    Submitted 21 June, 2023; originally announced June 2023.

    Comments: Proceedings of INTERSPEECH

  17. Algorithms of Sampling-Frequency-Independent Layers for Non-integer Strides

    Authors: Kanami Imamura, Tomohiko Nakamura, Norihiro Takamune, Kohei Yatabe, Hiroshi Saruwatari

    Abstract: In this paper, we propose algorithms for handling non-integer strides in sampling-frequency-independent (SFI) convolutional and transposed convolutional layers. The SFI layers have been developed for handling various sampling frequencies (SFs) by a single neural network. They are replaceable with their non-SFI counterparts and can be introduced into various network architectures. However, they cou… ▽ More

    Submitted 19 June, 2023; originally announced June 2023.

    Comments: 5 pages, 3 figures, accepted for European Signal Processing Conference 2023 (EUSIPCO 2023)

    Journal ref: European Signal Processing Conference, Sep. 2023, pp. 326--330

  18. arXiv:2306.08855  [pdf, other

    eess.AS cs.SD

    Multichannel Active Noise Control with Exterior Radiation Suppression Based on Riemannian Optimization

    Authors: Takaaki Kojima, Kazuyuki Arikawa, Shoichi Koyama, Hiroshi Saruwatari

    Abstract: A multichannel active noise control (ANC) method with exterior radiation suppression is proposed. When applying ANC in a three-dimensional space by using multiple microphones and loudspeakers, the loudspeaker output can amplify noise outside a region of target positions because most of current ANC methods do not take into consideration the exterior radiation of secondary loudspeakers. We propose a… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

    Comments: Accepted to European Signal Processing Conference (EUSIPCO) 2023

  19. arXiv:2306.00697  [pdf, other

    cs.CL cs.AI eess.AS

    How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics

    Authors: Joonyong Park, Shinnosuke Takamichi, Tomohiko Nakamura, Kentaro Seki, Detai Xin, Hiroshi Saruwatari

    Abstract: We examine the speech modeling potential of generative spoken language modeling (GSLM), which involves using learned symbols derived from data rather than phonemes for speech analysis and synthesis. Since GSLM facilitates textless spoken language processing, exploring its effectiveness is critical for paving the way for novel paradigms in spoken-language processing. This paper presents the finding… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted to INTERSPEECH 2023

  20. arXiv:2305.13724  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    ChatGPT-EDSS: Empathetic Dialogue Speech Synthesis Trained from ChatGPT-derived Context Word Embeddings

    Authors: Yuki Saito, Shinnosuke Takamichi, Eiji Iimori, Kentaro Tachibana, Hiroshi Saruwatari

    Abstract: We propose ChatGPT-EDSS, an empathetic dialogue speech synthesis (EDSS) method using ChatGPT for extracting dialogue context. ChatGPT is a chatbot that can deeply understand the content and purpose of an input prompt and appropriately respond to the user's request. We focus on ChatGPT's reading comprehension and introduce it to EDSS, a task of synthesizing speech that can empathize with the interl… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: 5 pages, accepted for INTERSPEECH 2023

  21. arXiv:2305.13713  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    CALLS: Japanese Empathetic Dialogue Speech Corpus of Complaint Handling and Attentive Listening in Customer Center

    Authors: Yuki Saito, Eiji Iimori, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari

    Abstract: We present CALLS, a Japanese speech corpus that considers phone calls in a customer center as a new domain of empathetic spoken dialogue. The existing STUDIES corpus covers only empathetic dialogue between a teacher and student in a school. To extend the application range of empathetic dialogue speech synthesis (EDSS), we designed our corpus to include the same female speaker as the STUDIES teache… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: 5 pages, accepted for INTERSPEECH2023

  22. arXiv:2305.12445  [pdf, other

    cs.SD eess.AS

    JNV Corpus: A Corpus of Japanese Nonverbal Vocalizations with Diverse Phrases and Emotions

    Authors: Detai Xin, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present JNV (Japanese Nonverbal Vocalizations) corpus, a corpus of Japanese nonverbal vocalizations (NVs) with diverse phrases and emotions. Existing Japanese NV corpora lack phrase or emotion diversity, which makes it difficult to analyze NVs and support downstream tasks like emotion recognition. We first propose a corpus-design method that contains two phases: (1) collecting NVs phrases based… ▽ More

    Submitted 21 May, 2023; originally announced May 2023.

    Comments: 4 pages, 3 figures

  23. arXiv:2305.12442  [pdf, other

    cs.SD eess.AS

    Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus

    Authors: Detai Xin, Shinnosuke Takamichi, Ai Morimatsu, Hiroshi Saruwatari

    Abstract: We present a large-scale in-the-wild Japanese laughter corpus and a laughter synthesis method. Previous work on laughter synthesis lacks not only data but also proper ways to represent laughter. To solve these problems, we first propose an in-the-wild corpus comprising $3.5$ hours of laughter, which is to our best knowledge the largest laughter corpus designed for laughter synthesis. We then propo… ▽ More

    Submitted 26 May, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

    Comments: Accepted by INTERSPEECH 2023

  24. arXiv:2303.16021  [pdf, ps, other

    eess.AS cs.SD

    Spatial Active Noise Control Method Based On Sound Field Interpolation From Reference Microphone Signals

    Authors: Kazuyuki Arikawa, Shoichi Koyama, Hiroshi Saruwatari

    Abstract: A spatial active noise control (ANC) method based on the interpolation of a sound field from reference microphone signals is proposed. In most current spatial ANC methods, a sufficient number of error microphones are required to reduce noise over the target region because the sound field is estimated from error microphone signals. However, in practical applications, it is preferable that the numbe… ▽ More

    Submitted 28 March, 2023; originally announced March 2023.

    Comments: Accepted to International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023

  25. arXiv:2303.03869  [pdf, ps, other

    eess.AS cs.SD

    Kernel interpolation of acoustic transfer functions with adaptive kernel for directed and residual reverberations

    Authors: Juliano G. C. Ribeiro, Shoichi Koyama, Hiroshi Saruwatari

    Abstract: An interpolation method for region-to-region acoustic transfer functions (ATFs) based on kernel ridge regression with an adaptive kernel is proposed. Most current ATF interpolation methods do not incorporate the acoustic properties for which measurements are performed. Our proposed method is based on a separate adaptation of directional weighting functions to directed and residual reverberations,… ▽ More

    Submitted 7 March, 2023; originally announced March 2023.

    Comments: To appear in ICASSP 2023

    Journal ref: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  26. arXiv:2302.13652  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

    Authors: Dong Yang, Tomoki Koriyama, Yuki Saito, Takaaki Saeki, Detai Xin, Hiroshi Saruwatari

    Abstract: Pause insertion, also known as phrase break prediction and phrasing, is an essential part of TTS systems because proper pauses with natural duration significantly enhance the rhythm and intelligibility of synthetic speech. However, conventional phrasing models ignore various speakers' different styles of inserting silent pauses, which can degrade the performance of the model trained on a multi-spe… ▽ More

    Submitted 27 February, 2023; originally announced February 2023.

    Comments: Accepted by ICASSP2023

  27. arXiv:2301.12596  [pdf, other

    eess.AS cs.CL

    Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

    Authors: Takaaki Saeki, Soumi Maiti, Xinjian Li, Shinji Watanabe, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource la… ▽ More

    Submitted 27 May, 2023; v1 submitted 29 January, 2023; originally announced January 2023.

    Comments: To appear in IJCAI 2023

  28. JaCappella Corpus: A Japanese a Cappella Vocal Ensemble Corpus

    Authors: Tomohiko Nakamura, Shinnosuke Takamichi, Naoko Tanji, Satoru Fukayama, Hiroshi Saruwatari

    Abstract: We construct a corpus of Japanese a cappella vocal ensembles (jaCappella corpus) for vocal ensemble separation and synthesis. It consists of 35 copyright-cleared vocal ensemble songs and their audio recordings of individual voice parts. These songs were arranged from out-of-copyright Japanese children's songs and have six voice parts (lead vocal, soprano, alto, tenor, bass, and vocal percussion).… ▽ More

    Submitted 24 February, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

    Comments: Accepted for ICASSP2023

    Journal ref: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Jun. 2023, 5 pages

  29. arXiv:2211.02336  [pdf, other

    cs.SD eess.AS

    Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts

    Authors: Detai Xin, Sharath Adavanne, Federico Ang, Ashish Kulkarni, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present a multi-speaker Japanese audiobook text-to-speech (TTS) system that leverages multimodal context information of preceding acoustic context and bilateral textual context to improve the prosody of synthetic speech. Previous work either uses unilateral or single-modality context, which does not fully represent the context information. The proposed method uses an acoustic context encoder an… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

  30. arXiv:2210.14850  [pdf, other

    cs.SD eess.AS

    Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection

    Authors: Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari

    Abstract: This paper proposes a method for selecting training data for text-to-speech (TTS) synthesis from dark data. TTS models are typically trained on high-quality speech corpora that cost much time and money for data collection, which makes it very challenging to increase speaker variation. In contrast, there is a large amount of data whose availability is unknown (a.k.a, "dark data"), such as YouTube v… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023

  31. arXiv:2210.09916  [pdf, other

    cs.SD eess.AS

    Mid-attribute speaker generation using optimal-transport-based interpolation of Gaussian mixture models

    Authors: Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Detai Xin, Hiroshi Saruwatari

    Abstract: In this paper, we propose a method for intermediating multiple speakers' attributes and diversifying their voice characteristics in ``speaker generation,'' an emerging task that aims to synthesize a nonexistent speaker's naturally sounding voice. The conventional TacoSpawn-based speaker generation method represents the distributions of speaker embeddings by Gaussian mixture models (GMMs) condition… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023. Demo: https://sarulab-speech.github.io/demo_mid-attribute-speaker-generation

  32. arXiv:2210.09815  [pdf, other

    cs.SD eess.AS

    Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion

    Authors: Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present a training method with linguistic speech regularization that improves the robustness of spontaneous speech synthesis methods with filled pause (FP) insertion. Spontaneous speech synthesis is aimed at producing speech with human-like disfluencies, such as FPs. Because modeling the complex data distribution of spontaneous speech with a rich FP vocabulary is challenging, the quality of FP-… ▽ More

    Submitted 19 September, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

    Comments: Accepted to SSW12

  33. arXiv:2210.09173  [pdf, other

    cs.SD eess.AS

    Visual onoma-to-wave: environmental sound synthesis from visual onomatopoeias and sound-source images

    Authors: Hien Ohnaka, Shinnosuke Takamichi, Keisuke Imoto, Yuki Okamoto, Kazuki Fujii, Hiroshi Saruwatari

    Abstract: We propose a method for synthesizing environmental sounds from visually represented onomatopoeias and sound sources. An onomatopoeia is a word that imitates a sound structure, i.e., the text representation of sound. From this perspective, onoma-to-wave has been proposed to synthesize environmental sounds from the desired onomatopoeia texts. Onomatopoeias have another representation: visual-text re… ▽ More

    Submitted 17 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023

  34. arXiv:2210.07559  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Personalized Spontaneous Speech Synthesis

    Authors: Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present a comprehensive empirical study for personalized spontaneous speech synthesis on the basis of linguistic knowledge. With the advent of voice cloning for reading-style speech synthesis, a new voice cloning paradigm for human-like and spontaneous speech synthesis is required. We, therefore, focus on personalized spontaneous speech synthesis that can clone both the individual's voice timbr… ▽ More

    Submitted 19 September, 2023; v1 submitted 14 October, 2022; originally announced October 2022.

    Comments: Accepted to APSIPA ASC 2022

  35. arXiv:2209.12549  [pdf, ps, other

    cs.SD cs.AI cs.LG eess.AS

    Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech

    Authors: Yusuke Nakai, Yuki Saito, Kenta Udagawa, Hiroshi Saruwatari

    Abstract: We propose a novel training algorithm for a multi-speaker neural text-to-speech (TTS) model based on multi-task adversarial training. A conventional generative adversarial network (GAN)-based training algorithm significantly improves the quality of synthetic speech by reducing the statistical difference between natural and synthetic speech. However, the algorithm does not guarantee the generalizat… ▽ More

    Submitted 26 September, 2022; originally announced September 2022.

    Comments: 6 pages, 1 figure, Accepted for APSIPA ASC 2022

  36. arXiv:2207.10967  [pdf, other

    cs.SD eess.AS

    Head-Related Transfer Function Interpolation from Spatially Sparse Measurements Using Autoencoder with Source Position Conditioning

    Authors: Yuki Ito, Tomohiko Nakamura, Shoichi Koyama, Hiroshi Saruwatari

    Abstract: We propose a method of head-related transfer function (HRTF) interpolation from sparsely measured HRTFs using an autoencoder with source position conditioning. The proposed method is drawn from an analogy between an HRTF interpolation method based on regularized linear regression (RLR) and an autoencoder. Through this analogy, we found the key feature of the RLR-based method that HRTFs are decompo… ▽ More

    Submitted 22 July, 2022; originally announced July 2022.

    Comments: Accepted to International Workshop on Acoustic Signal Enhancement (IWAENC) 2022

  37. arXiv:2207.10937  [pdf, other

    cs.SD eess.AS

    Physics-informed convolutional neural network with bicubic spline interpolation for sound field estimation

    Authors: Kazuhide Shigemi, Shoichi Koyama, Tomohiko Nakamura, Hiroshi Saruwatari

    Abstract: A sound field estimation method based on a physics-informed convolutional neural network (PICNN) using spline interpolation is proposed. Most of the sound field estimation methods are based on wavefunction expansion, making the estimated function satisfy the Helmholtz equation. However, these methods rely only on physical properties; thus, they suffer from a significant deterioration of accuracy w… ▽ More

    Submitted 22 July, 2022; originally announced July 2022.

    Comments: Accepted to International Workshop on Acoustic Signal Enhancement (IWAENC) 2022

  38. arXiv:2206.10695  [pdf, other

    cs.SD eess.AS

    Exploring the Effectiveness of Self-supervised Learning and Classifier Chains in Emotion Recognition of Nonverbal Vocalizations

    Authors: Detai Xin, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present an emotion recognition system for nonverbal vocalizations (NVs) submitted to the ExVo Few-Shot track of the ICML Expressive Vocalizations Competition 2022. The proposed method uses self-supervised learning (SSL) models to extract features from NVs and uses a classifier chain to model the label dependency between emotions. Experimental results demonstrate that the proposed method can sig… ▽ More

    Submitted 21 June, 2022; originally announced June 2022.

    Comments: Accepted by the ICML Expressive Vocalizations Workshop and Competition 2022

  39. arXiv:2206.10256  [pdf, other

    cs.SD cs.HC cs.NE eess.AS

    Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS

    Authors: Kenta Udagawa, Yuki Saito, Hiroshi Saruwatari

    Abstract: This paper proposes a human-in-the-loop speaker-adaptation method for multi-speaker text-to-speech. With a conventional speaker-adaptation method, a target speaker's embedding vector is extracted from his/her reference speech using a speaker encoder trained on a speaker-discriminative task. However, this method cannot obtain an embedding vector for the target speaker when the reference speech is u… ▽ More

    Submitted 21 June, 2022; originally announced June 2022.

    Comments: 5 pages, 3 figures, Accepted for INTERSPEECH2022

  40. arXiv:2206.08039  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis Using Linguistic and Prosodic Contexts of Dialogue History

    Authors: Yuto Nishimura, Yuki Saito, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari

    Abstract: We propose an end-to-end empathetic dialogue speech synthesis (DSS) model that considers both the linguistic and prosodic contexts of dialogue history. Empathy is the active attempt by humans to get inside the interlocutor in dialogue, and empathetic DSS is a technology to implement this act in spoken dialogue systems. Our model is conditioned by the history of linguistic and prosody features for… ▽ More

    Submitted 16 June, 2022; originally announced June 2022.

    Comments: 5 pages, 3 figures, Accepted for INTERSPEECH2022

  41. arXiv:2204.10561  [pdf, other

    cs.SD eess.AS

    Speaking-Rate-Controllable HiFi-GAN Using Feature Interpolation

    Authors: Detai Xin, Shinnosuke Takamichi, Takuma Okamoto, Hisashi Kawai, Hiroshi Saruwatari

    Abstract: This paper presents a speaking-rate-controllable HiFi-GAN neural vocoder. Original HiFi-GAN is a high-fidelity, computationally efficient, and tiny-footprint neural vocoder. We attempt to incorporate a speaking rate control function into HiFi-GAN for improving the accessibility of synthetic speech. The proposed method inserts a differentiable interpolation layer into the HiFi-GAN architecture. A s… ▽ More

    Submitted 22 April, 2022; originally announced April 2022.

    Comments: submitted to INTERSPEECH 2022

  42. arXiv:2204.02152  [pdf, other

    cs.SD eess.AS

    UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022

    Authors: Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present the UTokyo-SaruLab mean opinion score (MOS) prediction system submitted to VoiceMOS Challenge 2022. The challenge is to predict the MOS values of speech samples collected from previous Blizzard Challenges and Voice Conversion Challenges for two tracks: a main track for in-domain prediction and an out-of-domain (OOD) track for which there is less labeled data from different listening tes… ▽ More

    Submitted 29 June, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  43. arXiv:2203.14757  [pdf, ps, other

    cs.SD cs.AI cs.CL cs.HC cs.LG

    STUDIES: Corpus of Japanese Empathetic Dialogue Speech Towards Friendly Voice Agent

    Authors: Yuki Saito, Yuto Nishimura, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari

    Abstract: We present STUDIES, a new speech corpus for develo** a voice agent that can speak in a friendly manner. Humans naturally control their speech prosody to empathize with each other. By incorporating this "empathetic dialogue" behavior into a spoken dialogue system, we can develop a voice agent that can respond to a user more naturally. We designed the STUDIES corpus to include a speaker who speaks… ▽ More

    Submitted 16 June, 2022; v1 submitted 28 March, 2022; originally announced March 2022.

    Comments: 5 pages, 2 figures, Accepted for INTERSPEECH2022, project page: http://sython.org/Corpus/STUDIES

  44. arXiv:2203.14725  [pdf, other

    cs.SD

    vTTS: visual-text to speech

    Authors: Yoshifumi Nakano, Takaaki Saeki, Shinnosuke Takamichi, Katsuhito Sudoh, Hiroshi Saruwatari

    Abstract: This paper proposes visual-text to speech (vTTS), a method for synthesizing speech from visual text (i.e., text as an image). Conventional TTS converts phonemes or characters into discrete symbols and synthesizes a speech waveform from them, thus losing the visual features that the characters essentially have. Therefore, our method synthesizes speech not from discrete symbols but from visual text.… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: submitted to interspech 2022

  45. arXiv:2203.12937  [pdf, other

    cs.SD eess.AS

    SelfRemaster: Self-Supervised Speech Restoration with Analysis-by-Synthesis Approach Using Channel Modeling

    Authors: Takaaki Saeki, Shinnosuke Takamichi, Tomohiko Nakamura, Naoko Tanji, Hiroshi Saruwatari

    Abstract: We present a self-supervised speech restoration method without paired speech corpora. Because the previous general speech restoration method uses artificial paired data created by applying various distortions to high-quality speech corpora, it cannot sufficiently represent acoustic distortions of real data, limiting the applicability. Our model consists of analysis, synthesis, and channel modules… ▽ More

    Submitted 27 June, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

    Comments: Accepted to INTERSPEECH 2022

  46. arXiv:2203.09961  [pdf, other

    cs.SD eess.AS

    Personalized Filled-pause Generation with Group-wise Prediction Models

    Authors: Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: In this paper, we propose a method to generate personalized filled pauses (FPs) with group-wise prediction models. Compared with fluent text generation, disfluent text generation has not been widely explored. To generate more human-like texts, we addressed disfluent text generation. The usage of disfluency, such as FPs, rephrases, and word fragments, differs from speaker to speaker, and thus, the… ▽ More

    Submitted 22 April, 2022; v1 submitted 18 March, 2022; originally announced March 2022.

    Comments: Accepted to LREC 2022

  47. arXiv:2202.04807  [pdf, ps, other

    eess.AS cs.SD

    Spatial active noise control based on individual kernel interpolation of primary and secondary sound fields

    Authors: Kazuyuki Arikawa, Shoichi Koyama, Hiroshi Saruwatari

    Abstract: A spatial active noise control (ANC) method based on the individual kernel interpolation of primary and secondary sound fields is proposed. Spatial ANC is aimed at cancelling unwanted primary noise within a continuous region by using multiple secondary sources and microphones. A method based on the kernel interpolation of a sound field makes it possible to attenuate noise over the target region wi… ▽ More

    Submitted 9 February, 2022; originally announced February 2022.

    Comments: Accepted to International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022

  48. arXiv:2202.00200  [pdf, other

    cs.SD cs.LG eess.AS

    Differentiable Digital Signal Processing Mixture Model for Synthesis Parameter Extraction from Mixture of Harmonic Sounds

    Authors: Masaya Kawamura, Tomohiko Nakamura, Daichi Kitamura, Hiroshi Saruwatari, Yu Takahashi, Kazunobu Kondo

    Abstract: A differentiable digital signal processing (DDSP) autoencoder is a musical sound synthesizer that combines a deep neural network (DNN) and spectral modeling synthesis. It allows us to flexibly edit sounds by changing the fundamental frequency, timbre feature, and loudness (synthesis parameters) extracted from an input sound. However, it is designed for a monophonic harmonic sound and cannot handle… ▽ More

    Submitted 31 January, 2022; originally announced February 2022.

    Comments: 5 pages, 2 figures, to appear in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022)

  49. arXiv:2201.10896  [pdf, other

    cs.SD eess.AS

    J-MAC: Japanese multi-speaker audiobook corpus for speech synthesis

    Authors: Shinnosuke Takamichi, Wataru Nakata, Naoko Tanji, Hiroshi Saruwatari

    Abstract: In this paper, we construct a Japanese audiobook speech corpus called "J-MAC" for speech synthesis research. With the success of reading-style speech synthesis, the research target is shifting to tasks that use complicated contexts. Audiobook speech synthesis is a good example that requires cross-sentence, expressiveness, etc. Unlike reading-style speech, speaker-specific expressiveness in audiobo… ▽ More

    Submitted 26 January, 2022; originally announced January 2022.

  50. arXiv:2112.06774  [pdf, ps, other

    cs.SD eess.AS

    Mean-square-error-based secondary source placement in sound field synthesis with prior information on desired field

    Authors: Keisuke Kimura, Shoichi Koyama, Natsuki Ueno, Hiroshi Saruwatari

    Abstract: A method of optimizing secondary source placement in sound field synthesis is proposed. Such an optimization method will be useful when the allowable placement region and available number of loudspeakers are limited. We formulate a mean-square-error-based cost function, incorporating the statistical properties of possible desired sound fields, for general linear-least-squares-based sound field syn… ▽ More

    Submitted 9 December, 2021; originally announced December 2021.

    Comments: Accepted to IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2021