Skip to main content

Showing 1–14 of 14 results for author: Takaki, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2211.11222  [pdf, other

    eess.AS cs.CL cs.SD

    Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System

    Authors: Takenori Yoshimura, Shinji Takaki, Kazuhiro Nakamura, Keiichiro Oura, Yukiya Hono, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda

    Abstract: This paper integrates a classic mel-cepstral synthesis filter into a modern neural speech synthesis system towards end-to-end controllable speech synthesis. Since the mel-cepstral synthesis filter is explicitly embedded in neural waveform models in the proposed system, both voice characteristics and the pitch of synthesized speech are highly controlled via a frequency war** parameter and fundame… ▽ More

    Submitted 21 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  2. arXiv:2102.07786  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    PeriodNet: A non-autoregressive waveform generation model with a structure separating periodic and aperiodic components

    Authors: Yukiya Hono, Shinji Takaki, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda

    Abstract: We propose PeriodNet, a non-autoregressive (non-AR) waveform generation model with a new model structure for modeling periodic and aperiodic components in speech waveforms. The non-AR waveform generation models can generate speech waveforms parallelly and can be used as a speech vocoder by conditioning an acoustic feature. Since a speech waveform contains periodic and aperiodic components, both co… ▽ More

    Submitted 15 February, 2021; originally announced February 2021.

    Comments: 5 pages, accepted to ICASSP 2021

  3. arXiv:1911.03952  [pdf, other

    cs.SD eess.AS

    Transformation of low-quality device-recorded speech to high-quality speech using improved SEGAN model

    Authors: Seyyed Saeed Sarfjoo, Xin Wang, Gustav Eje Henter, Jaime Lorenzo-Trueba, Shinji Takaki, Junichi Yamagishi

    Abstract: Nowadays vast amounts of speech data are recorded from low-quality recorder devices such as smartphones, tablets, laptops, and medium-quality microphones. The objective of this research was to study the automatic generation of high-quality speech from such low-quality device-recorded speech, which could then be applied to many speech-generation tasks. In this paper, we first introduce our new devi… ▽ More

    Submitted 20 November, 2019; v1 submitted 10 November, 2019; originally announced November 2019.

    Comments: This study was conducted during an internship of the first author at NII, Japan in 2017

  4. arXiv:1910.11690  [pdf, other

    eess.AS cs.LG cs.SD

    Fast and High-Quality Singing Voice Synthesis System based on Convolutional Neural Networks

    Authors: Kazuhiro Nakamura, Shinji Takaki, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda

    Abstract: The present paper describes singing voice synthesis based on convolutional neural networks (CNNs). Singing voice synthesis systems based on deep neural networks (DNNs) are currently being proposed and are improving the naturalness of synthesized singing voices. As singing voices represent a rich form of expression, a powerful technique to model them accurately is required. In the proposed techniqu… ▽ More

    Submitted 21 April, 2020; v1 submitted 24 October, 2019; originally announced October 2019.

    Comments: Accepted to ICASSP 2020. Singing voice samples (Japanese, English, Chinese): https://www.techno-speech.com/news-20181214a-en. arXiv admin note: substantial text overlap with arXiv:1904.06868

  5. arXiv:1904.12088  [pdf, other

    eess.AS cs.SD stat.ML

    Neural source-filter waveform models for statistical parametric speech synthesis

    Authors: Xin Wang, Shinji Takaki, Junichi Yamagishi

    Abstract: Neural waveform models such as WaveNet have demonstrated better performance than conventional vocoders for statistical parametric speech synthesis. As an autoregressive (AR) model, WaveNet is limited by a slow sequential waveform generation process. Some new models that use the inverse-autoregressive flow (IAF) can generate a whole waveform in a one-shot manner. However, these IAF-based models req… ▽ More

    Submitted 17 November, 2019; v1 submitted 26 April, 2019; originally announced April 2019.

    Comments: Accepted to IEEE/ACM TASLP. Note: this paper is on a follow-up work of our ICASSP paper. Based on the h-NSF introduced in this work, we proposed a h-sinc-NSF model and published the third paper in SSW 10 (https://www.isca-speech.org/archive/SSW_2019/pdfs/SSW10_O_1-1.pdf)

  6. arXiv:1903.12392  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    Training a Neural Speech Waveform Model using Spectral Losses of Short-Time Fourier Transform and Continuous Wavelet Transform

    Authors: Shinji Takaki, Hirokazu Kameoka, Junichi Yamagishi

    Abstract: Recently, we proposed short-time Fourier transform (STFT)-based loss functions for training a neural speech waveform model. In this paper, we generalize the above framework and propose a training scheme for such models based on spectral amplitude and phase losses obtained by either STFT or continuous wavelet transform (CWT), or both of them. Since CWT is capable of having time and frequency resolu… ▽ More

    Submitted 7 April, 2019; v1 submitted 29 March, 2019; originally announced March 2019.

    Comments: Submitted to Interspeech 2019, Graz, Austria

  7. arXiv:1903.12316  [pdf, other

    eess.AS cs.SD

    Does the Lombard Effect Improve Emotional Communication in Noise? - Analysis of Emotional Speech Acted in Noise -

    Authors: Yi Zhao, Atsushi Ando, Shinji Takaki, Junichi Yamagishi, Satoshi Kobashikawa

    Abstract: Speakers usually adjust their way of talking in noisy environments involuntarily for effective communication. This adaptation is known as the Lombard effect. Although speech accompanying the Lombard effect can improve the intelligibility of a speaker's voice, the changes in acoustic features (e.g. fundamental frequency, speech intensity, and spectral tilt) caused by the Lombard effect may also aff… ▽ More

    Submitted 9 April, 2019; v1 submitted 28 March, 2019; originally announced March 2019.

    Comments: Submitted to Interspeech 2019, Graz, Austria

  8. arXiv:1810.11960  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language

    Authors: Yusuke Yasuda, Xin Wang, Shinji Takaki, Junichi Yamagishi

    Abstract: End-to-end speech synthesis is a promising approach that directly converts raw text to speech. Although it was shown that Tacotron2 outperforms classical pipeline systems with regards to naturalness in English, its applicability to other languages is still unknown. Japanese could be one of the most difficult languages for which to achieve end-to-end speech synthesis, largely due to its character d… ▽ More

    Submitted 14 February, 2019; v1 submitted 29 October, 2018; originally announced October 2018.

    Comments: to be appeared at ICASSP 2019

  9. arXiv:1810.11946  [pdf, other

    eess.AS cs.SD stat.ML

    Neural source-filter-based waveform model for statistical parametric speech synthesis

    Authors: Xin Wang, Shinji Takaki, Junichi Yamagishi

    Abstract: Neural waveform models such as the WaveNet are used in many recent text-to-speech systems, but the original WaveNet is quite slow in waveform generation because of its autoregressive (AR) structure. Although faster non-AR models were recently reported, they may be prohibitively complicated due to the use of a distilling training method and the blend of other disparate training criteria. This study… ▽ More

    Submitted 26 April, 2019; v1 submitted 29 October, 2018; originally announced October 2018.

    Comments: Submitted to ICASSP 2019

  10. arXiv:1810.11945  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    STFT spectral loss for training a neural speech waveform model

    Authors: Shinji Takaki, Toru Nakashika, Xin Wang, Junichi Yamagishi

    Abstract: This paper proposes a new loss using short-time Fourier transform (STFT) spectra for the aim of training a high-performance neural speech waveform model that predicts raw continuous speech waveform samples directly. Not only amplitude spectra but also phase spectra obtained from generated speech waveforms are used to calculate the proposed loss. We also mathematically show that training of the wav… ▽ More

    Submitted 30 October, 2018; v1 submitted 29 October, 2018; originally announced October 2018.

    Comments: Submitted to the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  11. arXiv:1807.11679  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

    Authors: Yi Zhao, Shinji Takaki, Hieu-Thi Luong, Junichi Yamagishi, Daisuke Saito, Nobuaki Minematsu

    Abstract: Recent neural networks such as WaveNet and sampleRNN that learn directly from speech waveform samples have achieved very high-quality synthetic speech in terms of both naturalness and speaker similarity even in multi-speaker text-to-speech synthesis systems. Such neural networks are being used as an alternative to vocoders and hence they are often called neural vocoders. The neural vocoder uses ac… ▽ More

    Submitted 31 July, 2018; originally announced July 2018.

  12. arXiv:1804.02549  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    A comparison of recent waveform generation and acoustic modeling methods for neural-network-based speech synthesis

    Authors: Xin Wang, Jaime Lorenzo-Trueba, Shinji Takaki, Lauri Juvela, Junichi Yamagishi

    Abstract: Recent advances in speech synthesis suggest that limitations such as the lossy nature of the amplitude spectrum with minimum phase approximation and the over-smoothing effect in acoustic modeling can be overcome by using advanced machine learning approaches. In this paper, we build a framework in which we can fairly compare new vocoding and acoustic modeling techniques with conventional approaches… ▽ More

    Submitted 7 April, 2018; originally announced April 2018.

    Comments: To appear in ICASSP 2018

  13. arXiv:1803.09946  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Complex-Valued Restricted Boltzmann Machine for Direct Speech Parameterization from Complex Spectra

    Authors: Toru Nakashika, Shinji Takaki, Junichi Yamagishi

    Abstract: This paper describes a novel energy-based probabilistic distribution that represents complex-valued data and explains how to apply it to direct feature extraction from complex-valued spectra. The proposed model, the complex-valued restricted Boltzmann machine (CRBM), is designed to deal with complex-valued visible units as an extension of the well-known restricted Boltzmann machine (RBM). Like the… ▽ More

    Submitted 27 March, 2018; originally announced March 2018.

    Comments: Under the IEEE T-ASLP Review

  14. arXiv:1506.05268  [pdf, other

    cs.SD cs.LG

    Deep Denoising Auto-encoder for Statistical Speech Synthesis

    Authors: Zhenzhou Wu, Shinji Takaki, Junichi Yamagishi

    Abstract: This paper proposes a deep denoising auto-encoder technique to extract better acoustic features for speech synthesis. The technique allows us to automatically extract low-dimensional features from high dimensional spectral features in a non-linear, data-driven, unsupervised way. We compared the new stochastic feature extractor with conventional mel-cepstral analysis in analysis-by-synthesis and te… ▽ More

    Submitted 17 June, 2015; originally announced June 2015.