Skip to main content

Showing 1–34 of 34 results for author: Kameoka, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.16464  [pdf, other

    cs.SD cs.LG eess.AS

    Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka

    Abstract: A generative adversarial network (GAN)-based vocoder trained with an adversarial discriminator is commonly used for speech synthesis because of its fast, lightweight, and high-quality characteristics. However, this data-driven model requires a large amount of training data incurring high data-collection costs. This fact motivates us to train a GAN-based vocoder on limited data. A promising solutio… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: Accepted to ICASSP 2024. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/augcondd/

  2. arXiv:2308.07117  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    iSTFTNet2: Faster and More Lightweight iSTFT-Based Neural Vocoder Using 1D-2D CNN

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Shogo Seki

    Abstract: The inverse short-time Fourier transform network (iSTFTNet) has garnered attention owing to its fast, lightweight, and high-fidelity speech synthesis. It obtains these characteristics using a fast and lightweight 1D CNN as the backbone and replacing some neural processes with iSTFT. Owing to the difficulty of a 1D CNN to model high-dimensional spectrograms, the frequency dimension is reduced via t… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: Accepted to Interspeech 2023. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/istftnet2/

  3. arXiv:2303.13909  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Shogo Seki

    Abstract: In speech synthesis, a generative adversarial network (GAN), training a generator (speech synthesizer) and a discriminator in a min-max game, is widely used to improve speech quality. An ensemble of discriminators is commonly used in recent neural vocoders (e.g., HiFi-GAN) and end-to-end text-to-speech (TTS) systems (e.g., VITS) to scrutinize waveforms from multiple perspectives. Such discriminato… ▽ More

    Submitted 24 March, 2023; originally announced March 2023.

    Comments: Accepted to ICASSP 2023. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/waveunetd/

  4. arXiv:2210.11059  [pdf, other

    eess.AS cs.SD stat.ML

    DisC-VC: Disentangled and F0-Controllable Neural Voice Conversion

    Authors: Chihiro Watanabe, Hirokazu Kameoka

    Abstract: Voice conversion is a task to convert a non-linguistic feature of a given utterance. Since naturalness of speech strongly depends on its pitch pattern, in some applications, it would be desirable to keep the original rise/fall pitch pattern while changing the speaker identity. Some of the existing methods address this problem by either using a source-filter model or develo** a neural network tha… ▽ More

    Submitted 20 October, 2022; originally announced October 2022.

  5. arXiv:2206.04780  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    Speak Like a Dog: Human to Non-human creature Voice Conversion

    Authors: Kohei Suzuki, Shoki Sakamoto, Tadahiro Taniguchi, Hirokazu Kameoka

    Abstract: This paper proposes a new voice conversion (VC) task from human speech to dog-like speech while preserving linguistic information as an example of human to non-human creature voice conversion (H2NH-VC) tasks. Although most VC studies deal with human to human VC, H2NH-VC aims to convert human speech into non-human creature-like speech. Non-parallel VC allows us to develop H2NH-VC, because we cannot… ▽ More

    Submitted 9 June, 2022; originally announced June 2022.

    Comments: 5 pages, 4 figures

    Journal ref: 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 1388-1393)

  6. arXiv:2203.02395  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform

    Authors: Takuhiro Kaneko, Kou Tanaka, Hirokazu Kameoka, Shogo Seki

    Abstract: In recent text-to-speech synthesis and voice conversion systems, a mel-spectrogram is commonly applied as an intermediate representation, and the necessity for a mel-spectrogram vocoder is increasing. A mel-spectrogram vocoder must solve three inverse problems: recovery of the original-scale magnitude spectrogram, phase reconstruction, and frequency-to-time conversion. A typical convolutional mel-… ▽ More

    Submitted 4 March, 2022; originally announced March 2022.

    Comments: Accepted to ICASSP 2022. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/istftnet/

  7. arXiv:2109.13496  [pdf, other

    cs.SD eess.AS

    FastMVAE2: On improving and accelerating the fast variational autoencoder-based source separation algorithm for determined mixtures

    Authors: Li Li, Hirokazu Kameoka, Shoji Makino

    Abstract: This paper proposes a new source model and training scheme to improve the accuracy and speed of the multichannel variational autoencoder (MVAE) method. The MVAE method is a recently proposed powerful multichannel source separation method. It consists of pretraining a source model represented by a conditional VAE (CVAE) and then estimating separation matrices along with other unknown parameters so… ▽ More

    Submitted 7 September, 2022; v1 submitted 28 September, 2021; originally announced September 2021.

    Comments: submit to IEEE/ACM TASLP, under review

  8. StarGAN-VC+ASR: StarGAN-based Non-Parallel Voice Conversion Regularized by Automatic Speech Recognition

    Authors: Shoki Sakamoto, Akira Taniguchi, Tadahiro Taniguchi, Hirokazu Kameoka

    Abstract: Preserving the linguistic content of input speech is essential during voice conversion (VC). The star generative adversarial network-based VC method (StarGAN-VC) is a recently developed method that allows non-parallel many-to-many VC. Although this method is powerful, it can fail to preserve the linguistic content of input speech when the number of available training samples is extremely small. To… ▽ More

    Submitted 9 August, 2021; originally announced August 2021.

    Comments: 5 pages, 6 figures, Accepted to INTERSPEECH 2021

    Journal ref: INTERSPEECH 2021, 1359--1363

  9. arXiv:2104.06900  [pdf, ps, other

    cs.SD eess.AS

    FastS2S-VC: Streaming Non-Autoregressive Sequence-to-Sequence Voice Conversion

    Authors: Hirokazu Kameoka, Kou Tanaka, Takuhiro Kaneko

    Abstract: This paper proposes a non-autoregressive extension of our previously proposed sequence-to-sequence (S2S) model-based voice conversion (VC) methods. S2S model-based VC methods have attracted particular attention in recent years for their flexibility in converting not only the voice identity but also the pitch contour and local duration of input speech, thanks to the ability of the encoder-decoder a… ▽ More

    Submitted 14 April, 2021; originally announced April 2021.

  10. arXiv:2104.01807  [pdf, other

    cs.SD cs.CL eess.AS

    StarGAN-based Emotional Voice Conversion for Japanese Phrases

    Authors: Asuka Moritani, Ryo Ozaki, Shoki Sakamoto, Hirokazu Kameoka, Tadahiro Taniguchi

    Abstract: This paper shows that StarGAN-VC, a spectral envelope transformation method for non-parallel many-to-many voice conversion (VC), is capable of emotional VC (EVC). Although StarGAN-VC has been shown to enable speaker identity conversion, its capability for EVC for Japanese phrases has not been clarified. In this paper, we describe the direct application of StarGAN-VC to an EVC task with minimal fun… ▽ More

    Submitted 5 April, 2021; originally announced April 2021.

    Comments: Submitted to Interspeech 2021

  11. arXiv:2102.12841  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

    Abstract: Non-parallel voice conversion (VC) is a technique for training voice converters without a parallel corpus. Cycle-consistent adversarial network-based VCs (CycleGAN-VC and CycleGAN-VC2) are widely accepted as benchmark methods. However, owing to their insufficient ability to grasp time-frequency structures, their application is limited to mel-cepstrum conversion and not mel-spectrogram conversion d… ▽ More

    Submitted 25 February, 2021; originally announced February 2021.

    Comments: Accepted to ICASSP 2021. Project page: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/maskcyclegan-vc/index.html

  12. arXiv:2010.11672  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

    Abstract: Non-parallel voice conversion (VC) is a technique for learning map**s between source and target speeches without using a parallel corpus. Recently, cycle-consistent adversarial network (CycleGAN)-VC and CycleGAN-VC2 have shown promising results regarding this problem and have been widely used as benchmark methods. However, owing to the ambiguity of the effectiveness of CycleGAN-VC/VC2 for mel-sp… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

    Comments: Accepted to Interspeech 2020. Project page: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc3/index.html

  13. arXiv:2010.02977  [pdf, ps, other

    cs.SD eess.AS

    VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics

    Authors: Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo, Shogo Seki

    Abstract: In this paper, we propose a non-parallel any-to-many voice conversion (VC) method termed VoiceGrad. Inspired by WaveGrad, a recently introduced novel waveform generation method, VoiceGrad is based upon the concepts of score matching and Langevin dynamics. It uses weighted denoising score matching to train a score approximator, a fully convolutional network with a U-Net structure designed to predic… ▽ More

    Submitted 9 March, 2024; v1 submitted 6 October, 2020; originally announced October 2020.

    Comments: For more details on the baseline method used for comparison, please refer to our article in arXiv:2008.12604

  14. arXiv:2008.03088  [pdf, other

    eess.AS cs.CL cs.SD

    Pretraining Techniques for Sequence-to-Sequence Voice Conversion

    Authors: Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu Kameoka, Tomoki Toda

    Abstract: Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody. Nonetheless, without sufficient data, seq2seq VC models can suffer from unstable training and mispronunciation problems in the converted speech, thus far from practical. To tackle these shortcomings, we propose to transfer knowledge from other speech processing tasks where large-sc… ▽ More

    Submitted 7 August, 2020; originally announced August 2020.

    Comments: Preprint. Under review

  15. arXiv:2005.08445  [pdf, ps, other

    eess.AS cs.SD stat.ML

    Many-to-Many Voice Transformer Network

    Authors: Hirokazu Kameoka, Wen-Chin Huang, Kou Tanaka, Takuhiro Kaneko, Nobukatsu Hojo, Tomoki Toda

    Abstract: This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework, which enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech. We previously proposed an S2S-based VC method using a transformer network architecture called the voice transformer network (VTN). The original VTN was designed to learn only a m… ▽ More

    Submitted 6 November, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

    Comments: submitted to IEEE/ACM Trans. ASLP. Please also refer to our related article: arXiv:1811.01609

  16. arXiv:1912.06813  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining

    Authors: Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu Kameoka, Tomoki Toda

    Abstract: We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining. Seq2seq VC models are attractive owing to their ability to convert prosody. While seq2seq models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have been successfully applied to VC, the use of the Transforme… ▽ More

    Submitted 14 December, 2019; originally announced December 2019.

    Comments: Preprint. Work in progress

  17. arXiv:1911.01601  [pdf, other

    eess.AS cs.CR cs.SD eess.SP

    ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

    Authors: Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Hector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, Lauri Juvela, Paavo Alku, Yu-Huai Peng, Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Sebastien Le Maguer, Markus Becker, Fergus Henderson, Rob Clark, Yu Zhang, Quan Wang, Ye Jia, Kai Onuma, Koji Mushika , et al. (15 additional authors not shown)

    Abstract: Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation attacks." These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to imperso… ▽ More

    Submitted 14 July, 2020; v1 submitted 4 November, 2019; originally announced November 2019.

    Comments: Accepted, Computer Speech and Language. This manuscript version is made available under the CC-BY-NC-ND 4.0. For the published version on Elsevier website, please visit https://doi.org/10.1016/j.csl.2020.101114

  18. arXiv:1907.12279  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

    Abstract: Non-parallel multi-domain voice conversion (VC) is a technique for learning map**s among multiple domains without relying on parallel data. This is important but challenging owing to the requirement of learning multiple map**s and the non-availability of explicit supervision. Recently, StarGAN-VC has garnered attention owing to its ability to solve this problem only using a single generator. H… ▽ More

    Submitted 7 August, 2019; v1 submitted 29 July, 2019; originally announced July 2019.

    Comments: Accepted to Interspeech 2019. Project page: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/stargan-vc2/index.html

  19. arXiv:1904.04631  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

    Abstract: Non-parallel voice conversion (VC) is a technique for learning the map** from source to target speech without relying on parallel data. This is an important task, but it has been challenging due to the disadvantages of the training conditions. Recently, CycleGAN-VC has provided a breakthrough and performed comparably to a parallel VC method without relying on any extra data, modules, or time ali… ▽ More

    Submitted 9 April, 2019; originally announced April 2019.

    Comments: Accepted to ICASSP 2019. Project page: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc2/index.html

  20. arXiv:1904.04540  [pdf, ps, other

    cs.SD stat.ML

    Crossmodal Voice Conversion

    Authors: Hirokazu Kameoka, Kou Tanaka, Aaron Valero Puche, Yasunori Ohishi, Takuhiro Kaneko

    Abstract: Humans are able to imagine a person's voice from the person's appearance and imagine the person's appearance from his/her voice. In this paper, we make the first attempt to develop a method that can convert speech into a voice that matches an input face image and generate a face image that matches the voice of the input speech by leveraging the correlation between faces and voices. We propose a mo… ▽ More

    Submitted 9 April, 2019; originally announced April 2019.

    Comments: Submitted to Interspeech2019

  21. arXiv:1904.02892  [pdf, ps, other

    cs.SD cs.LG eess.AS stat.ML

    WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation

    Authors: Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, Nobukatsu Hojo

    Abstract: WaveCycleGAN has recently been proposed to bridge the gap between natural and synthesized speech waveforms in statistical parametric speech synthesis and provides fast inference with a moving average model rather than an autoregressive model and high-quality speech synthesis with the adversarial training. However, the human ear can still distinguish the processed speech waveforms from natural ones… ▽ More

    Submitted 8 April, 2019; v1 submitted 5 April, 2019; originally announced April 2019.

    Comments: Submitted to INTERSPEECH2019

  22. arXiv:1903.12392  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    Training a Neural Speech Waveform Model using Spectral Losses of Short-Time Fourier Transform and Continuous Wavelet Transform

    Authors: Shinji Takaki, Hirokazu Kameoka, Junichi Yamagishi

    Abstract: Recently, we proposed short-time Fourier transform (STFT)-based loss functions for training a neural speech waveform model. In this paper, we generalize the above framework and propose a training scheme for such models based on spectral amplitude and phase losses obtained by either STFT or continuous wavelet transform (CWT), or both of them. Since CWT is capable of having time and frequency resolu… ▽ More

    Submitted 7 April, 2019; v1 submitted 29 March, 2019; originally announced March 2019.

    Comments: Submitted to Interspeech 2019, Graz, Austria

  23. arXiv:1812.06391  [pdf, other

    cs.LG stat.ML

    Fast MVAE: Joint separation and classification of mixed sources based on multichannel variational autoencoder with auxiliary classifier

    Authors: Li Li, Hirokazu Kameoka, Shoji Makino

    Abstract: This paper proposes an alternative algorithm for multichannel variational autoencoder (MVAE), a recently proposed multichannel source separation approach. While MVAE is notable in its impressive source separation performance, the convergence-guaranteed optimization algorithm and that it allows us to estimate source-class labels simultaneously with source separation, there are still two major drawb… ▽ More

    Submitted 13 February, 2019; v1 submitted 16 December, 2018; originally announced December 2018.

  24. arXiv:1811.04076  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms

    Authors: Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, Nobukatsu Hojo

    Abstract: This paper describes a method based on a sequence-to-sequence learning (Seq2Seq) with attention and context preservation mechanism for voice conversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving sequence modeling such as speech synthesis and recognition, machine translation, and image captioning. In contrast to current VC techniques, our method 1) stabilizes and accelerat… ▽ More

    Submitted 9 November, 2018; originally announced November 2018.

    Comments: Submitted to ICASSP2019

  25. arXiv:1811.01609  [pdf, ps, other

    cs.SD cs.LG eess.AS stat.ML

    ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion

    Authors: Hirokazu Kameoka, Kou Tanaka, Damian Kwasny, Takuhiro Kaneko, Nobukatsu Hojo

    Abstract: This paper proposes a voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the voice characteristics but also the pitch contour and duration of input speech. The proposed method, called ConvS2S-VC, has three key features. First, it uses a model with a fully convolutional architecture. This is particularly advantageous in that it is sui… ▽ More

    Submitted 6 October, 2020; v1 submitted 5 November, 2018; originally announced November 2018.

    Comments: Published in IEEE/ACM Trans. ASLP https://ieeexplore.ieee.org/document/9113442

  26. arXiv:1810.00223  [pdf, ps, other

    stat.ML cs.LG cs.SD eess.AS

    Generalized Multichannel Variational Autoencoder for Underdetermined Source Separation

    Authors: Shogo Seki, Hirokazu Kameoka, Li Li, Tomoki Toda, Kazuya Takeda

    Abstract: This paper deals with a multichannel audio source separation problem under underdetermined conditions. Multichannel Non-negative Matrix Factorization (MNMF) is one of powerful approaches, which adopts the NMF concept for source power spectrogram modeling. This concept is also employed in Independent Low-Rank Matrix Analysis (ILRMA), a special class of the MNMF framework formulated under determined… ▽ More

    Submitted 29 September, 2018; originally announced October 2018.

  27. arXiv:1809.10288  [pdf, ps, other

    eess.AS cs.LG cs.SD stat.ML

    WaveCycleGAN: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks

    Authors: Kou Tanaka, Takuhiro Kaneko, Nobukatsu Hojo, Hirokazu Kameoka

    Abstract: We propose a learning-based filter that allows us to directly modify a synthetic speech waveform into a natural speech waveform. Speech-processing systems using a vocoder framework such as statistical parametric speech synthesis and voice conversion are convenient especially for a limited number of data because it is possible to represent and process interpretable acoustic features over a compact… ▽ More

    Submitted 28 September, 2018; v1 submitted 25 September, 2018; originally announced September 2018.

    Comments: SLT2018

  28. arXiv:1808.05092  [pdf, ps, other

    stat.ML cs.LG cs.SD eess.AS

    ACVAE-VC: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder

    Authors: Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo

    Abstract: This paper proposes a non-parallel many-to-many voice conversion (VC) method using a variant of the conditional variational autoencoder (VAE) called an auxiliary classifier VAE (ACVAE). The proposed method has three key features. First, it adopts fully convolutional architectures to construct the encoder and decoder networks so that the networks can learn conversion rules that capture time depende… ▽ More

    Submitted 10 October, 2020; v1 submitted 13 August, 2018; originally announced August 2018.

    Comments: Publised in IEEE/ACM Trans. ASLP https://ieeexplore.ieee.org/abstract/document/8718381 Please also refer to our related articles: arXiv:1806.02169, arXiv:2008.12604

  29. arXiv:1808.00892  [pdf, ps, other

    stat.ML cs.LG

    Semi-blind source separation with multichannel variational autoencoder

    Authors: Hirokazu Kameoka, Li Li, Shota Inoue, Shoji Makino

    Abstract: This paper proposes a multichannel source separation technique called the multichannel variational autoencoder (MVAE) method, which uses a conditional VAE (CVAE) to model and estimate the power spectrograms of the sources in a mixture. By training the CVAE using the spectrograms of training examples with source-class labels, we can use the trained decoder distribution as a universal generative mod… ▽ More

    Submitted 26 August, 2018; v1 submitted 2 August, 2018; originally announced August 2018.

  30. arXiv:1806.02169  [pdf, ps, other

    cs.SD cs.LG eess.AS stat.ML

    StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks

    Authors: Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo

    Abstract: This paper proposes a method that allows non-parallel many-to-many voice conversion (VC) by using a variant of a generative adversarial network (GAN) called StarGAN. Our method, which we call StarGAN-VC, is noteworthy in that it (1) requires no parallel utterances, transcriptions, or time alignment procedures for speech generator training, (2) simultaneously learns many-to-many map**s across dif… ▽ More

    Submitted 29 June, 2018; v1 submitted 6 June, 2018; originally announced June 2018.

  31. arXiv:1804.02181  [pdf, ps, other

    eess.SP cs.LG stat.ML

    Generative adversarial network-based approach to signal reconstruction from magnitude spectrograms

    Authors: Keisuke Oyamada, Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo, Hiroyasu Ando

    Abstract: In this paper, we address the problem of reconstructing a time-domain signal (or a phase spectrogram) solely from a magnitude spectrogram. Since magnitude spectrograms do not contain phase information, we must restore or infer phase information to reconstruct a time-domain signal. One widely used approach for dealing with the signal reconstruction problem was proposed by Griffin and Lim. This meth… ▽ More

    Submitted 6 April, 2018; originally announced April 2018.

  32. arXiv:1804.00920  [pdf, ps, other

    eess.AS cs.CL cs.SD stat.ML

    Speech waveform synthesis from MFCC sequences with generative adversarial networks

    Authors: Lauri Juvela, Bajibabu Bollepalli, Xin Wang, Hirokazu Kameoka, Manu Airaksinen, Junichi Yamagishi, Paavo Alku

    Abstract: This paper proposes a method for generating speech from filterbank mel frequency cepstral coefficients (MFCC), which are widely used in speech applications, such as ASR, but are generally considered unusable for speech synthesis. First, we predict fundamental frequency and voicing information from MFCCs with an autoregressive recurrent neural net. Second, the spectral envelope information containe… ▽ More

    Submitted 3 April, 2018; originally announced April 2018.

  33. arXiv:1711.11293  [pdf, ps, other

    stat.ML cs.SD eess.AS

    Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks

    Authors: Takuhiro Kaneko, Hirokazu Kameoka

    Abstract: We propose a parallel-data-free voice-conversion (VC) method that can learn a map** from source to target speech without relying on parallel data. The proposed method is general purpose, high quality, and parallel-data free and works without any extra data, modules, or alignment procedure. It also avoids over-smoothing, which occurs in many conventional statistical model-based VC methods. Our me… ▽ More

    Submitted 20 December, 2017; v1 submitted 30 November, 2017; originally announced November 2017.

  34. arXiv:1207.3554  [pdf, ps, other

    cs.CV math.NA stat.ME stat.ML

    Designing various component analysis at will

    Authors: Akisato Kimura, Masashi Sugiyama, Sakano Hitoshi, Hirokazu Kameoka

    Abstract: This paper provides a generic framework of component analysis (CA) methods introducing a new expression for scatter matrices and Gram matrices, called Generalized Pairwise Expression (GPE). This expression is quite compact but highly powerful: The framework includes not only (1) the standard CA methods but also (2) several regularization techniques, (3) weighted extensions, (4) some clustering met… ▽ More

    Submitted 5 October, 2012; v1 submitted 15 July, 2012; originally announced July 2012.

    Comments: Accepted to IAPR International Conference on Pattern Recognition, submitted to IPSJ Transactions on Mathematical Modeling and its Applications (TOM). Just only one-page abstract for new due to novelty violation for journal submission. The details will be disclosed in late September