Skip to main content

Showing 1–15 of 15 results for author: Hojo, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2306.02273  [pdf, ps, other

    cs.CL cs.SD eess.AS

    End-to-End Joint Target and Non-Target Speakers ASR

    Authors: Ryo Masumura, Naoki Makishima, Taiga Yamane, Yoshihiko Yamazaki, Saki Mizuno, Mana Ihori, Mihiro Uchida, Keita Suzuki, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando

    Abstract: This paper proposes a novel automatic speech recognition (ASR) system that can transcribe individual speaker's speech while identifying whether they are target or non-target speakers from multi-talker overlapped speech. Target-speaker ASR systems are a promising way to only transcribe a target speaker's speech by enrolling the target speaker's information. However, in conversational ASR applicatio… ▽ More

    Submitted 4 June, 2023; originally announced June 2023.

    Comments: Accepted at Interspeech 2023

  2. arXiv:2305.14723  [pdf, other

    eess.AS cs.SD

    Downstream Task Agnostic Speech Enhancement with Self-Supervised Representation Loss

    Authors: Hiroshi Sato, Ryo Masumura, Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Kentaro Shinayama, Saki Mizuno, Mana Ihori, Tomohiro Tanaka, Nobukatsu Hojo

    Abstract: Self-supervised learning (SSL) is the latest breakthrough in speech processing, especially for label-scarce downstream tasks by leveraging massive unlabeled audio data. The noise robustness of the SSL is one of the important challenges to expanding its application. We can use speech enhancement (SE) to tackle this issue. However, the mismatch between the SE model and SSL models potentially limits… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: 4 pages , 2 figures, Accepted to Interspeech 2023

  3. arXiv:2102.12841  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

    Abstract: Non-parallel voice conversion (VC) is a technique for training voice converters without a parallel corpus. Cycle-consistent adversarial network-based VCs (CycleGAN-VC and CycleGAN-VC2) are widely accepted as benchmark methods. However, owing to their insufficient ability to grasp time-frequency structures, their application is limited to mel-cepstrum conversion and not mel-spectrogram conversion d… ▽ More

    Submitted 25 February, 2021; originally announced February 2021.

    Comments: Accepted to ICASSP 2021. Project page: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/maskcyclegan-vc/index.html

  4. arXiv:2010.11672  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

    Abstract: Non-parallel voice conversion (VC) is a technique for learning map**s between source and target speeches without using a parallel corpus. Recently, cycle-consistent adversarial network (CycleGAN)-VC and CycleGAN-VC2 have shown promising results regarding this problem and have been widely used as benchmark methods. However, owing to the ambiguity of the effectiveness of CycleGAN-VC/VC2 for mel-sp… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

    Comments: Accepted to Interspeech 2020. Project page: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc3/index.html

  5. arXiv:2010.02977  [pdf, ps, other

    cs.SD eess.AS

    VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics

    Authors: Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo, Shogo Seki

    Abstract: In this paper, we propose a non-parallel any-to-many voice conversion (VC) method termed VoiceGrad. Inspired by WaveGrad, a recently introduced novel waveform generation method, VoiceGrad is based upon the concepts of score matching and Langevin dynamics. It uses weighted denoising score matching to train a score approximator, a fully convolutional network with a U-Net structure designed to predic… ▽ More

    Submitted 9 March, 2024; v1 submitted 6 October, 2020; originally announced October 2020.

    Comments: For more details on the baseline method used for comparison, please refer to our article in arXiv:2008.12604

  6. arXiv:2005.08445  [pdf, ps, other

    eess.AS cs.SD stat.ML

    Many-to-Many Voice Transformer Network

    Authors: Hirokazu Kameoka, Wen-Chin Huang, Kou Tanaka, Takuhiro Kaneko, Nobukatsu Hojo, Tomoki Toda

    Abstract: This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework, which enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech. We previously proposed an S2S-based VC method using a transformer network architecture called the voice transformer network (VTN). The original VTN was designed to learn only a m… ▽ More

    Submitted 6 November, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

    Comments: submitted to IEEE/ACM Trans. ASLP. Please also refer to our related article: arXiv:1811.01609

  7. arXiv:1907.12279  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

    Abstract: Non-parallel multi-domain voice conversion (VC) is a technique for learning map**s among multiple domains without relying on parallel data. This is important but challenging owing to the requirement of learning multiple map**s and the non-availability of explicit supervision. Recently, StarGAN-VC has garnered attention owing to its ability to solve this problem only using a single generator. H… ▽ More

    Submitted 7 August, 2019; v1 submitted 29 July, 2019; originally announced July 2019.

    Comments: Accepted to Interspeech 2019. Project page: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/stargan-vc2/index.html

  8. arXiv:1904.04631  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

    Abstract: Non-parallel voice conversion (VC) is a technique for learning the map** from source to target speech without relying on parallel data. This is an important task, but it has been challenging due to the disadvantages of the training conditions. Recently, CycleGAN-VC has provided a breakthrough and performed comparably to a parallel VC method without relying on any extra data, modules, or time ali… ▽ More

    Submitted 9 April, 2019; originally announced April 2019.

    Comments: Accepted to ICASSP 2019. Project page: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc2/index.html

  9. arXiv:1904.02892  [pdf, ps, other

    cs.SD cs.LG eess.AS stat.ML

    WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation

    Authors: Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, Nobukatsu Hojo

    Abstract: WaveCycleGAN has recently been proposed to bridge the gap between natural and synthesized speech waveforms in statistical parametric speech synthesis and provides fast inference with a moving average model rather than an autoregressive model and high-quality speech synthesis with the adversarial training. However, the human ear can still distinguish the processed speech waveforms from natural ones… ▽ More

    Submitted 8 April, 2019; v1 submitted 5 April, 2019; originally announced April 2019.

    Comments: Submitted to INTERSPEECH2019

  10. arXiv:1811.04076  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms

    Authors: Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, Nobukatsu Hojo

    Abstract: This paper describes a method based on a sequence-to-sequence learning (Seq2Seq) with attention and context preservation mechanism for voice conversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving sequence modeling such as speech synthesis and recognition, machine translation, and image captioning. In contrast to current VC techniques, our method 1) stabilizes and accelerat… ▽ More

    Submitted 9 November, 2018; originally announced November 2018.

    Comments: Submitted to ICASSP2019

  11. arXiv:1811.01609  [pdf, ps, other

    cs.SD cs.LG eess.AS stat.ML

    ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion

    Authors: Hirokazu Kameoka, Kou Tanaka, Damian Kwasny, Takuhiro Kaneko, Nobukatsu Hojo

    Abstract: This paper proposes a voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the voice characteristics but also the pitch contour and duration of input speech. The proposed method, called ConvS2S-VC, has three key features. First, it uses a model with a fully convolutional architecture. This is particularly advantageous in that it is sui… ▽ More

    Submitted 6 October, 2020; v1 submitted 5 November, 2018; originally announced November 2018.

    Comments: Published in IEEE/ACM Trans. ASLP https://ieeexplore.ieee.org/document/9113442

  12. arXiv:1809.10288  [pdf, ps, other

    eess.AS cs.LG cs.SD stat.ML

    WaveCycleGAN: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks

    Authors: Kou Tanaka, Takuhiro Kaneko, Nobukatsu Hojo, Hirokazu Kameoka

    Abstract: We propose a learning-based filter that allows us to directly modify a synthetic speech waveform into a natural speech waveform. Speech-processing systems using a vocoder framework such as statistical parametric speech synthesis and voice conversion are convenient especially for a limited number of data because it is possible to represent and process interpretable acoustic features over a compact… ▽ More

    Submitted 28 September, 2018; v1 submitted 25 September, 2018; originally announced September 2018.

    Comments: SLT2018

  13. arXiv:1808.05092  [pdf, ps, other

    stat.ML cs.LG cs.SD eess.AS

    ACVAE-VC: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder

    Authors: Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo

    Abstract: This paper proposes a non-parallel many-to-many voice conversion (VC) method using a variant of the conditional variational autoencoder (VAE) called an auxiliary classifier VAE (ACVAE). The proposed method has three key features. First, it adopts fully convolutional architectures to construct the encoder and decoder networks so that the networks can learn conversion rules that capture time depende… ▽ More

    Submitted 10 October, 2020; v1 submitted 13 August, 2018; originally announced August 2018.

    Comments: Publised in IEEE/ACM Trans. ASLP https://ieeexplore.ieee.org/abstract/document/8718381 Please also refer to our related articles: arXiv:1806.02169, arXiv:2008.12604

  14. arXiv:1806.02169  [pdf, ps, other

    cs.SD cs.LG eess.AS stat.ML

    StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks

    Authors: Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo

    Abstract: This paper proposes a method that allows non-parallel many-to-many voice conversion (VC) by using a variant of a generative adversarial network (GAN) called StarGAN. Our method, which we call StarGAN-VC, is noteworthy in that it (1) requires no parallel utterances, transcriptions, or time alignment procedures for speech generator training, (2) simultaneously learns many-to-many map**s across dif… ▽ More

    Submitted 29 June, 2018; v1 submitted 6 June, 2018; originally announced June 2018.

  15. arXiv:1804.02181  [pdf, ps, other

    eess.SP cs.LG stat.ML

    Generative adversarial network-based approach to signal reconstruction from magnitude spectrograms

    Authors: Keisuke Oyamada, Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo, Hiroyasu Ando

    Abstract: In this paper, we address the problem of reconstructing a time-domain signal (or a phase spectrogram) solely from a magnitude spectrogram. Since magnitude spectrograms do not contain phase information, we must restore or infer phase information to reconstruct a time-domain signal. One widely used approach for dealing with the signal reconstruction problem was proposed by Griffin and Lim. This meth… ▽ More

    Submitted 6 April, 2018; originally announced April 2018.