Skip to main content

Showing 1–7 of 7 results for author: Seki, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.13982  [pdf, other

    cs.SD eess.AS

    Improved Remixing Process for Domain Adaptation-Based Speech Enhancement by Mitigating Data Imbalance in Signal-to-Noise Ratio

    Authors: Li Li, Shogo Seki

    Abstract: RemixIT and Remixed2Remixed are domain adaptation-based speech enhancement (DASE) methods that use a teacher model trained in full supervision to generate pseudo-paired data by remixing the outputs of the teacher model. The student model for enhancing real-world recorded signals is trained using the pseudo-paired data without ground truth. Since the noisy signals are recorded in natural environmen… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech2024

  2. arXiv:2312.16836  [pdf, other

    cs.SD eess.AS

    Remixed2Remixed: Domain adaptation for speech enhancement by Noise2Noise learning with Remixing

    Authors: Li Li, Shogo Seki

    Abstract: This paper proposes Remixed2Remixed, a domain adaptation method for speech enhancement, which adopts Noise2Noise (N2N) learning to adapt models trained on artificially generated (out-of-domain: OOD) noisy-clean pair data to better separate real-world recorded (in-domain) noisy data. The proposed method uses a teacher model trained on OOD data to acquire pseudo-in-domain speech and noise signals, w… ▽ More

    Submitted 28 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP2024

  3. arXiv:2308.07117  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    iSTFTNet2: Faster and More Lightweight iSTFT-Based Neural Vocoder Using 1D-2D CNN

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Shogo Seki

    Abstract: The inverse short-time Fourier transform network (iSTFTNet) has garnered attention owing to its fast, lightweight, and high-fidelity speech synthesis. It obtains these characteristics using a fast and lightweight 1D CNN as the backbone and replacing some neural processes with iSTFT. Owing to the difficulty of a 1D CNN to model high-dimensional spectrograms, the frequency dimension is reduced via t… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: Accepted to Interspeech 2023. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/istftnet2/

  4. arXiv:2303.13909  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Shogo Seki

    Abstract: In speech synthesis, a generative adversarial network (GAN), training a generator (speech synthesizer) and a discriminator in a min-max game, is widely used to improve speech quality. An ensemble of discriminators is commonly used in recent neural vocoders (e.g., HiFi-GAN) and end-to-end text-to-speech (TTS) systems (e.g., VITS) to scrutinize waveforms from multiple perspectives. Such discriminato… ▽ More

    Submitted 24 March, 2023; originally announced March 2023.

    Comments: Accepted to ICASSP 2023. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/waveunetd/

  5. arXiv:2203.02395  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform

    Authors: Takuhiro Kaneko, Kou Tanaka, Hirokazu Kameoka, Shogo Seki

    Abstract: In recent text-to-speech synthesis and voice conversion systems, a mel-spectrogram is commonly applied as an intermediate representation, and the necessity for a mel-spectrogram vocoder is increasing. A mel-spectrogram vocoder must solve three inverse problems: recovery of the original-scale magnitude spectrogram, phase reconstruction, and frequency-to-time conversion. A typical convolutional mel-… ▽ More

    Submitted 4 March, 2022; originally announced March 2022.

    Comments: Accepted to ICASSP 2022. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/istftnet/

  6. arXiv:2010.02977  [pdf, ps, other

    cs.SD eess.AS

    VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics

    Authors: Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo, Shogo Seki

    Abstract: In this paper, we propose a non-parallel any-to-many voice conversion (VC) method termed VoiceGrad. Inspired by WaveGrad, a recently introduced novel waveform generation method, VoiceGrad is based upon the concepts of score matching and Langevin dynamics. It uses weighted denoising score matching to train a score approximator, a fully convolutional network with a U-Net structure designed to predic… ▽ More

    Submitted 9 March, 2024; v1 submitted 6 October, 2020; originally announced October 2020.

    Comments: For more details on the baseline method used for comparison, please refer to our article in arXiv:2008.12604

  7. arXiv:1810.00223  [pdf, ps, other

    stat.ML cs.LG cs.SD eess.AS

    Generalized Multichannel Variational Autoencoder for Underdetermined Source Separation

    Authors: Shogo Seki, Hirokazu Kameoka, Li Li, Tomoki Toda, Kazuya Takeda

    Abstract: This paper deals with a multichannel audio source separation problem under underdetermined conditions. Multichannel Non-negative Matrix Factorization (MNMF) is one of powerful approaches, which adopts the NMF concept for source power spectrogram modeling. This concept is also employed in Independent Low-Rank Matrix Analysis (ILRMA), a special class of the MNMF framework formulated under determined… ▽ More

    Submitted 29 September, 2018; originally announced October 2018.