Skip to main content

Showing 1–12 of 12 results for author: Koishida, K

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.02897  [pdf, other

    cs.SD eess.AS

    LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes

    Authors: Trung Dang, David Aponte, Dung Tran, Kazuhito Koishida

    Abstract: Prior works have demonstrated zero-shot text-to-speech by using a generative language model on audio tokens obtained via a neural audio codec. It is still challenging, however, to adapt them to low-latency scenarios. In this paper, we present LiveSpeech - a fully autoregressive language model-based approach for zero-shot text-to-speech, enabling low-latency streaming of the output audio. To allow… ▽ More

    Submitted 10 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

  2. arXiv:2404.01740  [pdf, other

    cs.SD cs.AI eess.AS

    Weakly-supervised Audio Separation via Bi-modal Semantic Similarity

    Authors: Tanvir Mahmud, Saeed Amizadeh, Kazuhito Koishida, Diana Marculescu

    Abstract: Conditional sound separation in multi-source audio mixtures without having access to single source sound data during training is a long standing challenge. Existing mix-and-separate based methods suffer from significant performance drop with multi-source training mixtures due to the lack of supervision signal for single source separation cases during training. However, in the case of language-cond… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: Tech report. Accepted in ICLR-2024

  3. arXiv:2403.09579  [pdf, other

    cs.SD cs.LG eess.AS

    uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with Unsupervised Audio Mixtures

    Authors: Afrina Tabassum, Dung Tran, Trung Dang, Ismini Lourentzou, Kazuhito Koishida

    Abstract: Masked Autoencoders (MAEs) learn rich low-level representations from unlabeled data but require substantial labeled data to effectively adapt to downstream tasks. Conversely, Instance Discrimination (ID) emphasizes high-level semantics, offering a potential solution to alleviate annotation requirements in MAEs. Although combining these two approaches can address downstream tasks with limited label… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

    Comments: 5 pages, 6 figures, 4 tables. To appear in ICASSP'2024

  4. arXiv:2312.17255  [pdf, other

    eess.AS cs.LG cs.SD

    Single-channel speech enhancement using learnable loss mixup

    Authors: Oscar Chang, Dung N. Tran, Kazuhito Koishida

    Abstract: Generalization remains a major problem in supervised learning of single-channel speech enhancement. In this work, we propose learnable loss mixup (LLM), a simple and effortless training diagram, to improve the generalization of deep learning-based speech enhancement models. Loss mixup, of which learnable loss mixup is a special variant, optimizes a mixture of the loss functions of random sample pa… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

  5. arXiv:2311.00867  [pdf, other

    eess.AS cs.CL

    Automatic Disfluency Detection from Untranscribed Speech

    Authors: Amrit Romana, Kazuhito Koishida, Emily Mower Provost

    Abstract: Speech disfluencies, such as filled pauses or repetitions, are disruptions in the typical flow of speech. Stuttering is a speech disorder characterized by a high rate of disfluencies, but all individuals speak with some disfluencies and the rates of disfluencies may by increased by factors such as cognitive load. Clinically, automatic disfluency detection may help in treatment planning for individ… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

  6. arXiv:2309.10740  [pdf, other

    cs.SD cs.LG cs.MM eess.AS

    ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

    Authors: Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, Somayeh Sojoudi

    Abstract: Diffusion models are instrumental in text-to-audio (TTA) generation. Unfortunately, they suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation. To address this bottleneck, we introduce ConsistencyTTA, a framework requiring only a single non-autoregressive network query, thereby accelerating TTA by hundreds of times. We achieve so by pro… ▽ More

    Submitted 24 June, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

  7. arXiv:2210.14474  [pdf, other

    cs.SD cs.LG eess.AS

    SCP-GAN: Self-Correcting Discriminator Optimization for Training Consistency Preserving Metric GAN on Speech Enhancement Tasks

    Authors: Vasily Zadorozhnyy, Qiang Ye, Kazuhito Koishida

    Abstract: In recent years, Generative Adversarial Networks (GANs) have produced significantly improved results in speech enhancement (SE) tasks. They are difficult to train, however. In this work, we introduce several improvements to the GAN training schemes, which can be applied to most GAN-based SE models. We propose using consistency loss functions, which target the inconsistency in time and time-frequen… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: 5 pages (4 - manuscript and 1 - references), 2 figures, 2 tables

  8. arXiv:2112.10950  [pdf, other

    eess.AS cs.LG cs.SD

    Augmented Contrastive Self-Supervised Learning for Audio Invariant Representations

    Authors: Melikasadat Emami, Dung Tran, Kazuhito Koishida

    Abstract: Improving generalization is a major challenge in audio classification due to labeled data scarcity. Self-supervised learning (SSL) methods tackle this by leveraging unlabeled data to learn useful features for downstream classification tasks. In this work, we propose an augmented contrastive SSL framework to learn invariant representations from unlabeled data. Our method applies various perturbatio… ▽ More

    Submitted 20 December, 2021; originally announced December 2021.

    Comments: 4 pages, 4 figures

  9. arXiv:2112.04939  [pdf, other

    eess.AS cs.LG cs.SD

    A Training Framework for Stereo-Aware Speech Enhancement using Deep Neural Networks

    Authors: Bahareh Tolooshams, Kazuhito Koishida

    Abstract: Deep learning-based speech enhancement has shown unprecedented performance in recent years. The most popular mono speech enhancement frameworks are end-to-end networks map** the noisy mixture into an estimate of the clean speech. With growing computational power and availability of multichannel microphone recordings, prior works have aimed to incorporate spatial statistics along with spectral in… ▽ More

    Submitted 31 January, 2022; v1 submitted 9 December, 2021; originally announced December 2021.

    Comments: Accepted to the IEEE 47th International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

  10. arXiv:2112.04424  [pdf, other

    cs.SD cs.LG eess.AS

    Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features

    Authors: Trung Dang, Dung Tran, Peter Chin, Kazuhito Koishida

    Abstract: Unsupervised Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker without relying on parallel training data. Recently, self-supervised learning of speech representation has been shown to produce useful linguistic units without using transcripts, which can be directly passed to a VC model. In this paper, we showed that high-qual… ▽ More

    Submitted 10 February, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

  11. arXiv:2101.01902  [pdf, other

    cs.SD cs.LG eess.AS

    Interspeech 2021 Deep Noise Suppression Challenge

    Authors: Chandan K A Reddy, Harishchandra Dubey, Kazuhito Koishida, Arun Nair, Vishak Gopal, Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aichner, Sriram Srinivasan

    Abstract: The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality. We recently organized a DNS challenge special session at INTERSPEECH and ICASSP 2020. We open-sourced training and test datasets for the wideband scenario. We also open-sourced a subjective evaluation framework based on ITU-T standard P.808, wh… ▽ More

    Submitted 4 April, 2021; v1 submitted 6 January, 2021; originally announced January 2021.

    Comments: arXiv admin note: substantial text overlap with arXiv:2009.06122

  12. arXiv:1908.01399  [pdf, other

    eess.AS cs.LG cs.SD

    Sound Event Detection in Multichannel Audio using Convolutional Time-Frequency-Channel Squeeze and Excitation

    Authors: Wei Xia, Kazuhito Koishida

    Abstract: In this study, we introduce a convolutional time-frequency-channel "Squeeze and Excitation" (tfc-SE) module to explicitly model inter-dependencies between the time-frequency domain and multiple channels. The tfc-SE module consists of two parts: tf-SE block and c-SE block which are designed to provide attention on time-frequency and channel domain, respectively, for adaptively recalibrating the inp… ▽ More

    Submitted 4 August, 2019; originally announced August 2019.

    Comments: Accepted by Interspeech 2019