Skip to main content

Showing 1–50 of 85 results for author: Toda, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.06208  [pdf, other

    cs.SD eess.AS

    Quantifying the effect of speech pathology on automatic and human speaker verification

    Authors: Bence Mark Halpern, Thomas Tienkamp, Wen-Chin Huang, Lester Phillip Violeta, Teja Rebernik, Sebastiaan de Visscher, Max Witjes, Martijn Wieling, Defne Abur, Tomoki Toda

    Abstract: This study investigates how surgical intervention for speech pathology (specifically, as a result of oral cancer surgery) impacts the performance of an automatic speaker verification (ASV) system. Using two recently collected Dutch datasets with parallel pre and post-surgery audio from the same speaker, NKI-OC-VC and SPOKE, we assess the extent to which speech pathology influences ASV performance,… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures, 2 tables. Accepted to Interspeech 2024

    ACM Class: I.2.7

  2. arXiv:2406.06201  [pdf, other

    cs.CV cs.AI

    2DP-2MRC: 2-Dimensional Pointer-based Machine Reading Comprehension Method for Multimodal Moment Retrieval

    Authors: Jiajun He, Tomoki Toda

    Abstract: Moment retrieval aims to locate the most relevant moment in an untrimmed video based on a given natural language query. Existing solutions can be roughly categorized into moment-based and clip-based methods. The former often involves heavy computations, while the latter, due to overlooking coarse-grained information, typically underperforms compared to moment-based models. Hence, this paper propos… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024

  3. arXiv:2406.02438  [pdf, other

    eess.AS cs.MM cs.SD

    CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

    Authors: Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, **g Guo, Tomoki Toda, Zhiyao Duan

    Abstract: Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesi… ▽ More

    Submitted 18 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  4. arXiv:2405.11767  [pdf, other

    eess.AS cs.CR cs.SD

    Multi-speaker Text-to-speech Training with Speaker Anonymized Data

    Authors: Wen-Chin Huang, Yi-Chiao Wu, Tomoki Toda

    Abstract: The trend of scaling up speech generation models poses a threat of biometric information leakage of the identities of the voices in the training data, raising privacy and security concerns. In this paper, we investigate training multi-speaker text-to-speech (TTS) models using data that underwent speaker anonymization (SA), a process that tends to hide the speaker identity of the input speech while… ▽ More

    Submitted 19 May, 2024; originally announced May 2024.

    Comments: 5 pages. Submitted to Signal Processing Letters. Audio sample page: https://unilight.github.io/Publication-Demos/publications/sa-tts-spl/index.html

  5. arXiv:2405.05244  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

    Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Tomoki Toda, Zhiyao Duan

    Abstract: The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specializ… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: Evaluation plan of the SVDD Challenge @ SLT 2024

  6. arXiv:2404.06682  [pdf, other

    cs.SD eess.AS

    Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment

    Authors: Yuka Hashizume, Li Li, Atsushi Miyashita, Tomoki Toda

    Abstract: To achieve a flexible recommendation and retrieval system, it is desirable to calculate music similarity by focusing on multiple partial elements of musical pieces and allowing the users to select the element they want to focus on. A previous study proposed using multiple individual networks for calculating music similarity based on each instrumental sound, but it is impractical to use each signal… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

  7. arXiv:2403.06100  [pdf, other

    cs.HC cs.CL cs.LG eess.AS stat.ML

    Automatic design optimization of preference-based subjective evaluation with online learning in crowdsourcing environment

    Authors: Yusuke Yasuda, Tomoki Toda

    Abstract: A preference-based subjective evaluation is a key method for evaluating generative media reliably. However, its huge combinations of pairs prohibit it from being applied to large-scale evaluation using crowdsourcing. To address this issue, we propose an automatic optimization method for preference-based subjective evaluation in terms of pair combination selections and allocation of evaluation volu… ▽ More

    Submitted 10 March, 2024; originally announced March 2024.

  8. arXiv:2401.13260  [pdf, other

    cs.CL cs.MM cs.SD eess.AS

    MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction

    Authors: Jiajun He, Xiaohan Shi, Xingfeng Li, Tomoki Toda

    Abstract: The prevalent approach in speech emotion recognition (SER) involves integrating both audio and textual information to comprehensively identify the speaker's emotion, with the text generally obtained through automatic speech recognition (ASR). An essential issue of this approach is that ASR errors from the text modality can worsen the performance of SER. Previous studies have proposed using an auxi… ▽ More

    Submitted 28 May, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024

  9. arXiv:2311.07093  [pdf, other

    cs.SD cs.CL eess.AS

    On the Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition

    Authors: Xiaohan Shi, Jiajun He, Xingfeng Li, Tomoki Toda

    Abstract: This paper proposes an efficient attempt to noisy speech emotion recognition (NSER). Conventional NSER approaches have proven effective in mitigating the impact of artificial noise sources, such as white Gaussian noise, but are limited to non-stationary noises in real-world environments due to their complexity and uncertainty. To overcome this limitation, we introduce a new method for NSER by adop… ▽ More

    Submitted 14 November, 2023; v1 submitted 13 November, 2023; originally announced November 2023.

    Comments: Submitted to ICASSP 2024

  10. arXiv:2310.05203  [pdf, other

    eess.AS cs.CL cs.LG cs.SD eess.SP

    A Comparative Study of Voice Conversion Models with Large-Scale Speech and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge 2023

    Authors: Ryuichi Yamamoto, Reo Yoneyama, Lester Phillip Violeta, Wen-Chin Huang, Tomoki Toda

    Abstract: This paper presents our systems (denoted as T13) for the singing voice conversion challenge (SVCC) 2023. For both in-domain and cross-domain English singing voice conversion (SVC) tasks (Task 1 and Task 2), we adopt a recognition-synthesis approach with self-supervised learning-based representation. To achieve data-efficient SVC with a limited amount of target singer/speaker's data (150 to 160 utt… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: Accepted to ASRU 2023

  11. arXiv:2310.05129  [pdf, other

    cs.AI

    ed-cec: improving rare word recognition using asr postprocessing based on error detection and context-aware error correction

    Authors: Jiajun He, Zekun Yang, Tomoki Toda

    Abstract: Automatic speech recognition (ASR) systems often encounter difficulties in accurately recognizing rare words, leading to errors that can have a negative impact on downstream tasks such as keyword spotting, intent detection, and text summarization. To address this challenge, we present a novel ASR postprocessing method that focuses on improving the recognition of rare words through error detection… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: 6 pages, 5 figures, conference

  12. arXiv:2310.02570  [pdf, other

    cs.SD eess.AS

    Improving severity preservation of healthy-to-pathological voice conversion with global style tokens

    Authors: Bence Mark Halpern, Wen-Chin Huang, Lester Phillip Violeta, R. J. J. H. van Son, Tomoki Toda

    Abstract: In healthy-to-pathological voice conversion (H2P-VC), healthy speech is converted into pathological while preserving the identity. The paper improves on previous two-stage approach to H2P-VC where (1) speech is created first with the appropriate severity, (2) then the speaker identity of the voice is converted while preserving the severity of the voice. Specifically, we propose improvements to (2)… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

    Comments: 7 pages, 3 figures, 5 tables. Accepted to IEEE Automatic Speech Recognition and Understanding Workshop 2023

    ACM Class: I.2.7

  13. arXiv:2309.09627  [pdf, other

    cs.SD eess.AS

    Electrolaryngeal Speech Intelligibility Enhancement Through Robust Linguistic Encoders

    Authors: Lester Phillip Violeta, Wen-Chin Huang, Ding Ma, Ryuichi Yamamoto, Kazuhiro Kobayashi, Tomoki Toda

    Abstract: We propose a novel framework for electrolaryngeal speech intelligibility enhancement through the use of robust linguistic encoders. Pretraining and fine-tuning approaches have proven to work well in this task, but in most cases, various mismatches, such as the speech type mismatch (electrolaryngeal vs. typical) or a speaker mismatch between the datasets used in each stage, can deteriorate the conv… ▽ More

    Submitted 20 January, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024. Demo page: lesterphillip.github.io/icassp2024_el_sie

  14. arXiv:2309.08141  [pdf, other

    eess.AS cs.CL cs.LG cs.SD eess.SP

    Audio Difference Learning for Audio Captioning

    Authors: Tatsuya Komatsu, Yusuke Fujita, Kazuya Takeda, Tomoki Toda

    Abstract: This study introduces a novel training paradigm, audio difference learning, for improving audio captioning. The fundamental concept of the proposed learning method is to create a feature representation space that preserves the relationship between audio, enabling the generation of captions that detail intricate audio information. This method employs a reference audio along with the input audio, bo… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: submitted to ICASSP2024

  15. arXiv:2309.07598  [pdf, other

    cs.SD eess.AS

    AAS-VC: On the Generalization Ability of Automatic Alignment Search based Non-autoregressive Sequence-to-sequence Voice Conversion

    Authors: Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda

    Abstract: Non-autoregressive (non-AR) sequence-to-seqeunce (seq2seq) models for voice conversion (VC) is attractive in its ability to effectively model the temporal structure while enjoying boosted intelligibility and fast inference thanks to non-AR modeling. However, the dependency of current non-AR seq2seq VC models on ground truth durations extracted from an external AR model greatly limits its generaliz… ▽ More

    Submitted 15 September, 2023; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024. Demo: https://unilight.github.io/Publication-Demos/publications/aas-vc/index.html. Code: https://github.com/unilight/seq2seq-vc

  16. arXiv:2309.02133  [pdf, other

    cs.SD cs.CL eess.AS

    Evaluating Methods for Ground-Truth-Free Foreign Accent Conversion

    Authors: Wen-Chin Huang, Tomoki Toda

    Abstract: Foreign accent conversion (FAC) is a special application of voice conversion (VC) which aims to convert the accented speech of a non-native speaker to a native-sounding speech with the same speaker identity. FAC is difficult since the native speech from the desired non-native speaker to be used as the training target is impossible to collect. In this work, we evaluate three recently proposed metho… ▽ More

    Submitted 5 September, 2023; originally announced September 2023.

    Comments: Accepted to the 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Demo page: https://unilight.github.io/Publication-Demos/publications/fac-evaluate. Code: https://github.com/unilight/seq2seq-vc

  17. arXiv:2306.14422  [pdf, other

    cs.SD cs.CL eess.AS

    The Singing Voice Conversion Challenge 2023

    Authors: Wen-Chin Huang, Lester Phillip Violeta, Songxiang Liu, Jiatong Shi, Tomoki Toda

    Abstract: We present the latest iteration of the voice conversion challenge (VCC) series, a bi-annual scientific event aiming to compare and understand different voice conversion (VC) systems based on a common dataset. This year we shifted our focus to singing voice conversion (SVC), thus named the challenge the Singing Voice Conversion Challenge (SVCC). A new database was constructed for two tasks, namely… ▽ More

    Submitted 6 July, 2023; v1 submitted 26 June, 2023; originally announced June 2023.

  18. arXiv:2306.13953  [pdf, other

    cs.SD eess.AS

    An Analysis of Personalized Speech Recognition System Development for the Deaf and Hard-of-Hearing

    Authors: Lester Phillip Violeta, Tomoki Toda

    Abstract: Deaf or hard-of-hearing (DHH) speakers typically have atypical speech caused by deafness. With the growing support of speech-based devices and software applications, more work needs to be done to make these devices inclusive to everyone. To do so, we analyze the use of openly-available automatic speech recognition (ASR) tools with a DHH Japanese speaker dataset. As these out-of-the-box ASR models… ▽ More

    Submitted 24 June, 2023; originally announced June 2023.

    Comments: Submitted to APSIPA 2023

  19. arXiv:2212.08329  [pdf, other

    eess.AS cs.CL stat.ML

    Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder

    Authors: Yusuke Yasuda, Tomoki Toda

    Abstract: Text-to-speech synthesis (TTS) is a task to convert texts into speech. Two of the factors that have been driving TTS are the advancements of probabilistic models and latent representation learning. We propose a TTS method based on latent variable conversion using a diffusion probabilistic model and the variational autoencoder (VAE). In our TTS method, we use a waveform model based on VAE, a diffus… ▽ More

    Submitted 16 December, 2022; originally announced December 2022.

    Comments: Submitted to ICASSP 2023

  20. Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language

    Authors: Yusuke Yasuda, Tomoki Toda

    Abstract: End-to-end text-to-speech synthesis (TTS) can generate highly natural synthetic speech from raw text. However, rendering the correct pitch accents is still a challenging problem for end-to-end TTS. To tackle the challenge of rendering correct pitch accent in Japanese end-to-end TTS, we adopt PnG~BERT, a self-supervised pretrained model in the character and phoneme domain for TTS. We investigate th… ▽ More

    Submitted 16 December, 2022; originally announced December 2022.

    Journal ref: IEEE Journal of Selected Topics in Signal Processing (Volume: 16, Issue: 6, October 2022)

  21. arXiv:2211.07863  [pdf

    cs.SD eess.AS

    Music Similarity Calculation of Individual Instrumental Sounds Using Metric Learning

    Authors: Yuka Hashizume, Li Li, Tomoki Toda

    Abstract: The criteria for measuring music similarity are important for develo** a flexible music recommendation system. Some data-driven methods have been proposed to calculate music similarity from only music signals, such as metric learning based on a triplet loss using tag information on each musical piece. However, the resulting music similarity metric usually captures the entire piece of music, i.e.… ▽ More

    Submitted 14 November, 2022; originally announced November 2022.

    Comments: APSIPA ASC 2022 (pp.33--38)

    MSC Class: 68T99

  22. arXiv:2211.01198  [pdf, other

    eess.AS cs.SD

    Analysis of Noisy-target Training for DNN-based speech enhancement

    Authors: Takuya Fujimura, Tomoki Toda

    Abstract: Deep neural network (DNN)-based speech enhancement usually uses a clean speech as a training target. However, it is hard to collect large amounts of clean speech because the recording is very costly. In other words, the performance of current speech enhancement has been limited by the amount of training data. To relax this limitation, Noisy-target Training (NyTT) that utilizes noisy speech as a tr… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  23. arXiv:2211.01079  [pdf, other

    cs.SD eess.AS

    Intermediate Fine-Tuning Using Imperfect Synthetic Speech for Improving Electrolaryngeal Speech Recognition

    Authors: Lester Phillip Violeta, Ding Ma, Wen-Chin Huang, Tomoki Toda

    Abstract: Research on automatic speech recognition (ASR) systems for electrolaryngeal speakers has been relatively unexplored due to small datasets. When training data is lacking in ASR, a large-scale pretraining and fine tuning framework is often sufficient to achieve high recognition rates; however, in electrolaryngeal speech, the domain shift between the pretraining and fine-tuning data is too large to o… ▽ More

    Submitted 30 May, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted to ICASSP 2023

  24. arXiv:2210.15987  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    NNSVS: A Neural Network-Based Singing Voice Synthesis Toolkit

    Authors: Ryuichi Yamamoto, Reo Yoneyama, Tomoki Toda

    Abstract: This paper describes the design of NNSVS, an open-source software for neural network-based singing voice synthesis research. NNSVS is inspired by Sinsy, an open-source pioneer in singing voice synthesis research, and provides many additional features such as multi-stream models, autoregressive fundamental frequency models, and neural vocoders. Furthermore, NNSVS provides extensive documentation an… ▽ More

    Submitted 1 March, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

    Comments: Accepted to ICASSP 2023

  25. arXiv:2210.15533  [pdf, other

    cs.SD cs.LG eess.AS

    Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder

    Authors: Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda

    Abstract: Our previous work, the unified source-filter GAN (uSFGAN) vocoder, introduced a novel architecture based on the source-filter theory into the parallel waveform generative adversarial network to achieve high voice quality and pitch controllability. However, the high temporal resolution inputs result in high computation costs. Although the HiFi-GAN vocoder achieves fast high-fidelity voice generatio… ▽ More

    Submitted 27 February, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: Accepted to ICASSP 2023

  26. arXiv:2210.10314  [pdf, other

    cs.SD eess.AS

    Two-stage training method for Japanese electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion

    Authors: Ding Ma, Lester Phillip Violeta, Kazuhiro Kobayashi, Tomoki Toda

    Abstract: Sequence-to-sequence (seq2seq) voice conversion (VC) models have greater potential in converting electrolaryngeal (EL) speech to normal speech (EL2SP) compared to conventional VC models. However, EL2SP based on seq2seq VC requires a sufficiently large amount of parallel data for the model training and it suffers from significant performance degradation when the amount of training data is insuffici… ▽ More

    Submitted 19 October, 2022; originally announced October 2022.

    Comments: Accepted to SLT 2022

  27. arXiv:2207.13959  [pdf, other

    cs.DS cs.DM

    ZDD-Based Algorithmic Framework for Solving Shortest Reconfiguration Problems

    Authors: Takehiro Ito, Jun Kawahara, Yu Nakahata, Takehide Soh, Akira Suzuki, Junichi Teruyama, Takahisa Toda

    Abstract: This paper proposes an algorithmic framework for various reconfiguration problems using zero-suppressed binary decision diagrams (ZDDs), a data structure for families of sets. In general, a reconfiguration problem checks if there is a step-by-step transformation between two given feasible solutions (e.g., independent sets of an input graph) of a fixed search problem such that all intermediate resu… ▽ More

    Submitted 16 December, 2022; v1 submitted 28 July, 2022; originally announced July 2022.

  28. A Cyclical Approach to Synthetic and Natural Speech Mismatch Refinement of Neural Post-filter for Low-cost Text-to-speech System

    Authors: Yi-Chiao Wu, Patrick Lumban Tobing, Kazuki Yasuhara, Noriyuki Matsunaga, Yamato Ohtani, Tomoki Toda

    Abstract: Neural-based text-to-speech (TTS) systems achieve very high-fidelity speech generation because of the rapid neural network developments. However, the huge labeled corpus and high computation cost requirements limit the possibility of develo** a high-fidelity TTS system by small companies or individuals. On the other hand, a neural vocoder, which has been widely adopted for the speech generation… ▽ More

    Submitted 12 July, 2022; originally announced July 2022.

    Comments: 15 pages, 7 figures, 10 tables

    Journal ref: APSIPA Transactions on Signal and Information Processing, Vol 11, Issue 1, 2022

  29. arXiv:2207.04356  [pdf, other

    cs.SD cs.LG eess.AS

    A Comparative Study of Self-supervised Speech Representation Based Voice Conversion

    Authors: Wen-Chin Huang, Shu-Wen Yang, Tomoki Hayashi, Tomoki Toda

    Abstract: We present a large-scale comparative study of self-supervised speech representation (S3R)-based voice conversion (VC). In the context of recognition-synthesis VC, S3Rs are attractive owing to their potential to replace expensive supervised representations such as phonetic posteriorgrams (PPGs), which are commonly adopted by state-of-the-art VC systems. Using S3PRL-VC, an open-source VC software we… ▽ More

    Submitted 9 July, 2022; originally announced July 2022.

    Comments: Accepted to IEEE Journal of Selected Topics in Signal Processing. arXiv admin note: substantial text overlap with arXiv:2110.06280

  30. arXiv:2206.15155  [pdf, other

    cs.SD eess.AS

    An Evaluation of Three-Stage Voice Conversion Framework for Noisy and Reverberant Conditions

    Authors: Yeonjong Choi, Chao Xie, Tomoki Toda

    Abstract: This paper presents a new voice conversion (VC) framework capable of dealing with both additive noise and reverberation, and its performance evaluation. There have been studied some VC researches focusing on real-world circumstances where speech data are interfered with background noise and reverberation. To deal with more practical conditions where no clean target dataset is available, one possib… ▽ More

    Submitted 30 June, 2022; originally announced June 2022.

    Comments: Accepted to INTERSPEECH 2022

  31. arXiv:2206.05929  [pdf, other

    cs.SD eess.AS

    Improvement of Serial Approach to Anomalous Sound Detection by Incorporating Two Binary Cross-Entropies for Outlier Exposure

    Authors: Ibuki Kuroyanagi, Tomoki Hayashi, Kazuya Takeda, Tomoki Toda

    Abstract: Anomalous sound detection systems must detect unknown, atypical sounds using only normal audio data. Conventional methods use the serial method, a combination of outlier exposure (OE), which classifies normal and pseudo-anomalous data and obtains embedding, and inlier modeling (IM), which models the probability distribution of the embedding. Although the serial method shows high performance due to… ▽ More

    Submitted 13 June, 2022; originally announced June 2022.

    Comments: 5 pages, 3 figures, 3 tables, EUSIPCO 2022

  32. arXiv:2205.06053  [pdf, other

    cs.SD cs.LG eess.AS

    Unified Source-Filter GAN with Harmonic-plus-Noise Source Excitation Generation

    Authors: Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda

    Abstract: This paper introduces a unified source-filter network with a harmonic-plus-noise source excitation generation mechanism. In our previous work, we proposed unified Source-Filter GAN (uSFGAN) for develo** a high-fidelity neural vocoder with flexible voice controllability using a unified source-filter neural network architecture. However, the capability of uSFGAN to model the aperiodic source excit… ▽ More

    Submitted 30 June, 2022; v1 submitted 12 May, 2022; originally announced May 2022.

    Comments: Accepted to INTERSPEECH 2022

  33. arXiv:2203.15431  [pdf, other

    cs.SD eess.AS

    Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition

    Authors: Lester Phillip Violeta, Wen-Chin Huang, Tomoki Toda

    Abstract: We investigate the performance of self-supervised pretraining frameworks on pathological speech datasets used for automatic speech recognition (ASR). Modern end-to-end models require thousands of hours of data to train well, but only a small number of pathological speech datasets are publicly available. A proven solution to this problem is by first pretraining the model on a huge number of healthy… ▽ More

    Submitted 29 June, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: Accepted to INTERSPEECH 2022

  34. arXiv:2203.11389  [pdf, other

    cs.SD eess.AS

    The VoiceMOS Challenge 2022

    Authors: Wen-Chin Huang, Erica Cooper, Yu Tsao, Hsin-Min Wang, Tomoki Toda, Junichi Yamagishi

    Abstract: We present the first edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthetic speech. This challenge drew 22 participating teams from academia and industry who tried a variety of approaches to tackle the problem of predicting human ratings of synthesized speech. The listening test data for the main tra… ▽ More

    Submitted 3 July, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

    Comments: Accepted to Interspeech 2022

  35. arXiv:2111.07116  [pdf, other

    cs.SD eess.AS

    Direct Noisy Speech Modeling for Noisy-to-Noisy Voice Conversion

    Authors: Chao Xie, Yi-Chiao Wu, Patrick Lumban Tobing, Wen-Chin Huang, Tomoki Toda

    Abstract: Beyond the conventional voice conversion (VC) where the speaker information is converted without altering the linguistic content, the background sounds are informative and need to be retained in some real-world scenarios, such as VC in movie/video and VC in music where the voice is entangled with background sounds. As a new VC framework, we have developed a noisy-to-noisy (N2N) VC framework to con… ▽ More

    Submitted 13 November, 2021; originally announced November 2021.

  36. arXiv:2111.05691  [pdf, other

    eess.AS cs.AI cs.SD

    HASA-net: A non-intrusive hearing-aid speech assessment network

    Authors: Hsin-Tien Chiang, Yi-Chiao Wu, Cheng Yu, Tomoki Toda, Hsin-Min Wang, Yih-Chun Hu, Yu Tsao

    Abstract: Without the need of a clean reference, non-intrusive speech assessment methods have caught great attention for objective evaluations. Recently, deep neural network (DNN) models have been applied to build non-intrusive speech assessment approaches and confirmed to provide promising performance. However, most DNN-based approaches are designed for normal-hearing listeners without considering hearing-… ▽ More

    Submitted 10 November, 2021; originally announced November 2021.

  37. arXiv:2110.09103  [pdf, other

    cs.SD cs.CL eess.AS

    LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech

    Authors: Wen-Chin Huang, Erica Cooper, Junichi Yamagishi, Tomoki Toda

    Abstract: An effective approach to automatically predict the subjective rating for synthetic speech is to train on a listening test dataset with human-annotated scores. Although each speech sample in the dataset is rated by several listeners, most previous works only used the mean score as the training target. In this work, we present LDNet, a unified framework for mean opinion score (MOS) prediction that p… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022. Code available at: https://github.com/unilight/LDNet

  38. arXiv:2110.08213  [pdf, other

    cs.SD cs.CL eess.AS q-bio.QM

    Towards Identity Preserving Normal to Dysarthric Voice Conversion

    Authors: Wen-Chin Huang, Bence Mark Halpern, Lester Phillip Violeta, Odette Scharenborg, Tomoki Toda

    Abstract: We present a voice conversion framework that converts normal speech into dysarthric speech while preserving the speaker identity. Such a framework is essential for (1) clinical decision making processes and alleviation of patient stress, (2) data augmentation for dysarthric speech recognition. This is an especially challenging task since the converted samples should capture the severity of dysarth… ▽ More

    Submitted 15 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022

  39. arXiv:2110.06280  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations

    Authors: Wen-Chin Huang, Shu-Wen Yang, Tomoki Hayashi, Hung-Yi Lee, Shinji Watanabe, Tomoki Toda

    Abstract: This paper introduces S3PRL-VC, an open-source voice conversion (VC) framework based on the S3PRL toolkit. In the context of recognition-synthesis VC, self-supervised speech representation (S3R) is valuable in its potential to replace the expensive supervised representation adopted by state-of-the-art VC systems. Moreover, we claim that VC is a good probing task for S3R analysis. In this work, we… ▽ More

    Submitted 12 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022. Code available at: https://github.com/s3prl/s3prl/tree/master/s3prl/downstream/a2o-vc-vcc2020

  40. arXiv:2109.10608  [pdf, ps, other

    cs.SD eess.AS

    Noisy-to-Noisy Voice Conversion Framework with Denoising Model

    Authors: Chao Xie, Yi-Chiao Wu, Patrick Lumban Tobing, Wen-Chin Huang, Tomoki Toda

    Abstract: In a conventional voice conversion (VC) framework, a VC model is often trained with a clean dataset consisting of speech data carefully recorded and selected by minimizing background interference. However, collecting such a high-quality dataset is expensive and time-consuming. Leveraging crowd-sourced speech data in training is more economical. Moreover, for some real-world VC scenarios such as VC… ▽ More

    Submitted 22 September, 2021; originally announced September 2021.

  41. arXiv:2109.03551  [pdf, other

    cs.SD cs.CL cs.CV eess.AS

    Time Alignment using Lip Images for Frame-based Electrolaryngeal Voice Conversion

    Authors: Yi-Syuan Liou, Wen-Chin Huang, Ming-Chi Yen, Shu-Wei Tsai, Yu-Huai Peng, Tomoki Toda, Yu Tsao, Hsin-Min Wang

    Abstract: Voice conversion (VC) is an effective approach to electrolaryngeal (EL) speech enhancement, a task that aims to improve the quality of the artificial voice from an electrolarynx device. In frame-based VC methods, time alignment needs to be performed prior to model training, and the dynamic time war** (DTW) algorithm is widely adopted to compute the best time alignment between each utterance pair… ▽ More

    Submitted 8 September, 2021; originally announced September 2021.

    Comments: Accepted to APSIPA ASC 2021

  42. arXiv:2107.09477  [pdf, other

    cs.SD cs.CL eess.AS

    On Prosody Modeling for ASR+TTS based Voice Conversion

    Authors: Wen-Chin Huang, Tomoki Hayashi, Xinjian Li, Shinji Watanabe, Tomoki Toda

    Abstract: In voice conversion (VC), an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents; these are then used as input by a text-to-speech (TTS) system to generate the converted speech. Such a paradigm, referred to as ASR+TTS, overlooks the… ▽ More

    Submitted 20 July, 2021; originally announced July 2021.

    Comments: Submitted to ASRU2021. Under review

  43. arXiv:2106.06151  [pdf, other

    cs.SD cs.LG eess.AS

    Anomalous Sound Detection Using a Binary Classification Model and Class Centroids

    Authors: Ibuki Kuroyanagi, Tomoki Hayashi, Kazuya Takeda, Tomoki Toda

    Abstract: An anomalous sound detection system to detect unknown anomalous sounds usually needs to be built using only normal sound data. Moreover, it is desirable to improve the system by effectively using a small amount of anomalous sound data, which will be accumulated through the system's operation. As one of the methods to meet these requirements, we focus on a binary classification model that is develo… ▽ More

    Submitted 10 June, 2021; originally announced June 2021.

    Comments: 6 pages, 2 figures, 2 tables, EUSIPCO2021

  44. arXiv:2106.01415  [pdf, other

    cs.SD cs.CL eess.AS

    A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion

    Authors: Wen-Chin Huang, Kazuhiro Kobayashi, Yu-Huai Peng, Ching-Feng Liu, Yu Tsao, Hsin-Min Wang, Tomoki Toda

    Abstract: We propose a new paradigm for maintaining speaker identity in dysarthric voice conversion (DVC). The poor quality of dysarthric speech can be greatly improved by statistical VC, but as the normal speech utterances of a dysarthria patient are nearly impossible to collect, previous work failed to recover the individuality of the patient. In light of this, we suggest a novel, two-stage approach for D… ▽ More

    Submitted 2 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021. 5 pages, 3 figures, 1 table

  45. arXiv:2105.09858  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    Low-Latency Real-Time Non-Parallel Voice Conversion based on Cyclic Variational Autoencoder and Multiband WaveRNN with Data-Driven Linear Prediction

    Authors: Patrick Lumban Tobing, Tomoki Toda

    Abstract: This paper presents a low-latency real-time (LLRT) non-parallel voice conversion (VC) framework based on cyclic variational autoencoder (CycleVAE) and multiband WaveRNN with data-driven linear prediction (MWDLP). CycleVAE is a robust non-parallel multispeaker spectral model, which utilizes a speaker-independent latent space and a speaker-dependent code to generate reconstructed/converted spectral… ▽ More

    Submitted 4 July, 2021; v1 submitted 20 May, 2021; originally announced May 2021.

    Comments: Accepted for SSW11

  46. arXiv:2105.09856  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    High-Fidelity and Low-Latency Universal Neural Vocoder based on Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform Modeling

    Authors: Patrick Lumban Tobing, Tomoki Toda

    Abstract: This paper presents a novel high-fidelity and low-latency universal neural vocoder framework based on multiband WaveRNN with data-driven linear prediction for discrete waveform modeling (MWDLP). MWDLP employs a coarse-fine bit WaveRNN architecture for 10-bit mu-law waveform modeling. A sparse gated recurrent unit with a relatively large size of hidden units is utilized, while the multiband modelin… ▽ More

    Submitted 4 July, 2021; v1 submitted 20 May, 2021; originally announced May 2021.

    Comments: Accepted for INTERSPEECH 2021

  47. arXiv:2105.09579  [pdf, other

    cs.LG stat.ML

    Aggregate Learning for Mixed Frequency Data

    Authors: Takamichi Toda, Daisuke Moriwaki, Kazuhiro Ota

    Abstract: Large and acute economic shocks such as the 2007-2009 financial crisis and the current COVID-19 infections rapidly change the economic environment. In such a situation, the importance of real-time economic analysis using alternative datais emerging. Alternative data such as search query and location data are closer to real-time and richer than official statistics that are typically released once a… ▽ More

    Submitted 20 May, 2021; originally announced May 2021.

  48. arXiv:2104.06793  [pdf, other

    cs.SD cs.CL eess.AS

    Non-autoregressive sequence-to-sequence voice conversion

    Authors: Tomoki Hayashi, Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda

    Abstract: This paper proposes a novel voice conversion (VC) method based on non-autoregressive sequence-to-sequence (NAR-S2S) models. Inspired by the great success of NAR-S2S models such as FastSpeech in text-to-speech (TTS), we extend the FastSpeech2 model for the VC problem. We introduce the convolution-augmented Transformer (Conformer) instead of the Transformer, making it possible to capture both local… ▽ More

    Submitted 14 April, 2021; originally announced April 2021.

    Comments: Accepted to ICASSP2021. Demo HP: https://kan-bayashi.github.io/NonARSeq2SeqVC/

  49. arXiv:2104.04668  [pdf, other

    cs.SD cs.LG eess.AS

    Unified Source-Filter GAN: Unified Source-filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN

    Authors: Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda

    Abstract: We propose a unified approach to data-driven source-filter modeling using a single neural network for develo** a neural vocoder capable of generating high-quality synthetic speech waveforms while retaining flexibility of the source-filter model to control their voice characteristics. Our proposed network called unified source-filter generative adversarial networks (uSFGAN) is developed by factor… ▽ More

    Submitted 27 June, 2021; v1 submitted 9 April, 2021; originally announced April 2021.

    Comments: Submitted to INTERSPEECH 2021

  50. arXiv:2104.03009  [pdf, other

    eess.AS cs.LG cs.SD

    The AS-NU System for the M2VoC Challenge

    Authors: Cheng-Hung Hu, Yi-Chiao Wu, Wen-Chin Huang, Yu-Huai Peng, Yu-Wen Chen, Pin-Jui Ku, Tomoki Toda, Yu Tsao, Hsin-Min Wang

    Abstract: This paper describes the AS-NU systems for two tracks in MultiSpeaker Multi-Style Voice Cloning Challenge (M2VoC). The first track focuses on using a small number of 100 target utterances for voice cloning, while the second track focuses on using only 5 target utterances for voice cloning. Due to the serious lack of data in the second track, we selected the speaker most similar to the target speak… ▽ More

    Submitted 7 April, 2021; originally announced April 2021.