Skip to main content

Showing 1–36 of 36 results for author: Cooper, E

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.08911  [pdf, other

    cs.CL eess.AS

    An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

    Authors: Cheng Gong, Erica Cooper, Xin Wang, Chunyu Qiang, Mengzhe Geng, Dan Wells, Longbiao Wang, Jianwu Dang, Marc Tessier, Aidan Pine, Korin Richmond, Junichi Yamagishi

    Abstract: Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  2. arXiv:2406.08812  [pdf, other

    cs.SD eess.AS

    Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems

    Authors: Zhengyang Chen, Xuechen Liu, Erica Cooper, Junichi Yamagishi, Yanmin Qian

    Abstract: This paper proposes a speech synthesis system that allows users to specify and control the acoustic characteristics of a speaker by means of prompts describing the speaker's traits of synthesized speech. Unlike previous approaches, our method utilizes listener impressions to construct prompts, which are easier to collect and align more naturally with everyday descriptions of speaker traits. We ado… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted for presentation at Interspeech 2024 (with more analysis in the final Appendix part)

  3. arXiv:2406.07816  [pdf, other

    eess.AS cs.CL cs.SD

    Spoof Diarization: "What Spoofed When" in Partially Spoofed Audio

    Authors: Lin Zhang, Xin Wang, Erica Cooper, Mireia Diez, Federico Landini, Nicholas Evans, Junichi Yamagishi

    Abstract: This paper defines Spoof Diarization as a novel task in the Partial Spoof (PS) scenario. It aims to determine what spoofed when, which includes not only locating spoof regions but also clustering them according to different spoofing methods. As a pioneering study in spoof diarization, we focus on defining the task, establishing evaluation metrics, and proposing a benchmark model, namely the Counte… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  4. arXiv:2312.15616  [pdf, other

    cs.SD eess.AS stat.ML

    Uncertainty as a Predictor: Leveraging Self-Supervised Learning for Zero-Shot MOS Prediction

    Authors: Aditya Ravuri, Erica Cooper, Junichi Yamagishi

    Abstract: Predicting audio quality in voice synthesis and conversion systems is a critical yet challenging task, especially when traditional methods like Mean Opinion Scores (MOS) are cumbersome to collect at scale. This paper addresses the gap in efficient audio quality prediction, especially in low-resource settings where extensive MOS data from large-scale listening tests may be unavailable. We demonstra… ▽ More

    Submitted 25 December, 2023; originally announced December 2023.

    Comments: 5 pages, 3 figures, sasb draft

  5. arXiv:2312.14398  [pdf, other

    cs.SD eess.AS

    ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

    Authors: Cheng Gong, Xin Wang, Erica Cooper, Dan Wells, Longbiao Wang, Jianwu Dang, Korin Richmond, Junichi Yamagishi

    Abstract: Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. In most cases, TTS systems are built using a single speaker's voice. However, there is growing interest in develo** systems that can synthesize voices… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: 13 pages, 5 figures

  6. arXiv:2312.06055  [pdf, other

    cs.SD eess.AS

    Speaker-Text Retrieval via Contrastive Learning

    Authors: Xuechen Liu, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi

    Abstract: In this study, we introduce a novel cross-modal retrieval task involving speaker descriptions and their corresponding audio samples. Utilizing pre-trained speaker and text encoders, we present a simple learning framework based on contrastive learning. Additionally, we explore the impact of incorporating speaker labels into the training process. Our findings establish the effectiveness of linking s… ▽ More

    Submitted 10 December, 2023; originally announced December 2023.

    Comments: Submitted to IEEE Signal Processing Letters

  7. arXiv:2310.05078  [pdf, other

    eess.AS cs.SD

    Partial Rank Similarity Minimization Method for Quality MOS Prediction of Unseen Speech Synthesis Systems in Zero-Shot and Semi-supervised setting

    Authors: Hemant Yadav, Erica Cooper, Junichi Yamagishi, Sunayana Sitaram, Rajiv Ratn Shah

    Abstract: This paper introduces a novel objective function for quality mean opinion score (MOS) prediction of unseen speech synthesis systems. The proposed function measures the similarity of relative positions of predicted MOS values, in a mini-batch, rather than the actual MOS values. That is the partial rank similarity is measured (PRS) rather than the individual MOS values as with the L1 loss. Our exper… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: Accepted to ASRU 2023

  8. arXiv:2310.02640  [pdf, other

    eess.AS

    The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains

    Authors: Erica Cooper, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda, Junichi Yamagishi

    Abstract: We present the second edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthesized and processed speech. This year, we emphasize real-world and challenging zero-shot out-of-domain MOS prediction with three tracks for three different voice evaluation scenarios. Ten teams from industry and academia in seve… ▽ More

    Submitted 6 October, 2023; v1 submitted 4 October, 2023; originally announced October 2023.

    Comments: Accepted to ASRU 2023

  9. arXiv:2309.07658  [pdf, other

    cs.SD eess.AS

    DDSP-based Neural Waveform Synthesis of Polyphonic Guitar Performance from String-wise MIDI Input

    Authors: Nicolas Jonason, Xin Wang, Erica Cooper, Lauri Juvela, Bob L. T. Sturm, Junichi Yamagishi

    Abstract: We explore the use of neural synthesis for acoustic guitar from string-wise MIDI input. We propose four different systems and compare them with both objective metrics and subjective evaluation against natural audio and a sample-based baseline. We iteratively develop these four systems by making various considerations on the architecture and intermediate tasks, such as predicting pitch and loudness… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

  10. arXiv:2309.06141  [pdf, other

    cs.SD eess.AS

    SynVox2: Towards a privacy-friendly VoxCeleb2 dataset

    Authors: Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi, Nicholas Evans, Massimiliano Todisco, Jean-François Bonastre, Mickael Rouvier

    Abstract: The success of deep learning in speaker recognition relies heavily on the use of large datasets. However, the data-hungry nature of deep learning methods has already being questioned on account the ethical, privacy, and legal concerns that arise when using large-scale datasets of natural speech collected from real human speakers. For example, the widely-used VoxCeleb2 dataset for speaker recogniti… ▽ More

    Submitted 12 September, 2023; originally announced September 2023.

    Comments: conference

  11. arXiv:2306.08850  [pdf, other

    cs.SD eess.AS

    Exploring Isolated Musical Notes as Pre-training Data for Predominant Instrument Recognition in Polyphonic Music

    Authors: Lifan Zhong, Erica Cooper, Junichi Yamagishi, Nobuaki Minematsu

    Abstract: With the growing amount of musical data available, automatic instrument recognition, one of the essential problems in Music Information Retrieval (MIR), is drawing more and more attention. While automatic recognition of single instruments has been well-studied, it remains challenging for polyphonic, multi-instrument musical recordings. This work presents our efforts toward building a robust end-to… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

    Comments: Submitted to APSIPA 2023

  12. arXiv:2305.18823  [pdf, other

    cs.SD eess.AS

    Speaker anonymization using orthogonal Householder neural network

    Authors: Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi, Natalia Tomashenko

    Abstract: Speaker anonymization aims to conceal a speaker's identity while preserving content information in speech. Current mainstream neural-network speaker anonymization systems disentangle speech into prosody-related, content, and speaker representations. The speaker representation is then anonymized by a selection-based speaker anonymizer that uses a mean vector over a set of randomly selected speaker… ▽ More

    Submitted 12 September, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing

  13. arXiv:2305.17739  [pdf, other

    cs.SD cs.CL eess.AS

    Range-Based Equal Error Rate for Spoof Localization

    Authors: Lin Zhang, Xin Wang, Erica Cooper, Nicholas Evans, Junichi Yamagishi

    Abstract: Spoof localization, also called segment-level detection, is a crucial task that aims to locate spoofs in partially spoofed audio. The equal error rate (EER) is widely used to measure performance for such biometric scenarios. Although EER is the only threshold-free metric, it is usually calculated in a point-based way that uses scores and references with a pre-defined temporal resolution and counts… ▽ More

    Submitted 28 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  14. arXiv:2305.10940  [pdf, other

    eess.AS

    Improving Generalization Ability of Countermeasures for New Mismatch Scenario by Combining Multiple Advanced Regularization Terms

    Authors: Chang Zeng, Xin Wang, Xiaoxiao Miao, Erica Cooper, Junichi Yamagishi

    Abstract: The ability of countermeasure models to generalize from seen speech synthesis methods to unseen ones has been investigated in the ASVspoof challenge. However, a new mismatch scenario in which fake audio may be generated from real audio with unseen genres has not been studied thoroughly. To this end, we first use five different vocoders to create a new dataset called CN-Spoof based on the CN-Celeb1… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted by interspeech2023

  15. arXiv:2305.10608  [pdf, other

    eess.AS

    Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech

    Authors: Erica Cooper, Junichi Yamagishi

    Abstract: Mean Opinion Score (MOS) is a popular measure for evaluating synthesized speech. However, the scores obtained in MOS tests are heavily dependent upon many contextual factors. One such factor is the overall range of quality of the samples presented in the test -- listeners tend to try to use the entire range of scoring options available to them regardless of this, a phenomenon which is known as ran… ▽ More

    Submitted 6 October, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

    Comments: Proceedings of Interspeech 2023. DOI: 10.21437/Interspeech.2023-1076

  16. arXiv:2211.13868  [pdf, other

    cs.SD eess.AS

    Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems?

    Authors: Xuan Shi, Erica Cooper, Xin Wang, Junichi Yamagishi, Shrikanth Narayanan

    Abstract: With the similarity between music and speech synthesis from symbolic input and the rapid development of text-to-speech (TTS) techniques, it is worthwhile to explore ways to improve the MIDI-to-audio performance by borrowing from TTS techniques. In this study, we analyze the shortcomings of a TTS-based MIDI-to-audio system and improve it in terms of feature computation, model selection, and trainin… ▽ More

    Submitted 20 March, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

    Comments: Accepted by ICASSP 2023

  17. arXiv:2209.00485  [pdf, other

    eess.AS cs.SD

    Joint Speaker Encoder and Neural Back-end Model for Fully End-to-End Automatic Speaker Verification with Multiple Enrollment Utterances

    Authors: Chang Zeng, Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi

    Abstract: Conventional automatic speaker verification systems can usually be decomposed into a front-end model such as time delay neural network (TDNN) for extracting speaker embeddings and a back-end model such as statistics-based probabilistic linear discriminant analysis (PLDA) or neural network-based neural PLDA (NPLDA) for similarity scoring. However, the sequential optimization of the front-end and ba… ▽ More

    Submitted 1 September, 2022; originally announced September 2022.

    Comments: Submitted to TASLP

  18. arXiv:2204.05177  [pdf, other

    eess.AS cs.CR cs.SD

    The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance

    Authors: Lin Zhang, Xin Wang, Erica Cooper, Nicholas Evans, Junichi Yamagishi

    Abstract: Automatic speaker verification is susceptible to various manipulations and spoofing, such as text-to-speech synthesis, voice conversion, replay, tampering, adversarial attacks, and so on. We consider a new spoofing scenario called "Partial Spoof" (PS) in which synthesized or transformed speech segments are embedded into a bona fide utterance. While existing countermeasures (CMs) can detect fully s… ▽ More

    Submitted 30 January, 2023; v1 submitted 11 April, 2022; originally announced April 2022.

    Comments: Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (DOI: 10.1109/TASLP.2022.3233236)

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 813-825, 2023

  19. arXiv:2203.11389  [pdf, other

    cs.SD eess.AS

    The VoiceMOS Challenge 2022

    Authors: Wen-Chin Huang, Erica Cooper, Yu Tsao, Hsin-Min Wang, Tomoki Toda, Junichi Yamagishi

    Abstract: We present the first edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthetic speech. This challenge drew 22 participating teams from academia and industry who tried a variety of approaches to tackle the problem of predicting human ratings of synthesized speech. The listening test data for the main tra… ▽ More

    Submitted 3 July, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

    Comments: Accepted to Interspeech 2022

  20. arXiv:2202.13097  [pdf, ps, other

    cs.SD eess.AS

    Language-Independent Speaker Anonymization Approach using Self-Supervised Pre-Trained Models

    Authors: Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi, Natalia Tomashenko

    Abstract: Speaker anonymization aims to protect the privacy of speakers while preserving spoken linguistic information from speech. Current mainstream neural network speaker anonymization systems are complicated, containing an F0 extractor, speaker encoder, automatic speech recognition acoustic model (ASR AM), speech synthesis acoustic model and speech waveform generation model. Moreover, as an ASR AM is la… ▽ More

    Submitted 27 April, 2022; v1 submitted 26 February, 2022; originally announced February 2022.

  21. arXiv:2110.09103  [pdf, other

    cs.SD cs.CL eess.AS

    LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech

    Authors: Wen-Chin Huang, Erica Cooper, Junichi Yamagishi, Tomoki Toda

    Abstract: An effective approach to automatically predict the subjective rating for synthetic speech is to train on a listening test dataset with human-annotated scores. Although each speech sample in the dataset is rated by several listeners, most previous works only used the mean score as the training target. In this work, we present LDNet, a unified framework for mean opinion score (MOS) prediction that p… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022. Code available at: https://github.com/unilight/LDNet

  22. arXiv:2110.02635  [pdf, other

    eess.AS

    Generalization Ability of MOS Prediction Networks

    Authors: Erica Cooper, Wen-Chin Huang, Tomoki Toda, Junichi Yamagishi

    Abstract: Automatic methods to predict listener opinions of synthesized speech remain elusive since listeners, systems being evaluated, characteristics of the speech, and even the instructions given and the rating scale all vary from test to test. While automatic predictors for metrics such as mean opinion score (MOS) can achieve high prediction accuracy on samples from the same test, they typically fail to… ▽ More

    Submitted 14 February, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: \c{opyright} 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  23. arXiv:2110.01147  [pdf, other

    cs.SD cs.CL eess.AS

    On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

    Authors: Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian, Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David Cox, James Glass

    Abstract: Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparsity and its subsequent effects on synthetic speech. Additionally, we explored several… ▽ More

    Submitted 27 October, 2021; v1 submitted 3 October, 2021; originally announced October 2021.

  24. arXiv:2107.14132  [pdf, other

    cs.SD eess.AS

    Multi-Task Learning in Utterance-Level and Segmental-Level Spoof Detection

    Authors: Lin Zhang, Xin Wang, Erica Cooper, Junichi Yamagishi

    Abstract: In this paper, we provide a series of multi-tasking benchmarks for simultaneously detecting spoofing at the segmental and utterance levels in the PartialSpoof database. First, we propose the SELCNN network, which inserts squeeze-and-excitation (SE) blocks into a light convolutional neural network (LCNN) to enhance the capacity of hidden feature selection. Then, we implement multi-task learning (MT… ▽ More

    Submitted 31 August, 2021; v1 submitted 29 July, 2021; originally announced July 2021.

    Comments: Submitted to ASVspoof 2021 Workshop

  25. arXiv:2107.11506  [pdf, other

    eess.AS cs.SD

    Use of speaker recognition approaches for learning and evaluating embedding representations of musical instrument sounds

    Authors: Xuan Shi, Erica Cooper, Junichi Yamagishi

    Abstract: Constructing an embedding space for musical instrument sounds that can meaningfully represent new and unseen instruments is important for downstream music generation tasks such as multi-instrument synthesis and timbre transfer. The framework of Automatic Speaker Verification (ASV) provides us with architectures and evaluation methodologies for verifying the identities of unseen speakers, and these… ▽ More

    Submitted 24 December, 2021; v1 submitted 23 July, 2021; originally announced July 2021.

    Comments: Accepted by the IEEE/ACM Transactions on Audio, Speech, and Language Processing

  26. arXiv:2105.02373  [pdf, other

    cs.SD eess.AS

    How do Voices from Past Speech Synthesis Challenges Compare Today?

    Authors: Erica Cooper, Junichi Yamagishi

    Abstract: Shared challenges provide a venue for comparing systems trained on common data using a standardized evaluation, and they also provide an invaluable resource for researchers when the data and evaluation results are publicly released. The Blizzard Challenge and Voice Conversion Challenge are two such challenges for text-to-speech synthesis and for speaker conversion, respectively, and their publicly… ▽ More

    Submitted 30 June, 2021; v1 submitted 5 May, 2021; originally announced May 2021.

    Comments: To appear at ISCA Speech Synthesis Workshop 2021

  27. arXiv:2105.01573  [pdf, other

    eess.AS cs.SD

    Exploring Disentanglement with Multilingual and Monolingual VQ-VAE

    Authors: Jennifer Williams, Jason Fong, Erica Cooper, Junichi Yamagishi

    Abstract: This work examines the content and usefulness of disentangled phone and speaker representations from two separately trained VQ-VAE systems: one trained on multilingual data and another trained on monolingual data. We explore the multi- and monolingual models using four small proof-of-concept tasks: copy-synthesis, voice transformation, linguistic code-switching, and content-based privacy masking.… ▽ More

    Submitted 28 June, 2021; v1 submitted 4 May, 2021; originally announced May 2021.

    Comments: Accepted to Speech Synthesis Workshop 2021 (SSW11)

  28. arXiv:2104.12292  [pdf, other

    cs.SD eess.AS

    Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis

    Authors: Erica Cooper, Xin Wang, Junichi Yamagishi

    Abstract: Speech synthesis and music audio generation from symbolic input differ in many aspects but share some similarities. In this study, we investigate how text-to-speech synthesis techniques can be used for piano MIDI-to-audio synthesis tasks. Our investigation includes Tacotron and neural source-filter waveform models as the basic components, with which we build MIDI-to-audio synthesis systems in simi… ▽ More

    Submitted 24 February, 2022; v1 submitted 25 April, 2021; originally announced April 2021.

    Comments: In the proceedings of ISCA Speech Synthesis Workshop 2021

  29. arXiv:2104.02518  [pdf, other

    eess.AS cs.SD

    An Initial Investigation for Detecting Partially Spoofed Audio

    Authors: Lin Zhang, Xin Wang, Erica Cooper, Junichi Yamagishi, Jose Patino, Nicholas Evans

    Abstract: All existing databases of spoofed speech contain attack data that is spoofed in its entirety. In practice, it is entirely plausible that successful attacks can be mounted with utterances that are only partially spoofed. By definition, partially-spoofed utterances contain a mix of both spoofed and bona fide segments, which will likely degrade the performance of countermeasures trained with entirely… ▽ More

    Submitted 15 June, 2021; v1 submitted 6 April, 2021; originally announced April 2021.

    Comments: INTERSPEECH 2021

  30. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances

    Authors: Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi

    Abstract: Probabilistic linear discriminant analysis (PLDA) or cosine similarity have been widely used in traditional speaker verification systems as back-end techniques to measure pairwise similarities. To make better use of multiple enrollment utterances, we propose a novel attention back-end model, which can be used for both text-independent (TI) and text-dependent (TD) speaker verification, and employ s… ▽ More

    Submitted 5 October, 2021; v1 submitted 4 April, 2021; originally announced April 2021.

  31. arXiv:2010.11549  [pdf, other

    eess.AS cs.SD

    How Similar or Different Is Rakugo Speech Synthesizer to Professional Performers?

    Authors: Shuhei Kato, Yusuke Yasuda, Xin Wang, Erica Cooper, Junichi Yamagishi

    Abstract: We have been working on speech synthesis for rakugo (a traditional Japanese form of verbal entertainment similar to one-person stand-up comedy) toward speech synthesis that authentically entertains audiences. In this paper, we propose a novel evaluation methodology using synthesized rakugo speech and real rakugo speech uttered by professional performers of three different ranks. The naturalness of… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

    Comments: Submitted to ICASSP 2021

  32. arXiv:2010.10727  [pdf, other

    eess.AS cs.LG cs.SD

    Learning Disentangled Phone and Speaker Representations in a Semi-Supervised VQ-VAE Paradigm

    Authors: Jennifer Williams, Yi Zhao, Erica Cooper, Junichi Yamagishi

    Abstract: We present a new approach to disentangle speaker voice and phone content by introducing new components to the VQ-VAE architecture for speech synthesis. The original VQ-VAE does not generalize well to unseen speakers or content. To alleviate this problem, we have incorporated a speaker encoder and speaker VQ codebook that learns global speaker characteristics entirely separate from the existing sub… ▽ More

    Submitted 10 February, 2021; v1 submitted 20 October, 2020; originally announced October 2020.

    Comments: Accepted to ICASSP 2021

  33. arXiv:2005.07884  [pdf, other

    eess.AS cs.SD

    Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction

    Authors: Yi Zhao, Haoyu Li, Cheng-I Lai, Jennifer Williams, Erica Cooper, Junichi Yamagishi

    Abstract: Vector Quantized Variational AutoEncoders (VQ-VAE) are a powerful representation learning framework that can discover discrete groups of features from a speech signal without supervision. Until now, the VQ-VAE architecture has previously modeled individual types of speech features, such as only phones or only F0. This paper introduces an important extension to VQ-VAE for learning F0-related supras… ▽ More

    Submitted 16 May, 2020; originally announced May 2020.

    Comments: Submitted to Interspeech 2020

  34. arXiv:2005.01245  [pdf, other

    eess.AS

    Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?

    Authors: Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Junichi Yamagishi

    Abstract: Previous work on speaker adaptation for end-to-end speech synthesis still falls short in speaker similarity. We investigate an orthogonal approach to the current speaker adaptation paradigms, speaker augmentation, by creating artificial speakers and by taking advantage of low-quality data. The base Tacotron2 model is modified to account for the channel and dialect factors inherent in these corpora… ▽ More

    Submitted 7 August, 2020; v1 submitted 3 May, 2020; originally announced May 2020.

    Comments: Accepted to Interspeech 2020

  35. arXiv:1911.00137  [pdf, other

    eess.AS

    Modeling of Rakugo Speech and Its Limitations: Toward Speech Synthesis That Entertains Audiences

    Authors: Shuhei Kato, Yusuke Yasuda, Xin Wang, Erica Cooper, Shinji Takaki, Junichi Yamagishi

    Abstract: We have been investigating rakugo speech synthesis as a challenging example of speech synthesis that entertains audiences. Rakugo is a traditional Japanese form of verbal entertainment similar to a combination of one-person stand-up comedy and comic storytelling and is popular even today. In rakugo, a performer plays multiple characters, and conversations or dialogues between the characters make t… ▽ More

    Submitted 1 June, 2020; v1 submitted 31 October, 2019; originally announced November 2019.

    Comments: Resubmitted to IEEE Access

  36. arXiv:1910.10838  [pdf, other

    eess.AS

    Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings

    Authors: Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Fuming Fang, Xin Wang, Nanxin Chen, Junichi Yamagishi

    Abstract: While speaker adaptation for end-to-end speech synthesis using speaker embeddings can produce good speaker similarity for speakers seen during training, there remains a gap for zero-shot adaptation to unseen speakers. We investigate multi-speaker modeling for end-to-end text-to-speech synthesis and study the effects of different types of state-of-the-art neural speaker embeddings on speaker simila… ▽ More

    Submitted 4 February, 2020; v1 submitted 23 October, 2019; originally announced October 2019.

    Comments: Accepted to ICASSP 2020