Skip to main content

Showing 1–36 of 36 results for author: Cooper, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.08911  [pdf, other

    cs.CL eess.AS

    An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

    Authors: Cheng Gong, Erica Cooper, Xin Wang, Chunyu Qiang, Mengzhe Geng, Dan Wells, Longbiao Wang, Jianwu Dang, Marc Tessier, Aidan Pine, Korin Richmond, Junichi Yamagishi

    Abstract: Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  2. arXiv:2406.08812  [pdf, other

    cs.SD eess.AS

    Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems

    Authors: Zhengyang Chen, Xuechen Liu, Erica Cooper, Junichi Yamagishi, Yanmin Qian

    Abstract: This paper proposes a speech synthesis system that allows users to specify and control the acoustic characteristics of a speaker by means of prompts describing the speaker's traits of synthesized speech. Unlike previous approaches, our method utilizes listener impressions to construct prompts, which are easier to collect and align more naturally with everyday descriptions of speaker traits. We ado… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted for presentation at Interspeech 2024 (with more analysis in the final Appendix part)

  3. arXiv:2406.07816  [pdf, other

    eess.AS cs.CL cs.SD

    Spoof Diarization: "What Spoofed When" in Partially Spoofed Audio

    Authors: Lin Zhang, Xin Wang, Erica Cooper, Mireia Diez, Federico Landini, Nicholas Evans, Junichi Yamagishi

    Abstract: This paper defines Spoof Diarization as a novel task in the Partial Spoof (PS) scenario. It aims to determine what spoofed when, which includes not only locating spoof regions but also clustering them according to different spoofing methods. As a pioneering study in spoof diarization, we focus on defining the task, establishing evaluation metrics, and proposing a benchmark model, namely the Counte… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  4. arXiv:2312.15616  [pdf, other

    cs.SD eess.AS stat.ML

    Uncertainty as a Predictor: Leveraging Self-Supervised Learning for Zero-Shot MOS Prediction

    Authors: Aditya Ravuri, Erica Cooper, Junichi Yamagishi

    Abstract: Predicting audio quality in voice synthesis and conversion systems is a critical yet challenging task, especially when traditional methods like Mean Opinion Scores (MOS) are cumbersome to collect at scale. This paper addresses the gap in efficient audio quality prediction, especially in low-resource settings where extensive MOS data from large-scale listening tests may be unavailable. We demonstra… ▽ More

    Submitted 25 December, 2023; originally announced December 2023.

    Comments: 5 pages, 3 figures, sasb draft

  5. arXiv:2312.14398  [pdf, other

    cs.SD eess.AS

    ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

    Authors: Cheng Gong, Xin Wang, Erica Cooper, Dan Wells, Longbiao Wang, Jianwu Dang, Korin Richmond, Junichi Yamagishi

    Abstract: Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. In most cases, TTS systems are built using a single speaker's voice. However, there is growing interest in develo** systems that can synthesize voices… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: 13 pages, 5 figures

  6. arXiv:2312.06055  [pdf, other

    cs.SD eess.AS

    Speaker-Text Retrieval via Contrastive Learning

    Authors: Xuechen Liu, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi

    Abstract: In this study, we introduce a novel cross-modal retrieval task involving speaker descriptions and their corresponding audio samples. Utilizing pre-trained speaker and text encoders, we present a simple learning framework based on contrastive learning. Additionally, we explore the impact of incorporating speaker labels into the training process. Our findings establish the effectiveness of linking s… ▽ More

    Submitted 10 December, 2023; originally announced December 2023.

    Comments: Submitted to IEEE Signal Processing Letters

  7. arXiv:2310.05078  [pdf, other

    eess.AS cs.SD

    Partial Rank Similarity Minimization Method for Quality MOS Prediction of Unseen Speech Synthesis Systems in Zero-Shot and Semi-supervised setting

    Authors: Hemant Yadav, Erica Cooper, Junichi Yamagishi, Sunayana Sitaram, Rajiv Ratn Shah

    Abstract: This paper introduces a novel objective function for quality mean opinion score (MOS) prediction of unseen speech synthesis systems. The proposed function measures the similarity of relative positions of predicted MOS values, in a mini-batch, rather than the actual MOS values. That is the partial rank similarity is measured (PRS) rather than the individual MOS values as with the L1 loss. Our exper… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: Accepted to ASRU 2023

  8. arXiv:2309.07658  [pdf, other

    cs.SD eess.AS

    DDSP-based Neural Waveform Synthesis of Polyphonic Guitar Performance from String-wise MIDI Input

    Authors: Nicolas Jonason, Xin Wang, Erica Cooper, Lauri Juvela, Bob L. T. Sturm, Junichi Yamagishi

    Abstract: We explore the use of neural synthesis for acoustic guitar from string-wise MIDI input. We propose four different systems and compare them with both objective metrics and subjective evaluation against natural audio and a sample-based baseline. We iteratively develop these four systems by making various considerations on the architecture and intermediate tasks, such as predicting pitch and loudness… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

  9. arXiv:2309.06141  [pdf, other

    cs.SD eess.AS

    SynVox2: Towards a privacy-friendly VoxCeleb2 dataset

    Authors: Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi, Nicholas Evans, Massimiliano Todisco, Jean-François Bonastre, Mickael Rouvier

    Abstract: The success of deep learning in speaker recognition relies heavily on the use of large datasets. However, the data-hungry nature of deep learning methods has already being questioned on account the ethical, privacy, and legal concerns that arise when using large-scale datasets of natural speech collected from real human speakers. For example, the widely-used VoxCeleb2 dataset for speaker recogniti… ▽ More

    Submitted 12 September, 2023; originally announced September 2023.

    Comments: conference

  10. arXiv:2307.16544  [pdf

    cs.CL

    Utilisation of open intent recognition models for customer support intent detection

    Authors: Rasheed Mohammad, Oliver Favell, Shariq Shah, Emmett Cooper, Edlira Vakaj

    Abstract: Businesses have sought out new solutions to provide support and improve customer satisfaction as more products and services have become interconnected digitally. There is an inherent need for businesses to provide or outsource fast, efficient and knowledgeable support to remain competitive. Support solutions are also advancing with technologies, including use of social media, Artificial Intelligen… ▽ More

    Submitted 31 July, 2023; originally announced July 2023.

    Comments: 9 pages, 3 figures, conference

  11. arXiv:2306.08850  [pdf, other

    cs.SD eess.AS

    Exploring Isolated Musical Notes as Pre-training Data for Predominant Instrument Recognition in Polyphonic Music

    Authors: Lifan Zhong, Erica Cooper, Junichi Yamagishi, Nobuaki Minematsu

    Abstract: With the growing amount of musical data available, automatic instrument recognition, one of the essential problems in Music Information Retrieval (MIR), is drawing more and more attention. While automatic recognition of single instruments has been well-studied, it remains challenging for polyphonic, multi-instrument musical recordings. This work presents our efforts toward building a robust end-to… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

    Comments: Submitted to APSIPA 2023

  12. arXiv:2305.18823  [pdf, other

    cs.SD eess.AS

    Speaker anonymization using orthogonal Householder neural network

    Authors: Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi, Natalia Tomashenko

    Abstract: Speaker anonymization aims to conceal a speaker's identity while preserving content information in speech. Current mainstream neural-network speaker anonymization systems disentangle speech into prosody-related, content, and speaker representations. The speaker representation is then anonymized by a selection-based speaker anonymizer that uses a mean vector over a set of randomly selected speaker… ▽ More

    Submitted 12 September, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing

  13. arXiv:2305.17739  [pdf, other

    cs.SD cs.CL eess.AS

    Range-Based Equal Error Rate for Spoof Localization

    Authors: Lin Zhang, Xin Wang, Erica Cooper, Nicholas Evans, Junichi Yamagishi

    Abstract: Spoof localization, also called segment-level detection, is a crucial task that aims to locate spoofs in partially spoofed audio. The equal error rate (EER) is widely used to measure performance for such biometric scenarios. Although EER is the only threshold-free metric, it is usually calculated in a point-based way that uses scores and references with a pre-defined temporal resolution and counts… ▽ More

    Submitted 28 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  14. arXiv:2305.17601  [pdf, other

    cs.AI

    Incentivizing honest performative predictions with proper scoring rules

    Authors: Caspar Oesterheld, Johannes Treutlein, Emery Cooper, Rubi Hudson

    Abstract: Proper scoring rules incentivize experts to accurately report beliefs, assuming predictions cannot influence outcomes. We relax this assumption and investigate incentives when predictions are performative, i.e., when they can influence the outcome of the prediction, such as when making public predictions about the stock market. We say a prediction is a fixed point if it accurately reflects the exp… ▽ More

    Submitted 30 May, 2023; v1 submitted 27 May, 2023; originally announced May 2023.

    Comments: Accepted for the 39th Conference on Uncertainty in Artificial Intelligence (UAI 2023)

  15. arXiv:2302.02462  [pdf, other

    cs.PL

    The Marriage of Effects and Rewrites

    Authors: Ezra e. k. Cooper

    Abstract: In the research on computational effects, defined algebraically, effect symbols are often expected to obey certain equations. If we orient these equations, we get a rewrite system, which may be an effective way of transforming or optimizing the effects in a program. In order to do so, we need to establish strong normalization, or termination, of the rewrite system. Here we define a framework for c… ▽ More

    Submitted 5 February, 2023; originally announced February 2023.

    Comments: 15 pages, 2 figures. Submitted to FSCD 2023

    ACM Class: F.4.2

  16. arXiv:2211.13868  [pdf, other

    cs.SD eess.AS

    Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems?

    Authors: Xuan Shi, Erica Cooper, Xin Wang, Junichi Yamagishi, Shrikanth Narayanan

    Abstract: With the similarity between music and speech synthesis from symbolic input and the rapid development of text-to-speech (TTS) techniques, it is worthwhile to explore ways to improve the MIDI-to-audio performance by borrowing from TTS techniques. In this study, we analyze the shortcomings of a TTS-based MIDI-to-audio system and improve it in terms of feature computation, model selection, and trainin… ▽ More

    Submitted 20 March, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

    Comments: Accepted by ICASSP 2023

  17. arXiv:2209.00485  [pdf, other

    eess.AS cs.SD

    Joint Speaker Encoder and Neural Back-end Model for Fully End-to-End Automatic Speaker Verification with Multiple Enrollment Utterances

    Authors: Chang Zeng, Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi

    Abstract: Conventional automatic speaker verification systems can usually be decomposed into a front-end model such as time delay neural network (TDNN) for extracting speaker embeddings and a back-end model such as statistics-based probabilistic linear discriminant analysis (PLDA) or neural network-based neural PLDA (NPLDA) for similarity scoring. However, the sequential optimization of the front-end and ba… ▽ More

    Submitted 1 September, 2022; originally announced September 2022.

    Comments: Submitted to TASLP

  18. arXiv:2204.05177  [pdf, other

    eess.AS cs.CR cs.SD

    The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance

    Authors: Lin Zhang, Xin Wang, Erica Cooper, Nicholas Evans, Junichi Yamagishi

    Abstract: Automatic speaker verification is susceptible to various manipulations and spoofing, such as text-to-speech synthesis, voice conversion, replay, tampering, adversarial attacks, and so on. We consider a new spoofing scenario called "Partial Spoof" (PS) in which synthesized or transformed speech segments are embedded into a bona fide utterance. While existing countermeasures (CMs) can detect fully s… ▽ More

    Submitted 30 January, 2023; v1 submitted 11 April, 2022; originally announced April 2022.

    Comments: Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (DOI: 10.1109/TASLP.2022.3233236)

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 813-825, 2023

  19. arXiv:2203.14834  [pdf, other

    cs.SD

    Analyzing Language-Independent Speaker Anonymization Framework under Unseen Conditions

    Authors: Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi, Natalia Tomashenko

    Abstract: In our previous work, we proposed a language-independent speaker anonymization system based on self-supervised learning models. Although the system can anonymize speech data of any language, the anonymization was imperfect, and the speech content of the anonymized speech was distorted. This limitation is more severe when the input speech is from a domain unseen in the training data. This study ana… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: Submit to Interspeech2022

  20. arXiv:2203.11389  [pdf, other

    cs.SD eess.AS

    The VoiceMOS Challenge 2022

    Authors: Wen-Chin Huang, Erica Cooper, Yu Tsao, Hsin-Min Wang, Tomoki Toda, Junichi Yamagishi

    Abstract: We present the first edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthetic speech. This challenge drew 22 participating teams from academia and industry who tried a variety of approaches to tackle the problem of predicting human ratings of synthesized speech. The listening test data for the main tra… ▽ More

    Submitted 3 July, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

    Comments: Accepted to Interspeech 2022

  21. arXiv:2202.13097  [pdf, ps, other

    cs.SD eess.AS

    Language-Independent Speaker Anonymization Approach using Self-Supervised Pre-Trained Models

    Authors: Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi, Natalia Tomashenko

    Abstract: Speaker anonymization aims to protect the privacy of speakers while preserving spoken linguistic information from speech. Current mainstream neural network speaker anonymization systems are complicated, containing an F0 extractor, speaker encoder, automatic speech recognition acoustic model (ASR AM), speech synthesis acoustic model and speech waveform generation model. Moreover, as an ASR AM is la… ▽ More

    Submitted 27 April, 2022; v1 submitted 26 February, 2022; originally announced February 2022.

  22. arXiv:2110.09103  [pdf, other

    cs.SD cs.CL eess.AS

    LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech

    Authors: Wen-Chin Huang, Erica Cooper, Junichi Yamagishi, Tomoki Toda

    Abstract: An effective approach to automatically predict the subjective rating for synthetic speech is to train on a listening test dataset with human-annotated scores. Although each speech sample in the dataset is rated by several listeners, most previous works only used the mean score as the training target. In this work, we present LDNet, a unified framework for mean opinion score (MOS) prediction that p… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022. Code available at: https://github.com/unilight/LDNet

  23. arXiv:2110.01147  [pdf, other

    cs.SD cs.CL eess.AS

    On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

    Authors: Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian, Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David Cox, James Glass

    Abstract: Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparsity and its subsequent effects on synthetic speech. Additionally, we explored several… ▽ More

    Submitted 27 October, 2021; v1 submitted 3 October, 2021; originally announced October 2021.

  24. arXiv:2107.14132  [pdf, other

    cs.SD eess.AS

    Multi-Task Learning in Utterance-Level and Segmental-Level Spoof Detection

    Authors: Lin Zhang, Xin Wang, Erica Cooper, Junichi Yamagishi

    Abstract: In this paper, we provide a series of multi-tasking benchmarks for simultaneously detecting spoofing at the segmental and utterance levels in the PartialSpoof database. First, we propose the SELCNN network, which inserts squeeze-and-excitation (SE) blocks into a light convolutional neural network (LCNN) to enhance the capacity of hidden feature selection. Then, we implement multi-task learning (MT… ▽ More

    Submitted 31 August, 2021; v1 submitted 29 July, 2021; originally announced July 2021.

    Comments: Submitted to ASVspoof 2021 Workshop

  25. arXiv:2107.11506  [pdf, other

    eess.AS cs.SD

    Use of speaker recognition approaches for learning and evaluating embedding representations of musical instrument sounds

    Authors: Xuan Shi, Erica Cooper, Junichi Yamagishi

    Abstract: Constructing an embedding space for musical instrument sounds that can meaningfully represent new and unseen instruments is important for downstream music generation tasks such as multi-instrument synthesis and timbre transfer. The framework of Automatic Speaker Verification (ASV) provides us with architectures and evaluation methodologies for verifying the identities of unseen speakers, and these… ▽ More

    Submitted 24 December, 2021; v1 submitted 23 July, 2021; originally announced July 2021.

    Comments: Accepted by the IEEE/ACM Transactions on Audio, Speech, and Language Processing

  26. arXiv:2105.02373  [pdf, other

    cs.SD eess.AS

    How do Voices from Past Speech Synthesis Challenges Compare Today?

    Authors: Erica Cooper, Junichi Yamagishi

    Abstract: Shared challenges provide a venue for comparing systems trained on common data using a standardized evaluation, and they also provide an invaluable resource for researchers when the data and evaluation results are publicly released. The Blizzard Challenge and Voice Conversion Challenge are two such challenges for text-to-speech synthesis and for speaker conversion, respectively, and their publicly… ▽ More

    Submitted 30 June, 2021; v1 submitted 5 May, 2021; originally announced May 2021.

    Comments: To appear at ISCA Speech Synthesis Workshop 2021

  27. arXiv:2105.01573  [pdf, other

    eess.AS cs.SD

    Exploring Disentanglement with Multilingual and Monolingual VQ-VAE

    Authors: Jennifer Williams, Jason Fong, Erica Cooper, Junichi Yamagishi

    Abstract: This work examines the content and usefulness of disentangled phone and speaker representations from two separately trained VQ-VAE systems: one trained on multilingual data and another trained on monolingual data. We explore the multi- and monolingual models using four small proof-of-concept tasks: copy-synthesis, voice transformation, linguistic code-switching, and content-based privacy masking.… ▽ More

    Submitted 28 June, 2021; v1 submitted 4 May, 2021; originally announced May 2021.

    Comments: Accepted to Speech Synthesis Workshop 2021 (SSW11)

  28. arXiv:2104.12292  [pdf, other

    cs.SD eess.AS

    Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis

    Authors: Erica Cooper, Xin Wang, Junichi Yamagishi

    Abstract: Speech synthesis and music audio generation from symbolic input differ in many aspects but share some similarities. In this study, we investigate how text-to-speech synthesis techniques can be used for piano MIDI-to-audio synthesis tasks. Our investigation includes Tacotron and neural source-filter waveform models as the basic components, with which we build MIDI-to-audio synthesis systems in simi… ▽ More

    Submitted 24 February, 2022; v1 submitted 25 April, 2021; originally announced April 2021.

    Comments: In the proceedings of ISCA Speech Synthesis Workshop 2021

  29. arXiv:2104.02518  [pdf, other

    eess.AS cs.SD

    An Initial Investigation for Detecting Partially Spoofed Audio

    Authors: Lin Zhang, Xin Wang, Erica Cooper, Junichi Yamagishi, Jose Patino, Nicholas Evans

    Abstract: All existing databases of spoofed speech contain attack data that is spoofed in its entirety. In practice, it is entirely plausible that successful attacks can be mounted with utterances that are only partially spoofed. By definition, partially-spoofed utterances contain a mix of both spoofed and bona fide segments, which will likely degrade the performance of countermeasures trained with entirely… ▽ More

    Submitted 15 June, 2021; v1 submitted 6 April, 2021; originally announced April 2021.

    Comments: INTERSPEECH 2021

  30. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances

    Authors: Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi

    Abstract: Probabilistic linear discriminant analysis (PLDA) or cosine similarity have been widely used in traditional speaker verification systems as back-end techniques to measure pairwise similarities. To make better use of multiple enrollment utterances, we propose a novel attention back-end model, which can be used for both text-independent (TI) and text-dependent (TD) speaker verification, and employ s… ▽ More

    Submitted 5 October, 2021; v1 submitted 4 April, 2021; originally announced April 2021.

  31. arXiv:2011.04839  [pdf, other

    cs.SD cs.CL

    Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis

    Authors: Erica Cooper, Xin Wang, Yi Zhao, Yusuke Yasuda, Junichi Yamagishi

    Abstract: We explore pretraining strategies including choice of base corpus with the aim of choosing the best strategy for zero-shot multi-speaker end-to-end synthesis. We also examine choice of neural vocoder for waveform synthesis, as well as acoustic configurations used for mel spectrograms and final audio output. We find that fine-tuning a multi-speaker model from found audiobook data that has passed a… ▽ More

    Submitted 9 November, 2020; originally announced November 2020.

    Comments: Technical report

  32. arXiv:2010.11549  [pdf, other

    eess.AS cs.SD

    How Similar or Different Is Rakugo Speech Synthesizer to Professional Performers?

    Authors: Shuhei Kato, Yusuke Yasuda, Xin Wang, Erica Cooper, Junichi Yamagishi

    Abstract: We have been working on speech synthesis for rakugo (a traditional Japanese form of verbal entertainment similar to one-person stand-up comedy) toward speech synthesis that authentically entertains audiences. In this paper, we propose a novel evaluation methodology using synthesized rakugo speech and real rakugo speech uttered by professional performers of three different ranks. The naturalness of… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

    Comments: Submitted to ICASSP 2021

  33. arXiv:2010.10727  [pdf, other

    eess.AS cs.LG cs.SD

    Learning Disentangled Phone and Speaker Representations in a Semi-Supervised VQ-VAE Paradigm

    Authors: Jennifer Williams, Yi Zhao, Erica Cooper, Junichi Yamagishi

    Abstract: We present a new approach to disentangle speaker voice and phone content by introducing new components to the VQ-VAE architecture for speech synthesis. The original VQ-VAE does not generalize well to unseen speakers or content. To alleviate this problem, we have incorporated a speaker encoder and speaker VQ codebook that learns global speaker characteristics entirely separate from the existing sub… ▽ More

    Submitted 10 February, 2021; v1 submitted 20 October, 2020; originally announced October 2020.

    Comments: Accepted to ICASSP 2021

  34. arXiv:2010.10694  [pdf, other

    cs.CL

    An Investigation of the Relation Between Grapheme Embeddings and Pronunciation for Tacotron-based Systems

    Authors: Antoine Perquin, Erica Cooper, Junichi Yamagishi

    Abstract: End-to-end models, particularly Tacotron-based ones, are currently a popular solution for text-to-speech synthesis. They allow the production of high-quality synthesized speech with little to no text preprocessing. Indeed, they can be trained using either graphemes or phonemes as input directly. However, in the case of grapheme inputs, little is known concerning the relation between the underlying… ▽ More

    Submitted 4 April, 2021; v1 submitted 20 October, 2020; originally announced October 2020.

    Comments: Submitted to Interspeech 2021

  35. arXiv:2005.07884  [pdf, other

    eess.AS cs.SD

    Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction

    Authors: Yi Zhao, Haoyu Li, Cheng-I Lai, Jennifer Williams, Erica Cooper, Junichi Yamagishi

    Abstract: Vector Quantized Variational AutoEncoders (VQ-VAE) are a powerful representation learning framework that can discover discrete groups of features from a speech signal without supervision. Until now, the VQ-VAE architecture has previously modeled individual types of speech features, such as only phones or only F0. This paper introduces an important extension to VQ-VAE for learning F0-related supras… ▽ More

    Submitted 16 May, 2020; originally announced May 2020.

    Comments: Submitted to Interspeech 2020

  36. arXiv:1305.4319  [pdf, other

    q-bio.NC cs.HC

    Multi-command Tactile Brain Computer Interface: A Feasibility Study

    Authors: Hiromu Mori, Yoshihiro Matsumoto, Victor Kryssanov, Eric Cooper, Hitoshi Ogawa, Shoji Makino, Zbigniew R. Struzik, Tomasz M. Rutkowski

    Abstract: The study presented explores the extent to which tactile stimuli delivered to the ten digits of a BCI-naive subject can serve as a platform for a brain computer interface (BCI) that could be used in an interactive application such as robotic vehicle operation. The ten fingertips are used to evoke somatosensory brain responses, thus defining a tactile brain computer interface (tBCI). Experimental r… ▽ More

    Submitted 18 May, 2013; originally announced May 2013.

    Comments: Haptic and Audio Interaction Design 2013, Daejeon, Korea, April 18-19, 2013, 15 pages, 4 figures, The final publication will be available at link.springer.com