Skip to main content

Showing 51–100 of 163 results for author: Yamagishi, J

.
  1. arXiv:2201.03321  [pdf, other

    eess.AS cs.CR cs.SD

    A Practical Guide to Logical Access Voice Presentation Attack Detection

    Authors: Xin Wang, Junichi Yamagishi

    Abstract: Voice-based human-machine interfaces with an automatic speaker verification (ASV) component are commonly used in the market. However, the threat from presentation attacks is also growing since attackers can use recent speech synthesis technology to produce a natural-sounding voice of a victim. Presentation attack detection (PAD) for ASV, or speech anti-spoofing, is therefore indispensable. Researc… ▽ More

    Submitted 10 January, 2022; originally announced January 2022.

    Comments: This work will appear as one chapter for a new book called Frontiers in Fake Media Generation and Detection, edited by Mahdi Khosravy, Isao Echizen, Noboru Babaguchi. The code for this chapter is available in https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts

  2. arXiv:2111.12888  [pdf, other

    cs.CV cs.AI

    Effectiveness of Detection-based and Regression-based Approaches for Estimating Mask-Wearing Ratio

    Authors: Khanh-Duy Nguyen, Huy H. Nguyen, Trung-Nghia Le, Junichi Yamagishi, Isao Echizen

    Abstract: Estimating the mask-wearing ratio in public places is important as it enables health authorities to promptly analyze and implement policies. Methods for estimating the mask-wearing ratio on the basis of image analysis have been reported. However, there is still a lack of comprehensive research on both methodologies and datasets. Most recent reports straightforwardly propose estimating the ratio by… ▽ More

    Submitted 3 December, 2021; v1 submitted 24 November, 2021; originally announced November 2021.

  3. arXiv:2111.07725  [pdf, other

    eess.AS cs.SD

    Investigating self-supervised front ends for speech spoofing countermeasures

    Authors: Xin Wang, Junichi Yamagishi

    Abstract: Self-supervised speech model is a rapid progressing research topic, and many pre-trained models have been released and used in various down stream tasks. For speech anti-spoofing, most countermeasures (CMs) use signal processing algorithms to extract acoustic features for classification. In this study, we use pre-trained self-supervised speech models as the front end of spoofing CMs. We investigat… ▽ More

    Submitted 4 February, 2022; v1 submitted 15 November, 2021; originally announced November 2021.

    Comments: V3: added sub-band analysis, submitted to ISCA Odyssey2022; V2: added min tDCF results on 2019 and 2021 LA. EERs on LA 2021 were slightly updated to fix one glitch in the score file. EERs and min tDCFs on 2021 LA and DF can be computed using the latest official code https://github.com/asvspoof-challenge/2021. Work in progress. Feedback is welcome!

  4. arXiv:2110.09103  [pdf, other

    cs.SD cs.CL eess.AS

    LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech

    Authors: Wen-Chin Huang, Erica Cooper, Junichi Yamagishi, Tomoki Toda

    Abstract: An effective approach to automatically predict the subjective rating for synthetic speech is to train on a listening test dataset with human-annotated scores. Although each speech sample in the dataset is rated by several listeners, most previous works only used the mean score as the training target. In this work, we present LDNet, a unified framework for mean opinion score (MOS) prediction that p… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022. Code available at: https://github.com/unilight/LDNet

  5. arXiv:2110.06760  [pdf, other

    eess.AS

    Revisiting Speech Content Privacy

    Authors: Jennifer Williams, Junichi Yamagishi, Paul-Gauthier Noe, Cassia Valentini Botinhao, Jean-Francois Bonastre

    Abstract: In this paper, we discuss an important aspect of speech privacy: protecting spoken content. New capabilities from the field of machine learning provide a unique and timely opportunity to revisit speech content protection. There are many different applications of content privacy, even though this area has been under-explored in speech technology research. This paper presents several scenarios that… ▽ More

    Submitted 13 October, 2021; originally announced October 2021.

    Comments: Accepted to ISCA Security and Privacy in Speech Communication (1st SPSC Symposium)

  6. arXiv:2110.04946  [pdf, other

    cs.SD cs.LG eess.AS

    LaughNet: synthesizing laughter utterances from waveform silhouettes and a single laughter example

    Authors: Hieu-Thi Luong, Junichi Yamagishi

    Abstract: Emotional and controllable speech synthesis is a topic that has received much attention. However, most studies focused on improving the expressiveness and controllability in the context of linguistic content, even though natural verbal human communication is inseparable from spontaneous non-speech expressions such as laughter, crying, or grunting. We propose a model called LaughNet for synthesizin… ▽ More

    Submitted 25 January, 2022; v1 submitted 10 October, 2021; originally announced October 2021.

  7. arXiv:2110.04775  [pdf, other

    eess.AS cs.CR cs.SD

    Estimating the confidence of speech spoofing countermeasure

    Authors: Xin Wang, Junichi Yamagishi

    Abstract: Conventional speech spoofing countermeasures (CMs) are designed to make a binary decision on an input trial. However, a CM trained on a closed-set database is theoretically not guaranteed to perform well on unknown spoofing attacks. In some scenarios, an alternative strategy is to let the CM defer a decision when it is not confident. The question is then how to estimate a CM's confidence regarding… ▽ More

    Submitted 1 February, 2022; v1 submitted 10 October, 2021; originally announced October 2021.

    Comments: Work in progress. Comments are welcome. Accepted by ICASSP2022. Code is available https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts. Not all the comments from anonymous reviewers can be addressed within 4 pages, apologize for that

  8. arXiv:2110.02635  [pdf, other

    eess.AS

    Generalization Ability of MOS Prediction Networks

    Authors: Erica Cooper, Wen-Chin Huang, Tomoki Toda, Junichi Yamagishi

    Abstract: Automatic methods to predict listener opinions of synthesized speech remain elusive since listeners, systems being evaluated, characteristics of the speech, and even the instructions given and the rating scale all vary from test to test. While automatic predictors for metrics such as mean opinion score (MOS) can achieve high prediction accuracy on samples from the same test, they typically fail to… ▽ More

    Submitted 14 February, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: \c{opyright} 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  9. arXiv:2110.01147  [pdf, other

    cs.SD cs.CL eess.AS

    On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

    Authors: Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian, Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David Cox, James Glass

    Abstract: Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparsity and its subsequent effects on synthetic speech. Additionally, we explored several… ▽ More

    Submitted 27 October, 2021; v1 submitted 3 October, 2021; originally announced October 2021.

  10. arXiv:2109.07931  [pdf, other

    eess.AS cs.SD

    DDS: A new device-degraded speech dataset for speech enhancement

    Authors: Haoyu Li, Junichi Yamagishi

    Abstract: A large and growing amount of speech content in real-life scenarios is being recorded on consumer-grade devices in uncontrolled environments, resulting in degraded speech quality. Transforming such low-quality device-degraded speech into high-quality speech is a goal of speech enhancement (SE). This paper introduces a new speech dataset, DDS, to facilitate the research on SE. DDS provides aligned… ▽ More

    Submitted 22 March, 2022; v1 submitted 16 September, 2021; originally announced September 2021.

    Comments: Submitted to Interspeech 2022

  11. arXiv:2109.03398  [pdf, other

    cs.CV

    Master Face Attacks on Face Recognition Systems

    Authors: Huy H. Nguyen, Sébastien Marcel, Junichi Yamagishi, Isao Echizen

    Abstract: Face authentication is now widely used, especially on mobile devices, rather than authentication using a personal identification number or an unlock pattern, due to its convenience. It has thus become a tempting target for attackers using a presentation attack. Traditional presentation attacks use facial images or videos of the victim. Previous work has proven the existence of master faces, i.e.,… ▽ More

    Submitted 7 September, 2021; originally announced September 2021.

    Comments: This paper is an extension of the IJCB paper published in 2019 (Generating Master Faces for Use in Performing Wolf Attacks on Face Recognition Systems) and its first version was initially submitted to T-BIOM journal on Dec 25, 2020

  12. arXiv:2109.00648  [pdf, other

    cs.CL cs.SD eess.AS

    The VoicePrivacy 2020 Challenge: Results and findings

    Authors: Natalia Tomashenko, Xin Wang, Emmanuel Vincent, Jose Patino, Brij Mohan Lal Srivastava, Paul-Gauthier Noé, Andreas Nautsch, Nicholas Evans, Junichi Yamagishi, Benjamin O'Brien, Anaïs Chanclu, Jean-François Bonastre, Massimiliano Todisco, Mohamed Maouche

    Abstract: This paper presents the results and analyses stemming from the first VoicePrivacy 2020 Challenge which focuses on develo** anonymization solutions for speech technology. We provide a systematic overview of the challenge design with an analysis of submitted systems and evaluation results. In particular, we describe the voice anonymization task and datasets used for system development and evaluati… ▽ More

    Submitted 26 September, 2022; v1 submitted 1 September, 2021; originally announced September 2021.

    Comments: Submitted to the Special Issue on Voice Privacy (Computer Speech and Language Journal - Elsevier); under review

  13. arXiv:2109.00537  [pdf, other

    eess.AS cs.CR cs.LG cs.SD

    ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

    Authors: Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, Héctor Delgado

    Abstract: ASVspoof 2021 is the forth edition in the series of bi-annual challenges which aim to promote the study of spoofing and the design of countermeasures to protect automatic speaker verification systems from manipulation. In addition to a continued focus upon logical and physical access tasks in which there are a number of advances compared to previous editions, ASVspoof 2021 introduces a new task in… ▽ More

    Submitted 1 September, 2021; originally announced September 2021.

    Comments: Accepted to the ASVspoof 2021 Workshop

  14. arXiv:2109.00535  [pdf, other

    eess.AS cs.CR cs.LG cs.SD

    ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan

    Authors: Héctor Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu, Andreas Nautsch, Jose Patino, Md Sahidullah, Massimiliano Todisco, Xin Wang, Junichi Yamagishi

    Abstract: The automatic speaker verification spoofing and countermeasures (ASVspoof) challenge series is a community-led initiative which aims to promote the consideration of spoofing and the development of countermeasures. ASVspoof 2021 is the 4th in a series of bi-annual, competitive challenges where the goal is to develop countermeasures capable of discriminating between bona fide and spoofed or deepfake… ▽ More

    Submitted 1 September, 2021; originally announced September 2021.

    Comments: http://www.asvspoof.org

  15. arXiv:2109.00281  [pdf, other

    cs.CR cs.SD eess.AS

    Benchmarking and challenges in security and privacy for voice biometrics

    Authors: Jean-Francois Bonastre, Hector Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu, Andreas Nautsch, Paul-Gauthier Noe, Jose Patino, Md Sahidullah, Brij Mohan Lal Srivastava, Massimiliano Todisco, Natalia Tomashenko, Emmanuel Vincent, Xin Wang, Junichi Yamagishi

    Abstract: For many decades, research in speech technologies has focused upon improving reliability. With this now meeting user expectations for a range of diverse applications, speech technology is today omni-present. As result, a focus on security and privacy has now come to the fore. Here, the research effort is in its relative infancy and progress calls for greater, multidisciplinary collaboration with s… ▽ More

    Submitted 1 September, 2021; originally announced September 2021.

    Comments: Submitted to the symposium of the ISCA Security & Privacy in Speech Communications (SPSC) special interest group

  16. arXiv:2107.14480  [pdf, other

    cs.CV

    OpenForensics: Large-Scale Challenging Dataset For Multi-Face Forgery Detection And Segmentation In-The-Wild

    Authors: Trung-Nghia Le, Huy H. Nguyen, Junichi Yamagishi, Isao Echizen

    Abstract: The proliferation of deepfake media is raising concerns among the public and relevant authorities. It has become essential to develop countermeasures against forged faces in social media. This paper presents a comprehensive study on two new countermeasure tasks: multi-face forgery detection and segmentation in-the-wild. Localizing forged faces among multiple human faces in unrestricted natural sce… ▽ More

    Submitted 30 July, 2021; originally announced July 2021.

    Comments: Accepted to ICCV 2021. Project page: https://sites.google.com/view/ltnghia/research/openforensics

  17. arXiv:2107.14132  [pdf, other

    cs.SD eess.AS

    Multi-Task Learning in Utterance-Level and Segmental-Level Spoof Detection

    Authors: Lin Zhang, Xin Wang, Erica Cooper, Junichi Yamagishi

    Abstract: In this paper, we provide a series of multi-tasking benchmarks for simultaneously detecting spoofing at the segmental and utterance levels in the PartialSpoof database. First, we propose the SELCNN network, which inserts squeeze-and-excitation (SE) blocks into a light convolutional neural network (LCNN) to enhance the capacity of hidden feature selection. Then, we implement multi-task learning (MT… ▽ More

    Submitted 31 August, 2021; v1 submitted 29 July, 2021; originally announced July 2021.

    Comments: Submitted to ASVspoof 2021 Workshop

  18. arXiv:2107.11506  [pdf, other

    eess.AS cs.SD

    Use of speaker recognition approaches for learning and evaluating embedding representations of musical instrument sounds

    Authors: Xuan Shi, Erica Cooper, Junichi Yamagishi

    Abstract: Constructing an embedding space for musical instrument sounds that can meaningfully represent new and unseen instruments is important for downstream music generation tasks such as multi-instrument synthesis and timbre transfer. The framework of Automatic Speaker Verification (ASV) provides us with architectures and evaluation methodologies for verifying the identities of unseen speakers, and these… ▽ More

    Submitted 24 December, 2021; v1 submitted 23 July, 2021; originally announced July 2021.

    Comments: Accepted by the IEEE/ACM Transactions on Audio, Speech, and Language Processing

  19. arXiv:2107.09392  [pdf, other

    eess.AS cs.LG cs.SD

    SVSNet: An End-to-end Speaker Voice Similarity Assessment Model

    Authors: Cheng-Hung Hu, Yu-Huai Peng, Junichi Yamagishi, Yu Tsao, Hsin-Min Wang

    Abstract: Neural evaluation metrics derived for numerous speech generation tasks have recently attracted great attention. In this paper, we propose SVSNet, the first end-to-end neural network model to assess the speaker voice similarity between converted speech and natural speech for voice conversion tasks. Unlike most neural evaluation metrics that use hand-crafted features, SVSNet directly takes the raw w… ▽ More

    Submitted 16 February, 2022; v1 submitted 20 July, 2021; originally announced July 2021.

    Comments: To appear in IEEE Signal Processing Letters (SPL)

  20. arXiv:2106.13479  [pdf, other

    cs.SD cs.CL eess.AS

    Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance

    Authors: Hieu-Thi Luong, Junichi Yamagishi

    Abstract: Generally speaking, the main objective when training a neural speech synthesis system is to synthesize natural and expressive speech from the output layer of the neural network without much attention given to the hidden layers. However, by learning useful latent representation, the system can be used for many more practical scenarios. In this paper, we investigate the use of quantized vectors to m… ▽ More

    Submitted 25 June, 2021; originally announced June 2021.

    Comments: to be presented at SSW11

  21. arXiv:2106.06362  [pdf, other

    cs.SD cs.LG eess.AS stat.AP

    Visualizing Classifier Adjacency Relations: A Case Study in Speaker Verification and Voice Anti-Spoofing

    Authors: Tomi Kinnunen, Andreas Nautsch, Md Sahidullah, Nicholas Evans, Xin Wang, Massimiliano Todisco, Héctor Delgado, Junichi Yamagishi, Kong Aik Lee

    Abstract: Whether it be for results summarization, or the analysis of classifier fusion, some means to compare different classifiers can often provide illuminating insight into their behaviour, (dis)similarity or complementarity. We propose a simple method to derive 2D representation from detection scores produced by an arbitrary set of binary classifiers in response to a common dataset. Based upon rank cor… ▽ More

    Submitted 11 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021. Example code available at https://github.com/asvspoof-challenge/classifier-adjacency

  22. arXiv:2106.00950  [pdf, other

    cs.CL

    A Multi-Level Attention Model for Evidence-Based Fact Checking

    Authors: Canasai Kruengkrai, Junichi Yamagishi, Xin Wang

    Abstract: Evidence-based fact checking aims to verify the truthfulness of a claim against evidence extracted from textual sources. Learning a representation that effectively captures relations between a claim and evidence can be challenging. Recent state-of-the-art approaches have developed increasingly sophisticated models based on graph structures. We present a simple model that can be trained on sequence… ▽ More

    Submitted 2 June, 2021; originally announced June 2021.

    Comments: Findings of ACL 2021

  23. arXiv:2105.02373  [pdf, other

    cs.SD eess.AS

    How do Voices from Past Speech Synthesis Challenges Compare Today?

    Authors: Erica Cooper, Junichi Yamagishi

    Abstract: Shared challenges provide a venue for comparing systems trained on common data using a standardized evaluation, and they also provide an invaluable resource for researchers when the data and evaluation results are publicly released. The Blizzard Challenge and Voice Conversion Challenge are two such challenges for text-to-speech synthesis and for speaker conversion, respectively, and their publicly… ▽ More

    Submitted 30 June, 2021; v1 submitted 5 May, 2021; originally announced May 2021.

    Comments: To appear at ISCA Speech Synthesis Workshop 2021

  24. arXiv:2105.01573  [pdf, other

    eess.AS cs.SD

    Exploring Disentanglement with Multilingual and Monolingual VQ-VAE

    Authors: Jennifer Williams, Jason Fong, Erica Cooper, Junichi Yamagishi

    Abstract: This work examines the content and usefulness of disentangled phone and speaker representations from two separately trained VQ-VAE systems: one trained on multilingual data and another trained on monolingual data. We explore the multi- and monolingual models using four small proof-of-concept tasks: copy-synthesis, voice transformation, linguistic code-switching, and content-based privacy masking.… ▽ More

    Submitted 28 June, 2021; v1 submitted 4 May, 2021; originally announced May 2021.

    Comments: Accepted to Speech Synthesis Workshop 2021 (SSW11)

  25. arXiv:2104.12292  [pdf, other

    cs.SD eess.AS

    Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis

    Authors: Erica Cooper, Xin Wang, Junichi Yamagishi

    Abstract: Speech synthesis and music audio generation from symbolic input differ in many aspects but share some similarities. In this study, we investigate how text-to-speech synthesis techniques can be used for piano MIDI-to-audio synthesis tasks. Our investigation includes Tacotron and neural source-filter waveform models as the basic components, with which we build MIDI-to-audio synthesis systems in simi… ▽ More

    Submitted 24 February, 2022; v1 submitted 25 April, 2021; originally announced April 2021.

    Comments: In the proceedings of ISCA Speech Synthesis Workshop 2021

  26. arXiv:2104.08499  [pdf, other

    eess.AS cs.SD

    Multi-Metric Optimization using Generative Adversarial Networks for Near-End Speech Intelligibility Enhancement

    Authors: Haoyu Li, Junichi Yamagishi

    Abstract: The intelligibility of speech severely degrades in the presence of environmental noise and reverberation. In this paper, we propose a novel deep learning based system for modifying the speech signal to increase its intelligibility under the equal-power constraint, i.e., signal power before and after modification must be the same. To achieve this, we use generative adversarial networks (GANs) to ob… ▽ More

    Submitted 16 September, 2021; v1 submitted 17 April, 2021; originally announced April 2021.

    Comments: Accepted to IEEE/ACM Transactions on Audio Speech and Language Processing

  27. arXiv:2104.08422  [pdf, other

    cs.CV

    Fashion-Guided Adversarial Attack on Person Segmentation

    Authors: Marc Treu, Trung-Nghia Le, Huy H. Nguyen, Junichi Yamagishi, Isao Echizen

    Abstract: This paper presents the first adversarial example based method for attacking human instance segmentation networks, namely person segmentation networks in short, which are harder to fool than classification networks. We propose a novel Fashion-Guided Adversarial Attack (FashionAdv) framework to automatically identify attackable regions in the target image to minimize the effect on image quality. It… ▽ More

    Submitted 19 April, 2021; v1 submitted 16 April, 2021; originally announced April 2021.

    Comments: Accepted to Workshop on Media Forensics, CVPR 2021. Project page: https://github.com/nii-yamagishilab/fashion_adv

    Journal ref: CVPR Workshops 2021

  28. arXiv:2104.02518  [pdf, other

    eess.AS cs.SD

    An Initial Investigation for Detecting Partially Spoofed Audio

    Authors: Lin Zhang, Xin Wang, Erica Cooper, Junichi Yamagishi, Jose Patino, Nicholas Evans

    Abstract: All existing databases of spoofed speech contain attack data that is spoofed in its entirety. In practice, it is entirely plausible that successful attacks can be mounted with utterances that are only partially spoofed. By definition, partially-spoofed utterances contain a mix of both spoofed and bona fide segments, which will likely degrade the performance of countermeasures trained with entirely… ▽ More

    Submitted 15 June, 2021; v1 submitted 6 April, 2021; originally announced April 2021.

    Comments: INTERSPEECH 2021

  29. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances

    Authors: Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi

    Abstract: Probabilistic linear discriminant analysis (PLDA) or cosine similarity have been widely used in traditional speaker verification systems as back-end techniques to measure pairwise similarities. To make better use of multiple enrollment utterances, we propose a novel attention back-end model, which can be used for both text-independent (TI) and text-dependent (TD) speaker verification, and employ s… ▽ More

    Submitted 5 October, 2021; v1 submitted 4 April, 2021; originally announced April 2021.

  30. arXiv:2103.11326  [pdf, other

    eess.AS

    A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection

    Authors: Xin Wang, Junich Yamagishi

    Abstract: A great deal of recent research effort on speech spoofing countermeasures has been invested into back-end neural networks and training criteria. We contribute to this effort with a comparative perspective in this study. Our comparison of countermeasure models on the ASVspoof 2019 logical access task takes into account recently proposed margin-based training criteria, widely used front ends, and co… ▽ More

    Submitted 13 June, 2021; v1 submitted 21 March, 2021; originally announced March 2021.

    Comments: Interspeech 2021

  31. arXiv:2102.05889  [pdf, other

    eess.AS cs.CR cs.SD

    ASVspoof 2019: spoofing countermeasures for the detection of synthesized, converted and replayed speech

    Authors: Andreas Nautsch, Xin Wang, Nicholas Evans, Tomi Kinnunen, Ville Vestman, Massimiliano Todisco, Héctor Delgado, Md Sahidullah, Junichi Yamagishi, Kong Aik Lee

    Abstract: The ASVspoof initiative was conceived to spearhead research in anti-spoofing for automatic speaker verification (ASV). This paper describes the third in a series of bi-annual challenges: ASVspoof 2019. With the challenge database and protocols being described elsewhere, the focus of this paper is on results and the top performing single and ensemble system submissions from 62 teams, all of which o… ▽ More

    Submitted 11 February, 2021; originally announced February 2021.

    Journal ref: IEEE Transactions on Biometrics, Behavior, and Identity Science 2021

  32. arXiv:2011.05038  [pdf, other

    eess.AS cs.SD

    Enhancing Low-Quality Voice Recordings Using Disentangled Channel Factor and Neural Waveform Model

    Authors: Haoyu Li, Yang Ai, Junichi Yamagishi

    Abstract: High-quality speech corpora are essential foundations for most speech applications. However, such speech data are expensive and limited since they are collected in professional recording environments. In this work, we propose an encoder-decoder neural network to automatically enhance low-quality recordings to professional high-quality recordings. To address channel variability, we first filter out… ▽ More

    Submitted 10 November, 2020; originally announced November 2020.

    Comments: 8 pages. Accepted to IEEE SLT 2021

  33. arXiv:2011.04839  [pdf, other

    cs.SD cs.CL

    Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis

    Authors: Erica Cooper, Xin Wang, Yi Zhao, Yusuke Yasuda, Junichi Yamagishi

    Abstract: We explore pretraining strategies including choice of base corpus with the aim of choosing the best strategy for zero-shot multi-speaker end-to-end synthesis. We also examine choice of neural vocoder for waveform synthesis, as well as acoustic configurations used for mel spectrograms and final audio output. We find that fine-tuning a multi-speaker model from found audiobook data that has passed a… ▽ More

    Submitted 9 November, 2020; originally announced November 2020.

    Comments: Technical report

  34. arXiv:2011.03955  [pdf, other

    cs.SD eess.AS

    Denoising-and-Dereverberation Hierarchical Neural Vocoder for Robust Waveform Generation

    Authors: Yang Ai, Haoyu Li, Xin Wang, Junichi Yamagishi, Zhenhua Ling

    Abstract: This paper presents a denoising and dereverberation hierarchical neural vocoder (DNR-HiNet) to convert noisy and reverberant acoustic features into a clean speech waveform. We implement it mainly by modifying the amplitude spectrum predictor (ASP) in the original HiNet vocoder. This modified denoising and dereverberation ASP (DNR-ASP) can predict clean log amplitude spectra (LAS) from input degrad… ▽ More

    Submitted 8 November, 2020; originally announced November 2020.

    Comments: Accepted by SLT 2021

  35. arXiv:2010.11549  [pdf, other

    eess.AS cs.SD

    How Similar or Different Is Rakugo Speech Synthesizer to Professional Performers?

    Authors: Shuhei Kato, Yusuke Yasuda, Xin Wang, Erica Cooper, Junichi Yamagishi

    Abstract: We have been working on speech synthesis for rakugo (a traditional Japanese form of verbal entertainment similar to one-person stand-up comedy) toward speech synthesis that authentically entertains audiences. In this paper, we propose a novel evaluation methodology using synthesized rakugo speech and real rakugo speech uttered by professional performers of three different ranks. The naturalness of… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

    Comments: Submitted to ICASSP 2021

  36. arXiv:2010.10727  [pdf, other

    eess.AS cs.LG cs.SD

    Learning Disentangled Phone and Speaker Representations in a Semi-Supervised VQ-VAE Paradigm

    Authors: Jennifer Williams, Yi Zhao, Erica Cooper, Junichi Yamagishi

    Abstract: We present a new approach to disentangle speaker voice and phone content by introducing new components to the VQ-VAE architecture for speech synthesis. The original VQ-VAE does not generalize well to unseen speakers or content. To alleviate this problem, we have incorporated a speaker encoder and speaker VQ codebook that learns global speaker characteristics entirely separate from the existing sub… ▽ More

    Submitted 10 February, 2021; v1 submitted 20 October, 2020; originally announced October 2020.

    Comments: Accepted to ICASSP 2021

  37. arXiv:2010.10694  [pdf, other

    cs.CL

    An Investigation of the Relation Between Grapheme Embeddings and Pronunciation for Tacotron-based Systems

    Authors: Antoine Perquin, Erica Cooper, Junichi Yamagishi

    Abstract: End-to-end models, particularly Tacotron-based ones, are currently a popular solution for text-to-speech synthesis. They allow the production of high-quality synthesized speech with little to no text preprocessing. Indeed, they can be trained using either graphemes or phonemes as input directly. However, in the case of grapheme inputs, little is known concerning the relation between the underlying… ▽ More

    Submitted 4 April, 2021; v1 submitted 20 October, 2020; originally announced October 2020.

    Comments: Submitted to Interspeech 2021

  38. arXiv:2010.09602  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Text-to-Speech using Latent Duration based on VQ-VAE

    Authors: Yusuke Yasuda, Xin Wang, Junichi Yamagishi

    Abstract: Explicit duration modeling is a key to achieving robust and efficient alignment in text-to-speech synthesis (TTS). We propose a new TTS framework using explicit duration modeling that incorporates duration as a discrete latent variable to TTS and enables joint optimization of whole modules from scratch. We formulate our method based on conditional VQ-VAE to handle discrete duration in a variationa… ▽ More

    Submitted 20 October, 2020; v1 submitted 19 October, 2020; originally announced October 2020.

  39. arXiv:2010.03717  [pdf, other

    eess.AS cs.CL cs.SD

    Latent linguistic embedding for cross-lingual text-to-speech and voice conversion

    Authors: Hieu-Thi Luong, Junichi Yamagishi

    Abstract: As the recently proposed voice cloning system, NAUTILUS, is capable of cloning unseen voices using untranscribed speech, we investigate the feasibility of using it to develop a unified cross-lingual TTS/VC system. Cross-lingual speech generation is the scenario in which speech utterances are generated with the voices of target speakers in a language not spoken by them originally. This type of syst… ▽ More

    Submitted 7 October, 2020; originally announced October 2020.

    Comments: Accepted to Voice Conversion Challenge 2020 Online Workshop

  40. arXiv:2010.02150  [pdf, other

    cs.CL cs.CY cs.SI

    Viable Threat on News Reading: Generating Biased News Using Natural Language Models

    Authors: Saurabh Gupta, Huy H. Nguyen, Junichi Yamagishi, Isao Echizen

    Abstract: Recent advancements in natural language generation has raised serious concerns. High-performance language models are widely used for language generation tasks because they are able to produce fluent and meaningful sentences. These models are already being used to create fake news. They can also be exploited to generate biased news, which can then be used to attack news aggregators to change their… ▽ More

    Submitted 5 October, 2020; originally announced October 2020.

    Comments: 11 pages, 4 figures, 6 tables, Accepted at NLP+CSS Workshop at EMNLP 2020

  41. arXiv:2009.03554  [pdf, other

    eess.AS cs.SD

    Predictions of Subjective Ratings and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions

    Authors: Rohan Kumar Das, Tomi Kinnunen, Wen-Chin Huang, Zhenhua Ling, Junichi Yamagishi, Yi Zhao, Xiaohai Tian, Tomoki Toda

    Abstract: The Voice Conversion Challenge 2020 is the third edition under its flagship that promotes intra-lingual semiparallel and cross-lingual voice conversion (VC). While the primary evaluation of the challenge submissions was done through crowd-sourced listening tests, we also performed an objective assessment of the submitted systems. The aim of the objective assessment is to provide complementary perf… ▽ More

    Submitted 8 September, 2020; originally announced September 2020.

    Comments: Submitted to ISCA Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020

  42. arXiv:2008.12527  [pdf, other

    eess.AS cs.SD

    Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion

    Authors: Yi Zhao, Wen-Chin Huang, Xiaohai Tian, Junichi Yamagishi, Rohan Kumar Das, Tomi Kinnunen, Zhenhua Ling, Tomoki Toda

    Abstract: The voice conversion challenge is a bi-annual scientific event held to compare and understand different voice conversion (VC) systems built on a common dataset. In 2020, we organized the third edition of the challenge and constructed and distributed a new database for two tasks, intra-lingual semi-parallel and cross-lingual VC. After a two-month challenge period, we received 33 submissions, includ… ▽ More

    Submitted 28 August, 2020; originally announced August 2020.

    Comments: Submitted to ISCA Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020

  43. arXiv:2008.03648  [pdf, other

    eess.AS cs.SD

    An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning

    Authors: Berrak Sisman, Junichi Yamagishi, Simon King, Haizhou Li

    Abstract: Speaker identity is one of the important characteristics of human speech. In voice conversion, we change the speaker identity from one to another, while kee** the linguistic content unchanged. Voice conversion involves multiple speech processing techniques, such as speech analysis, spectral conversion, prosody conversion, speaker characterization, and vocoding. With the recent advances in theory… ▽ More

    Submitted 16 November, 2020; v1 submitted 9 August, 2020; originally announced August 2020.

    Comments: accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing

  44. arXiv:2007.05979  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals

    Authors: Tomi Kinnunen, Héctor Delgado, Nicholas Evans, Kong Aik Lee, Ville Vestman, Andreas Nautsch, Massimiliano Todisco, Xin Wang, Md Sahidullah, Junichi Yamagishi, Douglas A. Reynolds

    Abstract: Recent years have seen growing efforts to develop spoofing countermeasures (CMs) to protect automatic speaker verification (ASV) systems from being deceived by manipulated or artificial inputs. The reliability of spoofing CMs is typically gauged using the equal error rate (EER) metric. The primitive EER fails to reflect application requirements and the impact of spoofing and CMs upon ASV and its u… ▽ More

    Submitted 25 August, 2020; v1 submitted 12 July, 2020; originally announced July 2020.

    Comments: Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (doi updated)

  45. arXiv:2006.08376  [pdf, other

    cs.CV cs.LG

    Generating Master Faces for Use in Performing Wolf Attacks on Face Recognition Systems

    Authors: Huy H. Nguyen, Junichi Yamagishi, Isao Echizen, Sébastien Marcel

    Abstract: Due to its convenience, biometric authentication, especial face authentication, has become increasingly mainstream and thus is now a prime target for attackers. Presentation attacks and face morphing are typical types of attack. Previous research has shown that finger-vein- and fingerprint-based authentication methods are susceptible to wolf attacks, in which a wolf sample matches many enrolled us… ▽ More

    Submitted 15 June, 2020; originally announced June 2020.

    Comments: Accepted to be Published in Proceedings of the 2020 International Joint Conference on Biometrics (IJCB 2020), Houston, USA

  46. arXiv:2005.11004  [pdf, other

    eess.AS cs.CL cs.SD

    NAUTILUS: a Versatile Voice Cloning System

    Authors: Hieu-Thi Luong, Junichi Yamagishi

    Abstract: We introduce a novel speech synthesis system, called NAUTILUS, that can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker. By using a multi-speaker speech corpus to train all requisite encoders and decoders in the initial training stage, our system can clone unseen voices using untranscribed speech of target speakers on the basis o… ▽ More

    Submitted 6 October, 2020; v1 submitted 22 May, 2020; originally announced May 2020.

    Comments: Submitted to The IEEE/ACM Transactions on Audio, Speech, and Language Processing

  47. arXiv:2005.10390  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis

    Authors: Yusuke Yasuda, Xin Wang, Junichi Yamagishi

    Abstract: Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS does not require manually annotated and complicated linguistic features such as part-of-speech tags and syntactic structures for system training. However, it must be careful… ▽ More

    Submitted 7 October, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

  48. The Privacy ZEBRA: Zero Evidence Biometric Recognition Assessment

    Authors: Andreas Nautsch, Jose Patino, Natalia Tomashenko, Junichi Yamagishi, Paul-Gauthier Noe, Jean-Francois Bonastre, Massimiliano Todisco, Nicholas Evans

    Abstract: Mounting privacy legislation calls for the preservation of privacy in speech technology, though solutions are gravely lacking. While evaluation campaigns are long-proven tools to drive progress, the need to consider a privacy adversary implies that traditional approaches to evaluation must be adapted to the assessment of privacy and privacy preservation solutions. This paper presents the first ste… ▽ More

    Submitted 20 May, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: submitted to Interspeech 2020

    Journal ref: Proc Interspeech 2020

  49. arXiv:2005.08601  [pdf, other

    eess.AS cs.CL

    Design Choices for X-vector Based Speaker Anonymization

    Authors: Brij Mohan Lal Srivastava, Natalia Tomashenko, Xin Wang, Emmanuel Vincent, Junichi Yamagishi, Mohamed Maouche, Aurélien Bellet, Marc Tommasi

    Abstract: The recently proposed x-vector based anonymization scheme converts any input voice into that of a random pseudo-speaker. In this paper, we present a flexible pseudo-speaker selection technique as a baseline for the first VoicePrivacy Challenge. We explore several design choices for the distance metric between speakers, the region of x-vector space where the pseudo-speaker is picked, and gender sel… ▽ More

    Submitted 18 May, 2020; originally announced May 2020.

  50. arXiv:2005.07884  [pdf, other

    eess.AS cs.SD

    Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction

    Authors: Yi Zhao, Haoyu Li, Cheng-I Lai, Jennifer Williams, Erica Cooper, Junichi Yamagishi

    Abstract: Vector Quantized Variational AutoEncoders (VQ-VAE) are a powerful representation learning framework that can discover discrete groups of features from a speech signal without supervision. Until now, the VQ-VAE architecture has previously modeled individual types of speech features, such as only phones or only F0. This paper introduces an important extension to VQ-VAE for learning F0-related supras… ▽ More

    Submitted 16 May, 2020; originally announced May 2020.

    Comments: Submitted to Interspeech 2020