Search | arXiv e-print repository

Critical Review for One-class Classification: recent advances and the reality behind them

Authors: Toshitaka Hayashi, Dalibor Cimr, Hamido Fujita, Richard Cimler

Abstract: This paper offers a comprehensive review of one-class classification (OCC), examining the technologies and methodologies employed in its implementation. It delves into various approaches utilized for OCC across diverse data types, such as feature data, image, video, time series, and others. Through a systematic review, this paper synthesizes promi-nent strategies used in OCC from its inception to… ▽ More This paper offers a comprehensive review of one-class classification (OCC), examining the technologies and methodologies employed in its implementation. It delves into various approaches utilized for OCC across diverse data types, such as feature data, image, video, time series, and others. Through a systematic review, this paper synthesizes promi-nent strategies used in OCC from its inception to its current advance-ments, with a particular emphasis on the promising application. Moreo-ver, the article criticizes the state-of-the-art (SOTA) image anomaly de-tection (AD) algorithms dominating one-class experiments. These algo-rithms include outlier exposure (binary classification) and pretrained model (multi-class classification), conflicting with the fundamental con-cept of learning from one class. Our investigation reveals that the top nine algorithms for one-class CIFAR10 benchmark are not OCC. We ar-gue that binary/multi-class classification algorithms should not be com-pared with OCC. △ Less

Submitted 27 April, 2024; originally announced April 2024.

arXiv:2304.04596 [pdf, other]

ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit

Authors: Brian Yan, Jiatong Shi, Yun Tang, Hirofumi Inaguma, Yifan Peng, Siddharth Dalmia, Peter Polák, Patrick Fernandes, Dan Berrebbi, Tomoki Hayashi, Xiaohui Zhang, Zhaoheng Ni, Moto Hira, Soumi Maiti, Juan Pino, Shinji Watanabe

Abstract: ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community. ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2) simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech translation (S2ST) -- each task is supported with a wide variety of approaches, differentiating ESPnet-… ▽ More ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community. ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2) simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech translation (S2ST) -- each task is supported with a wide variety of approaches, differentiating ESPnet-ST-v2 from other open source spoken language translation toolkits. This toolkit offers state-of-the-art architectures such as transducers, hybrid CTC/attention, multi-decoders with searchable intermediates, time-synchronous blockwise CTC/attention, Translatotron models, and direct discrete unit models. In this paper, we describe the overall design, example models for each task, and performance benchmarking behind ESPnet-ST-v2, which is publicly available at https://github.com/espnet/espnet. △ Less

Submitted 6 July, 2023; v1 submitted 10 April, 2023; originally announced April 2023.

Comments: ACL 2023; System Demonstration

arXiv:2302.09189 [pdf]

Extraction of Constituent Factors of Digestion Efficiency in Information Transfer by Media Composed of Texts and Images

Authors: Koike Hiroaki, Teruaki Hayashi

Abstract: The development and spread of information and communication technologies have increased and diversified information. However, the increase in the volume and the selection of information does not necessarily promote understanding. In addition, conventional evaluations of information transfer have focused only on the arrival of information to the receivers. They need to sufficiently take into accoun… ▽ More The development and spread of information and communication technologies have increased and diversified information. However, the increase in the volume and the selection of information does not necessarily promote understanding. In addition, conventional evaluations of information transfer have focused only on the arrival of information to the receivers. They need to sufficiently take into account the receivers' understanding of the information after it has been acquired, which is the original purpose of the evaluation. In this study, we propose the concept of "information digestion," which refers to the receivers' correct understanding of the acquired information, its contents, and its purpose. In the experiment, we proposed an evaluation model of information digestibility using hierarchical factor analysis and extracted factors that constitute digestibility by four types of media. △ Less

Submitted 17 February, 2023; originally announced February 2023.

Comments: This paper is the revised version of the 29th annual conference of the Natural Language Processing Society in Japan, in Japanese language

arXiv:2301.09099 [pdf, ps, other]

Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study

Authors: Massa Baali, Tomoki Hayashi, Hamdy Mubarak, Soumi Maiti, Shinji Watanabe, Wassim El-Hajj, Ahmed Ali

Abstract: Several high-resource Text to Speech (TTS) systems currently produce natural, well-established human-like speech. In contrast, low-resource languages, including Arabic, have very limited TTS systems due to the lack of resources. We propose a fully unsupervised method for building TTS, including automatic data selection and pre-training/fine-tuning strategies for TTS training, using broadcast news… ▽ More Several high-resource Text to Speech (TTS) systems currently produce natural, well-established human-like speech. In contrast, low-resource languages, including Arabic, have very limited TTS systems due to the lack of resources. We propose a fully unsupervised method for building TTS, including automatic data selection and pre-training/fine-tuning strategies for TTS training, using broadcast news as a case study. We show how careful selection of data, yet smaller amounts, can improve the efficiency of TTS system in generating more natural speech than a system trained on a bigger dataset. We adopt to propose different approaches for the: 1) data: we applied automatic annotations using DNSMOS, automatic vowelization, and automatic speech recognition (ASR) for fixing transcriptions' errors; 2) model: we used transfer learning from high-resource language in TTS model and fine-tuned it with one hour broadcast recording then we used this model to guide a FastSpeech2-based Conformer model for duration. Our objective evaluation shows 3.9% character error rate (CER), while the groundtruth has 1.3% CER. As for the subjective evaluation, where 1 is bad and 5 is excellent, our FastSpeech2-based Conformer model achieved a mean opinion score (MOS) of 4.4 for intelligibility and 4.2 for naturalness, where many annotators recognized the voice of the broadcaster, which proves the effectiveness of our proposed unsupervised method. △ Less

Submitted 26 January, 2023; v1 submitted 22 January, 2023; originally announced January 2023.

arXiv:2212.13157 [pdf, other]

Gaussian Process Classification Bandits

Authors: Tatsuya Hayashi, Naoki Ito, Koji Tabata, Atsuyoshi Nakamura, Katsumasa Fujita, Yoshinori Harada, Tamiki Komatsuzaki

Abstract: Classification bandits are multi-armed bandit problems whose task is to classify a given set of arms into either positive or negative class depending on whether the rate of the arms with the expected reward of at least h is not less than w for given thresholds h and w. We study a special classification bandit problem in which arms correspond to points x in d-dimensional real space with expected re… ▽ More Classification bandits are multi-armed bandit problems whose task is to classify a given set of arms into either positive or negative class depending on whether the rate of the arms with the expected reward of at least h is not less than w for given thresholds h and w. We study a special classification bandit problem in which arms correspond to points x in d-dimensional real space with expected rewards f(x) which are generated according to a Gaussian process prior. We develop a framework algorithm for the problem using various arm selection policies and propose policies called FCB and FTSV. We show a smaller sample complexity upper bound for FCB than that for the existing algorithm of the level set estimation, in which whether f(x) is at least h or not must be decided for every arm's x. Arm selection policies depending on an estimated rate of arms with rewards of at least h are also proposed and shown to improve empirical sample complexity. According to our experimental results, the rate-estimation versions of FCB and FTSV, together with that of the popular active learning policy that selects the point with the maximum variance, outperform other policies for synthetic functions, and the version of FTSV is also the best performer for our real-world dataset. △ Less

Submitted 26 December, 2022; originally announced December 2022.

arXiv:2207.04356 [pdf, other]

doi 10.1109/JSTSP.2022.3193761

A Comparative Study of Self-supervised Speech Representation Based Voice Conversion

Authors: Wen-Chin Huang, Shu-Wen Yang, Tomoki Hayashi, Tomoki Toda

Abstract: We present a large-scale comparative study of self-supervised speech representation (S3R)-based voice conversion (VC). In the context of recognition-synthesis VC, S3Rs are attractive owing to their potential to replace expensive supervised representations such as phonetic posteriorgrams (PPGs), which are commonly adopted by state-of-the-art VC systems. Using S3PRL-VC, an open-source VC software we… ▽ More We present a large-scale comparative study of self-supervised speech representation (S3R)-based voice conversion (VC). In the context of recognition-synthesis VC, S3Rs are attractive owing to their potential to replace expensive supervised representations such as phonetic posteriorgrams (PPGs), which are commonly adopted by state-of-the-art VC systems. Using S3PRL-VC, an open-source VC software we previously developed, we provide a series of in-depth objective and subjective analyses under three VC settings: intra-/cross-lingual any-to-one (A2O) and any-to-any (A2A) VC, using the voice conversion challenge 2020 (VCC2020) dataset. We investigated S3R-based VC in various aspects, including model type, multilinguality, and supervision. We also studied the effect of a post-discretization process with k-means clustering and showed how it improves in the A2A setting. Finally, the comparison with state-of-the-art VC systems demonstrates the competitiveness of S3R-based VC and also sheds light on the possible improving directions. △ Less

Submitted 9 July, 2022; originally announced July 2022.

Comments: Accepted to IEEE Journal of Selected Topics in Signal Processing. arXiv admin note: substantial text overlap with arXiv:2110.06280

arXiv:2206.05929 [pdf, other]

Improvement of Serial Approach to Anomalous Sound Detection by Incorporating Two Binary Cross-Entropies for Outlier Exposure

Authors: Ibuki Kuroyanagi, Tomoki Hayashi, Kazuya Takeda, Tomoki Toda

Abstract: Anomalous sound detection systems must detect unknown, atypical sounds using only normal audio data. Conventional methods use the serial method, a combination of outlier exposure (OE), which classifies normal and pseudo-anomalous data and obtains embedding, and inlier modeling (IM), which models the probability distribution of the embedding. Although the serial method shows high performance due to… ▽ More Anomalous sound detection systems must detect unknown, atypical sounds using only normal audio data. Conventional methods use the serial method, a combination of outlier exposure (OE), which classifies normal and pseudo-anomalous data and obtains embedding, and inlier modeling (IM), which models the probability distribution of the embedding. Although the serial method shows high performance due to the powerful feature extraction of OE and the robustness of IM, OE still has a problem that doesn't work well when the normal and pseudo-anomalous data are too similar or too different. To explicitly distinguish these data, the proposed method uses multi-task learning of two binary cross-entropies when training OE. The first is a loss that classifies the sound of the target machine to which product it is emitted from, which deals with the case where the normal data and the pseudo-anomalous data are too similar. The second is a loss that identifies whether the sound is emitted from the target machine or not, which deals with the case where the normal data and the pseudo-anomalous data are too different. We perform our experiments with DCASE 2021 Task~2 dataset. Our proposed single-model method outperforms the top-ranked method, which combines multiple models, by 2.1% in AUC. △ Less

Submitted 13 June, 2022; originally announced June 2022.

Comments: 5 pages, 3 figures, 3 tables, EUSIPCO 2022

arXiv:2205.04029 [pdf, other]

Muskits: an End-to-End Music Processing Toolkit for Singing Voice Synthesis

Authors: Jiatong Shi, Shuai Guo, Tao Qian, Nan Huo, Tomoki Hayashi, Yuning Wu, Frank Xu, Xuankai Chang, Huazhe Li, Peter Wu, Shinji Watanabe, Qin **

Abstract: This paper introduces a new open-source platform named Muskits for end-to-end music processing, which mainly focuses on end-to-end singing voice synthesis (E2E-SVS). Muskits supports state-of-the-art SVS models, including RNN SVS, transformer SVS, and XiaoiceSing. The design of Muskits follows the style of widely-used speech processing toolkits, ESPnet and Kaldi, for data prepossessing, training,… ▽ More This paper introduces a new open-source platform named Muskits for end-to-end music processing, which mainly focuses on end-to-end singing voice synthesis (E2E-SVS). Muskits supports state-of-the-art SVS models, including RNN SVS, transformer SVS, and XiaoiceSing. The design of Muskits follows the style of widely-used speech processing toolkits, ESPnet and Kaldi, for data prepossessing, training, and recipe pipelines. To the best of our knowledge, this toolkit is the first platform that allows a fair and highly-reproducible comparison between several published works in SVS. In addition, we also demonstrate several advanced usages based on the toolkit functionalities, including multilingual training and transfer learning. This paper describes the major framework of Muskits, its functionalities, and experimental results in single-singer, multi-singer, multilingual, and transfer learning scenarios. The toolkit is publicly available at https://github.com/SJTMusicTeam/Muskits. △ Less

Submitted 2 July, 2022; v1 submitted 9 May, 2022; originally announced May 2022.

Comments: Accepted by Interspeech

arXiv:2204.13378 [pdf, other]

Learning General Inventory Management Policy for Large Supply Chain Network

Authors: Soh Kumabe, Shinya Shiroshita, Takanori Hayashi, Shirou Maruyama

Abstract: Inventory management in warehouses directly affects profits made by manufacturers. Particularly, large manufacturers produce a very large variety of products that are handled by a significantly large number of retailers. In such a case, the computational complexity of classical inventory management algorithms is inordinately large. In recent years, learning-based approaches have become popular for… ▽ More Inventory management in warehouses directly affects profits made by manufacturers. Particularly, large manufacturers produce a very large variety of products that are handled by a significantly large number of retailers. In such a case, the computational complexity of classical inventory management algorithms is inordinately large. In recent years, learning-based approaches have become popular for addressing such problems. However, previous studies have not been managed systems where both the number of products and retailers are large. This study proposes a reinforcement learning-based warehouse inventory management algorithm that can be used for supply chain systems where both the number of products and retailers are large. To solve the computational problem of handling large systems, we provide a means of approximate simulation of the system in the training phase. Our experiments on both real and artificial data demonstrate that our algorithm with approximated simulation can successfully handle large supply chain networks. △ Less

Submitted 28 April, 2022; originally announced April 2022.

Comments: 9 pages, OPTLearnMAS 2022

arXiv:2202.08470 [pdf, other]

doi 10.21437/Interspeech.2021-2218

Acoustic Event Detection with Classifier Chains

Authors: Tatsuya Komatsu, Shinji Watanabe, Koichi Miyazaki, Tomoki Hayashi

Abstract: This paper proposes acoustic event detection (AED) with classifier chains, a new classifier based on the probabilistic chain rule. The proposed AED with classifier chains consists of a gated recurrent unit and performs iterative binary detection of each event one by one. In each iteration, the event's activity is estimated and used to condition the next output based on the probabilistic chain rule… ▽ More This paper proposes acoustic event detection (AED) with classifier chains, a new classifier based on the probabilistic chain rule. The proposed AED with classifier chains consists of a gated recurrent unit and performs iterative binary detection of each event one by one. In each iteration, the event's activity is estimated and used to condition the next output based on the probabilistic chain rule to form classifier chains. Therefore, the proposed method can handle the interdependence among events upon classification, while the conventional AED methods with multiple binary classifiers with a linear layer and sigmoid function have placed an assumption of conditional independence. In the experiments with a real-recording dataset, the proposed method demonstrates its superior AED performance to a relative 14.80% improvement compared to a convolutional recurrent neural network baseline system with the multiple binary classifiers. △ Less

Submitted 17 February, 2022; originally announced February 2022.

Comments: 5pages, presented at Interspeech2021

arXiv:2112.09382 [pdf, other]

Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem

Authors: **g Shi, Xuankai Chang, Tomoki Hayashi, Yen-Ju Lu, Shinji Watanabe, Bo Xu

Abstract: Deep learning based models have significantly improved the performance of speech separation with input mixtures like the cocktail party. Prominent methods (e.g., frequency-domain and time-domain speech separation) usually build regression models to predict the ground-truth speech from the mixture, using the masking-based design and the signal-level loss criterion (e.g., MSE or SI-SNR). This study… ▽ More Deep learning based models have significantly improved the performance of speech separation with input mixtures like the cocktail party. Prominent methods (e.g., frequency-domain and time-domain speech separation) usually build regression models to predict the ground-truth speech from the mixture, using the masking-based design and the signal-level loss criterion (e.g., MSE or SI-SNR). This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem, with great flexibility and strong potential. Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols, and convert the paradigm of the speech separation/enhancement related tasks from regression to classification. By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized. Evaluation results based on the WSJ0-2mix and VCTK-noisy corpora in various settings show that our proposed method can steadily synthesize the separated speech with high speech quality and without any interference, which is difficult to avoid in regression-based methods. In addition, with negligible loss of listening quality, the speaker conversion of enhanced/separated speech could be easily realized through our method. △ Less

Submitted 9 January, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

Comments: 5 pages, https://shincling.github.io/discreteSeparation/

arXiv:2111.12460 [pdf, other]

ViCE: Improving Dense Representation Learning by Superpixelization and Contrasting Cluster Assignment

Authors: Robin Karlsson, Tomoki Hayashi, Keisuke Fujii, Alexander Carballo, Kento Ohtani, Kazuya Takeda

Abstract: Recent self-supervised models have demonstrated equal or better performance than supervised methods, opening for AI systems to learn visual representations from practically unlimited data. However, these methods are typically classification-based and thus ineffective for learning high-resolution feature maps that preserve precise spatial information. This work introduces superpixels to improve sel… ▽ More Recent self-supervised models have demonstrated equal or better performance than supervised methods, opening for AI systems to learn visual representations from practically unlimited data. However, these methods are typically classification-based and thus ineffective for learning high-resolution feature maps that preserve precise spatial information. This work introduces superpixels to improve self-supervised learning of dense semantically rich visual concept embeddings. Decomposing images into a small set of visually coherent regions reduces the computational complexity by $\mathcal{O}(1000)$ while preserving detail. We experimentally show that contrasting over regions improves the effectiveness of contrastive learning methods, extends their applicability to high-resolution images, improves overclustering performance, superpixels are better than grids, and regional masking improves performance. The expressiveness of our dense embeddings is demonstrated by improving the SOTA unsupervised semantic segmentation benchmark on Cityscapes, and for convolutional models on COCO. △ Less

Submitted 7 October, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

Comments: Accepted for BMVC 2022

ACM Class: I.2.10; I.2.9

arXiv:2111.04505 [pdf]

Feature Concepts for Data Federative Innovations

Authors: Yukio Ohsawa, Sae Kondo, Teruaki Hayashi

Abstract: A feature concept, the essence of the data-federative innovation process, is presented as a model of the concept to be acquired from data. A feature concept may be a simple feature, such as a single variable, but is more likely to be a conceptual illustration of the abstract information to be obtained from the data. For example, trees and clusters are feature concepts for decision tree learning an… ▽ More A feature concept, the essence of the data-federative innovation process, is presented as a model of the concept to be acquired from data. A feature concept may be a simple feature, such as a single variable, but is more likely to be a conceptual illustration of the abstract information to be obtained from the data. For example, trees and clusters are feature concepts for decision tree learning and clustering, respectively. Useful feature concepts for satis-fying the requirements of users of data have been elicited so far via creative communication among stakeholders in the market of data. In this short paper, such a creative communication is reviewed, showing a couple of appli-cations, for example, change explanation in markets and earthquakes, and highlight the feature concepts elicited in these cases. △ Less

Submitted 5 November, 2021; originally announced November 2021.

Comments: 13 pages, 7 figures

arXiv:2110.07840 [pdf, other]

ESPnet2-TTS: Extending the Edge of TTS Research

Authors: Tomoki Hayashi, Ryuichi Yamamoto, Takenori Yoshimura, Peter Wu, Jiatong Shi, Takaaki Saeki, Yooncheol Ju, Yusuke Yasuda, Shinnosuke Takamichi, Shinji Watanabe

Abstract: This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling, which simplify the training pipeline and further enhance T… ▽ More This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling, which simplify the training pipeline and further enhance TTS performance. The unified design of our recipes enables users to quickly reproduce state-of-the-art E2E-TTS results. We also provide many pre-trained models in a unified Python interface for inference, offering a quick means for users to generate baseline samples and build demos. Experimental evaluations with English and Japanese corpora demonstrate that our provided models synthesize utterances comparable to ground-truth ones, achieving state-of-the-art TTS performance. The toolkit is available online at https://github.com/espnet/espnet. △ Less

Submitted 14 October, 2021; originally announced October 2021.

Comments: Submitted to ICASSP2022. Demo HP: https://espnet.github.io/icassp2022-tts/

arXiv:2110.06280 [pdf, other]

S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations

Authors: Wen-Chin Huang, Shu-Wen Yang, Tomoki Hayashi, Hung-Yi Lee, Shinji Watanabe, Tomoki Toda

Abstract: This paper introduces S3PRL-VC, an open-source voice conversion (VC) framework based on the S3PRL toolkit. In the context of recognition-synthesis VC, self-supervised speech representation (S3R) is valuable in its potential to replace the expensive supervised representation adopted by state-of-the-art VC systems. Moreover, we claim that VC is a good probing task for S3R analysis. In this work, we… ▽ More This paper introduces S3PRL-VC, an open-source voice conversion (VC) framework based on the S3PRL toolkit. In the context of recognition-synthesis VC, self-supervised speech representation (S3R) is valuable in its potential to replace the expensive supervised representation adopted by state-of-the-art VC systems. Moreover, we claim that VC is a good probing task for S3R analysis. In this work, we provide a series of in-depth analyses by benchmarking on the two tasks in VCC2020, namely intra-/cross-lingual any-to-one (A2O) VC, as well as an any-to-any (A2A) setting. We also provide comparisons between not only different S3Rs but also top systems in VCC2020 with supervised representations. Systematic objective and subjective evaluation were conducted, and we show that S3R is comparable with VCC2020 top systems in the A2O setting in terms of similarity, and achieves state-of-the-art in S3R-based A2A VC. We believe the extensive analysis, as well as the toolkit itself, contribute to not only the S3R community but also the VC community. The codebase is now open-sourced. △ Less

Submitted 12 October, 2021; originally announced October 2021.

Comments: Submitted to ICASSP 2022. Code available at: https://github.com/s3prl/s3prl/tree/master/s3prl/downstream/a2o-vc-vcc2020

arXiv:2107.11929 [pdf]

Collaborative Problem Solving on a Data Platform Kaggle

Authors: Teruaki Hayashi, Takumi Shimizu, Yoshiaki Fukami

Abstract: Data exchange across different domains has gained much attention as a way of creating new businesses and improving the value of existing services. Data exchange ecosystem is developed by platform services that facilitate data and knowledge exchange and offer co-creation environments for organizations to promote their problem-solving. In this study, we investigate Kaggle, a data analysis competitio… ▽ More Data exchange across different domains has gained much attention as a way of creating new businesses and improving the value of existing services. Data exchange ecosystem is developed by platform services that facilitate data and knowledge exchange and offer co-creation environments for organizations to promote their problem-solving. In this study, we investigate Kaggle, a data analysis competition platform, and discuss the characteristics of data and the ecosystem that contributes to collaborative problem-solving by analyzing the datasets, users, and their relationships. △ Less

Submitted 25 July, 2021; originally announced July 2021.

Comments: This paper is the English-translated version of "Collaborative Problem Solving on a Data Platform Kaggle" IEICE Tech. Rep., vol.120, no.362, pp.37-40, 2021. (https://www.ieice.org/ken/paper/20210212XCc9/eng/)

arXiv:2107.09477 [pdf, other]

On Prosody Modeling for ASR+TTS based Voice Conversion

Authors: Wen-Chin Huang, Tomoki Hayashi, Xinjian Li, Shinji Watanabe, Tomoki Toda

Abstract: In voice conversion (VC), an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents; these are then used as input by a text-to-speech (TTS) system to generate the converted speech. Such a paradigm, referred to as ASR+TTS, overlooks the… ▽ More In voice conversion (VC), an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents; these are then used as input by a text-to-speech (TTS) system to generate the converted speech. Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity. Although some researchers have considered transferring prosodic clues from the source speech, there arises a speaker mismatch during training and conversion. To address this issue, in this work, we propose to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP). We evaluate both methods on the VCC2020 benchmark and consider different linguistic representations. The results demonstrate the effectiveness of TTP in both objective and subjective evaluations. △ Less

Submitted 20 July, 2021; originally announced July 2021.

Comments: Submitted to ASRU2021. Under review

arXiv:2106.06151 [pdf, other]

Anomalous Sound Detection Using a Binary Classification Model and Class Centroids

Authors: Ibuki Kuroyanagi, Tomoki Hayashi, Kazuya Takeda, Tomoki Toda

Abstract: An anomalous sound detection system to detect unknown anomalous sounds usually needs to be built using only normal sound data. Moreover, it is desirable to improve the system by effectively using a small amount of anomalous sound data, which will be accumulated through the system's operation. As one of the methods to meet these requirements, we focus on a binary classification model that is develo… ▽ More An anomalous sound detection system to detect unknown anomalous sounds usually needs to be built using only normal sound data. Moreover, it is desirable to improve the system by effectively using a small amount of anomalous sound data, which will be accumulated through the system's operation. As one of the methods to meet these requirements, we focus on a binary classification model that is developed by using not only normal data but also outlier data in the other domains as pseudo-anomalous sound data, which can be easily updated by using anomalous data. In this paper, we implement a new loss function based on metric learning to learn the distance relationship from each class centroid in feature space for the binary classification model. The proposed multi-task learning of the binary classification and the metric learning makes it possible to build the feature space where the within-class variance is minimized and the between-class variance is maximized while kee** normal and anomalous classes linearly separable. We also investigate the effectiveness of additionally using anomalous sound data for further improving the binary classification model. Our results showed that multi-task learning using binary classification and metric learning to consider the distance from each class centroid in the feature space is effective, and performance can be significantly improved by using even a small amount of anomalous data during training. △ Less

Submitted 10 June, 2021; originally announced June 2021.

Comments: 6 pages, 2 figures, 2 tables, EUSIPCO2021

arXiv:2104.09754 [pdf]

Hierarchical entropy and domain interaction to understand the structure in an image

Authors: Nao Uehara, Teruaki Hayashi, Yukio Ohsawa

Abstract: In this study, we devise a model that introduces two hierarchies into information entropy. The two hierarchies are the size of the region for which entropy is calculated and the size of the component that determines whether the structures in the image are integrated or not. And this model uses two indicators, hierarchical entropy and domain interaction. Both indicators increase or decrease due to… ▽ More In this study, we devise a model that introduces two hierarchies into information entropy. The two hierarchies are the size of the region for which entropy is calculated and the size of the component that determines whether the structures in the image are integrated or not. And this model uses two indicators, hierarchical entropy and domain interaction. Both indicators increase or decrease due to the integration or fragmentation of the structure in the image. It aims to help people interpret and explain what the structure in an image looks like from two indicators that change with the size of the region and the component. First, we conduct experiments using images and qualitatively evaluate how the two indicators change. Next, we explain the relationship with the hidden structure of Vermeer's girl with a pearl earring using the change of hierarchical entropy. Finally, we clarify the relationship between the change of domain interaction and the appropriate segment result of the image by an experiment using a questionnaire. △ Less

Submitted 20 April, 2021; originally announced April 2021.

Comments: 20pages,17figures

arXiv:2104.09719 [pdf]

Effects of Interregional Travels and Vaccination in Infection Spreads Simulated by Lattice of SEIRS Circuits

Authors: Yukio Ohsawa, Teruaki Hayashi, Sae Kondo

Abstract: The SEIRS model, an extension of the SEIR model for analyzing and predicting the spread of virus infection, was further extended to consider the movement of people across regions. In contrast to previous models that con-sider the risk of travelers from/to other regions, we consider two factors. First, we consider the movements of susceptible (S), exposed (E), and recovered (R) individuals who may… ▽ More The SEIRS model, an extension of the SEIR model for analyzing and predicting the spread of virus infection, was further extended to consider the movement of people across regions. In contrast to previous models that con-sider the risk of travelers from/to other regions, we consider two factors. First, we consider the movements of susceptible (S), exposed (E), and recovered (R) individuals who may get infected and infect others in the destination region, as well as infected (I) individuals. Second, people living in a region and moving from other regions are dealt as separate but interacting groups with respect to their states, S, E, R, or I. This enables us to consider the potential influence of movements before individuals become infected, difficult to detect by testing at the time of immigration, on the spread of infection. In this paper, we show the results of the simulation where individuals travel across regions, which means prefectures here, and the government chooses regions to vaccinate with priority. We found a general law that a quantity of vaccines can be used efficiently by maximizing an index value, the conditional entropy Hc, when we distribute vaccines to regions. The efficiency of this strategy, which maximizes Hc, was found to outperform that of vaccinating regions with a larger effective re-generation number. This law also explains the surprising result that travel activities across regional borders may suppress the spread if vaccination is processed at a sufficiently high pace, introducing the concept of social muddling. △ Less

Submitted 30 June, 2021; v1 submitted 19 April, 2021; originally announced April 2021.

Comments: 15 pages, one Table, 6 figures, to be submitted to a journal soon, on the way to choose the suitable journal

ACM Class: I.6.5; I.6.6

arXiv:2104.06793 [pdf, other]

Non-autoregressive sequence-to-sequence voice conversion

Authors: Tomoki Hayashi, Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda

Abstract: This paper proposes a novel voice conversion (VC) method based on non-autoregressive sequence-to-sequence (NAR-S2S) models. Inspired by the great success of NAR-S2S models such as FastSpeech in text-to-speech (TTS), we extend the FastSpeech2 model for the VC problem. We introduce the convolution-augmented Transformer (Conformer) instead of the Transformer, making it possible to capture both local… ▽ More This paper proposes a novel voice conversion (VC) method based on non-autoregressive sequence-to-sequence (NAR-S2S) models. Inspired by the great success of NAR-S2S models such as FastSpeech in text-to-speech (TTS), we extend the FastSpeech2 model for the VC problem. We introduce the convolution-augmented Transformer (Conformer) instead of the Transformer, making it possible to capture both local and global context information from the input sequence. Furthermore, we extend variance predictors to variance converters to explicitly convert the source speaker's prosody components such as pitch and energy into the target speaker. The experimental evaluation with the Japanese speaker dataset, which consists of male and female speakers of 1,000 utterances, demonstrates that the proposed model enables us to perform more stable, faster, and better conversion than autoregressive S2S (AR-S2S) models such as Tacotron2 and Transformer. △ Less

Submitted 14 April, 2021; originally announced April 2021.

Comments: Accepted to ICASSP2021. Demo HP: https://kan-bayashi.github.io/NonARSeq2SeqVC/

arXiv:2103.02858 [pdf, ps, other]

crank: An Open-Source Software for Nonparallel Voice Conversion Based on Vector-Quantized Variational Autoencoder

Authors: Kazuhiro Kobayashi, Wen-Chin Huang, Yi-Chiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, Tomoki Toda

Abstract: In this paper, we present an open-source software for develo** a nonparallel voice conversion (VC) system named crank. Although we have released an open-source VC software based on the Gaussian mixture model named sprocket in the last VC Challenge, it is not straightforward to apply any speech corpus because it is necessary to prepare parallel utterances of source and target speakers to model a… ▽ More In this paper, we present an open-source software for develo** a nonparallel voice conversion (VC) system named crank. Although we have released an open-source VC software based on the Gaussian mixture model named sprocket in the last VC Challenge, it is not straightforward to apply any speech corpus because it is necessary to prepare parallel utterances of source and target speakers to model a statistical conversion function. To address this issue, in this study, we developed a new open-source VC software that enables users to model the conversion function by using only a nonparallel speech corpus. For implementing the VC software, we used a vector-quantized variational autoencoder (VQVAE). To rapidly examine the effectiveness of recent technologies developed in this research field, crank also supports several representative works for autoencoder-based VC methods such as the use of hierarchical architectures, cyclic architectures, generative adversarial networks, speaker adversarial training, and neural vocoders. Moreover, it is possible to automatically estimate objective measures such as mel-cepstrum distortion and pseudo mean opinion score based on MOSNet. In this paper, we describe representative functions developed in crank and make brief comparisons by objective evaluations. △ Less

Submitted 4 March, 2021; originally announced March 2021.

Comments: Accepted to ICASSP 2021

arXiv:2012.13006 [pdf, other]

The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans

Authors: Shinji Watanabe, Florian Boyer, Xuankai Chang, Pengcheng Guo, Tomoki Hayashi, Yosuke Higuchi, Takaaki Hori, Wen-Chin Huang, Hirofumi Inaguma, Naoyuki Kamo, Shigeki Karita, Chenda Li, **g Shi, Aswin Shanmugam Subramanian, Wangyou Zhang

Abstract: This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text… ▽ More This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text to speech (TTS), voice conversation (VC), speech translation (ST), and speech enhancement (SE) with support for beamforming, speech separation, denoising, and dereverberation. All applications are trained in an end-to-end manner, thanks to the generic sequence to sequence modeling properties, and they can be further integrated and jointly optimized. Also, ESPnet provides reproducible all-in-one recipes for these applications with state-of-the-art performance in various benchmarks by incorporating transformer, advanced data augmentation, and conformer. This project aims to provide up-to-date speech processing experience to the community so that researchers in academia and various industry scales can develop their technologies collaboratively. △ Less

Submitted 23 December, 2020; originally announced December 2020.

arXiv:2012.11746 [pdf]

Data Combination for Problem-solving: A Case of an Open Data Exchange Platform

Authors: Teruaki Hayashi, Hiroki Sakaji, Hiroyasu Matsushima, Yoshiaki Fukami, Takumi Shimizu, Yukio Ohsawa

Abstract: In recent years, rather than enclosing data within a single organization, exchanging and combining data from different domains has become an emerging practice. Many studies have discussed the economic and utility value of data and data exchange, but the characteristics of data that contribute to problem solving through data combination have not been fully understood. In big data and interdisciplin… ▽ More In recent years, rather than enclosing data within a single organization, exchanging and combining data from different domains has become an emerging practice. Many studies have discussed the economic and utility value of data and data exchange, but the characteristics of data that contribute to problem solving through data combination have not been fully understood. In big data and interdisciplinary data combinations, large-scale data with many variables are expected to be used, and value is expected to be created by combining data as much as possible. In this study, we conduct three experiments to investigate the characteristics of data, focusing on the relationships between data combinations and variables in each dataset, using empirical data shared by the local government. The results indicate that even datasets that have a few variables are frequently used to propose solutions for problem solving. Moreover, we found that even if the datasets in the solution do not have common variables, there are some well-established solutions to the problems. The findings of this study shed light on mechanisms behind data combination for problem-solving involving multiple datasets and variables. △ Less

Submitted 21 December, 2020; originally announced December 2020.

arXiv:2011.03706 [pdf, other]

doi 10.1109/SLT48900.2021.9383615

ESPnet-se: end-to-end speech enhancement and separation toolkit designed for asr integration

Authors: Chenda Li, **g Shi, Wangyou Zhang, Aswin Shanmugam Subramanian, Xuankai Chang, Naoyuki Kamo, Moto Hira, Tomoki Hayashi, Christoph Boeddeker, Zhuo Chen, Shinji Watanabe

Abstract: We present ESPnet-SE, which is designed for the quick development of speech enhancement and speech separation systems in a single framework, along with the optional downstream speech recognition module. ESPnet-SE is a new project which integrates rich automatic speech recognition related models, resources and systems to support and validate the proposed front-end implementation (i.e. speech enhanc… ▽ More We present ESPnet-SE, which is designed for the quick development of speech enhancement and speech separation systems in a single framework, along with the optional downstream speech recognition module. ESPnet-SE is a new project which integrates rich automatic speech recognition related models, resources and systems to support and validate the proposed front-end implementation (i.e. speech enhancement and separation).It is capable of processing both single-channel and multi-channel data, with various functionalities including dereverberation, denoising and source separation. We provide all-in-one recipes including data pre-processing, feature extraction, training and evaluation pipelines for a wide range of benchmark datasets. This paper describes the design of the toolkit, several important functionalities, especially the speech recognition integration, which differentiates ESPnet-SE from other open source toolkits, and experimental results with major benchmark datasets. △ Less

Submitted 7 November, 2020; originally announced November 2020.

Comments: Accepted by SLT 2021

arXiv:2010.13956 [pdf, other]

Recent Developments on ESPnet Toolkit Boosted by Conformer

Authors: Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, **g Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, Yuekai Zhang

Abstract: In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a wide range of end-to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-… ▽ More In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a wide range of end-to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-speech (TTS). Our experiments reveal various training tips and significant performance benefits obtained with the Conformer on different tasks. These results are competitive or even outperform the current state-of-art Transformer models. We are preparing to release all-in-one recipes using open source and publicly available corpora for all the above tasks with pre-trained models. Our aim for this work is to contribute to our research community by reducing the burden of preparing state-of-the-art research environments usually requiring high resources. △ Less

Submitted 29 October, 2020; v1 submitted 26 October, 2020; originally announced October 2020.

arXiv:2010.12231 [pdf, other]

Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations

Authors: Wen-Chin Huang, Yi-Chiao Wu, Tomoki Hayashi, Tomoki Toda

Abstract: We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence (seq2seq) framework. A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker. We utilize vq-wav2vec (VQW2V), a discretized self-supervised speech representation that was learned from massive unlabeled data, which is assumed to be speaker-independent and well… ▽ More We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence (seq2seq) framework. A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker. We utilize vq-wav2vec (VQW2V), a discretized self-supervised speech representation that was learned from massive unlabeled data, which is assumed to be speaker-independent and well corresponds to underlying linguistic contents. Given a training dataset of the target speaker, we extract VQW2V and acoustic features to estimate a seq2seq map** function from the former to the latter. With the help of a pretraining method and a newly designed postprocessing technique, our model can be generalized to only 5 min of data, even outperforming the same model trained with parallel data. △ Less

Submitted 23 October, 2020; originally announced October 2020.

Comments: Submitted to ICASSP 2021

arXiv:2010.02434 [pdf, other]

The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS

Authors: Wen-Chin Huang, Tomoki Hayashi, Shinji Watanabe, Tomoki Toda

Abstract: This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020. We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model, followed using the transcriptions to generate the voice of the target with a text-to-speech (TTS) model. We revisit this method un… ▽ More This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020. We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model, followed using the transcriptions to generate the voice of the target with a text-to-speech (TTS) model. We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end speech processing toolkit, and the many well-configured pretrained models provided by the community. Official evaluation results show that our system comes out top among the participating systems in terms of conversion similarity, demonstrating the promising ability of seq2seq models to convert speaker identity. The implementation is made open-source at: https://github.com/espnet/espnet/tree/master/egs/vcc20. △ Less

Submitted 5 October, 2020; originally announced October 2020.

Comments: Accepted to the ISCA Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020

arXiv:2009.04035 [pdf]

Data Requests and Scenarios for Data Design of Unobserved Events in Corona-related Confusion Using TEEDA

Authors: Teruaki Hayashi, Nao Uehara, Daisuke Hase, Yukio Ohsawa

Abstract: Due to the global violence of the novel coronavirus, various industries have been affected and the breakdown between systems has been apparent. To understand and overcome the phenomenon related to this unprecedented crisis caused by the coronavirus infectious disease (COVID-19), the importance of data exchange and sharing across fields has gained social attention. In this study, we use the interac… ▽ More Due to the global violence of the novel coronavirus, various industries have been affected and the breakdown between systems has been apparent. To understand and overcome the phenomenon related to this unprecedented crisis caused by the coronavirus infectious disease (COVID-19), the importance of data exchange and sharing across fields has gained social attention. In this study, we use the interactive platform called treasuring every encounter of data affairs (TEEDA) to externalize data requests from data users, which is a tool to exchange not only the information on data that can be provided but also the call for data, what data users want and for what purpose. Further, we analyze the characteristics of missing data in the corona-related confusion stemming from both the data requests and the providable data obtained in the workshop. We also create three scenarios for the data design of unobserved events focusing on variables. △ Less

Submitted 8 September, 2020; originally announced September 2020.

arXiv:2008.03088 [pdf, other]

Pretraining Techniques for Sequence-to-Sequence Voice Conversion

Authors: Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu Kameoka, Tomoki Toda

Abstract: Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody. Nonetheless, without sufficient data, seq2seq VC models can suffer from unstable training and mispronunciation problems in the converted speech, thus far from practical. To tackle these shortcomings, we propose to transfer knowledge from other speech processing tasks where large-sc… ▽ More Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody. Nonetheless, without sufficient data, seq2seq VC models can suffer from unstable training and mispronunciation problems in the converted speech, thus far from practical. To tackle these shortcomings, we propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR). We argue that VC models initialized with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech. We apply such techniques to recurrent neural network (RNN)-based and Transformer based models, and through systematical experiments, we demonstrate the effectiveness of the pretraining scheme and the superiority of Transformer based models over RNN-based models in terms of intelligibility, naturalness, and similarity. △ Less

Submitted 7 August, 2020; originally announced August 2020.

Comments: Preprint. Under review

arXiv:2007.12955 [pdf, other]

doi 10.1109/TASLP.2021.3051765

Quasi-Periodic Parallel WaveGAN: A Non-autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network

Authors: Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda

Abstract: In this paper, we propose a quasi-periodic parallel WaveGAN (QPPWG) waveform generative model, which applies a quasi-periodic (QP) structure to a parallel WaveGAN (PWG) model using pitch-dependent dilated convolution networks (PDCNNs). PWG is a small-footprint GAN-based raw waveform generative model, whose generation time is much faster than real time because of its compact model and non-autoregre… ▽ More In this paper, we propose a quasi-periodic parallel WaveGAN (QPPWG) waveform generative model, which applies a quasi-periodic (QP) structure to a parallel WaveGAN (PWG) model using pitch-dependent dilated convolution networks (PDCNNs). PWG is a small-footprint GAN-based raw waveform generative model, whose generation time is much faster than real time because of its compact model and non-autoregressive (non-AR) and non-causal mechanisms. Although PWG achieves high-fidelity speech generation, the generic and simple network architecture lacks pitch controllability for an unseen auxiliary fundamental frequency ($F_{0}$) feature such as a scaled $F_{0}$. To improve the pitch controllability and speech modeling capability, we apply a QP structure with PDCNNs to PWG, which introduces pitch information to the network by dynamically changing the network architecture corresponding to the auxiliary $F_{0}$ feature. Both objective and subjective experimental results show that QPPWG outperforms PWG when the auxiliary $F_{0}$ feature is scaled. Moreover, analyses of the intermediate outputs of QPPWG also show better tractability and interpretability of QPPWG, which respectively models spectral and excitation-like signals using the cascaded fixed and adaptive blocks of the QP structure. △ Less

Submitted 19 February, 2021; v1 submitted 25 July, 2020; originally announced July 2020.

Comments: 15 pages, 10 figures, 8 tables

Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 792-806, 2021

arXiv:2007.05663 [pdf, other]

doi 10.1109/TASLP.2021.3061245

Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network

Authors: Yi-Chiao Wu, Tomoki Hayashi, Patrick Lumban Tobing, Kazuhiro Kobayashi, Tomoki Toda

Abstract: In this paper, a pitch-adaptive waveform generative model named Quasi-Periodic WaveNet (QPNet) is proposed to improve the limited pitch controllability of vanilla WaveNet (WN) using pitch-dependent dilated convolution neural networks (PDCNNs). Specifically, as a probabilistic autoregressive generation model with stacked dilated convolution layers, WN achieves high-fidelity audio waveform generatio… ▽ More In this paper, a pitch-adaptive waveform generative model named Quasi-Periodic WaveNet (QPNet) is proposed to improve the limited pitch controllability of vanilla WaveNet (WN) using pitch-dependent dilated convolution neural networks (PDCNNs). Specifically, as a probabilistic autoregressive generation model with stacked dilated convolution layers, WN achieves high-fidelity audio waveform generation. However, the pure-data-driven nature and the lack of prior knowledge of audio signals degrade the pitch controllability of WN. For instance, it is difficult for WN to precisely generate the periodic components of audio signals when the given auxiliary fundamental frequency ($F_{0}$) features are outside the $F_{0}$ range observed in the training data. To address this problem, QPNet with two novel designs is proposed. First, the PDCNN component is applied to dynamically change the network architecture of WN according to the given auxiliary $F_{0}$ features. Second, a cascaded network structure is utilized to simultaneously model the long- and short-term dependencies of quasi-periodic signals such as speech. The performances of single-tone sinusoid and speech generations are evaluated. The experimental results show the effectiveness of the PDCNNs for unseen auxiliary $F_{0}$ features and the effectiveness of the cascaded structure for speech generation. △ Less

Submitted 27 March, 2021; v1 submitted 10 July, 2020; originally announced July 2020.

Comments: 15 pages, 12 figures, 11 tables

Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1134-1148, 2021

arXiv:2005.11005 [pdf]

Modeling Stakeholder-centric Value Chain of Data to Understand Data Exchange Ecosystem

Authors: Teruaki Hayashi, Gensei Ishimura, Yukio Ohsawa

Abstract: In recent years, the expectation that new businesses and economic value can be created by combining/exchanging data from different fields has risen. However, value creation by data exchange involves not only data, but also technologies and a variety of stakeholders that are integrated and in competition with one another. This makes the data exchange ecosystem a challenging subject to study. In thi… ▽ More In recent years, the expectation that new businesses and economic value can be created by combining/exchanging data from different fields has risen. However, value creation by data exchange involves not only data, but also technologies and a variety of stakeholders that are integrated and in competition with one another. This makes the data exchange ecosystem a challenging subject to study. In this paper, we propose a model describing the stakeholder-centric value chain (SVC) of data by focusing on the relationships among stakeholders in data businesses and discussing creative ways to use them. The SVC model enables the analysis and understanding of the structural characteristics of the data exchange ecosystem. We identified stakeholders who carry potential risk, those who play central roles in the ecosystem, and the distribution of profit among them using business models collected by the SVC. △ Less

Submitted 22 May, 2020; originally announced May 2020.

arXiv:2005.10603 [pdf]

Detecting and explaining changes in various assets' relationships in financial markets

Authors: Makoto Naraoka, Teruaki Hayashi, Takaaki Yoshino, Toshiaki Sugie, Kota Takano, Yukio Ohsawa

Abstract: We study the method for detecting relationship changes in financial markets and providing human-interpretable network visualization to support the decision-making of fund managers dealing with multi-assets. First, we construct co-occurrence networks with each asset as a node and a pair with a strong relationship in price change as an edge at each time step. Second, we calculate Graph-Based Entropy… ▽ More We study the method for detecting relationship changes in financial markets and providing human-interpretable network visualization to support the decision-making of fund managers dealing with multi-assets. First, we construct co-occurrence networks with each asset as a node and a pair with a strong relationship in price change as an edge at each time step. Second, we calculate Graph-Based Entropy to represent the variety of price changes based on the network. Third, we apply the Differential Network to finance, which is traditionally used in the field of bioinformatics. By the method described above, we can visualize when and what kind of changes are occurring in the financial market, and which assets play a central role in changes in financial markets. Experiments with multi-asset time-series data showed results that were well fit with actual events while maintaining high interpretability. It is suggested that this approach is useful for fund managers to use as a new option for decision making. △ Less

Submitted 16 November, 2020; v1 submitted 21 May, 2020; originally announced May 2020.

Comments: 12 pages, 6 figures

arXiv:2005.08654 [pdf, other]

Quasi-Periodic Parallel WaveGAN Vocoder: A Non-autoregressive Pitch-dependent Dilated Convolution Model for Parametric Speech Generation

Authors: Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda

Abstract: In this paper, we propose a parallel WaveGAN (PWG)-like neural vocoder with a quasi-periodic (QP) architecture to improve the pitch controllability of PWG. PWG is a compact non-autoregressive (non-AR) speech generation model, whose generative speed is much faster than real time. While utilizing PWG as a vocoder to generate speech on the basis of acoustic features such as spectral and prosodic feat… ▽ More In this paper, we propose a parallel WaveGAN (PWG)-like neural vocoder with a quasi-periodic (QP) architecture to improve the pitch controllability of PWG. PWG is a compact non-autoregressive (non-AR) speech generation model, whose generative speed is much faster than real time. While utilizing PWG as a vocoder to generate speech on the basis of acoustic features such as spectral and prosodic features, PWG generates high-fidelity speech. However, when the input acoustic features include unseen pitches, the pitch accuracy of PWG-generated speech degrades because of the fixed and generic network of PWG without prior knowledge of speech periodicity. The proposed QPPWG adopts a pitch-dependent dilated convolution network (PDCNN) module, which introduces the pitch information into PWG via the dynamically changed network architecture, to improve the pitch controllability and speech modeling capability of vanilla PWG. Both objective and subjective evaluation results show the higher pitch accuracy and comparable speech quality of QPPWG-generated speech when the QPPWG model size is only 70 % of that of vanilla PWG. △ Less

Submitted 6 August, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

Comments: 5 page, 6 figures, 2 tables. Proc. Interspeech, 2020

arXiv:2005.05525 [pdf, other]

DiscreTalk: Text-to-Speech as a Machine Translation Problem

Authors: Tomoki Hayashi, Shinji Watanabe

Abstract: This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT). The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model. The VQ-VAE model learns a map** function from a speech waveform into a sequence of discrete symbols, and then the Tran… ▽ More This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT). The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model. The VQ-VAE model learns a map** function from a speech waveform into a sequence of discrete symbols, and then the Transformer-NMT model is trained to estimate this discrete symbol sequence from a given input text. Since the VQ-VAE model can learn such a map** in a fully-data-driven manner, we do not need to consider hyperparameters of the feature extraction required in the conventional E2E-TTS models. Thanks to the use of discrete symbols, we can use various techniques developed in NMT and automatic speech recognition (ASR) such as beam search, subword units, and fusions with a language model. Furthermore, we can avoid an over smoothing problem of predicted features, which is one of the common issues in TTS. The experimental evaluation with the JSUT corpus shows that the proposed method outperforms the conventional Transformer-TTS model with a non-autoregressive neural vocoder in naturalness, achieving the performance comparable to the reconstruction of the VQ-VAE model. △ Less

Submitted 11 May, 2020; originally announced May 2020.

Comments: Submitted to INTERSPEECH 2020. The demo is available on https://kan-bayashi.github.io/DiscreTalk/

arXiv:2005.04954 [pdf, other]

Propagation Graph Estimation from Individual's Time Series of Observed States

Authors: Tatsuya Hayashi, Atsuyoshi Nakamura

Abstract: Various things propagate through the medium of individuals. Some individuals follow the others and take the states similar to their states a small number of time steps later. In this paper, we study the problem of estimating the state propagation order of individuals from the real-valued state sequences of all the individuals. We propose a method to estimate the propagation direction between indiv… ▽ More Various things propagate through the medium of individuals. Some individuals follow the others and take the states similar to their states a small number of time steps later. In this paper, we study the problem of estimating the state propagation order of individuals from the real-valued state sequences of all the individuals. We propose a method to estimate the propagation direction between individuals by the sum of the time delay of one individual's state positions from the other individual's matched state position averaged over the minimum cost alignments and show how to calculate it efficiently. The propagation order estimated by our proposed method is demonstrated to be significantly more accurate than that by a baseline method for our synthetic datasets, and also to be consistent with visually recognizable propagation orders for the dataset of Japanese stock price time series and biological cell firing state sequences. △ Less

Submitted 30 July, 2021; v1 submitted 11 May, 2020; originally announced May 2020.

Comments: 22 pages, 10 figures

arXiv:2004.10234 [pdf, ps, other]

ESPnet-ST: All-in-One Speech Translation Toolkit

Authors: Hirofumi Inaguma, Shun Kiyono, Kevin Duh, Shigeki Karita, Nelson Enrique Yalta Soplin, Tomoki Hayashi, Shinji Watanabe

Abstract: We present ESPnet-ST, which is designed for the quick development of speech-to-speech translation systems in a single framework. ESPnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet, which integrates or newly implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation. We provide all-in-one recipes including data pre-p… ▽ More We present ESPnet-ST, which is designed for the quick development of speech-to-speech translation systems in a single framework. ESPnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet, which integrates or newly implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation. We provide all-in-one recipes including data pre-processing, feature extraction, training, and decoding pipelines for a wide range of benchmark datasets. Our reproducible results can match or even outperform the current state-of-the-art performances; these pre-trained models are downloadable. The toolkit is publicly available at https://github.com/espnet/espnet. △ Less

Submitted 30 September, 2020; v1 submitted 21 April, 2020; originally announced April 2020.

Comments: Accepted at ACL 2020 System Demonstration (update Table1, fix typo)

arXiv:2003.11750 [pdf]

doi 10.1109/ACCESS.2020.2984007

Non-parallel Voice Conversion System with WaveNet Vocoder and Collapsed Speech Suppression

Authors: Yi-Chiao Wu, Patrick Lumban Tobing, Kazuhiro Kobayashi, Tomoki Hayashi, Tomoki Toda

Abstract: In this paper, we integrate a simple non-parallel voice conversion (VC) system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression technique. The effectiveness of WN as a vocoder for generating high-fidelity speech waveforms on the basis of acoustic features has been confirmed in recent works. However, when combining the WN vocoder with a VC system, the distorted acoustic featu… ▽ More In this paper, we integrate a simple non-parallel voice conversion (VC) system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression technique. The effectiveness of WN as a vocoder for generating high-fidelity speech waveforms on the basis of acoustic features has been confirmed in recent works. However, when combining the WN vocoder with a VC system, the distorted acoustic features, acoustic and temporal mismatches, and exposure bias usually lead to significant speech quality degradation, making WN generate some very noisy speech segments called collapsed speech. To tackle the problem, we take conventional-vocoder-generated speech as the reference speech to derive a linear predictive coding distribution constraint (LPCDC) to avoid the collapsed speech problem. Furthermore, to mitigate the negative effects introduced by the LPCDC, we propose a collapsed speech segment detector (CSSD) to ensure that the LPCDC is only applied to the problematic segments to limit the loss of quality to short periods. Objective and subjective evaluations are conducted, and the experimental results confirm the effectiveness of the proposed method, which further improves the speech quality of our previous non-parallel VC system submitted to Voice Conversion Challenge 2018. △ Less

Submitted 6 April, 2020; v1 submitted 26 March, 2020; originally announced March 2020.

Comments: 13 pages, 13 figures, 1 table, accepted to publish in IEEE Access

arXiv:2003.05109 [pdf]

Variable-Based Network Analysis of Datasets on Data Exchange Platforms

Authors: Teruaki Hayashi, Yukio Ohsawa

Abstract: Recently, data exchange platforms have emerged in the digital economy to enable better resource allocation in a data-driven society, which requires cross-organizational data collaborations. Understanding the characteristics of the data on these platforms is important for their application; however, the structures of such platforms have not been extensively investigated. In this study, we apply a n… ▽ More Recently, data exchange platforms have emerged in the digital economy to enable better resource allocation in a data-driven society, which requires cross-organizational data collaborations. Understanding the characteristics of the data on these platforms is important for their application; however, the structures of such platforms have not been extensively investigated. In this study, we apply a network approach with a novel variable-based structural analysis to the metadata of datasets on two data platform services. It was noted that the structures of the data networks are locally dense and highly assortative, similar to human-related net-works. Even though the data on these platforms are designed and collected differently, depending on the use objectives, the variables of heterogeneous data exhibit a power distribution, and the data networks exhibit multi-scaling behavior. Furthermore, we found that the data collection strategies of the platforms are related to the variety of variables, density of the networks, and their robustness from the viewpoint of sustainability and social acceptability of the data platforms. △ Less

Submitted 11 March, 2020; originally announced March 2020.

arXiv:2002.09581 [pdf]

Extracting and Validating Explanatory Word Archipelagoes using Dual Entropy

Authors: Yukio Ohsawa, Teruaki Hayashi

Abstract: The logical connectivity of text is represented by the connectivity of words that form archipelagoes. Here, each archipelago is a sequence of islands of the occurrences of a certain word. An island here means the local sequence of sentences where the word is emphasized, and an archipelago of a length comparable to the target text is extracted using the co-variation of entropy A (the window-based e… ▽ More The logical connectivity of text is represented by the connectivity of words that form archipelagoes. Here, each archipelago is a sequence of islands of the occurrences of a certain word. An island here means the local sequence of sentences where the word is emphasized, and an archipelago of a length comparable to the target text is extracted using the co-variation of entropy A (the window-based entropy) on the distribution of the word's occurrences with the width of each time window. Then, the logical connectivity of text is evaluated on entropy B (the graph-based entropy) computed on the distribution of sentences to connected word-clusters obtained on the co-occurrence of words. The results show the parts of the target text with words forming archipelagoes extracted on entropy A, without learned or prepared knowledge, form an explanatory part of the text that is of smaller entropy B than the parts extracted by the baseline methods. △ Less

Submitted 21 February, 2020; originally announced February 2020.

Comments: 7 pages, 2 figures, 2 columns

MSC Class: 68W32 (Primary) 68T50; 91F20 (Secondary)

arXiv:2002.00551 [pdf, other]

End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice Activity Detection

Authors: Takenori Yoshimura, Tomoki Hayashi, Kazuya Takeda, Shinji Watanabe

Abstract: This paper integrates a voice activity detection (VAD) function with end-to-end automatic speech recognition toward an online speech interface and transcribing very long audio recordings. We focus on connectionist temporal classification (CTC) and its extension of CTC/attention architectures. As opposed to an attention-based architecture, input-synchronous label prediction can be performed based o… ▽ More This paper integrates a voice activity detection (VAD) function with end-to-end automatic speech recognition toward an online speech interface and transcribing very long audio recordings. We focus on connectionist temporal classification (CTC) and its extension of CTC/attention architectures. As opposed to an attention-based architecture, input-synchronous label prediction can be performed based on a greedy search with the CTC (pre-)softmax output. This prediction includes consecutive long blank labels, which can be regarded as a non-speech region. We use the labels as a cue for detecting speech segments with simple thresholding. The threshold value is directly related to the length of a non-speech region, which is more intuitive and easier to control than conventional VAD hyperparameters. Experimental results on unsegmented data show that the proposed method outperformed the baseline methods using the conventional energy-based and neural-network-based VAD methods and achieved an RTF less than 0.2. The proposed method is publicly available. △ Less

Submitted 14 February, 2020; v1 submitted 2 February, 2020; originally announced February 2020.

Comments: Submitted to ICASSP 2020

arXiv:2001.05624 [pdf]

doi 10.1007/s12652-020-02268-5

Cluster-based Zero-shot learning for multivariate data

Authors: Toshitaka Hayashi, Hamido Fujita

Abstract: Supervised learning requires a sufficient training dataset which includes all label. However, there are cases that some class is not in the training data. Zero-Shot Learning (ZSL) is the task of predicting class that is not in the training data(target class). The existing ZSL method is done for image data. However, the zero-shot problem should happen to every data type. Hence, considering ZSL for… ▽ More Supervised learning requires a sufficient training dataset which includes all label. However, there are cases that some class is not in the training data. Zero-Shot Learning (ZSL) is the task of predicting class that is not in the training data(target class). The existing ZSL method is done for image data. However, the zero-shot problem should happen to every data type. Hence, considering ZSL for other data types is required. In this paper, we propose the cluster-based ZSL method, which is a baseline method for multivariate binary classification problems. The proposed method is based on the assumption that if data is far from training data, the data is considered as target class. In training, clustering is done for training data. In prediction, the data is determined belonging to a cluster or not. If data does not belong to a cluster, the data is predicted as target class. The proposed method is evaluated and demonstrated using the KEEL dataset. This paper has been published in the Journal of Ambient Intelligence and Humanized Computing. The final version is available at the following URL: https://link.springer.com/article/10.1007/s12652-020-02268-5 △ Less

Submitted 30 June, 2020; v1 submitted 15 January, 2020; originally announced January 2020.

Comments: J Ambient Intell Human Comput (2020)

arXiv:1912.06813 [pdf, other]

Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining

Authors: Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu Kameoka, Tomoki Toda

Abstract: We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining. Seq2seq VC models are attractive owing to their ability to convert prosody. While seq2seq models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have been successfully applied to VC, the use of the Transforme… ▽ More We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining. Seq2seq VC models are attractive owing to their ability to convert prosody. While seq2seq models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have been successfully applied to VC, the use of the Transformer network, which has shown promising results in various speech processing tasks, has not yet been investigated. Nonetheless, their data-hungry property and the mispronunciation of converted speech make seq2seq models far from practical. To this end, we propose a simple yet effective pretraining technique to transfer knowledge from learned TTS models, which benefit from large-scale, easily accessible TTS corpora. VC models initialized with such pretrained model parameters are able to generate effective hidden representations for high-fidelity, highly intelligible converted speech. Experimental results show that such a pretraining scheme can facilitate data-efficient training and outperform an RNN-based seq2seq VC model in terms of intelligibility, naturalness, and similarity. △ Less

Submitted 14 December, 2019; originally announced December 2019.

Comments: Preprint. Work in progress

arXiv:1910.11033 [pdf, other]

Assisting human experts in the interpretation of their visual process: A case study on assessing copper surface adhesive potency

Authors: Tristan Hascoet, Xuejiao Deng, Kiyoto Tai, Mari Sugiyama, Yuji Adachi, Sachiko Nakamura, Yasuo Ariki, Tomoko Hayashi, Tetusya Takiguchi

Abstract: Deep Neural Networks are often though to lack interpretability due to the distributed nature of their internal representations. In contrast, humans can generally justify, in natural language, for their answer to a visual question with simple common sense reasoning. However, human introspection abilities have their own limits as one often struggles to justify for the recognition process behind our… ▽ More Deep Neural Networks are often though to lack interpretability due to the distributed nature of their internal representations. In contrast, humans can generally justify, in natural language, for their answer to a visual question with simple common sense reasoning. However, human introspection abilities have their own limits as one often struggles to justify for the recognition process behind our lowest level feature recognition ability: for instance, it is difficult to precisely explain why a given texture seems more characteristic of the surface of a finger nail rather than a plastic bottle. In this paper, we showcase an application in which deep learning models can actually help human experts justify for their own low-level visual recognition process: We study the problem of assessing the adhesive potency of copper sheets from microscopic pictures of their surface. Although highly trained material experts are able to qualitatively assess the surface adhesive potency, they are often unable to precisely justify for their decision process. We present a model that, under careful design considerations, is able to provide visual clues for human experts to understand and justify for their own recognition process. Not only can our model assist human experts in their interpretation of the surface characteristics, we show how this model can be used to test different hypothesis of the copper surface response to different manufacturing processes. △ Less

Submitted 24 October, 2019; originally announced October 2019.

arXiv:1910.10909 [pdf, ps, other]

ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

Authors: Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Yu Zhang, Xu Tan

Abstract: This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit named ESPnet-TTS, which is an extension of the open-source speech processing toolkit ESPnet. The toolkit supports state-of-the-art E2E-TTS models, including Tacotron~2, Transformer TTS, and FastSpeech, and also provides recipes inspired by the Kaldi automatic speech recognition (ASR) toolkit. The recipes are based on the desig… ▽ More This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit named ESPnet-TTS, which is an extension of the open-source speech processing toolkit ESPnet. The toolkit supports state-of-the-art E2E-TTS models, including Tacotron~2, Transformer TTS, and FastSpeech, and also provides recipes inspired by the Kaldi automatic speech recognition (ASR) toolkit. The recipes are based on the design unified with the ESPnet ASR recipe, providing high reproducibility. The toolkit also provides pre-trained models and samples of all of the recipes so that users can use it as a baseline. Furthermore, the unified design enables the integration of ASR functions with TTS, e.g., ASR-based objective evaluation and semi-supervised learning with both ASR and TTS models. This paper describes the design of the toolkit and experimental evaluation in comparison with other toolkits. The experimental results show that our models can achieve state-of-the-art performance comparable to the other latest toolkits, resulting in a mean opinion score (MOS) of 4.25 on the LJSpeech dataset. The toolkit is publicly available at https://github.com/espnet/espnet. △ Less

Submitted 16 February, 2020; v1 submitted 24 October, 2019; originally announced October 2019.

Comments: Accepted to ICASSP2020. Demo HP: https://espnet.github.io/icassp2020-tts/

arXiv:1909.06317 [pdf, other]

doi 10.1109/ASRU46091.2019.9003750

A Comparative Study on Transformer vs RNN in Speech Applications

Authors: Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura, Wangyou Zhang

Abstract: Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence model called Transformer, which achieves state-of-the-art performance in neural machine translation and other natural language processing applications. We underto… ▽ More Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence model called Transformer, which achieves state-of-the-art performance in neural machine translation and other natural language processing applications. We undertook intensive studies in which we experimentally compared and analyzed Transformer and conventional recurrent neural networks (RNN) in a total of 15 ASR, one multilingual ASR, one ST, and two TTS benchmarks. Our experiments revealed various training tips and significant performance benefits obtained with Transformer for each task including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN. We are preparing to release Kaldi-style reproducible recipes using open source and publicly available datasets for all the ASR, ST, and TTS tasks for the community to succeed our exciting outcomes. △ Less

Submitted 28 September, 2019; v1 submitted 13 September, 2019; originally announced September 2019.

Comments: Accepted at ASRU 2019

Journal ref: IEEE Automatic Speech Recognition and Understanding Workshop 2019

arXiv:1907.10185 [pdf, ps, other]

Non-Parallel Voice Conversion with Cyclic Variational Autoencoder

Authors: Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda

Abstract: In this paper, we present a novel technique for a non-parallel voice conversion (VC) with the use of cyclic variational autoencoder (CycleVAE)-based spectral modeling. In a variational autoencoder(VAE) framework, a latent space, usually with a Gaussian prior, is used to encode a set of input features. In a VAE-based VC, the encoded latent features are fed into a decoder, along with speaker-coding… ▽ More In this paper, we present a novel technique for a non-parallel voice conversion (VC) with the use of cyclic variational autoencoder (CycleVAE)-based spectral modeling. In a variational autoencoder(VAE) framework, a latent space, usually with a Gaussian prior, is used to encode a set of input features. In a VAE-based VC, the encoded latent features are fed into a decoder, along with speaker-coding features, to generate estimated spectra with either the original speaker identity (reconstructed) or another speaker identity (converted). Due to the non-parallel modeling condition, the converted spectra can not be directly optimized, which heavily degrades the performance of a VAE-based VC. In this work, to overcome this problem, we propose to use CycleVAE-based spectral model that indirectly optimizes the conversion flow by recycling the converted features back into the system to obtain corresponding cyclic reconstructed spectra that can be directly optimized. The cyclic flow can be continued by using the cyclic reconstructed features as input for the next cycle. The experimental results demonstrate the effectiveness of the proposed CycleVAE-based VC, which yields higher accuracy of converted spectra, generates latent features with higher correlation degree, and significantly improves the quality and conversion accuracy of the converted speech. △ Less

Submitted 23 July, 2019; originally announced July 2019.

Comments: Accepted to INTERSPEECH 2019

arXiv:1907.08940 [pdf]

Statistical Voice Conversion with Quasi-Periodic WaveNet Vocoder

Authors: Yi-Chiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda

Abstract: In this paper, we investigate the effectiveness of a quasi-periodic WaveNet (QPNet) vocoder combined with a statistical spectral conversion technique for a voice conversion task. The WaveNet (WN) vocoder has been applied as the waveform generation module in many different voice conversion frameworks and achieves significant improvement over conventional vocoders. However, because of the fixed dila… ▽ More In this paper, we investigate the effectiveness of a quasi-periodic WaveNet (QPNet) vocoder combined with a statistical spectral conversion technique for a voice conversion task. The WaveNet (WN) vocoder has been applied as the waveform generation module in many different voice conversion frameworks and achieves significant improvement over conventional vocoders. However, because of the fixed dilated convolution and generic network architecture, the WN vocoder lacks robustness against unseen input features and often requires a huge network size to achieve acceptable speech quality. Such limitations usually lead to performance degradation in the voice conversion task. To overcome this problem, the QPNet vocoder is applied, which includes a pitch-dependent dilated convolution component to enhance the pitch controllability and attain a more compact network than the WN vocoder. In the proposed method, input spectral features are first converted using a framewise deep neural network, and then the QPNet vocoder generates converted speech conditioned on the linearly converted prosodic and transformed spectral features. The experimental results confirm that the QPNet vocoder achieves significantly better performance than the same-size WN vocoder while maintaining comparable speech quality to the double-size WN vocoder. Index Terms: WaveNet, vocoder, voice conversion, pitch-dependent dilated convolution, pitch controllability △ Less

Submitted 22 March, 2020; v1 submitted 21 July, 2019; originally announced July 2019.

Comments: 6pages, 7figures, Proc. SSW10, 2019

arXiv:1907.00797 [pdf]

Quasi-Periodic WaveNet Vocoder: A Pitch Dependent Dilated Convolution Model for Parametric Speech Generation

Authors: Yi-Chiao Wu, Tomoki Hayashi, Patrick Lumban Tobing, Kazuhiro Kobayashi, Tomoki Toda

Abstract: In this paper, we propose a quasi-periodic neural network (QPNet) vocoder with a novel network architecture named pitch-dependent dilated convolution (PDCNN) to improve the pitch controllability of WaveNet (WN) vocoder. The effectiveness of the WN vocoder to generate high-fidelity speech samples from given acoustic features has been proved recently. However, because of the fixed dilated convolutio… ▽ More In this paper, we propose a quasi-periodic neural network (QPNet) vocoder with a novel network architecture named pitch-dependent dilated convolution (PDCNN) to improve the pitch controllability of WaveNet (WN) vocoder. The effectiveness of the WN vocoder to generate high-fidelity speech samples from given acoustic features has been proved recently. However, because of the fixed dilated convolution and generic network architecture, the WN vocoder hardly generates speech with given F0 values which are outside the range observed in training data. Consequently, the WN vocoder lacks the pitch controllability which is one of the essential capabilities of conventional vocoders. To address this limitation, we propose the PDCNN component which has the time-variant adaptive dilation size related to the given F0 values and a cascade network structure of the QPNet vocoder to generate quasi-periodic signals such as speech. Both objective and subjective tests are conducted, and the experimental results demonstrate the better pitch controllability of the QPNet vocoder compared to the same and double sized WN vocoders while attaining comparable speech qualities. Index Terms: WaveNet, vocoder, quasi-periodic signal, pitch-dependent dilated convolution, pitch controllability △ Less

Submitted 22 March, 2020; v1 submitted 1 July, 2019; originally announced July 2019.

Comments: 5 pages, 4 figures, Proc. Interspeech, 2019

Showing 1–50 of 59 results for author: Hayashi, T