Skip to main content

Showing 1–50 of 59 results for author: Hayashi, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.17931  [pdf

    cs.LG cs.CV

    Critical Review for One-class Classification: recent advances and the reality behind them

    Authors: Toshitaka Hayashi, Dalibor Cimr, Hamido Fujita, Richard Cimler

    Abstract: This paper offers a comprehensive review of one-class classification (OCC), examining the technologies and methodologies employed in its implementation. It delves into various approaches utilized for OCC across diverse data types, such as feature data, image, video, time series, and others. Through a systematic review, this paper synthesizes promi-nent strategies used in OCC from its inception to… ▽ More

    Submitted 27 April, 2024; originally announced April 2024.

  2. arXiv:2304.04596  [pdf, other

    cs.SD cs.CL eess.AS

    ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit

    Authors: Brian Yan, Jiatong Shi, Yun Tang, Hirofumi Inaguma, Yifan Peng, Siddharth Dalmia, Peter Polák, Patrick Fernandes, Dan Berrebbi, Tomoki Hayashi, Xiaohui Zhang, Zhaoheng Ni, Moto Hira, Soumi Maiti, Juan Pino, Shinji Watanabe

    Abstract: ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community. ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2) simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech translation (S2ST) -- each task is supported with a wide variety of approaches, differentiating ESPnet-… ▽ More

    Submitted 6 July, 2023; v1 submitted 10 April, 2023; originally announced April 2023.

    Comments: ACL 2023; System Demonstration

  3. arXiv:2302.09189  [pdf

    cs.CL

    Extraction of Constituent Factors of Digestion Efficiency in Information Transfer by Media Composed of Texts and Images

    Authors: Koike Hiroaki, Teruaki Hayashi

    Abstract: The development and spread of information and communication technologies have increased and diversified information. However, the increase in the volume and the selection of information does not necessarily promote understanding. In addition, conventional evaluations of information transfer have focused only on the arrival of information to the receivers. They need to sufficiently take into accoun… ▽ More

    Submitted 17 February, 2023; originally announced February 2023.

    Comments: This paper is the revised version of the 29th annual conference of the Natural Language Processing Society in Japan, in Japanese language

  4. arXiv:2301.09099  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study

    Authors: Massa Baali, Tomoki Hayashi, Hamdy Mubarak, Soumi Maiti, Shinji Watanabe, Wassim El-Hajj, Ahmed Ali

    Abstract: Several high-resource Text to Speech (TTS) systems currently produce natural, well-established human-like speech. In contrast, low-resource languages, including Arabic, have very limited TTS systems due to the lack of resources. We propose a fully unsupervised method for building TTS, including automatic data selection and pre-training/fine-tuning strategies for TTS training, using broadcast news… ▽ More

    Submitted 26 January, 2023; v1 submitted 22 January, 2023; originally announced January 2023.

  5. arXiv:2212.13157  [pdf, other

    cs.LG

    Gaussian Process Classification Bandits

    Authors: Tatsuya Hayashi, Naoki Ito, Koji Tabata, Atsuyoshi Nakamura, Katsumasa Fujita, Yoshinori Harada, Tamiki Komatsuzaki

    Abstract: Classification bandits are multi-armed bandit problems whose task is to classify a given set of arms into either positive or negative class depending on whether the rate of the arms with the expected reward of at least h is not less than w for given thresholds h and w. We study a special classification bandit problem in which arms correspond to points x in d-dimensional real space with expected re… ▽ More

    Submitted 26 December, 2022; originally announced December 2022.

  6. arXiv:2207.04356  [pdf, other

    cs.SD cs.LG eess.AS

    A Comparative Study of Self-supervised Speech Representation Based Voice Conversion

    Authors: Wen-Chin Huang, Shu-Wen Yang, Tomoki Hayashi, Tomoki Toda

    Abstract: We present a large-scale comparative study of self-supervised speech representation (S3R)-based voice conversion (VC). In the context of recognition-synthesis VC, S3Rs are attractive owing to their potential to replace expensive supervised representations such as phonetic posteriorgrams (PPGs), which are commonly adopted by state-of-the-art VC systems. Using S3PRL-VC, an open-source VC software we… ▽ More

    Submitted 9 July, 2022; originally announced July 2022.

    Comments: Accepted to IEEE Journal of Selected Topics in Signal Processing. arXiv admin note: substantial text overlap with arXiv:2110.06280

  7. arXiv:2206.05929  [pdf, other

    cs.SD eess.AS

    Improvement of Serial Approach to Anomalous Sound Detection by Incorporating Two Binary Cross-Entropies for Outlier Exposure

    Authors: Ibuki Kuroyanagi, Tomoki Hayashi, Kazuya Takeda, Tomoki Toda

    Abstract: Anomalous sound detection systems must detect unknown, atypical sounds using only normal audio data. Conventional methods use the serial method, a combination of outlier exposure (OE), which classifies normal and pseudo-anomalous data and obtains embedding, and inlier modeling (IM), which models the probability distribution of the embedding. Although the serial method shows high performance due to… ▽ More

    Submitted 13 June, 2022; originally announced June 2022.

    Comments: 5 pages, 3 figures, 3 tables, EUSIPCO 2022

  8. arXiv:2205.04029  [pdf, other

    cs.SD cs.MM eess.AS

    Muskits: an End-to-End Music Processing Toolkit for Singing Voice Synthesis

    Authors: Jiatong Shi, Shuai Guo, Tao Qian, Nan Huo, Tomoki Hayashi, Yuning Wu, Frank Xu, Xuankai Chang, Huazhe Li, Peter Wu, Shinji Watanabe, Qin **

    Abstract: This paper introduces a new open-source platform named Muskits for end-to-end music processing, which mainly focuses on end-to-end singing voice synthesis (E2E-SVS). Muskits supports state-of-the-art SVS models, including RNN SVS, transformer SVS, and XiaoiceSing. The design of Muskits follows the style of widely-used speech processing toolkits, ESPnet and Kaldi, for data prepossessing, training,… ▽ More

    Submitted 2 July, 2022; v1 submitted 9 May, 2022; originally announced May 2022.

    Comments: Accepted by Interspeech

  9. arXiv:2204.13378  [pdf, other

    cs.AI cs.LG

    Learning General Inventory Management Policy for Large Supply Chain Network

    Authors: Soh Kumabe, Shinya Shiroshita, Takanori Hayashi, Shirou Maruyama

    Abstract: Inventory management in warehouses directly affects profits made by manufacturers. Particularly, large manufacturers produce a very large variety of products that are handled by a significantly large number of retailers. In such a case, the computational complexity of classical inventory management algorithms is inordinately large. In recent years, learning-based approaches have become popular for… ▽ More

    Submitted 28 April, 2022; originally announced April 2022.

    Comments: 9 pages, OPTLearnMAS 2022

  10. Acoustic Event Detection with Classifier Chains

    Authors: Tatsuya Komatsu, Shinji Watanabe, Koichi Miyazaki, Tomoki Hayashi

    Abstract: This paper proposes acoustic event detection (AED) with classifier chains, a new classifier based on the probabilistic chain rule. The proposed AED with classifier chains consists of a gated recurrent unit and performs iterative binary detection of each event one by one. In each iteration, the event's activity is estimated and used to condition the next output based on the probabilistic chain rule… ▽ More

    Submitted 17 February, 2022; originally announced February 2022.

    Comments: 5pages, presented at Interspeech2021

  11. arXiv:2112.09382  [pdf, other

    cs.SD cs.LG eess.AS

    Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem

    Authors: **g Shi, Xuankai Chang, Tomoki Hayashi, Yen-Ju Lu, Shinji Watanabe, Bo Xu

    Abstract: Deep learning based models have significantly improved the performance of speech separation with input mixtures like the cocktail party. Prominent methods (e.g., frequency-domain and time-domain speech separation) usually build regression models to predict the ground-truth speech from the mixture, using the masking-based design and the signal-level loss criterion (e.g., MSE or SI-SNR). This study… ▽ More

    Submitted 9 January, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

    Comments: 5 pages, https://shincling.github.io/discreteSeparation/

  12. arXiv:2111.12460  [pdf, other

    cs.CV cs.LG stat.ML

    ViCE: Improving Dense Representation Learning by Superpixelization and Contrasting Cluster Assignment

    Authors: Robin Karlsson, Tomoki Hayashi, Keisuke Fujii, Alexander Carballo, Kento Ohtani, Kazuya Takeda

    Abstract: Recent self-supervised models have demonstrated equal or better performance than supervised methods, opening for AI systems to learn visual representations from practically unlimited data. However, these methods are typically classification-based and thus ineffective for learning high-resolution feature maps that preserve precise spatial information. This work introduces superpixels to improve sel… ▽ More

    Submitted 7 October, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

    Comments: Accepted for BMVC 2022

    ACM Class: I.2.10; I.2.9

  13. arXiv:2111.04505  [pdf

    cs.LG

    Feature Concepts for Data Federative Innovations

    Authors: Yukio Ohsawa, Sae Kondo, Teruaki Hayashi

    Abstract: A feature concept, the essence of the data-federative innovation process, is presented as a model of the concept to be acquired from data. A feature concept may be a simple feature, such as a single variable, but is more likely to be a conceptual illustration of the abstract information to be obtained from the data. For example, trees and clusters are feature concepts for decision tree learning an… ▽ More

    Submitted 5 November, 2021; originally announced November 2021.

    Comments: 13 pages, 7 figures

  14. arXiv:2110.07840  [pdf, other

    cs.CL cs.SD eess.AS

    ESPnet2-TTS: Extending the Edge of TTS Research

    Authors: Tomoki Hayashi, Ryuichi Yamamoto, Takenori Yoshimura, Peter Wu, Jiatong Shi, Takaaki Saeki, Yooncheol Ju, Yusuke Yasuda, Shinnosuke Takamichi, Shinji Watanabe

    Abstract: This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling, which simplify the training pipeline and further enhance T… ▽ More

    Submitted 14 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP2022. Demo HP: https://espnet.github.io/icassp2022-tts/

  15. arXiv:2110.06280  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    S3PRL-VC: Open-source Voice Conversion Framework with Self-supervised Speech Representations

    Authors: Wen-Chin Huang, Shu-Wen Yang, Tomoki Hayashi, Hung-Yi Lee, Shinji Watanabe, Tomoki Toda

    Abstract: This paper introduces S3PRL-VC, an open-source voice conversion (VC) framework based on the S3PRL toolkit. In the context of recognition-synthesis VC, self-supervised speech representation (S3R) is valuable in its potential to replace the expensive supervised representation adopted by state-of-the-art VC systems. Moreover, we claim that VC is a good probing task for S3R analysis. In this work, we… ▽ More

    Submitted 12 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022. Code available at: https://github.com/s3prl/s3prl/tree/master/s3prl/downstream/a2o-vc-vcc2020

  16. arXiv:2107.11929  [pdf

    cs.CY

    Collaborative Problem Solving on a Data Platform Kaggle

    Authors: Teruaki Hayashi, Takumi Shimizu, Yoshiaki Fukami

    Abstract: Data exchange across different domains has gained much attention as a way of creating new businesses and improving the value of existing services. Data exchange ecosystem is developed by platform services that facilitate data and knowledge exchange and offer co-creation environments for organizations to promote their problem-solving. In this study, we investigate Kaggle, a data analysis competitio… ▽ More

    Submitted 25 July, 2021; originally announced July 2021.

    Comments: This paper is the English-translated version of "Collaborative Problem Solving on a Data Platform Kaggle" IEICE Tech. Rep., vol.120, no.362, pp.37-40, 2021. (https://www.ieice.org/ken/paper/20210212XCc9/eng/)

  17. arXiv:2107.09477  [pdf, other

    cs.SD cs.CL eess.AS

    On Prosody Modeling for ASR+TTS based Voice Conversion

    Authors: Wen-Chin Huang, Tomoki Hayashi, Xinjian Li, Shinji Watanabe, Tomoki Toda

    Abstract: In voice conversion (VC), an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents; these are then used as input by a text-to-speech (TTS) system to generate the converted speech. Such a paradigm, referred to as ASR+TTS, overlooks the… ▽ More

    Submitted 20 July, 2021; originally announced July 2021.

    Comments: Submitted to ASRU2021. Under review

  18. arXiv:2106.06151  [pdf, other

    cs.SD cs.LG eess.AS

    Anomalous Sound Detection Using a Binary Classification Model and Class Centroids

    Authors: Ibuki Kuroyanagi, Tomoki Hayashi, Kazuya Takeda, Tomoki Toda

    Abstract: An anomalous sound detection system to detect unknown anomalous sounds usually needs to be built using only normal sound data. Moreover, it is desirable to improve the system by effectively using a small amount of anomalous sound data, which will be accumulated through the system's operation. As one of the methods to meet these requirements, we focus on a binary classification model that is develo… ▽ More

    Submitted 10 June, 2021; originally announced June 2021.

    Comments: 6 pages, 2 figures, 2 tables, EUSIPCO2021

  19. arXiv:2104.09754  [pdf

    cs.CV

    Hierarchical entropy and domain interaction to understand the structure in an image

    Authors: Nao Uehara, Teruaki Hayashi, Yukio Ohsawa

    Abstract: In this study, we devise a model that introduces two hierarchies into information entropy. The two hierarchies are the size of the region for which entropy is calculated and the size of the component that determines whether the structures in the image are integrated or not. And this model uses two indicators, hierarchical entropy and domain interaction. Both indicators increase or decrease due to… ▽ More

    Submitted 20 April, 2021; originally announced April 2021.

    Comments: 20pages,17figures

  20. arXiv:2104.09719  [pdf

    physics.med-ph cs.SI physics.soc-ph

    Effects of Interregional Travels and Vaccination in Infection Spreads Simulated by Lattice of SEIRS Circuits

    Authors: Yukio Ohsawa, Teruaki Hayashi, Sae Kondo

    Abstract: The SEIRS model, an extension of the SEIR model for analyzing and predicting the spread of virus infection, was further extended to consider the movement of people across regions. In contrast to previous models that con-sider the risk of travelers from/to other regions, we consider two factors. First, we consider the movements of susceptible (S), exposed (E), and recovered (R) individuals who may… ▽ More

    Submitted 30 June, 2021; v1 submitted 19 April, 2021; originally announced April 2021.

    Comments: 15 pages, one Table, 6 figures, to be submitted to a journal soon, on the way to choose the suitable journal

    ACM Class: I.6.5; I.6.6

  21. arXiv:2104.06793  [pdf, other

    cs.SD cs.CL eess.AS

    Non-autoregressive sequence-to-sequence voice conversion

    Authors: Tomoki Hayashi, Wen-Chin Huang, Kazuhiro Kobayashi, Tomoki Toda

    Abstract: This paper proposes a novel voice conversion (VC) method based on non-autoregressive sequence-to-sequence (NAR-S2S) models. Inspired by the great success of NAR-S2S models such as FastSpeech in text-to-speech (TTS), we extend the FastSpeech2 model for the VC problem. We introduce the convolution-augmented Transformer (Conformer) instead of the Transformer, making it possible to capture both local… ▽ More

    Submitted 14 April, 2021; originally announced April 2021.

    Comments: Accepted to ICASSP2021. Demo HP: https://kan-bayashi.github.io/NonARSeq2SeqVC/

  22. arXiv:2103.02858  [pdf, ps, other

    eess.AS cs.SD

    crank: An Open-Source Software for Nonparallel Voice Conversion Based on Vector-Quantized Variational Autoencoder

    Authors: Kazuhiro Kobayashi, Wen-Chin Huang, Yi-Chiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, Tomoki Toda

    Abstract: In this paper, we present an open-source software for develo** a nonparallel voice conversion (VC) system named crank. Although we have released an open-source VC software based on the Gaussian mixture model named sprocket in the last VC Challenge, it is not straightforward to apply any speech corpus because it is necessary to prepare parallel utterances of source and target speakers to model a… ▽ More

    Submitted 4 March, 2021; originally announced March 2021.

    Comments: Accepted to ICASSP 2021

  23. arXiv:2012.13006  [pdf, other

    eess.AS cs.SD

    The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans

    Authors: Shinji Watanabe, Florian Boyer, Xuankai Chang, Pengcheng Guo, Tomoki Hayashi, Yosuke Higuchi, Takaaki Hori, Wen-Chin Huang, Hirofumi Inaguma, Naoyuki Kamo, Shigeki Karita, Chenda Li, **g Shi, Aswin Shanmugam Subramanian, Wangyou Zhang

    Abstract: This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text… ▽ More

    Submitted 23 December, 2020; originally announced December 2020.

  24. arXiv:2012.11746  [pdf

    cs.CY

    Data Combination for Problem-solving: A Case of an Open Data Exchange Platform

    Authors: Teruaki Hayashi, Hiroki Sakaji, Hiroyasu Matsushima, Yoshiaki Fukami, Takumi Shimizu, Yukio Ohsawa

    Abstract: In recent years, rather than enclosing data within a single organization, exchanging and combining data from different domains has become an emerging practice. Many studies have discussed the economic and utility value of data and data exchange, but the characteristics of data that contribute to problem solving through data combination have not been fully understood. In big data and interdisciplin… ▽ More

    Submitted 21 December, 2020; originally announced December 2020.

  25. ESPnet-se: end-to-end speech enhancement and separation toolkit designed for asr integration

    Authors: Chenda Li, **g Shi, Wangyou Zhang, Aswin Shanmugam Subramanian, Xuankai Chang, Naoyuki Kamo, Moto Hira, Tomoki Hayashi, Christoph Boeddeker, Zhuo Chen, Shinji Watanabe

    Abstract: We present ESPnet-SE, which is designed for the quick development of speech enhancement and speech separation systems in a single framework, along with the optional downstream speech recognition module. ESPnet-SE is a new project which integrates rich automatic speech recognition related models, resources and systems to support and validate the proposed front-end implementation (i.e. speech enhanc… ▽ More

    Submitted 7 November, 2020; originally announced November 2020.

    Comments: Accepted by SLT 2021

  26. arXiv:2010.13956  [pdf, other

    eess.AS cs.SD

    Recent Developments on ESPnet Toolkit Boosted by Conformer

    Authors: Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, **g Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, Yuekai Zhang

    Abstract: In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a wide range of end-to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-… ▽ More

    Submitted 29 October, 2020; v1 submitted 26 October, 2020; originally announced October 2020.

  27. arXiv:2010.12231  [pdf, other

    eess.AS cs.CL cs.SD

    Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations

    Authors: Wen-Chin Huang, Yi-Chiao Wu, Tomoki Hayashi, Tomoki Toda

    Abstract: We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence (seq2seq) framework. A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker. We utilize vq-wav2vec (VQW2V), a discretized self-supervised speech representation that was learned from massive unlabeled data, which is assumed to be speaker-independent and well… ▽ More

    Submitted 23 October, 2020; originally announced October 2020.

    Comments: Submitted to ICASSP 2021

  28. arXiv:2010.02434  [pdf, other

    eess.AS cs.CL cs.SD

    The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS

    Authors: Wen-Chin Huang, Tomoki Hayashi, Shinji Watanabe, Tomoki Toda

    Abstract: This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020. We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model, followed using the transcriptions to generate the voice of the target with a text-to-speech (TTS) model. We revisit this method un… ▽ More

    Submitted 5 October, 2020; originally announced October 2020.

    Comments: Accepted to the ISCA Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020

  29. arXiv:2009.04035  [pdf

    cs.HC cs.CY

    Data Requests and Scenarios for Data Design of Unobserved Events in Corona-related Confusion Using TEEDA

    Authors: Teruaki Hayashi, Nao Uehara, Daisuke Hase, Yukio Ohsawa

    Abstract: Due to the global violence of the novel coronavirus, various industries have been affected and the breakdown between systems has been apparent. To understand and overcome the phenomenon related to this unprecedented crisis caused by the coronavirus infectious disease (COVID-19), the importance of data exchange and sharing across fields has gained social attention. In this study, we use the interac… ▽ More

    Submitted 8 September, 2020; originally announced September 2020.

  30. arXiv:2008.03088  [pdf, other

    eess.AS cs.CL cs.SD

    Pretraining Techniques for Sequence-to-Sequence Voice Conversion

    Authors: Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu Kameoka, Tomoki Toda

    Abstract: Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody. Nonetheless, without sufficient data, seq2seq VC models can suffer from unstable training and mispronunciation problems in the converted speech, thus far from practical. To tackle these shortcomings, we propose to transfer knowledge from other speech processing tasks where large-sc… ▽ More

    Submitted 7 August, 2020; originally announced August 2020.

    Comments: Preprint. Under review

  31. Quasi-Periodic Parallel WaveGAN: A Non-autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network

    Authors: Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda

    Abstract: In this paper, we propose a quasi-periodic parallel WaveGAN (QPPWG) waveform generative model, which applies a quasi-periodic (QP) structure to a parallel WaveGAN (PWG) model using pitch-dependent dilated convolution networks (PDCNNs). PWG is a small-footprint GAN-based raw waveform generative model, whose generation time is much faster than real time because of its compact model and non-autoregre… ▽ More

    Submitted 19 February, 2021; v1 submitted 25 July, 2020; originally announced July 2020.

    Comments: 15 pages, 10 figures, 8 tables

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 792-806, 2021

  32. Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network

    Authors: Yi-Chiao Wu, Tomoki Hayashi, Patrick Lumban Tobing, Kazuhiro Kobayashi, Tomoki Toda

    Abstract: In this paper, a pitch-adaptive waveform generative model named Quasi-Periodic WaveNet (QPNet) is proposed to improve the limited pitch controllability of vanilla WaveNet (WN) using pitch-dependent dilated convolution neural networks (PDCNNs). Specifically, as a probabilistic autoregressive generation model with stacked dilated convolution layers, WN achieves high-fidelity audio waveform generatio… ▽ More

    Submitted 27 March, 2021; v1 submitted 10 July, 2020; originally announced July 2020.

    Comments: 15 pages, 12 figures, 11 tables

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1134-1148, 2021

  33. arXiv:2005.11005  [pdf

    cs.CY

    Modeling Stakeholder-centric Value Chain of Data to Understand Data Exchange Ecosystem

    Authors: Teruaki Hayashi, Gensei Ishimura, Yukio Ohsawa

    Abstract: In recent years, the expectation that new businesses and economic value can be created by combining/exchanging data from different fields has risen. However, value creation by data exchange involves not only data, but also technologies and a variety of stakeholders that are integrated and in competition with one another. This makes the data exchange ecosystem a challenging subject to study. In thi… ▽ More

    Submitted 22 May, 2020; originally announced May 2020.

  34. arXiv:2005.10603  [pdf

    q-fin.GN cs.CE

    Detecting and explaining changes in various assets' relationships in financial markets

    Authors: Makoto Naraoka, Teruaki Hayashi, Takaaki Yoshino, Toshiaki Sugie, Kota Takano, Yukio Ohsawa

    Abstract: We study the method for detecting relationship changes in financial markets and providing human-interpretable network visualization to support the decision-making of fund managers dealing with multi-assets. First, we construct co-occurrence networks with each asset as a node and a pair with a strong relationship in price change as an edge at each time step. Second, we calculate Graph-Based Entropy… ▽ More

    Submitted 16 November, 2020; v1 submitted 21 May, 2020; originally announced May 2020.

    Comments: 12 pages, 6 figures

  35. arXiv:2005.08654  [pdf, other

    eess.AS cs.SD

    Quasi-Periodic Parallel WaveGAN Vocoder: A Non-autoregressive Pitch-dependent Dilated Convolution Model for Parametric Speech Generation

    Authors: Yi-Chiao Wu, Tomoki Hayashi, Takuma Okamoto, Hisashi Kawai, Tomoki Toda

    Abstract: In this paper, we propose a parallel WaveGAN (PWG)-like neural vocoder with a quasi-periodic (QP) architecture to improve the pitch controllability of PWG. PWG is a compact non-autoregressive (non-AR) speech generation model, whose generative speed is much faster than real time. While utilizing PWG as a vocoder to generate speech on the basis of acoustic features such as spectral and prosodic feat… ▽ More

    Submitted 6 August, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

    Comments: 5 page, 6 figures, 2 tables. Proc. Interspeech, 2020

  36. arXiv:2005.05525  [pdf, other

    cs.CL cs.SD eess.AS

    DiscreTalk: Text-to-Speech as a Machine Translation Problem

    Authors: Tomoki Hayashi, Shinji Watanabe

    Abstract: This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT). The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model. The VQ-VAE model learns a map** function from a speech waveform into a sequence of discrete symbols, and then the Tran… ▽ More

    Submitted 11 May, 2020; originally announced May 2020.

    Comments: Submitted to INTERSPEECH 2020. The demo is available on https://kan-bayashi.github.io/DiscreTalk/

  37. arXiv:2005.04954  [pdf, other

    cs.AI cs.LG q-bio.QM

    Propagation Graph Estimation from Individual's Time Series of Observed States

    Authors: Tatsuya Hayashi, Atsuyoshi Nakamura

    Abstract: Various things propagate through the medium of individuals. Some individuals follow the others and take the states similar to their states a small number of time steps later. In this paper, we study the problem of estimating the state propagation order of individuals from the real-valued state sequences of all the individuals. We propose a method to estimate the propagation direction between indiv… ▽ More

    Submitted 30 July, 2021; v1 submitted 11 May, 2020; originally announced May 2020.

    Comments: 22 pages, 10 figures

  38. arXiv:2004.10234  [pdf, ps, other

    cs.CL cs.SD eess.AS

    ESPnet-ST: All-in-One Speech Translation Toolkit

    Authors: Hirofumi Inaguma, Shun Kiyono, Kevin Duh, Shigeki Karita, Nelson Enrique Yalta Soplin, Tomoki Hayashi, Shinji Watanabe

    Abstract: We present ESPnet-ST, which is designed for the quick development of speech-to-speech translation systems in a single framework. ESPnet-ST is a new project inside end-to-end speech processing toolkit, ESPnet, which integrates or newly implements automatic speech recognition, machine translation, and text-to-speech functions for speech translation. We provide all-in-one recipes including data pre-p… ▽ More

    Submitted 30 September, 2020; v1 submitted 21 April, 2020; originally announced April 2020.

    Comments: Accepted at ACL 2020 System Demonstration (update Table1, fix typo)

  39. Non-parallel Voice Conversion System with WaveNet Vocoder and Collapsed Speech Suppression

    Authors: Yi-Chiao Wu, Patrick Lumban Tobing, Kazuhiro Kobayashi, Tomoki Hayashi, Tomoki Toda

    Abstract: In this paper, we integrate a simple non-parallel voice conversion (VC) system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression technique. The effectiveness of WN as a vocoder for generating high-fidelity speech waveforms on the basis of acoustic features has been confirmed in recent works. However, when combining the WN vocoder with a VC system, the distorted acoustic featu… ▽ More

    Submitted 6 April, 2020; v1 submitted 26 March, 2020; originally announced March 2020.

    Comments: 13 pages, 13 figures, 1 table, accepted to publish in IEEE Access

  40. arXiv:2003.05109  [pdf

    cs.SI cs.CY

    Variable-Based Network Analysis of Datasets on Data Exchange Platforms

    Authors: Teruaki Hayashi, Yukio Ohsawa

    Abstract: Recently, data exchange platforms have emerged in the digital economy to enable better resource allocation in a data-driven society, which requires cross-organizational data collaborations. Understanding the characteristics of the data on these platforms is important for their application; however, the structures of such platforms have not been extensively investigated. In this study, we apply a n… ▽ More

    Submitted 11 March, 2020; originally announced March 2020.

  41. arXiv:2002.09581  [pdf

    cs.CL

    Extracting and Validating Explanatory Word Archipelagoes using Dual Entropy

    Authors: Yukio Ohsawa, Teruaki Hayashi

    Abstract: The logical connectivity of text is represented by the connectivity of words that form archipelagoes. Here, each archipelago is a sequence of islands of the occurrences of a certain word. An island here means the local sequence of sentences where the word is emphasized, and an archipelago of a length comparable to the target text is extracted using the co-variation of entropy A (the window-based e… ▽ More

    Submitted 21 February, 2020; originally announced February 2020.

    Comments: 7 pages, 2 figures, 2 columns

    MSC Class: 68W32 (Primary) 68T50; 91F20 (Secondary)

  42. arXiv:2002.00551  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice Activity Detection

    Authors: Takenori Yoshimura, Tomoki Hayashi, Kazuya Takeda, Shinji Watanabe

    Abstract: This paper integrates a voice activity detection (VAD) function with end-to-end automatic speech recognition toward an online speech interface and transcribing very long audio recordings. We focus on connectionist temporal classification (CTC) and its extension of CTC/attention architectures. As opposed to an attention-based architecture, input-synchronous label prediction can be performed based o… ▽ More

    Submitted 14 February, 2020; v1 submitted 2 February, 2020; originally announced February 2020.

    Comments: Submitted to ICASSP 2020

  43. Cluster-based Zero-shot learning for multivariate data

    Authors: Toshitaka Hayashi, Hamido Fujita

    Abstract: Supervised learning requires a sufficient training dataset which includes all label. However, there are cases that some class is not in the training data. Zero-Shot Learning (ZSL) is the task of predicting class that is not in the training data(target class). The existing ZSL method is done for image data. However, the zero-shot problem should happen to every data type. Hence, considering ZSL for… ▽ More

    Submitted 30 June, 2020; v1 submitted 15 January, 2020; originally announced January 2020.

    Comments: J Ambient Intell Human Comput (2020)

  44. arXiv:1912.06813  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining

    Authors: Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu Kameoka, Tomoki Toda

    Abstract: We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining. Seq2seq VC models are attractive owing to their ability to convert prosody. While seq2seq models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have been successfully applied to VC, the use of the Transforme… ▽ More

    Submitted 14 December, 2019; originally announced December 2019.

    Comments: Preprint. Work in progress

  45. arXiv:1910.11033  [pdf, other

    cs.CV cs.LG

    Assisting human experts in the interpretation of their visual process: A case study on assessing copper surface adhesive potency

    Authors: Tristan Hascoet, Xuejiao Deng, Kiyoto Tai, Mari Sugiyama, Yuji Adachi, Sachiko Nakamura, Yasuo Ariki, Tomoko Hayashi, Tetusya Takiguchi

    Abstract: Deep Neural Networks are often though to lack interpretability due to the distributed nature of their internal representations. In contrast, humans can generally justify, in natural language, for their answer to a visual question with simple common sense reasoning. However, human introspection abilities have their own limits as one often struggles to justify for the recognition process behind our… ▽ More

    Submitted 24 October, 2019; originally announced October 2019.

  46. arXiv:1910.10909  [pdf, ps, other

    cs.CL eess.AS

    ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

    Authors: Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Yu Zhang, Xu Tan

    Abstract: This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit named ESPnet-TTS, which is an extension of the open-source speech processing toolkit ESPnet. The toolkit supports state-of-the-art E2E-TTS models, including Tacotron~2, Transformer TTS, and FastSpeech, and also provides recipes inspired by the Kaldi automatic speech recognition (ASR) toolkit. The recipes are based on the desig… ▽ More

    Submitted 16 February, 2020; v1 submitted 24 October, 2019; originally announced October 2019.

    Comments: Accepted to ICASSP2020. Demo HP: https://espnet.github.io/icassp2020-tts/

  47. A Comparative Study on Transformer vs RNN in Speech Applications

    Authors: Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura, Wangyou Zhang

    Abstract: Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence model called Transformer, which achieves state-of-the-art performance in neural machine translation and other natural language processing applications. We underto… ▽ More

    Submitted 28 September, 2019; v1 submitted 13 September, 2019; originally announced September 2019.

    Comments: Accepted at ASRU 2019

    Journal ref: IEEE Automatic Speech Recognition and Understanding Workshop 2019

  48. arXiv:1907.10185  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Non-Parallel Voice Conversion with Cyclic Variational Autoencoder

    Authors: Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda

    Abstract: In this paper, we present a novel technique for a non-parallel voice conversion (VC) with the use of cyclic variational autoencoder (CycleVAE)-based spectral modeling. In a variational autoencoder(VAE) framework, a latent space, usually with a Gaussian prior, is used to encode a set of input features. In a VAE-based VC, the encoded latent features are fed into a decoder, along with speaker-coding… ▽ More

    Submitted 23 July, 2019; originally announced July 2019.

    Comments: Accepted to INTERSPEECH 2019

  49. arXiv:1907.08940  [pdf

    eess.AS cs.SD

    Statistical Voice Conversion with Quasi-Periodic WaveNet Vocoder

    Authors: Yi-Chiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda

    Abstract: In this paper, we investigate the effectiveness of a quasi-periodic WaveNet (QPNet) vocoder combined with a statistical spectral conversion technique for a voice conversion task. The WaveNet (WN) vocoder has been applied as the waveform generation module in many different voice conversion frameworks and achieves significant improvement over conventional vocoders. However, because of the fixed dila… ▽ More

    Submitted 22 March, 2020; v1 submitted 21 July, 2019; originally announced July 2019.

    Comments: 6pages, 7figures, Proc. SSW10, 2019

  50. arXiv:1907.00797  [pdf

    eess.AS cs.SD

    Quasi-Periodic WaveNet Vocoder: A Pitch Dependent Dilated Convolution Model for Parametric Speech Generation

    Authors: Yi-Chiao Wu, Tomoki Hayashi, Patrick Lumban Tobing, Kazuhiro Kobayashi, Tomoki Toda

    Abstract: In this paper, we propose a quasi-periodic neural network (QPNet) vocoder with a novel network architecture named pitch-dependent dilated convolution (PDCNN) to improve the pitch controllability of WaveNet (WN) vocoder. The effectiveness of the WN vocoder to generate high-fidelity speech samples from given acoustic features has been proved recently. However, because of the fixed dilated convolutio… ▽ More

    Submitted 22 March, 2020; v1 submitted 1 July, 2019; originally announced July 2019.

    Comments: 5 pages, 4 figures, Proc. Interspeech, 2019