Skip to main content

Showing 1–26 of 26 results for author: Komatsu, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.13139  [pdf, other

    eess.AS cs.SD

    Audio Fingerprinting with Holographic Reduced Representations

    Authors: Yusuke Fujita, Tatsuya Komatsu

    Abstract: This paper proposes an audio fingerprinting model with holographic reduced representation (HRR). The proposed method reduces the number of stored fingerprints, whereas conventional neural audio fingerprinting requires many fingerprints for each audio track to achieve high accuracy and time resolution. We utilize HRR to aggregate multiple fingerprints into a composite fingerprint via circular convo… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: accepted at Interspeech 2024

  2. arXiv:2406.12194  [pdf, other

    eess.AS cs.SD

    Universal Score-based Speech Enhancement with High Content Preservation

    Authors: Robin Scheibler, Yusuke Fujita, Yuma Shirahata, Tatsuya Komatsu

    Abstract: We propose UNIVERSE++, a universal speech enhancement method based on score-based diffusion and adversarial training. Specifically, we improve the existing UNIVERSE model that decouples clean speech feature extraction and diffusion. Our contributions are three-fold. First, we make several modifications to the network architecture, improving training stability and final performance. Second, we intr… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: 5 pages, 5 figures, accepted at Interspeech 2024

  3. arXiv:2401.11700  [pdf, other

    cs.CL cs.SD eess.AS

    Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers

    Authors: Michael Hentschel, Yuta Nishikawa, Tatsuya Komatsu, Yusuke Fujita

    Abstract: This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers. To distil the teacher's knowledge, we use an attention decoder that learns from BERT's token probabilities. Our method shows that language model (LM) information can be more effectively distilled into an ASR model using both the in… ▽ More

    Submitted 22 January, 2024; originally announced January 2024.

    Comments: Accepted at ICASSP 2024

  4. Estimation of articulated angle in six-wheeled dump trucks using multiple GNSS receivers for autonomous driving

    Authors: Taro Suzuki, Kazunori Ohno, Syotaro Kojima, Naoto Miyamoto, Takahiro Suzuki, Tomohiro Komatsu, Yukinori Shibata, Kimitaka Asano, Keiji Nagatani

    Abstract: Due to the declining birthrate and aging population, the shortage of labor in the construction industry has become a serious problem, and increasing attention has been paid to automation of construction equipment. We focus on the automatic operation of articulated six-wheel dump trucks at construction sites. For the automatic operation of the dump trucks, it is important to estimate the position a… ▽ More

    Submitted 5 December, 2023; originally announced December 2023.

    Comments: This is an electronic version of an article published in ADVANCED ROBOTICS, 35:23, 1376-1387, 2021. ADVANCED ROBOTICS is available online at: www.tandfonline.com/Article DOI; 10.1080/01691864.2019.1619622

    Journal ref: Advanced Robotics, 35:23, 1376-1387, 2021

  5. arXiv:2310.03273  [pdf, other

    cs.CV cs.LG eess.IV

    Ablation Study to Clarify the Mechanism of Object Segmentation in Multi-Object Representation Learning

    Authors: Takayuki Komatsu, Yoshiyuki Ohmura, Yasuo Kuniyoshi

    Abstract: Multi-object representation learning aims to represent complex real-world visual input using the composition of multiple objects. Representation learning methods have often used unsupervised learning to segment an input image into individual objects and encode these objects into each latent vector. However, it is not clear how previous methods have achieved the appropriate segmentation of individu… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

  6. arXiv:2309.08141  [pdf, other

    eess.AS cs.CL cs.LG cs.SD eess.SP

    Audio Difference Learning for Audio Captioning

    Authors: Tatsuya Komatsu, Yusuke Fujita, Kazuya Takeda, Tomoki Toda

    Abstract: This study introduces a novel training paradigm, audio difference learning, for improving audio captioning. The fundamental concept of the proposed learning method is to create a feature representation space that preserves the relationship between audio, enabling the generation of captions that detail intricate audio information. This method employs a reference audio along with the input audio, bo… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: submitted to ICASSP2024

  7. arXiv:2309.08140  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions

    Authors: Reo Shimizu, Ryuichi Yamamoto, Masaya Kawamura, Yuma Shirahata, Hironori Doi, Tatsuya Komatsu, Kentaro Tachibana

    Abstract: We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions. To control speaker identity within the prompt-based TTS framework, we introduce the concept of speaker prompt, which describes voice characteristics (e.g., gender-neutral, young, old, and muffled) designed to be approximately independent of spe… ▽ More

    Submitted 27 December, 2023; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024

  8. arXiv:2303.06806  [pdf, other

    eess.AS cs.CL cs.SD

    Neural Diarization with Non-autoregressive Intermediate Attractors

    Authors: Yusuke Fujita, Tatsuya Komatsu, Robin Scheibler, Yusuke Kida, Tetsuji Ogawa

    Abstract: End-to-end neural diarization (EEND) with encoder-decoder-based attractors (EDA) is a promising method to handle the whole speaker diarization problem simultaneously with a single neural network. While the EEND model can produce all frame-level speaker labels simultaneously, it disregards output label dependency. In this work, we propose a novel EEND model that introduces the label dependency betw… ▽ More

    Submitted 12 March, 2023; originally announced March 2023.

    Comments: ICASSP 2023

  9. arXiv:2204.02279  [pdf, ps, other

    cs.SD eess.AS

    How Information on Acoustic Scenes and Sound Events Mutually Benefits Event Detection and Scene Classification Tasks

    Authors: Keisuke Imoto, Yuka Komatsu, Shunsuke Tsubaki, Tatsuya Komatsu

    Abstract: Acoustic scene classification (ASC) and sound event detection (SED) are fundamental tasks in environmental sound analysis, and many methods based on deep learning have been proposed. Considering that information on acoustic scenes and sound events helps SED and ASC mutually, some researchers have proposed a joint analysis of acoustic scenes and sound events by multitask learning (MTL). However, co… ▽ More

    Submitted 5 April, 2022; originally announced April 2022.

    Comments: Submitted to INTERSPEECH 2022

  10. arXiv:2204.00176  [pdf, other

    cs.CL cs.SD eess.AS

    Better Intermediates Improve CTC Inference

    Authors: Tatsuya Komatsu, Yusuke Fujita, Jaesong Lee, Lukas Lee, Shinji Watanabe, Yusuke Kida

    Abstract: This paper proposes a method for improved CTC inference with searched intermediates and multi-pass conditioning. The paper first formulates self-conditioned CTC as a probabilistic model with an intermediate prediction as a latent representation and provides a tractable conditioning framework. We then propose two new conditioning methods based on the new formulation: (1) Searched intermediate condi… ▽ More

    Submitted 31 March, 2022; originally announced April 2022.

    Comments: 5 pages, submitted INTERSPEECH2022

  11. Alternate Intermediate Conditioning with Syllable-level and Character-level Targets for Japanese ASR

    Authors: Yusuke Fujita, Tatsuya Komatsu, Yusuke Kida

    Abstract: End-to-end automatic speech recognition directly maps input speech to characters. However, the map** can be problematic when several different pronunciations should be mapped into one character or when one pronunciation is shared among many different characters. Japanese ASR suffers the most from such many-to-one and one-to-many map** problems due to Japanese kanji characters. To alleviate the… ▽ More

    Submitted 12 March, 2023; v1 submitted 31 March, 2022; originally announced April 2022.

    Comments: SLT 2022

  12. arXiv:2204.00174  [pdf, other

    cs.CL cs.SD eess.AS

    InterAug: Augmenting Noisy Intermediate Predictions for CTC-based ASR

    Authors: Yu Nakagome, Tatsuya Komatsu, Yusuke Fujita, Shuta Ichimura, Yusuke Kida

    Abstract: This paper proposes InterAug: a novel training method for CTC-based ASR using augmented intermediate representations for conditioning. The proposed method exploits the conditioning framework of self-conditioned CTC to train robust models by conditioning with "noisy" intermediate predictions. During the training, intermediate predictions are changed to incorrect intermediate predictions, and fed in… ▽ More

    Submitted 31 March, 2022; originally announced April 2022.

    Comments: This paper was submitted to INTERSPEECH2022

  13. arXiv:2202.08474  [pdf, other

    eess.AS cs.SD

    Non-Autoregressive ASR with Self-Conditioned Folded Encoders

    Authors: Tatsuya Komatsu

    Abstract: This paper proposes CTC-based non-autoregressive ASR with self-conditioned folded encoders. The proposed method realizes non-autoregressive ASR with fewer parameters by folding the conventional stack of encoders into only two blocks; base encoders and folded encoders. The base encoders convert the input audio features into a neural representation suitable for recognition. This is followed by the f… ▽ More

    Submitted 17 February, 2022; originally announced February 2022.

    Comments: 5 pages, accepted at ICASSP2022

  14. Acoustic Event Detection with Classifier Chains

    Authors: Tatsuya Komatsu, Shinji Watanabe, Koichi Miyazaki, Tomoki Hayashi

    Abstract: This paper proposes acoustic event detection (AED) with classifier chains, a new classifier based on the probabilistic chain rule. The proposed AED with classifier chains consists of a gated recurrent unit and performs iterative binary detection of each event one by one. In each iteration, the event's activity is estimated and used to condition the next output based on the probabilistic chain rule… ▽ More

    Submitted 17 February, 2022; originally announced February 2022.

    Comments: 5pages, presented at Interspeech2021

  15. arXiv:2202.08456  [pdf, other

    eess.AS cs.LG cs.SD

    MLP-ASR: Sequence-length agnostic all-MLP architectures for speech recognition

    Authors: ** Sakuma, Tatsuya Komatsu, Robin Scheibler

    Abstract: We propose multi-layer perceptron (MLP)-based architectures suitable for variable length input. MLP-based architectures, recently proposed for image classification, can only be used for inputs of a fixed, pre-defined size. However, many types of data are naturally variable in length, for example, acoustic signals. We propose three approaches to extend MLP-based architectures for use with sequences… ▽ More

    Submitted 17 February, 2022; originally announced February 2022.

    Comments: 8 pages, 4 figures

  16. arXiv:2110.05249  [pdf, other

    eess.AS cs.CL cs.SD

    A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation

    Authors: Yosuke Higuchi, Nanxin Chen, Yuya Fujita, Hirofumi Inaguma, Tatsuya Komatsu, Jaesong Lee, Jumon Nozaki, Tianzi Wang, Shinji Watanabe

    Abstract: Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines. Showing great potential for real-time applications, an increasing number of NAR models have been explored in different fields to mitigate the performance gap against AR models. In this work, we con… ▽ More

    Submitted 11 October, 2021; originally announced October 2021.

    Comments: Accepted to ASRU2021

  17. arXiv:2104.10328  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and Backward Transformers

    Authors: Yusuke Kida, Tatsuya Komatsu, Masahito Togami

    Abstract: This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR). The speech-to-text alignment is a problem of splitting long audio recordings with un-aligned transcripts into utterance-wise pairs of speech and text. Unlike conventional methods based on frame-synchronous prediction, the proposed method re-defines the speech-to-text alignment a… ▽ More

    Submitted 20 April, 2021; originally announced April 2021.

    Comments: Submitted to INTERSPEECH 2021

  18. arXiv:2104.02724  [pdf, other

    eess.AS cs.CL cs.SD

    Relaxing the Conditional Independence Assumption of CTC-based ASR by Conditioning on Intermediate Predictions

    Authors: Jumon Nozaki, Tatsuya Komatsu

    Abstract: This paper proposes a method to relax the conditional independence assumption of connectionist temporal classification (CTC)-based automatic speech recognition (ASR) models. We train a CTC-based ASR model with auxiliary CTC losses in intermediate layers in addition to the original CTC loss in the last layer. During both training and inference, each generated prediction in the intermediate layers i… ▽ More

    Submitted 8 October, 2021; v1 submitted 6 April, 2021; originally announced April 2021.

    Comments: Accepted to INTERSPEECH2021

  19. arXiv:2006.11204  [pdf, other

    cs.LG cs.CR stat.ML

    Differentially Private Variational Autoencoders with Term-wise Gradient Aggregation

    Authors: Tsubasa Takahashi, Shun Takagi, Hajime Ono, Tatsuya Komatsu

    Abstract: This paper studies how to learn variational autoencoders with a variety of divergences under differential privacy constraints. We often build a VAE with an appropriate prior distribution to describe the desired properties of the learned representations and introduce a divergence as a regularization term to close the representations to the prior. Using differentially private SGD (DP-SGD), which ran… ▽ More

    Submitted 19 June, 2020; originally announced June 2020.

    Comments: 10 pages

  20. arXiv:2002.05831  [pdf, other

    eess.AS cs.SD

    Consistency-aware multi-channel speech enhancement using deep neural networks

    Authors: Yoshiki Masuyama, Masahito Togami, Tatsuya Komatsu

    Abstract: This paper proposes a deep neural network (DNN)-based multi-channel speech enhancement system in which a DNN is trained to maximize the quality of the enhanced time-domain signal. DNN-based multi-channel speech enhancement is often conducted in the time-frequency (T-F) domain because spatial filtering can be efficiently implemented in the T-F domain. In such a case, ordinary objective functions ar… ▽ More

    Submitted 13 February, 2020; originally announced February 2020.

    Comments: To appear at the 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)

  21. arXiv:1911.04228  [pdf, ps, other

    eess.AS cs.SD

    Unsupervised Training for Deep Speech Source Separation with Kullback-Leibler Divergence Based Probabilistic Loss Function

    Authors: Masahito Togami, Yoshiki Masuyama, Tatsuya Komatsu, Yu Nakagome

    Abstract: In this paper, we propose a multi-channel speech source separation with a deep neural network (DNN) which is trained under the condition that no clean signal is available. As an alternative to a clean signal, the proposed method adopts an estimated speech signal by an unsupervised speech source separation with a statistical model. As a statistical model of microphone input signal, we adopts a time… ▽ More

    Submitted 11 November, 2019; originally announced November 2019.

  22. arXiv:1908.10055  [pdf, ps, other

    cs.SD eess.AS

    Overview of Tasks and Investigation of Subjective Evaluation Methods in Environmental Sound Synthesis and Conversion

    Authors: Yuki Okamoto, Keisuke Imoto, Tatsuya Komatsu, Shinnosuke Takamichi, Takumi Yagyu, Ryosuke Yamanishi, Yoichi Yamashita

    Abstract: Synthesizing and converting environmental sounds have the potential for many applications such as supporting movie and game production, data augmentation for sound event detection and scene classification. Conventional works on synthesizing and converting environmental sounds are based on a physical modeling or concatenative approach. However, there are a limited number of works that have addresse… ▽ More

    Submitted 27 August, 2019; originally announced August 2019.

  23. arXiv:1907.04984  [pdf, other

    cs.SD eess.AS

    Multichannel Loss Function for Supervised Speech Source Separation by Mask-based Beamforming

    Authors: Yoshiki Masuyama, Masahito Togami, Tatsuya Komatsu

    Abstract: In this paper, we propose two mask-based beamforming methods using a deep neural network (DNN) trained by multichannel loss functions. Beamforming technique using time-frequency (TF)-masks estimated by a DNN have been applied to many applications where TF-masks are used for estimating spatial covariance matrices. To train a DNN for mask-based beamforming, loss functions designed for monaural speec… ▽ More

    Submitted 10 July, 2019; originally announced July 2019.

    Comments: 5 pages, Accepted at INTERSPEECH 2019

  24. arXiv:1904.03787  [pdf, other

    cs.SD cs.LG eess.AS

    Bayesian Non-Parametric Multi-Source Modelling Based Determined Blind Source Separation

    Authors: Chaitanya Narisetty, Tatsuya Komatsu, Reishi Kondo

    Abstract: This paper proposes a determined blind source separation method using Bayesian non-parametric modelling of sources. Conventionally source signals are separated from a given set of mixture signals by modelling them using non-negative matrix factorization (NMF). However in NMF, a latent variable signifying model complexity must be appropriately specified to avoid over-fitting or under-fitting. As re… ▽ More

    Submitted 7 April, 2019; originally announced April 2019.

    Comments: 5 pages, 2 figures. Accepted at ICASSP 2019

  25. arXiv:1904.02852  [pdf, other

    eess.AS cs.SD

    Modelling of Sound Events with Hidden Imbalances Based on Clustering and Separate Sub-Dictionary Learning

    Authors: Chaitanya Narisetty, Tatsuya Komatsu, Reishi Kondo

    Abstract: This paper proposes an effective modelling of sound event spectra with a hidden data-size-imbalance, for improved Acoustic Event Detection (AED). The proposed method models each event as an aggregated representation of a few latent factors, while conventional approaches try to find acoustic elements directly from the event spectra. In the method, all the latent factors across all events are assign… ▽ More

    Submitted 4 April, 2019; originally announced April 2019.

  26. arXiv:1807.01985  [pdf, other

    cs.LG stat.ML

    BayesGrad: Explaining Predictions of Graph Convolutional Networks

    Authors: Hirotaka Akita, Kosuke Nakago, Tomoki Komatsu, Yohei Sugawara, Shin-ichi Maeda, Yukino Baba, Hisashi Kashima

    Abstract: Recent advances in graph convolutional networks have significantly improved the performance of chemical predictions, raising a new research question: "how do we explain the predictions of graph convolutional networks?" A possible approach to answer this question is to visualize evidence substructures responsible for the predictions. For chemical property prediction tasks, the sample size of the tr… ▽ More

    Submitted 4 July, 2018; originally announced July 2018.