Skip to main content

Showing 1–24 of 24 results for author: Takeuchi, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.02032  [pdf, other

    eess.AS cs.MM cs.SD

    M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Masahiro Yasuda, Shunsuke Tsubaki, Keisuke Imoto

    Abstract: Contrastive language-audio pre-training (CLAP) enables zero-shot (ZS) inference of audio and exhibits promising performance in several classification tasks. However, conventional audio representations are still crucial for many tasks where ZS is not applicable (e.g., regression problems). Here, we explore a new representation, a general-purpose audio-language representation, that performs well in… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: 5 pages, 1 figure, 5 tables. Accepted by Interspeech 2024

    MSC Class: 68T07

  2. arXiv:2404.17107  [pdf, other

    eess.AS cs.SD

    Exploring Pre-trained General-purpose Audio Representations for Heart Murmur Detection

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: To reduce the need for skilled clinicians in heart sound interpretation, recent studies on automating cardiac auscultation have explored deep learning approaches. However, despite the demands for large data for deep learning, the size of the heart sound datasets is limited, and no pre-trained model is available. On the contrary, many pre-trained models for general audio tasks are available as gene… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: 4 pages, 1 figure, and 4 tables. Accepted by IEEE EMBC 2024

    MSC Class: 68T07

  3. arXiv:2404.06095  [pdf, other

    eess.AS cs.SD

    Masked Modeling Duo: Towards a Universal Audio Pre-training Framework

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Self-supervised learning (SSL) using masked prediction has made great strides in general-purpose audio representation. This study proposes Masked Modeling Duo (M2D), an improved masked prediction SSL, which learns by predicting representations of masked input signals that serve as training signals. Unlike conventional methods, M2D obtains a training signal by encoding only the masked part, encoura… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: 15 pages, 6 figures, 15 tables. Accepted by TASLP

    MSC Class: 68T07

  4. arXiv:2403.10756  [pdf, other

    eess.AS cs.SD

    Refining Knowledge Transfer on Audio-Image Temporal Agreement for Audio-Text Cross Retrieval

    Authors: Shunsuke Tsubaki, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Keisuke Imoto

    Abstract: The aim of this research is to refine knowledge transfer on audio-image temporal agreement for audio-text cross retrieval. To address the limited availability of paired non-speech audio-text data, learning methods for transferring the knowledge acquired from a large amount of paired audio-image data to shared audio-text representation have been investigated, suggesting the importance of how audio-… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: Submitted to EUSIPCO2024

  5. arXiv:2402.08252  [pdf, other

    eess.AS cs.SD

    Unrestricted Global Phase Bias-Aware Single-channel Speech Enhancement with Conformer-based Metric GAN

    Authors: Shiqi Zhang, Zheng Qiu, Daiki Takeuchi, Noboru Harada, Shoji Makino

    Abstract: With the rapid development of neural networks in recent years, the ability of various networks to enhance the magnitude spectrum of noisy speech in the single-channel speech enhancement domain has become exceptionally outstanding. However, enhancing the phase spectrum using neural networks is often ineffective, which remains a challenging problem. In this paper, we found that the human ear cannot… ▽ More

    Submitted 4 June, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

    Comments: Accepted by ICASSP 2024 Updated on 2024/06/04 to add one more citation in appendix

  6. arXiv:2308.11923  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement

    Authors: Daiki Takeuchi, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada, Kunio Kashino

    Abstract: We proposed Audio Difference Captioning (ADC) as a new extension task of audio captioning for describing the semantic differences between input pairs of similar but slightly different audio clips. The ADC solves the problem that conventional audio captioning sometimes generates similar captions for similar audio clips, failing to describe the difference in content. We also propose a cross-attentio… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

    Comments: Accepted to DCASE2023 Workshop

  7. arXiv:2305.14079  [pdf, other

    eess.AS cs.SD

    Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Self-supervised learning general-purpose audio representations have demonstrated high performance in a variety of tasks. Although they can be optimized for application by fine-tuning, even higher performance can be expected if they can be specialized to pre-train for an application. This paper explores the challenges and solutions in specializing general-purpose audio representations for a specifi… ▽ More

    Submitted 3 August, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Interspeech 2023; 5+2 pages, 2 figures, 6+6 tables, Code: https://github.com/nttcslab/m2d/tree/master/speech

    MSC Class: 68T07

  8. arXiv:2304.14923  [pdf, ps, other

    eess.SP cs.SD eess.AS eess.IV physics.optics

    Deep sound-field denoiser: optically-measured sound-field denoising using deep neural network

    Authors: Kenji Ishikawa, Daiki Takeuchi, Noboru Harada, Takehiro Moriya

    Abstract: This paper proposes a deep sound-field denoiser, a deep neural network (DNN) based denoising of optically measured sound-field images. Sound-field imaging using optical methods has gained considerable attention due to its ability to achieve high-spatial-resolution imaging of acoustic phenomena that conventional acoustic sensors cannot accomplish. However, the optically measured sound-field images… ▽ More

    Submitted 21 September, 2023; v1 submitted 27 April, 2023; originally announced April 2023.

    Comments: 16 pages, 10 figures, 2 tables

  9. arXiv:2303.00455  [pdf, other

    eess.AS cs.SD

    First-shot anomaly sound detection for machine condition monitoring: A domain generalization baseline

    Authors: Noboru Harada, Daisuke Niizumi, Yasunori Ohishi, Daiki Takeuchi, Masahiro Yasuda

    Abstract: This paper provides a baseline system for First-shot-compliant unsupervised anomaly detection (ASD) for machine condition monitoring. First-shot ASD does not allow systems to do machine-type dependent hyperparameter tuning or tool ensembling based on the performance metric calculated with the grand truth. To show benchmark performance for First-shot ASD, this paper proposes an anomaly sound detect… ▽ More

    Submitted 1 March, 2023; originally announced March 2023.

    Comments: 5 pages, 2 figures

  10. arXiv:2210.14648  [pdf, other

    eess.AS cs.CV cs.LG cs.SD

    Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Masked Autoencoders is a simple yet powerful self-supervised learning method. However, it learns representations indirectly by reconstructing masked input patches. Several methods learn representations directly by predicting representations of masked patches; however, we think using all patches to encode training signal representations is suboptimal. We propose a new method, Masked Modeling Duo (M… ▽ More

    Submitted 2 March, 2023; v1 submitted 26 October, 2022; originally announced October 2022.

    Comments: 6 pages, 3 figures, and 6 tables. To appear at ICASSP2023

    MSC Class: 68T07

  11. arXiv:2209.10275  [pdf, other

    cs.IT

    Tight Exponential Strong Converse for Source Coding Problem with Encoded Side Information

    Authors: Daisuke Takeuchi, Shun Watanabe

    Abstract: The source coding problem with encoded side information is considered. A lower bound on the strong converse exponent has been derived by Oohama, but its tightness has not been clarified. In this paper, we derive a tight strong converse exponent. For the special case such that the side-information does not exists, we demonstrate that our tight exponent of the WAK problem reduces to the known tight… ▽ More

    Submitted 3 April, 2024; v1 submitted 21 September, 2022; originally announced September 2022.

    Comments: 15 pages, 5 figures; v2 adds an analysis of full-side information case; v3 adds numerical experiment and an application to the privacy amplification

  12. arXiv:2207.11964  [pdf, other

    eess.AS cs.LG cs.MM cs.SD

    ConceptBeam: Concept Driven Target Speech Extraction

    Authors: Yasunori Ohishi, Marc Delcroix, Tsubasa Ochiai, Shoko Araki, Daiki Takeuchi, Daisuke Niizumi, Akisato Kimura, Noboru Harada, Kunio Kashino

    Abstract: We propose a novel framework for target speech extraction based on semantic information, called ConceptBeam. Target speech extraction means extracting the speech of a target speaker in a mixture. Typical approaches have been exploiting properties of audio signals, such as harmonic structure and direction of arrival. In contrast, ConceptBeam tackles the problem with semantic clues. Specifically, we… ▽ More

    Submitted 25 July, 2022; originally announced July 2022.

    Comments: Accepted to ACM Multimedia 2022

  13. arXiv:2207.09732  [pdf, other

    eess.AS cs.CL cs.IR cs.LG cs.SD

    Introducing Auxiliary Text Query-modifier to Content-based Audio Retrieval

    Authors: Daiki Takeuchi, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada, Kunio Kashino

    Abstract: The amount of audio data available on public websites is growing rapidly, and an efficient mechanism for accessing the desired data is necessary. We propose a content-based audio retrieval method that can retrieve a target audio that is similar to but slightly different from the query audio by introducing auxiliary textual information which describes the difference between the query and target aud… ▽ More

    Submitted 20 July, 2022; originally announced July 2022.

    Comments: Accepted to Interspeech 2022

  14. arXiv:2205.08138  [pdf, ps, other

    eess.AS cs.SD

    Composing General Audio Representation by Fusing Multilayer Features of a Pre-trained Model

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Many application studies rely on audio DNN models pre-trained on a large-scale dataset as essential feature extractors, and they extract features from the last layers. In this study, we focus on our finding that the middle layer features of existing supervised pre-trained models are more effective than the late layer features for some tasks. We propose a simple approach to compose features effecti… ▽ More

    Submitted 17 May, 2022; originally announced May 2022.

    Comments: 5 pages, 4 figures and 4 tables. Accepted by EUSIPCO 2022

    MSC Class: 68T07

  15. arXiv:2204.12260  [pdf, other

    eess.AS cs.SD

    Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Recent general-purpose audio representations show state-of-the-art performance on various audio tasks. These representations are pre-trained by self-supervised learning methods that create training signals from the input. For example, typical audio contrastive learning uses temporal relationships among input sounds to create training signals, whereas some methods use a difference among input views… ▽ More

    Submitted 26 April, 2022; originally announced April 2022.

    Comments: 22 pages, 8 figures. Under the review process

    MSC Class: 68T07

    Journal ref: HEAR: Holistic Evaluation of Audio Representations (NeurIPS 2021 Competition) PMLR 166 (2022) 1-24

  16. BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Pre-trained models are essential as feature extractors in modern machine learning systems in various domains. In this study, we hypothesize that representations effective for general audio tasks should provide multiple aspects of robust features of the input sound. For recognizing sounds regardless of perturbations such as varying pitch or timbre, features should be robust to these perturbations.… ▽ More

    Submitted 16 June, 2022; v1 submitted 15 April, 2022; originally announced April 2022.

    Comments: 15 pages, 6 figures, and 15 tables. Under the review process

    MSC Class: 68T07

    Journal ref: IEEE/ACM Trans. Audio, Speech, Language Process. 31 (2023) 137-151

  17. arXiv:2103.06695  [pdf, other

    eess.AS cs.LG cs.SD

    BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Inspired by the recent progress in self-supervised learning for computer vision that generates supervision using data augmentations, we explore a new general-purpose audio representation learning approach. We propose learning general-purpose audio representation from a single audio segment without expecting relationships between different time segments of audio samples. To implement this principle… ▽ More

    Submitted 20 April, 2021; v1 submitted 11 March, 2021; originally announced March 2021.

    Comments: IJCNN 2021, 8 pages, 4 figures

    MSC Class: 68T07

  18. arXiv:2012.07331  [pdf, other

    eess.AS cs.CL cs.SD

    Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

    Authors: Yuma Koizumi, Yasunori Ohishi, Daisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda

    Abstract: The goal of audio captioning is to translate input audio into its description using natural language. One of the problems in audio captioning is the lack of training data due to the difficulty in collecting audio-caption pairs by crawling the web. In this study, to overcome this problem, we propose to use a pre-trained large-scale language model. Since an audio input cannot be directly inputted in… ▽ More

    Submitted 14 December, 2020; originally announced December 2020.

    Comments: Submitted to ICASSP 2021

  19. arXiv:2009.11436  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Effects of Word-frequency based Pre- and Post- Processings for Audio Captioning

    Authors: Daiki Takeuchi, Yuma Koizumi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: The system we used for Task 6 (Automated Audio Captioning)of the Detection and Classification of Acoustic Scenes and Events(DCASE) 2020 Challenge combines three elements, namely, dataaugmentation, multi-task learning, and post-processing, for audiocaptioning. The system received the highest evaluation scores, butwhich of the individual elements most fully contributed to its perfor-mance has not ye… ▽ More

    Submitted 23 September, 2020; originally announced September 2020.

    Comments: Accepted to DCASE2020 Workshop

  20. arXiv:2007.00225  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning with Keywords and Sentence Length Estimation

    Authors: Yuma Koizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: This technical report describes the system participating to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, Task 6: automated audio captioning. Our submission focuses on solving two indeterminacy problems in automated audio captioning: word selection indeterminacy and sentence length indeterminacy. We simultaneously solve the main caption generation and sub i… ▽ More

    Submitted 1 July, 2020; originally announced July 2020.

    Comments: Technical Report of DCASE2020 Challenge Task 6

  21. arXiv:2002.05873  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention

    Authors: Yuma Koizumi, Kohei Yatabe, Marc Delcroix, Yoshiki Masuyama, Daiki Takeuchi

    Abstract: This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features; we extract a speaker representation used for adaptation directly from the test utterance. Conventional studies of deep neural network (DNN)--based speech enhancement mainly focus on building a speaker independent model. Meanwhile, in speech applications including speech recognition and s… ▽ More

    Submitted 14 February, 2020; originally announced February 2020.

    Comments: 5 pages, to appear in IEEE ICASSP 2020

  22. arXiv:2002.05843  [pdf, other

    eess.AS cs.SD

    Real-time speech enhancement using equilibriated RNN

    Authors: Daiki Takeuchi, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

    Abstract: We propose a speech enhancement method using a causal deep neural network~(DNN) for real-time applications. DNN has been widely used for estimating a time-frequency~(T-F) mask which enhances a speech signal. One popular DNN structure for that is a recurrent neural network~(RNN) owing to its capability of effectively modelling time-sequential data like speech. In particular, the long short-term mem… ▽ More

    Submitted 13 February, 2020; originally announced February 2020.

    Comments: To appear in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020)

  23. arXiv:1911.10764  [pdf, other

    eess.AS cs.SD

    Invertible DNN-based nonlinear time-frequency transform for speech enhancement

    Authors: Daiki Takeuchi, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

    Abstract: We propose an end-to-end speech enhancement method with trainable time-frequency~(T-F) transform based on invertible deep neural network~(DNN). The resent development of speech enhancement is brought by using DNN. The ordinary DNN-based speech enhancement employs T-F transform, typically the short-time Fourier transform~(STFT), and estimates a T-F mask using DNN. On the other hand, some methods ha… ▽ More

    Submitted 13 February, 2020; v1 submitted 25 November, 2019; originally announced November 2019.

    Comments: To appear in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020)

  24. arXiv:1903.08876  [pdf, other

    eess.AS cs.SD

    Data-driven design of perfect reconstruction filterbank for DNN-based sound source enhancement

    Authors: Daiki Takeuchi, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

    Abstract: We propose a data-driven design method of perfect-reconstruction filterbank (PRFB) for sound-source enhancement (SSE) based on deep neural network (DNN). DNNs have been used to estimate a time-frequency (T-F) mask in the short-time Fourier transform (STFT) domain. Their training is more stable when a simple cost function as mean-squared error (MSE) is utilized comparing to some advanced cost such… ▽ More

    Submitted 21 March, 2019; originally announced March 2019.

    Comments: 5 pages, to appear in IEEE ICASSP 2019 (Paper Code: AASP-P8.8, Session: Spatial Audio, Audio Enhancement and Bandwidth Extension)