Skip to main content

Showing 1–22 of 22 results for author: Ohishi, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.02032  [pdf, other

    eess.AS cs.MM cs.SD

    M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Masahiro Yasuda, Shunsuke Tsubaki, Keisuke Imoto

    Abstract: Contrastive language-audio pre-training (CLAP) enables zero-shot (ZS) inference of audio and exhibits promising performance in several classification tasks. However, conventional audio representations are still crucial for many tasks where ZS is not applicable (e.g., regression problems). Here, we explore a new representation, a general-purpose audio-language representation, that performs well in… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: 5 pages, 1 figure, 5 tables. Accepted by Interspeech 2024

    MSC Class: 68T07

  2. arXiv:2404.17107  [pdf, other

    eess.AS cs.SD

    Exploring Pre-trained General-purpose Audio Representations for Heart Murmur Detection

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: To reduce the need for skilled clinicians in heart sound interpretation, recent studies on automating cardiac auscultation have explored deep learning approaches. However, despite the demands for large data for deep learning, the size of the heart sound datasets is limited, and no pre-trained model is available. On the contrary, many pre-trained models for general audio tasks are available as gene… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: 4 pages, 1 figure, and 4 tables. Accepted by IEEE EMBC 2024

    MSC Class: 68T07

  3. arXiv:2404.08264  [pdf, other

    cs.MM cs.CV eess.AS

    Guided Masked Self-Distillation Modeling for Distributed Multimedia Sensor Event Analysis

    Authors: Masahiro Yasuda, Noboru Harada, Yasunori Ohishi, Shoichiro Saito, Akira Nakayama, Nobutaka Ono

    Abstract: Observations with distributed sensors are essential in analyzing a series of human and machine activities (referred to as 'events' in this paper) in complex and extensive real-world environments. This is because the information obtained from a single sensor is often missing or fragmented in such an environment; observations from multiple locations and modalities should be integrated to analyze eve… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Comments: 13page, 7figure, under review

  4. arXiv:2404.06095  [pdf, other

    eess.AS cs.SD

    Masked Modeling Duo: Towards a Universal Audio Pre-training Framework

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Self-supervised learning (SSL) using masked prediction has made great strides in general-purpose audio representation. This study proposes Masked Modeling Duo (M2D), an improved masked prediction SSL, which learns by predicting representations of masked input signals that serve as training signals. Unlike conventional methods, M2D obtains a training signal by encoding only the masked part, encoura… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: 15 pages, 6 figures, 15 tables. Accepted by TASLP

    MSC Class: 68T07

  5. arXiv:2403.10756  [pdf, other

    eess.AS cs.SD

    Refining Knowledge Transfer on Audio-Image Temporal Agreement for Audio-Text Cross Retrieval

    Authors: Shunsuke Tsubaki, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Keisuke Imoto

    Abstract: The aim of this research is to refine knowledge transfer on audio-image temporal agreement for audio-text cross retrieval. To address the limited availability of paired non-speech audio-text data, learning methods for transferring the knowledge acquired from a large amount of paired audio-image data to shared audio-text representation have been investigated, suggesting the importance of how audio-… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: Submitted to EUSIPCO2024

  6. arXiv:2308.11923  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement

    Authors: Daiki Takeuchi, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada, Kunio Kashino

    Abstract: We proposed Audio Difference Captioning (ADC) as a new extension task of audio captioning for describing the semantic differences between input pairs of similar but slightly different audio clips. The ADC solves the problem that conventional audio captioning sometimes generates similar captions for similar audio clips, failing to describe the difference in content. We also propose a cross-attentio… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

    Comments: Accepted to DCASE2023 Workshop

  7. arXiv:2305.14079  [pdf, other

    eess.AS cs.SD

    Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Self-supervised learning general-purpose audio representations have demonstrated high performance in a variety of tasks. Although they can be optimized for application by fine-tuning, even higher performance can be expected if they can be specialized to pre-train for an application. This paper explores the challenges and solutions in specializing general-purpose audio representations for a specifi… ▽ More

    Submitted 3 August, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Interspeech 2023; 5+2 pages, 2 figures, 6+6 tables, Code: https://github.com/nttcslab/m2d/tree/master/speech

    MSC Class: 68T07

  8. arXiv:2303.00455  [pdf, other

    eess.AS cs.SD

    First-shot anomaly sound detection for machine condition monitoring: A domain generalization baseline

    Authors: Noboru Harada, Daisuke Niizumi, Yasunori Ohishi, Daiki Takeuchi, Masahiro Yasuda

    Abstract: This paper provides a baseline system for First-shot-compliant unsupervised anomaly detection (ASD) for machine condition monitoring. First-shot ASD does not allow systems to do machine-type dependent hyperparameter tuning or tool ensembling based on the performance metric calculated with the grand truth. To show benchmark performance for First-shot ASD, this paper proposes an anomaly sound detect… ▽ More

    Submitted 1 March, 2023; originally announced March 2023.

    Comments: 5 pages, 2 figures

  9. arXiv:2210.14648  [pdf, other

    eess.AS cs.CV cs.LG cs.SD

    Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Masked Autoencoders is a simple yet powerful self-supervised learning method. However, it learns representations indirectly by reconstructing masked input patches. Several methods learn representations directly by predicting representations of masked patches; however, we think using all patches to encode training signal representations is suboptimal. We propose a new method, Masked Modeling Duo (M… ▽ More

    Submitted 2 March, 2023; v1 submitted 26 October, 2022; originally announced October 2022.

    Comments: 6 pages, 3 figures, and 6 tables. To appear at ICASSP2023

    MSC Class: 68T07

  10. arXiv:2207.11964  [pdf, other

    eess.AS cs.LG cs.MM cs.SD

    ConceptBeam: Concept Driven Target Speech Extraction

    Authors: Yasunori Ohishi, Marc Delcroix, Tsubasa Ochiai, Shoko Araki, Daiki Takeuchi, Daisuke Niizumi, Akisato Kimura, Noboru Harada, Kunio Kashino

    Abstract: We propose a novel framework for target speech extraction based on semantic information, called ConceptBeam. Target speech extraction means extracting the speech of a target speaker in a mixture. Typical approaches have been exploiting properties of audio signals, such as harmonic structure and direction of arrival. In contrast, ConceptBeam tackles the problem with semantic clues. Specifically, we… ▽ More

    Submitted 25 July, 2022; originally announced July 2022.

    Comments: Accepted to ACM Multimedia 2022

  11. arXiv:2207.09732  [pdf, other

    eess.AS cs.CL cs.IR cs.LG cs.SD

    Introducing Auxiliary Text Query-modifier to Content-based Audio Retrieval

    Authors: Daiki Takeuchi, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada, Kunio Kashino

    Abstract: The amount of audio data available on public websites is growing rapidly, and an efficient mechanism for accessing the desired data is necessary. We propose a content-based audio retrieval method that can retrieve a target audio that is similar to but slightly different from the query audio by introducing auxiliary textual information which describes the difference between the query and target aud… ▽ More

    Submitted 20 July, 2022; originally announced July 2022.

    Comments: Accepted to Interspeech 2022

  12. arXiv:2205.08138  [pdf, ps, other

    eess.AS cs.SD

    Composing General Audio Representation by Fusing Multilayer Features of a Pre-trained Model

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Many application studies rely on audio DNN models pre-trained on a large-scale dataset as essential feature extractors, and they extract features from the last layers. In this study, we focus on our finding that the middle layer features of existing supervised pre-trained models are more effective than the late layer features for some tasks. We propose a simple approach to compose features effecti… ▽ More

    Submitted 17 May, 2022; originally announced May 2022.

    Comments: 5 pages, 4 figures and 4 tables. Accepted by EUSIPCO 2022

    MSC Class: 68T07

  13. arXiv:2204.12260  [pdf, other

    eess.AS cs.SD

    Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Recent general-purpose audio representations show state-of-the-art performance on various audio tasks. These representations are pre-trained by self-supervised learning methods that create training signals from the input. For example, typical audio contrastive learning uses temporal relationships among input sounds to create training signals, whereas some methods use a difference among input views… ▽ More

    Submitted 26 April, 2022; originally announced April 2022.

    Comments: 22 pages, 8 figures. Under the review process

    MSC Class: 68T07

    Journal ref: HEAR: Holistic Evaluation of Audio Representations (NeurIPS 2021 Competition) PMLR 166 (2022) 1-24

  14. BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Pre-trained models are essential as feature extractors in modern machine learning systems in various domains. In this study, we hypothesize that representations effective for general audio tasks should provide multiple aspects of robust features of the input sound. For recognizing sounds regardless of perturbations such as varying pitch or timbre, features should be robust to these perturbations.… ▽ More

    Submitted 16 June, 2022; v1 submitted 15 April, 2022; originally announced April 2022.

    Comments: 15 pages, 6 figures, and 15 tables. Under the review process

    MSC Class: 68T07

    Journal ref: IEEE/ACM Trans. Audio, Speech, Language Process. 31 (2023) 137-151

  15. arXiv:2204.03895  [pdf, other

    eess.AS cs.SD

    SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning

    Authors: Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Yasunori Ohishi, Shoko Araki

    Abstract: In many situations, we would like to hear desired sound events (SEs) while being able to ignore interference. Target sound extraction (TSE) tackles this problem by estimating the audio signal of the sounds of target SE classes in a mixture of sounds while suppressing all other sounds. We can achieve this with a neural network that extracts the target SEs by conditioning it on clues representing th… ▽ More

    Submitted 2 November, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

    Comments: Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing on Feb. 10th, 2022, and accepted on Oct. 20, 2022

  16. arXiv:2202.09124  [pdf, other

    eess.AS cs.SD

    Multi-view and Multi-modal Event Detection Utilizing Transformer-based Multi-sensor fusion

    Authors: Masahiro Yasuda, Yasunori Ohishi, Shoichiro Saito, Noboru Harada

    Abstract: We tackle a challenging task: multi-view and multi-modal event detection that detects events in a wide-range real environment by utilizing data from distributed cameras and microphones and their weak labels. In this task, distributed sensors are utilized complementarily to capture events that are difficult to capture with a single sensor, such as a series of actions of people moving in an intricat… ▽ More

    Submitted 18 February, 2022; originally announced February 2022.

    Comments: 5 pages, 5 figures, to appear in IEEE ICASSP 2022

  17. arXiv:2202.09121  [pdf, other

    eess.AS cs.SD

    Echo-aware Adaptation of Sound Event Localization and Detection in Unknown Environments

    Authors: Masahiro Yasuda, Yasunori Ohishi, Shoichiro Saito

    Abstract: Our goal is to develop a sound event localization and detection (SELD) system that works robustly in unknown environments. A SELD system trained on known environment data is degraded in an unknown environment due to environmental effects such as reverberation and noise not contained in the training data. Previous studies on related tasks have shown that domain adaptation methods are effective when… ▽ More

    Submitted 18 February, 2022; originally announced February 2022.

    Comments: 5 pages, 3 figures, to appear in IEEE ICASSP 2022

  18. arXiv:2106.02369  [pdf, other

    eess.AS

    ToyADMOS2: Another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift conditions

    Authors: Noboru Harada, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Masahiro Yasuda, Shoichiro Saito

    Abstract: This paper proposes a new large-scale dataset called "ToyADMOS2" for anomaly detection in machine operating sounds (ADMOS). As did for our previous ToyADMOS dataset, we collected a large number of operating sounds of miniature machines (toys) under normal and anomaly conditions by deliberately damaging them but extended with providing controlled depth of damages in anomaly samples. Since typical a… ▽ More

    Submitted 4 June, 2021; originally announced June 2021.

    Comments: 5 pages, 4 figures

  19. arXiv:2103.06695  [pdf, other

    eess.AS cs.LG cs.SD

    BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

    Authors: Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: Inspired by the recent progress in self-supervised learning for computer vision that generates supervision using data augmentations, we explore a new general-purpose audio representation learning approach. We propose learning general-purpose audio representation from a single audio segment without expecting relationships between different time segments of audio samples. To implement this principle… ▽ More

    Submitted 20 April, 2021; v1 submitted 11 March, 2021; originally announced March 2021.

    Comments: IJCNN 2021, 8 pages, 4 figures

    MSC Class: 68T07

  20. arXiv:2012.07331  [pdf, other

    eess.AS cs.CL cs.SD

    Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

    Authors: Yuma Koizumi, Yasunori Ohishi, Daisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda

    Abstract: The goal of audio captioning is to translate input audio into its description using natural language. One of the problems in audio captioning is the lack of training data due to the difficulty in collecting audio-caption pairs by crawling the web. In this study, to overcome this problem, we propose to use a pre-trained large-scale language model. Since an audio input cannot be directly inputted in… ▽ More

    Submitted 14 December, 2020; originally announced December 2020.

    Comments: Submitted to ICASSP 2021

  21. arXiv:2009.11436  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Effects of Word-frequency based Pre- and Post- Processings for Audio Captioning

    Authors: Daiki Takeuchi, Yuma Koizumi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: The system we used for Task 6 (Automated Audio Captioning)of the Detection and Classification of Acoustic Scenes and Events(DCASE) 2020 Challenge combines three elements, namely, dataaugmentation, multi-task learning, and post-processing, for audiocaptioning. The system received the highest evaluation scores, butwhich of the individual elements most fully contributed to its perfor-mance has not ye… ▽ More

    Submitted 23 September, 2020; originally announced September 2020.

    Comments: Accepted to DCASE2020 Workshop

  22. arXiv:2007.00225  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    The NTT DCASE2020 Challenge Task 6 system: Automated Audio Captioning with Keywords and Sentence Length Estimation

    Authors: Yuma Koizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

    Abstract: This technical report describes the system participating to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, Task 6: automated audio captioning. Our submission focuses on solving two indeterminacy problems in automated audio captioning: word selection indeterminacy and sentence length indeterminacy. We simultaneously solve the main caption generation and sub i… ▽ More

    Submitted 1 July, 2020; originally announced July 2020.

    Comments: Technical Report of DCASE2020 Challenge Task 6