Skip to main content

Showing 1–22 of 22 results for author: Masuyama, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.16808  [pdf, other

    cs.SD eess.AS

    Exploring the Capability of Mamba in Speech Applications

    Authors: Koichi Miyazaki, Yoshiki Masuyama, Masato Murata

    Abstract: This paper explores the capability of Mamba, a recently proposed architecture based on state space models (SSMs), as a competitive alternative to Transformer-based models. In the speech domain, well-designed Transformer-based models, such as the Conformer and E-Branchformer, have become the de facto standards. Extensive evaluations have demonstrated the effectiveness of these Transformer-based mod… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  2. arXiv:2402.17907  [pdf, other

    eess.AS cs.SD

    NIIRF: Neural IIR Filter Field for HRTF Upsampling and Personalization

    Authors: Yoshiki Masuyama, Gordon Wichern, François G. Germain, Zexu Pan, Sameer Khurana, Chiori Hori, Jonathan Le Roux

    Abstract: Head-related transfer functions (HRTFs) are important for immersive audio, and their spatial interpolation has been studied to upsample finite measurements. Recently, neural fields (NFs) which map from sound source direction to HRTF have gained attention. Existing NF-based methods focused on estimating the magnitude of the HRTF from a given sound source direction, and the magnitude is converted to… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024

  3. arXiv:2310.19644  [pdf, other

    eess.AS cs.MM

    Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction

    Authors: Zexu Pan, Gordon Wichern, Yoshiki Masuyama, Francois G. Germain, Sameer Khurana, Chiori Hori, Jonathan Le Roux

    Abstract: Target speech extraction aims to extract, based on a given conditioning cue, a target speech signal that is corrupted by interfering sources, such as noise or competing speakers. Building upon the achievements of the state-of-the-art (SOTA) time-frequency speaker separation model TF-GridNet, we propose AV-GridNet, a visual-grounded variant that incorporates the face recording of a target speaker a… ▽ More

    Submitted 30 October, 2023; originally announced October 2023.

    Comments: Accepted by ASRU 2023

  4. arXiv:2307.12232  [pdf, other

    cs.SD eess.AS eess.SP

    Signal Reconstruction from Mel-spectrogram Based on Bi-level Consistency of Full-band Magnitude and Phase

    Authors: Yoshiki Masuyama, Natsuki Ueno, Nobutaka Ono

    Abstract: We propose an optimization-based method for reconstructing a time-domain signal from a low-dimensional spectral representation such as a mel-spectrogram. Phase reconstruction has been studied to reconstruct a time-domain signal from the full-band short-time Fourier transform (STFT) magnitude. The Griffin-Lim algorithm (GLA) has been widely used because it relies only on the redundancy of STFT and… ▽ More

    Submitted 23 July, 2023; originally announced July 2023.

    Comments: Accepted to IEEE WASPAA 2023

  5. arXiv:2307.12231  [pdf, other

    cs.SD cs.CL eess.AS

    Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation

    Authors: Yoshiki Masuyama, Xuankai Chang, Wangyou Zhang, Samuele Cornell, Zhong-Qiu Wang, Nobutaka Ono, Yanmin Qian, Shinji Watanabe

    Abstract: Neural speech separation has made remarkable progress and its integration with automatic speech recognition (ASR) is an important direction towards realizing multi-speaker ASR. This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end. In detail, we explore multi-channel separation methods, mask-based beamforming and comp… ▽ More

    Submitted 23 July, 2023; originally announced July 2023.

    Comments: Accepted to IEEE WASPAA 2023

  6. arXiv:2306.13734  [pdf, other

    eess.AS cs.CL cs.SD

    The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios

    Authors: Samuele Cornell, Matthew Wiesner, Shinji Watanabe, Desh Raj, Xuankai Chang, Paola Garcia, Matthew Maciejewski, Yoshiki Masuyama, Zhong-Qiu Wang, Stefano Squartini, Sanjeev Khudanpur

    Abstract: The CHiME challenges have played a significant role in the development and evaluation of robust automatic speech recognition (ASR) systems. We introduce the CHiME-7 distant ASR (DASR) task, within the 7th CHiME challenge. This task comprises joint ASR and diarization in far-field settings with multiple, and possibly heterogeneous, recording devices. Different from previous challenges, we evaluate… ▽ More

    Submitted 14 July, 2023; v1 submitted 23 June, 2023; originally announced June 2023.

  7. arXiv:2306.10240  [pdf, other

    cs.SD cs.LG eess.AS

    Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation

    Authors: Yoshiaki Bando, Yoshiki Masuyama, Aditya Arie Nugraha, Kazuyoshi Yoshii

    Abstract: This paper describes an efficient unsupervised learning method for a neural source separation model that utilizes a probabilistic generative model of observed multichannel mixtures proposed for blind source separation (BSS). For this purpose, amortized variational inference (AVI) has been used for directly solving the inverse problem of BSS with full-rank spatial covariance analysis (FCA). Althoug… ▽ More

    Submitted 16 June, 2023; originally announced June 2023.

    Comments: 5 pages, 2 figures, accepted to EUSIPCO 2023

  8. arXiv:2302.07928  [pdf, other

    eess.AS cs.SD eess.SP

    Multi-Channel Target Speaker Extraction with Refinement: The WavLab Submission to the Second Clarity Enhancement Challenge

    Authors: Samuele Cornell, Zhong-Qiu Wang, Yoshiki Masuyama, Shinji Watanabe, Manuel Pariente, Nobutaka Ono

    Abstract: This paper describes our submission to the Second Clarity Enhancement Challenge (CEC2), which consists of target speech enhancement for hearing-aid (HA) devices in noisy-reverberant environments with multiple interferers such as music and competing speakers. Our approach builds upon the powerful iterative neural/beamforming enhancement (iNeuBe) framework introduced in our recent work, and this p… ▽ More

    Submitted 15 February, 2023; originally announced February 2023.

  9. arXiv:2211.08246  [pdf, other

    cs.SD eess.AS eess.SP

    Online Phase Reconstruction via DNN-based Phase Differences Estimation

    Authors: Yoshiki Masuyama, Kohei Yatabe, Kento Nagatomo, Yasuhiro Oikawa

    Abstract: This paper presents a two-stage online phase reconstruction framework using causal deep neural networks (DNNs). Phase reconstruction is a task of recovering phase of the short-time Fourier transform (STFT) coefficients only from the corresponding magnitude. However, phase is sensitive to waveform shifts and not easy to estimate from the magnitude even with a DNN. To overcome this problem, we propo… ▽ More

    Submitted 12 November, 2022; originally announced November 2022.

    Comments: Accepted to IEEE/ACM Trans. Audio, Speech, and Language Processing

  10. arXiv:2210.10742  [pdf, other

    cs.SD eess.AS

    End-to-End Integration of Speech Recognition, Dereverberation, Beamforming, and Self-Supervised Learning Representation

    Authors: Yoshiki Masuyama, Xuankai Chang, Samuele Cornell, Shinji Watanabe, Nobutaka Ono

    Abstract: Self-supervised learning representation (SSLR) has demonstrated its significant effectiveness in automatic speech recognition (ASR), mainly with clean speech. Recent work pointed out the strength of integrating SSLR with single-channel speech enhancement for ASR in noisy environments. This paper further advances this integration by dealing with multi-channel input. We propose a novel end-to-end ar… ▽ More

    Submitted 19 October, 2022; originally announced October 2022.

    Comments: Accepted to IEEE SLT 2022

  11. arXiv:2207.09514  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding

    Authors: Yen-Ju Lu, Xuankai Chang, Chenda Li, Wangyou Zhang, Samuele Cornell, Zhaoheng Ni, Yoshiki Masuyama, Brian Yan, Robin Scheibler, Zhong-Qiu Wang, Yu Tsao, Yanmin Qian, Shinji Watanabe

    Abstract: This paper presents recent progress on integrating speech separation and enhancement (SSE) into the ESPnet toolkit. Compared with the previous ESPnet-SE work, numerous features have been added, including recent state-of-the-art speech enhancement models with their respective training and evaluation recipes. Importantly, a new interface has been designed to flexibly combine speech enhancement front… ▽ More

    Submitted 19 July, 2022; originally announced July 2022.

    Comments: To appear in Interspeech 2022

  12. arXiv:2206.13014  [pdf, other

    eess.AS cs.SD eess.SP

    Joint Optimization of Sampling Rate Offsets Based on Entire Signal Relationship Among Distributed Microphones

    Authors: Yoshiki Masuyama, Kouei Yamaoka, Nobutaka Ono

    Abstract: In this paper, we propose to simultaneously estimate all the sampling rate offsets (SROs) of multiple devices. In a distributed microphone array, the SRO is inevitable, which deteriorates the performance of array signal processing. Most of the existing SRO estimation methods focused on synchronizing two microphones. When synchronizing more than two microphones, we select one reference microphone a… ▽ More

    Submitted 26 June, 2022; originally announced June 2022.

    Comments: 5 pages, 2 figures,accepted by Interspeech2022

  13. arXiv:2007.13976  [pdf, other

    cs.SD cs.CV eess.AS

    Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling

    Authors: Yoshiki Masuyama, Yoshiaki Bando, Kohei Yatabe, Yoko Sasaki, Masaki Onishi, Yasuhiro Oikawa

    Abstract: Detecting sound source objects within visual observation is important for autonomous robots to comprehend surrounding environments. Since sounding objects have a large variety with different appearances in our living environments, labeling all sounding objects is impossible in practice. This calls for self-supervised learning which does not require manual labeling. Most of conventional self-superv… ▽ More

    Submitted 27 July, 2020; originally announced July 2020.

    Comments: Accepted for publication in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

  14. arXiv:2002.05873  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention

    Authors: Yuma Koizumi, Kohei Yatabe, Marc Delcroix, Yoshiki Masuyama, Daiki Takeuchi

    Abstract: This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features; we extract a speaker representation used for adaptation directly from the test utterance. Conventional studies of deep neural network (DNN)--based speech enhancement mainly focus on building a speaker independent model. Meanwhile, in speech applications including speech recognition and s… ▽ More

    Submitted 14 February, 2020; originally announced February 2020.

    Comments: 5 pages, to appear in IEEE ICASSP 2020

  15. arXiv:2002.05832  [pdf, other

    eess.AS cs.SD

    Phase reconstruction based on recurrent phase unwrap** with deep neural networks

    Authors: Yoshiki Masuyama, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

    Abstract: Phase reconstruction, which estimates phase from a given amplitude spectrogram, is an active research field in acoustical signal processing with many applications including audio synthesis. To take advantage of rich knowledge from data, several studies presented deep neural network (DNN)--based phase reconstruction methods. However, the training of a DNN for phase reconstruction is not an easy tas… ▽ More

    Submitted 13 February, 2020; originally announced February 2020.

    Comments: To appear at the 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)

  16. arXiv:2002.05831  [pdf, other

    eess.AS cs.SD

    Consistency-aware multi-channel speech enhancement using deep neural networks

    Authors: Yoshiki Masuyama, Masahito Togami, Tatsuya Komatsu

    Abstract: This paper proposes a deep neural network (DNN)-based multi-channel speech enhancement system in which a DNN is trained to maximize the quality of the enhanced time-domain signal. DNN-based multi-channel speech enhancement is often conducted in the time-frequency (T-F) domain because spatial filtering can be efficiently implemented in the T-F domain. In such a case, ordinary objective functions ar… ▽ More

    Submitted 13 February, 2020; originally announced February 2020.

    Comments: To appear at the 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)

  17. arXiv:1911.04228  [pdf, ps, other

    eess.AS cs.SD

    Unsupervised Training for Deep Speech Source Separation with Kullback-Leibler Divergence Based Probabilistic Loss Function

    Authors: Masahito Togami, Yoshiki Masuyama, Tatsuya Komatsu, Yu Nakagome

    Abstract: In this paper, we propose a multi-channel speech source separation with a deep neural network (DNN) which is trained under the condition that no clean signal is available. As an alternative to a clean signal, the proposed method adopts an estimated speech signal by an unsupervised speech source separation with a statistical model. As a statistical model of microphone input signal, we adopts a time… ▽ More

    Submitted 11 November, 2019; originally announced November 2019.

  18. arXiv:1907.04984  [pdf, other

    cs.SD eess.AS

    Multichannel Loss Function for Supervised Speech Source Separation by Mask-based Beamforming

    Authors: Yoshiki Masuyama, Masahito Togami, Tatsuya Komatsu

    Abstract: In this paper, we propose two mask-based beamforming methods using a deep neural network (DNN) trained by multichannel loss functions. Beamforming technique using time-frequency (TF)-masks estimated by a DNN have been applied to many applications where TF-masks are used for estimating spatial covariance matrices. To train a DNN for mask-based beamforming, loss functions designed for monaural speec… ▽ More

    Submitted 10 July, 2019; originally announced July 2019.

    Comments: 5 pages, Accepted at INTERSPEECH 2019

  19. arXiv:1903.05603  [pdf, other

    eess.AS cs.SD eess.SP

    Low-rankness of Complex-valued Spectrogram and Its Application to Phase-aware Audio Processing

    Authors: Yoshiki Masuyama, Kohei Yatabe, Yasuhiro Oikawa

    Abstract: Low-rankness of amplitude spectrograms has been effectively utilized in audio signal processing methods including non-negative matrix factorization. However, such methods have a fundamental limitation owing to their amplitude-only treatment where the phase of the observed signal is utilized for resynthesizing the estimated signal. In order to address this limitation, we directly treat a complex-va… ▽ More

    Submitted 13 March, 2019; originally announced March 2019.

    Comments: 5 pages, to appear in IEEE ICASSP 2019 (Paper Code: AASP-P13.9, Session: Acoustic Scene Classification and Music Signal Analysis)

  20. arXiv:1903.05600  [pdf, other

    eess.AS cs.SD eess.SP

    Phase-aware Harmonic/Percussive Source Separation via Convex Optimization

    Authors: Yoshiki Masuyama, Kohei Yatabe, Yasuhiro Oikawa

    Abstract: Decomposition of an audio mixture into harmonic and percussive components, namely harmonic/percussive source separation (HPSS), is a useful pre-processing tool for many audio applications. Popular approaches to HPSS exploit the distinctive source-specific structures of power spectrograms. However, such approaches consider only power spectrograms, and the phase remains intact for resynthesizing the… ▽ More

    Submitted 13 March, 2019; originally announced March 2019.

    Comments: 5 pages, to appear in IEEE ICASSP 2019 (Paper Code: AASP-P16.5, Session: Music Signal Analysis, Feedback and Echo Cancellation and Equalization)

  21. arXiv:1903.03971  [pdf, other

    cs.SD cs.LG eess.AS

    Deep Griffin-Lim Iteration

    Authors: Yoshiki Masuyama, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

    Abstract: This paper presents a novel phase reconstruction method (only from a given amplitude spectrogram) by combining a signal-processing-based approach and a deep neural network (DNN). To retrieve a time-domain signal from its amplitude spectrogram, the corresponding phase is required. One of the popular phase reconstruction methods is the Griffin-Lim algorithm (GLA), which is based on the redundancy of… ▽ More

    Submitted 10 March, 2019; originally announced March 2019.

    Comments: 5 pages, to appear in IEEE ICASSP 2019 (Paper Code: AASP-L3.1, Session: Source Separation and Speech Enhancement I)

  22. arXiv:1811.08783  [pdf, other

    eess.SP cs.SD eess.AS

    Designing nearly tight window for improving time-frequency masking

    Authors: Tsubasa Kusano, Yoshiki Masuyama, Kohei Yatabe, Yasuhiro Oikawa

    Abstract: Many audio signal processing methods are formulated in the time-frequency (T-F) domain which is obtained by the short-time Fourier transform (STFT). The properties of the STFT are fully characterized by window function, number of frequency channels, and time-shift. Thus, designing a better window is important for improving the performance of the processing especially when a less redundant T-F repr… ▽ More

    Submitted 4 February, 2019; v1 submitted 17 November, 2018; originally announced November 2018.