Skip to main content

Showing 1–44 of 44 results for author: Nakatani, T

Searching in archive eess. Search in all archives.
.
  1. arXiv:2402.03058  [pdf, other

    eess.AS cs.SD

    Array Geometry-Robust Attention-Based Neural Beamformer for Moving Speakers

    Authors: Marvin Tammen, Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Shoko Araki, Simon Doclo

    Abstract: Although mask-based beamforming is a powerful speech enhancement approach, it often requires manual parameter tuning to handle moving speakers. Recently, this approach was augmented with an attention-based spatial covariance matrix aggregator (ASA) module, enabling accurate tracking of moving speakers without manual tuning. However, the deep neural network model used in this module is limited to s… ▽ More

    Submitted 17 June, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

    Comments: accepted at Interspeech 2024

  2. arXiv:2311.11595  [pdf, ps, other

    eess.AS

    Neural network-based virtual microphone estimation with virtual microphone and beamformer-level multi-task loss

    Authors: Hanako Segawa, Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Rintaro Ikeshita, Shoko Araki, Takeshi Yamada, Shoji Makino

    Abstract: Array processing performance depends on the number of microphones available. Virtual microphone estimation (VME) has been proposed to increase the number of microphone signals artificially. Neural network-based VME (NN-VME) trains an NN with a VM-level loss to predict a signal at a microphone location that is available during training but not at inference. However, this training objective may not… ▽ More

    Submitted 20 November, 2023; originally announced November 2023.

    Comments: 5 pages, 2 figures, 1 table

  3. arXiv:2308.03987  [pdf, other

    eess.AS cs.SD

    Target Speech Extraction with Conditional Diffusion Model

    Authors: Naoyuki Kamo, Marc Delcroix, Tomohiro Nakatani

    Abstract: Diffusion model-based speech enhancement has received increased attention since it can generate very natural enhanced signals and generalizes well to unseen conditions. Diffusion models have been explored for several sub-tasks of speech enhancement, such as speech denoising, dereverberation, and source separation. In this paper, we investigate their use for target speech extraction (TSE), which co… ▽ More

    Submitted 16 August, 2023; v1 submitted 7 August, 2023; originally announced August 2023.

    Comments: 5 pages, 4 figures, Interspeech2023

  4. arXiv:2306.17317  [pdf, ps, other

    eess.AS cs.SD

    Modified Parametric Multichannel Wiener Filter \\for Low-latency Enhancement of Speech Mixtures with Unknown Number of Speakers

    Authors: Ning Guo, Tomohiro Nakatani, Shoko Araki, Takehiro Moriya

    Abstract: This paper introduces a novel low-latency online beamforming (BF) algorithm, named Modified Parametric Multichannel Wiener Filter (Mod-PMWF), for enhancing speech mixtures with unknown and varying number of speakers. Although conventional BFs such as linearly constrained minimum variance BF (LCMV BF) can enhance a speech mixture, they typically require such attributes of the speech mixture as the… ▽ More

    Submitted 29 June, 2023; originally announced June 2023.

  5. arXiv:2306.12820  [pdf, other

    cs.SD eess.AS

    NoisyILRMA: Diffuse-Noise-Aware Independent Low-Rank Matrix Analysis for Fast Blind Source Extraction

    Authors: Koki Nishida, Norihiro Takamune, Rintaro Ikeshita, Daichi Kitamura, Hiroshi Saruwatari, Tomohiro Nakatani

    Abstract: In this paper, we address the multichannel blind source extraction (BSE) of a single source in diffuse noise environments. To solve this problem even faster than by fast multichannel nonnegative matrix factorization (FastMNMF) and its variant, we propose a BSE method called NoisyILRMA, which is a modification of independent low-rank matrix analysis (ILRMA) to account for diffuse noise. NoisyILRMA… ▽ More

    Submitted 22 June, 2023; originally announced June 2023.

    Comments: 5 pages, 3 figures, accepted for European Signal Processing Conference 2023 (EUSIPCO 2023)

  6. arXiv:2305.13580  [pdf, other

    eess.AS cs.SD

    Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization

    Authors: Marc Delcroix, Naohiro Tawara, Mireia Diez, Federico Landini, Anna Silnova, Atsunori Ogawa, Tomohiro Nakatani, Lukas Burget, Shoko Araki

    Abstract: Combining end-to-end neural speaker diarization (EEND) with vector clustering (VC), known as EEND-VC, has gained interest for leveraging the strengths of both methods. EEND-VC estimates activities and speaker embeddings for all speakers within an audio chunk and uses VC to associate these activities with speaker identities across different chunks. EEND-VC generates thus multiple streams of embeddi… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: Accepted at Interspeech 2023

  7. arXiv:2205.03568  [pdf, other

    eess.AS cs.SD

    Mask-based Neural Beamforming for Moving Speakers with Self-Attention-based Tracking

    Authors: Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Shoko Araki

    Abstract: Beamforming is a powerful tool designed to enhance speech signals from the direction of a target source. Computing the beamforming filter requires estimating spatial covariance matrices (SCMs) of the source and noise signals. Time-frequency masks are often used to compute these SCMs. Most studies of mask-based beamforming have assumed that the sources do not move. However, sources often move in pr… ▽ More

    Submitted 7 May, 2022; originally announced May 2022.

    Comments: 11 pages, 7 figures, Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing

  8. arXiv:2204.04811  [pdf, other

    eess.AS cs.SD

    Listen only to me! How well can target speech extraction handle false alarms?

    Authors: Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Katerina Zmolikova, Hiroshi Sato, Tomohiro Nakatani

    Abstract: Target speech extraction (TSE) extracts the speech of a target speaker in a mixture given auxiliary clues characterizing the speaker, such as an enrollment utterance. TSE addresses thus the challenging problem of simultaneously performing separation and speaker identification. There has been much progress in extraction performance following the recent development of neural networks for speech enha… ▽ More

    Submitted 14 July, 2022; v1 submitted 10 April, 2022; originally announced April 2022.

    Comments: Accepted to Interspeech 2022

  9. Effective data screening technique for crowdsourced speech intelligibility experiments: Evaluation with IRM-based speech enhancement

    Authors: Ayako Yamamoto, Toshio Irino, Shoko Araki, Kenichi Arai, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani

    Abstract: It is essential to perform speech intelligibility (SI) experiments with human listeners in order to evaluate objective intelligibility measures for develo** effective speech enhancement and noise reduction algorithms. Recently, crowdsourced remote testing has become a popular means for collecting a massive amount and variety of data at a relatively small cost and in a short time. However, carefu… ▽ More

    Submitted 19 August, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: This paper was submitted to APSIPA ASC 2022 (https://www.apsipa2022.org). The original title [v1] was "Subjective intelligibility of speech sounds enhanced by ideal ratio mask via crowdsourced remote experiments with effective data screening."

    Journal ref: Proc. APSIPA ASC 2022

  10. arXiv:2202.00875  [pdf, other

    eess.SP eess.AS

    ISS2: An Extension of Iterative Source Steering Algorithm for Majorization-Minimization-Based Independent Vector Analysis

    Authors: Rintaro Ikeshita, Tomohiro Nakatani

    Abstract: A majorization-minimization (MM) algorithm for independent vector analysis optimizes a separation matrix $W = [w_1, \ldots, w_m]^h \in \mathbb{C}^{m \times m}$ by minimizing a surrogate function of the form $\mathcal{L}(W) = \sum_{i = 1}^m w_i^h V_i w_i - \log | \det W |^2$, where $m \in \mathbb{N}$ is the number of sensors and positive definite matrices… ▽ More

    Submitted 16 June, 2022; v1 submitted 1 February, 2022; originally announced February 2022.

    Comments: Accepted for publication in the 30th European Signal Processing Conference (EUSIPCO 2022)

  11. arXiv:2111.10574  [pdf, ps, other

    eess.AS cs.SD eess.SP

    Switching Independent Vector Analysis and Its Extension to Blind and Spatially Guided Convolutional Beamforming Algorithms

    Authors: Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Hiroshi Sawada, Naoyuki Kamo, Shoko Araki

    Abstract: This paper develops a framework that can perform denoising, dereverberation, and source separation accurately by using a relatively small number of microphones. It has been empirically confirmed that Independent Vector Analysis (IVA) can blindly separate N sources from their sound mixture even with diffuse noise when a sufficiently large number (=M) of microphones are available (i.e., M>>N). Howev… ▽ More

    Submitted 24 February, 2022; v1 submitted 20 November, 2021; originally announced November 2021.

    Comments: Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing on 27 July 2021, accepted on 22 Feb. 2022

  12. Blind and neural network-guided convolutional beamformer for joint denoising, dereverberation, and source separation

    Authors: Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Hiroshi Sawada, Shoko Araki

    Abstract: This paper proposes an approach for optimizing a Convolutional BeamFormer (CBF) that can jointly perform denoising (DN), dereverberation (DR), and source separation (SS). First, we develop a blind CBF optimization algorithm that requires no prior information on the sources or the room acoustics, by extending a conventional joint DR and SS method. For making the optimization computationally tractab… ▽ More

    Submitted 4 August, 2021; originally announced August 2021.

    Comments: Accepted by IEEE ICASSP 2021

  13. arXiv:2106.05529  [pdf, other

    cs.SD eess.AS

    Independent Deeply Learned Tensor Analysis for Determined Audio Source Separation

    Authors: Naoki Narisawa, Rintaro Ikeshita, Norihiro Takamune, Daichi Kitamura, Tomohiko Nakamura, Hiroshi Saruwatari, Tomohiro Nakatani

    Abstract: We address the determined audio source separation problem in the time-frequency domain. In independent deeply learned matrix analysis (IDLMA), it is assumed that the inter-frequency correlation of each source spectrum is zero, which is inappropriate for modeling nonstationary signals such as music signals. To account for the correlation between frequencies, independent positive semidefinite tensor… ▽ More

    Submitted 10 June, 2021; originally announced June 2021.

    Comments: 5 pages, 2 figures, accepted for European Signal Processing Conference 2021 (EUSIPCO 2021)

  14. arXiv:2106.03903  [pdf, other

    cs.SD cs.LG eess.AS

    PILOT: Introducing Transformers for Probabilistic Sound Event Localization

    Authors: Christopher Schymura, Benedikt Bönninghoff, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa

    Abstract: Sound event localization aims at estimating the positions of sound sources in the environment with respect to an acoustic receiver (e.g. a microphone array). Recent advances in this domain most prominently focused on utilizing deep recurrent neural networks. Inspired by the success of transformer architectures as a suitable alternative to classical recurrent neural networks, this paper introduces… ▽ More

    Submitted 7 June, 2021; originally announced June 2021.

    Comments: Accepted at INTERSPEECH 2021

  15. Comparison of remote experiments using crowdsourcing and laboratory experiments on speech intelligibility

    Authors: Ayako Yamamoto, Toshio Irino, Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani

    Abstract: Many subjective experiments have been performed to develop objective speech intelligibility measures, but the novel coronavirus outbreak has made it very difficult to conduct experiments in a laboratory. One solution is to perform remote testing using crowdsourcing; however, because we cannot control the listening conditions, it is unclear whether the results are entirely reliable. In this study,… ▽ More

    Submitted 16 April, 2021; originally announced April 2021.

    Comments: This paper was submitted to Interspeech2021

    Journal ref: Proc. Interspeech 2021

  16. arXiv:2103.00417  [pdf, other

    cs.SD cs.LG eess.AS

    Exploiting Attention-based Sequence-to-Sequence Architectures for Sound Event Localization

    Authors: Christopher Schymura, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa

    Abstract: Sound event localization frameworks based on deep neural networks have shown increased robustness with respect to reverberation and noise in comparison to classical parametric approaches. In particular, recurrent architectures that incorporate temporal context into the estimation process seem to be well-suited for this task. This paper proposes a novel approach to sound event localization by utili… ▽ More

    Submitted 28 February, 2021; originally announced March 2021.

    Comments: Published in Proceedings of the 28th European Signal Processing Conference (EUSIPCO), 2020

  17. arXiv:2102.11588  [pdf, other

    cs.SD cs.AI cs.CL cs.CV cs.LG eess.AS eess.IV

    Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain

    Authors: Julio Wissing, Benedikt Boenninghoff, Dorothea Kolossa, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Christopher Schymura

    Abstract: Estimating the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. Both applications benefit from a known speaker position when, for instance, applying beamforming or assigning unique speaker identities. Recently, several approaches utilizing acoustic signals augmented with visual data have been proposed for this task. However, both the… ▽ More

    Submitted 24 February, 2021; v1 submitted 23 February, 2021; originally announced February 2021.

    Comments: 4 pages, 6 figures, ICASSP 2021

  18. End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend

    Authors: Wangyou Zhang, Christoph Boeddeker, Shinji Watanabe, Tomohiro Nakatani, Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Naoyuki Kamo, Reinhold Haeb-Umbach, Yanmin Qian

    Abstract: Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both single-channel and multichannel conditions. However, severe performance degradation is still observed in the reverberant and noisy scenarios, and there is still a large performance gap between anechoic and reverberant conditions. In this work, we focus on the multichannel mult… ▽ More

    Submitted 23 February, 2021; originally announced February 2021.

    Comments: 5 pages, 1 figure, accepted by ICASSP 2021

  19. arXiv:2102.04696  [pdf, other

    eess.AS cs.SD eess.SP

    Independent Vector Extraction for Fast Joint Blind Source Separation and Dereverberation

    Authors: Rintaro Ikeshita, Tomohiro Nakatani

    Abstract: We address a blind source separation (BSS) problem in a noisy reverberant environment in which the number of microphones $M$ is greater than the number of sources of interest, and the other noise components can be approximated as stationary and Gaussian distributed. Conventional BSS algorithms for the optimization of a multi-input multi-output convolutional beamformer have suffered from a huge com… ▽ More

    Submitted 21 April, 2021; v1 submitted 9 February, 2021; originally announced February 2021.

    Comments: Accepted to IEEE Signal Processing Letters

  20. arXiv:2102.01326  [pdf, other

    eess.AS cs.LG cs.SD

    Multimodal Attention Fusion for Target Speaker Extraction

    Authors: Hiroshi Sato, Tsubasa Ochiai, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Shoko Araki

    Abstract: Target speaker extraction, which aims at extracting a target speaker's voice from a mixture of voices using audio, visual or locational clues, has received much interest. Recently an audio-visual target speaker extraction has been proposed that extracts target speech by using complementary audio and visual clues. Although audio-visual target speaker extraction offers a more stable performance than… ▽ More

    Submitted 2 February, 2021; originally announced February 2021.

    Comments: 7 pages, 5 figures

    Journal ref: in IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 778-784

  21. arXiv:2101.08563  [pdf, ps, other

    cs.SD eess.AS

    A Joint Diagonalization Based Efficient Approach to Underdetermined Blind Audio Source Separation Using the Multichannel Wiener Filter

    Authors: Nobutaka Ito, Rintaro Ikeshita, Hiroshi Sawada, Tomohiro Nakatani

    Abstract: This paper presents a computationally efficient approach to blind source separation (BSS) of audio signals, applicable even when there are more sources than microphones (i.e., the underdetermined case). When there are as many sources as microphones (i.e., the determined case), BSS can be performed computationally efficiently by independent component analysis (ICA). Unfortunately, however, ICA is b… ▽ More

    Submitted 21 January, 2021; originally announced January 2021.

    Comments: submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

  22. arXiv:2101.05516  [pdf, other

    eess.AS cs.SD

    Speaker activity driven neural speech extraction

    Authors: Marc Delcroix, Katerina Zmolikova, Tsubasa Ochiai, Keisuke Kinoshita, Tomohiro Nakatani

    Abstract: Target speech extraction, which extracts the speech of a target speaker in a mixture given auxiliary speaker clues, has recently received increased interest. Various clues have been investigated such as pre-recorded enrollment utterances, direction information, or video of the target speaker. In this paper, we explore the use of speaker activity information as an auxiliary clue for single-channel… ▽ More

    Submitted 9 February, 2021; v1 submitted 14 January, 2021; originally announced January 2021.

    Comments: To appear in ICASSP 2021

  23. arXiv:2101.04315  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Neural Network-based Virtual Microphone Estimator

    Authors: Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Shoko Araki

    Abstract: Develo** microphone array technologies for a small number of microphones is important due to the constraints of many devices. One direction to address this situation consists of virtually augmenting the number of microphone signals, e.g., based on several physical model assumptions. However, such assumptions are not necessarily met in realistic conditions. In this paper, as an alternative approa… ▽ More

    Submitted 12 January, 2021; originally announced January 2021.

    Comments: 5 pages, 2 figures, submitted to ICASSP 2021

  24. arXiv:2011.15003  [pdf, other

    cs.SD eess.AS

    Convolutive Transfer Function Invariant SDR training criteria for Multi-Channel Reverberant Speech Separation

    Authors: Christoph Boeddeker, Wangyou Zhang, Tomohiro Nakatani, Keisuke Kinoshita, Tsubasa Ochiai, Marc Delcroix, Naoyuki Kamo, Yanmin Qian, Reinhold Haeb-Umbach

    Abstract: Time-domain training criteria have proven to be very effective for the separation of single-channel non-reverberant speech mixtures. Likewise, mask-based beamforming has shown impressive performance in multi-channel reverberant speech enhancement and source separation. Here, we propose to combine neural network supported multi-channel source separation with a time-domain training objective functio… ▽ More

    Submitted 8 June, 2021; v1 submitted 30 November, 2020; originally announced November 2020.

    Comments: Accepted by ICASSP 2021

  25. arXiv:2011.11984  [pdf, other

    eess.AS

    Integration of variational autoencoder and spatial clustering for adaptive multi-channel neural speech separation

    Authors: Katerina Zmolikova, Marc Delcroix, Lukáš Burget, Tomohiro Nakatani, Jan "Honza" Černocký

    Abstract: In this paper, we propose a method combining variational autoencoder model of speech with a spatial clustering approach for multi-channel speech separation. The advantage of integrating spatial clustering with a spectral model was shown in several works. As the spectral model, previous works used either factorial generative models of the mixed speech or discriminative neural networks. In our work,… ▽ More

    Submitted 24 November, 2020; originally announced November 2020.

    Comments: 8 pages, 3 figures, to be published in SLT2021

  26. Block Coordinate Descent Algorithms for Auxiliary-Function-Based Independent Vector Extraction

    Authors: Rintaro Ikeshita, Tomohiro Nakatani, Shoko Araki

    Abstract: In this paper, we address the problem of extracting all super-Gaussian source signals from a linear mixture in which (i) the number of super-Gaussian sources $K$ is less than that of sensors $M$, and (ii) there are up to $M - K$ stationary Gaussian noises that do not need to be extracted. To solve this problem, independent vector extraction (IVE) using a majorization minimization and block coordin… ▽ More

    Submitted 3 May, 2021; v1 submitted 18 October, 2020; originally announced October 2020.

    Comments: Accepted by IEEE Transactions on Signal Processing

  27. arXiv:2009.09395  [pdf, other

    eess.AS

    Far-Field Automatic Speech Recognition

    Authors: Reinhold Haeb-Umbach, Jahn Heymann, Lukas Drude, Shinji Watanabe, Marc Delcroix, Tomohiro Nakatani

    Abstract: The machine recognition of speech spoken at a distance from the microphones, known as far-field automatic speech recognition (ASR), has received a significant increase of attention in science and industry, which caused or was caused by an equally significant improvement in recognition accuracy. Meanwhile it has entered the consumer market with digital home assistants with a spoken language interfa… ▽ More

    Submitted 20 September, 2020; originally announced September 2020.

    Comments: accepted for Proceedings of the IEEE

  28. arXiv:2006.13579  [pdf, other

    eess.AS cs.SD

    Multi-path RNN for hierarchical modeling of long sequential data and its application to speaker stream separation

    Authors: Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

    Abstract: Recently, the source separation performance was greatly improved by time-domain audio source separation based on dual-path recurrent neural network (DPRNN). DPRNN is a simple but effective model for a long sequential data. While DPRNN is quite efficient in modeling a sequential data of the length of an utterance, i.e., about 5 to 10 second data, it is harder to apply it to longer sequences such as… ▽ More

    Submitted 24 June, 2020; originally announced June 2020.

    Comments: 5 pages, 4 figures

  29. Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR

    Authors: Thilo von Neumann, Christoph Boeddeker, Lukas Drude, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

    Abstract: Most approaches to multi-talker overlapped speech separation and recognition assume that the number of simultaneously active speakers is given, but in realistic situations, it is typically unknown. To cope with this, we extend an iterative speech extraction system with mechanisms to count the number of sources and combine it with a single-talker speech recognizer to form the first end-to-end multi… ▽ More

    Submitted 21 December, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

    Comments: 5 pages, INTERSPEECH 2020

  30. Jointly optimal denoising, dereverberation, and source separation

    Authors: Tomohiro Nakatani, Christoph Boeddeker, Keisuke Kinoshita, Rintaro Ikeshita, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: This paper proposes methods that can optimize a Convolutional BeamFormer (CBF) for jointly performing denoising, dereverberation, and source separation (DN+DR+SS) in a computationally efficient way. Conventionally, cascade configuration composed of a Weighted Prediction Error minimization (WPE) dereverberation filter followed by a Minimum Variance Distortionless Response beamformer has been usedas… ▽ More

    Submitted 2 August, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

    Comments: Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing on 12 Feb 2020, Accepted to IEEE/ACM Trans. Audio, Speech, and Language Processing on 14 July 2020

  31. arXiv:2005.04669  [pdf, other

    cs.SD eess.AS eess.SP

    Cognitive-driven convolutional beamforming using EEG-based auditory attention decoding

    Authors: Ali Aroudi, Marc Delcroix, Tomohiro Nakatani, Keisuke Kinoshita, Shoko Araki, Simon Doclo

    Abstract: The performance of speech enhancement algorithms in a multi-speaker scenario depends on correctly identifying the target speaker to be enhanced. Auditory attention decoding (AAD) methods allow to identify the target speaker which the listener is attending to from single-trial EEG recordings. Aiming at enhancing the target speaker and suppressing interfering speakers, reverberation and ambient nois… ▽ More

    Submitted 10 May, 2020; originally announced May 2020.

  32. arXiv:2003.03998  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Improving noise robust automatic speech recognition with single-channel time-domain enhancement network

    Authors: Keisuke Kinoshita, Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani

    Abstract: With the advent of deep learning, research on noise-robust automatic speech recognition (ASR) has progressed rapidly. However, ASR performance in noisy conditions of single-channel systems remains unsatisfactory. Indeed, most single-channel speech enhancement (SE) methods (denoising) have brought only limited performance gains over state-of-the-art ASR back-end trained on multi-condition training… ▽ More

    Submitted 9 March, 2020; originally announced March 2020.

    Comments: 5 pages, to appear in ICASSP2020

  33. arXiv:2003.03987  [pdf, other

    eess.AS cs.LG cs.SD

    Tackling real noisy reverberant meetings with all-neural source separation, counting, and diarization system

    Authors: Keisuke Kinoshita, Marc Delcroix, Shoko Araki, Tomohiro Nakatani

    Abstract: Automatic meeting analysis is an essential fundamental technology required to let, e.g. smart devices follow and respond to our conversations. To achieve an optimal automatic meeting analysis, we previously proposed an all-neural approach that jointly solves source separation, speaker diarization and source counting problems in an optimal way (in a sense that all the 3 tasks can be jointly optimiz… ▽ More

    Submitted 9 March, 2020; originally announced March 2020.

    Comments: 8 pages, to appear in ICASSP2020

  34. Overdetermined independent vector analysis

    Authors: Rintaro Ikeshita, Tomohiro Nakatani, Shoko Araki

    Abstract: We address the convolutive blind source separation problem for the (over-)determined case where (i) the number of nonstationary target-sources $K$ is less than that of microphones $M$, and (ii) there are up to $M - K$ stationary Gaussian noises that need not to be extracted. Independent vector analysis (IVA) can solve the problem by separating into $M$ sources and selecting the top $K$ highly nons… ▽ More

    Submitted 5 March, 2020; originally announced March 2020.

    Comments: To appear at the 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)

  35. arXiv:2002.08582  [pdf, ps, other

    cs.SD eess.AS eess.SP

    Convergence-guaranteed Independent Positive Semidefinite Tensor Analysis Based on Student's t Distribution

    Authors: Tatsuki Kondo, Kanta Fukushige, Norihiro Takamune, Daichi Kitamura, Hiroshi Saruwatari, Rintaro Ikeshita, Tomohiro Nakatani

    Abstract: In this paper, we address a blind source separation (BSS) problem and propose a new extended framework of independent positive semidefinite tensor analysis (IPSDTA). IPSDTA is a state-of-the-art BSS method that enables us to take interfrequency correlations into account, but the generative model is limited within the multivariate Gaussian distribution and its parameter optimization algorithm does… ▽ More

    Submitted 20 February, 2020; originally announced February 2020.

    Comments: 5 pages, 3 figures, to appear in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2020

  36. arXiv:2001.08378  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam

    Authors: Marc Delcroix, Tsubasa Ochiai, Katerina Zmolikova, Keisuke Kinoshita, Naohiro Tawara, Tomohiro Nakatani, Shoko Araki

    Abstract: Target speech extraction, which extracts a single target source in a mixture given clues about the target speaker, has attracted increasing attention. We have recently proposed SpeakerBeam, which exploits an adaptation utterance of the target speaker to extract his/her voice characteristics that are then used to guide a neural network towards extracting speech of that speaker. SpeakerBeam presents… ▽ More

    Submitted 23 January, 2020; originally announced January 2020.

    Comments: 5 pages, 3 figures. Submitted to ICASSP 2020

  37. End-to-end training of time domain audio separation and recognition

    Authors: Thilo von Neumann, Keisuke Kinoshita, Lukas Drude, Christoph Boeddeker, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

    Abstract: The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multi-speaker speech recognition. However, up until now, state-of-the-art neural network-based time domain source separation has not yet been combined with E2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audi… ▽ More

    Submitted 13 April, 2020; v1 submitted 18 December, 2019; originally announced December 2019.

    Comments: 5 pages, 1 figure, to appear in ICASSP 2020

  38. arXiv:1910.13707  [pdf, other

    cs.SD cs.CL eess.AS

    Jointly optimal dereverberation and beamforming

    Authors: Christoph Boeddeker, Tomohiro Nakatani, Keisuke Kinoshita, Reinhold Haeb-Umbach

    Abstract: We previously proposed an optimal (in the maximum likelihood sense) convolutional beamformer that can perform simultaneous denoising and dereverberation, and showed its superiority over the widely used cascade of a WPE dereverberation filter and a conventional MPDR beamformer. However, it has not been fully investigated which components in the convolutional beamformer yield such superiority. To th… ▽ More

    Submitted 30 October, 2019; originally announced October 2019.

    Comments: Submitted to ICASSP 2020

  39. arXiv:1908.02710  [pdf, ps, other

    eess.AS cs.SD

    Maximum likelihood convolutional beamformer for simultaneous denoising and dereverberation

    Authors: Tomohiro Nakatani, Keisuke Kinoshita

    Abstract: This article describes a probabilistic formulation of a Weighted Power minimization Distortionless response convolutional beamformer (WPD). The WPD unifies a weighted prediction error based dereverberation method (WPE) and a minimum power distortionless response beamformer (MPDR) into a single convolutional beamformer, and achieves simultaneous dereverberation and denoising in an optimal way. Howe… ▽ More

    Submitted 6 August, 2019; originally announced August 2019.

    Comments: Accepted for EUSIPCO 2019. arXiv admin note: text overlap with arXiv:1812.08400

  40. GEDI: Gammachirp Envelope Distortion Index for Predicting Intelligibility of Enhanced Speech

    Authors: Katsuhiko Yamamoto, Toshio Irino, Shoko Araki, Keisuke Kinoshita, Tomohiro Nakatani

    Abstract: In this study, we propose a new concept, the gammachirp envelope distortion index (GEDI), based on the signal-to-distortion ratio in the auditory envelope, SDRenv to predict the intelligibility of speech enhanced by nonlinear algorithms. The objective of GEDI is to calculate the distortion between enhanced and clean-speech representations in the domain of a temporal envelope extracted by the gamma… ▽ More

    Submitted 19 July, 2020; v1 submitted 3 April, 2019; originally announced April 2019.

    Comments: Preprint, 37 pages, 6 tables, 9 figures

    Journal ref: Speech Communication, Vol. 123, pp. 43-58, 2020

  41. arXiv:1902.07881  [pdf, ps, other

    eess.AS cs.LG cs.SD

    All-neural online source separation, counting, and diarization for meeting analysis

    Authors: Thilo von Neumann, Keisuke Kinoshita, Marc Delcroix, Shoko Araki, Tomohiro Nakatani, Reinhold Haeb-Umbach

    Abstract: Automatic meeting analysis comprises the tasks of speaker counting, speaker diarization, and the separation of overlapped speech, followed by automatic speech recognition. This all has to be carried out on arbitrarily long sessions and, ideally, in an online or block-online manner. While significant progress has been made on individual tasks, this paper presents for the first time an all-neural ap… ▽ More

    Submitted 21 February, 2019; originally announced February 2019.

    Comments: 5 pages, to appear in ICASSP2019

  42. A unified convolutional beamformer for simultaneous denoising and dereverberation

    Authors: Tomohiro Nakatani, Keisuke Kinoshita

    Abstract: This paper proposes a method for estimating a convolutional beamformer that can perform denoising and dereverberation simultaneously in an optimal way. The application of dereverberation based on a weighted prediction error (WPE) method followed by denoising based on a minimum variance distortionless response (MVDR) beamformer has conventionally been considered a promising approach, however, the o… ▽ More

    Submitted 5 May, 2019; v1 submitted 20 December, 2018; originally announced December 2018.

    Comments: Published in IEEE Signal Processing Letters

    Journal ref: IEEE Signal Processing Letters, vol. 26, no. 6, pp. 903-907, June 2019

  43. arXiv:1805.09498  [pdf, ps, other

    cs.SD eess.AS

    FastFCA-AS: Joint Diagonalization Based Acceleration of Full-Rank Spatial Covariance Analysis for Separating Any Number of Sources

    Authors: Nobutaka Ito, Tomohiro Nakatani

    Abstract: Here we propose FastFCA-AS, an accelerated algorithm for Full-rank spatial Covariance Analysis (FCA), which is a robust audio source separation method proposed by Duong et al. ["Under-determined reverberant audio source separation using a full-rank spatial covariance model," IEEE Trans. ASLP, vol. 18, no. 7, pp. 1830-1840, Sept. 2010]. In the conventional FCA, matrix inversion and matrix multiplic… ▽ More

    Submitted 23 May, 2018; originally announced May 2018.

    Comments: Submitted to IWAENC2018

  44. arXiv:1805.06572  [pdf, ps, other

    cs.SD eess.AS

    FastFCA: A Joint Diagonalization Based Fast Algorithm for Audio Source Separation Using A Full-Rank Spatial Covariance Model

    Authors: Nobutaka Ito, Shoko Araki, Tomohiro Nakatani

    Abstract: A source separation method using a full-rank spatial covariance model has been proposed by Duong et al. ["Under-determined Reverberant Audio Source Separation Using a Full-rank Spatial Covariance Model," IEEE Trans. ASLP, vol. 18, no. 7, pp. 1830-1840, Sep. 2010], which is referred to as full-rank spatial covariance analysis (FCA) in this paper. Here we propose a fast algorithm for estimating the… ▽ More

    Submitted 16 May, 2018; originally announced May 2018.