Skip to main content

Showing 1–44 of 44 results for author: Kinoshita, K

Searching in archive cs. Search in all archives.
.
  1. Neural Target Speech Extraction: An Overview

    Authors: Katerina Zmolikova, Marc Delcroix, Tsubasa Ochiai, Keisuke Kinoshita, Jan Černocký, Dong Yu

    Abstract: Humans can listen to a target speaker even in challenging acoustic conditions that have noise, reverberation, and interfering speakers. This phenomenon is known as the cocktail-party effect. For decades, researchers have focused on approaching the listening ability of humans. One critical issue is handling interfering speakers because the target and non-target speech signals share similar characte… ▽ More

    Submitted 30 January, 2023; originally announced January 2023.

    Comments: Submitted to IEEE Signal Processing Magazine on Apr. 25, 2022, and accepted on Jan. 12, 2023

  2. arXiv:2301.10984  [pdf

    cs.CY

    Design aesthetics recommender system based on customer profile and wanted affect

    Authors: Brahim Benaissa, Masakazu Kobayashi, Keita Kinoshita

    Abstract: Product recommendation systems have been instrumental in online commerce since the early days. Their development is expanded further with the help of big data and advanced deep learning methods, where consumer profiling is central. The interest of the consumer can now be predicted based on the personal past choices and the choices of similar consumers. However, what is currently defined as a choic… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

    Comments: 11 pages, 10 figures, peer-reviewed at the KEER 2022 conference

  3. On Word Error Rate Definitions and their Efficient Computation for Multi-Speaker Speech Recognition Systems

    Authors: Thilo von Neumann, Christoph Boeddeker, Keisuke Kinoshita, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: We propose a general framework to compute the word error rate (WER) of ASR systems that process recordings containing multiple speakers at their input and that produce multiple output word sequences (MIMO). Such ASR systems are typically required, e.g., for meeting transcription. We provide an efficient implementation based on a dynamic programming search in a multi-dimensional Levenshtein distanc… ▽ More

    Submitted 21 July, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

    Comments: Presented at ICASSP 2023

  4. arXiv:2207.13888  [pdf, other

    eess.AS cs.SD

    Utterance-by-utterance overlap-aware neural diarization with Graph-PIT

    Authors: Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Christoph Boeddeker, Reinhold Haeb-Umbach

    Abstract: Recent speaker diarization studies showed that integration of end-to-end neural diarization (EEND) and clustering-based diarization is a promising approach for achieving state-of-the-art performance on various tasks. Such an approach first divides an observed signal into fixed-length segments, then performs {\it segment-level} local diarization based on an EEND module, and merges the segment-level… ▽ More

    Submitted 28 July, 2022; originally announced July 2022.

    Comments: Accepted to Interspeech 2022 (5 pages, 1 figure)

  5. arXiv:2206.08174  [pdf, other

    eess.AS cs.SD eess.SP

    Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations

    Authors: Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Ryo Masumura

    Abstract: Target speech extraction is a technique to extract the target speaker's voice from mixture signals using a pre-recorded enrollment utterance that characterize the voice characteristics of the target speaker. One major difficulty of target speech extraction lies in handling variability in ``intra-speaker'' characteristics, i.e., characteristics mismatch between target speech and an enrollment utter… ▽ More

    Submitted 16 June, 2022; originally announced June 2022.

    Comments: 5 pages, 2 figures, 3 tables Submitted to Interspeech 2022

  6. arXiv:2204.04811  [pdf, other

    eess.AS cs.SD

    Listen only to me! How well can target speech extraction handle false alarms?

    Authors: Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Katerina Zmolikova, Hiroshi Sato, Tomohiro Nakatani

    Abstract: Target speech extraction (TSE) extracts the speech of a target speaker in a mixture given auxiliary clues characterizing the speaker, such as an enrollment utterance. TSE addresses thus the challenging problem of simultaneously performing separation and speaker identification. There has been much progress in extraction performance following the recent development of neural networks for speech enha… ▽ More

    Submitted 14 July, 2022; v1 submitted 10 April, 2022; originally announced April 2022.

    Comments: Accepted to Interspeech 2022

  7. arXiv:2204.03895  [pdf, other

    eess.AS cs.SD

    SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning

    Authors: Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Yasunori Ohishi, Shoko Araki

    Abstract: In many situations, we would like to hear desired sound events (SEs) while being able to ignore interference. Target sound extraction (TSE) tackles this problem by estimating the audio signal of the sounds of target SE classes in a mixture of sounds while suppressing all other sounds. We can achieve this with a neural network that extracts the target SEs by conditioning it on clues representing th… ▽ More

    Submitted 2 November, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

    Comments: Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing on Feb. 10th, 2022, and accepted on Oct. 20, 2022

  8. Effective data screening technique for crowdsourced speech intelligibility experiments: Evaluation with IRM-based speech enhancement

    Authors: Ayako Yamamoto, Toshio Irino, Shoko Araki, Kenichi Arai, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani

    Abstract: It is essential to perform speech intelligibility (SI) experiments with human listeners in order to evaluate objective intelligibility measures for develo** effective speech enhancement and noise reduction algorithms. Recently, crowdsourced remote testing has become a popular means for collecting a massive amount and variety of data at a relatively small cost and in a short time. However, carefu… ▽ More

    Submitted 19 August, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: This paper was submitted to APSIPA ASC 2022 (https://www.apsipa2022.org). The original title [v1] was "Subjective intelligibility of speech sounds enhanced by ideal ratio mask via crowdsourced remote experiments with effective data screening."

    Journal ref: Proc. APSIPA ASC 2022

  9. arXiv:2202.06524  [pdf, other

    eess.AS cs.LG cs.SD

    Tight integration of neural- and clustering-based diarization through deep unfolding of infinite Gaussian mixture model

    Authors: Keisuke Kinoshita, Marc Delcroix, Tomoharu Iwata

    Abstract: Speaker diarization has been investigated extensively as an important central task for meeting analysis. Recent trend shows that integration of end-to-end neural (EEND)-and clustering-based diarization is a promising approach to handle realistic conversational data containing overlapped speech with an arbitrarily large number of speakers, and achieved state-of-the-art results on various tasks. How… ▽ More

    Submitted 14 February, 2022; originally announced February 2022.

    Comments: Accepted to IEEE ICASSP-2022, 5 pages, 2 figures

  10. Learning to Enhance or Not: Neural Network-Based Switching of Enhanced and Observed Signals for Overlap** Speech Recognition

    Authors: Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Naoyuki Kamo, Takafumi Moriya

    Abstract: The combination of a deep neural network (DNN) -based speech enhancement (SE) front-end and an automatic speech recognition (ASR) back-end is a widely used approach to implement overlap** speech recognition. However, the SE front-end generates processing artifacts that can degrade the ASR performance. We previously found that such performance degradation can occur even under fully overlap** co… ▽ More

    Submitted 11 January, 2022; originally announced January 2022.

    Comments: 5 pages, 2 figures

    Journal ref: In 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6287-6291

  11. arXiv:2111.10574  [pdf, ps, other

    eess.AS cs.SD eess.SP

    Switching Independent Vector Analysis and Its Extension to Blind and Spatially Guided Convolutional Beamforming Algorithms

    Authors: Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Hiroshi Sawada, Naoyuki Kamo, Shoko Araki

    Abstract: This paper develops a framework that can perform denoising, dereverberation, and source separation accurately by using a relatively small number of microphones. It has been empirically confirmed that Independent Vector Analysis (IVA) can blindly separate N sources from their sound mixture even with diffuse noise when a sufficiently large number (=M) of microphones are available (i.e., M>>N). Howev… ▽ More

    Submitted 24 February, 2022; v1 submitted 20 November, 2021; originally announced November 2021.

    Comments: Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing on 27 July 2021, accepted on 22 Feb. 2022

  12. arXiv:2110.15581  [pdf, other

    eess.AS cs.SD

    SA-SDR: A novel loss function for separation of meeting style data

    Authors: Thilo von Neumann, Keisuke Kinoshita, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: Many state-of-the-art neural network-based source separation systems use the averaged Signal-to-Distortion Ratio (SDR) as a training objective function. The basic SDR is, however, undefined if the network reconstructs the reference signal perfectly or if the reference signal contains silence, e.g., when a two-output separator processes a single-speaker recording. Many modifications to the plain SD… ▽ More

    Submitted 21 April, 2022; v1 submitted 29 October, 2021; originally announced October 2021.

    Comments: accepted at ICASSP 2022

  13. Blind and neural network-guided convolutional beamformer for joint denoising, dereverberation, and source separation

    Authors: Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Hiroshi Sawada, Shoko Araki

    Abstract: This paper proposes an approach for optimizing a Convolutional BeamFormer (CBF) that can jointly perform denoising (DN), dereverberation (DR), and source separation (SS). First, we develop a blind CBF optimization algorithm that requires no prior information on the sources or the room acoustics, by extending a conventional joint DR and SS method. For making the optimization computationally tractab… ▽ More

    Submitted 4 August, 2021; originally announced August 2021.

    Comments: Accepted by IEEE ICASSP 2021

  14. Graph-PIT: Generalized permutation invariant training for continuous separation of arbitrary numbers of speakers

    Authors: Thilo von Neumann, Keisuke Kinoshita, Christoph Boeddeker, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: Automatic transcription of meetings requires handling of overlapped speech, which calls for continuous speech separation (CSS) systems. The uPIT criterion was proposed for utterance-level separation with neural networks and introduces the constraint that the total number of speakers must not exceed the number of output channels. When processing meeting-like data in a segment-wise manner, i.e., by… ▽ More

    Submitted 20 September, 2021; v1 submitted 30 July, 2021; originally announced July 2021.

    Comments: Accepted at INTERSPEECH 2021

  15. arXiv:2107.14445  [pdf, other

    eess.AS cs.SD

    Speeding Up Permutation Invariant Training for Source Separation

    Authors: Thilo von Neumann, Christoph Boeddeker, Keisuke Kinoshita, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: Permutation invariant training (PIT) is a widely used training criterion for neural network-based source separation, used for both utterance-level separation with utterance-level PIT (uPIT) and separation of long recordings with the recently proposed Graph-PIT. When implemented naively, both suffer from an exponential complexity in the number of utterances to separate, rendering them unusable for… ▽ More

    Submitted 30 July, 2021; originally announced July 2021.

    Comments: Accepted at 14th ITG Conference on Speech Communication

  16. arXiv:2106.07144  [pdf, other

    eess.AS cs.SD

    Few-shot learning of new sound classes for target sound extraction

    Authors: Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Shoko Araki

    Abstract: Target sound extraction consists of extracting the sound of a target acoustic event (AE) class from a mixture of AE sounds. It can be realized using a neural network that extracts the target sound conditioned on a 1-hot vector that represents the desired AE class. With this approach, embedding vectors associated with the AE classes are directly optimized for the extraction of sound classes seen du… ▽ More

    Submitted 13 June, 2021; originally announced June 2021.

    Comments: To appear in Interspeech 2021

  17. arXiv:2106.03903  [pdf, other

    cs.SD cs.LG eess.AS

    PILOT: Introducing Transformers for Probabilistic Sound Event Localization

    Authors: Christopher Schymura, Benedikt Bönninghoff, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa

    Abstract: Sound event localization aims at estimating the positions of sound sources in the environment with respect to an acoustic receiver (e.g. a microphone array). Recent advances in this domain most prominently focused on utilizing deep recurrent neural networks. Inspired by the success of transformer architectures as a suitable alternative to classical recurrent neural networks, this paper introduces… ▽ More

    Submitted 7 June, 2021; originally announced June 2021.

    Comments: Accepted at INTERSPEECH 2021

  18. Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlap** Speech Recognition

    Authors: Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoyuki Kamo

    Abstract: Although recent advances in deep learning technology improved automatic speech recognition (ASR), it remains difficult to recognize speech when it overlaps other people's voices. Speech separation or extraction is often used as a front-end to ASR to handle such overlap** speech. However, deep neural network-based speech enhancement can generate `processing artifacts' as a side effect of the enha… ▽ More

    Submitted 2 June, 2021; originally announced June 2021.

    Comments: 5 pages, 1 figure

    Journal ref: in Proc. Interspeech 2021, 1149-1153

  19. arXiv:2105.09040  [pdf, other

    eess.AS cs.SD

    Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech

    Authors: Keisuke Kinoshita, Marc Delcroix, Naohiro Tawara

    Abstract: Recently, we proposed a novel speaker diarization method called End-to-End-Neural-Diarization-vector clustering (EEND-vector clustering) that integrates clustering-based and end-to-end neural network-based diarization approaches into one framework. The proposed method combines advantages of both frameworks, i.e. high diarization performance and handling of overlapped speech based on EEND, and robu… ▽ More

    Submitted 31 August, 2021; v1 submitted 19 May, 2021; originally announced May 2021.

    Comments: 5 pages, 1 figure, Interspeech2021. (Update to include a reference to the code)

  20. Comparison of remote experiments using crowdsourcing and laboratory experiments on speech intelligibility

    Authors: Ayako Yamamoto, Toshio Irino, Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani

    Abstract: Many subjective experiments have been performed to develop objective speech intelligibility measures, but the novel coronavirus outbreak has made it very difficult to conduct experiments in a laboratory. One solution is to perform remote testing using crowdsourcing; however, because we cannot control the listening conditions, it is unclear whether the results are entirely reliable. In this study,… ▽ More

    Submitted 16 April, 2021; originally announced April 2021.

    Comments: This paper was submitted to Interspeech2021

    Journal ref: Proc. Interspeech 2021

  21. arXiv:2103.00417  [pdf, other

    cs.SD cs.LG eess.AS

    Exploiting Attention-based Sequence-to-Sequence Architectures for Sound Event Localization

    Authors: Christopher Schymura, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa

    Abstract: Sound event localization frameworks based on deep neural networks have shown increased robustness with respect to reverberation and noise in comparison to classical parametric approaches. In particular, recurrent architectures that incorporate temporal context into the estimation process seem to be well-suited for this task. This paper proposes a novel approach to sound event localization by utili… ▽ More

    Submitted 28 February, 2021; originally announced March 2021.

    Comments: Published in Proceedings of the 28th European Signal Processing Conference (EUSIPCO), 2020

  22. arXiv:2102.11634  [pdf, other

    eess.AS cs.SD

    Dual-Path Modeling for Long Recording Speech Separation in Meetings

    Authors: Chenda Li, Zhuo Chen, Yi Luo, Cong Han, Tianyan Zhou, Keisuke Kinoshita, Marc Delcroix, Shinji Watanabe, Yanmin Qian

    Abstract: The continuous speech separation (CSS) is a task to separate the speech sources from a long, partially overlapped recording, which involves a varying number of speakers. A straightforward extension of conventional utterance-level speech separation to the CSS task is to segment the long recording with a size-fixed window and process each window separately. Though effective, this extension fails to… ▽ More

    Submitted 23 February, 2021; originally announced February 2021.

    Comments: Accepted by ICASSP 2021

  23. arXiv:2102.11588  [pdf, other

    cs.SD cs.AI cs.CL cs.CV cs.LG eess.AS eess.IV

    Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain

    Authors: Julio Wissing, Benedikt Boenninghoff, Dorothea Kolossa, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Christopher Schymura

    Abstract: Estimating the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. Both applications benefit from a known speaker position when, for instance, applying beamforming or assigning unique speaker identities. Recently, several approaches utilizing acoustic signals augmented with visual data have been proposed for this task. However, both the… ▽ More

    Submitted 24 February, 2021; v1 submitted 23 February, 2021; originally announced February 2021.

    Comments: 4 pages, 6 figures, ICASSP 2021

  24. End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend

    Authors: Wangyou Zhang, Christoph Boeddeker, Shinji Watanabe, Tomohiro Nakatani, Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Naoyuki Kamo, Reinhold Haeb-Umbach, Yanmin Qian

    Abstract: Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both single-channel and multichannel conditions. However, severe performance degradation is still observed in the reverberant and noisy scenarios, and there is still a large performance gap between anechoic and reverberant conditions. In this work, we focus on the multichannel mult… ▽ More

    Submitted 23 February, 2021; originally announced February 2021.

    Comments: 5 pages, 1 figure, accepted by ICASSP 2021

  25. arXiv:2102.01326  [pdf, other

    eess.AS cs.LG cs.SD

    Multimodal Attention Fusion for Target Speaker Extraction

    Authors: Hiroshi Sato, Tsubasa Ochiai, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Shoko Araki

    Abstract: Target speaker extraction, which aims at extracting a target speaker's voice from a mixture of voices using audio, visual or locational clues, has received much interest. Recently an audio-visual target speaker extraction has been proposed that extracts target speech by using complementary audio and visual clues. Although audio-visual target speaker extraction offers a more stable performance than… ▽ More

    Submitted 2 February, 2021; originally announced February 2021.

    Comments: 7 pages, 5 figures

    Journal ref: in IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 778-784

  26. arXiv:2101.05516  [pdf, other

    eess.AS cs.SD

    Speaker activity driven neural speech extraction

    Authors: Marc Delcroix, Katerina Zmolikova, Tsubasa Ochiai, Keisuke Kinoshita, Tomohiro Nakatani

    Abstract: Target speech extraction, which extracts the speech of a target speaker in a mixture given auxiliary speaker clues, has recently received increased interest. Various clues have been investigated such as pre-recorded enrollment utterances, direction information, or video of the target speaker. In this paper, we explore the use of speaker activity information as an auxiliary clue for single-channel… ▽ More

    Submitted 9 February, 2021; v1 submitted 14 January, 2021; originally announced January 2021.

    Comments: To appear in ICASSP 2021

  27. arXiv:2101.04315  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Neural Network-based Virtual Microphone Estimator

    Authors: Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Shoko Araki

    Abstract: Develo** microphone array technologies for a small number of microphones is important due to the constraints of many devices. One direction to address this situation consists of virtually augmenting the number of microphone signals, e.g., based on several physical model assumptions. However, such assumptions are not necessarily met in realistic conditions. In this paper, as an alternative approa… ▽ More

    Submitted 12 January, 2021; originally announced January 2021.

    Comments: 5 pages, 2 figures, submitted to ICASSP 2021

  28. arXiv:2012.09727  [pdf, other

    eess.AS cs.SD eess.SP

    Continuous Speech Separation Using Speaker Inventory for Long Multi-talker Recording

    Authors: Cong Han, Yi Luo, Chenda Li, Tianyan Zhou, Keisuke Kinoshita, Shinji Watanabe, Marc Delcroix, Hakan Erdogan, John R. Hershey, Nima Mesgarani, Zhuo Chen

    Abstract: Leveraging additional speaker information to facilitate speech separation has received increasing attention in recent years. Recent research includes extracting target speech by using the target speaker's voice snippet and jointly separating all participating speakers by using a pool of additional speaker signals, which is known as speech separation using speaker inventory (SSUSI). However, all th… ▽ More

    Submitted 18 December, 2020; v1 submitted 17 December, 2020; originally announced December 2020.

  29. arXiv:2011.15003  [pdf, other

    cs.SD eess.AS

    Convolutive Transfer Function Invariant SDR training criteria for Multi-Channel Reverberant Speech Separation

    Authors: Christoph Boeddeker, Wangyou Zhang, Tomohiro Nakatani, Keisuke Kinoshita, Tsubasa Ochiai, Marc Delcroix, Naoyuki Kamo, Yanmin Qian, Reinhold Haeb-Umbach

    Abstract: Time-domain training criteria have proven to be very effective for the separation of single-channel non-reverberant speech mixtures. Likewise, mask-based beamforming has shown impressive performance in multi-channel reverberant speech enhancement and source separation. Here, we propose to combine neural network supported multi-channel source separation with a time-domain training objective functio… ▽ More

    Submitted 8 June, 2021; v1 submitted 30 November, 2020; originally announced November 2020.

    Comments: Accepted by ICASSP 2021

  30. arXiv:2010.13366  [pdf, other

    eess.AS cs.SD stat.ML

    Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds

    Authors: Keisuke Kinoshita, Marc Delcroix, Naohiro Tawara

    Abstract: Recent diarization technologies can be categorized into two approaches, i.e., clustering and end-to-end neural approaches, which have different pros and cons. The clustering-based approaches assign speaker labels to speech regions by clustering speaker embeddings such as x-vectors. While it can be seen as a current state-of-the-art approach that works for various challenging data with reasonable r… ▽ More

    Submitted 4 February, 2021; v1 submitted 26 October, 2020; originally announced October 2020.

    Comments: To appear in ICASSP 2021

  31. arXiv:2006.13579  [pdf, other

    eess.AS cs.SD

    Multi-path RNN for hierarchical modeling of long sequential data and its application to speaker stream separation

    Authors: Keisuke Kinoshita, Thilo von Neumann, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

    Abstract: Recently, the source separation performance was greatly improved by time-domain audio source separation based on dual-path recurrent neural network (DPRNN). DPRNN is a simple but effective model for a long sequential data. While DPRNN is quite efficient in modeling a sequential data of the length of an utterance, i.e., about 5 to 10 second data, it is harder to apply it to longer sequences such as… ▽ More

    Submitted 24 June, 2020; originally announced June 2020.

    Comments: 5 pages, 4 figures

  32. arXiv:2006.05712  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Listen to What You Want: Neural Network-based Universal Sound Selector

    Authors: Tsubasa Ochiai, Marc Delcroix, Yuma Koizumi, Hiroaki Ito, Keisuke Kinoshita, Shoko Araki

    Abstract: Being able to control the acoustic events (AEs) to which we want to listen would allow the development of more controllable hearable devices. This paper addresses the AE sound selection (or removal) problems, that we define as the extraction (or suppression) of all the sounds that belong to one or multiple desired AE classes. Although this problem could be addressed with a combination of source se… ▽ More

    Submitted 10 June, 2020; originally announced June 2020.

    Comments: 5 pages, 2 figures, submitted to INTERSPEECH 2020

  33. Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR

    Authors: Thilo von Neumann, Christoph Boeddeker, Lukas Drude, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

    Abstract: Most approaches to multi-talker overlapped speech separation and recognition assume that the number of simultaneously active speakers is given, but in realistic situations, it is typically unknown. To cope with this, we extend an iterative speech extraction system with mechanisms to count the number of sources and combine it with a single-talker speech recognizer to form the first end-to-end multi… ▽ More

    Submitted 21 December, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

    Comments: 5 pages, INTERSPEECH 2020

  34. Jointly optimal denoising, dereverberation, and source separation

    Authors: Tomohiro Nakatani, Christoph Boeddeker, Keisuke Kinoshita, Rintaro Ikeshita, Marc Delcroix, Reinhold Haeb-Umbach

    Abstract: This paper proposes methods that can optimize a Convolutional BeamFormer (CBF) for jointly performing denoising, dereverberation, and source separation (DN+DR+SS) in a computationally efficient way. Conventionally, cascade configuration composed of a Weighted Prediction Error minimization (WPE) dereverberation filter followed by a Minimum Variance Distortionless Response beamformer has been usedas… ▽ More

    Submitted 2 August, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

    Comments: Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing on 12 Feb 2020, Accepted to IEEE/ACM Trans. Audio, Speech, and Language Processing on 14 July 2020

  35. arXiv:2005.04669  [pdf, other

    cs.SD eess.AS eess.SP

    Cognitive-driven convolutional beamforming using EEG-based auditory attention decoding

    Authors: Ali Aroudi, Marc Delcroix, Tomohiro Nakatani, Keisuke Kinoshita, Shoko Araki, Simon Doclo

    Abstract: The performance of speech enhancement algorithms in a multi-speaker scenario depends on correctly identifying the target speaker to be enhanced. Auditory attention decoding (AAD) methods allow to identify the target speaker which the listener is attending to from single-trial EEG recordings. Aiming at enhancing the target speaker and suppressing interfering speakers, reverberation and ambient nois… ▽ More

    Submitted 10 May, 2020; originally announced May 2020.

  36. arXiv:2003.03998  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Improving noise robust automatic speech recognition with single-channel time-domain enhancement network

    Authors: Keisuke Kinoshita, Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani

    Abstract: With the advent of deep learning, research on noise-robust automatic speech recognition (ASR) has progressed rapidly. However, ASR performance in noisy conditions of single-channel systems remains unsatisfactory. Indeed, most single-channel speech enhancement (SE) methods (denoising) have brought only limited performance gains over state-of-the-art ASR back-end trained on multi-condition training… ▽ More

    Submitted 9 March, 2020; originally announced March 2020.

    Comments: 5 pages, to appear in ICASSP2020

  37. arXiv:2003.03987  [pdf, other

    eess.AS cs.LG cs.SD

    Tackling real noisy reverberant meetings with all-neural source separation, counting, and diarization system

    Authors: Keisuke Kinoshita, Marc Delcroix, Shoko Araki, Tomohiro Nakatani

    Abstract: Automatic meeting analysis is an essential fundamental technology required to let, e.g. smart devices follow and respond to our conversations. To achieve an optimal automatic meeting analysis, we previously proposed an all-neural approach that jointly solves source separation, speaker diarization and source counting problems in an optimal way (in a sense that all the 3 tasks can be jointly optimiz… ▽ More

    Submitted 9 March, 2020; originally announced March 2020.

    Comments: 8 pages, to appear in ICASSP2020

  38. arXiv:2001.08378  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam

    Authors: Marc Delcroix, Tsubasa Ochiai, Katerina Zmolikova, Keisuke Kinoshita, Naohiro Tawara, Tomohiro Nakatani, Shoko Araki

    Abstract: Target speech extraction, which extracts a single target source in a mixture given clues about the target speaker, has attracted increasing attention. We have recently proposed SpeakerBeam, which exploits an adaptation utterance of the target speaker to extract his/her voice characteristics that are then used to guide a neural network towards extracting speech of that speaker. SpeakerBeam presents… ▽ More

    Submitted 23 January, 2020; originally announced January 2020.

    Comments: 5 pages, 3 figures. Submitted to ICASSP 2020

  39. End-to-end training of time domain audio separation and recognition

    Authors: Thilo von Neumann, Keisuke Kinoshita, Lukas Drude, Christoph Boeddeker, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

    Abstract: The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multi-speaker speech recognition. However, up until now, state-of-the-art neural network-based time domain source separation has not yet been combined with E2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audi… ▽ More

    Submitted 13 April, 2020; v1 submitted 18 December, 2019; originally announced December 2019.

    Comments: 5 pages, 1 figure, to appear in ICASSP 2020

  40. arXiv:1910.13707  [pdf, other

    cs.SD cs.CL eess.AS

    Jointly optimal dereverberation and beamforming

    Authors: Christoph Boeddeker, Tomohiro Nakatani, Keisuke Kinoshita, Reinhold Haeb-Umbach

    Abstract: We previously proposed an optimal (in the maximum likelihood sense) convolutional beamformer that can perform simultaneous denoising and dereverberation, and showed its superiority over the widely used cascade of a WPE dereverberation filter and a conventional MPDR beamformer. However, it has not been fully investigated which components in the convolutional beamformer yield such superiority. To th… ▽ More

    Submitted 30 October, 2019; originally announced October 2019.

    Comments: Submitted to ICASSP 2020

  41. arXiv:1908.02710  [pdf, ps, other

    eess.AS cs.SD

    Maximum likelihood convolutional beamformer for simultaneous denoising and dereverberation

    Authors: Tomohiro Nakatani, Keisuke Kinoshita

    Abstract: This article describes a probabilistic formulation of a Weighted Power minimization Distortionless response convolutional beamformer (WPD). The WPD unifies a weighted prediction error based dereverberation method (WPE) and a minimum power distortionless response beamformer (MPDR) into a single convolutional beamformer, and achieves simultaneous dereverberation and denoising in an optimal way. Howe… ▽ More

    Submitted 6 August, 2019; originally announced August 2019.

    Comments: Accepted for EUSIPCO 2019. arXiv admin note: text overlap with arXiv:1812.08400

  42. GEDI: Gammachirp Envelope Distortion Index for Predicting Intelligibility of Enhanced Speech

    Authors: Katsuhiko Yamamoto, Toshio Irino, Shoko Araki, Keisuke Kinoshita, Tomohiro Nakatani

    Abstract: In this study, we propose a new concept, the gammachirp envelope distortion index (GEDI), based on the signal-to-distortion ratio in the auditory envelope, SDRenv to predict the intelligibility of speech enhanced by nonlinear algorithms. The objective of GEDI is to calculate the distortion between enhanced and clean-speech representations in the domain of a temporal envelope extracted by the gamma… ▽ More

    Submitted 19 July, 2020; v1 submitted 3 April, 2019; originally announced April 2019.

    Comments: Preprint, 37 pages, 6 tables, 9 figures

    Journal ref: Speech Communication, Vol. 123, pp. 43-58, 2020

  43. arXiv:1902.07881  [pdf, ps, other

    eess.AS cs.LG cs.SD

    All-neural online source separation, counting, and diarization for meeting analysis

    Authors: Thilo von Neumann, Keisuke Kinoshita, Marc Delcroix, Shoko Araki, Tomohiro Nakatani, Reinhold Haeb-Umbach

    Abstract: Automatic meeting analysis comprises the tasks of speaker counting, speaker diarization, and the separation of overlapped speech, followed by automatic speech recognition. This all has to be carried out on arbitrarily long sessions and, ideally, in an online or block-online manner. While significant progress has been made on individual tasks, this paper presents for the first time an all-neural ap… ▽ More

    Submitted 21 February, 2019; originally announced February 2019.

    Comments: 5 pages, to appear in ICASSP2019

  44. A unified convolutional beamformer for simultaneous denoising and dereverberation

    Authors: Tomohiro Nakatani, Keisuke Kinoshita

    Abstract: This paper proposes a method for estimating a convolutional beamformer that can perform denoising and dereverberation simultaneously in an optimal way. The application of dereverberation based on a weighted prediction error (WPE) method followed by denoising based on a minimum variance distortionless response (MVDR) beamformer has conventionally been considered a promising approach, however, the o… ▽ More

    Submitted 5 May, 2019; v1 submitted 20 December, 2018; originally announced December 2018.

    Comments: Published in IEEE Signal Processing Letters

    Journal ref: IEEE Signal Processing Letters, vol. 26, no. 6, pp. 903-907, June 2019