Skip to main content

Showing 1–30 of 30 results for author: Araki, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.14860  [pdf, other

    eess.AS cs.SD

    Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance

    Authors: Tsubasa Ochiai, Kazuma Iwamoto, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri

    Abstract: It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with a single-channel speech enhancement (SE) front-end. This is generally attributed to the processing distortions caused by the nonlinear processing of single-channel SE front-ends. However, the causes of such degraded ASR performance have not been fully investigated. How to design single-channel SE f… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

    Comments: 13 pages, 6 figures, Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing

  2. arXiv:2402.13200  [pdf, other

    eess.AS cs.SD

    Probing Self-supervised Learning Models with Target Speech Extraction

    Authors: Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Takanori Ashihara, Shoko Araki, Jan Cernocky

    Abstract: Large-scale pre-trained self-supervised learning (SSL) models have shown remarkable advancements in speech-related tasks. However, the utilization of these models in complex multi-talker scenarios, such as extracting a target speaker in a mixture, is yet to be fully evaluated. In this paper, we introduce target speech extraction (TSE) as a novel downstream task to evaluate the feature extraction c… ▽ More

    Submitted 17 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024, Self-supervision in Audio, Speech, and Beyond (SASB) workshop

  3. arXiv:2402.13199  [pdf, other

    eess.AS cs.SD

    Target Speech Extraction with Pre-trained Self-supervised Learning Models

    Authors: Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Shoko Araki, Jan Cernocky

    Abstract: Pre-trained self-supervised learning (SSL) models have achieved remarkable success in various speech tasks. However, their potential in target speech extraction (TSE) has not been fully exploited. TSE aims to extract the speech of a target speaker in a mixture guided by enrollment utterances. We exploit pre-trained SSL models for two purposes within a TSE framework, i.e., to process the input mixt… ▽ More

    Submitted 17 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024

  4. arXiv:2402.03058  [pdf, other

    eess.AS cs.SD

    Array Geometry-Robust Attention-Based Neural Beamformer for Moving Speakers

    Authors: Marvin Tammen, Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Shoko Araki, Simon Doclo

    Abstract: Although mask-based beamforming is a powerful speech enhancement approach, it often requires manual parameter tuning to handle moving speakers. Recently, this approach was augmented with an attention-based spatial covariance matrix aggregator (ASA) module, enabling accurate tracking of moving speakers without manual tuning. However, the deep neural network model used in this module is limited to s… ▽ More

    Submitted 17 June, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

    Comments: accepted at Interspeech 2024

  5. arXiv:2312.12764  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Lattice Rescoring Based on Large Ensemble of Complementary Neural Language Models

    Authors: Atsunori Ogawa, Naohiro Tawara, Marc Delcroix, Shoko Araki

    Abstract: We investigate the effectiveness of using a large ensemble of advanced neural language models (NLMs) for lattice rescoring on automatic speech recognition (ASR) hypotheses. Previous studies have reported the effectiveness of combining a small number of NLMs. In contrast, in this study, we combine up to eight NLMs, i.e., forward/backward long short-term memory/Transformer-LMs that are trained with… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: Accepted to ICASSP 2022

  6. arXiv:2306.17317  [pdf, ps, other

    eess.AS cs.SD

    Modified Parametric Multichannel Wiener Filter \\for Low-latency Enhancement of Speech Mixtures with Unknown Number of Speakers

    Authors: Ning Guo, Tomohiro Nakatani, Shoko Araki, Takehiro Moriya

    Abstract: This paper introduces a novel low-latency online beamforming (BF) algorithm, named Modified Parametric Multichannel Wiener Filter (Mod-PMWF), for enhancing speech mixtures with unknown and varying number of speakers. Although conventional BFs such as linearly constrained minimum variance BF (LCMV BF) can enhance a speech mixture, they typically require such attributes of the speech mixture as the… ▽ More

    Submitted 29 June, 2023; originally announced June 2023.

  7. arXiv:2305.13580  [pdf, other

    eess.AS cs.SD

    Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization

    Authors: Marc Delcroix, Naohiro Tawara, Mireia Diez, Federico Landini, Anna Silnova, Atsunori Ogawa, Tomohiro Nakatani, Lukas Burget, Shoko Araki

    Abstract: Combining end-to-end neural speaker diarization (EEND) with vector clustering (VC), known as EEND-VC, has gained interest for leveraging the strengths of both methods. EEND-VC estimates activities and speaker embeddings for all speakers within an audio chunk and uses VC to associate these activities with speaker identities across different chunks. EEND-VC generates thus multiple streams of embeddi… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: Accepted at Interspeech 2023

  8. arXiv:2207.11964  [pdf, other

    eess.AS cs.LG cs.MM cs.SD

    ConceptBeam: Concept Driven Target Speech Extraction

    Authors: Yasunori Ohishi, Marc Delcroix, Tsubasa Ochiai, Shoko Araki, Daiki Takeuchi, Daisuke Niizumi, Akisato Kimura, Noboru Harada, Kunio Kashino

    Abstract: We propose a novel framework for target speech extraction based on semantic information, called ConceptBeam. Target speech extraction means extracting the speech of a target speaker in a mixture. Typical approaches have been exploiting properties of audio signals, such as harmonic structure and direction of arrival. In contrast, ConceptBeam tackles the problem with semantic clues. Specifically, we… ▽ More

    Submitted 25 July, 2022; originally announced July 2022.

    Comments: Accepted to ACM Multimedia 2022

  9. arXiv:2205.03568  [pdf, other

    eess.AS cs.SD

    Mask-based Neural Beamforming for Moving Speakers with Self-Attention-based Tracking

    Authors: Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Shoko Araki

    Abstract: Beamforming is a powerful tool designed to enhance speech signals from the direction of a target source. Computing the beamforming filter requires estimating spatial covariance matrices (SCMs) of the source and noise signals. Time-frequency masks are often used to compute these SCMs. Most studies of mask-based beamforming have assumed that the sources do not move. However, sources often move in pr… ▽ More

    Submitted 7 May, 2022; originally announced May 2022.

    Comments: 11 pages, 7 figures, Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing

  10. arXiv:2204.03895  [pdf, other

    eess.AS cs.SD

    SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning

    Authors: Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Yasunori Ohishi, Shoko Araki

    Abstract: In many situations, we would like to hear desired sound events (SEs) while being able to ignore interference. Target sound extraction (TSE) tackles this problem by estimating the audio signal of the sounds of target SE classes in a mixture of sounds while suppressing all other sounds. We can achieve this with a neural network that extracts the target SEs by conditioning it on clues representing th… ▽ More

    Submitted 2 November, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

    Comments: Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing on Feb. 10th, 2022, and accepted on Oct. 20, 2022

  11. Effective data screening technique for crowdsourced speech intelligibility experiments: Evaluation with IRM-based speech enhancement

    Authors: Ayako Yamamoto, Toshio Irino, Shoko Araki, Kenichi Arai, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani

    Abstract: It is essential to perform speech intelligibility (SI) experiments with human listeners in order to evaluate objective intelligibility measures for develo** effective speech enhancement and noise reduction algorithms. Recently, crowdsourced remote testing has become a popular means for collecting a massive amount and variety of data at a relatively small cost and in a short time. However, carefu… ▽ More

    Submitted 19 August, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: This paper was submitted to APSIPA ASC 2022 (https://www.apsipa2022.org). The original title [v1] was "Subjective intelligibility of speech sounds enhanced by ideal ratio mask via crowdsourced remote experiments with effective data screening."

    Journal ref: Proc. APSIPA ASC 2022

  12. arXiv:2201.06685  [pdf, other

    eess.AS cs.SD

    How Bad Are Artifacts?: Analyzing the Impact of Speech Enhancement Errors on ASR

    Authors: Kazuma Iwamoto, Tsubasa Ochiai, Marc Delcroix, Rintaro Ikeshita, Hiroshi Sato, Shoko Araki, Shigeru Katagiri

    Abstract: It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with single-channel speech enhancement (SE). In this paper, we investigate the causes of ASR performance degradation by decomposing the SE errors using orthogonal projection-based decomposition (OPD). OPD decomposes the SE errors into noise and artifact components. The artifact component is defined as t… ▽ More

    Submitted 30 March, 2022; v1 submitted 17 January, 2022; originally announced January 2022.

    Comments: 5 pages, 5 figures, submitted to Interspeech 2022

  13. arXiv:2111.10574  [pdf, ps, other

    eess.AS cs.SD eess.SP

    Switching Independent Vector Analysis and Its Extension to Blind and Spatially Guided Convolutional Beamforming Algorithms

    Authors: Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Hiroshi Sawada, Naoyuki Kamo, Shoko Araki

    Abstract: This paper develops a framework that can perform denoising, dereverberation, and source separation accurately by using a relatively small number of microphones. It has been empirically confirmed that Independent Vector Analysis (IVA) can blindly separate N sources from their sound mixture even with diffuse noise when a sufficiently large number (=M) of microphones are available (i.e., M>>N). Howev… ▽ More

    Submitted 24 February, 2022; v1 submitted 20 November, 2021; originally announced November 2021.

    Comments: Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing on 27 July 2021, accepted on 22 Feb. 2022

  14. Blind and neural network-guided convolutional beamformer for joint denoising, dereverberation, and source separation

    Authors: Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Hiroshi Sawada, Shoko Araki

    Abstract: This paper proposes an approach for optimizing a Convolutional BeamFormer (CBF) that can jointly perform denoising (DN), dereverberation (DR), and source separation (SS). First, we develop a blind CBF optimization algorithm that requires no prior information on the sources or the room acoustics, by extending a conventional joint DR and SS method. For making the optimization computationally tractab… ▽ More

    Submitted 4 August, 2021; originally announced August 2021.

    Comments: Accepted by IEEE ICASSP 2021

  15. arXiv:2106.07144  [pdf, other

    eess.AS cs.SD

    Few-shot learning of new sound classes for target sound extraction

    Authors: Marc Delcroix, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita, Shoko Araki

    Abstract: Target sound extraction consists of extracting the sound of a target acoustic event (AE) class from a mixture of AE sounds. It can be realized using a neural network that extracts the target sound conditioned on a 1-hot vector that represents the desired AE class. With this approach, embedding vectors associated with the AE classes are directly optimized for the extraction of sound classes seen du… ▽ More

    Submitted 13 June, 2021; originally announced June 2021.

    Comments: To appear in Interspeech 2021

  16. arXiv:2106.03903  [pdf, other

    cs.SD cs.LG eess.AS

    PILOT: Introducing Transformers for Probabilistic Sound Event Localization

    Authors: Christopher Schymura, Benedikt Bönninghoff, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa

    Abstract: Sound event localization aims at estimating the positions of sound sources in the environment with respect to an acoustic receiver (e.g. a microphone array). Recent advances in this domain most prominently focused on utilizing deep recurrent neural networks. Inspired by the success of transformer architectures as a suitable alternative to classical recurrent neural networks, this paper introduces… ▽ More

    Submitted 7 June, 2021; originally announced June 2021.

    Comments: Accepted at INTERSPEECH 2021

  17. Comparison of remote experiments using crowdsourcing and laboratory experiments on speech intelligibility

    Authors: Ayako Yamamoto, Toshio Irino, Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani

    Abstract: Many subjective experiments have been performed to develop objective speech intelligibility measures, but the novel coronavirus outbreak has made it very difficult to conduct experiments in a laboratory. One solution is to perform remote testing using crowdsourcing; however, because we cannot control the listening conditions, it is unclear whether the results are entirely reliable. In this study,… ▽ More

    Submitted 16 April, 2021; originally announced April 2021.

    Comments: This paper was submitted to Interspeech2021

    Journal ref: Proc. Interspeech 2021

  18. arXiv:2103.00417  [pdf, other

    cs.SD cs.LG eess.AS

    Exploiting Attention-based Sequence-to-Sequence Architectures for Sound Event Localization

    Authors: Christopher Schymura, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa

    Abstract: Sound event localization frameworks based on deep neural networks have shown increased robustness with respect to reverberation and noise in comparison to classical parametric approaches. In particular, recurrent architectures that incorporate temporal context into the estimation process seem to be well-suited for this task. This paper proposes a novel approach to sound event localization by utili… ▽ More

    Submitted 28 February, 2021; originally announced March 2021.

    Comments: Published in Proceedings of the 28th European Signal Processing Conference (EUSIPCO), 2020

  19. arXiv:2102.11588  [pdf, other

    cs.SD cs.AI cs.CL cs.CV cs.LG eess.AS eess.IV

    Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain

    Authors: Julio Wissing, Benedikt Boenninghoff, Dorothea Kolossa, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Christopher Schymura

    Abstract: Estimating the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. Both applications benefit from a known speaker position when, for instance, applying beamforming or assigning unique speaker identities. Recently, several approaches utilizing acoustic signals augmented with visual data have been proposed for this task. However, both the… ▽ More

    Submitted 24 February, 2021; v1 submitted 23 February, 2021; originally announced February 2021.

    Comments: 4 pages, 6 figures, ICASSP 2021

  20. arXiv:2102.01326  [pdf, other

    eess.AS cs.LG cs.SD

    Multimodal Attention Fusion for Target Speaker Extraction

    Authors: Hiroshi Sato, Tsubasa Ochiai, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Shoko Araki

    Abstract: Target speaker extraction, which aims at extracting a target speaker's voice from a mixture of voices using audio, visual or locational clues, has received much interest. Recently an audio-visual target speaker extraction has been proposed that extracts target speech by using complementary audio and visual clues. Although audio-visual target speaker extraction offers a more stable performance than… ▽ More

    Submitted 2 February, 2021; originally announced February 2021.

    Comments: 7 pages, 5 figures

    Journal ref: in IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 778-784

  21. arXiv:2101.04315  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Neural Network-based Virtual Microphone Estimator

    Authors: Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Rintaro Ikeshita, Keisuke Kinoshita, Shoko Araki

    Abstract: Develo** microphone array technologies for a small number of microphones is important due to the constraints of many devices. One direction to address this situation consists of virtually augmenting the number of microphone signals, e.g., based on several physical model assumptions. However, such assumptions are not necessarily met in realistic conditions. In this paper, as an alternative approa… ▽ More

    Submitted 12 January, 2021; originally announced January 2021.

    Comments: 5 pages, 2 figures, submitted to ICASSP 2021

  22. arXiv:2006.05712  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Listen to What You Want: Neural Network-based Universal Sound Selector

    Authors: Tsubasa Ochiai, Marc Delcroix, Yuma Koizumi, Hiroaki Ito, Keisuke Kinoshita, Shoko Araki

    Abstract: Being able to control the acoustic events (AEs) to which we want to listen would allow the development of more controllable hearable devices. This paper addresses the AE sound selection (or removal) problems, that we define as the extraction (or suppression) of all the sounds that belong to one or multiple desired AE classes. Although this problem could be addressed with a combination of source se… ▽ More

    Submitted 10 June, 2020; originally announced June 2020.

    Comments: 5 pages, 2 figures, submitted to INTERSPEECH 2020

  23. arXiv:2005.08170  [pdf, other

    cs.CV cs.LG

    Neural Networks for Fashion Image Classification and Visual Search

    Authors: Fengzi Li, Shashi Kant, Shunichi Araki, Sumer Bangera, Swapna Samir Shukla

    Abstract: We discuss two potentially challenging problems faced by the ecommerce industry. One relates to the problem faced by sellers while uploading pictures of products on the platform for sale and the consequent manual tagging involved. It gives rise to misclassifications leading to its absence from search results. The other problem concerns with the potential bottleneck in placing orders when a custome… ▽ More

    Submitted 17 May, 2020; originally announced May 2020.

  24. arXiv:2005.04669  [pdf, other

    cs.SD eess.AS eess.SP

    Cognitive-driven convolutional beamforming using EEG-based auditory attention decoding

    Authors: Ali Aroudi, Marc Delcroix, Tomohiro Nakatani, Keisuke Kinoshita, Shoko Araki, Simon Doclo

    Abstract: The performance of speech enhancement algorithms in a multi-speaker scenario depends on correctly identifying the target speaker to be enhanced. Auditory attention decoding (AAD) methods allow to identify the target speaker which the listener is attending to from single-trial EEG recordings. Aiming at enhancing the target speaker and suppressing interfering speakers, reverberation and ambient nois… ▽ More

    Submitted 10 May, 2020; originally announced May 2020.

  25. arXiv:2003.03987  [pdf, other

    eess.AS cs.LG cs.SD

    Tackling real noisy reverberant meetings with all-neural source separation, counting, and diarization system

    Authors: Keisuke Kinoshita, Marc Delcroix, Shoko Araki, Tomohiro Nakatani

    Abstract: Automatic meeting analysis is an essential fundamental technology required to let, e.g. smart devices follow and respond to our conversations. To achieve an optimal automatic meeting analysis, we previously proposed an all-neural approach that jointly solves source separation, speaker diarization and source counting problems in an optimal way (in a sense that all the 3 tasks can be jointly optimiz… ▽ More

    Submitted 9 March, 2020; originally announced March 2020.

    Comments: 8 pages, to appear in ICASSP2020

  26. Overdetermined independent vector analysis

    Authors: Rintaro Ikeshita, Tomohiro Nakatani, Shoko Araki

    Abstract: We address the convolutive blind source separation problem for the (over-)determined case where (i) the number of nonstationary target-sources $K$ is less than that of microphones $M$, and (ii) there are up to $M - K$ stationary Gaussian noises that need not to be extracted. Independent vector analysis (IVA) can solve the problem by separating into $M$ sources and selecting the top $K$ highly nons… ▽ More

    Submitted 5 March, 2020; originally announced March 2020.

    Comments: To appear at the 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)

  27. arXiv:2001.08378  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam

    Authors: Marc Delcroix, Tsubasa Ochiai, Katerina Zmolikova, Keisuke Kinoshita, Naohiro Tawara, Tomohiro Nakatani, Shoko Araki

    Abstract: Target speech extraction, which extracts a single target source in a mixture given clues about the target speaker, has attracted increasing attention. We have recently proposed SpeakerBeam, which exploits an adaptation utterance of the target speaker to extract his/her voice characteristics that are then used to guide a neural network towards extracting speech of that speaker. SpeakerBeam presents… ▽ More

    Submitted 23 January, 2020; originally announced January 2020.

    Comments: 5 pages, 3 figures. Submitted to ICASSP 2020

  28. GEDI: Gammachirp Envelope Distortion Index for Predicting Intelligibility of Enhanced Speech

    Authors: Katsuhiko Yamamoto, Toshio Irino, Shoko Araki, Keisuke Kinoshita, Tomohiro Nakatani

    Abstract: In this study, we propose a new concept, the gammachirp envelope distortion index (GEDI), based on the signal-to-distortion ratio in the auditory envelope, SDRenv to predict the intelligibility of speech enhanced by nonlinear algorithms. The objective of GEDI is to calculate the distortion between enhanced and clean-speech representations in the domain of a temporal envelope extracted by the gamma… ▽ More

    Submitted 19 July, 2020; v1 submitted 3 April, 2019; originally announced April 2019.

    Comments: Preprint, 37 pages, 6 tables, 9 figures

    Journal ref: Speech Communication, Vol. 123, pp. 43-58, 2020

  29. arXiv:1902.07881  [pdf, ps, other

    eess.AS cs.LG cs.SD

    All-neural online source separation, counting, and diarization for meeting analysis

    Authors: Thilo von Neumann, Keisuke Kinoshita, Marc Delcroix, Shoko Araki, Tomohiro Nakatani, Reinhold Haeb-Umbach

    Abstract: Automatic meeting analysis comprises the tasks of speaker counting, speaker diarization, and the separation of overlapped speech, followed by automatic speech recognition. This all has to be carried out on arbitrarily long sessions and, ideally, in an online or block-online manner. While significant progress has been made on individual tasks, this paper presents for the first time an all-neural ap… ▽ More

    Submitted 21 February, 2019; originally announced February 2019.

    Comments: 5 pages, to appear in ICASSP2019

  30. arXiv:1805.06572  [pdf, ps, other

    cs.SD eess.AS

    FastFCA: A Joint Diagonalization Based Fast Algorithm for Audio Source Separation Using A Full-Rank Spatial Covariance Model

    Authors: Nobutaka Ito, Shoko Araki, Tomohiro Nakatani

    Abstract: A source separation method using a full-rank spatial covariance model has been proposed by Duong et al. ["Under-determined Reverberant Audio Source Separation Using a Full-rank Spatial Covariance Model," IEEE Trans. ASLP, vol. 18, no. 7, pp. 1830-1840, Sep. 2010], which is referred to as full-rank spatial covariance analysis (FCA) in this paper. Here we propose a fast algorithm for estimating the… ▽ More

    Submitted 16 May, 2018; originally announced May 2018.