Skip to main content

Showing 1–12 of 12 results for author: Ogawa, A

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.18972  [pdf, ps, other

    eess.AS cs.CL

    Applying LLMs for Rescoring N-best ASR Hypotheses of Casual Conversations: Effects of Domain Adaptation and Context Carry-over

    Authors: Atsunori Ogawa, Naoyuki Kamo, Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Takatomo Kano, Naohiro Tawara, Marc Delcroix

    Abstract: Large language models (LLMs) have been successfully applied for rescoring automatic speech recognition (ASR) hypotheses. However, their ability to rescore ASR hypotheses of casual conversations has not been sufficiently explored. In this study, we reveal it by performing N-best ASR hypotheses rescoring using Llama2 on the CHiME-7 distant ASR (DASR) task. Llama2 is one of the most representative LL… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: 5 pages

  2. arXiv:2312.14609  [pdf, ps, other

    eess.AS cs.CL

    BLSTM-Based Confidence Estimation for End-to-End Speech Recognition

    Authors: Atsunori Ogawa, Naohiro Tawara, Takatomo Kano, Marc Delcroix

    Abstract: Confidence estimation, in which we estimate the reliability of each recognized token (e.g., word, sub-word, and character) in automatic speech recognition (ASR) hypotheses and detect incorrectly recognized tokens, is an important function for develo** ASR applications. In this study, we perform confidence estimation for end-to-end (E2E) ASR hypotheses. Recent E2E ASR systems show high performanc… ▽ More

    Submitted 22 December, 2023; originally announced December 2023.

    Comments: Accepted to ICASSP 2021

  3. arXiv:2312.12764  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Lattice Rescoring Based on Large Ensemble of Complementary Neural Language Models

    Authors: Atsunori Ogawa, Naohiro Tawara, Marc Delcroix, Shoko Araki

    Abstract: We investigate the effectiveness of using a large ensemble of advanced neural language models (NLMs) for lattice rescoring on automatic speech recognition (ASR) hypotheses. Previous studies have reported the effectiveness of combining a small number of NLMs. In contrast, in this study, we combine up to eight NLMs, i.e., forward/backward long short-term memory/Transformer-LMs that are trained with… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: Accepted to ICASSP 2022

  4. arXiv:2310.11010  [pdf, ps, other

    eess.AS cs.CL

    Iterative Shallow Fusion of Backward Language Model for End-to-End Speech Recognition

    Authors: Atsunori Ogawa, Takafumi Moriya, Naoyuki Kamo, Naohiro Tawara, Marc Delcroix

    Abstract: We propose a new shallow fusion (SF) method to exploit an external backward language model (BLM) for end-to-end automatic speech recognition (ASR). The BLM has complementary characteristics with a forward language model (FLM), and the effectiveness of their combination has been confirmed by rescoring ASR hypotheses as post-processing. In the proposed SF, we iteratively apply the BLM to partial ASR… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

    Comments: Accepted to ICASSP 2023

  5. arXiv:2309.12656  [pdf, other

    eess.AS cs.SD

    NTT speaker diarization system for CHiME-7: multi-domain, multi-microphone End-to-end and vector clustering diarization

    Authors: Naohiro Tawara, Marc Delcroix, Atsushi Ando, Atsunori Ogawa

    Abstract: This paper details our speaker diarization system designed for multi-domain, multi-microphone casual conversations. The proposed diarization pipeline uses weighted prediction error (WPE)-based dereverberation as a front end, then applies end-to-end neural diarization with vector clustering (EEND-VC) to each channel separately. It integrates the diarization result obtained from each channel using d… ▽ More

    Submitted 22 September, 2023; originally announced September 2023.

    Comments: 5 pages, 5 figures, Submitted to ICASSP 2024

  6. arXiv:2306.04233  [pdf, other

    cs.CL cs.SD eess.AS

    Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization

    Authors: Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Takatomo Kano, Atsunori Ogawa, Marc Delcroix

    Abstract: End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model. This approach is promising because it, in contrast to the conventional cascade approach, can utilize full acoustical information and mitigate to the propagation of transcription errors. However, due to the high cost of collecting speech-summary pairs, an E2E SSum model… ▽ More

    Submitted 7 June, 2023; originally announced June 2023.

    Comments: Accepted by Interspeech 2023

  7. arXiv:2305.15971  [pdf, other

    eess.AS

    Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data

    Authors: Takafumi Moriya, Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Takanori Ashihara, Kohei Matsuura, Tomohiro Tanaka, Ryo Masumura, Atsunori Ogawa, Taichi Asami

    Abstract: Neural transducer (RNNT)-based target-speaker speech recognition (TS-RNNT) directly transcribes a target speaker's voice from a multi-talker mixture. It is a promising approach for streaming applications because it does not incur the extra computation costs of a target speech extraction frontend, which is a critical barrier to quick response. TS-RNNT is trained end-to-end given the input speech (i… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  8. arXiv:2305.13580  [pdf, other

    eess.AS cs.SD

    Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization

    Authors: Marc Delcroix, Naohiro Tawara, Mireia Diez, Federico Landini, Anna Silnova, Atsunori Ogawa, Tomohiro Nakatani, Lukas Burget, Shoko Araki

    Abstract: Combining end-to-end neural speaker diarization (EEND) with vector clustering (VC), known as EEND-VC, has gained interest for leveraging the strengths of both methods. EEND-VC estimates activities and speaker embeddings for all speakers within an audio chunk and uses VC to associate these activities with speaker identities across different chunks. EEND-VC generates thus multiple streams of embeddi… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: Accepted at Interspeech 2023

  9. arXiv:2303.00978  [pdf, other

    cs.CL eess.AS

    Leveraging Large Text Corpora for End-to-End Speech Summarization

    Authors: Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Atsunori Ogawa, Marc Delcroix, Ryo Masumura

    Abstract: End-to-end speech summarization (E2E SSum) is a technique to directly generate summary sentences from speech. Compared with the cascade approach, which combines automatic speech recognition (ASR) and text summarization models, the E2E approach is more promising because it mitigates ASR errors, incorporates nonverbal information, and simplifies the overall system. However, since collecting a large… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: Accepted to ICASSP 2023

  10. Effective data screening technique for crowdsourced speech intelligibility experiments: Evaluation with IRM-based speech enhancement

    Authors: Ayako Yamamoto, Toshio Irino, Shoko Araki, Kenichi Arai, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani

    Abstract: It is essential to perform speech intelligibility (SI) experiments with human listeners in order to evaluate objective intelligibility measures for develo** effective speech enhancement and noise reduction algorithms. Recently, crowdsourced remote testing has become a popular means for collecting a massive amount and variety of data at a relatively small cost and in a short time. However, carefu… ▽ More

    Submitted 19 August, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: This paper was submitted to APSIPA ASC 2022 (https://www.apsipa2022.org). The original title [v1] was "Subjective intelligibility of speech sounds enhanced by ideal ratio mask via crowdsourced remote experiments with effective data screening."

    Journal ref: Proc. APSIPA ASC 2022

  11. arXiv:2111.08201  [pdf, other

    eess.AS cs.CL

    Attention-based Multi-hypothesis Fusion for Speech Summarization

    Authors: Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe

    Abstract: Speech summarization, which generates a text summary from speech, can be achieved by combining automatic speech recognition (ASR) and text summarization (TS). With this cascade approach, we can exploit state-of-the-art models and large training datasets for both subtasks, i.e., Transformer for ASR and Bidirectional Encoder Representations from Transformers (BERT) for TS. However, ASR errors direct… ▽ More

    Submitted 15 November, 2021; originally announced November 2021.

  12. Comparison of remote experiments using crowdsourcing and laboratory experiments on speech intelligibility

    Authors: Ayako Yamamoto, Toshio Irino, Kenichi Arai, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita, Tomohiro Nakatani

    Abstract: Many subjective experiments have been performed to develop objective speech intelligibility measures, but the novel coronavirus outbreak has made it very difficult to conduct experiments in a laboratory. One solution is to perform remote testing using crowdsourcing; however, because we cannot control the listening conditions, it is unclear whether the results are entirely reliable. In this study,… ▽ More

    Submitted 16 April, 2021; originally announced April 2021.

    Comments: This paper was submitted to Interspeech2021

    Journal ref: Proc. Interspeech 2021