Skip to main content

Showing 1–23 of 23 results for author: Masumura, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.18910  [pdf, other

    cs.CL cs.SD eess.AS

    Factor-Conditioned Speaking-Style Captioning

    Authors: Atsushi Ando, Takafumi Moriya, Shota Horiguchi, Ryo Masumura

    Abstract: This paper presents a novel speaking-style captioning method that generates diverse descriptions while accurately predicting speaking-style information. Conventional learning criteria directly use original captions that contain not only speaking-style factor terms but also syntax words, which disturbs learning speaking-style information. To solve this problem, we introduce factor-conditioned capti… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  2. arXiv:2308.16454  [pdf, other

    cs.CV cs.LG

    Adversarial Finetuning with Latent Representation Constraint to Mitigate Accuracy-Robustness Tradeoff

    Authors: Satoshi Suzuki, Shin'ya Yamaguchi, Shoichiro Takeda, Sekitoshi Kanai, Naoki Makishima, Atsushi Ando, Ryo Masumura

    Abstract: This paper addresses the tradeoff between standard accuracy on clean examples and robustness against adversarial examples in deep neural networks (DNNs). Although adversarial training (AT) improves robustness, it degrades the standard accuracy, thus yielding the tradeoff. To mitigate this tradeoff, we propose a novel AT method called ARREST, which comprises three components: (i) adversarial finetu… ▽ More

    Submitted 31 August, 2023; originally announced August 2023.

    Comments: Accepted by International Conference on Computer Vision (ICCV) 2023

  3. arXiv:2306.02273  [pdf, ps, other

    cs.CL cs.SD eess.AS

    End-to-End Joint Target and Non-Target Speakers ASR

    Authors: Ryo Masumura, Naoki Makishima, Taiga Yamane, Yoshihiko Yamazaki, Saki Mizuno, Mana Ihori, Mihiro Uchida, Keita Suzuki, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando

    Abstract: This paper proposes a novel automatic speech recognition (ASR) system that can transcribe individual speaker's speech while identifying whether they are target or non-target speakers from multi-talker overlapped speech. Target-speaker ASR systems are a promising way to only transcribe a target speaker's speech by enrolling the target speaker's information. However, in conversational ASR applicatio… ▽ More

    Submitted 4 June, 2023; originally announced June 2023.

    Comments: Accepted at Interspeech 2023

  4. arXiv:2305.14723  [pdf, other

    eess.AS cs.SD

    Downstream Task Agnostic Speech Enhancement with Self-Supervised Representation Loss

    Authors: Hiroshi Sato, Ryo Masumura, Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Kentaro Shinayama, Saki Mizuno, Mana Ihori, Tomohiro Tanaka, Nobukatsu Hojo

    Abstract: Self-supervised learning (SSL) is the latest breakthrough in speech processing, especially for label-scarce downstream tasks by leveraging massive unlabeled audio data. The noise robustness of the SSL is one of the important challenges to expanding its application. We can use speech enhancement (SE) to tackle this issue. However, the mismatch between the SE model and SSL models potentially limits… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: 4 pages , 2 figures, Accepted to Interspeech 2023

  5. arXiv:2303.00978  [pdf, other

    cs.CL eess.AS

    Leveraging Large Text Corpora for End-to-End Speech Summarization

    Authors: Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Atsunori Ogawa, Marc Delcroix, Ryo Masumura

    Abstract: End-to-end speech summarization (E2E SSum) is a technique to directly generate summary sentences from speech. Compared with the cascade approach, which combines automatic speech recognition (ASR) and text summarization models, the E2E approach is more promising because it mitigates ASR errors, incorporates nonverbal information, and simplifies the overall system. However, since collecting a large… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: Accepted to ICASSP 2023

  6. arXiv:2210.15937  [pdf, other

    cs.CL cs.SD eess.AS

    On the Use of Modality-Specific Large-Scale Pre-Trained Encoders for Multimodal Sentiment Analysis

    Authors: Atsushi Ando, Ryo Masumura, Akihiko Takashima, Satoshi Suzuki, Naoki Makishima, Keita Suzuki, Takafumi Moriya, Takanori Ashihara, Hiroshi Sato

    Abstract: This paper investigates the effectiveness and implementation of modality-specific large-scale pre-trained encoders for multimodal sentiment analysis~(MSA). Although the effectiveness of pre-trained encoders in various fields has been reported, conventional MSA methods employ them for only linguistic modality, and their application has not been investigated. This paper compares the features yielded… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: Accepted to SLT 2022

  7. arXiv:2207.04659  [pdf, other

    cs.SD eess.AS

    Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data

    Authors: Naoki Makishima, Satoshi Suzuki, Atsushi Ando, Ryo Masumura

    Abstract: In this paper, we investigate the semi-supervised joint training of text to speech (TTS) and automatic speech recognition (ASR), where a small amount of paired data and a large amount of unpaired text data are available. Conventional studies form a cycle called the TTS-ASR pipeline, where the multispeaker TTS model synthesizes speech from text with a reference speech and the ASR model reconstructs… ▽ More

    Submitted 11 July, 2022; originally announced July 2022.

    Comments: Accepted to INTERSPEECH 2022

  8. arXiv:2206.08174  [pdf, other

    eess.AS cs.SD eess.SP

    Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations

    Authors: Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Ryo Masumura

    Abstract: Target speech extraction is a technique to extract the target speaker's voice from mixture signals using a pre-recorded enrollment utterance that characterize the voice characteristics of the target speaker. One major difficulty of target speech extraction lies in handling variability in ``intra-speaker'' characteristics, i.e., characteristics mismatch between target speech and an enrollment utter… ▽ More

    Submitted 16 June, 2022; originally announced June 2022.

    Comments: 5 pages, 2 figures, 3 tables Submitted to Interspeech 2022

  9. arXiv:2202.09979  [pdf, other

    cs.CL cs.CV

    Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations

    Authors: Yoshihiro Yamazaki, Shota Orihashi, Ryo Masumura, Mihiro Uchida, Akihiko Takashima

    Abstract: There have been many attempts to build multimodal dialog systems that can respond to a question about given audio-visual information, and the representative task for such systems is the Audio Visual Scene-Aware Dialog (AVSD). Most conventional AVSD models adopt the Convolutional Neural Network (CNN)-based video feature extractor to understand visual information. While a CNN tends to obtain both te… ▽ More

    Submitted 20 February, 2022; originally announced February 2022.

    Comments: Accepted at DSTC10 Workshop at AAAI 2022

  10. Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages

    Authors: Shota Orihashi, Yoshihiro Yamazaki, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Ryo Masumura

    Abstract: This paper presents a novel training method for end-to-end scene text recognition. End-to-end scene text recognition offers high recognition accuracy, especially when using the encoder-decoder model based on Transformer. To train a highly accurate end-to-end model, we need to prepare a large image-to-text paired dataset for the target language. However, it is difficult to collect this data, especi… ▽ More

    Submitted 24 November, 2021; originally announced November 2021.

    Comments: Accept as short paper at ACM MMAsia 2021

  11. arXiv:2111.10957  [pdf, ps, other

    cs.CL cs.LG

    Hierarchical Knowledge Distillation for Dialogue Sequence Labeling

    Authors: Shota Orihashi, Yoshihiro Yamazaki, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Ryo Masumura

    Abstract: This paper presents a novel knowledge distillation method for dialogue sequence labeling. Dialogue sequence labeling is a supervised learning task that estimates labels for each utterance in the target dialogue document, and is useful for many applications such as dialogue act estimation. Accurate labeling is often realized by a hierarchically-structured large model consisting of utterance-level a… ▽ More

    Submitted 21 November, 2021; originally announced November 2021.

    Comments: Accepted at ASRU 2021

  12. arXiv:2107.05382  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    End-to-End Rich Transcription-Style Automatic Speech Recognition with Semi-Supervised Learning

    Authors: Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Shota Orihashi, Naoki Makishima

    Abstract: We propose a semi-supervised learning method for building end-to-end rich transcription-style automatic speech recognition (RT-ASR) systems from small-scale rich transcription-style and large-scale common transcription-style datasets. In spontaneous speech tasks, various speech phenomena such as fillers, word fragments, laughter and coughs, etc. are often included. While common transcriptions do n… ▽ More

    Submitted 7 July, 2021; originally announced July 2021.

    Comments: Accepted at Interspeech 2021

  13. arXiv:2107.01569  [pdf, other

    cs.CL cs.LG

    Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition

    Authors: Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Takafumi Moriya, Takanori Ashihara, Shota Orihashi, Naoki Makishima

    Abstract: We propose a cross-modal transformer-based neural correction models that refines the output of an automatic speech recognition (ASR) system so as to exclude ASR errors. Generally, neural correction models are composed of encoder-decoder networks, which can directly model sequence-to-sequence map** problems. The most successful method is to use both input speech and its ASR output text as the inp… ▽ More

    Submitted 4 July, 2021; originally announced July 2021.

    Comments: Accepted to Interspeech 2021

  14. arXiv:2107.01549  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Unified Autoregressive Modeling for Joint End-to-End Multi-Talker Overlapped Speech Recognition and Speaker Attribute Estimation

    Authors: Ryo Masumura, Daiki Okamura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi

    Abstract: In this paper, we present a novel modeling method for single-channel multi-talker overlapped automatic speech recognition (ASR) systems. Fully neural network based end-to-end models have dramatically improved the performance of multi-taker overlapped ASR tasks. One promising approach for end-to-end modeling is autoregressive modeling with serialized output training in which transcriptions of multi… ▽ More

    Submitted 4 July, 2021; originally announced July 2021.

    Comments: Accepted at Interspeech 2021

  15. arXiv:2106.12132  [pdf, other

    cs.SD eess.AS

    Enrollment-less training for personalized voice activity detection

    Authors: Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura

    Abstract: We present a novel personalized voice activity detection (PVAD) learning method that does not require enrollment data during training. PVAD is a task to detect the speech segments of a specific target speaker at the frame level using enrollment speech of the target speaker. Since PVAD must learn speakers' speech variations to clarify the boundary between speakers, studies on PVAD used large-scale… ▽ More

    Submitted 22 June, 2021; originally announced June 2021.

    Comments: Accepted to INTERSPEECH 2021

  16. arXiv:2106.12131  [pdf, other

    cs.CL

    Zero-Shot Joint Modeling of Multiple Spoken-Text-Style Conversion Tasks using Switching Tokens

    Authors: Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura

    Abstract: In this paper, we propose a novel spoken-text-style conversion method that can simultaneously execute multiple style conversion modules such as punctuation restoration and disfluency deletion without preparing matched datasets. In practice, transcriptions generated by automatic speech recognition systems are not highly readable because they often include many disfluencies and do not include punctu… ▽ More

    Submitted 22 June, 2021; originally announced June 2021.

    Comments: Accepted at INTERSPEECH 2021

  17. arXiv:2103.01463  [pdf, other

    cs.SD cs.LG eess.AS

    Audio-Visual Speech Separation Using Cross-Modal Correspondence Loss

    Authors: Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi, Ryo Masumura

    Abstract: We present an audio-visual speech separation learning method that considers the correspondence between the separated signals and the visual signals to reflect the speech characteristics during training. Audio-visual speech separation is a technique to estimate the individual speech signals from a mixture using the visual signals of the speakers. Conventional studies on audio-visual speech separati… ▽ More

    Submitted 1 March, 2021; originally announced March 2021.

    Comments: Accepted to ICASSP 2021

  18. arXiv:2102.08154  [pdf, ps, other

    cs.CL cs.LG

    End-to-End Automatic Speech Recognition with Deep Mutual Learning

    Authors: Ryo Masumura, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Takanori Ashihara

    Abstract: This paper is the first study to apply deep mutual learning (DML) to end-to-end ASR models. In DML, multiple models are trained simultaneously and collaboratively by mimicking each other throughout the training process, which helps to attain the global optimum and prevent models from making over-confident predictions. While previous studies applied DML to simple multi-class classification problems… ▽ More

    Submitted 16 February, 2021; originally announced February 2021.

    Comments: Accepted at Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2020, pp.632-637

  19. arXiv:2102.08147  [pdf, ps, other

    cs.CL cs.LG

    Large-Context Conversational Representation Learning: Self-Supervised Learning for Conversational Documents

    Authors: Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi

    Abstract: This paper presents a novel self-supervised learning method for handling conversational documents consisting of transcribed text of human-to-human conversations. One of the key technologies for understanding conversational documents is utterance-level sequential labeling, where labels are estimated from the documents in an utterance-by-utterance manner. The main issue with utterance-level sequenti… ▽ More

    Submitted 16 February, 2021; originally announced February 2021.

    Comments: Accepted at IEEE Spoken Language Technology Workshop (SLT), 2021, pp.1012-1019

  20. arXiv:2102.07935  [pdf, ps, other

    cs.CL cs.LG

    Hierarchical Transformer-based Large-Context End-to-end ASR with Large-Context Knowledge Distillation

    Authors: Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi

    Abstract: We present a novel large-context end-to-end automatic speech recognition (E2E-ASR) model and its effective training method based on knowledge distillation. Common E2E-ASR models have mainly focused on utterance-level processing in which each utterance is independently transcribed. On the other hand, large-context E2E-ASR models, which take into account long-range sequential contexts beyond utteran… ▽ More

    Submitted 15 February, 2021; originally announced February 2021.

    Comments: Accepted at ICASSP 2021

  21. arXiv:2102.07380  [pdf, other

    cs.CL

    MAPGN: MAsked Pointer-Generator Network for sequence-to-sequence pre-training

    Authors: Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura

    Abstract: This paper presents a self-supervised learning method for pointer-generator networks to improve spoken-text normalization. Spoken-text normalization that converts spoken-style text into style normalized text is becoming an important technology for improving subsequent processing such as machine translation and summarization. The most successful spoken-text normalization method to date is sequence-… ▽ More

    Submitted 15 February, 2021; v1 submitted 15 February, 2021; originally announced February 2021.

    Comments: Accepted at ICASSP 2021

  22. arXiv:2010.15437  [pdf, other

    cs.CL

    Memory Attentive Fusion: External Language Model Integration for Transformer-based Sequence-to-Sequence Model

    Authors: Mana Ihori, Ryo Masumura, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi

    Abstract: This paper presents a novel fusion method for integrating an external language model (LM) into the Transformer based sequence-to-sequence (seq2seq) model. While paired data are basically required to train the seq2seq model, the external LM can be trained with only unpaired data. Thus, it is important to leverage memorized knowledge in the external LM for building the seq2seq model, since it is har… ▽ More

    Submitted 29 October, 2020; originally announced October 2020.

    Comments: Accepted as a short paper at INLG 2020

  23. arXiv:2007.00222  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    A Transformer-based Audio Captioning Model with Keyword Estimation

    Authors: Yuma Koizumi, Ryo Masumura, Kyosuke Nishida, Masahiro Yasuda, Shoichiro Saito

    Abstract: One of the problems with automated audio captioning (AAC) is the indeterminacy in word selection corresponding to the audio event/scene. Since one acoustic event/scene can be described with several words, it results in a combinatorial explosion of possible captions and difficulty in training. To solve this problem, we propose a Transformer-based audio-captioning model with keyword estimation calle… ▽ More

    Submitted 8 August, 2020; v1 submitted 1 July, 2020; originally announced July 2020.

    Comments: Accepted to Interspeech 2020