Skip to main content

Showing 1–16 of 16 results for author: Takashima, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2306.02273  [pdf, ps, other

    cs.CL cs.SD eess.AS

    End-to-End Joint Target and Non-Target Speakers ASR

    Authors: Ryo Masumura, Naoki Makishima, Taiga Yamane, Yoshihiko Yamazaki, Saki Mizuno, Mana Ihori, Mihiro Uchida, Keita Suzuki, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando

    Abstract: This paper proposes a novel automatic speech recognition (ASR) system that can transcribe individual speaker's speech while identifying whether they are target or non-target speakers from multi-talker overlapped speech. Target-speaker ASR systems are a promising way to only transcribe a target speaker's speech by enrolling the target speaker's information. However, in conversational ASR applicatio… ▽ More

    Submitted 4 June, 2023; originally announced June 2023.

    Comments: Accepted at Interspeech 2023

  2. arXiv:2210.15937  [pdf, other

    cs.CL cs.SD eess.AS

    On the Use of Modality-Specific Large-Scale Pre-Trained Encoders for Multimodal Sentiment Analysis

    Authors: Atsushi Ando, Ryo Masumura, Akihiko Takashima, Satoshi Suzuki, Naoki Makishima, Keita Suzuki, Takafumi Moriya, Takanori Ashihara, Hiroshi Sato

    Abstract: This paper investigates the effectiveness and implementation of modality-specific large-scale pre-trained encoders for multimodal sentiment analysis~(MSA). Although the effectiveness of pre-trained encoders in various fields has been reported, conventional MSA methods employ them for only linguistic modality, and their application has not been investigated. This paper compares the features yielded… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: Accepted to SLT 2022

  3. arXiv:2202.09979  [pdf, other

    cs.CL cs.CV

    Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations

    Authors: Yoshihiro Yamazaki, Shota Orihashi, Ryo Masumura, Mihiro Uchida, Akihiko Takashima

    Abstract: There have been many attempts to build multimodal dialog systems that can respond to a question about given audio-visual information, and the representative task for such systems is the Audio Visual Scene-Aware Dialog (AVSD). Most conventional AVSD models adopt the Convolutional Neural Network (CNN)-based video feature extractor to understand visual information. While a CNN tends to obtain both te… ▽ More

    Submitted 20 February, 2022; originally announced February 2022.

    Comments: Accepted at DSTC10 Workshop at AAAI 2022

  4. Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages

    Authors: Shota Orihashi, Yoshihiro Yamazaki, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Ryo Masumura

    Abstract: This paper presents a novel training method for end-to-end scene text recognition. End-to-end scene text recognition offers high recognition accuracy, especially when using the encoder-decoder model based on Transformer. To train a highly accurate end-to-end model, we need to prepare a large image-to-text paired dataset for the target language. However, it is difficult to collect this data, especi… ▽ More

    Submitted 24 November, 2021; originally announced November 2021.

    Comments: Accept as short paper at ACM MMAsia 2021

  5. arXiv:2111.10957  [pdf, ps, other

    cs.CL cs.LG

    Hierarchical Knowledge Distillation for Dialogue Sequence Labeling

    Authors: Shota Orihashi, Yoshihiro Yamazaki, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Ryo Masumura

    Abstract: This paper presents a novel knowledge distillation method for dialogue sequence labeling. Dialogue sequence labeling is a supervised learning task that estimates labels for each utterance in the target dialogue document, and is useful for many applications such as dialogue act estimation. Accurate labeling is often realized by a hierarchically-structured large model consisting of utterance-level a… ▽ More

    Submitted 21 November, 2021; originally announced November 2021.

    Comments: Accepted at ASRU 2021

  6. arXiv:2107.05382  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    End-to-End Rich Transcription-Style Automatic Speech Recognition with Semi-Supervised Learning

    Authors: Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Shota Orihashi, Naoki Makishima

    Abstract: We propose a semi-supervised learning method for building end-to-end rich transcription-style automatic speech recognition (RT-ASR) systems from small-scale rich transcription-style and large-scale common transcription-style datasets. In spontaneous speech tasks, various speech phenomena such as fillers, word fragments, laughter and coughs, etc. are often included. While common transcriptions do n… ▽ More

    Submitted 7 July, 2021; originally announced July 2021.

    Comments: Accepted at Interspeech 2021

  7. arXiv:2107.01569  [pdf, other

    cs.CL cs.LG

    Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition

    Authors: Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Takafumi Moriya, Takanori Ashihara, Shota Orihashi, Naoki Makishima

    Abstract: We propose a cross-modal transformer-based neural correction models that refines the output of an automatic speech recognition (ASR) system so as to exclude ASR errors. Generally, neural correction models are composed of encoder-decoder networks, which can directly model sequence-to-sequence map** problems. The most successful method is to use both input speech and its ASR output text as the inp… ▽ More

    Submitted 4 July, 2021; originally announced July 2021.

    Comments: Accepted to Interspeech 2021

  8. arXiv:2107.01549  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Unified Autoregressive Modeling for Joint End-to-End Multi-Talker Overlapped Speech Recognition and Speaker Attribute Estimation

    Authors: Ryo Masumura, Daiki Okamura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi

    Abstract: In this paper, we present a novel modeling method for single-channel multi-talker overlapped automatic speech recognition (ASR) systems. Fully neural network based end-to-end models have dramatically improved the performance of multi-taker overlapped ASR tasks. One promising approach for end-to-end modeling is autoregressive modeling with serialized output training in which transcriptions of multi… ▽ More

    Submitted 4 July, 2021; originally announced July 2021.

    Comments: Accepted at Interspeech 2021

  9. arXiv:2106.12132  [pdf, other

    cs.SD eess.AS

    Enrollment-less training for personalized voice activity detection

    Authors: Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura

    Abstract: We present a novel personalized voice activity detection (PVAD) learning method that does not require enrollment data during training. PVAD is a task to detect the speech segments of a specific target speaker at the frame level using enrollment speech of the target speaker. Since PVAD must learn speakers' speech variations to clarify the boundary between speakers, studies on PVAD used large-scale… ▽ More

    Submitted 22 June, 2021; originally announced June 2021.

    Comments: Accepted to INTERSPEECH 2021

  10. arXiv:2106.12131  [pdf, other

    cs.CL

    Zero-Shot Joint Modeling of Multiple Spoken-Text-Style Conversion Tasks using Switching Tokens

    Authors: Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura

    Abstract: In this paper, we propose a novel spoken-text-style conversion method that can simultaneously execute multiple style conversion modules such as punctuation restoration and disfluency deletion without preparing matched datasets. In practice, transcriptions generated by automatic speech recognition systems are not highly readable because they often include many disfluencies and do not include punctu… ▽ More

    Submitted 22 June, 2021; originally announced June 2021.

    Comments: Accepted at INTERSPEECH 2021

  11. arXiv:2103.01463  [pdf, other

    cs.SD cs.LG eess.AS

    Audio-Visual Speech Separation Using Cross-Modal Correspondence Loss

    Authors: Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi, Ryo Masumura

    Abstract: We present an audio-visual speech separation learning method that considers the correspondence between the separated signals and the visual signals to reflect the speech characteristics during training. Audio-visual speech separation is a technique to estimate the individual speech signals from a mixture using the visual signals of the speakers. Conventional studies on audio-visual speech separati… ▽ More

    Submitted 1 March, 2021; originally announced March 2021.

    Comments: Accepted to ICASSP 2021

  12. arXiv:2102.08154  [pdf, ps, other

    cs.CL cs.LG

    End-to-End Automatic Speech Recognition with Deep Mutual Learning

    Authors: Ryo Masumura, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Takanori Ashihara

    Abstract: This paper is the first study to apply deep mutual learning (DML) to end-to-end ASR models. In DML, multiple models are trained simultaneously and collaboratively by mimicking each other throughout the training process, which helps to attain the global optimum and prevent models from making over-confident predictions. While previous studies applied DML to simple multi-class classification problems… ▽ More

    Submitted 16 February, 2021; originally announced February 2021.

    Comments: Accepted at Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2020, pp.632-637

  13. arXiv:2102.08147  [pdf, ps, other

    cs.CL cs.LG

    Large-Context Conversational Representation Learning: Self-Supervised Learning for Conversational Documents

    Authors: Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi

    Abstract: This paper presents a novel self-supervised learning method for handling conversational documents consisting of transcribed text of human-to-human conversations. One of the key technologies for understanding conversational documents is utterance-level sequential labeling, where labels are estimated from the documents in an utterance-by-utterance manner. The main issue with utterance-level sequenti… ▽ More

    Submitted 16 February, 2021; originally announced February 2021.

    Comments: Accepted at IEEE Spoken Language Technology Workshop (SLT), 2021, pp.1012-1019

  14. arXiv:2102.07935  [pdf, ps, other

    cs.CL cs.LG

    Hierarchical Transformer-based Large-Context End-to-end ASR with Large-Context Knowledge Distillation

    Authors: Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi

    Abstract: We present a novel large-context end-to-end automatic speech recognition (E2E-ASR) model and its effective training method based on knowledge distillation. Common E2E-ASR models have mainly focused on utterance-level processing in which each utterance is independently transcribed. On the other hand, large-context E2E-ASR models, which take into account long-range sequential contexts beyond utteran… ▽ More

    Submitted 15 February, 2021; originally announced February 2021.

    Comments: Accepted at ICASSP 2021

  15. arXiv:2102.07380  [pdf, other

    cs.CL

    MAPGN: MAsked Pointer-Generator Network for sequence-to-sequence pre-training

    Authors: Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura

    Abstract: This paper presents a self-supervised learning method for pointer-generator networks to improve spoken-text normalization. Spoken-text normalization that converts spoken-style text into style normalized text is becoming an important technology for improving subsequent processing such as machine translation and summarization. The most successful spoken-text normalization method to date is sequence-… ▽ More

    Submitted 15 February, 2021; v1 submitted 15 February, 2021; originally announced February 2021.

    Comments: Accepted at ICASSP 2021

  16. arXiv:2010.15437  [pdf, other

    cs.CL

    Memory Attentive Fusion: External Language Model Integration for Transformer-based Sequence-to-Sequence Model

    Authors: Mana Ihori, Ryo Masumura, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi

    Abstract: This paper presents a novel fusion method for integrating an external language model (LM) into the Transformer based sequence-to-sequence (seq2seq) model. While paired data are basically required to train the seq2seq model, the external LM can be trained with only unpaired data. Thus, it is important to leverage memorized knowledge in the external LM for building the seq2seq model, since it is har… ▽ More

    Submitted 29 October, 2020; originally announced October 2020.

    Comments: Accepted as a short paper at INLG 2020