Skip to main content

Showing 1–13 of 13 results for author: Siohan, O

Searching in archive cs. Search in all archives.
.
  1. arXiv:2312.10088  [pdf, ps, other

    eess.AS cs.CV cs.LG cs.SD

    On Robustness to Missing Video for Audiovisual Speech Recognition

    Authors: Oscar Chang, Otavio Braga, Hank Liao, Dmitriy Serdyuk, Olivier Siohan

    Abstract: It has been shown that learning audiovisual features can lead to improved speech recognition performance over audio-only features, especially for noisy speech. However, in many common applications, the visual features are partially or entirely missing, e.g.~the speaker might move off screen. Multi-modal models need to be robust: missing video frames should not degrade the performance of an audiovi… ▽ More

    Submitted 18 December, 2023; v1 submitted 13 December, 2023; originally announced December 2023.

  2. arXiv:2312.10087  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Revisiting the Entropy Semiring for Neural Speech Recognition

    Authors: Oscar Chang, Dongseong Hwang, Olivier Siohan

    Abstract: In streaming settings, speech recognition models have to map sub-sequences of speech to text before the full audio stream becomes available. However, since alignment information between speech and text is rarely available during training, models need to learn it in a completely self-supervised way. In practice, the exponential number of possible alignments makes this extremely challenging, with mo… ▽ More

    Submitted 18 December, 2023; v1 submitted 13 December, 2023; originally announced December 2023.

  3. arXiv:2312.09369  [pdf, other

    cs.SD cs.AI eess.AS

    Audio-visual fine-tuning of audio-only ASR models

    Authors: Avner May, Dmitriy Serdyuk, Ankit Parag Shah, Otavio Braga, Olivier Siohan

    Abstract: Audio-visual automatic speech recognition (AV-ASR) models are very effective at reducing word error rates on noisy speech, but require large amounts of transcribed AV training data. Recently, audio-visual self-supervised learning (SSL) approaches have been developed to reduce this dependence on transcribed AV data, but these methods are quite complex and computationally expensive. In this work, we… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

  4. arXiv:2306.16398  [pdf, other

    cs.SD eess.AS

    Cascaded encoders for fine-tuning ASR models on overlapped speech

    Authors: Richard Rose, Oscar Chang, Olivier Siohan

    Abstract: Multi-talker speech recognition (MT-ASR) has been shown to improve ASR performance on speech containing overlap** utterances from more than one speaker. Multi-talker models have typically been trained from scratch using simulated or actual overlap** speech datasets. On the other hand, the trend in ASR has been to train foundation models using massive datasets collected from a wide variety of t… ▽ More

    Submitted 28 June, 2023; originally announced June 2023.

  5. arXiv:2302.10915  [pdf, other

    cs.LG cs.CL cs.CV cs.SD eess.AS

    Conformers are All You Need for Visual Speech Recognition

    Authors: Oscar Chang, Hank Liao, Dmitriy Serdyuk, Ankit Shah, Olivier Siohan

    Abstract: Visual speech recognition models extract visual features in a hierarchical manner. At the lower level, there is a visual front-end with a limited temporal receptive field that processes the raw pixels depicting the lips or faces. At the higher level, there is an encoder that attends to the embeddings produced by the front-end over a large temporal receptive field. Previous work has focused on impr… ▽ More

    Submitted 12 December, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

  6. arXiv:2205.05684  [pdf, other

    eess.AS cs.CV cs.LG cs.SD

    A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection

    Authors: Otavio Braga, Olivier Siohan

    Abstract: Audio-visual automatic speech recognition is a promising approach to robust ASR under noisy conditions. However, up until recently it had been traditionally studied in isolation assuming the video of a single speaking face matches the audio, and selecting the active speaker at inference time when multiple people are on screen was put aside as a separate problem. As an alternative, recent work has… ▽ More

    Submitted 11 May, 2022; originally announced May 2022.

    Comments: arXiv admin note: text overlap with arXiv:2205.05586

  7. arXiv:2205.05586  [pdf, other

    eess.AS cs.CV cs.LG cs.SD

    End-to-End Multi-Person Audio/Visual Automatic Speech Recognition

    Authors: Otavio Braga, Takaki Makino, Olivier Siohan, Hank Liao

    Abstract: Traditionally, audio-visual automatic speech recognition has been studied under the assumption that the speaking face on the visual signal is the face matching the audio. However, in a more realistic setting, when multiple faces are potentially on screen one needs to decide which face to feed to the A/V ASR system. The present work takes the recent progress of A/V ASR one step further and consider… ▽ More

    Submitted 11 May, 2022; originally announced May 2022.

  8. arXiv:2205.05206  [pdf, other

    eess.AS cs.CV cs.LG cs.SD

    Best of Both Worlds: Multi-task Audio-Visual Automatic Speech Recognition and Active Speaker Detection

    Authors: Otavio Braga, Olivier Siohan

    Abstract: Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face. However, when multiple candidate speakers are visible this traditionally requires solving a separate problem, namely active speaker detection (ASD), which entails selecting at each moment in time which of the visible faces corresponds to the… ▽ More

    Submitted 10 May, 2022; originally announced May 2022.

  9. arXiv:2204.00652  [pdf, other

    cs.SD cs.CL eess.AS

    End-to-end multi-talker audio-visual ASR using an active speaker attention module

    Authors: Richard Rose, Olivier Siohan

    Abstract: This paper presents a new approach for end-to-end audio-visual multi-talker speech recognition. The approach, referred to here as the visual context attention model (VCAM), is important because it uses the available video information to assign decoded text to one of multiple visible faces. This essentially resolves the label ambiguity issue associated with most multi-talker modeling approaches whi… ▽ More

    Submitted 1 April, 2022; originally announced April 2022.

    Comments: 5 pages, 3 figures, 3 tables, 28 citations

  10. arXiv:2201.10439  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video

    Authors: Dmitriy Serdyuk, Otavio Braga, Olivier Siohan

    Abstract: Audio-visual automatic speech recognition (AV-ASR) extends speech recognition by introducing the video modality as an additional source of information. In this work, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image trans… ▽ More

    Submitted 31 October, 2022; v1 submitted 25 January, 2022; originally announced January 2022.

    Comments: 5 pages, 3 figures, published at Interspeech 2022

  11. arXiv:2109.09536  [pdf, other

    cs.CV cs.LG

    Audio-Visual Speech Recognition is Worth 32$\times$32$\times$8 Voxels

    Authors: Dmitriy Serdyuk, Otavio Braga, Olivier Siohan

    Abstract: Audio-visual automatic speech recognition (AV-ASR) introduces the video modality into the speech recognition process, often by relying on information conveyed by the motion of the speaker's mouth. The use of the video signal requires extracting visual features, which are then combined with the acoustic features to build an AV-ASR system [1]. This is traditionally done with some form of 3D convolut… ▽ More

    Submitted 20 September, 2021; originally announced September 2021.

    Comments: 7 pages, 2 figures, 4 tables. A draft for a paper accepted to ASRU workshop

  12. arXiv:2104.14346  [pdf, other

    cs.CL cs.SD eess.AS

    Bridging the gap between streaming and non-streaming ASR systems bydistilling ensembles of CTC and RNN-T models

    Authors: Thibault Doutre, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Olivier Siohan, Liangliang Cao

    Abstract: Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their minimal latency makes them suitable for such tasks. Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context and suffer from higher word error rates (WER). To improve streaming mode… ▽ More

    Submitted 25 April, 2021; originally announced April 2021.

  13. arXiv:1911.04890  [pdf, other

    eess.AS cs.CL cs.CV cs.LG cs.SD

    Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

    Authors: Takaki Makino, Hank Liao, Yannis Assael, Brendan Shillingford, Basilio Garcia, Otavio Braga, Olivier Siohan

    Abstract: This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and au… ▽ More

    Submitted 8 November, 2019; originally announced November 2019.

    Comments: Will be presented in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019)