Search | arXiv e-print repository

On Robustness to Missing Video for Audiovisual Speech Recognition

Authors: Oscar Chang, Otavio Braga, Hank Liao, Dmitriy Serdyuk, Olivier Siohan

Abstract: It has been shown that learning audiovisual features can lead to improved speech recognition performance over audio-only features, especially for noisy speech. However, in many common applications, the visual features are partially or entirely missing, e.g.~the speaker might move off screen. Multi-modal models need to be robust: missing video frames should not degrade the performance of an audiovi… ▽ More It has been shown that learning audiovisual features can lead to improved speech recognition performance over audio-only features, especially for noisy speech. However, in many common applications, the visual features are partially or entirely missing, e.g.~the speaker might move off screen. Multi-modal models need to be robust: missing video frames should not degrade the performance of an audiovisual model to be worse than that of a single-modality audio-only model. While there have been many attempts at building robust models, there is little consensus on how robustness should be evaluated. To address this, we introduce a framework that allows claims about robustness to be evaluated in a precise and testable way. We also conduct a systematic empirical study of the robustness of common audiovisual speech recognition architectures on a range of acoustic noise conditions and test suites. Finally, we show that an architecture-agnostic solution based on cascades can consistently achieve robustness to missing video, even in settings where existing techniques for robustness like dropout fall short. △ Less

Submitted 18 December, 2023; v1 submitted 13 December, 2023; originally announced December 2023.

arXiv:2312.09369 [pdf, other]

Audio-visual fine-tuning of audio-only ASR models

Authors: Avner May, Dmitriy Serdyuk, Ankit Parag Shah, Otavio Braga, Olivier Siohan

Abstract: Audio-visual automatic speech recognition (AV-ASR) models are very effective at reducing word error rates on noisy speech, but require large amounts of transcribed AV training data. Recently, audio-visual self-supervised learning (SSL) approaches have been developed to reduce this dependence on transcribed AV data, but these methods are quite complex and computationally expensive. In this work, we… ▽ More Audio-visual automatic speech recognition (AV-ASR) models are very effective at reducing word error rates on noisy speech, but require large amounts of transcribed AV training data. Recently, audio-visual self-supervised learning (SSL) approaches have been developed to reduce this dependence on transcribed AV data, but these methods are quite complex and computationally expensive. In this work, we propose replacing these expensive AV-SSL methods with a simple and fast \textit{audio-only} SSL method, and then performing AV supervised fine-tuning. We show that this approach is competitive with state-of-the-art (SOTA) AV-SSL methods on the LRS3-TED benchmark task (within 0.5% absolute WER), while being dramatically simpler and more efficient (12-30x faster to pre-train). Furthermore, we show we can extend this approach to convert a SOTA audio-only ASR model into an AV model. By doing so, we match SOTA AV-SSL results, even though no AV data was used during pre-training. △ Less

Submitted 14 December, 2023; originally announced December 2023.

arXiv:2302.00835 [pdf, other]

Diminimal families of arbitrary diameter

Authors: Luiz Emilio Allem, Rodrigo Orsini Braga, Carlos Hoppen, Elismar da Rosa Oliveira, Lucas Siviero Sibemberg, Vilmar Trevisan

Abstract: Given a tree $T$, let $q(T)$ be the minimum number of distinct eigenvalues in a symmetric matrix whose underlying graph is $T$. It is well known that $q(T)\geq d(T)+1$, where $d(T)$ is the diameter of $T$, and a tree $T$ is said to be diminimal if $q(T)=d(T)+1$. In this paper, we present families of diminimal trees of any fixed diameter. Our proof is constructive, allowing us to compute, for any d… ▽ More Given a tree $T$, let $q(T)$ be the minimum number of distinct eigenvalues in a symmetric matrix whose underlying graph is $T$. It is well known that $q(T)\geq d(T)+1$, where $d(T)$ is the diameter of $T$, and a tree $T$ is said to be diminimal if $q(T)=d(T)+1$. In this paper, we present families of diminimal trees of any fixed diameter. Our proof is constructive, allowing us to compute, for any diminimal tree $T$ of diameter $d$ in these families, a symmetric matrix $M$ with underlying graph $T$ whose spectrum has exactly $d+1$ distinct eigenvalues. △ Less

Submitted 1 February, 2023; originally announced February 2023.

Comments: 29 pages, 10 figures

arXiv:2205.05684 [pdf, other]

A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection

Authors: Otavio Braga, Olivier Siohan

Abstract: Audio-visual automatic speech recognition is a promising approach to robust ASR under noisy conditions. However, up until recently it had been traditionally studied in isolation assuming the video of a single speaking face matches the audio, and selecting the active speaker at inference time when multiple people are on screen was put aside as a separate problem. As an alternative, recent work has… ▽ More Audio-visual automatic speech recognition is a promising approach to robust ASR under noisy conditions. However, up until recently it had been traditionally studied in isolation assuming the video of a single speaking face matches the audio, and selecting the active speaker at inference time when multiple people are on screen was put aside as a separate problem. As an alternative, recent work has proposed to address the two problems simultaneously with an attention mechanism, baking the speaker selection problem directly into a fully differentiable model. One interesting finding was that the attention indirectly learns the association between the audio and the speaking face even though this correspondence is never explicitly provided at training time. In the present work we further investigate this connection and examine the interplay between the two problems. With experiments involving over 50 thousand hours of public YouTube videos as training data, we first evaluate the accuracy of the attention layer on an active speaker selection task. Secondly, we show under closer scrutiny that an end-to-end model performs at least as well as a considerably larger two-step system that utilizes a hard decision boundary under various noise conditions and number of parallel face tracks. △ Less

Submitted 11 May, 2022; originally announced May 2022.

Comments: arXiv admin note: text overlap with arXiv:2205.05586

arXiv:2205.05586 [pdf, other]

End-to-End Multi-Person Audio/Visual Automatic Speech Recognition

Authors: Otavio Braga, Takaki Makino, Olivier Siohan, Hank Liao

Abstract: Traditionally, audio-visual automatic speech recognition has been studied under the assumption that the speaking face on the visual signal is the face matching the audio. However, in a more realistic setting, when multiple faces are potentially on screen one needs to decide which face to feed to the A/V ASR system. The present work takes the recent progress of A/V ASR one step further and consider… ▽ More Traditionally, audio-visual automatic speech recognition has been studied under the assumption that the speaking face on the visual signal is the face matching the audio. However, in a more realistic setting, when multiple faces are potentially on screen one needs to decide which face to feed to the A/V ASR system. The present work takes the recent progress of A/V ASR one step further and considers the scenario where multiple people are simultaneously on screen (multi-person A/V ASR). We propose a fully differentiable A/V ASR model that is able to handle multiple face tracks in a video. Instead of relying on two separate models for speaker face selection and audio-visual ASR on a single face track, we introduce an attention layer to the ASR encoder that is able to soft-select the appropriate face video track. Experiments carried out on an A/V system trained on over 30k hours of YouTube videos illustrate that the proposed approach can automatically select the proper face tracks with minor WER degradation compared to an oracle selection of the speaking face while still showing benefits of employing the visual signal instead of the audio alone. △ Less

Submitted 11 May, 2022; originally announced May 2022.

arXiv:2205.05206 [pdf, other]

Best of Both Worlds: Multi-task Audio-Visual Automatic Speech Recognition and Active Speaker Detection

Authors: Otavio Braga, Olivier Siohan

Abstract: Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face. However, when multiple candidate speakers are visible this traditionally requires solving a separate problem, namely active speaker detection (ASD), which entails selecting at each moment in time which of the visible faces corresponds to the… ▽ More Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face. However, when multiple candidate speakers are visible this traditionally requires solving a separate problem, namely active speaker detection (ASD), which entails selecting at each moment in time which of the visible faces corresponds to the audio. Recent work has shown that we can solve both problems simultaneously by employing an attention mechanism over the competing video tracks of the speakers' faces, at the cost of sacrificing some accuracy on active speaker detection. This work closes this gap in active speaker detection accuracy by presenting a single model that can be jointly trained with a multi-task loss. By combining the two tasks during training we reduce the ASD classification accuracy by approximately 25%, while simultaneously improving the ASR performance when compared to the multi-person baseline trained exclusively for ASR. △ Less

Submitted 10 May, 2022; originally announced May 2022.

arXiv:2201.10439 [pdf, other]

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video

Authors: Dmitriy Serdyuk, Otavio Braga, Olivier Siohan

Abstract: Audio-visual automatic speech recognition (AV-ASR) extends speech recognition by introducing the video modality as an additional source of information. In this work, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image trans… ▽ More Audio-visual automatic speech recognition (AV-ASR) extends speech recognition by introducing the video modality as an additional source of information. In this work, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image transformer networks arXiv:2010.11929 demonstrated the ability to extract rich visual features for image classification tasks. Here, we propose to replace the 3D convolution with a video transformer to extract visual features. We train our baselines and the proposed model on a large scale corpus of YouTube videos. The performance of our approach is evaluated on a labeled subset of YouTube videos as well as on the LRS3-TED public corpus. Our best video-only model obtains 31.4% WER on YTDEV18 and 17.0% on LRS3-TED, a 10% and 15% relative improvements over our convolutional baseline. We achieve the state of the art performance of the audio-visual recognition on the LRS3-TED after fine-tuning our model (1.6% WER). In addition, in a series of experiments on multi-person AV-ASR, we obtained an average relative reduction of 2% over our convolutional video frontend. △ Less

Submitted 31 October, 2022; v1 submitted 25 January, 2022; originally announced January 2022.

Comments: 5 pages, 3 figures, published at Interspeech 2022

arXiv:2109.09536 [pdf, other]

Audio-Visual Speech Recognition is Worth 32$\times$32$\times$8 Voxels

Authors: Dmitriy Serdyuk, Otavio Braga, Olivier Siohan

Abstract: Audio-visual automatic speech recognition (AV-ASR) introduces the video modality into the speech recognition process, often by relying on information conveyed by the motion of the speaker's mouth. The use of the video signal requires extracting visual features, which are then combined with the acoustic features to build an AV-ASR system [1]. This is traditionally done with some form of 3D convolut… ▽ More Audio-visual automatic speech recognition (AV-ASR) introduces the video modality into the speech recognition process, often by relying on information conveyed by the motion of the speaker's mouth. The use of the video signal requires extracting visual features, which are then combined with the acoustic features to build an AV-ASR system [1]. This is traditionally done with some form of 3D convolutional network (e.g. VGG) as widely used in the computer vision community. Recently, image transformers [2] have been introduced to extract visual features useful for image classification tasks. In this work, we propose to replace the 3D convolutional visual front-end with a video transformer front-end. We train our systems on a large-scale dataset composed of YouTube videos and evaluate performance on the publicly available LRS3-TED set, as well as on a large set of YouTube videos. On a lip-reading task, the transformer-based front-end shows superior performance compared to a strong convolutional baseline. On an AV-ASR task, the transformer front-end performs as well as (or better than) the convolutional baseline. Fine-tuning our model on the LRS3-TED training set matches previous state of the art. Thus, we experimentally show the viability of the convolution-free model for AV-ASR. △ Less

Submitted 20 September, 2021; originally announced September 2021.

Comments: 7 pages, 2 figures, 4 tables. A draft for a paper accepted to ASRU workshop

arXiv:2010.01468 [pdf, ps, other]

Graphs with at most two nonzero distinct absolute eigenvalues

Authors: N. E. Arévalo, R. O. Braga, V. M. Rodrigues

Abstract: In his survey "Beyond graph energy: Norms of graphs and matrices" (2016), Nikiforov proposed two problems concerning characterizing the graphs that attain equality in a lower bound and in a upper bound for the energy of a graph, respectively. We show that these graphs have at most two nonzero distinct absolute eigenvalues and investigate the proposed problems organizing our study according to the… ▽ More In his survey "Beyond graph energy: Norms of graphs and matrices" (2016), Nikiforov proposed two problems concerning characterizing the graphs that attain equality in a lower bound and in a upper bound for the energy of a graph, respectively. We show that these graphs have at most two nonzero distinct absolute eigenvalues and investigate the proposed problems organizing our study according to the type of spectrum they can have. In most cases all graphs are characterized. Infinite families of graphs are given otherwise. We also show that all graphs satifying the properties required in the problems are integral, except for complete bipartite graphs $K_{p,q}$ and disconnected graphs with a connected component $K_{p,q}$, where $pq$ is not a perfect square. △ Less

Submitted 3 October, 2020; originally announced October 2020.

Comments: 17 pages, 5 figures, submitted

MSC Class: 05C50 (Primary) 15A18; 15A60 (Secondary)

arXiv:1911.04890 [pdf, other]

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

Authors: Takaki Makino, Hank Liao, Yannis Assael, Brendan Shillingford, Basilio Garcia, Otavio Braga, Olivier Siohan

Abstract: This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and au… ▽ More This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: a set of utterance segments from public YouTube videos called YTDEV18 and the publicly available LRS3-TED set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YTDEV18 set artificially corrupted with background noise and overlap** speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the LRS3-TED set. △ Less

Submitted 8 November, 2019; originally announced November 2019.

Comments: Will be presented in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019)

Showing 1–10 of 10 results for author: Braga, O