Skip to main content

Showing 1–10 of 10 results for author: Braga, O

.
  1. arXiv:2312.10088  [pdf, ps, other

    eess.AS cs.CV cs.LG cs.SD

    On Robustness to Missing Video for Audiovisual Speech Recognition

    Authors: Oscar Chang, Otavio Braga, Hank Liao, Dmitriy Serdyuk, Olivier Siohan

    Abstract: It has been shown that learning audiovisual features can lead to improved speech recognition performance over audio-only features, especially for noisy speech. However, in many common applications, the visual features are partially or entirely missing, e.g.~the speaker might move off screen. Multi-modal models need to be robust: missing video frames should not degrade the performance of an audiovi… ▽ More

    Submitted 18 December, 2023; v1 submitted 13 December, 2023; originally announced December 2023.

  2. arXiv:2312.09369  [pdf, other

    cs.SD cs.AI eess.AS

    Audio-visual fine-tuning of audio-only ASR models

    Authors: Avner May, Dmitriy Serdyuk, Ankit Parag Shah, Otavio Braga, Olivier Siohan

    Abstract: Audio-visual automatic speech recognition (AV-ASR) models are very effective at reducing word error rates on noisy speech, but require large amounts of transcribed AV training data. Recently, audio-visual self-supervised learning (SSL) approaches have been developed to reduce this dependence on transcribed AV data, but these methods are quite complex and computationally expensive. In this work, we… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

  3. arXiv:2302.00835  [pdf, other

    math.CO

    Diminimal families of arbitrary diameter

    Authors: Luiz Emilio Allem, Rodrigo Orsini Braga, Carlos Hoppen, Elismar da Rosa Oliveira, Lucas Siviero Sibemberg, Vilmar Trevisan

    Abstract: Given a tree $T$, let $q(T)$ be the minimum number of distinct eigenvalues in a symmetric matrix whose underlying graph is $T$. It is well known that $q(T)\geq d(T)+1$, where $d(T)$ is the diameter of $T$, and a tree $T$ is said to be diminimal if $q(T)=d(T)+1$. In this paper, we present families of diminimal trees of any fixed diameter. Our proof is constructive, allowing us to compute, for any d… ▽ More

    Submitted 1 February, 2023; originally announced February 2023.

    Comments: 29 pages, 10 figures

  4. arXiv:2205.05684  [pdf, other

    eess.AS cs.CV cs.LG cs.SD

    A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection

    Authors: Otavio Braga, Olivier Siohan

    Abstract: Audio-visual automatic speech recognition is a promising approach to robust ASR under noisy conditions. However, up until recently it had been traditionally studied in isolation assuming the video of a single speaking face matches the audio, and selecting the active speaker at inference time when multiple people are on screen was put aside as a separate problem. As an alternative, recent work has… ▽ More

    Submitted 11 May, 2022; originally announced May 2022.

    Comments: arXiv admin note: text overlap with arXiv:2205.05586

  5. arXiv:2205.05586  [pdf, other

    eess.AS cs.CV cs.LG cs.SD

    End-to-End Multi-Person Audio/Visual Automatic Speech Recognition

    Authors: Otavio Braga, Takaki Makino, Olivier Siohan, Hank Liao

    Abstract: Traditionally, audio-visual automatic speech recognition has been studied under the assumption that the speaking face on the visual signal is the face matching the audio. However, in a more realistic setting, when multiple faces are potentially on screen one needs to decide which face to feed to the A/V ASR system. The present work takes the recent progress of A/V ASR one step further and consider… ▽ More

    Submitted 11 May, 2022; originally announced May 2022.

  6. arXiv:2205.05206  [pdf, other

    eess.AS cs.CV cs.LG cs.SD

    Best of Both Worlds: Multi-task Audio-Visual Automatic Speech Recognition and Active Speaker Detection

    Authors: Otavio Braga, Olivier Siohan

    Abstract: Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face. However, when multiple candidate speakers are visible this traditionally requires solving a separate problem, namely active speaker detection (ASD), which entails selecting at each moment in time which of the visible faces corresponds to the… ▽ More

    Submitted 10 May, 2022; originally announced May 2022.

  7. arXiv:2201.10439  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video

    Authors: Dmitriy Serdyuk, Otavio Braga, Olivier Siohan

    Abstract: Audio-visual automatic speech recognition (AV-ASR) extends speech recognition by introducing the video modality as an additional source of information. In this work, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image trans… ▽ More

    Submitted 31 October, 2022; v1 submitted 25 January, 2022; originally announced January 2022.

    Comments: 5 pages, 3 figures, published at Interspeech 2022

  8. arXiv:2109.09536  [pdf, other

    cs.CV cs.LG

    Audio-Visual Speech Recognition is Worth 32$\times$32$\times$8 Voxels

    Authors: Dmitriy Serdyuk, Otavio Braga, Olivier Siohan

    Abstract: Audio-visual automatic speech recognition (AV-ASR) introduces the video modality into the speech recognition process, often by relying on information conveyed by the motion of the speaker's mouth. The use of the video signal requires extracting visual features, which are then combined with the acoustic features to build an AV-ASR system [1]. This is traditionally done with some form of 3D convolut… ▽ More

    Submitted 20 September, 2021; originally announced September 2021.

    Comments: 7 pages, 2 figures, 4 tables. A draft for a paper accepted to ASRU workshop

  9. arXiv:2010.01468  [pdf, ps, other

    math.CO math.SP

    Graphs with at most two nonzero distinct absolute eigenvalues

    Authors: N. E. Arévalo, R. O. Braga, V. M. Rodrigues

    Abstract: In his survey "Beyond graph energy: Norms of graphs and matrices" (2016), Nikiforov proposed two problems concerning characterizing the graphs that attain equality in a lower bound and in a upper bound for the energy of a graph, respectively. We show that these graphs have at most two nonzero distinct absolute eigenvalues and investigate the proposed problems organizing our study according to the… ▽ More

    Submitted 3 October, 2020; originally announced October 2020.

    Comments: 17 pages, 5 figures, submitted

    MSC Class: 05C50 (Primary) 15A18; 15A60 (Secondary)

  10. arXiv:1911.04890  [pdf, other

    eess.AS cs.CL cs.CV cs.LG cs.SD

    Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

    Authors: Takaki Makino, Hank Liao, Yannis Assael, Brendan Shillingford, Basilio Garcia, Otavio Braga, Olivier Siohan

    Abstract: This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and au… ▽ More

    Submitted 8 November, 2019; originally announced November 2019.

    Comments: Will be presented in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019)