Skip to main content

Showing 1–24 of 24 results for author: Petridis, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.18373  [pdf, other

    cs.CL cs.SD eess.AS

    Dynamic Data Pruning for Automatic Speech Recognition

    Authors: Qiao Xiao, **chuan Ma, Adriana Fernandez-Lopez, Boqian Wu, Lu Yin, Stavros Petridis, Mykola Pechenizkiy, Maja Pantic, Decebal Constantin Mocanu, Shiwei Liu

    Abstract: The recent success of Automatic Speech Recognition (ASR) is largely attributed to the ever-growing amount of training data. However, this trend has made model training prohibitively costly and imposed computational demands. While data pruning has been proposed to mitigate this issue by identifying a small subset of relevant data, its application in ASR has been barely explored, and existing works… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  2. arXiv:2310.17864  [pdf, other

    eess.AS cs.SD

    TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

    Authors: Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, **chuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao, Robin Scheibler, Samuele Cornell, Sean Kim, Stavros Petridis

    Abstract: TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by develo** impactful features. Here, we survey TorchAudio's devel… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

  3. arXiv:2303.17200  [pdf, other

    cs.CV cs.AI cs.SD eess.AS

    SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision

    Authors: Xubo Liu, Egor Lakomkin, Konstantinos Vougioukas, **chuan Ma, Honglie Chen, Ruiming Xie, Morrie Doulaty, Niko Moritz, Jáchym Kolář, Stavros Petridis, Maja Pantic, Christian Fuegen

    Abstract: Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are limited in size. In this paper, for the first time, we study the potential of leveraging synthetic visual data for VSR. Our method, termed SynthVSR, substantially improves the performance of VSR systems wit… ▽ More

    Submitted 3 April, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

    Comments: IEEE/CVF CVPR 2023

  4. Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

    Authors: **chuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic

    Abstract: Audio-visual speech recognition has received a lot of attention due to its robustness against acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been substantially improved, mainly due to the use of larger models and training sets. However, accurate labelling of datasets is time-consuming and expensive. Hence… ▽ More

    Submitted 28 June, 2023; v1 submitted 24 March, 2023; originally announced March 2023.

    Comments: Accepted to ICASSP 2023

  5. arXiv:2303.09455  [pdf, other

    cs.CL cs.CV cs.LG cs.SD eess.AS

    Learning Cross-lingual Visual Speech Representations

    Authors: Andreas Zinonos, Alexandros Haliassos, **chuan Ma, Stavros Petridis, Maja Pantic

    Abstract: Cross-lingual self-supervised learning has been a growing research topic in the last few years. However, current works only explored the use of audio signals to create representations. In this work, we study cross-lingual self-supervised visual representation learning. We use the recently-proposed Raw Audio-Visual Speech Encoders (RAVEn) framework to pre-train an audio-visual model with unlabelled… ▽ More

    Submitted 14 March, 2023; originally announced March 2023.

  6. arXiv:2211.10999  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

    Authors: Rodrigo Mira, Buye Xu, Jacob Donley, Anurag Kumar, Stavros Petridis, Vamsi Krishna Ithapu, Maja Pantic

    Abstract: Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging not only the audio itself but also the target speaker's lip movements. This approach has been shown to yield improvements over audio-only speech enhancement, particularly for the removal of interfering speech. Despite recent advances in speech synthesis, most audio-visual approaches continue to use… ▽ More

    Submitted 13 March, 2023; v1 submitted 20 November, 2022; originally announced November 2022.

    Comments: accepted to ICASSP 2023

  7. arXiv:2211.02133  [pdf, other

    eess.AS cs.CV cs.SD

    Streaming Audio-Visual Speech Recognition with Alignment Regularization

    Authors: **chuan Ma, Niko Moritz, Stavros Petridis, Christian Fuegen, Maja Pantic

    Abstract: In this work, we propose a streaming AV-ASR system based on a hybrid connectionist temporal classification (CTC)/attention neural network architecture. The audio and the visual encoder neural networks are both based on the conformer architecture, which is made streamable using chunk-wise self-attention (CSA) and causal convolution. Streaming recognition with a decoder neural network is realized by… ▽ More

    Submitted 1 July, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

    Comments: Accepted to Interspeech 2023

  8. arXiv:2205.02058  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    SVTS: Scalable Video-to-Speech Synthesis

    Authors: Rodrigo Mira, Alexandros Haliassos, Stavros Petridis, Björn W. Schuller, Maja Pantic

    Abstract: Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip movements into the corresponding audio. This task has received an increasing amount of attention due to its self-supervised nature (i.e., can be trained without manual labelling) combined with the ever-growing collection of audio-visual data available online. Despite these strong motivations, contempora… ▽ More

    Submitted 15 August, 2022; v1 submitted 4 May, 2022; originally announced May 2022.

    Comments: accepted to INTERSPEECH 2022 (Oral Presentation)

  9. arXiv:2202.13084  [pdf, other

    cs.CV cs.SD eess.AS

    Visual Speech Recognition for Multiple Languages in the Wild

    Authors: **chuan Ma, Stavros Petridis, Maja Pantic

    Abstract: Visual speech recognition (VSR) aims to recognize the content of speech based on lip movements, without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before. However, these advances are usually due to the larger training sets rather than the model design. H… ▽ More

    Submitted 30 October, 2022; v1 submitted 26 February, 2022; originally announced February 2022.

    Comments: Published in Nature Machine Intelligence

  10. arXiv:2106.09171  [pdf, other

    cs.LG cs.CV cs.SD eess.AS

    LiRA: Learning Visual Speech Representations from Audio through Self-supervision

    Authors: **chuan Ma, Rodrigo Mira, Stavros Petridis, Björn W. Schuller, Maja Pantic

    Abstract: The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect of audiovisual self-supervised learning. Recent works have focused on each of these modalities separately, while others have attempted to model both simultaneously in a cross-modal fashion. However, comparatively little attention has been given to leveraging one modality as a training… ▽ More

    Submitted 16 June, 2021; originally announced June 2021.

    Comments: Accepted for publication at Interspeech 2021

  11. arXiv:2104.13332  [pdf, other

    cs.LG cs.CV cs.SD eess.AS

    End-to-End Video-To-Speech Synthesis using Generative Adversarial Networks

    Authors: Rodrigo Mira, Konstantinos Vougioukas, **chuan Ma, Stavros Petridis, Björn W. Schuller, Maja Pantic

    Abstract: Video-to-speech is the process of reconstructing the audio speech from a video of a spoken utterance. Previous approaches to this task have relied on a two-step process where an intermediate representation is inferred from the video, and is then decoded into waveform audio using a vocoder or a waveform reconstruction algorithm. In this work, we propose a new end-to-end video-to-speech model based… ▽ More

    Submitted 15 August, 2022; v1 submitted 27 April, 2021; originally announced April 2021.

    Comments: Published in IEEE Transactions on Cybernetics (April 2022)

  12. arXiv:2102.09281  [pdf, other

    cs.LG cs.CV cs.SD eess.AS

    DINO: A Conditional Energy-Based GAN for Domain Translation

    Authors: Konstantinos Vougioukas, Stavros Petridis, Maja Pantic

    Abstract: Domain translation is the process of transforming data from one domain to another while preserving the common semantics. Some of the most popular domain translation systems are based on conditional generative adversarial networks, which use source domain data to drive the generator and as an input to the discriminator. However, this approach does not enforce the preservation of shared semantics si… ▽ More

    Submitted 18 February, 2021; originally announced February 2021.

    Comments: Accepted to ICLR 2021

  13. arXiv:2102.06657  [pdf, other

    cs.CV eess.AS

    End-to-end Audio-visual Speech Recognition with Conformers

    Authors: **chuan Ma, Stavros Petridis, Maja Pantic

    Abstract: In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner. In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms, respectively, which are then fed to conformers and then fusion takes place via a Multi-Layer Perceptron (MLP). T… ▽ More

    Submitted 12 February, 2021; originally announced February 2021.

    Comments: Accepted to ICASSP 2021

  14. arXiv:2010.03623  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Domain Adversarial Neural Networks for Dysarthric Speech Recognition

    Authors: Dominika Woszczyk, Stavros Petridis, David Millard

    Abstract: Speech recognition systems have improved dramatically over the last few years, however, their performance is significantly degraded for the cases of accented or impaired speech. This work explores domain adversarial neural networks (DANN) for speaker-independent speech recognition on the UAS dataset of dysarthric speech. The classification task on 10 spoken digits is performed using an end-to-end… ▽ More

    Submitted 7 October, 2020; originally announced October 2020.

    Comments: 5 pages, to be published in Interspeech 2020

  15. arXiv:2007.04134  [pdf, other

    eess.AS cs.CL cs.CV cs.LG cs.MM cs.SD

    Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision

    Authors: Abhinav Shukla, Stavros Petridis, Maja Pantic

    Abstract: The intuitive interaction between the audio and visual modalities is valuable for cross-modal self-supervised learning. This concept has been demonstrated for generic audiovisual tasks like video action recognition and acoustic scene classification. However, self-supervision remains under-explored for audiovisual speech. We propose a method to learn self-supervised speech representations from the… ▽ More

    Submitted 8 July, 2020; originally announced July 2020.

    Comments: Accepted at the Workshop on Self-supervision in Audio and Speech at ICML 2020

  16. arXiv:2005.01400  [pdf, other

    eess.AS cs.CL cs.CV cs.LG

    Does Visual Self-Supervision Improve Learning of Speech Representations for Emotion Recognition?

    Authors: Abhinav Shukla, Stavros Petridis, Maja Pantic

    Abstract: Self-supervised learning has attracted plenty of recent research interest. However, most works for self-supervision in speech are typically unimodal and there has been limited work that studies the interaction between audio and visual modalities for cross-modal self-supervision. This work (1) investigates visual self-supervision via face reconstruction to guide the learning of audio representation… ▽ More

    Submitted 18 March, 2021; v1 submitted 4 May, 2020; originally announced May 2020.

    Comments: Accepted for publication in IEEE Transactions on Affective Computing; v3: Publication-ready version including additional experiments and discussion

  17. arXiv:2001.08702  [pdf, other

    cs.CV cs.SD eess.AS

    Lipreading using Temporal Convolutional Networks

    Authors: Brais Martinez, **chuan Ma, Stavros Petridis, Maja Pantic

    Abstract: Lip-reading has attracted a lot of research attention lately thanks to advances in deep learning. The current state-of-the-art model for recognition of isolated words in-the-wild consists of a residual network and Bidirectional Gated Recurrent Unit (BGRU) layers. In this work, we address the limitations of this model and we propose changes which further improve its performance. Firstly, the BGRU l… ▽ More

    Submitted 23 January, 2020; originally announced January 2020.

  18. arXiv:2001.04316  [pdf, other

    eess.AS cs.CV cs.MM

    Visually Guided Self Supervised Learning of Speech Representations

    Authors: Abhinav Shukla, Konstantinos Vougioukas, **chuan Ma, Stavros Petridis, Maja Pantic

    Abstract: Self supervised representation learning has recently attracted a lot of research interest for both the audio and visual modalities. However, most works typically focus on a particular modality or feature alone and there has been very limited work that studies the interaction between the two modalities for learning self supervised representations. We propose a framework for learning audio represent… ▽ More

    Submitted 20 February, 2020; v1 submitted 13 January, 2020; originally announced January 2020.

    Comments: Accepted at ICASSP 2020 v2: Updated to the ICASSP 2020 camera ready version

  19. arXiv:1912.08639  [pdf, other

    cs.CV cs.SD eess.AS

    Detecting Adversarial Attacks On Audiovisual Speech Recognition

    Authors: **chuan Ma, Stavros Petridis, Maja Pantic

    Abstract: Adversarial attacks pose a threat to deep learning models. However, research on adversarial detection methods, especially in the multi-modal domain, is very limited. In this work, we propose an efficient and straightforward detection method based on the temporal correlation between audio and video streams. The main idea is that the correlation between audio and video in adversarial examples will b… ▽ More

    Submitted 12 February, 2021; v1 submitted 18 December, 2019; originally announced December 2019.

    Comments: Accepted to ICASSP 2021

  20. arXiv:1912.05833  [pdf, other

    cs.LG eess.AS stat.ML

    Speech-driven facial animation using polynomial fusion of features

    Authors: Triantafyllos Kefalas, Konstantinos Vougioukas, Yannis Panagakis, Stavros Petridis, Jean Kossaifi, Maja Pantic

    Abstract: Speech-driven facial animation involves using a speech signal to generate realistic videos of talking faces. Recent deep learning approaches to facial synthesis rely on extracting low-dimensional representations and concatenating them, followed by a decoding step of the concatenated vector. This accounts for only first-order interactions of the features and ignores higher-order interactions. In th… ▽ More

    Submitted 19 February, 2020; v1 submitted 12 December, 2019; originally announced December 2019.

  21. arXiv:1906.06337  [pdf, other

    cs.CV cs.LG eess.AS

    Realistic Speech-Driven Facial Animation with GANs

    Authors: Konstantinos Vougioukas, Stavros Petridis, Maja Pantic

    Abstract: Speech-driven facial animation is the process that automatically synthesizes talking characters based on speech signals. The majority of work in this domain creates a map** from audio features to visual features. This approach often requires post-processing using computer graphics techniques to produce realistic albeit subject dependent results. We present an end-to-end system that generates vid… ▽ More

    Submitted 14 June, 2019; originally announced June 2019.

    Comments: arXiv admin note: text overlap with arXiv:1805.09313

  22. arXiv:1906.06301  [pdf, other

    eess.AS cs.CV cs.SD

    Video-Driven Speech Reconstruction using Generative Adversarial Networks

    Authors: Konstantinos Vougioukas, **chuan Ma, Stavros Petridis, Maja Pantic

    Abstract: Speech is a means of communication which relies on both audio and visual information. The absence of one modality can often lead to confusion or misinterpretation of information. In this paper we present an end-to-end temporal model capable of directly synthesising audio from silent video, without needing to transform to-and-from intermediate features. Our proposed approach, based on GANs is capab… ▽ More

    Submitted 14 June, 2019; originally announced June 2019.

  23. arXiv:1906.02112  [pdf, other

    eess.AS cs.CV eess.IV

    Investigating the Lombard Effect Influence on End-to-End Audio-Visual Speech Recognition

    Authors: **chuan Ma, Stavros Petridis, Maja Pantic

    Abstract: Several audio-visual speech recognition models have been recently proposed which aim to improve the robustness over audio-only models in the presence of noise. However, almost all of them ignore the impact of the Lombard effect, i.e., the change in speaking style in noisy environments which aims to make speech more intelligible and affects both the acoustic characteristics of speech and the lip mo… ▽ More

    Submitted 9 July, 2019; v1 submitted 5 June, 2019; originally announced June 2019.

    Comments: Accepted for publication at Interspeech 2019

  24. arXiv:1805.09313  [pdf, other

    eess.AS cs.CV cs.SD eess.IV

    End-to-End Speech-Driven Facial Animation with Temporal GANs

    Authors: Konstantinos Vougioukas, Stavros Petridis, Maja Pantic

    Abstract: Speech-driven facial animation is the process which uses speech signals to automatically synthesize a talking character. The majority of work in this domain creates a map** from audio features to visual features. This often requires post-processing using computer graphics techniques to produce realistic albeit subject dependent results. We present a system for generating videos of a talking head… ▽ More

    Submitted 19 July, 2018; v1 submitted 23 May, 2018; originally announced May 2018.