Skip to main content

Showing 1–10 of 10 results for author: Doulaty, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.09569  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time

    Authors: Frank Seide, Morrie Doulaty, Yangyang Shi, Yashesh Gaur, Junteng Jia, Chunyang Wu

    Abstract: We introduce Speech ReaLLM, a new ASR architecture that marries "decoder-only" ASR with the RNN-T to make multimodal LLM architectures capable of real-time streaming. This is the first "decoder-only" ASR architecture designed to handle continuous audio without explicit end-pointing. Speech ReaLLM is a special case of the more general ReaLLM ("real-time LLM") approach, also introduced here for the… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  2. arXiv:2303.17200  [pdf, other

    cs.CV cs.AI cs.SD eess.AS

    SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision

    Authors: Xubo Liu, Egor Lakomkin, Konstantinos Vougioukas, **chuan Ma, Honglie Chen, Ruiming Xie, Morrie Doulaty, Niko Moritz, Jáchym Kolář, Stavros Petridis, Maja Pantic, Christian Fuegen

    Abstract: Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are limited in size. In this paper, for the first time, we study the potential of leveraging synthetic visual data for VSR. Our method, termed SynthVSR, substantially improves the performance of VSR systems wit… ▽ More

    Submitted 3 April, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

    Comments: IEEE/CVF CVPR 2023

  3. arXiv:2110.07058  [pdf, other

    cs.CV cs.AI

    Ego4D: Around the World in 3,000 Hours of Egocentric Video

    Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do , et al. (60 additional authors not shown)

    Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with cons… ▽ More

    Submitted 11 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)

  4. arXiv:1606.03333  [pdf, other

    cs.MM cs.CL cs.IR

    Automatic Genre and Show Identification of Broadcast Media

    Authors: Mortaza Doulaty, Oscar Saz, Raymond W. M. Ng, Thomas Hain

    Abstract: Huge amounts of digital videos are being produced and broadcast every day, leading to giant media archives. Effective techniques are needed to make such data accessible further. Automatic meta-data labelling of broadcast media is an essential task for multimedia indexing, where it is standard to use multi-modal input for such purposes. This paper describes a novel method for automatic detection of… ▽ More

    Submitted 10 June, 2016; originally announced June 2016.

    Comments: Proc. of 17th Interspeech (2016), San Francisco, California, USA

  5. The 2015 Sheffield System for Transcription of Multi-Genre Broadcast Media

    Authors: Oscar Saz, Mortaza Doulaty, Salil Deena, Rosanna Milner, Raymond W. M. Ng, Madina Hasan, Yulan Liu, Thomas Hain

    Abstract: We describe the University of Sheffield system for participation in the 2015 Multi-Genre Broadcast (MGB) challenge task of transcribing multi-genre broadcast shows. Transcription was one of four tasks proposed in the MGB challenge, with the aim of advancing the state of the art of automatic speech recognition, speaker diarisation and automatic alignment of subtitles for broadcast media. Four topic… ▽ More

    Submitted 21 December, 2015; originally announced December 2015.

    Comments: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2015), 13-17 Dec 2015, Scottsdale, Arizona, USA

  6. Latent Dirichlet Allocation Based Organisation of Broadcast Media Archives for Deep Neural Network Adaptation

    Authors: Mortaza Doulaty, Oscar Saz, Raymond W. M. Ng, Thomas Hain

    Abstract: This paper presents a new method for the discovery of latent domains in diverse speech data, for the use of adaptation of Deep Neural Networks (DNNs) for Automatic Speech Recognition. Our work focuses on transcription of multi-genre broadcast media, which is often only categorised broadly in terms of high level genres such as sports, news, documentary, etc. However, in terms of acoustic modelling… ▽ More

    Submitted 16 November, 2015; originally announced November 2015.

    Comments: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2015), 13-17 Dec 2015, Scottsdale, Arizona, USA

  7. Background-tracking Acoustic Features for Genre Identification of Broadcast Shows

    Authors: Oscar Saz, Mortaza Doulaty, Thomas Hain

    Abstract: This paper presents a novel method for extracting acoustic features that characterise the background environment in audio recordings. These features are based on the output of an alignment that fits multiple parallel background--based Constrained Maximum Likelihood Linear Regression transformations asynchronously to the input audio signal. With this setup, the resulting features can track changes… ▽ More

    Submitted 16 September, 2015; originally announced September 2015.

    Journal ref: IEEE Spoken Language Technology Workshop (SLT 2014), pp118-123, 7-10 Dec 2014, Lake Tahoe, NV, USA

  8. arXiv:1509.03870  [pdf, other

    cs.CL

    The USFD Spoken Language Translation System for IWSLT 2014

    Authors: Raymond W. M. Ng, Mortaza Doulaty, Rama Doddipatla, Wilker Aziz, Kashif Shah, Oscar Saz, Madina Hasan, Ghada AlHarbi, Lucia Specia, Thomas Hain

    Abstract: The University of Sheffield (USFD) participated in the International Workshop for Spoken Language Translation (IWSLT) in 2014. In this paper, we will introduce the USFD SLT system for IWSLT. Automatic speech recognition (ASR) is achieved by two multi-pass deep neural network systems with adaptation and rescoring techniques. Machine translation (MT) is achieved by a phrase-based system. The USFD pr… ▽ More

    Submitted 13 September, 2015; originally announced September 2015.

    Journal ref: Proc. of 11th International Workshop on Spoken Language Translation (SLT 2014) 86-91, Lake Tahoe, USA, December 4th and 5th, 2014

  9. arXiv:1509.02412  [pdf, other

    cs.CL

    Unsupervised Domain Discovery using Latent Dirichlet Allocation for Acoustic Modelling in Speech Recognition

    Authors: Mortaza Doulaty, Oscar Saz, Thomas Hain

    Abstract: Speech recognition systems are often highly domain dependent, a fact widely reported in the literature. However the concept of domain is complex and not bound to clear criteria. Hence it is often not evident if data should be considered to be out-of-domain. While both acoustic and language models can be domain specific, work in this paper concentrates on acoustic modelling. We present a novel meth… ▽ More

    Submitted 8 September, 2015; originally announced September 2015.

    Journal ref: 16th Interspeech.Proc. (2015) 3640-3644, Dresden, Germany

  10. arXiv:1509.02409  [pdf, other

    cs.LG cs.CL cs.SD

    Data-selective Transfer Learning for Multi-Domain Speech Recognition

    Authors: Mortaza Doulaty, Oscar Saz, Thomas Hain

    Abstract: Negative transfer in training of acoustic models for automatic speech recognition has been reported in several contexts such as domain change or speaker characteristics. This paper proposes a novel technique to overcome negative transfer by efficient selection of speech data for acoustic model training. Here data is chosen on relevance for a specific target. A submodular function based on likeliho… ▽ More

    Submitted 8 September, 2015; originally announced September 2015.

    Journal ref: 16th Interspeech.Proc. (2015) 2897-2901