Skip to main content

Showing 1–10 of 10 results for author: Harte, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2012.07467  [pdf, other

    eess.AS cs.LG

    AV Taris: Online Audio-Visual Speech Recognition

    Authors: George Sterpu, Naomi Harte

    Abstract: In recent years, Automatic Speech Recognition (ASR) technology has approached human-level performance on conversational speech under relatively clean listening conditions. In more demanding situations involving distant microphones, overlapped speech, background noise, or natural dialogue structures, the ASR error rate is at least an order of magnitude higher. The visual modality of speech carries… ▽ More

    Submitted 14 December, 2020; originally announced December 2020.

  2. arXiv:2009.11939  [pdf, other

    cs.CV cs.LG cs.MM

    Deep Multi-Scale Feature Learning for Defocus Blur Estimation

    Authors: Ali Karaali, Naomi Harte, Claudio Rosito Jung

    Abstract: This paper presents an edge-based defocus blur estimation method from a single defocused image. We first distinguish edges that lie at depth discontinuities (called depth edges, for which the blur estimate is ambiguous) from edges that lie at approximately constant depth regions (called pattern edges, for which the blur estimate is well-defined). Then, we estimate the defocus blur amount at patter… ▽ More

    Submitted 7 November, 2021; v1 submitted 24 September, 2020; originally announced September 2020.

    Comments: under review

  3. arXiv:2006.04928  [pdf, other

    eess.AS cs.LG cs.SD

    Learning to Count Words in Fluent Speech enables Online Speech Recognition

    Authors: George Sterpu, Christian Saam, Naomi Harte

    Abstract: Sequence to Sequence models, in particular the Transformer, achieve state of the art results in Automatic Speech Recognition. Practical usage is however limited to cases where full utterance latency is acceptable. In this work we introduce Taris, a Transformer-based online speech recognition system aided by an auxiliary task of incremental word counting. We use the cumulative word sum to dynamical… ▽ More

    Submitted 24 November, 2020; v1 submitted 8 June, 2020; originally announced June 2020.

    Comments: Accepted at the 8th IEEE Spoken Language Technology Workshop (SLT 2021)

  4. arXiv:2005.09297  [pdf, other

    eess.AS cs.LG

    Should we hard-code the recurrence concept or learn it instead ? Exploring the Transformer architecture for Audio-Visual Speech Recognition

    Authors: George Sterpu, Christian Saam, Naomi Harte

    Abstract: The audio-visual speech fusion strategy AV Align has shown significant performance improvements in audio-visual speech recognition (AVSR) on the challenging LRS2 dataset. Performance improvements range between 7% and 30% depending on the noise level when leveraging the visual modality of speech in addition to the auditory one. This work presents a variant of AV Align where the recurrent Long Short… ▽ More

    Submitted 19 May, 2020; originally announced May 2020.

    Comments: Submitted to INTERSPEECH 2020

  5. arXiv:2005.09128  [pdf, other

    cs.CL

    Neural Generation of Dialogue Response Timings

    Authors: Matthew Roddy, Naomi Harte

    Abstract: The timings of spoken response offsets in human dialogue have been shown to vary based on contextual elements of the dialogue. We propose neural models that simulate the distributions of these response offsets, taking into account the response turn as well as the preceding turn. The models are designed to be integrated into the pipeline of an incremental spoken dialogue system (SDS). We evaluate o… ▽ More

    Submitted 18 May, 2020; originally announced May 2020.

    Comments: Accepted to ACL 2020

  6. How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

    Authors: George Sterpu, Christian Saam, Naomi Harte

    Abstract: Audio-Visual Speech Recognition (AVSR) seeks to model, and thereby exploit, the dynamic relationship between a human voice and the corresponding mouth movements. A recently proposed multimodal fusion strategy, AV Align, based on state-of-the-art sequence to sequence neural networks, attempts to model this relationship by explicitly aligning the acoustic and visual representations of speech. This s… ▽ More

    Submitted 17 April, 2020; originally announced April 2020.

    Comments: in IEEE/ACM Transactions on Audio, Speech, and Language Processing (to appear)

  7. arXiv:1809.01728  [pdf, other

    eess.AS cs.LG cs.SD eess.IV stat.ML

    Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition

    Authors: George Sterpu, Christian Saam, Naomi Harte

    Abstract: Automatic speech recognition can potentially benefit from the lip motion patterns, complementing acoustic speech to improve the overall recognition performance, particularly in noise. In this paper we propose an audio-visual fusion strategy that goes beyond simple feature concatenation and learns to automatically align the two modalities, leading to enhanced representations which increase the reco… ▽ More

    Submitted 1 May, 2019; v1 submitted 5 September, 2018; originally announced September 2018.

    Comments: In ICMI'18, October 16-20, 2018, Boulder, CO, USA. Equation (2) corrected on this version

  8. Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs

    Authors: Matthew Roddy, Gabriel Skantze, Naomi Harte

    Abstract: In human conversational interactions, turn-taking exchanges can be coordinated using cues from multiple modalities. To design spoken dialog systems that can conduct fluid interactions it is desirable to incorporate cues from separate modalities into turn-taking models. We propose that there is an appropriate temporal granularity at which modalities should be modeled. We design a multiscale RNN arc… ▽ More

    Submitted 31 August, 2018; originally announced August 2018.

    Comments: Accepted for ICMI18

  9. arXiv:1806.11461  [pdf, other

    cs.CL

    Investigating Speech Features for Continuous Turn-Taking Prediction Using LSTMs

    Authors: Matthew Roddy, Gabriel Skantze, Naomi Harte

    Abstract: For spoken dialog systems to conduct fluid conversational interactions with users, the systems must be sensitive to turn-taking cues produced by a user. Models should be designed so that effective decisions can be made as to when it is appropriate, or not, for the system to speak. Traditional end-of-turn models, where decisions are made at utterance end-points, are limited in their ability to mode… ▽ More

    Submitted 29 June, 2018; originally announced June 2018.

    Comments: Accepted for Interspeech 2018

  10. arXiv:1805.11685  [pdf, other

    eess.IV cs.CV eess.AS

    Can DNNs Learn to Lipread Full Sentences?

    Authors: George Sterpu, Christian Saam, Naomi Harte

    Abstract: Finding visual features and suitable models for lipreading tasks that are more complex than a well-constrained vocabulary has proven challenging. This paper explores state-of-the-art Deep Neural Network architectures for lipreading based on a Sequence to Sequence Recurrent Neural Network. We report results for both hand-crafted and 2D/3D Convolutional Neural Network visual front-ends, online monot… ▽ More

    Submitted 29 May, 2018; originally announced May 2018.

    Comments: Accepted at the 2018 IEEE International Conference on Image Processing (ICIP 2018)