Skip to main content

Showing 1–11 of 11 results for author: Theobald, B

Searching in archive eess. Search in all archives.
.
  1. arXiv:2402.00340  [pdf, other

    cs.SD eess.AS

    Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?

    Authors: Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Skyler Seto, Tatiana Likhomanenko, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Barry-John Theobald

    Abstract: Self-supervised features are typically used in place of filter-bank features in speaker verification models. However, these models were originally designed to ingest filter-bank features as inputs, and thus, training them on top of self-supervised features assumes that both feature types require the same amount of learning for the task. In this work, we observe that pre-trained self-supervised spe… ▽ More

    Submitted 13 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  2. arXiv:2401.17230  [pdf, other

    cs.SD cs.AI eess.AS

    ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models

    Authors: Jee-weon Jung, Wangyou Zhang, Jiatong Shi, Zakaria Aldeneh, Takuya Higuchi, Barry-John Theobald, Ahmed Hussen Abdelaziz, Shinji Watanabe

    Abstract: This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also… ▽ More

    Submitted 13 June, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

    Comments: 5 pages, 3 figures, 7 tables, Interspeech 2024

  3. arXiv:2308.09514  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Spatial LibriSpeech: An Augmented Dataset for Spatial Audio Learning

    Authors: Miguel Sarabia, Elena Menyaylenko, Alessandro Toso, Skyler Seto, Zakaria Aldeneh, Shadi Pirhosseinloo, Luca Zappella, Barry-John Theobald, Nicholas Apostoloff, Jonathan Sheaffer

    Abstract: We present Spatial LibriSpeech, a spatial audio dataset with over 650 hours of 19-channel audio, first-order ambisonics, and optional distractor noise. Spatial LibriSpeech is designed for machine learning model training, and it includes labels for source position, speaking direction, room acoustics and geometry. Spatial LibriSpeech is generated by augmenting LibriSpeech samples with 200k+ simulate… ▽ More

    Submitted 18 August, 2023; originally announced August 2023.

    Journal ref: Proceedings of INTERSPEECH (2023), pp. 3724-3728

  4. arXiv:2210.14800  [pdf, other

    eess.AS cs.HC cs.SD

    Naturalistic Head Motion Generation from Speech

    Authors: Trisha Mittal, Zakaria Aldeneh, Masha Fedzechkina, Anurag Ranjan, Barry-John Theobald

    Abstract: Synthesizing natural head motion to accompany speech for an embodied conversational agent is necessary for providing a rich interactive experience. Most prior works assess the quality of generated head motion by comparing them against a single ground-truth using an objective metric. Yet there are many plausible head motion sequences to accompany a speech utterance. In this work, we study the varia… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023

  5. arXiv:2203.10117  [pdf, other

    cs.SD cs.CV cs.GR eess.AS

    On the role of Lip Articulation in Visual Speech Perception

    Authors: Zakaria Aldeneh, Masha Fedzechkina, Skyler Seto, Katherine Metcalf, Miguel Sarabia, Nicholas Apostoloff, Barry-John Theobald

    Abstract: Generating realistic lip motion from audio to simulate speech production is critical for driving natural character animation. Previous research has shown that traditional metrics used to optimize and assess models for generating lip motion from speech are not a good indicator of subjective opinion of animation quality. Devising metrics that align with subjective opinion first requires understandin… ▽ More

    Submitted 10 November, 2022; v1 submitted 18 March, 2022; originally announced March 2022.

    Comments: Submitted to ICASSP 2023

  6. arXiv:2005.13616  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Modality Dropout for Improved Performance-driven Talking Faces

    Authors: Ahmed Hussen Abdelaziz, Barry-John Theobald, Paul Dixon, Reinhard Knothe, Nicholas Apostoloff, Sachin Kajareker

    Abstract: We describe our novel deep learning approach for driving animated faces using both acoustic and visual information. In particular, speech-related facial movements are generated using audiovisual information, and non-speech facial movements are generated using only visual information. To ensure that our model exploits both modalities during training, batches are generated that contain audio-only, v… ▽ More

    Submitted 27 May, 2020; originally announced May 2020.

    Comments: Pre-print

  7. arXiv:2004.12031  [pdf, ps, other

    cs.LG cs.CL cs.CV cs.SD eess.AS

    On the Role of Visual Cues in Audiovisual Speech Enhancement

    Authors: Zakaria Aldeneh, Anushree Prasanna Kumar, Barry-John Theobald, Erik Marchi, Sachin Kajarekar, Devang Naik, Ahmed Hussen Abdelaziz

    Abstract: We present an introspection of an audiovisual speech enhancement model. In particular, we focus on interpreting how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal. We show that visual cues provide not only high-level information about speech activity, i.e., speech/silence, but also fine-grained visual information about the place of… ▽ More

    Submitted 25 February, 2021; v1 submitted 24 April, 2020; originally announced April 2020.

    Comments: ICASSP 2021

  8. arXiv:1905.06860  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models

    Authors: Ahmed Hussen Abdelaziz, Barry-John Theobald, Justin Binder, Gabriele Fanelli, Paul Dixon, Nicholas Apostoloff, Thibaut Weise, Sachin Kajareker

    Abstract: Speech-driven visual speech synthesis involves map** features extracted from acoustic speech to the corresponding lip animation controls for a face model. This map** can take many forms, but a powerful approach is to use deep neural networks (DNNs). However, a limitation is the lack of synchronized audio, video, and depth data required to reliably train the DNNs, especially for speaker-indepen… ▽ More

    Submitted 14 May, 2019; originally announced May 2019.

    Comments: 9 pages, 2 figures, 2 tables

    ACM Class: I.2.m; I.3.8

  9. arXiv:1710.01093  [pdf, other

    cs.CV cs.CL eess.AS

    Which phoneme-to-viseme maps best improve visual-only computer lip-reading?

    Authors: Helen L. Bear, Richard W. Harvey, Barry-John Theobald, Yuxuan Lan

    Abstract: A critical assumption of all current visual speech recognition systems is that there are visual speech units called visemes which can be mapped to units of acoustic speech, the phonemes. Despite there being a number of published maps it is infrequent to see the effectiveness of these tested, particularly on visual-only lip-reading (many works use audio-visual speech). Here we examine 120 map**s… ▽ More

    Submitted 3 October, 2017; originally announced October 2017.

    Journal ref: Helen L. Bear, Richard W. Harvey, Barry-John Theobald, and Yuxuan Lan. Which phoneme-to-viseme maps best improve visual-only computer lip-reading? Advances in Visual Computing 2014. p230-239

  10. arXiv:1710.01084  [pdf, other

    cs.CV eess.IV

    Some observations on computer lip-reading: moving from the dream to the reality

    Authors: Helen L. Bear, Gari Owen, Richard Harvey, Barry-John Theobald

    Abstract: In the quest for greater computer lip-reading performance there are a number of tacit assumptions which are either present in the datasets (high resolution for example) or in the methods (recognition of spoken visual units called visemes for example). Here we review these and other assumptions and show the surprising result that computer lip-reading is not heavily constrained by video resolution,… ▽ More

    Submitted 3 October, 2017; originally announced October 2017.

    Journal ref: Helen L. Bear, Gari Owen, Richard Harvey, and Barry-John Theobald. Some observations on computer lip-reading: moving from the dream to the reality. International Society for Optics and Photonics- Security and defence. 2014. p92530G--92530G

  11. arXiv:1710.01073  [pdf, other

    cs.CV eess.IV

    Resolution limits on visual speech recognition

    Authors: Helen L. Bear, Richard Harvey, Barry-John Theobald, Yuxuan Lan

    Abstract: Visual-only speech recognition is dependent upon a number of factors that can be difficult to control, such as: lighting; identity; motion; emotion and expression. But some factors, such as video resolution are controllable, so it is surprising that there is not yet a systematic study of the effect of resolution on lip-reading. Here we use a new data set, the Rosetta Raven data, to train and test… ▽ More

    Submitted 3 October, 2017; originally announced October 2017.

    Journal ref: Helen L. Bear, Richard Harvey, Barry-John Theobald, Yuxuan Lan. Resolution limits on visual speech recognition. International Conference on Image Processing (ICIP). 2014. p1371-1375