Skip to main content

Showing 1–45 of 45 results for author: Harwath, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.12209  [pdf, other

    cs.SD cs.CL eess.AS

    Interface Design for Self-Supervised Speech Models

    Authors: Yi-Jen Shih, David Harwath

    Abstract: Self-supervised speech (SSL) models have recently become widely adopted for many downstream speech processing tasks. The general usage pattern is to employ SSL models as feature extractors, and then train a downstream prediction head to solve a specific task. However, different layers of SSL models have been shown to capture different types of information, and the methods of combining them are not… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech2024

  2. arXiv:2406.09272  [pdf, other

    cs.CV cs.AI cs.SD eess.AS

    Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

    Authors: Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman

    Abstract: Generating realistic audio for human interactions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinat… ▽ More

    Submitted 20 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Project page: https://vision.cs.utexas.edu/projects/action2sound

  3. arXiv:2406.06438  [pdf, other

    cs.CL cs.CV cs.HC cs.LG cs.SD eess.AS

    Multimodal Contextualized Semantic Parsing from Speech

    Authors: Jordan Voas, Raymond Mooney, David Harwath

    Abstract: We introduce Semantic Parsing in Contextual Environments (SPICE), a task designed to enhance artificial agents' contextual awareness by integrating multimodal inputs with prior contexts. SPICE goes beyond traditional semantic parsing by offering a structured, interpretable framework for dynamically updating an agent's knowledge with new information, mirroring the complexity of human communication.… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 10 Pages, 3 figures, ACL 2024 Main

  4. arXiv:2404.05206  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

    Authors: Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman

    Abstract: We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos. Whereas existing methods rely on curated data with known audio-visual correspondence, our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree, while diminishing those associations when… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

    Comments: Accepted at CVPR 2024. Project page: https://vision.cs.utexas.edu/projects/soundingactions

  5. arXiv:2403.16973  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

    Authors: Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath

    Abstract: We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts. VoiceCraft employs a Transformer decoder architecture and introduces a token rearrangement procedure that combines causal masking and delayed stacking to enable generation within an… ▽ More

    Submitted 13 June, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

    Comments: ACL 2024. Data, code, and model weights are available at https://github.com/jasonppy/VoiceCraft

  6. arXiv:2402.06959  [pdf, other

    cs.CL cs.SD eess.AS

    SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data

    Authors: Hsuan-Fu Wang, Yi-Jen Shih, Heng-Jui Chang, Layne Berry, Puyuan Peng, Hung-yi Lee, Hsin-Min Wang, David Harwath

    Abstract: The recently proposed visually grounded speech model SpeechCLIP is an innovative framework that bridges speech and text through images via CLIP without relying on text transcription. On this basis, this paper introduces two extensions to SpeechCLIP. First, we apply the Continuous Integrate-and-Fire (CIF) module to replace a fixed number of CLS tokens in the cascaded architecture. Second, we propos… ▽ More

    Submitted 10 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024, Self-supervision in Audio, Speech, and Beyond (SASB) workshop

  7. arXiv:2402.05819  [pdf, other

    eess.AS cs.CL cs.LG

    Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model

    Authors: Hung-Chieh Fang, Nai-Xuan Ye, Yi-Jen Shih, Puyuan Peng, Hsuan-Fu Wang, Layne Berry, Hung-yi Lee, David Harwath

    Abstract: Recent advances in self-supervised speech models have shown significant improvement in many downstream tasks. However, these models predominantly centered on frame-level training objectives, which can fall short in spoken language understanding tasks that require semantic comprehension. Existing works often rely on additional speech-text data as intermediate targets, which is costly in the real-wo… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024 workshop on Self-supervision in Audio, Speech, and Beyond (SASB)

  8. arXiv:2402.01591  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    BAT: Learning to Reason about Spatial Sounds with Large Language Models

    Authors: Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath

    Abstract: Spatial sound reasoning is a fundamental human skill, enabling us to navigate and interpret our surroundings based on sound. In this paper we present BAT, which combines the spatial sound perception ability of a binaural acoustic scene analysis model with the natural language reasoning capabilities of a large language model (LLM) to replicate this innate ability. To address the lack of existing da… ▽ More

    Submitted 25 May, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

    Comments: Accepted to ICML 2024. Our demo, dataset, code and model weights are available at: https://zhishengzheng.com/BAT

  9. arXiv:2310.07654  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Audio-Visual Neural Syntax Acquisition

    Authors: Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass

    Abstract: We study phrase structure induction from visually-grounded speech. The core idea is to first segment the speech waveform into sequences of word segments, and subsequently induce phrase structure using the inferred segment-level continuous representations. We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without eve… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

  10. arXiv:2309.10787  [pdf, other

    eess.AS cs.CV cs.MM cs.SD

    AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

    Authors: Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee

    Abstract: Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual a… ▽ More

    Submitted 19 March, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024; Evaluation Code: https://github.com/roger-tseng/av-superb Submission Platform: https://av.superbbenchmark.org

  11. arXiv:2306.15644  [pdf, other

    cs.CL

    Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos

    Authors: Chiori Hori, Puyuan Peng, David Harwath, Xinyu Liu, Kei Ota, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux

    Abstract: To realize human-robot collaboration, robots need to execute actions for new tasks according to human instructions given finite prior knowledge. Human experts can share their knowledge of how to perform a task with a robot through multi-modal instructions in their demonstrations, showing a sequence of short-horizon steps to achieve a long-horizon goal. This paper introduces a method for robot acti… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

    Comments: Accepted to Interspeech2023

  12. arXiv:2306.08667  [pdf, other

    cs.CL cs.SD eess.AS

    When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants

    Authors: Anuj Diwan, Eunsol Choi, David Harwath

    Abstract: We present the first unified study of the efficiency of self-attention-based Transformer variants spanning text, speech and vision. We identify input length thresholds (tip** points) at which efficient Transformer variants become more efficient than vanilla models, using a variety of efficiency metrics (latency, throughput, and memory). To conduct this analysis for speech, we introduce L-HuBERT,… ▽ More

    Submitted 14 June, 2023; originally announced June 2023.

    Comments: 10 pages, 6 pages. Accepted to ACL 2023

  13. arXiv:2305.15405  [pdf, other

    cs.CL eess.AS

    Textless Low-Resource Speech-to-Speech Translation With Unit Language Models

    Authors: Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi

    Abstract: Existing speech-to-speech translation models fall into two camps: textless models trained with hundreds of hours of parallel speech data or unsupervised models that leverage text as an intermediate step. Both approaches limit building speech-to-speech translation models for a wide range of languages, as they exclude languages that are primarily spoken and language pairs that lack large-scale paral… ▽ More

    Submitted 20 February, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: 20 pages, 4 figures

  14. arXiv:2305.12606  [pdf, other

    cs.CL cs.SD eess.AS

    Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages

    Authors: Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass

    Abstract: Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each. However, there are thousands of spoken languages worldwide, and adapting to new languages is an important problem. In this work, we aim to understand which model adapts better to languages unseen during pre-training. We fine-tune both mo… ▽ More

    Submitted 30 May, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

    Comments: Accepted at Interspeech 2023

  15. arXiv:2305.11435  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

    Authors: Puyuan Peng, Shang-Wen Li, Okko Räsänen, Abdelrahman Mohamed, David Harwath

    Abstract: In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We demonstrate that a nearly identical model architecture (HuBERT) trained with a masked language modeling loss does not exhibit this same ability, suggesting that the visual grounding objective is responsible for the emergence of thi… ▽ More

    Submitted 23 July, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

    Comments: Interspeech 2023. Code & Model: https://github.com/jasonppy/syllable-discovery

  16. arXiv:2305.11095  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization

    Authors: Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath

    Abstract: We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering. We selected three tasks: audio-visual speech recognition (AVSR), code-switched speech recognition (CS-ASR), and speech translation (ST) on unseen language pairs. We design task-specific prompts, by either leveraging another large-scale model, or sim… ▽ More

    Submitted 15 August, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

    Comments: Interspeech 2023

  17. arXiv:2212.01661  [pdf, other

    eess.AS cs.CL cs.SD

    Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised Speech Models

    Authors: Reem Gody, David Harwath

    Abstract: Self-supervised learning (SSL) has been able to leverage unlabeled data to boost the performance of automatic speech recognition (ASR) models when we have access to only a small amount of transcribed speech data. However, this raises the question of which subset of the available unlabeled data should be selected for transcription. Our work investigates different unsupervised data selection techniq… ▽ More

    Submitted 3 December, 2022; originally announced December 2022.

  18. arXiv:2212.01393  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Continual Learning for On-Device Speech Recognition using Disentangled Conformers

    Authors: Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol Choi, David Harwath, Abdelrahman Mohamed

    Abstract: Automatic speech recognition research focuses on training and evaluating on static datasets. Yet, as speech models are increasingly deployed on personal devices, such models encounter user-specific distributional shifts. To simulate this real-world scenario, we introduce LibriContinual, a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks, with dat… ▽ More

    Submitted 13 December, 2022; v1 submitted 2 December, 2022; originally announced December 2022.

    Comments: 8 pages, 2 figures. Submitted to ICASSP 2023

  19. arXiv:2211.01461  [pdf, other

    eess.AS cs.CL cs.SD

    Phoneme Segmentation Using Self-Supervised Speech Models

    Authors: Luke Strgar, David Harwath

    Abstract: We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of representations learned in self-supervised pre-training for the task. Our model extends transformer-style encoders with strategically placed convolutions that manipulate features learned in pre-training. Using the TIMIT and Buckeye corpora we train and test the model in the supervised and unsupervised set… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted to SLT 2022

  20. arXiv:2211.01180  [pdf, other

    cs.CL cs.SD eess.AS

    M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

    Authors: Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-yi Lee, David Harwath

    Abstract: This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages. We identify key differenc… ▽ More

    Submitted 10 April, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted to ICASSP 2023

  21. arXiv:2211.00768  [pdf, other

    cs.CL cs.CV

    Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality

    Authors: Anuj Diwan, Layne Berry, Eunsol Choi, David Harwath, Kyle Mahowald

    Abstract: Recent visuolinguistic pre-trained models show promising progress on various end tasks such as image retrieval and video captioning. Yet, they fail miserably on the recently proposed Winoground dataset, which challenges models to match paired images and English captions, with items constructed to overlap lexically but differ in meaning (e.g., "there is a mug in some grass" vs. "there is some grass… ▽ More

    Submitted 3 December, 2022; v1 submitted 1 November, 2022; originally announced November 2022.

    Comments: Accepted at EMNLP 2022. We release our annotation and code at https://github.com/ajd12342/why-winoground-hard . 15 pages, 3 figures

  22. arXiv:2210.07839  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    Contrastive Audio-Visual Masked Autoencoder

    Authors: Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass

    Abstract: In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments… ▽ More

    Submitted 11 April, 2023; v1 submitted 2 October, 2022; originally announced October 2022.

    Comments: Accepted at ICLR 2023 as a notable top 25% paper. Code and pretrained models are at https://github.com/yuangongnd/cav-mae

  23. arXiv:2210.03625  [pdf, other

    cs.CL cs.CV cs.MM

    C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

    Authors: Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass

    Abstract: Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in differen… ▽ More

    Submitted 9 May, 2023; v1 submitted 7 October, 2022; originally announced October 2022.

    Comments: Accepted at ICASSP 2023. The code, models, and dataset are available at https://github.com/roudimit/c2kd

  24. arXiv:2210.00705  [pdf, other

    cs.CL cs.SD eess.AS

    SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

    Authors: Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-yi Lee, David Harwath

    Abstract: Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly. Therefore, we propose SpeechCLIP, a novel framework bridging speech and text through images to enhance speech models without transcriptions. We leverage state-of-the-art pre-trained HuBERT and CLIP, aligning them via paired images and spoken captions… ▽ More

    Submitted 25 October, 2022; v1 submitted 3 October, 2022; originally announced October 2022.

    Comments: Accepted to IEEE SLT 2022

  25. arXiv:2203.16691  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

    Authors: Alan Baade, Puyuan Peng, David Harwath

    Abstract: In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification. Specifically, we leverage the insight that the SSAST uses a very high masking ratio (75%) during pretraining, meaning that the vast majority of self-attention compute is performed on mask tokens. We address this by integrating… ▽ More

    Submitted 30 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH. 5 pages, 2 figures, 5 tables

  26. arXiv:2203.15081  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Word Discovery in Visually Grounded, Self-Supervised Speech Models

    Authors: Puyuan Peng, David Harwath

    Abstract: We present a method for visually-grounded spoken term discovery. After training either a HuBERT or wav2vec2.0 model to associate spoken captions with natural images, we show that powerful word segmentation and clustering capability emerges within the model's self-attention heads. Our experiments reveal that this ability is not present to nearly the same extent in the base HuBERT and wav2vec2.0 mod… ▽ More

    Submitted 19 June, 2023; v1 submitted 28 March, 2022; originally announced March 2022.

    Comments: Interspeech 2022 Oral. Update Table 5

  27. arXiv:2203.11294  [pdf, other

    cs.SD

    Automated detection of foreground speech with wearable sensing in everyday home environments: A transfer learning approach

    Authors: Dawei Liang, Zifan Xu, Yinuo Chen, Rebecca Adaimi, David Harwath, Edison Thomaz

    Abstract: Acoustic sensing has proved effective as a foundation for numerous applications in health and human behavior analysis. In this work, we focus on the problem of detecting in-person social interactions in naturalistic settings from audio captured by a smartwatch. As a first step towards detecting social interactions, it is critical to distinguish the speech of the individual wearing the watch from a… ▽ More

    Submitted 21 March, 2022; originally announced March 2022.

    Comments: Submitted to Interspeech 2022

  28. arXiv:2202.03543  [pdf, other

    eess.AS cs.CL cs.CV cs.LG cs.SD

    Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling

    Authors: Puyuan Peng, David Harwath

    Abstract: In this paper, we describe our submissions to the ZeroSpeech 2021 Challenge and SUPERB benchmark. Our submissions are based on the recently proposed FaST-VGS model, which is a Transformer-based model that learns to associate raw speech waveforms with semantically related images, all without the use of any transcriptions of the speech. Additionally, we introduce a novel extension of this model, FaS… ▽ More

    Submitted 2 March, 2022; v1 submitted 7 February, 2022; originally announced February 2022.

    Comments: SAS workshop at AAAI2022, code and model weights available at https://github.com/jasonppy/FaST-VGS-Family

  29. arXiv:2112.04446  [pdf, other

    cs.CV cs.CL cs.SD eess.AS

    Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

    Authors: Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne

    Abstract: Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text,… ▽ More

    Submitted 18 August, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

    Comments: CVPR2022. The final published version of the proceedings will be available on IEEE Xplore

  30. arXiv:2112.00775  [pdf, other

    cs.CV

    Routing with Self-Attention for Multimodal Capsule Networks

    Authors: Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah

    Abstract: The task of multimodal learning has seen a growing interest recently as it allows for training neural architectures based on different modalities such as vision, text, and audio. One challenge in training such models is that they need to jointly learn semantic concepts and their relationships across different input representations. Capsule networks have been shown to perform well in context of cap… ▽ More

    Submitted 1 December, 2021; originally announced December 2021.

  31. arXiv:2111.04823  [pdf, other

    cs.CL cs.CV cs.MM cs.SD eess.AS eess.IV

    Cascaded Multilingual Audio-Visual Learning from Videos

    Authors: Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass

    Abstract: In this paper, we explore self-supervised audio-visual models that learn from instructional videos. Prior work has shown that these models can relate spoken words and sounds to visual content after training on a large-scale dataset of videos, but they were only trained and evaluated on videos in English. To learn multilingual audio-visual representations, we propose a cascaded approach that levera… ▽ More

    Submitted 8 November, 2021; originally announced November 2021.

    Comments: Presented at Interspeech 2021. This version contains updated results using the YouCook-Japanese dataset

  32. arXiv:2109.08186  [pdf, other

    eess.AS cs.CL cs.IR

    Fast-Slow Transformer for Visually Grounding Speech

    Authors: Puyuan Peng, David Harwath

    Abstract: We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS. FaST-VGS is a Transformer-based model for learning the associations between raw speech waveforms and visual images. The model unifies dual-encoder and cross-attention architectures into a single model, rea** the superior retrieval speed of the former along with the accuracy of the latter. FaST-VGS achieves state-of-the-… ▽ More

    Submitted 2 March, 2022; v1 submitted 16 September, 2021; originally announced September 2021.

    Comments: ICASSP 2022, code and model weights are available at https://github.com/jasonppy/FaST-VGS-Family

  33. arXiv:2106.07732  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    Learning Audio-Visual Dereverberation

    Authors: Changan Chen, Wei Sun, David Harwath, Kristen Grauman

    Abstract: Reverberation not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition. Prior work attempts to remove reverberation based on the audio modality only. Our idea is to learn to dereverberate speech from audio-visual observations. The visual environment surrounding a human speaker reveals important cues about the room geometry… ▽ More

    Submitted 13 March, 2023; v1 submitted 14 June, 2021; originally announced June 2021.

    Comments: Accepted at ICASSP 2023. This is the longer version of the five-page camera-ready paper. Project page: https://vision.cs.utexas.edu/projects/learning-audio-visual-dereverberation

  34. arXiv:2105.04489  [pdf, other

    cs.CV cs.CL cs.LG cs.SD eess.AS

    Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

    Authors: Mathew Monfort, SouYoung **, Alexander Liu, David Harwath, Rogerio Feris, James Glass, Aude Oliva

    Abstract: When people observe events, they are able to abstract key information and build concise summaries of what is happening. These summaries include contextual and semantic information describing the important high-level details (what, where, who and how) of the observed event and exclude background information that is deemed unimportant to the observer. With this in mind, the descriptions people gener… ▽ More

    Submitted 10 May, 2021; originally announced May 2021.

    Comments: To appear at CVPR 2021

  35. arXiv:2104.12671  [pdf, other

    cs.CV

    Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

    Authors: Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang

    Abstract: Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities. In this context, this paper proposes a self-supervised training framework that learns a common multimodal embedding space that, in addition to sharing representations across different modalitie… ▽ More

    Submitted 3 September, 2021; v1 submitted 26 April, 2021; originally announced April 2021.

    Comments: To be presented at ICCV 2021

    Journal ref: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 8012-8021

  36. arXiv:2012.15454  [pdf, other

    cs.CL cs.AI cs.CV cs.LG eess.AS

    Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

    Authors: Wei-Ning Hsu, David Harwath, Christopher Song, James Glass

    Abstract: In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision. Instead, we connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised vi… ▽ More

    Submitted 31 December, 2020; originally announced December 2020.

  37. arXiv:2006.09199  [pdf, other

    cs.CV cs.CL cs.MM cs.SD eess.AS

    AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

    Authors: Andrew Rouditchenko, Angie Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass

    Abstract: Current methods for learning visually grounded language from videos often rely on text annotation, such as human generated captions or machine generated automatic speech recognition (ASR) transcripts. In this work, we introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. To circumvent the nee… ▽ More

    Submitted 29 June, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

    Comments: A version of this work has been accepted to Interspeech 2021

  38. arXiv:1911.09602  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

    Authors: David Harwath, Wei-Ning Hsu, James Glass

    Abstract: In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our method is capable of capturing both word-level and sub-word units, depending on how it is configured. What differentiates this paper from prior work on speech unit learning is the choice of training objective. Rather… ▽ More

    Submitted 14 February, 2020; v1 submitted 21 November, 2019; originally announced November 2019.

    Comments: Camera-ready version for ICLR

  39. arXiv:1907.04355  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Transfer Learning from Audio-Visual Grounding to Speech Recognition

    Authors: Wei-Ning Hsu, David Harwath, James Glass

    Abstract: Transfer learning aims to reduce the amount of data required to excel at a new task by re-using the knowledge acquired from learning other related tasks. This paper proposes a novel transfer learning scenario, which distills robust phonetic features from grounding models that are trained to tell whether a pair of image and speech are semantically correlated, without using any textual transcripts.… ▽ More

    Submitted 9 July, 2019; originally announced July 2019.

    Comments: Accepted to Interspeech 2019. 4 pages, 2 figures

  40. arXiv:1902.08213  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Towards Visually Grounded Sub-Word Speech Unit Discovery

    Authors: David Harwath, James Glass

    Abstract: In this paper, we investigate the manner in which interpretable sub-word speech units emerge within a convolutional neural network model trained to associate raw speech waveforms with semantically related natural image scenes. We show how diphone boundaries can be superficially extracted from the activation patterns of intermediate layers of the model, suggesting that the model may be leveraging t… ▽ More

    Submitted 21 February, 2019; originally announced February 2019.

    Comments: Accepted to ICASSP 2019

  41. arXiv:1804.03052  [pdf, other

    cs.CL cs.SD eess.AS

    Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech

    Authors: David Harwath, Galen Chuang, James Glass

    Abstract: In this paper, we explore the learning of neural network embeddings for natural images and speech waveforms describing the content of those images. These embeddings are learned directly from the waveforms without the use of linguistic transcriptions or conventional speech recognition technology. While prior work has investigated this setting in the monolingual case using English speech data, this… ▽ More

    Submitted 9 April, 2018; originally announced April 2018.

    Comments: to appear at ICASSP 2018

  42. arXiv:1804.01452  [pdf, other

    cs.CV cs.CL cs.SD

    Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

    Authors: David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, James Glass

    Abstract: In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learned as a by-product of training to perform an image-audio retrieval task. Our models operate directly… ▽ More

    Submitted 4 April, 2018; originally announced April 2018.

  43. arXiv:1712.03897  [pdf, other

    cs.LG cs.CL cs.CV

    Learning Modality-Invariant Representations for Speech and Images

    Authors: Kenneth Leidal, David Harwath, James Glass

    Abstract: In this paper, we explore the unsupervised learning of a semantic embedding space for co-occurring sensory inputs. Specifically, we focus on the task of learning a semantic vector space for both spoken and handwritten digits using the TIDIGITs and MNIST datasets. Current techniques encode image and audio/textual inputs directly to semantic embeddings. In contrast, our technique maps an input to th… ▽ More

    Submitted 11 December, 2017; originally announced December 2017.

  44. arXiv:1701.07481  [pdf, other

    cs.CL cs.CV

    Learning Word-Like Units from Joint Audio-Visual Analysis

    Authors: David Harwath, James R. Glass

    Abstract: Given a collection of images and spoken audio captions, we present a method for discovering word-like acoustic units in the continuous speech signal and grounding them to semantically relevant image regions. For example, our model is able to detect spoken instances of the word 'lighthouse' within an utterance and associate them with image regions containing lighthouses. We do not use any form of c… ▽ More

    Submitted 24 May, 2017; v1 submitted 25 January, 2017; originally announced January 2017.

  45. arXiv:1511.03690  [pdf, other

    cs.CV cs.AI cs.CL

    Deep Multimodal Semantic Embeddings for Speech and Images

    Authors: David Harwath, James Glass

    Abstract: In this paper, we present a model which takes as input a corpus of images with relevant spoken captions and finds a correspondence between the two modalities. We employ a pair of convolutional neural networks to model visual objects and speech signals at the word level, and tie the networks together with an embedding and alignment model which learns a joint semantic space over both modalities. We… ▽ More

    Submitted 11 November, 2015; originally announced November 2015.