Skip to main content

Showing 1–50 of 70 results for author: Glass, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.18625  [pdf, other

    cs.SD cs.AI eess.AS

    Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer

    Authors: Liming Wang, Yuan Gong, Nauman Dawalatabad, Marco Vilela, Katerina Placek, Brian Tracey, Yishu Gong, Alan Premasiri, Fernando Vieira, James Glass

    Abstract: Automatic prediction of amyotrophic lateral sclerosis (ALS) disease progression provides a more efficient and objective alternative than manual approaches. We propose ALS longitudinal speech transformer (ALST), a neural network-based automatic predictor of ALS disease progression from longitudinal speech recordings of ALS patients. By taking advantage of high-quality pretrained speech features and… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  2. arXiv:2406.10082  [pdf, other

    eess.AS cs.CV cs.SD

    Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

    Authors: Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass

    Abstract: Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in noise. Since videos are harder to obtain than audio, the video training data of AVSR models is usually limited to a few thousand hours. In contrast, speech models such as Whisper are trained with hundreds of thousands of hours of data, and thus learn a better speech-to-text decoder. The huge training data differe… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Interspeech 2024. Code https://github.com/roudimit/whisper-flamingo

  3. arXiv:2401.08833  [pdf, other

    eess.AS cs.CL cs.SD

    Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective

    Authors: Alexander H. Liu, Sung-Lin Yeh, James Glass

    Abstract: Existing studies on self-supervised speech representation learning have focused on develo** new training methods and applying pre-trained models for different applications. However, the quality of these models is often measured by the performance of different downstream tasks. How well the representations access the information of interest is less studied. In this work, we take a closer look int… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Comments: ICASSP 2024

  4. arXiv:2311.09117  [pdf, other

    cs.CL cs.SD eess.AS

    R-Spin: Efficient Speaker and Noise-invariant Representation Learning with Acoustic Pieces

    Authors: Heng-Jui Chang, James Glass

    Abstract: This paper introduces Robust Spin (R-Spin), a data-efficient domain-specific self-supervision method for speaker and noise-invariant speech representations by learning discrete acoustic units with speaker-invariant clustering (Spin). R-Spin resolves Spin's issues and enhances content representations by learning to predict acoustic pieces. R-Spin offers a 12X reduction in computational resources co… ▽ More

    Submitted 1 April, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: Accepted to NAACL 2024

  5. arXiv:2310.07654  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Audio-Visual Neural Syntax Acquisition

    Authors: Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass

    Abstract: We study phrase structure induction from visually-grounded speech. The core idea is to first segment the speech waveform into sequences of word segments, and subsequently induce phrase structure using the inferred segment-level continuous representations. We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without eve… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

  6. arXiv:2309.14405  [pdf, other

    cs.SD cs.AI eess.AS

    Joint Audio and Speech Understanding

    Authors: Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, James Glass

    Abstract: Humans are surrounded by audio signals that include both speech and non-speech sounds. The recognition and understanding of speech and non-speech audio events, along with a profound comprehension of the relationship between them, constitute fundamental cognitive capabilities. For the first time, we build a machine learning model, called LTU-AS, that has a conceptually similar universal audio perce… ▽ More

    Submitted 10 December, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted at ASRU 2023. Code, dataset, and pretrained models are at https://github.com/yuangongnd/ltu. Interactive demo at https://huggingface.co/spaces/yuangongfdu/ltu-2

  7. Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

    Authors: Yuan Gong, Sameer Khurana, Leonid Karlinsky, James Glass

    Abstract: In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sound… ▽ More

    Submitted 6 July, 2023; originally announced July 2023.

    Comments: Accepted at Interspeech 2023. Code at https://github.com/yuangongnd/whisper-at

    Journal ref: Proceedings of Interspeech 2023

  8. arXiv:2306.00789  [pdf, other

    cs.CL cs.AI eess.AS eess.SP

    Improved Cross-Lingual Transfer Learning For Automatic Speech Translation

    Authors: Sameer Khurana, Nauman Dawalatabad, Antoine Laurent, Luis Vicente, Pablo Gimeno, Victoria Mingote, James Glass

    Abstract: Research in multilingual speech-to-text translation is topical. Having a single model that supports multiple translation tasks is desirable. The goal of this work it to improve cross-lingual transfer learning in multilingual speech-to-text translation via semantic knowledge distillation. We show that by initializing the encoder of the encoder-decoder sequence-to-sequence translation model with SAM… ▽ More

    Submitted 25 January, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

  9. arXiv:2305.12606  [pdf, other

    cs.CL cs.SD eess.AS

    Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages

    Authors: Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass

    Abstract: Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each. However, there are thousands of spoken languages worldwide, and adapting to new languages is an important problem. In this work, we aim to understand which model adapts better to languages unseen during pre-training. We fine-tune both mo… ▽ More

    Submitted 30 May, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

    Comments: Accepted at Interspeech 2023

  10. arXiv:2305.11072  [pdf, other

    cs.CL eess.AS

    Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering

    Authors: Heng-Jui Chang, Alexander H. Liu, James Glass

    Abstract: Self-supervised speech representation models have succeeded in various tasks, but improving them for content-related problems using unlabeled data is challenging. We propose speaker-invariant clustering (Spin), a novel self-supervised learning method that clusters speech representations and performs swapped prediction between the original and speaker-perturbed utterances. Spin disentangles speaker… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  11. arXiv:2305.10790  [pdf, other

    eess.AS cs.SD

    Listen, Think, and Understand

    Authors: Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, James Glass

    Abstract: The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is crucial for many applications. Although significant progress has been made in this area since the development of AudioSet, most existing models are designed to map audio inputs to pre-defined, discrete sound label sets. In contrast, humans possess the ability to not only classify sounds into general cat… ▽ More

    Submitted 19 February, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted at ICLR 2024. Code, dataset, and models are available at https://github.com/YuanGongND/ltu. The interactive demo is at https://huggingface.co/spaces/yuangongfdu/ltu

  12. arXiv:2211.07795  [pdf, other

    eess.AS cs.AI cs.LG

    On Unsupervised Uncertainty-Driven Speech Pseudo-Label Filtering and Model Calibration

    Authors: Nauman Dawalatabad, Sameer Khurana, Antoine Laurent, James Glass

    Abstract: Pseudo-label (PL) filtering forms a crucial part of Self-Training (ST) methods for unsupervised domain adaptation. Dropout-based Uncertainty-driven Self-Training (DUST) proceeds by first training a teacher model on source domain labeled data. Then, the teacher model is used to provide PLs for the unlabeled target domain data. Finally, we train a student on augmented labeled and pseudo-labeled data… ▽ More

    Submitted 14 November, 2022; originally announced November 2022.

  13. arXiv:2210.07839  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    Contrastive Audio-Visual Masked Autoencoder

    Authors: Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass

    Abstract: In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments… ▽ More

    Submitted 11 April, 2023; v1 submitted 2 October, 2022; originally announced October 2022.

    Comments: Accepted at ICLR 2023 as a notable top 25% paper. Code and pretrained models are at https://github.com/yuangongnd/cav-mae

  14. arXiv:2208.00061  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    UAVM: Towards Unifying Audio and Visual Models

    Authors: Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass

    Abstract: Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do… ▽ More

    Submitted 15 February, 2023; v1 submitted 29 July, 2022; originally announced August 2022.

    Comments: Published in Signal Processing Letters. Code at https://github.com/YuanGongND/uavm

    Journal ref: IEEE Signal Processing Letters, vol. 29, pp. 2437-2441, 2022

  15. arXiv:2205.08180  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation

    Authors: Sameer Khurana, Antoine Laurent, James Glass

    Abstract: We propose the SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation learning framework. Unlike previous works on speech representation learning, which learns multilingual contextual speech embedding at the resolution of an acoustic frame (10-20ms), this work focuses on learning multimodal (speech-text) multilingual speech embedding at the resolution of a s… ▽ More

    Submitted 17 May, 2022; originally announced May 2022.

  16. Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition

    Authors: Yuan Gong, ** Yu, James Glass

    Abstract: Recognizing human non-speech vocalizations is an important task and has broad applications such as automatic sound transcription and health condition monitoring. However, existing datasets have a relatively small number of vocal sound samples or noisy labels. As a consequence, state-of-the-art audio event classification models may not perform well in detecting human vocal sounds. To support resear… ▽ More

    Submitted 17 June, 2022; v1 submitted 6 May, 2022; originally announced May 2022.

    Comments: Accepted at ICASSP 2022. Dataset and code at https://github.com/YuanGongND/vocalsound Interactive Colab demo at https://colab.research.google.com/github/YuanGongND/vocalsound/blob/main/colab/VocalSound.ipynb

  17. Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment

    Authors: Yuan Gong, Ziyi Chen, Iek-Heng Chu, Peng Chang, James Glass

    Abstract: Automatic pronunciation assessment is an important technology to help self-directed language learners. While pronunciation quality has multiple aspects including accuracy, fluency, completeness, and prosody, previous efforts typically only model one aspect (e.g., accuracy) at one granularity (e.g., at the phoneme-level). In this work, we explore modeling multi-aspect pronunciation assessment at mu… ▽ More

    Submitted 6 May, 2022; originally announced May 2022.

    Comments: Accepted at ICASSP 2022. Code at https://github.com/YuanGongND/gopt Interactive Colab demo at https://colab.research.google.com/github/YuanGongND/gopt/blob/master/colab/GOPT_GPU.ipynb . ICASSP 2022

  18. arXiv:2204.02524  [pdf, other

    cs.SD cs.CL eess.AS

    Simple and Effective Unsupervised Speech Synthesis

    Authors: Alexander H. Liu, Cheng-I Jeff Lai, Wei-Ning Hsu, Michael Auli, Alexei Baevski, James Glass

    Abstract: We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe. The framework leverages recent work in unsupervised speech recognition as well as existing neural-based speech synthesis. Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus. Experiments demonstra… ▽ More

    Submitted 20 April, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: preprint, equal contribution from first two authors

  19. arXiv:2203.06760  [pdf, other

    cs.SD cs.AI eess.AS

    CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification

    Authors: Yuan Gong, Sameer Khurana, Andrew Rouditchenko, James Glass

    Abstract: Audio classification is an active research area with a wide range of applications. Over the past decade, convolutional neural networks (CNNs) have been the de-facto standard building block for end-to-end audio classification models. Recently, neural networks based solely on self-attention mechanisms such as the Audio Spectrogram Transformer (AST) have been shown to outperform CNNs. In this paper,… ▽ More

    Submitted 13 March, 2022; originally announced March 2022.

  20. arXiv:2112.04446  [pdf, other

    cs.CV cs.CL cs.SD eess.AS

    Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

    Authors: Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne

    Abstract: Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text,… ▽ More

    Submitted 18 August, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

    Comments: CVPR2022. The final published version of the proceedings will be available on IEEE Xplore

  21. arXiv:2111.04823  [pdf, other

    cs.CL cs.CV cs.MM cs.SD eess.AS eess.IV

    Cascaded Multilingual Audio-Visual Learning from Videos

    Authors: Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass

    Abstract: In this paper, we explore self-supervised audio-visual models that learn from instructional videos. Prior work has shown that these models can relate spoken words and sounds to visual content after training on a large-scale dataset of videos, but they were only trained and evaluated on videos in English. To learn multilingual audio-visual representations, we propose a cascaded approach that levera… ▽ More

    Submitted 8 November, 2021; originally announced November 2021.

    Comments: Presented at Interspeech 2021. This version contains updated results using the YouCook-Japanese dataset

  22. arXiv:2110.09784  [pdf, other

    cs.SD cs.AI eess.AS

    SSAST: Self-Supervised Audio Spectrogram Transformer

    Authors: Yuan Gong, Cheng-I Jeff Lai, Yu-An Chung, James Glass

    Abstract: Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology ca… ▽ More

    Submitted 10 February, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

    Comments: Accepted at AAAI2022. Code at https://github.com/YuanGongND/ssast

  23. arXiv:2110.07575  [pdf, other

    cs.CL cs.CV cs.MM eess.AS

    Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

    Authors: Ian Palmer, Andrew Rouditchenko, Andrei Barbu, Boris Katz, James Glass

    Abstract: Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences with very weak supervision. However, modern audio-visual datasets contain biases that undermine the real-world performance of models trained on that data. We introduce Spoken ObjectNet, which is designed to remove some of these biases and provide a way to better evaluate how effectively models will pe… ▽ More

    Submitted 14 October, 2021; originally announced October 2021.

    Comments: Presented at Interspeech 2021. This version contains additional experiments on the Spoken ObjectNet test set

  24. Magic dust for cross-lingual adaptation of monolingual wav2vec-2.0

    Authors: Sameer Khurana, Antoine Laurent, James Glass

    Abstract: We propose a simple and effective cross-lingual transfer learning method to adapt monolingual wav2vec-2.0 models for Automatic Speech Recognition (ASR) in resource-scarce languages. We show that a monolingual wav2vec-2.0 is a good few-shot ASR learner in several languages. We improve its performance further via several iterations of Dropout Uncertainty-Driven Self-Training (DUST) by using a modera… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

  25. arXiv:2110.01147  [pdf, other

    cs.SD cs.CL eess.AS

    On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

    Authors: Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian, Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David Cox, James Glass

    Abstract: Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparsity and its subsequent effects on synthetic speech. Additionally, we explored several… ▽ More

    Submitted 27 October, 2021; v1 submitted 3 October, 2021; originally announced October 2021.

  26. arXiv:2106.05933  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition

    Authors: Cheng-I Jeff Lai, Yang Zhang, Alexander H. Liu, Shiyu Chang, Yi-Lun Liao, Yung-Sung Chuang, Kaizhi Qian, Sameer Khurana, David Cox, James Glass

    Abstract: Self-supervised speech representation learning (speech SSL) has demonstrated the benefit of scale in learning rich representations for Automatic Speech Recognition (ASR) with limited paired data, such as wav2vec 2.0. We investigate the existence of sparse subnetworks in pre-trained speech SSL models that achieve even better low-resource ASR results. However, directly applying widely adopted prunin… ▽ More

    Submitted 26 October, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

  27. arXiv:2105.04489  [pdf, other

    cs.CV cs.CL cs.LG cs.SD eess.AS

    Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

    Authors: Mathew Monfort, SouYoung **, Alexander Liu, David Harwath, Rogerio Feris, James Glass, Aude Oliva

    Abstract: When people observe events, they are able to abstract key information and build concise summaries of what is happening. These summaries include contextual and semantic information describing the important high-level details (what, where, who and how) of the observed event and exclude background information that is deemed unimportant to the observer. With this in mind, the descriptions people gener… ▽ More

    Submitted 10 May, 2021; originally announced May 2021.

    Comments: To appear at CVPR 2021

  28. arXiv:2102.01243  [pdf, other

    cs.SD cs.LG eess.AS

    PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation

    Authors: Yuan Gong, Yu-An Chung, James Glass

    Abstract: Audio tagging is an active research area and has a wide range of applications. Since the release of AudioSet, great progress has been made in advancing model performance, which mostly comes from the development of novel model architectures and attention modules. However, we find that appropriate training techniques are equally important for building audio tagging models with AudioSet, but have not… ▽ More

    Submitted 17 November, 2021; v1 submitted 1 February, 2021; originally announced February 2021.

    Comments: Published in IEEE/ACM Transactions on Audio Speech and Language Processing. Code at https://github.com/YuanGongND/psla

    Journal ref: in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3292-3306, 2021

  29. arXiv:2012.15454  [pdf, other

    cs.CL cs.AI cs.CV cs.LG eess.AS

    Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

    Authors: Wei-Ning Hsu, David Harwath, Christopher Song, James Glass

    Abstract: In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision. Instead, we connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised vi… ▽ More

    Submitted 31 December, 2020; originally announced December 2020.

  30. arXiv:2010.11481  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Similarity Analysis of Self-Supervised Speech Representations

    Authors: Yu-An Chung, Yonatan Belinkov, James Glass

    Abstract: Self-supervised speech representation learning has recently been a prosperous research topic. Many algorithms have been proposed for learning useful representations from large-scale unlabeled data, and their applications to a wide range of speech tasks have also been investigated. However, there has been little research focusing on understanding the properties of existing approaches. In this work,… ▽ More

    Submitted 2 February, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: Accepted to ICASSP 2021. Supplementary materials available at https://github.com/iamyuanchung/ICASSP21-Similarity-Supplementary

  31. arXiv:2006.09199  [pdf, other

    cs.CV cs.CL cs.MM cs.SD eess.AS

    AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

    Authors: Andrew Rouditchenko, Angie Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass

    Abstract: Current methods for learning visually grounded language from videos often rely on text annotation, such as human generated captions or machine generated automatic speech recognition (ASR) transcripts. In this work, we introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. To circumvent the nee… ▽ More

    Submitted 29 June, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

    Comments: A version of this work has been accepted to Interspeech 2021

  32. arXiv:2006.02814  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    CSTNet: Contrastive Speech Translation Network for Self-Supervised Speech Representation Learning

    Authors: Sameer Khurana, Antoine Laurent, James Glass

    Abstract: More than half of the 7,000 languages in the world are in imminent danger of going extinct. Traditional methods of documenting language proceed by collecting audio data followed by manual annotation by trained linguists at different levels of granularity. This time consuming and painstaking process could benefit from machine learning. Many endangered languages do not have any orthographic form but… ▽ More

    Submitted 5 August, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

  33. arXiv:2006.02547  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning

    Authors: Sameer Khurana, Antoine Laurent, Wei-Ning Hsu, Jan Chorowski, Adrian Lancucki, Ricard Marxer, James Glass

    Abstract: Probabilistic Latent Variable Models (LVMs) provide an alternative to self-supervised learning approaches for linguistic representation learning from speech. LVMs admit an intuitive probabilistic interpretation where the latent structure shapes the information extracted from the signal. Even though LVMs have recently seen a renewed interest due to the introduction of Variational Autoencoders (VAEs… ▽ More

    Submitted 8 September, 2020; v1 submitted 3 June, 2020; originally announced June 2020.

    Comments: Proceedings of Interspeech, 2020

  34. arXiv:2005.08392  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Vector-Quantized Autoregressive Predictive Coding

    Authors: Yu-An Chung, Hao Tang, James Glass

    Abstract: Autoregressive Predictive Coding (APC), as a self-supervised objective, has enjoyed success in learning representations from large amounts of unlabeled data, and the learned representations are rich for many downstream tasks. However, the connection between low self-supervised loss and strong performance in downstream tasks remains unclear. In this work, we propose Vector-Quantized Autoregressive… ▽ More

    Submitted 17 May, 2020; originally announced May 2020.

  35. arXiv:2004.05274  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Improved Speech Representations with Multi-Target Autoregressive Predictive Coding

    Authors: Yu-An Chung, James Glass

    Abstract: Training objectives based on predictive coding have recently been shown to be very effective at learning meaningful representations from unlabeled speech. One example is Autoregressive Predictive Coding (Chung et al., 2019), which trains an autoregressive RNN to generate an unseen future frame given a context such as recent past frames. The basic hypothesis of these approaches is that hidden state… ▽ More

    Submitted 10 April, 2020; originally announced April 2020.

    Comments: Accepted to ACL 2020

  36. arXiv:2002.01440  [pdf, other

    eess.AS cs.SD eess.SP

    Audio-Visual Calibration with Polynomial Regression for 2-D Projection Using SVD-PHAT

    Authors: Francois Grondin, Hao Tang, James Glass

    Abstract: This paper proposes a straightforward 2-D method to spatially calibrate the visual field of a camera with the auditory field of an array microphone by generating and overlaying an acoustic image over an optical image. Using a low-cost microphone array and an off-the-shelf camera, we show that polynomial regression can deal efficiently with non-linear camera distortion, and that a recently proposed… ▽ More

    Submitted 12 February, 2020; v1 submitted 4 February, 2020; originally announced February 2020.

  37. arXiv:1911.09602  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

    Authors: David Harwath, Wei-Ning Hsu, James Glass

    Abstract: In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our method is capable of capturing both word-level and sub-word units, depending on how it is configured. What differentiates this paper from prior work on speech unit learning is the choice of training objective. Rather… ▽ More

    Submitted 14 February, 2020; v1 submitted 21 November, 2019; originally announced November 2019.

    Comments: Camera-ready version for ICLR

  38. arXiv:1910.12607  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Generative Pre-Training for Speech with Autoregressive Predictive Coding

    Authors: Yu-An Chung, James Glass

    Abstract: Learning meaningful and general representations from unannotated speech that are applicable to a wide range of tasks remains challenging. In this paper we propose to use autoregressive predictive coding (APC), a recently proposed self-supervised objective, as a generative pre-training approach for learning meaningful, non-specific, and transferable speech representations. We pre-train APC on large… ▽ More

    Submitted 26 January, 2020; v1 submitted 23 October, 2019; originally announced October 2019.

    Comments: Accepted to ICASSP 2020. Code and pre-trained models are available at https://github.com/iamyuanchung/Autoregressive-Predictive-Coding

  39. arXiv:1910.10049  [pdf, other

    eess.AS cs.SD

    Sound Event Localization and Detection Using CRNN on Pairs of Microphones

    Authors: Francois Grondin, James Glass, Iwona Sobieraj, Mark D. Plumbley

    Abstract: This paper proposes sound event localization and detection methods from multichannel recording. The proposed system is based on two Convolutional Recurrent Neural Networks (CRNNs) to perform sound event detection (SED) and time difference of arrival (TDOA) estimation on each pair of microphones in a microphone array. In this paper, the system is evaluated with a four-microphone array, and thus com… ▽ More

    Submitted 22 October, 2019; originally announced October 2019.

  40. arXiv:1907.12621  [pdf, other

    eess.AS cs.SD

    Fast and Robust 3-D Sound Source Localization with DSVD-PHAT

    Authors: Francois Grondin, James Glass

    Abstract: This paper introduces a variant of the Singular Value Decomposition with Phase Transform (SVD-PHAT), named Difference SVD-PHAT (DSVD-PHAT), to achieve robust Sound Source Localization (SSL) in noisy conditions. Experiments are performed on a Baxter robot with a four-microphone planar array mounted on its head. Results show that this method offers similar robustness to noise as the state-of-the-art… ▽ More

    Submitted 29 July, 2019; originally announced July 2019.

  41. arXiv:1907.04355  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Transfer Learning from Audio-Visual Grounding to Speech Recognition

    Authors: Wei-Ning Hsu, David Harwath, James Glass

    Abstract: Transfer learning aims to reduce the amount of data required to excel at a new task by re-using the knowledge acquired from learning other related tasks. This paper proposes a novel transfer learning scenario, which distills robust phonetic features from grounding models that are trained to tell whether a pair of image and speech are semantically correlated, without using any textual transcripts.… ▽ More

    Submitted 9 July, 2019; originally announced July 2019.

    Comments: Accepted to Interspeech 2019. 4 pages, 2 figures

  42. arXiv:1907.04224  [pdf, other

    cs.CL cs.SD eess.AS

    Analyzing Phonetic and Graphemic Representations in End-to-End Automatic Speech Recognition

    Authors: Yonatan Belinkov, Ahmed Ali, James Glass

    Abstract: End-to-end neural network systems for automatic speech recognition (ASR) are trained from acoustic features to text transcriptions. In contrast to modular ASR systems, which contain separately-trained components for acoustic modeling, pronunciation lexicon, and language modeling, the end-to-end paradigm is both conceptually simpler and has the potential benefit of training the entire system on the… ▽ More

    Submitted 19 April, 2020; v1 submitted 9 July, 2019; originally announced July 2019.

    Comments: Corrected dataset statistics

    ACM Class: I.2.7

  43. arXiv:1906.11913  [pdf, ps, other

    eess.AS eess.SP

    Multiple Sound Source Localization with SVD-PHAT

    Authors: Francois Grondin, James Glass

    Abstract: This paper introduces a modification of phase transform on singular value decomposition (SVD-PHAT) to localize multiple sound sources. This work aims to improve localization accuracy and keeps the algorithm complexity low for real-time applications. This method relies on multiple scans of the search space, with projection of each low-dimensional observation onto orthogonal subspaces. We show that… ▽ More

    Submitted 27 June, 2019; originally announced June 2019.

  44. arXiv:1906.07307  [pdf, other

    cs.CL cs.SD eess.AS

    Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models

    Authors: Wei Fang, Yu-An Chung, James Glass

    Abstract: Modern text-to-speech (TTS) systems are able to generate audio that sounds almost as natural as human speech. However, the bar of develo** high-quality TTS systems remains high since a sizable set of studio-quality <text, audio> pairs is usually required. Compared to commercial data used to develop state-of-the-art systems, publicly available data are usually worse in terms of both quality and s… ▽ More

    Submitted 17 June, 2019; originally announced June 2019.

  45. arXiv:1905.04554  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification

    Authors: Achintya kr. Sarkar, Zheng-Hua Tan, Hao Tang, Suwon Shon, James Glass

    Abstract: There are a number of studies about extraction of bottleneck (BN) features from deep neural networks (DNNs)trained to discriminate speakers, pass-phrases and triphone states for improving the performance of text-dependent speaker verification (TD-SV). However, a moderate success has been achieved. A recent study [1] presented a time contrastive learning (TCL) concept to explore the non-stationarit… ▽ More

    Submitted 11 May, 2019; originally announced May 2019.

    Comments: Copyright (c) 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019

  46. arXiv:1904.04240  [pdf, other

    eess.AS cs.SD

    MCE 2018: The 1st Multi-target Speaker Detection and Identification Challenge Evaluation

    Authors: Suwon Shon, Najim Dehak, Douglas Reynolds, James Glass

    Abstract: The Multi-target Challenge aims to assess how well current speech technology is able to determine whether or not a recorded utterance was spoken by one of a large number of blacklisted speakers. It is a form of multi-target speaker detection based on real-world telephone conversations. Data recordings are generated from call center customer-agent conversations. The task is to measure how accuratel… ▽ More

    Submitted 7 April, 2019; originally announced April 2019.

    Comments: http://mce.csail.mit.edu . arXiv admin note: text overlap with arXiv:1807.06663

  47. arXiv:1904.03601  [pdf, other

    eess.AS cs.SD

    VoiceID Loss: Speech Enhancement for Speaker Verification

    Authors: Suwon Shon, Hao Tang, James Glass

    Abstract: In this paper, we propose VoiceID loss, a novel loss function for training a speech enhancement model to improve the robustness of speaker verification. In contrast to the commonly used loss functions for speech enhancement such as the L2 loss, the VoiceID loss is based on the feedback from a speaker verification model to generate a ratio mask. The generated ratio mask is multiplied pointwise with… ▽ More

    Submitted 5 July, 2019; v1 submitted 7 April, 2019; originally announced April 2019.

    Comments: interspeech 2019; demo link : https://people.csail.mit.edu/swshon/supplement/voiceid-loss

  48. arXiv:1904.03240  [pdf, ps, other

    cs.CL cs.LG cs.SD eess.AS

    An Unsupervised Autoregressive Model for Speech Representation Learning

    Authors: Yu-An Chung, Wei-Ning Hsu, Hao Tang, James Glass

    Abstract: This paper proposes a novel unsupervised autoregressive neural model for learning generic speech representations. In contrast to other speech representation learning methods that aim to remove noise or speaker variabilities, ours is designed to preserve information for a wide range of downstream tasks. In addition, the proposed model does not require any phonetic or word boundary labels, allowing… ▽ More

    Submitted 18 June, 2019; v1 submitted 5 April, 2019; originally announced April 2019.

    Comments: Accepted to Interspeech 2019. Code available at: https://github.com/iamyuanchung/Autoregressive-Predictive-Coding

  49. arXiv:1902.08213  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Towards Visually Grounded Sub-Word Speech Unit Discovery

    Authors: David Harwath, James Glass

    Abstract: In this paper, we investigate the manner in which interpretable sub-word speech units emerge within a convolutional neural network model trained to associate raw speech waveforms with semantically related natural image scenes. We show how diphone boundaries can be superficially extracted from the activation patterns of intermediate layers of the model, suggesting that the model may be leveraging t… ▽ More

    Submitted 21 February, 2019; originally announced February 2019.

    Comments: Accepted to ICASSP 2019

  50. arXiv:1812.01501  [pdf, other

    eess.AS cs.LG cs.SD

    Domain Attentive Fusion for End-to-end Dialect Identification with Unknown Target Domain

    Authors: Suwon Shon, Ahmed Ali, James Glass

    Abstract: End-to-end deep learning language or dialect identification systems operate on the spectrogram or other acoustic feature and directly generate identification scores for each class. An important issue for end-to-end systems is to have some knowledge of the application domain, because the system can be vulnerable to use cases that were not seen in the training phase; such a scenario is often referre… ▽ More

    Submitted 6 May, 2019; v1 submitted 4 December, 2018; originally announced December 2018.

    Comments: ICASSP 2019, revised typos