Skip to main content

Showing 1–50 of 157 results for author: Glass, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.18625  [pdf, other

    cs.SD cs.AI eess.AS

    Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer

    Authors: Liming Wang, Yuan Gong, Nauman Dawalatabad, Marco Vilela, Katerina Placek, Brian Tracey, Yishu Gong, Alan Premasiri, Fernando Vieira, James Glass

    Abstract: Automatic prediction of amyotrophic lateral sclerosis (ALS) disease progression provides a more efficient and objective alternative than manual approaches. We propose ALS longitudinal speech transformer (ALST), a neural network-based automatic predictor of ALS disease progression from longitudinal speech recordings of ALS patients. By taking advantage of high-quality pretrained speech features and… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  2. arXiv:2406.16008  [pdf, other

    cs.CL cs.AI cs.LG

    Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization

    Authors: Cheng-Yu Hsieh, Yung-Sung Chuang, Chun-Liang Li, Zifeng Wang, Long T. Le, Abhishek Kumar, James Glass, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna, Tomas Pfister

    Abstract: Large language models (LLMs), even when specifically trained to process long input contexts, struggle to capture relevant information located in the middle of their input. This phenomenon has been known as the lost-in-the-middle problem. In this work, we make three contributions. First, we set out to understand the factors that cause this phenomenon. In doing so, we establish a connection between… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: ACL Findings 2024

  3. arXiv:2406.12034  [pdf, other

    cs.CL cs.LG

    Self-MoE: Towards Compositional Large Language Models with Self-Specialized Experts

    Authors: Junmo Kang, Leonid Karlinsky, Hongyin Luo, Zhen Wang, Jacob Hansen, James Glass, David Cox, Rameswar Panda, Rogerio Feris, Alan Ritter

    Abstract: We present Self-MoE, an approach that transforms a monolithic LLM into a compositional, modular system of self-specialized experts, named MiXSE (MiXture of Self-specialized Experts). Our approach leverages self-specialization, which constructs expert modules using self-generated synthetic data, each equipped with a shared base LLM and incorporating self-optimized routing. This allows for dynamic a… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  4. arXiv:2406.10991  [pdf, other

    cs.CL

    Adaptive Query Rewriting: Aligning Rewriters through Marginal Probability of Conversational Answers

    Authors: Tianhua Zhang, Kun Li, Hongyin Luo, Xixin Wu, James Glass, Helen Meng

    Abstract: Query rewriting is a crucial technique for passage retrieval in open-domain conversational question answering (CQA). It decontexualizes conversational queries into self-contained questions suitable for off-the-shelf retrievers. Existing methods attempt to incorporate retriever's preference during the training of rewriting models. However, these approaches typically rely on extensive annotations su… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  5. arXiv:2406.10082  [pdf, other

    eess.AS cs.CV cs.SD

    Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

    Authors: Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass

    Abstract: Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in noise. Since videos are harder to obtain than audio, the video training data of AVSR models is usually limited to a few thousand hours. In contrast, speech models such as Whisper are trained with hundreds of thousands of hours of data, and thus learn a better speech-to-text decoder. The huge training data differe… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Interspeech 2024. Code https://github.com/roudimit/whisper-flamingo

  6. arXiv:2405.17402  [pdf, other

    cs.CL

    THREAD: Thinking Deeper with Recursive Spawning

    Authors: Philip Schroeder, Nathaniel Morgan, Hongyin Luo, James Glass

    Abstract: Large language models (LLMs) have shown impressive capabilities across diverse settings, but still struggle as the length and complexity of the context increases. To address this challenge, we propose Thinking Recursively and Dynamically (ThReaD). THREAD frames model generation as a thread of execution that, based on the context, can run to completion or dynamically spawn new threads. By spawning,… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  7. arXiv:2402.19464  [pdf, other

    cs.LG cs.AI cs.CL

    Curiosity-driven Red-teaming for Large Language Models

    Authors: Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, Pulkit Agrawal

    Abstract: Large language models (LLMs) hold great potential for many natural language applications but risk generating incorrect or toxic content. To probe when an LLM generates unwanted content, the current paradigm is to recruit a \textit{red team} of human testers to design input prompts (i.e., test cases) that elicit undesirable responses from LLMs. However, relying solely on human testers is expensive… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: Published at ICLR 2024

  8. arXiv:2401.08833  [pdf, other

    eess.AS cs.CL cs.SD

    Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective

    Authors: Alexander H. Liu, Sung-Lin Yeh, James Glass

    Abstract: Existing studies on self-supervised speech representation learning have focused on develo** new training methods and applying pre-trained models for different applications. However, the quality of these models is often measured by the performance of different downstream tasks. How well the representations access the information of interest is less studied. In this work, we take a closer look int… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Comments: ICASSP 2024

  9. arXiv:2311.09117  [pdf, other

    cs.CL cs.SD eess.AS

    R-Spin: Efficient Speaker and Noise-invariant Representation Learning with Acoustic Pieces

    Authors: Heng-Jui Chang, James Glass

    Abstract: This paper introduces Robust Spin (R-Spin), a data-efficient domain-specific self-supervision method for speaker and noise-invariant speech representations by learning discrete acoustic units with speaker-invariant clustering (Spin). R-Spin resolves Spin's issues and enhances content representations by learning to predict acoustic pieces. R-Spin offers a 12X reduction in computational resources co… ▽ More

    Submitted 1 April, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: Accepted to NAACL 2024

  10. arXiv:2310.07654  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Audio-Visual Neural Syntax Acquisition

    Authors: Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass

    Abstract: We study phrase structure induction from visually-grounded speech. The core idea is to first segment the speech waveform into sequences of word segments, and subsequently induce phrase structure using the inferred segment-level continuous representations. We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without eve… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

  11. arXiv:2310.00160  [pdf, other

    cs.CL cs.AI

    Self-Specialization: Uncovering Latent Expertise within Large Language Models

    Authors: Junmo Kang, Hongyin Luo, Yada Zhu, Jacob Hansen, James Glass, David Cox, Alan Ritter, Rogerio Feris, Leonid Karlinsky

    Abstract: Recent works have demonstrated the effectiveness of self-alignment in which a large language model is aligned to follow general instructions using instructional data generated from the model itself starting from a handful of human-written seeds. Instead of general alignment, in this work, we focus on self-alignment for expert domain specialization (e.g., biomedicine, finance). As a preliminary, we… ▽ More

    Submitted 5 June, 2024; v1 submitted 29 September, 2023; originally announced October 2023.

    Comments: ACL 2024 (Findings; Long Paper)

  12. arXiv:2309.14405  [pdf, other

    cs.SD cs.AI eess.AS

    Joint Audio and Speech Understanding

    Authors: Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, James Glass

    Abstract: Humans are surrounded by audio signals that include both speech and non-speech sounds. The recognition and understanding of speech and non-speech audio events, along with a profound comprehension of the relationship between them, constitute fundamental cognitive capabilities. For the first time, we build a machine learning model, called LTU-AS, that has a conceptually similar universal audio perce… ▽ More

    Submitted 10 December, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted at ASRU 2023. Code, dataset, and pretrained models are at https://github.com/yuangongnd/ltu. Interactive demo at https://huggingface.co/spaces/yuangongfdu/ltu-2

  13. arXiv:2309.10814  [pdf, other

    cs.CL

    Natural Language Embedded Programs for Hybrid Language Symbolic Reasoning

    Authors: Tianhua Zhang, Jiaxin Ge, Hongyin Luo, Yung-Sung Chuang, Mingye Gao, Yuan Gong, Xixin Wu, Yoon Kim, Helen Meng, James Glass

    Abstract: How can we perform computations over natural language representations to solve tasks that require symbolic and numeric reasoning? We propose natural language embedded programs (NLEP) as a unifying framework for addressing math/symbolic reasoning, natural language understanding, and instruction following tasks. Our approach prompts a language model to generate full Python programs that define funct… ▽ More

    Submitted 28 March, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

    Comments: NAACL 2024

  14. arXiv:2309.03883  [pdf, other

    cs.CL cs.AI cs.LG

    DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

    Authors: Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, Pengcheng He

    Abstract: Despite their impressive capabilities, large language models (LLMs) are prone to hallucinations, i.e., generating content that deviates from facts seen during pretraining. We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs that does not require conditioning on retrieved external knowledge nor additional fine-tuning. Our approach obtains the next-token distributi… ▽ More

    Submitted 10 March, 2024; v1 submitted 7 September, 2023; originally announced September 2023.

    Comments: ICLR 2024 main conference paper. The source code is available at https://github.com/voidism/DoLa

  15. Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

    Authors: Yuan Gong, Sameer Khurana, Leonid Karlinsky, James Glass

    Abstract: In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sound… ▽ More

    Submitted 6 July, 2023; originally announced July 2023.

    Comments: Accepted at Interspeech 2023. Code at https://github.com/yuangongnd/whisper-at

    Journal ref: Proceedings of Interspeech 2023

  16. arXiv:2306.05083  [pdf, other

    cs.CL

    Revealing the Blind Spot of Sentence Encoder Evaluation by HEROS

    Authors: Cheng-Han Chiang, Yung-Sung Chuang, James Glass, Hung-yi Lee

    Abstract: Existing sentence textual similarity benchmark datasets only use a single number to summarize how similar the sentence encoder's decision is to humans'. However, it is unclear what kind of sentence pairs a sentence encoder (SE) would consider similar. Moreover, existing SE benchmarks mainly consider sentence pairs with low lexical overlap, so it is unclear how the SEs behave when two sentences hav… ▽ More

    Submitted 13 June, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

    Comments: ACL 2023 repl4nlp (representation learning for NLP) workshop poster paper. Dataset at https://huggingface.co/datasets/dcml0714/Heros

  17. arXiv:2306.00789  [pdf, other

    cs.CL cs.AI eess.AS eess.SP

    Improved Cross-Lingual Transfer Learning For Automatic Speech Translation

    Authors: Sameer Khurana, Nauman Dawalatabad, Antoine Laurent, Luis Vicente, Pablo Gimeno, Victoria Mingote, James Glass

    Abstract: Research in multilingual speech-to-text translation is topical. Having a single model that supports multiple translation tasks is desirable. The goal of this work it to improve cross-lingual transfer learning in multilingual speech-to-text translation via semantic knowledge distillation. We show that by initializing the encoder of the encoder-decoder sequence-to-sequence translation model with SAM… ▽ More

    Submitted 25 January, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

  18. arXiv:2305.17197  [pdf, other

    cs.CL

    Entailment as Robust Self-Learner

    Authors: Jiaxin Ge, Hongyin Luo, Yoon Kim, James Glass

    Abstract: Entailment has been recognized as an important metric for evaluating natural language understanding (NLU) models, and recent studies have found that entailment pretraining benefits weakly supervised fine-tuning. In this work, we design a prompting strategy that formulates a number of different NLU tasks as contextual entailment. This approach improves the zero-shot adaptation of pretrained entailm… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: Accepted by ACL 2023 main conference

  19. arXiv:2305.17080  [pdf, other

    cs.CL

    Expand, Rerank, and Retrieve: Query Reranking for Open-Domain Question Answering

    Authors: Yung-Sung Chuang, Wei Fang, Shang-Wen Li, Wen-tau Yih, James Glass

    Abstract: We propose EAR, a query Expansion And Reranking approach for improving passage retrieval, with the application to open-domain question answering. EAR first applies a query expansion model to generate a diverse set of queries, and then uses a query reranker to select the ones that could lead to better retrieval results. Motivated by the observation that the best query expansion often is not picked… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: ACL 2023 long paper (Findings)

  20. arXiv:2305.15225  [pdf, other

    cs.CL

    SAIL: Search-Augmented Instruction Learning

    Authors: Hongyin Luo, Yung-Sung Chuang, Yuan Gong, Tianhua Zhang, Yoon Kim, Xixin Wu, Danny Fox, Helen Meng, James Glass

    Abstract: Large language models (LLMs) have been significantly improved by instruction fine-tuning, but still lack transparency and the ability to utilize up-to-date knowledge and information. In this work, we propose search-augmented instruction learning (SAIL), which grounds the language generation and instruction following abilities on complex search results generated by in-house and external search engi… ▽ More

    Submitted 25 June, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

  21. arXiv:2305.12606  [pdf, other

    cs.CL cs.SD eess.AS

    Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages

    Authors: Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass

    Abstract: Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each. However, there are thousands of spoken languages worldwide, and adapting to new languages is an important problem. In this work, we aim to understand which model adapts better to languages unseen during pre-training. We fine-tune both mo… ▽ More

    Submitted 30 May, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

    Comments: Accepted at Interspeech 2023

  22. arXiv:2305.11072  [pdf, other

    cs.CL eess.AS

    Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering

    Authors: Heng-Jui Chang, Alexander H. Liu, James Glass

    Abstract: Self-supervised speech representation models have succeeded in various tasks, but improving them for content-related problems using unlabeled data is challenging. We propose speaker-invariant clustering (Spin), a novel self-supervised learning method that clusters speech representations and performs swapped prediction between the original and speaker-perturbed utterances. Spin disentangles speaker… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  23. arXiv:2305.10790  [pdf, other

    eess.AS cs.SD

    Listen, Think, and Understand

    Authors: Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, James Glass

    Abstract: The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is crucial for many applications. Although significant progress has been made in this area since the development of AudioSet, most existing models are designed to map audio inputs to pre-defined, discrete sound label sets. In contrast, humans possess the ability to not only classify sounds into general cat… ▽ More

    Submitted 19 February, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted at ICLR 2024. Code, dataset, and models are available at https://github.com/YuanGongND/ltu. The interactive demo is at https://huggingface.co/spaces/yuangongfdu/ltu

  24. arXiv:2305.10005  [pdf, other

    cs.CL

    DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

    Authors: Alexander H. Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, James R. Glass

    Abstract: In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with… ▽ More

    Submitted 16 January, 2024; v1 submitted 17 May, 2023; originally announced May 2023.

  25. arXiv:2304.03728  [pdf, other

    cs.CL

    Interpretable Unified Language Checking

    Authors: Tianhua Zhang, Hongyin Luo, Yung-Sung Chuang, Wei Fang, Luc Gaitskell, Thomas Hartvigsen, Xixin Wu, Danny Fox, Helen Meng, James Glass

    Abstract: Despite recent concerns about undesirable behaviors generated by large language models (LLMs), including non-factual, biased, and hateful language, we find LLMs are inherent multi-task language checkers based on their latent representations of natural and social knowledge. We present an interpretable, unified, language checking (UniLC) method for both human and machine-generated language that aims… ▽ More

    Submitted 7 April, 2023; originally announced April 2023.

    Comments: 10 + 5 pages

  26. arXiv:2303.16990  [pdf, other

    cs.CV

    What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

    Authors: Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Daniel Kondermann, Samuel Thomas, Shih-Fu Chang, Rogerio Feris, James Glass, Hilde Kuehne

    Abstract: Spatio-temporal grounding describes the task of localizing events in space and time, e.g., in video data, based on verbal descriptions only. Models for this task are usually trained with human-annotated sentences and bounding box supervision. This work addresses this task from a multimodal supervision perspective, proposing a framework for spatio-temporal action grounding trained on loose video an… ▽ More

    Submitted 28 May, 2024; v1 submitted 29 March, 2023; originally announced March 2023.

    Comments: To be presented at CVPR 2024. Project page: https://brian7685.github.io/STG/

  27. arXiv:2303.05670  [pdf, other

    cs.CL cs.AI cs.CY

    Logic Against Bias: Textual Entailment Mitigates Stereotypical Sentence Reasoning

    Authors: Hongyin Luo, James Glass

    Abstract: Due to their similarity-based learning objectives, pretrained sentence encoders often internalize stereotypical assumptions that reflect the social biases that exist within their training corpora. In this paper, we describe several kinds of stereotypes concerning different communities that are present in popular sentence representation models, including pretrained next sentence prediction and cont… ▽ More

    Submitted 9 March, 2023; originally announced March 2023.

    Comments: Accepted by EACL 2023

  28. arXiv:2212.10020  [pdf, other

    cs.CL

    On the Blind Spots of Model-Based Evaluation Metrics for Text Generation

    Authors: Tianxing He, **gyu Zhang, Tianle Wang, Sachin Kumar, Kyunghyun Cho, James Glass, Yulia Tsvetkov

    Abstract: In this work, we explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics: stress tests with synthetic data. Basically, we design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores. We examine a range of recently proposed evaluation metrics based on pretrained language model… ▽ More

    Submitted 18 May, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: ACL 2023

  29. arXiv:2211.07795  [pdf, other

    eess.AS cs.AI cs.LG

    On Unsupervised Uncertainty-Driven Speech Pseudo-Label Filtering and Model Calibration

    Authors: Nauman Dawalatabad, Sameer Khurana, Antoine Laurent, James Glass

    Abstract: Pseudo-label (PL) filtering forms a crucial part of Self-Training (ST) methods for unsupervised domain adaptation. Dropout-based Uncertainty-driven Self-Training (DUST) proceeds by first training a teacher model on source domain labeled data. Then, the teacher model is used to provide PLs for the unlabeled target domain data. Finally, we train a student on augmented labeled and pseudo-labeled data… ▽ More

    Submitted 14 November, 2022; originally announced November 2022.

  30. arXiv:2210.07839  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    Contrastive Audio-Visual Masked Autoencoder

    Authors: Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass

    Abstract: In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments… ▽ More

    Submitted 11 April, 2023; v1 submitted 2 October, 2022; originally announced October 2022.

    Comments: Accepted at ICLR 2023 as a notable top 25% paper. Code and pretrained models are at https://github.com/yuangongnd/cav-mae

  31. arXiv:2210.07431  [pdf, other

    cs.CL cs.LG

    PCFG-based Natural Language Interface Improves Generalization for Controlled Text Generation

    Authors: **gyu Zhang, James Glass, Tianxing He

    Abstract: Existing work on controlled text generation (CTG) assumes a control interface of categorical attributes. In this work, we propose a natural language (NL) interface, where we craft a PCFG to embed the control attributes into natural language commands, and propose variants of existing CTG models that take commands as input. In our experiments, we design tailored setups to test model's generalization… ▽ More

    Submitted 13 October, 2022; originally announced October 2022.

  32. arXiv:2210.03625  [pdf, other

    cs.CL cs.CV cs.MM

    C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

    Authors: Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass

    Abstract: Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in differen… ▽ More

    Submitted 9 May, 2023; v1 submitted 7 October, 2022; originally announced October 2022.

    Comments: Accepted at ICASSP 2023. The code, models, and dataset are available at https://github.com/roudimit/c2kd

  33. arXiv:2208.00061  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    UAVM: Towards Unifying Audio and Visual Models

    Authors: Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass

    Abstract: Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do… ▽ More

    Submitted 15 February, 2023; v1 submitted 29 July, 2022; originally announced August 2022.

    Comments: Published in Signal Processing Letters. Code at https://github.com/YuanGongND/uavm

    Journal ref: IEEE Signal Processing Letters, vol. 29, pp. 2437-2441, 2022

  34. arXiv:2207.07033  [pdf, other

    cs.AI cs.CY

    Develo** a Series of AI Challenges for the United States Department of the Air Force

    Authors: Vijay Gadepally, Gregory Angelides, Andrei Barbu, Andrew Bowne, Laura J. Brattain, Tamara Broderick, Armando Cabrera, Glenn Carl, Ronisha Carter, Miriam Cha, Emilie Cowen, Jesse Cummings, Bill Freeman, James Glass, Sam Goldberg, Mark Hamilton, Thomas Heldt, Kuan Wei Huang, Phillip Isola, Boris Katz, Jamie Koerner, Yen-Chen Lin, David Mayo, Kyle McAlpin, Taylor Perron , et al. (17 additional authors not shown)

    Abstract: Through a series of federal initiatives and orders, the U.S. Government has been making a concerted effort to ensure American leadership in AI. These broad strategy documents have influenced organizations such as the United States Department of the Air Force (DAF). The DAF-MIT AI Accelerator is an initiative between the DAF and MIT to bridge the gap between AI researchers and DAF mission requireme… ▽ More

    Submitted 14 July, 2022; originally announced July 2022.

  35. arXiv:2205.08180  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation

    Authors: Sameer Khurana, Antoine Laurent, James Glass

    Abstract: We propose the SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation learning framework. Unlike previous works on speech representation learning, which learns multilingual contextual speech embedding at the resolution of an acoustic frame (10-20ms), this work focuses on learning multimodal (speech-text) multilingual speech embedding at the resolution of a s… ▽ More

    Submitted 17 May, 2022; originally announced May 2022.

  36. Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition

    Authors: Yuan Gong, ** Yu, James Glass

    Abstract: Recognizing human non-speech vocalizations is an important task and has broad applications such as automatic sound transcription and health condition monitoring. However, existing datasets have a relatively small number of vocal sound samples or noisy labels. As a consequence, state-of-the-art audio event classification models may not perform well in detecting human vocal sounds. To support resear… ▽ More

    Submitted 17 June, 2022; v1 submitted 6 May, 2022; originally announced May 2022.

    Comments: Accepted at ICASSP 2022. Dataset and code at https://github.com/YuanGongND/vocalsound Interactive Colab demo at https://colab.research.google.com/github/YuanGongND/vocalsound/blob/main/colab/VocalSound.ipynb

  37. Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment

    Authors: Yuan Gong, Ziyi Chen, Iek-Heng Chu, Peng Chang, James Glass

    Abstract: Automatic pronunciation assessment is an important technology to help self-directed language learners. While pronunciation quality has multiple aspects including accuracy, fluency, completeness, and prosody, previous efforts typically only model one aspect (e.g., accuracy) at one granularity (e.g., at the phoneme-level). In this work, we explore modeling multi-aspect pronunciation assessment at mu… ▽ More

    Submitted 6 May, 2022; originally announced May 2022.

    Comments: Accepted at ICASSP 2022. Code at https://github.com/YuanGongND/gopt Interactive Colab demo at https://colab.research.google.com/github/YuanGongND/gopt/blob/master/colab/GOPT_GPU.ipynb . ICASSP 2022

  38. arXiv:2204.10298  [pdf, other

    cs.CL

    DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings

    Authors: Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljačić, Shang-Wen Li, Wen-tau Yih, Yoon Kim, James Glass

    Abstract: We propose DiffCSE, an unsupervised contrastive learning framework for learning sentence embeddings. DiffCSE learns sentence embeddings that are sensitive to the difference between the original sentence and an edited sentence, where the edited sentence is obtained by stochastically masking out the original sentence and then sampling from a masked language model. We show that DiffSCE is an instance… ▽ More

    Submitted 21 April, 2022; originally announced April 2022.

    Comments: NAACL 2022 main conference (Long paper). Pretrained models and code are available at https://github.com/voidism/DiffCSE

  39. arXiv:2204.02524  [pdf, other

    cs.SD cs.CL eess.AS

    Simple and Effective Unsupervised Speech Synthesis

    Authors: Alexander H. Liu, Cheng-I Jeff Lai, Wei-Ning Hsu, Michael Auli, Alexei Baevski, James Glass

    Abstract: We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe. The framework leverages recent work in unsupervised speech recognition as well as existing neural-based speech synthesis. Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus. Experiments demonstra… ▽ More

    Submitted 20 April, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: preprint, equal contribution from first two authors

  40. arXiv:2203.06760  [pdf, other

    cs.SD cs.AI eess.AS

    CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification

    Authors: Yuan Gong, Sameer Khurana, Andrew Rouditchenko, James Glass

    Abstract: Audio classification is an active research area with a wide range of applications. Over the past decade, convolutional neural networks (CNNs) have been the de-facto standard building block for end-to-end audio classification models. Recently, neural networks based solely on self-attention mechanisms such as the Audio Spectrogram Transformer (AST) have been shown to outperform CNNs. In this paper,… ▽ More

    Submitted 13 March, 2022; originally announced March 2022.

  41. arXiv:2203.01146  [pdf, other

    cs.AI

    Controlling the Focus of Pretrained Language Generation Models

    Authors: Jiabao Ji, Yoon Kim, James Glass, Tianxing He

    Abstract: The finetuning of pretrained transformer-based language generation models are typically conducted in an end-to-end manner, where the model learns to attend to relevant parts of the input by itself. However, there does not exist a mechanism to directly control the model's focus. This work aims to develop a control mechanism by which a user can select spans of context as "highlights" for the model t… ▽ More

    Submitted 2 March, 2022; originally announced March 2022.

    Journal ref: ACL Findings 2022

  42. arXiv:2112.04446  [pdf, other

    cs.CV cs.CL cs.SD eess.AS

    Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

    Authors: Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne

    Abstract: Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text,… ▽ More

    Submitted 18 August, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

    Comments: CVPR2022. The final published version of the proceedings will be available on IEEE Xplore

  43. arXiv:2112.00775  [pdf, other

    cs.CV

    Routing with Self-Attention for Multimodal Capsule Networks

    Authors: Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah

    Abstract: The task of multimodal learning has seen a growing interest recently as it allows for training neural architectures based on different modalities such as vision, text, and audio. One challenge in training such models is that they need to jointly learn semantic concepts and their relationships across different input representations. Capsule networks have been shown to perform well in context of cap… ▽ More

    Submitted 1 December, 2021; originally announced December 2021.

  44. arXiv:2111.04823  [pdf, other

    cs.CL cs.CV cs.MM cs.SD eess.AS eess.IV

    Cascaded Multilingual Audio-Visual Learning from Videos

    Authors: Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass

    Abstract: In this paper, we explore self-supervised audio-visual models that learn from instructional videos. Prior work has shown that these models can relate spoken words and sounds to visual content after training on a large-scale dataset of videos, but they were only trained and evaluated on videos in English. To learn multilingual audio-visual representations, we propose a cascaded approach that levera… ▽ More

    Submitted 8 November, 2021; originally announced November 2021.

    Comments: Presented at Interspeech 2021. This version contains updated results using the YouCook-Japanese dataset

  45. arXiv:2110.09784  [pdf, other

    cs.SD cs.AI eess.AS

    SSAST: Self-Supervised Audio Spectrogram Transformer

    Authors: Yuan Gong, Cheng-I Jeff Lai, Yu-An Chung, James Glass

    Abstract: Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology ca… ▽ More

    Submitted 10 February, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

    Comments: Accepted at AAAI2022. Code at https://github.com/YuanGongND/ssast

  46. arXiv:2110.07575  [pdf, other

    cs.CL cs.CV cs.MM eess.AS

    Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

    Authors: Ian Palmer, Andrew Rouditchenko, Andrei Barbu, Boris Katz, James Glass

    Abstract: Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences with very weak supervision. However, modern audio-visual datasets contain biases that undermine the real-world performance of models trained on that data. We introduce Spoken ObjectNet, which is designed to remove some of these biases and provide a way to better evaluate how effectively models will pe… ▽ More

    Submitted 14 October, 2021; originally announced October 2021.

    Comments: Presented at Interspeech 2021. This version contains additional experiments on the Spoken ObjectNet test set

  47. Magic dust for cross-lingual adaptation of monolingual wav2vec-2.0

    Authors: Sameer Khurana, Antoine Laurent, James Glass

    Abstract: We propose a simple and effective cross-lingual transfer learning method to adapt monolingual wav2vec-2.0 models for Automatic Speech Recognition (ASR) in resource-scarce languages. We show that a monolingual wav2vec-2.0 is a good few-shot ASR learner in several languages. We improve its performance further via several iterations of Dropout Uncertainty-Driven Self-Training (DUST) by using a modera… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

  48. arXiv:2110.01147  [pdf, other

    cs.SD cs.CL eess.AS

    On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

    Authors: Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian, Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David Cox, James Glass

    Abstract: Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparsity and its subsequent effects on synthetic speech. Additionally, we explored several… ▽ More

    Submitted 27 October, 2021; v1 submitted 3 October, 2021; originally announced October 2021.

  49. arXiv:2109.02772  [pdf, other

    cs.AI

    An Empirical Study on Few-shot Knowledge Probing for Pretrained Language Models

    Authors: Tianxing He, Kyunghyun Cho, James Glass

    Abstract: Prompt-based knowledge probing for 1-hop relations has been used to measure how much world knowledge is stored in pretrained language models. Existing work uses considerable amounts of data to tune the prompts for better performance. In this work, we compare a variety of approaches under a few-shot knowledge probing setting, where only a small number (e.g., 10 or 20) of example triples are availab… ▽ More

    Submitted 11 September, 2021; v1 submitted 6 September, 2021; originally announced September 2021.

  50. arXiv:2108.12802  [pdf, other

    cs.CL cs.AI cs.LG

    Interpretable Propaganda Detection in News Articles

    Authors: Seunghak Yu, Giovanni Da San Martino, Mitra Mohtarami, James Glass, Preslav Nakov

    Abstract: Online users today are exposed to misleading and propagandistic news articles and media posts on a daily basis. To counter thus, a number of approaches have been designed aiming to achieve a healthier and safer online news and media consumption. Automatic systems are able to support humans in detecting such content; yet, a major impediment to their broad adoption is that besides being accurate, th… ▽ More

    Submitted 29 August, 2021; originally announced August 2021.

    Comments: propaganda, propaganda techniques, disinformation, misinformation, fake news, explainability, interpretability

    MSC Class: 68T50 ACM Class: F.2.2; I.2.7

    Journal ref: RANLP-2021