Skip to main content

Showing 1–50 of 82 results for author: Livescu, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.00837  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Towards Robust Speech Representation Learning for Thousands of Languages

    Authors: William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, **chuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji Watanabe

    Abstract: Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 millio… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: 20 pages

  2. arXiv:2406.10083  [pdf, other

    cs.CL cs.SD eess.AS

    On the Evaluation of Speech Foundation Models for Spoken Language Understanding

    Authors: Siddhant Arora, Ankita Pasad, Chung-Ming Chien, Jionghao Han, Roshan Sharma, Jee-weon Jung, Hira Dhamyal, William Chen, Suwon Shon, Hung-yi Lee, Karen Livescu, Shinji Watanabe

    Abstract: The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and sequence generation tasks, on natural speech. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for th… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted at ACL Findings 2024

  3. arXiv:2406.09345  [pdf, other

    cs.CL cs.SD eess.AS

    DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding

    Authors: Suwon Shon, Kwangyoun Kim, Yi-Te Hsu, Prashant Sridhar, Shinji Watanabe, Karen Livescu

    Abstract: The integration of pre-trained text-based large language models (LLM) with speech input has enabled instruction-following capabilities for diverse speech tasks. This integration requires the use of a speech encoder, a speech adapter, and an LLM, trained on diverse tasks. We propose the use of discrete speech units (DSU), rather than continuous-valued speech encoder outputs, that are converted to t… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  4. arXiv:2406.09282  [pdf, other

    cs.CL cs.SD eess.AS

    On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models

    Authors: **chuan Tian, Yifan Peng, William Chen, Kwanghee Choi, Karen Livescu, Shinji Watanabe

    Abstract: The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full transparency in building advanced speech-to-text (S2T) foundation models. To this end, OWSM models are trained on 25 public speech datasets, which are heterogeneous in multiple ways. In this study, we advance the OWSM series by introducing OWSM v3.2, which improves on prior models by investigating and addressing the i… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  5. arXiv:2406.08641  [pdf, ps, other

    cs.SD cs.CL eess.AS

    ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

    Authors: Jiatong Shi, Shih-Heng Wang, William Chen, Martijn Bartelds, Vanya Bannihatti Kumar, **chuan Tian, Xuankai Chang, Dan Jurafsky, Karen Livescu, Hung-yi Lee, Shinji Watanabe

    Abstract: ML-SUPERB evaluates self-supervised learning (SSL) models on the tasks of language identification and automatic speech recognition (ASR). This benchmark treats the models as feature extractors and uses a single shallow downstream model, which can be fine-tuned for a downstream task. However, real-world use cases may require different configurations. This paper presents ML-SUPERB~2.0, which is a ne… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  6. arXiv:2406.08619  [pdf, other

    cs.CL cs.LG eess.AS

    Self-Supervised Speech Representations are More Phonetic than Semantic

    Authors: Kwanghee Choi, Ankita Pasad, Tomohiko Nakamura, Satoru Fukayama, Karen Livescu, Shinji Watanabe

    Abstract: Self-supervised speech models (S3Ms) have become an effective backbone for speech applications. Various analyses suggest that S3Ms encode linguistic properties. In this work, we seek a more fine-grained analysis of the word-level linguistic properties encoded in S3Ms. Specifically, we curate a novel dataset of near homophone (phonetically similar) and synonym (semantically similar) word pairs and… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024. Source code at https://github.com/juice500ml/phonetic_semantic_probing

  7. arXiv:2406.06907  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale

    Authors: Shester Gueuwou, Xiaodan Du, Greg Shakhnarovich, Karen Livescu

    Abstract: A persistent challenge in sign language video processing, including the task of sign language to written language translation, is how we learn representations of sign language in an effective and efficient way that can preserve the important attributes of these languages, while remaining invariant to irrelevant visual differences. Informed by the nature and linguistics of signed languages, our pro… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

  8. arXiv:2402.13433  [pdf, other

    cs.CL cs.DS

    Structured Tree Alignment for Evaluation of (Speech) Constituency Parsing

    Authors: Freda Shi, Kevin Gimpel, Karen Livescu

    Abstract: We present the structured average intersection-over-union ratio (STRUCT-IOU), a similarity metric between constituency parse trees motivated by the problem of evaluating speech parsers. STRUCT-IOU enables comparison between a constituency parse tree (over automatically recognized spoken word boundaries) with the ground-truth parse (over written words). To compute the metric, we project the ground-… ▽ More

    Submitted 19 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: ACL 2024 camera-ready

  9. arXiv:2312.09895  [pdf, other

    cs.CL cs.SD eess.AS

    Generative Context-aware Fine-tuning of Self-supervised Speech Models

    Authors: Suwon Shon, Kwangyoun Kim, Prashant Sridhar, Yi-Te Hsu, Shinji Watanabe, Karen Livescu

    Abstract: When performing tasks like automatic speech recognition or spoken language understanding for a given utterance, access to preceding text or audio provides contextual information can improve performance. Considering the recent advances in generative large language models (LLM), we hypothesize that an LLM could generate useful context information using the preceding text. With appropriate prompts, L… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

  10. arXiv:2310.08715  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Toward Joint Language Modeling for Speech Units and Text

    Authors: Ju-Chieh Chou, Chung-Ming Chien, Wei-Ning Hsu, Karen Livescu, Arun Babu, Alexis Conneau, Alexei Baevski, Michael Auli

    Abstract: Speech and text are two major forms of human language. The research community has been focusing on map** speech to text or vice versa for many years. However, in the field of language modeling, very little effort has been made to model them jointly. In light of this, we explore joint language modeling for speech units and text. Specifically, we compare different speech tokenizers to transform co… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: EMNLP findings 2023

  11. arXiv:2310.07654  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Audio-Visual Neural Syntax Acquisition

    Authors: Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass

    Abstract: We study phrase structure induction from visually-grounded speech. The core idea is to first segment the speech waveform into sequences of word segments, and subsequently induce phrase structure using the inferred segment-level continuous representations. We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without eve… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

  12. arXiv:2310.05919  [pdf, other

    cs.CL eess.AS

    Few-Shot Spoken Language Understanding via Joint Speech-Text Models

    Authors: Chung-Ming Chien, Mingjiamei Zhang, Ju-Chieh Chou, Karen Livescu

    Abstract: Recent work on speech representation models jointly pre-trained with text has demonstrated the potential of improving speech representations by encoding speech and text in a shared space. In this paper, we leverage such shared representations to address the persistent challenge of limited data availability in spoken language understanding tasks. By employing a pre-trained speech-text model, we fin… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

  13. arXiv:2310.02973  [pdf, other

    cs.CL cs.SD eess.AS

    UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions

    Authors: Siddhant Arora, Hayato Futami, Jee-weon Jung, Yifan Peng, Roshan Sharma, Yosuke Kashiwagi, Emiru Tsunoo, Karen Livescu, Shinji Watanabe

    Abstract: Recent studies leverage large language models with multi-tasking capabilities, using natural language prompts to guide the model's behavior and surpassing performance of task-specific models. Motivated by this, we ask: can we build a single model that jointly performs various spoken language understanding (SLU) tasks? We start by adapting a pre-trained automatic speech recognition model to additio… ▽ More

    Submitted 3 April, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

    Comments: Accepted at NAACL 2024

  14. arXiv:2309.08030  [pdf, other

    eess.AS cs.CL cs.SD

    AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement

    Authors: Ju-Chieh Chou, Chung-Ming Chien, Karen Livescu

    Abstract: Speech enhancement systems are typically trained using pairs of clean and noisy speech. In audio-visual speech enhancement (AVSE), there is not as much ground-truth clean data available; most audio-visual datasets are collected in real-world environments with background noise and reverberation, hampering the development of AVSE. In this work, we introduce AV2Wav, a resynthesis-based audio-visual s… ▽ More

    Submitted 8 April, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: extended version for the accepted paper at ICASSP 2024

  15. arXiv:2309.02450  [pdf, other

    cs.CV

    Self-Supervised Video Transformers for Isolated Sign Language Recognition

    Authors: Marcelo Sandoval-Castaneda, Yanhong Li, Diane Brentari, Karen Livescu, Gregory Shakhnarovich

    Abstract: This paper presents an in-depth analysis of various self-supervision methods for isolated sign language recognition (ISLR). We consider four recently introduced transformer-based approaches to self-supervised learning from videos, and four pre-training data regimes, and study all the combinations on the WLASL2000 dataset. Our findings reveal that MaskFeat achieves performance superior to pose-base… ▽ More

    Submitted 1 September, 2023; originally announced September 2023.

    Comments: 14 pages. Submitted to WACV 2024

  16. arXiv:2307.00162  [pdf, other

    cs.CL cs.LG eess.AS

    What Do Self-Supervised Speech Models Know About Words?

    Authors: Ankita Pasad, Chung-Ming Chien, Shane Settle, Karen Livescu

    Abstract: Many self-supervised speech models (S3Ms) have been introduced over the last few years, improving performance and data efficiency on various speech tasks. However, these empirical successes alone do not give a complete picture of what is learned during pre-training. Recent work has begun analyzing how S3Ms encode certain properties, such as phonetic and speaker information, but we still lack a pro… ▽ More

    Submitted 31 January, 2024; v1 submitted 30 June, 2023; originally announced July 2023.

    Comments: Pre-MIT Press publication version

  17. arXiv:2212.10525  [pdf, other

    cs.CL eess.AS

    SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks

    Authors: Suwon Shon, Siddhant Arora, Chyi-Jiunn Lin, Ankita Pasad, Felix Wu, Roshan Sharma, Wei-Lun Wu, Hung-Yi Lee, Karen Livescu, Shinji Watanabe

    Abstract: Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In particular, there are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce suc… ▽ More

    Submitted 15 June, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: accepted in ACL 2023 (long paper)

  18. arXiv:2212.08542  [pdf, other

    eess.AS cs.CL

    Context-aware Fine-tuning of Self-supervised Speech Models

    Authors: Suwon Shon, Felix Wu, Kwangyoun Kim, Prashant Sridhar, Karen Livescu, Shinji Watanabe

    Abstract: Self-supervised pre-trained transformers have improved the state of the art on a variety of speech tasks. Due to the quadratic time and space complexity of self-attention, they usually operate at the level of relatively short (e.g., utterance) segments. In this paper, we study the use of context, i.e., surrounding segments, during fine-tuning and propose a new approach called context-aware fine-tu… ▽ More

    Submitted 28 March, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

  19. arXiv:2211.03929  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Comparative layer-wise analysis of self-supervised speech models

    Authors: Ankita Pasad, Bowen Shi, Karen Livescu

    Abstract: Many self-supervised speech models, varying in their pre-training objective, input modality, and pre-training data, have been proposed in the last few years. Despite impressive successes on downstream tasks, we still have a limited understanding of the properties encoded by the models and the differences across models. In this work, we examine the intermediate representations for a variety of rece… ▽ More

    Submitted 16 March, 2023; v1 submitted 7 November, 2022; originally announced November 2022.

    Comments: Accepted to ICASSP 2023. Code: https://github.com/ankitapasad/layerwise-analysis

  20. arXiv:2206.04615  [pdf, other

    cs.CL cs.AI cs.CY cs.LG stat.ML

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

    Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

    Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

  21. arXiv:2205.12870  [pdf, other

    cs.CV cs.CL

    Open-Domain Sign Language Translation Learned from Online Video

    Authors: Bowen Shi, Diane Brentari, Greg Shakhnarovich, Karen Livescu

    Abstract: Existing work on sign language translation - that is, translation from sign language videos into sentences in a written language - has focused mainly on (1) data collected in a controlled environment or (2) data in a specific domain, which limits the applicability to real-world settings. In this paper, we introduce OpenASL, a large-scale American Sign Language (ASL) - English dataset collected fro… ▽ More

    Submitted 19 November, 2022; v1 submitted 25 May, 2022; originally announced May 2022.

    Comments: EMNLP 2022

  22. arXiv:2205.10643  [pdf, other

    cs.CL cs.SD eess.AS

    Self-Supervised Speech Representation Learning: A Review

    Authors: Abdelrahman Mohamed, Hung-yi Lee, Lasse Borgholt, Jakob D. Havtorn, Joakim Edin, Christian Igel, Katrin Kirchhoff, Shang-Wen Li, Karen Livescu, Lars Maaløe, Tara N. Sainath, Shinji Watanabe

    Abstract: Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and languages for which only limited labeled data is available. Self-supervised representation learning methods promise a single universal model that would benefit a… ▽ More

    Submitted 27 October, 2022; v1 submitted 21 May, 2022; originally announced May 2022.

  23. arXiv:2203.13291  [pdf, other

    cs.CV cs.CL

    Searching for fingerspelled content in American Sign Language

    Authors: Bowen Shi, Diane Brentari, Greg Shakhnarovich, Karen Livescu

    Abstract: Natural language processing for sign language video - including tasks like recognition, translation, and search - is crucial for making artificial intelligence technologies accessible to deaf individuals, and is gaining research interest in recent years. In this paper, we address the problem of searching for fingerspelled key-words or key phrases in raw sign language videos. This is an important t… ▽ More

    Submitted 24 March, 2022; originally announced March 2022.

    Comments: ACL 2022

  24. arXiv:2112.07648  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    On the Use of External Data for Spoken Named Entity Recognition

    Authors: Ankita Pasad, Felix Wu, Suwon Shon, Karen Livescu, Kyu J. Han

    Abstract: Spoken language understanding (SLU) tasks involve map** from speech audio signals to semantic labels. Given the complexity of such tasks, good performance might be expected to require large labeled datasets, which are difficult to collect for each new task and domain. However, recent advances in self-supervised speech representations have made it feasible to consider learning SLU models with lim… ▽ More

    Submitted 8 July, 2022; v1 submitted 14 December, 2021; originally announced December 2021.

    Comments: Accepted at NAACL 2022. Codebase available at https://github.com/asappresearch/spoken-ner

  25. arXiv:2111.10367  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech

    Authors: Suwon Shon, Ankita Pasad, Felix Wu, Pablo Brusco, Yoav Artzi, Karen Livescu, Kyu J. Han

    Abstract: Progress in speech processing has been facilitated by shared datasets and benchmarks. Historically these have focused on automatic speech recognition (ASR), speaker identification, or other lower-level tasks. Interest has been growing in higher-level spoken language understanding tasks, including using end-to-end models, but there are fewer annotated datasets for such tasks. At the same time, rece… ▽ More

    Submitted 29 July, 2022; v1 submitted 19 November, 2021; originally announced November 2021.

    Comments: Updated preprint for SLUE Benchmark v0.2; Toolkit link https://github.com/asappresearch/slue-toolkit

  26. arXiv:2110.08538  [pdf, other

    cs.CL

    Substructure Distribution Projection for Zero-Shot Cross-Lingual Dependency Parsing

    Authors: Haoyue Shi, Kevin Gimpel, Karen Livescu

    Abstract: We present substructure distribution projection (SubDP), a technique that projects a distribution over structures in one domain to another, by projecting substructure distributions separately. Models for the target domains can be then trained, using the projected distributions as soft silver labels. We evaluate SubDP on zero-shot cross-lingual dependency parsing, taking dependency arcs as substruc… ▽ More

    Submitted 16 October, 2021; originally announced October 2021.

  27. arXiv:2109.09667  [pdf, other

    cs.CL

    On Generalization in Coreference Resolution

    Authors: Shubham Toshniwal, Patrick Xia, Sam Wiseman, Karen Livescu, Kevin Gimpel

    Abstract: While coreference resolution is defined independently of dataset domain, most models for performing coreference resolution do not transfer well to unseen domains. We consolidate a set of 8 coreference resolution datasets targeting different domains to evaluate the off-the-shelf performance of models. We then mix three datasets for training; even though their domain, annotation guidelines, and meta… ▽ More

    Submitted 20 September, 2021; originally announced September 2021.

    Comments: CRAC 2021

  28. arXiv:2107.04734  [pdf, other

    cs.CL cs.LG eess.AS

    Layer-wise Analysis of a Self-supervised Speech Representation Model

    Authors: Ankita Pasad, Ju-Chieh Chou, Karen Livescu

    Abstract: Recently proposed self-supervised learning approaches have been successful for pre-training speech representation models. The utility of these learned representations has been observed empirically, but not much has been studied about the type or extent of information encoded in the pre-trained representations themselves. Develo** such insights can help understand the capabilities and limits of t… ▽ More

    Submitted 3 December, 2022; v1 submitted 9 July, 2021; originally announced July 2021.

    Comments: Accepted to ASRU 2021. Code: https://github.com/ankitapasad/layerwise-analysis

  29. arXiv:2104.01291  [pdf, other

    cs.CV cs.CL

    Fingerspelling Detection in American Sign Language

    Authors: Bowen Shi, Diane Brentari, Greg Shakhnarovich, Karen Livescu

    Abstract: Fingerspelling, in which words are signed letter by letter, is an important component of American Sign Language. Most previous work on automatic fingerspelling recognition has assumed that the boundaries of fingerspelling regions in signing videos are known beforehand. In this paper, we consider the task of fingerspelling detection in raw, untrimmed sign language videos. This is an important step… ▽ More

    Submitted 2 April, 2021; originally announced April 2021.

    Comments: CVPR 2021

  30. arXiv:2102.13249  [pdf, other

    cs.CL cs.AI

    Chess as a Testbed for Language Model State Tracking

    Authors: Shubham Toshniwal, Sam Wiseman, Karen Livescu, Kevin Gimpel

    Abstract: Transformer language models have made tremendous strides in natural language understanding tasks. However, the complexity of natural language makes it challenging to ascertain how accurately these models are tracking the world state underlying the text. Motivated by this issue, we consider the task of language modeling for the game of chess. Unlike natural language, chess notations describe a simp… ▽ More

    Submitted 13 May, 2022; v1 submitted 25 February, 2021; originally announced February 2021.

    Comments: AAAI 2022 extended version with supplementary material

  31. arXiv:2101.00411  [pdf, other

    cs.CL

    Substructure Substitution: Structured Data Augmentation for NLP

    Authors: Haoyue Shi, Karen Livescu, Kevin Gimpel

    Abstract: We study a family of data augmentation methods, substructure substitution (SUB2), for natural language processing (NLP) tasks. SUB2 generates new examples by substituting substructures (e.g., subtrees or subsequences) with ones with the same label, which can be applied to many structured NLP tasks such as part-of-speech tagging and parsing. For more general tasks (e.g., text classification) which… ▽ More

    Submitted 2 January, 2021; originally announced January 2021.

  32. arXiv:2012.02221  [pdf, other

    eess.AS cs.CL cs.SD

    A Correspondence Variational Autoencoder for Unsupervised Acoustic Word Embeddings

    Authors: Puyuan Peng, Herman Kamper, Karen Livescu

    Abstract: We propose a new unsupervised model for map** a variable-duration speech segment to a fixed-dimensional representation. The resulting acoustic word embeddings can form the basis of search, discovery, and indexing systems for low- and zero-resource languages. Our model, which we refer to as a maximal sampling correspondence variational autoencoder (MCVAE), is a recurrent neural network (RNN) trai… ▽ More

    Submitted 3 December, 2020; originally announced December 2020.

    Comments: 10 pages, 6 figures, NeurIPS 2020 Workshop Self-Supervised Learning for Speech and Audio Processing

  33. arXiv:2011.11807  [pdf, other

    cs.CL

    Acoustic span embeddings for multilingual query-by-example search

    Authors: Yushi Hu, Shane Settle, Karen Livescu

    Abstract: Query-by-example (QbE) speech search is the task of matching spoken queries to utterances within a search collection. In low- or zero-resource settings, QbE search is often addressed with approaches based on dynamic time war** (DTW). Recent work has found that methods based on acoustic word embeddings (AWEs) can improve both performance and search speed. However, prior work on AWE-based QbE has… ▽ More

    Submitted 23 November, 2020; originally announced November 2020.

  34. arXiv:2010.02807  [pdf, other

    cs.CL cs.LG

    Learning to Ignore: Long Document Coreference with Bounded Memory Neural Networks

    Authors: Shubham Toshniwal, Sam Wiseman, Allyson Ettinger, Karen Livescu, Kevin Gimpel

    Abstract: Long document coreference resolution remains a challenging task due to the large memory and runtime requirements of current models. Recent work doing incremental coreference resolution using just the global representation of entities shows practical benefits but requires kee** all entities in memory, which can be impractical for long documents. We argue that kee** all entities in memory is unn… ▽ More

    Submitted 16 November, 2020; v1 submitted 6 October, 2020; originally announced October 2020.

    Comments: Post EMNLP 2020 camera ready updates

  35. arXiv:2010.02423  [pdf, other

    cs.CL

    On the Role of Supervision in Unsupervised Constituency Parsing

    Authors: Haoyue Shi, Karen Livescu, Kevin Gimpel

    Abstract: We analyze several recent unsupervised constituency parsing models, which are tuned with respect to the parsing $F_1$ score on the Wall Street Journal (WSJ) development set (1,700 sentences). We introduce strong baselines for them, by training an existing supervised parsing model (Kitaev and Klein, 2018) on the same labeled examples they access. When training on the 1,700 examples, or even when us… ▽ More

    Submitted 6 October, 2020; v1 submitted 5 October, 2020; originally announced October 2020.

    Comments: EMNLP 2020. Project page: https://ttic.uchicago.edu/~freda/project/rsucp/

  36. arXiv:2007.00183  [pdf, other

    eess.AS cs.CL cs.SD

    Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings

    Authors: Bowen Shi, Shane Settle, Karen Livescu

    Abstract: Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length segments of frames. We consider segmental models for whole-word ("acoustic-to-word") speech recognition, with the feature vectors defined using vector embeddings of segments. Such models are computationally challenging as the number of paths is proportional to the vocabulary size, whic… ▽ More

    Submitted 24 November, 2020; v1 submitted 30 June, 2020; originally announced July 2020.

    Comments: SLT 2021

  37. arXiv:2006.14007  [pdf, other

    cs.CL cs.SD eess.AS

    Multilingual Jointly Trained Acoustic and Written Word Embeddings

    Authors: Yushi Hu, Shane Settle, Karen Livescu

    Abstract: Acoustic word embeddings (AWEs) are vector representations of spoken word segments. AWEs can be learned jointly with embeddings of character sequences, to generate phonetically meaningful embeddings of written words, or acoustically grounded word embeddings (AGWEs). Such embeddings have been used to improve speech retrieval, recognition, and spoken term discovery. In this work, we extend this idea… ▽ More

    Submitted 24 June, 2020; originally announced June 2020.

  38. arXiv:2006.06226  [pdf, other

    cs.CL

    Discrete Latent Variable Representations for Low-Resource Text Classification

    Authors: Shuning **, Sam Wiseman, Karl Stratos, Karen Livescu

    Abstract: While much work on deep latent variable models of text uses continuous latent variables, discrete latent variables are interesting because they are more interpretable and typically more space efficient. We consider several approaches to learning discrete latent variable models for text in the case where exact marginalization over these variables is intractable. We compare the performance of the le… ▽ More

    Submitted 11 June, 2020; originally announced June 2020.

    Comments: ACL 2020

  39. arXiv:2006.03866  [pdf, other

    cs.CL

    A Cross-Task Analysis of Text Span Representations

    Authors: Shubham Toshniwal, Haoyue Shi, Bowen Shi, Lingyu Gao, Karen Livescu, Kevin Gimpel

    Abstract: Many natural language processing (NLP) tasks involve reasoning with textual spans, including question answering, entity recognition, and coreference resolution. While extensive research has focused on functional architectures for representing words and sentences, there is less work on representing arbitrary spans of text within sentences. In this paper, we conduct a comprehensive empirical evaluat… ▽ More

    Submitted 6 June, 2020; originally announced June 2020.

    Comments: RepL4NLP 2020

  40. arXiv:2005.02990  [pdf, other

    cs.CL cs.LG

    PeTra: A Sparsely Supervised Memory Model for People Tracking

    Authors: Shubham Toshniwal, Allyson Ettinger, Kevin Gimpel, Karen Livescu

    Abstract: We propose PeTra, a memory-augmented neural network designed to track entities in its memory slots. PeTra is trained using sparse annotation from the GAP pronoun resolution dataset and outperforms a prior memory model on the task while using a simpler architecture. We empirically compare key modeling choices, finding that we can simplify several aspects of the design of the memory module while ret… ▽ More

    Submitted 6 May, 2020; originally announced May 2020.

    Comments: ACL 2020

  41. arXiv:2001.10603  [pdf, other

    eess.AS cs.SD

    Unsupervised Pre-training of Bidirectional Speech Encoders via Masked Reconstruction

    Authors: Weiran Wang, Qingming Tang, Karen Livescu

    Abstract: We propose an approach for pre-training speech representations via a masked reconstruction loss. Our pre-trained encoder networks are bidirectional and can therefore be used directly in typical bidirectional speech recognition models. The pre-trained networks can then be fine-tuned on a smaller amount of supervised data for speech recognition. Experiments with this approach on the LibriSpeech and… ▽ More

    Submitted 5 May, 2020; v1 submitted 28 January, 2020; originally announced January 2020.

    Comments: Final version for ICASSP 2020

  42. arXiv:1908.10546  [pdf, other

    cs.CV cs.CL

    Fingerspelling recognition in the wild with iterative visual attention

    Authors: Bowen Shi, Aurora Martinez Del Rio, Jonathan Keane, Diane Brentari, Greg Shakhnarovich, Karen Livescu

    Abstract: Sign language recognition is a challenging gesture sequence recognition problem, characterized by quick and highly coarticulated motion. In this paper we focus on recognition of fingerspelling sequences in American Sign Language (ASL) videos collected in the wild, mainly from YouTube and Deaf social media. Most previous work on sign language recognition has focused on controlled settings where the… ▽ More

    Submitted 28 August, 2019; originally announced August 2019.

    Comments: ICCV 2019

  43. arXiv:1906.09535  [pdf, other

    cs.CL

    Variational Sequential Labelers for Semi-Supervised Learning

    Authors: Mingda Chen, Qingming Tang, Karen Livescu, Kevin Gimpel

    Abstract: We introduce a family of multitask variational methods for semi-supervised sequence labeling. Our model family consists of a latent-variable generative model and a discriminative labeler. The generative models use latent variables to define the conditional probability of a word given its context, drawing inspiration from word prediction objectives commonly used in learning word embeddings. The lab… ▽ More

    Submitted 22 June, 2019; originally announced June 2019.

    Comments: Appeared in EMNLP 2018 Long

  44. Visually Grounded Neural Syntax Acquisition

    Authors: Haoyue Shi, Jiayuan Mao, Kevin Gimpel, Karen Livescu

    Abstract: We present the Visually Grounded Neural Syntax Learner (VG-NSL), an approach for learning syntactic representations and structures without any explicit supervision. The model learns by looking at natural images and reading paired captions. VG-NSL generates constituency parse trees of texts, recursively composes representations for constituents, and matches them with images. We define concreteness… ▽ More

    Submitted 24 September, 2019; v1 submitted 7 June, 2019; originally announced June 2019.

    Comments: ACL 2019. Project page: https://ttic.uchicago.edu/~freda/project/vgnsl/

  45. arXiv:1904.10947  [pdf, other

    cs.CL cs.SD eess.AS

    On the Contributions of Visual and Textual Supervision in Low-Resource Semantic Speech Retrieval

    Authors: Ankita Pasad, Bowen Shi, Herman Kamper, Karen Livescu

    Abstract: Recent work has shown that speech paired with images can be used to learn semantically meaningful speech representations even without any textual supervision. In real-world low-resource settings, however, we often have access to some transcribed speech. We study whether and how visual grounding is useful in the presence of varying amounts of textual supervision. In particular, we consider the task… ▽ More

    Submitted 30 August, 2019; v1 submitted 24 April, 2019; originally announced April 2019.

  46. arXiv:1904.07078  [pdf, other

    cs.CL cs.SD eess.AS

    Semantic query-by-example speech search using visual grounding

    Authors: Herman Kamper, Aristotelis Anastassiou, Karen Livescu

    Abstract: A number of recent studies have started to investigate how speech systems can be trained on untranscribed speech by leveraging accompanying images at training time. Examples of tasks include keyword prediction and within- and across-mode retrieval. Here we consider how such models can be used for query-by-example (QbE) search, the task of retrieving utterances relevant to a given spoken query. We… ▽ More

    Submitted 15 April, 2019; originally announced April 2019.

    Comments: Accepted to ICASSP 2019

  47. arXiv:1903.12306  [pdf, other

    cs.CL

    Acoustically Grounded Word Embeddings for Improved Acoustics-to-Word Speech Recognition

    Authors: Shane Settle, Kartik Audhkhasi, Karen Livescu, Michael Picheny

    Abstract: Direct acoustics-to-word (A2W) systems for end-to-end automatic speech recognition are simpler to train, and more efficient to decode with, than sub-word systems. However, A2W systems can have difficulties at training time when data is limited, and at decoding time when recognizing words outside the training vocabulary. To address these shortcomings, we investigate the use of recently proposed aco… ▽ More

    Submitted 28 March, 2019; originally announced March 2019.

    Comments: To appear at ICASSP 2019

  48. arXiv:1810.11438  [pdf, other

    cs.CV cs.CL

    American Sign Language fingerspelling recognition in the wild

    Authors: Bowen Shi, Aurora Martinez Del Rio, Jonathan Keane, Jonathan Michaux, Diane Brentari, Greg Shakhnarovich, Karen Livescu

    Abstract: We address the problem of American Sign Language fingerspelling recognition in the wild, using videos collected from websites. We introduce the largest data set available so far for the problem of fingerspelling recognition, and the first using naturally occurring video data. Using this data set, we present the first attempt to recognize fingerspelling sequences in this challenging setting. Unlike… ▽ More

    Submitted 17 February, 2019; v1 submitted 26 October, 2018; originally announced October 2018.

    Comments: accepted in SLT 2018

  49. arXiv:1809.01431  [pdf, other

    cs.CL

    Pre-training on high-resource speech recognition improves low-resource speech-to-text translation

    Authors: Sameer Bansal, Herman Kamper, Karen Livescu, Adam Lopez, Sharon Goldwater

    Abstract: We present a simple approach to improve direct speech-to-text translation (ST) when the source language is low-resource: we pre-train the model on a high-resource automatic speech recognition (ASR) task, and then fine-tune its parameters for ST. We demonstrate that our approach is effective by pre-training on 300 hours of English ASR data to improve Spanish-English ST from 10.8 to 20.2 BLEU when o… ▽ More

    Submitted 27 February, 2019; v1 submitted 5 September, 2018; originally announced September 2018.

    Comments: Accepted for publication in NAACL 2019

  50. arXiv:1807.10857  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition

    Authors: Shubham Toshniwal, Anjuli Kannan, Chung-Cheng Chiu, Yonghui Wu, Tara N Sainath, Karen Livescu

    Abstract: Attention-based recurrent neural encoder-decoder models present an elegant solution to the automatic speech recognition problem. This approach folds the acoustic model, pronunciation model, and language model into a single network and requires only a parallel corpus of speech and text for training. However, unlike in conventional approaches that combine separate acoustic and language models, it is… ▽ More

    Submitted 6 November, 2018; v1 submitted 27 July, 2018; originally announced July 2018.

    Comments: Accepted in SLT 2018