Skip to main content

Showing 1–14 of 14 results for author: Sarı, L

.
  1. arXiv:2309.13018  [pdf, other

    eess.AS cs.CL cs.SD

    Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model

    Authors: Jiamin Xie, Ke Li, **xi Guo, Andros Tjandra, Yuan Shangguan, Leda Sari, Chunyang Wu, Junteng Jia, Jay Mahadeokar, Ozlem Kalinli

    Abstract: Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in… ▽ More

    Submitted 11 January, 2024; v1 submitted 22 September, 2023; originally announced September 2023.

  2. arXiv:2309.09390  [pdf, other

    cs.CL cs.SD eess.AS

    Augmenting text for spoken language understanding with Large Language Models

    Authors: Roshan Sharma, Suyoun Kim, Daniel Lazar, Trang Le, Akshat Shrivastava, Kwanghoon Ahn, Piyush Kansal, Leda Sari, Ozlem Kalinli, Michael Seltzer

    Abstract: Spoken semantic parsing (SSP) involves generating machine-comprehensible parses from input speech. Training robust models for existing application domains represented in training data or extending to new domains requires corresponding triplets of speech-transcript-semantic parse data, which is expensive to obtain. In this paper, we address this challenge by examining methods that can use transcrip… ▽ More

    Submitted 17 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  3. arXiv:2306.15687  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

    Authors: Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu

    Abstract: Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative… ▽ More

    Submitted 19 October, 2023; v1 submitted 23 June, 2023; originally announced June 2023.

    Comments: Accepted to NeurIPS 2023

  4. arXiv:2306.00998  [pdf, other

    eess.AS cs.CL cs.SD

    Towards Selection of Text-to-speech Data to Augment ASR Training

    Authors: Shuo Liu, Leda Sarı, Chunyang Wu, Gil Keren, Yuan Shangguan, Jay Mahadeokar, Ozlem Kalinli

    Abstract: This paper presents a method for selecting appropriate synthetic speech samples from a given large text-to-speech (TTS) dataset as supplementary training data for an automatic speech recognition (ASR) model. We trained a neural network, which can be optimised using cross-entropy loss or Arcface loss, to measure the similarity of a synthetic data to real speech. We found that incorporating syntheti… ▽ More

    Submitted 30 May, 2023; originally announced June 2023.

  5. arXiv:2303.12197  [pdf, other

    eess.AS

    Self-Supervised Representations for Singing Voice Conversion

    Authors: Tejas Jayashankar, Jilong Wu, Leda Sari, David Kant, Vimal Manohar, Qing He

    Abstract: A singing voice conversion model converts a song in the voice of an arbitrary source singer to the voice of a target singer. Recently, methods that leverage self-supervised audio representations such as HuBERT and Wav2Vec 2.0 have helped further the state-of-the-art. Though these methods produce more natural and melodic singing outputs, they often rely on confusion and disentanglement losses to re… ▽ More

    Submitted 21 March, 2023; originally announced March 2023.

  6. arXiv:2303.00802  [pdf, other

    cs.CL cs.SD eess.AS

    Synthetic Cross-accent Data Augmentation for Automatic Speech Recognition

    Authors: Philipp Klumpp, Pooja Chitkara, Leda Sarı, Prashant Serai, Jilong Wu, Irina-Elena Veliche, Rongqing Huang, Qing He

    Abstract: The awareness for biased ASR datasets or models has increased notably in recent years. Even for English, despite a vast amount of available training data, systems perform worse for non-native speakers. In this work, we improve an accent-conversion model (ACM) which transforms native US-English speech into accented pronunciation. We include phonetic knowledge in the ACM training to provide accurate… ▽ More

    Submitted 1 March, 2023; originally announced March 2023.

  7. arXiv:2211.02536  [pdf, ps, other

    cs.CL cs.AI cs.SD eess.AS

    Biased Self-supervised learning for ASR

    Authors: Florian L. Kreyssig, Yangyang Shi, **xi Guo, Leda Sari, Abdelrahman Mohamed, Philip C. Woodland

    Abstract: Self-supervised learning via masked prediction pre-training (MPPT) has shown impressive performance on a range of speech-processing tasks. This paper proposes a method to bias self-supervised learning towards a specific task. The core idea is to slightly finetune the model that is used to obtain the target sequence. This leads to better performance and a substantial increase in training speed. Fur… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  8. arXiv:2111.09983  [pdf, other

    eess.AS cs.SD

    Towards Measuring Fairness in Speech Recognition: Casual Conversations Dataset Transcriptions

    Authors: Chunxi Liu, Michael Picheny, Leda Sarı, Pooja Chitkara, Alex Xiao, Xiaohui Zhang, Mark Chou, Andres Alvarado, Caner Hazirbas, Yatharth Saraf

    Abstract: It is well known that many machine learning systems demonstrate bias towards specific groups of individuals. This problem has been studied extensively in the Facial Recognition area, but much less so in Automatic Speech Recognition (ASR). This paper presents initial Speech Recognition results on "Casual Conversations" -- a publicly released 846 hour corpus designed to help researchers evaluate the… ▽ More

    Submitted 18 November, 2021; originally announced November 2021.

    Comments: Submitted to ICASSP 2022. Our dataset will be publicly available at (https://ai.facebook.com/datasets/casual-conversations-downloads) for general use. We also would like to note that considering the limitations of our dataset, we limit the use of it for only evaluation purposes (see license agreement)

  9. arXiv:2110.07058  [pdf, other

    cs.CV cs.AI

    Ego4D: Around the World in 3,000 Hours of Egocentric Video

    Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do , et al. (60 additional authors not shown)

    Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with cons… ▽ More

    Submitted 11 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)

  10. arXiv:2102.06291  [pdf, other

    cs.SD cs.LG eess.AS eess.IV

    A Multi-View Approach To Audio-Visual Speaker Verification

    Authors: Leda Sarı, Kritika Singh, Jiatong Zhou, Lorenzo Torresani, Nayan Singhal, Yatharth Saraf

    Abstract: Although speaker verification has conventionally been an audio-only task, some practical applications provide both audio and visual streams of input. In these cases, the visual stream provides complementary information and can often be leveraged in conjunction with the acoustics of speech to improve verification performance. In this study, we explore audio-visual approaches to speaker verification… ▽ More

    Submitted 11 February, 2021; originally announced February 2021.

  11. arXiv:2008.03425  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Deep F-measure Maximization for End-to-End Speech Understanding

    Authors: Leda Sarı, Mark Hasegawa-Johnson

    Abstract: Spoken language understanding (SLU) datasets, like many other machine learning datasets, usually suffer from the label imbalance problem. Label imbalance usually causes the learned model to replicate similar biases at the output which raises the issue of unfairness to the minority classes in the dataset. In this work, we approach the fairness problem by maximizing the F-measure instead of accuracy… ▽ More

    Submitted 7 August, 2020; originally announced August 2020.

    Comments: Interspeech 2020 submission (Accepted)

  12. arXiv:2005.11408  [pdf, other

    eess.AS cs.LG

    Identify Speakers in Cocktail Parties with End-to-End Attention

    Authors: Junzhe Zhu, Mark Hasegawa-Johnson, Leda Sari

    Abstract: In scenarios where multiple speakers talk at the same time, it is important to be able to identify the talkers accurately. This paper presents an end-to-end system that integrates speech source extraction and speaker identification, and proposes a new way to jointly optimize these two parts by max-pooling the speaker predictions along the channel dimension. Residual attention permits us to learn s… ▽ More

    Submitted 9 August, 2020; v1 submitted 22 May, 2020; originally announced May 2020.

    Comments: Accepted by Interspeech 2020 for presentation; https://github.com/JunzheJosephZhu/Identify-Speakers-in-Cocktail-Parties-with-E2E-Attention

  13. arXiv:2002.06165  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR

    Authors: Leda Sarı, Niko Moritz, Takaaki Hori, Jonathan Le Roux

    Abstract: We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR). The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism. The resulting memory vector (M-vector) is concatenated to the acoustic… ▽ More

    Submitted 14 February, 2020; originally announced February 2020.

    Comments: To appear in Proc. ICASSP 2020

  14. arXiv:1006.0834  [pdf

    cs.NI

    Performance of RCPC-Encoded V-BLAST MIMO In Nakagami-m Fading Channel

    Authors: L. Sari, G. Wibisono, D. Gunawan

    Abstract: Multiple Input Multiple Output (MIMO) wireless communication link has been theoretically proven to be reliable and capable of achieving high capacity. However, these two advantageous characteristics tend to be addressed separately in many major researches. Researches on various approaches to attain both characteristics in a single MIMO system are still on-going and an established approach is yet t… ▽ More

    Submitted 4 June, 2010; originally announced June 2010.

    Comments: Submitted to Journal of Telecommunications, see http://sites.google.com/site/journaloftelecommunications/volume-2-issue-2-may-2010

    Journal ref: Journal of Telecommunications,Volume 2, Issue 2, pp49-57, May 2010