Skip to main content

Showing 1–8 of 8 results for author: Lin, Y Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2404.09385  [pdf, other

    eess.AS cs.CL eess.SP

    A Large-Scale Evaluation of Speech Foundation Models

    Authors: Shu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li, Abdelrahman Mohamed, Shinji Watanabe, Hung-yi Lee

    Abstract: The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work,… ▽ More

    Submitted 29 May, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

    Comments: The extended journal version for SUPERB and SUPERB-SG. Published in IEEE/ACM TASLP. The Arxiv version is preferred

  2. arXiv:2306.07949  [pdf, other

    eess.AS cs.AI cs.LG

    Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition

    Authors: Xianzhao Chen, Yist Y. Lin, Kang Wang, Yi He, Zejun Ma

    Abstract: End-to-end (E2E) systems have shown comparable performance to hybrid systems for automatic speech recognition (ASR). Word timings, as a by-product of ASR, are essential in many applications, especially for subtitling and computer-aided pronunciation training. In this paper, we improve the frame-level classifier for word timings in E2E system by introducing label priors in connectionist temporal cl… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

    Comments: To appear in the proceedings of INTERSPEECH 2023

  3. arXiv:2210.15876  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition

    Authors: Yist Y. Lin, Tao Han, Haihua Xu, Van Tung Pham, Yerbolat Khassanov, Tze Yuang Chong, Yi He, Lu Lu, Zejun Ma

    Abstract: One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose an on-the-fly random utterance concatenation (RUC) based data augmentation method to alleviate train-test utterance length mismatch issue for short-video ASR task. Specifically, we are motivated by observatio… ▽ More

    Submitted 25 May, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: 5 pages, 3 figures, 4 tables

  4. arXiv:2105.01051  [pdf, ps, other

    cs.CL cs.SD eess.AS

    SUPERB: Speech processing Universal PERformance Benchmark

    Authors: Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee

    Abstract: Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm. To bridge… ▽ More

    Submitted 15 October, 2021; v1 submitted 3 May, 2021; originally announced May 2021.

    Comments: To appear in Interspeech 2021

  5. arXiv:2104.03017  [pdf, other

    eess.AS cs.LG cs.SD

    Utilizing Self-supervised Representations for MOS Prediction

    Authors: Wei-Cheng Tseng, Chien-yu Huang, Wei-Tsung Kao, Yist Y. Lin, Hung-yi Lee

    Abstract: Speech quality assessment has been a critical issue in speech processing for decades. Existing automatic evaluations usually require clean references or parallel ground truth data, which is infeasible when the amount of data soars. Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception. However, such a test is expensive and… ▽ More

    Submitted 20 September, 2021; v1 submitted 7 April, 2021; originally announced April 2021.

    Comments: In Proceedings of Interspeech 2021. We acknowledge the support of AWS Machine Learning Research Awards program. Source code available at https://github.com/s3prl/s3prl/tree/master/s3prl/downstream/mos_prediction

  6. arXiv:2104.02901  [pdf, other

    eess.AS cs.SD

    S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

    Authors: Jheng-hao Lin, Yist Y. Lin, Chung-Ming Chien, Hung-yi Lee

    Abstract: Any-to-any voice conversion (VC) aims to convert the timbre of utterances from and to any speakers seen or unseen during training. Various any-to-any VC approaches have been proposed like AUTOVC, AdaINVC, and FragmentVC. AUTOVC, and AdaINVC utilize source and target encoders to disentangle the content and speaker information of the features. FragmentVC utilizes two encoders to encode source and ta… ▽ More

    Submitted 14 June, 2021; v1 submitted 6 April, 2021; originally announced April 2021.

    Comments: Accepted by INTERSPEECH 2021

  7. arXiv:2010.14150  [pdf, other

    eess.AS cs.LG

    FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention

    Authors: Yist Y. Lin, Chung-Ming Chien, Jheng-Hao Lin, Hung-yi Lee, Lin-shan Lee

    Abstract: Any-to-any voice conversion aims to convert the voice from and to any speakers even unseen during training, which is much more challenging compared to one-to-one or many-to-many tasks, but much more attractive in real-world scenarios. In this paper we proposed FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0, while the spectra… ▽ More

    Submitted 3 May, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

    Comments: To appear in the proceedings of ICASSP 2021, equal contribution from first two authors

  8. arXiv:2005.08781  [pdf, other

    eess.AS cs.LG cs.SD

    Defending Your Voice: Adversarial Attack on Voice Conversion

    Authors: Chien-yu Huang, Yist Y. Lin, Hung-yi Lee, Lin-shan Lee

    Abstract: Substantial improvements have been achieved in recent years in voice conversion, which converts the speaker characteristics of an utterance into those of another speaker without changing the linguistic content of the utterance. Nonetheless, the improved conversion technologies also led to concerns about privacy and authentication. It thus becomes highly desired to be able to prevent one's voice fr… ▽ More

    Submitted 4 May, 2021; v1 submitted 18 May, 2020; originally announced May 2020.

    Comments: Accepted by SLT 2021