Skip to main content

Showing 1–13 of 13 results for author: Chien, C

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.10083  [pdf, other

    cs.CL cs.SD eess.AS

    On the Evaluation of Speech Foundation Models for Spoken Language Understanding

    Authors: Siddhant Arora, Ankita Pasad, Chung-Ming Chien, Jionghao Han, Roshan Sharma, Jee-weon Jung, Hira Dhamyal, William Chen, Suwon Shon, Hung-yi Lee, Karen Livescu, Shinji Watanabe

    Abstract: The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and sequence generation tasks, on natural speech. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for th… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted at ACL Findings 2024

  2. arXiv:2406.06251  [pdf, other

    eess.AS cs.CL

    Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning

    Authors: Chung-Ming Chien, Andros Tjandra, Apoorv Vyas, Matt Le, Bowen Shi, Wei-Ning Hsu

    Abstract: As the scale of generative models continues to grow, efficient reuse and adaptation of pre-trained models have become crucial considerations. In this work, we propose Voicebox Adapter, a novel approach that integrates fine-grained conditions into a pre-trained Voicebox speech generation model using a cross-attention module. To ensure a smooth integration of newly added modules with pre-trained one… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: Accepted by InterSpeech 2024

  3. arXiv:2403.11249  [pdf, other

    eess.IV cs.CV

    YOLOv9 for Fracture Detection in Pediatric Wrist Trauma X-ray Images

    Authors: Chun-Tse Chien, Rui-Yang Ju, Kuang-Yi Chou, Jen-Shiun Chiang

    Abstract: The introduction of YOLOv9, the latest version of the You Only Look Once (YOLO) series, has led to its widespread adoption across various scenarios. This paper is the first to apply the YOLOv9 algorithm model to the fracture detection task as computer-assisted diagnosis (CAD) to help radiologists and surgeons to interpret X-ray images. Specifically, this paper trained the model on the GRAZPEDWRI-D… ▽ More

    Submitted 27 May, 2024; v1 submitted 17 March, 2024; originally announced March 2024.

    Comments: Accepted by Electronics Letters

  4. arXiv:2310.08715  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Toward Joint Language Modeling for Speech Units and Text

    Authors: Ju-Chieh Chou, Chung-Ming Chien, Wei-Ning Hsu, Karen Livescu, Arun Babu, Alexis Conneau, Alexei Baevski, Michael Auli

    Abstract: Speech and text are two major forms of human language. The research community has been focusing on map** speech to text or vice versa for many years. However, in the field of language modeling, very little effort has been made to model them jointly. In light of this, we explore joint language modeling for speech units and text. Specifically, we compare different speech tokenizers to transform co… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: EMNLP findings 2023

  5. arXiv:2310.05919  [pdf, other

    cs.CL eess.AS

    Few-Shot Spoken Language Understanding via Joint Speech-Text Models

    Authors: Chung-Ming Chien, Mingjiamei Zhang, Ju-Chieh Chou, Karen Livescu

    Abstract: Recent work on speech representation models jointly pre-trained with text has demonstrated the potential of improving speech representations by encoding speech and text in a shared space. In this paper, we leverage such shared representations to address the persistent challenge of limited data availability in spoken language understanding tasks. By employing a pre-trained speech-text model, we fin… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

  6. arXiv:2309.08030  [pdf, other

    eess.AS cs.CL cs.SD

    AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement

    Authors: Ju-Chieh Chou, Chung-Ming Chien, Karen Livescu

    Abstract: Speech enhancement systems are typically trained using pairs of clean and noisy speech. In audio-visual speech enhancement (AVSE), there is not as much ground-truth clean data available; most audio-visual datasets are collected in real-world environments with background noise and reverberation, hampering the development of AVSE. In this work, we introduce AV2Wav, a resynthesis-based audio-visual s… ▽ More

    Submitted 8 April, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: extended version for the accepted paper at ICASSP 2024

  7. arXiv:2307.00162  [pdf, other

    cs.CL cs.LG eess.AS

    What Do Self-Supervised Speech Models Know About Words?

    Authors: Ankita Pasad, Chung-Ming Chien, Shane Settle, Karen Livescu

    Abstract: Many self-supervised speech models (S3Ms) have been introduced over the last few years, improving performance and data efficiency on various speech tasks. However, these empirical successes alone do not give a complete picture of what is learned during pre-training. Recent work has begun analyzing how S3Ms encode certain properties, such as phonetic and speaker information, but we still lack a pro… ▽ More

    Submitted 31 January, 2024; v1 submitted 30 June, 2023; originally announced July 2023.

    Comments: Pre-MIT Press publication version

  8. arXiv:2306.05085  [pdf, other

    eess.IV

    TransTIC: Transferring Transformer-based Image Compression from Human Perception to Machine Perception

    Authors: Yi-Hsin Chen, Ying-Chieh Weng, Chia-Hao Kao, Cheng Chien, Wei-Chen Chiu, Wen-Hsiao Peng

    Abstract: This work aims for transferring a Transformer-based image compression codec from human perception to machine perception without fine-tuning the codec. We propose a transferable Transformer-based image compression framework, termed TransTIC. Inspired by visual prompt tuning, TransTIC adopts an instance-specific prompt generator to inject instance-specific prompts to the encoder and task-specific pr… ▽ More

    Submitted 18 August, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

    Comments: Accepted to ICCV 2023

  9. arXiv:2202.08164  [pdf, other

    eess.AS cs.CL cs.LG

    Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module

    Authors: Adam Gabryƛ, Goeric Huybrechts, Manuel Sam Ribeiro, Chung-Ming Chien, Julian Roth, Giulia Comini, Roberto Barra-Chicote, Bartek Perz, Jaime Lorenzo-Trueba

    Abstract: State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech. When using reduced amounts of training data, standard TTS models suffer from speech quality and intelligibility degradations, making training low-resource TTS systems problematic. In this paper, we propose a novel extremely low-resource TTS method called Voice Filt… ▽ More

    Submitted 16 February, 2022; originally announced February 2022.

    Comments: Accepted at ICASSP 2022

  10. arXiv:2104.02901  [pdf, other

    eess.AS cs.SD

    S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations

    Authors: Jheng-hao Lin, Yist Y. Lin, Chung-Ming Chien, Hung-yi Lee

    Abstract: Any-to-any voice conversion (VC) aims to convert the timbre of utterances from and to any speakers seen or unseen during training. Various any-to-any VC approaches have been proposed like AUTOVC, AdaINVC, and FragmentVC. AUTOVC, and AdaINVC utilize source and target encoders to disentangle the content and speaker information of the features. FragmentVC utilizes two encoders to encode source and ta… ▽ More

    Submitted 14 June, 2021; v1 submitted 6 April, 2021; originally announced April 2021.

    Comments: Accepted by INTERSPEECH 2021

  11. arXiv:2103.04088  [pdf, other

    eess.AS cs.LG cs.SD

    Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech

    Authors: Chung-Ming Chien, Jheng-Hao Lin, Chien-yu Huang, Po-chun Hsu, Hung-yi Lee

    Abstract: The few-shot multi-speaker multi-style voice cloning task is to synthesize utterances with voice and speaking style similar to a reference speaker given only a few reference samples. In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations. Among different types of embeddings, the embedding pretrained by voice convers… ▽ More

    Submitted 1 May, 2021; v1 submitted 6 March, 2021; originally announced March 2021.

    Comments: Accepted by ICASSP 2021, in the special session of ICASSP 2021 M2VoC Challenge

  12. arXiv:2011.06465  [pdf, other

    eess.AS cs.LG cs.SD

    Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis

    Authors: Chung-Ming Chien, Hung-yi Lee

    Abstract: Prosody modeling is an essential component in modern text-to-speech (TTS) frameworks. By explicitly providing prosody features to the TTS model, the style of synthesized utterances can thus be controlled. However, predicting natural and reasonable prosody at inference time is challenging. In this work, we analyzed the behavior of non-autoregressive TTS models under different prosody-modeling setti… ▽ More

    Submitted 1 May, 2021; v1 submitted 12 November, 2020; originally announced November 2020.

    Comments: Accepted by SLT 2021

  13. arXiv:2010.14150  [pdf, other

    eess.AS cs.LG

    FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention

    Authors: Yist Y. Lin, Chung-Ming Chien, Jheng-Hao Lin, Hung-yi Lee, Lin-shan Lee

    Abstract: Any-to-any voice conversion aims to convert the voice from and to any speakers even unseen during training, which is much more challenging compared to one-to-one or many-to-many tasks, but much more attractive in real-world scenarios. In this paper we proposed FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0, while the spectra… ▽ More

    Submitted 3 May, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

    Comments: To appear in the proceedings of ICASSP 2021, equal contribution from first two authors