Skip to main content

Showing 1–50 of 62 results for author: Lai, C

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.18556  [pdf

    eess.IV cs.CV cs.LG

    Renal digital pathology visual knowledge search platform based on language large model and book knowledge

    Authors: Xiaomin Lv, Chong Lai, Liya Ding, Maode Lai, Qingrong Sun

    Abstract: Large models have become mainstream, yet their applications in digital pathology still require exploration. Meanwhile renal pathology images play an important role in the diagnosis of renal diseases. We conducted image segmentation and paired corresponding text descriptions based on 60 books for renal pathology, clustering analysis for all image and text description features based on large models,… ▽ More

    Submitted 26 May, 2024; originally announced June 2024.

    Comments: 9 pages, 6 figures

  2. arXiv:2406.08353  [pdf, other

    eess.AS cs.CL cs.MM cs.SD

    Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques

    Authors: Yuanchao Li, Peter Bell, Catherine Lai

    Abstract: Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SE… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  3. arXiv:2405.20064  [pdf, other

    eess.AS cs.SD

    1st Place Solution to Odyssey Emotion Recognition Challenge Task1: Tackling Class Imbalance Problem

    Authors: Mingjie Chen, Hezhao Zhang, Yuanchao Li, Jiachen Luo, Wen Wu, Ziyang Ma, Peter Bell, Catherine Lai, Joshua Reiss, Lin Wang, Philip C. Woodland, Xie Chen, Huy Phan, Thomas Hain

    Abstract: Speech emotion recognition is a challenging classification task with natural emotional speech, especially when the distribution of emotion types is imbalanced in the training and test data. In this case, it is more difficult for a model to learn to separate minority classes, resulting in those sometimes being ignored or frequently misclassified. Previous work has utilised class weighted loss for t… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

  4. arXiv:2405.18503  [pdf, other

    cs.SD cs.LG eess.AS

    SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

    Authors: Koichi Saito, Dongjun Kim, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong, Yuhta Takida, Yuki Mitsufuji

    Abstract: Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often suffer from slow inference speeds. This drawback burdens creators, who typically refine their sounds through trial and error… ▽ More

    Submitted 10 June, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: Audio samples: https://koichi-saito-sony.github.io/soundctm/. Codes: https://github.com/sony/soundctm. Checkpoints: https://huggingface.co/Sony/soundctm

  5. arXiv:2405.16677  [pdf, other

    eess.AS cs.CL cs.SD

    Crossmodal ASR Error Correction with Discrete Speech Units

    Authors: Yuanchao Li, Pinzhen Chen, Peter Bell, Catherine Lai

    Abstract: ASR remains unsatisfactory in scenarios where the speaking style diverges from that used to train ASR systems, resulting in erroneous transcripts. To address this, ASR Error Correction (AEC), a post-ASR processing approach, is required. In this work, we tackle an understudied issue: the Low-Resource Out-of-Domain (LROOD) problem, by investigating crossmodal AEC on very limited downstream data with… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

  6. arXiv:2404.09385  [pdf, other

    eess.AS cs.CL eess.SP

    A Large-Scale Evaluation of Speech Foundation Models

    Authors: Shu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li, Abdelrahman Mohamed, Shinji Watanabe, Hung-yi Lee

    Abstract: The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work,… ▽ More

    Submitted 29 May, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

    Comments: The extended journal version for SUPERB and SUPERB-SG. Published in IEEE/ACM TASLP. The Arxiv version is preferred

  7. arXiv:2402.02617  [pdf, other

    cs.CL cs.SD eess.AS

    Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study on Speech Emotion Recognition

    Authors: Alexandra Saliba, Yuanchao Li, Ramon Sanabria, Catherine Lai

    Abstract: The efficacy of self-supervised speech models has been validated, yet the optimal utilization of their representations remains challenging across diverse tasks. In this study, we delve into Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks. AWEs have previously shown utility in capturing acoustic discrimin… ▽ More

    Submitted 4 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP2024 Self-supervision in Audio, Speech and Beyond (SASB) workshop. First two authors contributed equally

  8. arXiv:2401.01329  [pdf, other

    eess.SP cs.NI

    Self-Supervised Millimeter Wave Indoor Localization using Tiny Neural Networks

    Authors: Anish Shastri, Steve Blandino, Camillo Gentile, Chieh** Lai, Paolo Casari

    Abstract: The quasi-optical propagation of millimeter-wave signals enables high-accuracy localization algorithms that employ geometric approaches or machine learning models. However, most algorithms require information on the indoor environment, may entail the collection of large training datasets, or bear an infeasible computational burden for commercial off-the-shelf (COTS) devices. In this work, we propo… ▽ More

    Submitted 2 January, 2024; originally announced January 2024.

    Comments: 13 pages, 11 figures

  9. arXiv:2310.13267  [pdf, other

    cs.CL cs.CV cs.LG cs.SD eess.AS

    On the Language Encoder of Contrastive Cross-modal Models

    Authors: Mengjie Zhao, Junya Ono, Zhi Zhong, Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Takashi Shibuya, Hiromi Wakaki, Yuki Mitsufuji

    Abstract: Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descriptions of image/audio into vector representations. We extensively evaluate how unsupervised and supervised sentence embedding… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

  10. arXiv:2310.07654  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Audio-Visual Neural Syntax Acquisition

    Authors: Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass

    Abstract: We study phrase structure induction from visually-grounded speech. The core idea is to first segment the speech waveform into sequences of word segments, and subsequently induce phrase structure using the inferred segment-level continuous representations. We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without eve… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

  11. arXiv:2309.10787  [pdf, other

    eess.AS cs.CV cs.MM cs.SD

    AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

    Authors: Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee

    Abstract: Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual a… ▽ More

    Submitted 19 March, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024; Evaluation Code: https://github.com/roger-tseng/av-superb Submission Platform: https://av.superbbenchmark.org

  12. arXiv:2309.09843  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Instruction-Following Speech Recognition

    Authors: Cheng-I Jeff Lai, Zhiyun Lu, Liangliang Cao, Ruoming Pang

    Abstract: Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. With the advent of Large Language Models (LLMs) in speech processing, more organic, text-prompt-based interactions have become possible. However, the mechanisms behind these models' speech understanding and "reasoning" capabilities remai… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

  13. arXiv:2309.06934  [pdf, other

    eess.AS cs.SD

    VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

    Authors: Carlos Hernandez-Olivan, Koichi Saito, Naoki Murata, Chieh-Hsin Lai, Marco A. Martínez-Ramirez, Wei-Hsiang Liao, Yuki Mitsufuji

    Abstract: Restoring degraded music signals is essential to enhance audio quality for downstream music manipulation. Recent diffusion-based music restoration methods have demonstrated impressive performance, and among them, diffusion posterior sampling (DPS) stands out given its intrinsic properties, making it versatile across various restoration tasks. In this paper, we identify that there are potential iss… ▽ More

    Submitted 13 September, 2023; originally announced September 2023.

  14. arXiv:2308.06979  [pdf, other

    eess.AS cs.SD

    The Sound Demixing Challenge 2023 $\unicode{x2013}$ Music Demixing Track

    Authors: Giorgio Fabbro, Stefan Uhlich, Chieh-Hsin Lai, Woosung Choi, Marco Martínez-Ramírez, Weihsiang Liao, Igor Gadelha, Geraldo Ramos, Eddie Hsu, Hugo Rodrigues, Fabian-Robert Stöter, Alexandre Défossez, Yi Luo, Jianwei Yu, Dipam Chakraborty, Sharada Mohanty, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Nabarun Goswami, Tatsuya Harada, Minseok Kim, Jun Hyung Lee, Yuanliang Dong, Xinran Zhang , et al. (2 additional authors not shown)

    Abstract: This paper summarizes the music demixing (MDX) track of the Sound Demixing Challenge (SDX'23). We provide a summary of the challenge setup and introduce the task of robust music source separation (MSS), i.e., training MSS models in the presence of errors in the training data. We propose a formalization of the errors that can occur in the design of a training dataset for MSS systems and introduce t… ▽ More

    Submitted 19 April, 2024; v1 submitted 14 August, 2023; originally announced August 2023.

    Comments: Published in Transactions of the International Society for Music Information Retrieval (https://transactions.ismir.net/articles/10.5334/tismir.171)

    Journal ref: Transactions of the International Society for Music Information Retrieval, 7(1), pp.63-84, 2024

  15. arXiv:2307.15374  [pdf

    eess.SY

    Leveraging Optical Communication Fiber and AI for Distributed Water Pipe Leak Detection

    Authors: Huan Wu, Huan-Feng Duan, Wallace W. L. Lai, Kun Zhu, Xin Cheng, Hao Yin, Bin Zhou, Chun-Cheung Lai, Chao Lu, Xiaoli Ding

    Abstract: Detecting leaks in water networks is a costly challenge. This article introduces a practical solution: the integration of optical network with water networks for efficient leak detection. Our approach uses a fiber-optic cable to measure vibrations, enabling accurate leak identification and localization by an intelligent algorithm. We also propose a method to access leak severity for prioritized re… ▽ More

    Submitted 28 July, 2023; originally announced July 2023.

    Comments: Accepted

    Journal ref: IEEE Communications Magazine, 2023

  16. MOV-Modified-FxLMS algorithm with Variable Penalty Factor in a Practical Power Output Constrained Active Control System

    Authors: Chung Kwan Lai, Dongyuan Shi, Bhan Lam, Woon-Seng Gan

    Abstract: Practical Active Noise Control (ANC) systems typically require a restriction in their maximum output power, to prevent overdriving the loudspeaker and causing system instability. Recently, the minimum output variance filtered-reference least mean square (MOV-FxLMS) algorithm was shown to have optimal control under output constraint with an analytically formulated penalty factor, but it needs offli… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

    Comments: Accepted article in IEEE Signal Processing Letters

    Journal ref: IEEE Signal Process. Lett., vol. 30, pp. 723-727, 2023

  17. arXiv:2305.16076  [pdf, other

    eess.AS cs.SD

    Transfer Learning for Personality Perception via Speech Emotion Recognition

    Authors: Yuanchao Li, Peter Bell, Catherine Lai

    Abstract: Holistic perception of affective attributes is an important human perceptual ability. However, this ability is far from being realized in current affective computing, as not all of the attributes are well studied and their interrelationships are poorly understood. In this work, we investigate the relationship between two affective attributes: personality and emotion, from a transfer learning persp… ▽ More

    Submitted 28 May, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

    Comments: Accepted to INTERSPEECH 2023

  18. arXiv:2305.16065  [pdf, other

    eess.AS cs.CL cs.SD

    ASR and Emotional Speech: A Word-Level Investigation of the Mutual Impact of Speech and Emotion Recognition

    Authors: Yuanchao Li, Zeyu Zhao, Ondrej Klejch, Peter Bell, Catherine Lai

    Abstract: In Speech Emotion Recognition (SER), textual data is often used alongside audio signals to address their inherent variability. However, the reliance on human annotated text in most research hinders the development of practical SER systems. To overcome this challenge, we investigate how Automatic Speech Recognition (ASR) performs on emotional speech by analyzing the ASR performance on emotion corpo… ▽ More

    Submitted 28 May, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

    Comments: Accepted to INTERSPEECH 2023

  19. arXiv:2305.13583  [pdf, other

    cs.CL cs.MM eess.AS eess.IV

    Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition

    Authors: Yaoting Wang, Yuanchao Li, Paul Pu Liang, Louis-Philippe Morency, Peter Bell, Catherine Lai

    Abstract: Fusing multiple modalities has proven effective for multimodal information processing. However, the incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition. In this study, we first analyze how the salient affective information in one modality can be affected by the other, and demonstrate that inter-modal incongruity exists latently in crossmodal att… ▽ More

    Submitted 12 November, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: *First two authors contributed equally

  20. Interference-Aware Deployment for Maximizing User Satisfaction in Multi-UAV Wireless Networks

    Authors: Chuan-Chi Lai, Ang-Hsun Tsai, Chia-Wei Ting, Ko-Han Lin, **g-Chi Ling, Chia-En Tsai

    Abstract: In this letter, we study the deployment of Unmanned Aerial Vehicle mounted Base Stations (UAV-BSs) in multi-UAV cellular networks. We model the multi-UAV deployment problem as a user satisfaction maximization problem, that is, maximizing the proportion of served ground users (GUs) that meet a given minimum data rate requirement. We propose an interference-aware deployment (IAD) algorithm for servi… ▽ More

    Submitted 6 April, 2023; originally announced April 2023.

    Comments: 5 pages, 3 figures, to appear in IEEE Wireless Communications Letters

  21. Real-time modelling of observation filter in the Remote Microphone Technique for an Active Noise Control application

    Authors: Chung Kwan Lai, Bhan Lam, Dongyuan Shi, Woon-Seng Gan

    Abstract: The remote microphone technique (RMT) is often used in active noise control (ANC) applications to overcome design constraints in microphone placements by estimating the acoustic pressure at inconvenient locations using a pre-calibrated observation filter (OF), albeit limited to stationary primary acoustic fields. While the OF estimation in varying primary fields can be significantly improved throu… ▽ More

    Submitted 21 March, 2023; originally announced March 2023.

    Comments: 5 pages, 5 figures. Submitted to 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2023)

    Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 2023, pp. 1-5

  22. arXiv:2303.08809  [pdf, other

    cs.CL eess.AS

    Cascading and Direct Approaches to Unsupervised Constituency Parsing on Spoken Sentences

    Authors: Yuan Tseng, Cheng-I Lai, Hung-yi Lee

    Abstract: Past work on unsupervised parsing is constrained to written form. In this paper, we present the first study on unsupervised spoken constituency parsing given unlabeled spoken sentences and unpaired textual data. The goal is to determine the spoken sentences' hierarchical syntactic structure in the form of constituency parse trees, such that each node is a span of audio that corresponds to a consti… ▽ More

    Submitted 9 May, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

    Comments: Accepted to ICASSP 2023; updated compute resource acknowledgements

  23. arXiv:2303.00146  [pdf, other

    cs.HC cs.RO cs.SD eess.AS

    I Know Your Feelings Before You Do: Predicting Future Affective Reactions in Human-Computer Dialogue

    Authors: Yuanchao Li, Koji Inoue, Leimin Tian, Changzeng Fu, Carlos Ishi, Hiroshi Ishiguro, Tatsuya Kawahara, Catherine Lai

    Abstract: Current Spoken Dialogue Systems (SDSs) often serve as passive listeners that respond only after receiving user speech. To achieve human-like dialogue, we propose a novel future prediction architecture that allows an SDS to anticipate future affective reactions based on its current behaviors before the user speaks. In this work, we investigate two scenarios: speech and laughter. In speech, we propo… ▽ More

    Submitted 17 March, 2023; v1 submitted 28 February, 2023; originally announced March 2023.

    Comments: Accepted to CHI2023 Late-Breaking Work

  24. arXiv:2301.12686  [pdf, other

    cs.LG cs.AI cs.CV cs.SD eess.AS

    GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration

    Authors: Naoki Murata, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon

    Abstract: Pre-trained diffusion models have been successfully used as priors in a variety of linear inverse problems, where the goal is to reconstruct a signal from noisy linear measurements. However, existing approaches require knowledge of the linear operator. In this paper, we propose GibbsDDRM, an extension of Denoising Diffusion Restoration Models (DDRM) to a blind setting in which the linear measureme… ▽ More

    Submitted 27 June, 2023; v1 submitted 30 January, 2023; originally announced January 2023.

  25. arXiv:2211.05163  [pdf, other

    cs.MM cs.SD eess.AS

    Multimodal Dyadic Impression Recognition via Listener Adaptive Cross-Domain Fusion

    Authors: Yuanchao Li, Peter Bell, Catherine Lai

    Abstract: As a sub-branch of affective computing, impression recognition, e.g., perception of speaker characteristics such as warmth or competence, is potentially a critical part of both human-human conversations and spoken dialogue systems. Most research has studied impressions only from the behaviors expressed by the speaker or the response from the listener, yet ignored their latent connection. In this p… ▽ More

    Submitted 16 February, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

    Comments: Accepted to ICASSP2023. arXiv admin note: substantial text overlap with arXiv:2203.13932

  26. arXiv:2211.04124  [pdf, other

    eess.AS cs.LG cs.SD

    Unsupervised vocal dereverberation with diffusion-based generative models

    Authors: Koichi Saito, Naoki Murata, Toshimitsu Uesaka, Chieh-Hsin Lai, Yuhta Takida, Takao Fukui, Yuki Mitsufuji

    Abstract: Removing reverb from reverberant music is a necessary technique to clean up audio for downstream music manipulations. Reverberation of music contains two categories, natural reverb, and artificial reverb. Artificial reverb has a wider diversity than natural reverb due to its various parameter setups and reverberation types. However, recent supervised dereverberation methods may fail because they r… ▽ More

    Submitted 8 November, 2022; originally announced November 2022.

    Comments: 6 pages, 2 figures, submitted to ICASSP 2023

  27. arXiv:2211.01522  [pdf, other

    cs.LG cs.SD eess.AS

    Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing

    Authors: Yonggan Fu, Yang Zhang, Kaizhi Qian, Zhifan Ye, Zhongzhi Yu, Cheng-I Lai, Yingyan Lin

    Abstract: Self-supervised learning (SSL) for rich speech representations has achieved empirical success in low-resource Automatic Speech Recognition (ASR) and other speech processing tasks, which can mitigate the necessity of a large amount of transcribed speech and thus has driven a growing demand for on-device ASR and other speech processing. However, advanced speech SSL models have become increasingly la… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted at NeurIPS 2022

  28. arXiv:2210.16797  [pdf, ps, other

    cs.NI eess.SP eess.SY

    Adaptive and Fair Deployment Approach to Balance Offload Traffic in Multi-UAV Cellular Networks

    Authors: Chuan-Chi Lai, Bhola, Ang-Hsun Tsai, Li-Chun Wang

    Abstract: Unmanned aerial vehicle-aided communication (UAB-BS) is a promising solution to establish rapid wireless connectivity in sudden/temporary crowded events because of its more flexibility and mobility features than conventional ground base station (GBS). Because of these benefits, UAV-BSs can easily be deployed at high altitudes to provide more line of sight (LoS) links than GBS. Therefore, users on… ▽ More

    Submitted 12 November, 2022; v1 submitted 30 October, 2022; originally announced October 2022.

    Comments: 15 pages, 9 figures, to appear in IEEE Transactions on Vehicular Technology

  29. arXiv:2210.02595  [pdf, other

    eess.AS cs.CL cs.SD

    Exploration of A Self-Supervised Speech Model: A Study on Emotional Corpora

    Authors: Yuanchao Li, Yumnah Mohamied, Peter Bell, Catherine Lai

    Abstract: Self-supervised speech models have grown fast during the past few years and have proven feasible for use in various downstream tasks. Some recent work has started to look at the characteristics of these models, yet many concerns have not been fully addressed. In this work, we conduct a study on emotional corpora to explore a popular self-supervised model -- wav2vec 2.0. Via a set of quantitative a… ▽ More

    Submitted 12 December, 2022; v1 submitted 5 October, 2022; originally announced October 2022.

    Comments: Accepted to SLT 2022

  30. arXiv:2204.09224  [pdf, other

    cs.SD cs.AI eess.AS

    ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers

    Authors: Kaizhi Qian, Yang Zhang, Heting Gao, Junrui Ni, Cheng-I Lai, David Cox, Mark Hasegawa-Johnson, Shiyu Chang

    Abstract: Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks. Since the majority of the downstream tasks of SSL learning in speech largely focus on the content information in speech, the most desirable speech representations should be able to disentangle unwanted va… ▽ More

    Submitted 23 June, 2022; v1 submitted 20 April, 2022; originally announced April 2022.

  31. arXiv:2204.02524  [pdf, other

    cs.SD cs.CL eess.AS

    Simple and Effective Unsupervised Speech Synthesis

    Authors: Alexander H. Liu, Cheng-I Jeff Lai, Wei-Ning Hsu, Michael Auli, Alexei Baevski, James Glass

    Abstract: We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe. The framework leverages recent work in unsupervised speech recognition as well as existing neural-based speech synthesis. Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus. Experiments demonstra… ▽ More

    Submitted 20 April, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: preprint, equal contribution from first two authors

  32. arXiv:2203.14640  [pdf, other

    eess.AS

    Analysis of Voice Conversion and Code-Switching Synthesis Using VQ-VAE

    Authors: Shuvayanti Das, Jennifer Williams, Catherine Lai

    Abstract: This paper presents an analysis of speech synthesis quality achieved by simultaneously performing voice conversion and language code-switching using multilingual VQ-VAE speech synthesis in German, French, English and Italian. In this paper, we utilize VQ code indices representing phone information from VQ-VAE to perform code-switching and a VQ speaker code to perform voice conversion in a single s… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: Submitted to Interspeech 2022

  33. arXiv:2203.13932  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    A Cross-Domain Approach for Continuous Impression Recognition from Dyadic Audio-Visual-Physio Signals

    Authors: Yuanchao Li, Catherine Lai

    Abstract: The impression we make on others depends not only on what we say, but also, to a large extent, on how we say it. As a sub-branch of affective computing and social signal processing, impression recognition has proven critical in both human-human conversations and spoken dialogue systems. However, most research has studied impressions only from the signals expressed by the emitter, ignoring the resp… ▽ More

    Submitted 25 March, 2022; originally announced March 2022.

    Comments: 5 pages, 2 figures, submitted to INTERSPEECH 2022

  34. arXiv:2203.09599  [pdf, ps, other

    cs.RO cs.CL cs.HC eess.AS

    Robotic Speech Synthesis: Perspectives on Interactions, Scenarios, and Ethics

    Authors: Yuanchao Li, Catherine Lai

    Abstract: In recent years, many works have investigated the feasibility of conversational robots for performing specific tasks, such as healthcare and interview. Along with this development comes a practical issue: how should we synthesize robotic voices to meet the needs of different situations? In this paper, we discuss this issue from three perspectives: 1) the difficulties of synthesizing non-verbal and… ▽ More

    Submitted 17 March, 2022; originally announced March 2022.

    Comments: Accepted for the HRI 2022 Workshop "Robo-Identity: Exploring Artificial Identity and Emotion via Speech Interactions" at HRI 2022, 7 March 2022

  35. arXiv:2203.06849  [pdf, other

    cs.CL cs.SD eess.AS

    SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities

    Authors: Hsiang-Sheng Tsai, Heng-Jui Chang, Wen-Chin Huang, Zili Huang, Kushal Lakhotia, Shu-wen Yang, Shuyan Dong, Andy T. Liu, Cheng-I Jeff Lai, Jiatong Shi, Xuankai Chang, Phil Hall, Hsuan-Jui Chen, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee

    Abstract: Transfer learning has proven to be crucial in advancing the state of speech and natural language processing research in recent years. In speech, a model pre-trained by self-supervised learning transfers remarkably well on multiple tasks. However, the lack of a consistent evaluation methodology is limiting towards a holistic understanding of the efficacy of such models. SUPERB was a step towards in… ▽ More

    Submitted 14 March, 2022; originally announced March 2022.

    Comments: ACL 2022 main conference

  36. arXiv:2202.10777  [pdf, other

    eess.AS cs.AI cs.SD q-bio.QM

    Continuous Speech for Improved Learning Pathological Voice Disorders

    Authors: Syu-Siang Wang, Chi-Te Wang, Chih-Chung Lai, Yu Tsao, Shih-Hau Fang

    Abstract: Goal: Numerous studies had successfully differentiated normal and abnormal voice samples. Nevertheless, further classification had rarely been attempted. This study proposes a novel approach, using continuous Mandarin speech instead of a single vowel, to classify four common voice disorders (i.e. functional dysphonia, neoplasm, phonotrauma, and vocal palsy). Methods: In the proposed framework, aco… ▽ More

    Submitted 22 February, 2022; originally announced February 2022.

  37. arXiv:2110.15684  [pdf, other

    eess.AS cs.CL cs.MM cs.SD

    Fusing ASR Outputs in Joint Training for Speech Emotion Recognition

    Authors: Yuanchao Li, Peter Bell, Catherine Lai

    Abstract: Alongside acoustic information, linguistic features based on speech transcripts have been proven useful in Speech Emotion Recognition (SER). However, due to the scarcity of emotion labelled data and the difficulty of recognizing emotional speech, it is hard to obtain reliable linguistic features and models in this research area. In this paper, we propose to fuse Automatic Speech Recognition (ASR)… ▽ More

    Submitted 17 March, 2022; v1 submitted 29 October, 2021; originally announced October 2021.

    Comments: Accepted for ICASSP 2022

  38. arXiv:2110.09784  [pdf, other

    cs.SD cs.AI eess.AS

    SSAST: Self-Supervised Audio Spectrogram Transformer

    Authors: Yuan Gong, Cheng-I Jeff Lai, Yu-An Chung, James Glass

    Abstract: Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology ca… ▽ More

    Submitted 10 February, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

    Comments: Accepted at AAAI2022. Code at https://github.com/YuanGongND/ssast

  39. arXiv:2110.01147  [pdf, other

    cs.SD cs.CL eess.AS

    On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

    Authors: Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian, Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David Cox, James Glass

    Abstract: Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparsity and its subsequent effects on synthetic speech. Additionally, we explored several… ▽ More

    Submitted 27 October, 2021; v1 submitted 3 October, 2021; originally announced October 2021.

  40. arXiv:2107.02527  [pdf, other

    eess.AS cs.CL cs.SD

    Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm

    Authors: Elijah Gutierrez, Pilar Oplustil-Gallegos, Catherine Lai

    Abstract: Text-to-Speech synthesis systems are generally evaluated using Mean Opinion Score (MOS) tests, where listeners score samples of synthetic speech on a Likert scale. A major drawback of MOS tests is that they only offer a general measure of overall quality-i.e., the naturalness of an utterance-and so cannot tell us where exactly synthesis errors occur. This can make evaluation of the appropriateness… ▽ More

    Submitted 6 July, 2021; originally announced July 2021.

    Comments: Accepted to Speech Synthesis Workshop 2019: https://ssw11.hte.hu/en/

  41. arXiv:2106.05933  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition

    Authors: Cheng-I Jeff Lai, Yang Zhang, Alexander H. Liu, Shiyu Chang, Yi-Lun Liao, Yung-Sung Chuang, Kaizhi Qian, Sameer Khurana, David Cox, James Glass

    Abstract: Self-supervised speech representation learning (speech SSL) has demonstrated the benefit of scale in learning rich representations for Automatic Speech Recognition (ASR) with limited paired data, such as wav2vec 2.0. We investigate the existence of sparse subnetworks in pre-trained speech SSL models that achieve even better low-resource ASR results. However, directly applying widely adopted prunin… ▽ More

    Submitted 26 October, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

  42. arXiv:2105.01051  [pdf, ps, other

    cs.CL cs.SD eess.AS

    SUPERB: Speech processing Universal PERformance Benchmark

    Authors: Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee

    Abstract: Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm. To bridge… ▽ More

    Submitted 15 October, 2021; v1 submitted 3 May, 2021; originally announced May 2021.

    Comments: To appear in Interspeech 2021

  43. arXiv:2012.00250  [pdf, other

    cs.SD cs.HC eess.AS

    Strike on Stage: a percussion and media performance

    Authors: Charles Martin, Chi-Hsia Lai

    Abstract: This paper describes Strike on Stage, an interface and corresponding audio-visual performance work developed and performed in 2010 by percussionists and media artists Chi-Hsia Lai and Charles Martin. The concept of Strike on Stage is to integrate computer visuals and sound into an improvised percussion performance. A large projection surface is positioned directly behind the performers, while a co… ▽ More

    Submitted 30 November, 2020; originally announced December 2020.

    Journal ref: Proceedings of the International Conference on New Interfaces for Musical Expression (2011) pp. 142-143

  44. arXiv:2010.11081  [pdf, other

    eess.IV cs.CV

    Anatomically-Informed Deep Learning on Contrast-Enhanced Cardiac MRI for Scar Segmentation and Clinical Feature Extraction

    Authors: Haley G. Abramson, Dan M. Popescu, Rebecca Yu, Changxin Lai, Julie K. Shade, Katherine C. Wu, Mauro Maggioni, Natalia A. Trayanova

    Abstract: Visualizing disease-induced scarring and fibrosis in the heart on cardiac magnetic resonance (CMR) imaging with contrast enhancement (LGE) is paramount in characterizing disease progression and quantifying pathophysiological substrates of arrhythmias. However, segmentation and scar/fibrosis identification from LGE-CMR is an intensive manual process prone to large inter-observer variability. Here,… ▽ More

    Submitted 8 January, 2021; v1 submitted 21 October, 2020; originally announced October 2020.

    Comments: Haley G. Abramson and Dan M. Popescu contributed equally to this work

  45. arXiv:2008.09519  [pdf, ps, other

    cs.NI cs.DC eess.SY

    The Coverage Overlap** Problem of Serving Arbitrary Crowds in 3D Drone Cellular Networks

    Authors: Chuan-Chi Lai, Li-Chun Wang, Zhu Han

    Abstract: Providing coverage for flash crowds is an important application for drone base stations (DBSs). However, any arbitrary crowd is likely to be distributed at a high density. Under the condition for each DBS to serve the same number of ground users, multiple DBSs may be placed at the same horizontal location but different altitudes and will cause severe co-channel interference, to which we refer as t… ▽ More

    Submitted 20 August, 2020; originally announced August 2020.

    Comments: 18 pages, 10 figures, to appear in IEEE Transactions on Mobile Computing. arXiv admin note: text overlap with arXiv:1909.11554

  46. Quasi-Deterministic Channel Model for mmWaves: Mathematical Formalization and Validation

    Authors: Mattia Lecci, Michele Polese, Chieh** Lai, Jian Wang, Camillo Gentile, Nada Golmie, Michele Zorzi

    Abstract: 5G and beyond networks will use, for the first time ever, the millimeter wave (mmWave) spectrum for mobile communications. Accurate performance evaluation is fundamental to the design of reliable mmWave networks, with accuracy rooted in the fidelity of the channel models. At mmWaves, the model must account for the spatial characteristics of propagation since networks will employ highly directional… ▽ More

    Submitted 9 February, 2021; v1 submitted 1 June, 2020; originally announced June 2020.

    Comments: 6 pages, 5 figures, 1 table, presented at IEEE GLOBECOM 2020. Please cite it as: M. Lecci, M. Polese, C. Lai, J. Wang, C. Gentile, N. Golmie, M. Zorzi, "Quasi-Deterministic Channel Model for mmWaves: Mathematical Formalization and Validation," IEEE Global Communications Conference (GLOBECOM), Dec. 2020, Taipei, Taiwan

  47. arXiv:2005.07884  [pdf, other

    eess.AS cs.SD

    Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction

    Authors: Yi Zhao, Haoyu Li, Cheng-I Lai, Jennifer Williams, Erica Cooper, Junichi Yamagishi

    Abstract: Vector Quantized Variational AutoEncoders (VQ-VAE) are a powerful representation learning framework that can discover discrete groups of features from a speech signal without supervision. Until now, the VQ-VAE architecture has previously modeled individual types of speech features, such as only phones or only F0. This paper introduces an important extension to VQ-VAE for learning F0-related supras… ▽ More

    Submitted 16 May, 2020; originally announced May 2020.

    Comments: Submitted to Interspeech 2020

  48. arXiv:2005.01245  [pdf, other

    eess.AS

    Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?

    Authors: Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Junichi Yamagishi

    Abstract: Previous work on speaker adaptation for end-to-end speech synthesis still falls short in speaker similarity. We investigate an orthogonal approach to the current speaker adaptation paradigms, speaker augmentation, by creating artificial speakers and by taking advantage of low-quality data. The base Tacotron2 model is modified to account for the channel and dialect factors inherent in these corpora… ▽ More

    Submitted 7 August, 2020; v1 submitted 3 May, 2020; originally announced May 2020.

    Comments: Accepted to Interspeech 2020

  49. arXiv:2003.09077  [pdf, other

    cs.LG eess.SP math.NA math.OC stat.ML

    Inverse Problems, Deep Learning, and Symmetry Breaking

    Authors: Kshitij Tayal, Chieh-Hsin Lai, Vipin Kumar, Ju Sun

    Abstract: In many physical systems, inputs related by intrinsic system symmetries are mapped to the same output. When inverting such systems, i.e., solving the associated inverse problems, there is no unique solution. This causes fundamental difficulties for deploying the emerging end-to-end deep learning approach. Using the generalized phase retrieval problem as an illustrative example, we show that carefu… ▽ More

    Submitted 19 March, 2020; originally announced March 2020.

  50. arXiv:2003.06686  [pdf, other

    eess.AS cs.CL cs.LG cs.SD stat.ML

    Perception of prosodic variation for speech synthesis using an unsupervised discrete representation of F0

    Authors: Zack Hodari, Catherine Lai, Simon King

    Abstract: In English, prosody adds a broad range of information to segment sequences, from information structure (e.g. contrast) to stylistic variation (e.g. expression of emotion). However, when learning to control prosody in text-to-speech voices, it is not clear what exactly the control is modifying. Existing research on discrete representation learning for prosody has demonstrated high naturalness, but… ▽ More

    Submitted 14 March, 2020; originally announced March 2020.

    Comments: Published to the 10th ISCA International Conference on Speech Prosody (SP2020)