-
Renal digital pathology visual knowledge search platform based on language large model and book knowledge
Authors:
Xiaomin Lv,
Chong Lai,
Liya Ding,
Maode Lai,
Qingrong Sun
Abstract:
Large models have become mainstream, yet their applications in digital pathology still require exploration. Meanwhile renal pathology images play an important role in the diagnosis of renal diseases. We conducted image segmentation and paired corresponding text descriptions based on 60 books for renal pathology, clustering analysis for all image and text description features based on large models,…
▽ More
Large models have become mainstream, yet their applications in digital pathology still require exploration. Meanwhile renal pathology images play an important role in the diagnosis of renal diseases. We conducted image segmentation and paired corresponding text descriptions based on 60 books for renal pathology, clustering analysis for all image and text description features based on large models, ultimately building a retrieval system based on the semantic features of large models. Based above analysis, we established a knowledge base of 10,317 renal pathology images and paired corresponding text descriptions, and then we evaluated the semantic feature capabilities of 4 large models, including GPT2, gemma, LLma and Qwen, and the image-based feature capabilities of dinov2 large model. Furthermore, we built a semantic retrieval system to retrieve pathological images based on text descriptions, and named RppD (aidp.zjsru.edu.cn).
△ Less
Submitted 26 May, 2024;
originally announced June 2024.
-
Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques
Authors:
Yuanchao Li,
Peter Bell,
Catherine Lai
Abstract:
Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SE…
▽ More
Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SER performance using ASR transcripts with varying Word Error Rates (WERs) on well-known corpora: IEMOCAP, CMU-MOSI, and MSP-Podcast. Our evaluation includes text-only and bimodal SER with diverse fusion techniques, aiming for a comprehensive analysis that uncovers novel findings and challenges faced by current SER research. Additionally, we propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript. This research is expected to provide insights into SER with ASR assistance, especially for real-world applications.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
1st Place Solution to Odyssey Emotion Recognition Challenge Task1: Tackling Class Imbalance Problem
Authors:
Mingjie Chen,
Hezhao Zhang,
Yuanchao Li,
Jiachen Luo,
Wen Wu,
Ziyang Ma,
Peter Bell,
Catherine Lai,
Joshua Reiss,
Lin Wang,
Philip C. Woodland,
Xie Chen,
Huy Phan,
Thomas Hain
Abstract:
Speech emotion recognition is a challenging classification task with natural emotional speech, especially when the distribution of emotion types is imbalanced in the training and test data. In this case, it is more difficult for a model to learn to separate minority classes, resulting in those sometimes being ignored or frequently misclassified. Previous work has utilised class weighted loss for t…
▽ More
Speech emotion recognition is a challenging classification task with natural emotional speech, especially when the distribution of emotion types is imbalanced in the training and test data. In this case, it is more difficult for a model to learn to separate minority classes, resulting in those sometimes being ignored or frequently misclassified. Previous work has utilised class weighted loss for training, but problems remain as it sometimes causes over-fitting for minor classes or under-fitting for major classes. This paper presents the system developed by a multi-site team for the participation in the Odyssey 2024 Emotion Recognition Challenge Track-1. The challenge data has the aforementioned properties and therefore the presented systems aimed to tackle these issues, by introducing focal loss in optimisation when applying class weighted loss. Specifically, the focal loss is further weighted by prior-based class weights. Experimental results show that combining these two approaches brings better overall performance, by sacrificing performance on major classes. The system further employs a majority voting strategy to combine the outputs of an ensemble of 7 models. The models are trained independently, using different acoustic features and loss functions - with the aim to have different properties for different data. Hence these models show different performance preferences on major classes and minor classes. The ensemble system output obtained the best performance in the challenge, ranking top-1 among 68 submissions. It also outperformed all single models in our set. On the Odyssey 2024 Emotion Recognition Challenge Task-1 data the system obtained a Macro-F1 score of 35.69% and an accuracy of 37.32%.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation
Authors:
Koichi Saito,
Dongjun Kim,
Takashi Shibuya,
Chieh-Hsin Lai,
Zhi Zhong,
Yuhta Takida,
Yuki Mitsufuji
Abstract:
Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often suffer from slow inference speeds. This drawback burdens creators, who typically refine their sounds through trial and error…
▽ More
Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often suffer from slow inference speeds. This drawback burdens creators, who typically refine their sounds through trial and error to align them with their artistic intentions. To address this issue, we introduce Sound Consistency Trajectory Models (SoundCTM). Our model enables flexible transitioning between high-quality 1-step sound generation and superior sound quality through multi-step generation. This allows creators to initially control sounds with 1-step samples before refining them through multi-step generation. While CTM fundamentally achieves flexible 1-step and multi-step generation, its impressive performance heavily depends on an additional pretrained feature extractor and an adversarial loss, which are expensive to train and not always available in other domains. Thus, we reframe CTM's training framework and introduce a novel feature distance by utilizing the teacher's network for a distillation loss. Additionally, while distilling classifier-free guided trajectories, we train conditional and unconditional student models simultaneously and interpolate between these models during inference. We also propose training-free controllable frameworks for SoundCTM, leveraging its flexible sampling capability. SoundCTM achieves both promising 1-step and multi-step real-time sound generation without using any extra off-the-shelf networks. Furthermore, we demonstrate SoundCTM's capability of controllable sound generation in a training-free manner. Our codes, pretrained models, and audio samples are available at https://github.com/sony/soundctm.
△ Less
Submitted 10 June, 2024; v1 submitted 28 May, 2024;
originally announced May 2024.
-
Crossmodal ASR Error Correction with Discrete Speech Units
Authors:
Yuanchao Li,
Pinzhen Chen,
Peter Bell,
Catherine Lai
Abstract:
ASR remains unsatisfactory in scenarios where the speaking style diverges from that used to train ASR systems, resulting in erroneous transcripts. To address this, ASR Error Correction (AEC), a post-ASR processing approach, is required. In this work, we tackle an understudied issue: the Low-Resource Out-of-Domain (LROOD) problem, by investigating crossmodal AEC on very limited downstream data with…
▽ More
ASR remains unsatisfactory in scenarios where the speaking style diverges from that used to train ASR systems, resulting in erroneous transcripts. To address this, ASR Error Correction (AEC), a post-ASR processing approach, is required. In this work, we tackle an understudied issue: the Low-Resource Out-of-Domain (LROOD) problem, by investigating crossmodal AEC on very limited downstream data with 1-best hypothesis transcription. We explore pre-training and fine-tuning strategies and uncover an ASR domain discrepancy phenomenon, shedding light on appropriate training schemes for LROOD data. Moreover, we propose the incorporation of discrete speech units to align with and enhance the word embeddings for improving AEC quality. Results from multiple corpora and several evaluation metrics demonstrate the feasibility and efficacy of our proposed AEC approach on LROOD data, as well as its generalizability and superiority on large-scale data. Finally, a study on speech emotion recognition confirms that our model produces ASR error-robust transcripts suitable for downstream applications.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
A Large-Scale Evaluation of Speech Foundation Models
Authors:
Shu-wen Yang,
Heng-Jui Chang,
Zili Huang,
Andy T. Liu,
Cheng-I Lai,
Haibin Wu,
Jiatong Shi,
Xuankai Chang,
Hsiang-Sheng Tsai,
Wen-Chin Huang,
Tzu-hsun Feng,
Po-Han Chi,
Yist Y. Lin,
Yung-Sung Chuang,
Tzu-Hsien Huang,
Wei-Cheng Tseng,
Kushal Lakhotia,
Shang-Wen Li,
Abdelrahman Mohamed,
Shinji Watanabe,
Hung-yi Lee
Abstract:
The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work,…
▽ More
The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work, we establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the paradigm for speech. We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads. Combining our results with community submissions, we verify that the foundation model paradigm is promising for speech, and our multi-tasking framework is simple yet effective, as the best-performing foundation model shows competitive generalizability across most SUPERB tasks. For reproducibility and extensibility, we have developed a long-term maintained platform that enables deterministic benchmarking, allows for result sharing via an online leaderboard, and promotes collaboration through a community-driven benchmark database to support new development cycles. Finally, we conduct a series of analyses to offer an in-depth understanding of SUPERB and speech foundation models, including information flows across tasks inside the models, the correctness of the weighted-sum benchmarking protocol and the statistical significance and robustness of the benchmark.
△ Less
Submitted 29 May, 2024; v1 submitted 14 April, 2024;
originally announced April 2024.
-
Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study on Speech Emotion Recognition
Authors:
Alexandra Saliba,
Yuanchao Li,
Ramon Sanabria,
Catherine Lai
Abstract:
The efficacy of self-supervised speech models has been validated, yet the optimal utilization of their representations remains challenging across diverse tasks. In this study, we delve into Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks. AWEs have previously shown utility in capturing acoustic discrimin…
▽ More
The efficacy of self-supervised speech models has been validated, yet the optimal utilization of their representations remains challenging across diverse tasks. In this study, we delve into Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks. AWEs have previously shown utility in capturing acoustic discriminability. In light of this, we propose measuring layer-wise similarity between AWEs and word embeddings, aiming to further investigate the inherent context within AWEs. Moreover, we evaluate the contribution of AWEs, in comparison to other types of speech features, in the context of Speech Emotion Recognition (SER). Through a comparative experiment and a layer-wise accuracy analysis on two distinct corpora, IEMOCAP and ESD, we explore differences between AWEs and raw self-supervised representations, as well as the proper utilization of AWEs alone and in combination with word embeddings. Our findings underscore the acoustic context conveyed by AWEs and showcase the highly competitive SER accuracies by appropriately employing AWEs.
△ Less
Submitted 4 February, 2024;
originally announced February 2024.
-
Self-Supervised Millimeter Wave Indoor Localization using Tiny Neural Networks
Authors:
Anish Shastri,
Steve Blandino,
Camillo Gentile,
Chieh** Lai,
Paolo Casari
Abstract:
The quasi-optical propagation of millimeter-wave signals enables high-accuracy localization algorithms that employ geometric approaches or machine learning models. However, most algorithms require information on the indoor environment, may entail the collection of large training datasets, or bear an infeasible computational burden for commercial off-the-shelf (COTS) devices. In this work, we propo…
▽ More
The quasi-optical propagation of millimeter-wave signals enables high-accuracy localization algorithms that employ geometric approaches or machine learning models. However, most algorithms require information on the indoor environment, may entail the collection of large training datasets, or bear an infeasible computational burden for commercial off-the-shelf (COTS) devices. In this work, we propose to use tiny neural networks (NNs) to learn the relationship between angle difference-of-arrival (ADoA) measurements and locations of a receiver in an indoor environment. To relieve training data collection efforts, we resort to a self-supervised approach by bootstrap** the training of our neural network through location estimates obtained from a state-of-the-art localization algorithm. We evaluate our scheme via mmWave measurements from indoor 60-GHz double-directional channel sounding. We process the measurements to yield dominant multipath components, use the corresponding angles to compute ADoA values, and finally obtain location fixes. Results show that the tiny NN achieves sub-meter errors in 74\% of the cases, thus performing as good as or even better than the state-of-the-art algorithm, with significantly lower computational complexity.
△ Less
Submitted 2 January, 2024;
originally announced January 2024.
-
On the Language Encoder of Contrastive Cross-modal Models
Authors:
Mengjie Zhao,
Junya Ono,
Zhi Zhong,
Chieh-Hsin Lai,
Yuhta Takida,
Naoki Murata,
Wei-Hsiang Liao,
Takashi Shibuya,
Hiromi Wakaki,
Yuki Mitsufuji
Abstract:
Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descriptions of image/audio into vector representations. We extensively evaluate how unsupervised and supervised sentence embedding…
▽ More
Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descriptions of image/audio into vector representations. We extensively evaluate how unsupervised and supervised sentence embedding training affect language encoder quality and cross-modal task performance. In VL pretraining, we found that sentence embedding training language encoder quality and aids in cross-modal tasks, improving contrastive VL models such as CyCLIP. In contrast, AL pretraining benefits less from sentence embedding training, which may result from the limited amount of pretraining data. We analyze the representation spaces to understand the strengths of sentence embedding training, and find that it improves text-space uniformity, at the cost of decreased cross-modal alignment.
△ Less
Submitted 20 October, 2023;
originally announced October 2023.
-
Audio-Visual Neural Syntax Acquisition
Authors:
Cheng-I Jeff Lai,
Freda Shi,
Puyuan Peng,
Yoon Kim,
Kevin Gimpel,
Shiyu Chang,
Yung-Sung Chuang,
Saurabhchand Bhati,
David Cox,
David Harwath,
Yang Zhang,
Karen Livescu,
James Glass
Abstract:
We study phrase structure induction from visually-grounded speech. The core idea is to first segment the speech waveform into sequences of word segments, and subsequently induce phrase structure using the inferred segment-level continuous representations. We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without eve…
▽ More
We study phrase structure induction from visually-grounded speech. The core idea is to first segment the speech waveform into sequences of word segments, and subsequently induce phrase structure using the inferred segment-level continuous representations. We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without ever being exposed to text. By training on paired images and spoken captions, AV-NSL exhibits the capability to infer meaningful phrase structures that are comparable to those derived by naturally-supervised text parsers, for both English and German. Our findings extend prior work in unsupervised language acquisition from speech and grounded grammar induction, and present one approach to bridge the gap between the two topics.
△ Less
Submitted 11 October, 2023;
originally announced October 2023.
-
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Authors:
Yuan Tseng,
Layne Berry,
Yi-Ting Chen,
I-Hsiang Chiu,
Hsuan-Hao Lin,
Max Liu,
Puyuan Peng,
Yi-Jen Shih,
Hung-Yu Wang,
Haibin Wu,
Po-Yao Huang,
Chun-Mao Lai,
Shang-Wen Li,
David Harwath,
Yu Tsao,
Shinji Watanabe,
Abdelrahman Mohamed,
Chi-Luen Feng,
Hung-yi Lee
Abstract:
Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual a…
▽ More
Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations on 7 datasets covering 5 audio-visual tasks in speech and audio processing. We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks, emphasizing the need for future study on improving universal model performance. In addition, we show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task. We release our benchmark with evaluation code and a model submission platform to encourage further research in audio-visual learning.
△ Less
Submitted 19 March, 2024; v1 submitted 19 September, 2023;
originally announced September 2023.
-
Instruction-Following Speech Recognition
Authors:
Cheng-I Jeff Lai,
Zhiyun Lu,
Liangliang Cao,
Ruoming Pang
Abstract:
Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. With the advent of Large Language Models (LLMs) in speech processing, more organic, text-prompt-based interactions have become possible. However, the mechanisms behind these models' speech understanding and "reasoning" capabilities remai…
▽ More
Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. With the advent of Large Language Models (LLMs) in speech processing, more organic, text-prompt-based interactions have become possible. However, the mechanisms behind these models' speech understanding and "reasoning" capabilities remain underexplored. To study this question from the data perspective, we introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions. This enables a multitude of speech recognition tasks -- ranging from transcript manipulation to summarization -- without relying on predefined command sets. Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring LLMs or pre-trained speech modules. It also offers selective transcription options based on instructions like "transcribe first half and then turn off listening," providing an additional layer of privacy and safety compared to existing LLMs. Our findings highlight the significant potential of instruction-following training to advance speech foundation models.
△ Less
Submitted 18 September, 2023;
originally announced September 2023.
-
VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance
Authors:
Carlos Hernandez-Olivan,
Koichi Saito,
Naoki Murata,
Chieh-Hsin Lai,
Marco A. MartÃnez-Ramirez,
Wei-Hsiang Liao,
Yuki Mitsufuji
Abstract:
Restoring degraded music signals is essential to enhance audio quality for downstream music manipulation. Recent diffusion-based music restoration methods have demonstrated impressive performance, and among them, diffusion posterior sampling (DPS) stands out given its intrinsic properties, making it versatile across various restoration tasks. In this paper, we identify that there are potential iss…
▽ More
Restoring degraded music signals is essential to enhance audio quality for downstream music manipulation. Recent diffusion-based music restoration methods have demonstrated impressive performance, and among them, diffusion posterior sampling (DPS) stands out given its intrinsic properties, making it versatile across various restoration tasks. In this paper, we identify that there are potential issues which will degrade current DPS-based methods' performance and introduce the way to mitigate the issues inspired by diverse diffusion guidance techniques including the RePaint (RP) strategy and the Pseudoinverse-Guided Diffusion Models ($Î $GDM). We demonstrate our methods for the vocal declip** and bandwidth extension tasks under various levels of distortion and cutoff frequency, respectively. In both tasks, our methods outperform the current DPS-based music restoration benchmarks. We refer to \url{http://carlosholivan.github.io/demos/audio-restoration-2023.html} for examples of the restored audio samples.
△ Less
Submitted 13 September, 2023;
originally announced September 2023.
-
The Sound Demixing Challenge 2023 $\unicode{x2013}$ Music Demixing Track
Authors:
Giorgio Fabbro,
Stefan Uhlich,
Chieh-Hsin Lai,
Woosung Choi,
Marco MartÃnez-RamÃrez,
Weihsiang Liao,
Igor Gadelha,
Geraldo Ramos,
Eddie Hsu,
Hugo Rodrigues,
Fabian-Robert Stöter,
Alexandre Défossez,
Yi Luo,
Jianwei Yu,
Dipam Chakraborty,
Sharada Mohanty,
Roman Solovyev,
Alexander Stempkovskiy,
Tatiana Habruseva,
Nabarun Goswami,
Tatsuya Harada,
Minseok Kim,
Jun Hyung Lee,
Yuanliang Dong,
Xinran Zhang
, et al. (2 additional authors not shown)
Abstract:
This paper summarizes the music demixing (MDX) track of the Sound Demixing Challenge (SDX'23). We provide a summary of the challenge setup and introduce the task of robust music source separation (MSS), i.e., training MSS models in the presence of errors in the training data. We propose a formalization of the errors that can occur in the design of a training dataset for MSS systems and introduce t…
▽ More
This paper summarizes the music demixing (MDX) track of the Sound Demixing Challenge (SDX'23). We provide a summary of the challenge setup and introduce the task of robust music source separation (MSS), i.e., training MSS models in the presence of errors in the training data. We propose a formalization of the errors that can occur in the design of a training dataset for MSS systems and introduce two new datasets that simulate such errors: SDXDB23_LabelNoise and SDXDB23_Bleeding. We describe the methods that achieved the highest scores in the competition. Moreover, we present a direct comparison with the previous edition of the challenge (the Music Demixing Challenge 2021): the best performing system achieved an improvement of over 1.6dB in signal-to-distortion ratio over the winner of the previous competition, when evaluated on MDXDB21. Besides relying on the signal-to-distortion ratio as objective metric, we also performed a listening test with renowned producers and musicians to study the perceptual quality of the systems and report here the results. Finally, we provide our insights into the organization of the competition and our prospects for future editions.
△ Less
Submitted 19 April, 2024; v1 submitted 14 August, 2023;
originally announced August 2023.
-
Leveraging Optical Communication Fiber and AI for Distributed Water Pipe Leak Detection
Authors:
Huan Wu,
Huan-Feng Duan,
Wallace W. L. Lai,
Kun Zhu,
Xin Cheng,
Hao Yin,
Bin Zhou,
Chun-Cheung Lai,
Chao Lu,
Xiaoli Ding
Abstract:
Detecting leaks in water networks is a costly challenge. This article introduces a practical solution: the integration of optical network with water networks for efficient leak detection. Our approach uses a fiber-optic cable to measure vibrations, enabling accurate leak identification and localization by an intelligent algorithm. We also propose a method to access leak severity for prioritized re…
▽ More
Detecting leaks in water networks is a costly challenge. This article introduces a practical solution: the integration of optical network with water networks for efficient leak detection. Our approach uses a fiber-optic cable to measure vibrations, enabling accurate leak identification and localization by an intelligent algorithm. We also propose a method to access leak severity for prioritized repairs. Our solution detects even small leaks with flow rates as low as 0.027 L/s. It offers a cost-effective way to improve leak detection, enhance water management, and increase operational efficiency.
△ Less
Submitted 28 July, 2023;
originally announced July 2023.
-
MOV-Modified-FxLMS algorithm with Variable Penalty Factor in a Practical Power Output Constrained Active Control System
Authors:
Chung Kwan Lai,
Dongyuan Shi,
Bhan Lam,
Woon-Seng Gan
Abstract:
Practical Active Noise Control (ANC) systems typically require a restriction in their maximum output power, to prevent overdriving the loudspeaker and causing system instability. Recently, the minimum output variance filtered-reference least mean square (MOV-FxLMS) algorithm was shown to have optimal control under output constraint with an analytically formulated penalty factor, but it needs offli…
▽ More
Practical Active Noise Control (ANC) systems typically require a restriction in their maximum output power, to prevent overdriving the loudspeaker and causing system instability. Recently, the minimum output variance filtered-reference least mean square (MOV-FxLMS) algorithm was shown to have optimal control under output constraint with an analytically formulated penalty factor, but it needs offline knowledge of disturbance power and secondary path gain. The constant penalty factor in MOV-FxLMS is also susceptible to variations in disturbance power that could cause output power constraint violations. This paper presents a new variable penalty factor that utilizes the estimated disturbance in the established Modified-FxLMS (MFxLMS) algorithm, resulting in a computationally efficient MOV-MFxLMS algorithm that can adapt to changes in disturbance levels in real-time. Numerical simulation with real noise and plant response showed that the variable penalty factor always manages to meet its maximum power output constraint despite sudden changes in disturbance power, whereas the fixed penalty factor has suffered from a constraint mismatch.
△ Less
Submitted 15 June, 2023;
originally announced June 2023.
-
Transfer Learning for Personality Perception via Speech Emotion Recognition
Authors:
Yuanchao Li,
Peter Bell,
Catherine Lai
Abstract:
Holistic perception of affective attributes is an important human perceptual ability. However, this ability is far from being realized in current affective computing, as not all of the attributes are well studied and their interrelationships are poorly understood. In this work, we investigate the relationship between two affective attributes: personality and emotion, from a transfer learning persp…
▽ More
Holistic perception of affective attributes is an important human perceptual ability. However, this ability is far from being realized in current affective computing, as not all of the attributes are well studied and their interrelationships are poorly understood. In this work, we investigate the relationship between two affective attributes: personality and emotion, from a transfer learning perspective. Specifically, we transfer Transformer-based and wav2vec2-based emotion recognition models to perceive personality from speech across corpora. Compared with previous studies, our results show that transferring emotion recognition is effective for personality perception. Moreoever, this allows for better use and exploration of small personality corpora. We also provide novel findings on the relationship between personality and emotion that will aid future research on holistic affect recognition.
△ Less
Submitted 28 May, 2023; v1 submitted 25 May, 2023;
originally announced May 2023.
-
ASR and Emotional Speech: A Word-Level Investigation of the Mutual Impact of Speech and Emotion Recognition
Authors:
Yuanchao Li,
Zeyu Zhao,
Ondrej Klejch,
Peter Bell,
Catherine Lai
Abstract:
In Speech Emotion Recognition (SER), textual data is often used alongside audio signals to address their inherent variability. However, the reliance on human annotated text in most research hinders the development of practical SER systems. To overcome this challenge, we investigate how Automatic Speech Recognition (ASR) performs on emotional speech by analyzing the ASR performance on emotion corpo…
▽ More
In Speech Emotion Recognition (SER), textual data is often used alongside audio signals to address their inherent variability. However, the reliance on human annotated text in most research hinders the development of practical SER systems. To overcome this challenge, we investigate how Automatic Speech Recognition (ASR) performs on emotional speech by analyzing the ASR performance on emotion corpora and examining the distribution of word errors and confidence scores in ASR transcripts to gain insight into how emotion affects ASR. We utilize four ASR systems, namely Kaldi ASR, wav2vec2, Conformer, and Whisper, and three corpora: IEMOCAP, MOSI, and MELD to ensure generalizability. Additionally, we conduct text-based SER on ASR transcripts with increasing word error rates to investigate how ASR affects SER. The objective of this study is to uncover the relationship and mutual impact of ASR and SER, in order to facilitate ASR adaptation to emotional speech and the use of SER in real world.
△ Less
Submitted 28 May, 2023; v1 submitted 25 May, 2023;
originally announced May 2023.
-
Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition
Authors:
Yaoting Wang,
Yuanchao Li,
Paul Pu Liang,
Louis-Philippe Morency,
Peter Bell,
Catherine Lai
Abstract:
Fusing multiple modalities has proven effective for multimodal information processing. However, the incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition. In this study, we first analyze how the salient affective information in one modality can be affected by the other, and demonstrate that inter-modal incongruity exists latently in crossmodal att…
▽ More
Fusing multiple modalities has proven effective for multimodal information processing. However, the incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition. In this study, we first analyze how the salient affective information in one modality can be affected by the other, and demonstrate that inter-modal incongruity exists latently in crossmodal attention. Based on this finding, we propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model, which dynamically chooses the primary modality in each training batch and reduces fusion times by leveraging the learned hierarchy in the latent space to alleviate incongruity. The experimental evaluation on five benchmark datasets: CMU-MOSI, CMU-MOSEI, and IEMOCAP (sentiment and emotion), where incongruity implicitly lies in hard samples, as well as UR-FUNNY (humour) and MUStaRD (sarcasm), where incongruity is common, verifies the efficacy of our approach, showing that HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
△ Less
Submitted 12 November, 2023; v1 submitted 22 May, 2023;
originally announced May 2023.
-
Interference-Aware Deployment for Maximizing User Satisfaction in Multi-UAV Wireless Networks
Authors:
Chuan-Chi Lai,
Ang-Hsun Tsai,
Chia-Wei Ting,
Ko-Han Lin,
**g-Chi Ling,
Chia-En Tsai
Abstract:
In this letter, we study the deployment of Unmanned Aerial Vehicle mounted Base Stations (UAV-BSs) in multi-UAV cellular networks. We model the multi-UAV deployment problem as a user satisfaction maximization problem, that is, maximizing the proportion of served ground users (GUs) that meet a given minimum data rate requirement. We propose an interference-aware deployment (IAD) algorithm for servi…
▽ More
In this letter, we study the deployment of Unmanned Aerial Vehicle mounted Base Stations (UAV-BSs) in multi-UAV cellular networks. We model the multi-UAV deployment problem as a user satisfaction maximization problem, that is, maximizing the proportion of served ground users (GUs) that meet a given minimum data rate requirement. We propose an interference-aware deployment (IAD) algorithm for serving arbitrarily distributed outdoor GUs. The proposed algorithm can alleviate the problem of overlap** coverage between adjacent UAV-BSs to minimize inter-cell interference. Therefore, reducing co-channel interference between UAV-BSs will improve user satisfaction and ensure that most GUs can achieve the minimum data rate requirement. Simulation results show that our proposed IAD outperforms comparative methods by more than 10% in user satisfaction in high-density environments.
△ Less
Submitted 6 April, 2023;
originally announced April 2023.
-
Real-time modelling of observation filter in the Remote Microphone Technique for an Active Noise Control application
Authors:
Chung Kwan Lai,
Bhan Lam,
Dongyuan Shi,
Woon-Seng Gan
Abstract:
The remote microphone technique (RMT) is often used in active noise control (ANC) applications to overcome design constraints in microphone placements by estimating the acoustic pressure at inconvenient locations using a pre-calibrated observation filter (OF), albeit limited to stationary primary acoustic fields. While the OF estimation in varying primary fields can be significantly improved throu…
▽ More
The remote microphone technique (RMT) is often used in active noise control (ANC) applications to overcome design constraints in microphone placements by estimating the acoustic pressure at inconvenient locations using a pre-calibrated observation filter (OF), albeit limited to stationary primary acoustic fields. While the OF estimation in varying primary fields can be significantly improved through the recently proposed source decomposition technique, it requires knowledge of the relative source strengths between incoherent primary noise sources. This paper proposes a method for combining the RMT with a new source-localization technique to estimate the source ratio parameter. Unlike traditional source-localization techniques, the proposed method is capable of being implemented in a real-time RMT application. Simulations with measured responses from an open-aperture ANC application showed a good estimation of the source ratio parameter, which allows the observation filter to be modelled in real-time.
△ Less
Submitted 21 March, 2023;
originally announced March 2023.
-
Cascading and Direct Approaches to Unsupervised Constituency Parsing on Spoken Sentences
Authors:
Yuan Tseng,
Cheng-I Lai,
Hung-yi Lee
Abstract:
Past work on unsupervised parsing is constrained to written form. In this paper, we present the first study on unsupervised spoken constituency parsing given unlabeled spoken sentences and unpaired textual data. The goal is to determine the spoken sentences' hierarchical syntactic structure in the form of constituency parse trees, such that each node is a span of audio that corresponds to a consti…
▽ More
Past work on unsupervised parsing is constrained to written form. In this paper, we present the first study on unsupervised spoken constituency parsing given unlabeled spoken sentences and unpaired textual data. The goal is to determine the spoken sentences' hierarchical syntactic structure in the form of constituency parse trees, such that each node is a span of audio that corresponds to a constituent. We compare two approaches: (1) cascading an unsupervised automatic speech recognition (ASR) model and an unsupervised parser to obtain parse trees on ASR transcripts, and (2) direct training an unsupervised parser on continuous word-level speech representations. This is done by first splitting utterances into sequences of word-level segments, and aggregating self-supervised speech representations within segments to obtain segment embeddings. We find that separately training a parser on the unpaired text and directly applying it on ASR transcripts for inference produces better results for unsupervised parsing. Additionally, our results suggest that accurate segmentation alone may be sufficient to parse spoken sentences accurately. Finally, we show the direct approach may learn head-directionality correctly for both head-initial and head-final languages without any explicit inductive bias.
△ Less
Submitted 9 May, 2023; v1 submitted 15 March, 2023;
originally announced March 2023.
-
I Know Your Feelings Before You Do: Predicting Future Affective Reactions in Human-Computer Dialogue
Authors:
Yuanchao Li,
Koji Inoue,
Leimin Tian,
Changzeng Fu,
Carlos Ishi,
Hiroshi Ishiguro,
Tatsuya Kawahara,
Catherine Lai
Abstract:
Current Spoken Dialogue Systems (SDSs) often serve as passive listeners that respond only after receiving user speech. To achieve human-like dialogue, we propose a novel future prediction architecture that allows an SDS to anticipate future affective reactions based on its current behaviors before the user speaks. In this work, we investigate two scenarios: speech and laughter. In speech, we propo…
▽ More
Current Spoken Dialogue Systems (SDSs) often serve as passive listeners that respond only after receiving user speech. To achieve human-like dialogue, we propose a novel future prediction architecture that allows an SDS to anticipate future affective reactions based on its current behaviors before the user speaks. In this work, we investigate two scenarios: speech and laughter. In speech, we propose to predict the user's future emotion based on its temporal relationship with the system's current emotion and its causal relationship with the system's current Dialogue Act (DA). In laughter, we propose to predict the occurrence and type of the user's laughter using the system's laughter behaviors in the current turn. Preliminary analysis of human-robot dialogue demonstrated synchronicity in the emotions and laughter displayed by the human and robot, as well as DA-emotion causality in their dialogue. This verifies that our architecture can contribute to the development of an anticipatory SDS.
△ Less
Submitted 17 March, 2023; v1 submitted 28 February, 2023;
originally announced March 2023.
-
GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration
Authors:
Naoki Murata,
Koichi Saito,
Chieh-Hsin Lai,
Yuhta Takida,
Toshimitsu Uesaka,
Yuki Mitsufuji,
Stefano Ermon
Abstract:
Pre-trained diffusion models have been successfully used as priors in a variety of linear inverse problems, where the goal is to reconstruct a signal from noisy linear measurements. However, existing approaches require knowledge of the linear operator. In this paper, we propose GibbsDDRM, an extension of Denoising Diffusion Restoration Models (DDRM) to a blind setting in which the linear measureme…
▽ More
Pre-trained diffusion models have been successfully used as priors in a variety of linear inverse problems, where the goal is to reconstruct a signal from noisy linear measurements. However, existing approaches require knowledge of the linear operator. In this paper, we propose GibbsDDRM, an extension of Denoising Diffusion Restoration Models (DDRM) to a blind setting in which the linear measurement operator is unknown. GibbsDDRM constructs a joint distribution of the data, measurements, and linear operator by using a pre-trained diffusion model for the data prior, and it solves the problem by posterior sampling with an efficient variant of a Gibbs sampler. The proposed method is problem-agnostic, meaning that a pre-trained diffusion model can be applied to various inverse problems without fine-tuning. In experiments, it achieved high performance on both blind image deblurring and vocal dereverberation tasks, despite the use of simple generic priors for the underlying linear operators.
△ Less
Submitted 27 June, 2023; v1 submitted 30 January, 2023;
originally announced January 2023.
-
Multimodal Dyadic Impression Recognition via Listener Adaptive Cross-Domain Fusion
Authors:
Yuanchao Li,
Peter Bell,
Catherine Lai
Abstract:
As a sub-branch of affective computing, impression recognition, e.g., perception of speaker characteristics such as warmth or competence, is potentially a critical part of both human-human conversations and spoken dialogue systems. Most research has studied impressions only from the behaviors expressed by the speaker or the response from the listener, yet ignored their latent connection. In this p…
▽ More
As a sub-branch of affective computing, impression recognition, e.g., perception of speaker characteristics such as warmth or competence, is potentially a critical part of both human-human conversations and spoken dialogue systems. Most research has studied impressions only from the behaviors expressed by the speaker or the response from the listener, yet ignored their latent connection. In this paper, we perform impression recognition using a proposed listener adaptive cross-domain architecture, which consists of a listener adaptation function to model the causality between speaker and listener behaviors and a cross-domain fusion function to strengthen their connection. The experimental evaluation on the dyadic IMPRESSION dataset verified the efficacy of our method, producing concordance correlation coefficients of 78.8% and 77.5% in the competence and warmth dimensions, outperforming previous studies. The proposed method is expected to be generalized to similar dyadic interaction scenarios.
△ Less
Submitted 16 February, 2023; v1 submitted 9 November, 2022;
originally announced November 2022.
-
Unsupervised vocal dereverberation with diffusion-based generative models
Authors:
Koichi Saito,
Naoki Murata,
Toshimitsu Uesaka,
Chieh-Hsin Lai,
Yuhta Takida,
Takao Fukui,
Yuki Mitsufuji
Abstract:
Removing reverb from reverberant music is a necessary technique to clean up audio for downstream music manipulations. Reverberation of music contains two categories, natural reverb, and artificial reverb. Artificial reverb has a wider diversity than natural reverb due to its various parameter setups and reverberation types. However, recent supervised dereverberation methods may fail because they r…
▽ More
Removing reverb from reverberant music is a necessary technique to clean up audio for downstream music manipulations. Reverberation of music contains two categories, natural reverb, and artificial reverb. Artificial reverb has a wider diversity than natural reverb due to its various parameter setups and reverberation types. However, recent supervised dereverberation methods may fail because they rely on sufficiently diverse and numerous pairs of reverberant observations and retrieved data for training in order to be generalizable to unseen observations during inference. To resolve these problems, we propose an unsupervised method that can remove a general kind of artificial reverb for music without requiring pairs of data for training. The proposed method is based on diffusion models, where it initializes the unknown reverberation operator with a conventional signal processing technique and simultaneously refines the estimate with the help of diffusion models. We show through objective and perceptual evaluations that our method outperforms the current leading vocal dereverberation benchmarks.
△ Less
Submitted 8 November, 2022;
originally announced November 2022.
-
Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing
Authors:
Yonggan Fu,
Yang Zhang,
Kaizhi Qian,
Zhifan Ye,
Zhongzhi Yu,
Cheng-I Lai,
Yingyan Lin
Abstract:
Self-supervised learning (SSL) for rich speech representations has achieved empirical success in low-resource Automatic Speech Recognition (ASR) and other speech processing tasks, which can mitigate the necessity of a large amount of transcribed speech and thus has driven a growing demand for on-device ASR and other speech processing. However, advanced speech SSL models have become increasingly la…
▽ More
Self-supervised learning (SSL) for rich speech representations has achieved empirical success in low-resource Automatic Speech Recognition (ASR) and other speech processing tasks, which can mitigate the necessity of a large amount of transcribed speech and thus has driven a growing demand for on-device ASR and other speech processing. However, advanced speech SSL models have become increasingly large, which contradicts the limited on-device resources. This gap could be more severe in multilingual/multitask scenarios requiring simultaneously recognizing multiple languages or executing multiple speech processing tasks. Additionally, strongly overparameterized speech SSL models tend to suffer from overfitting when being finetuned on low-resource speech corpus. This work aims to enhance the practical usage of speech SSL models towards a win-win in both enhanced efficiency and alleviated overfitting via our proposed S$^3$-Router framework, which for the first time discovers that simply discarding no more than 10\% of model weights via only finetuning model connections of speech SSL models can achieve better accuracy over standard weight finetuning on downstream speech processing tasks. More importantly, S$^3$-Router can serve as an all-in-one technique to enable (1) a new finetuning scheme, (2) an efficient multilingual/multitask solution, (3) a state-of-the-art ASR pruning technique, and (4) a new tool to quantitatively analyze the learned speech representation. We believe S$^3$-Router has provided a new perspective for practical deployment of speech SSL models. Our codes are available at: https://github.com/GATECH-EIC/S3-Router.
△ Less
Submitted 2 November, 2022;
originally announced November 2022.
-
Adaptive and Fair Deployment Approach to Balance Offload Traffic in Multi-UAV Cellular Networks
Authors:
Chuan-Chi Lai,
Bhola,
Ang-Hsun Tsai,
Li-Chun Wang
Abstract:
Unmanned aerial vehicle-aided communication (UAB-BS) is a promising solution to establish rapid wireless connectivity in sudden/temporary crowded events because of its more flexibility and mobility features than conventional ground base station (GBS). Because of these benefits, UAV-BSs can easily be deployed at high altitudes to provide more line of sight (LoS) links than GBS. Therefore, users on…
▽ More
Unmanned aerial vehicle-aided communication (UAB-BS) is a promising solution to establish rapid wireless connectivity in sudden/temporary crowded events because of its more flexibility and mobility features than conventional ground base station (GBS). Because of these benefits, UAV-BSs can easily be deployed at high altitudes to provide more line of sight (LoS) links than GBS. Therefore, users on the ground can obtain more reliable wireless channels. In practice, the mobile nature of the ground user can create uneven user density at different times and spaces. This phenomenon leads to unbalanced user associations among UAV-BSs and may cause frequent UAV-BS overload. We propose a three-dimensional adaptive and fair deployment approach to solve this problem. The proposed approach can jointly optimize the altitude and transmission power of UAV-BS to offload the traffic from overloaded UAV-BSs. The simulation results show that the network performance improves by 37.71% in total capacity, 37.48% in total energy efficiency and 16.12% in the Jain fairness index compared to the straightforward greedy approach.
△ Less
Submitted 12 November, 2022; v1 submitted 30 October, 2022;
originally announced October 2022.
-
Exploration of A Self-Supervised Speech Model: A Study on Emotional Corpora
Authors:
Yuanchao Li,
Yumnah Mohamied,
Peter Bell,
Catherine Lai
Abstract:
Self-supervised speech models have grown fast during the past few years and have proven feasible for use in various downstream tasks. Some recent work has started to look at the characteristics of these models, yet many concerns have not been fully addressed. In this work, we conduct a study on emotional corpora to explore a popular self-supervised model -- wav2vec 2.0. Via a set of quantitative a…
▽ More
Self-supervised speech models have grown fast during the past few years and have proven feasible for use in various downstream tasks. Some recent work has started to look at the characteristics of these models, yet many concerns have not been fully addressed. In this work, we conduct a study on emotional corpora to explore a popular self-supervised model -- wav2vec 2.0. Via a set of quantitative analysis, we mainly demonstrate that: 1) wav2vec 2.0 appears to discard paralinguistic information that is less useful for word recognition purposes; 2) for emotion recognition, representations from the middle layer alone perform as well as those derived from layer averaging, while the final layer results in the worst performance in some cases; 3) current self-supervised models may not be the optimal solution for downstream tasks that make use of non-lexical features. Our work provides novel findings that will aid future research in this area and theoretical basis for the use of existing models.
△ Less
Submitted 12 December, 2022; v1 submitted 5 October, 2022;
originally announced October 2022.
-
ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers
Authors:
Kaizhi Qian,
Yang Zhang,
Heting Gao,
Junrui Ni,
Cheng-I Lai,
David Cox,
Mark Hasegawa-Johnson,
Shiyu Chang
Abstract:
Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks. Since the majority of the downstream tasks of SSL learning in speech largely focus on the content information in speech, the most desirable speech representations should be able to disentangle unwanted va…
▽ More
Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks. Since the majority of the downstream tasks of SSL learning in speech largely focus on the content information in speech, the most desirable speech representations should be able to disentangle unwanted variations, such as speaker variations, from the content. However, disentangling speakers is very challenging, because removing the speaker information could easily result in a loss of content as well, and the damage of the latter usually far outweighs the benefit of the former. In this paper, we propose a new SSL method that can achieve speaker disentanglement without severe loss of content. Our approach is adapted from the HuBERT framework, and incorporates disentangling mechanisms to regularize both the teacher labels and the learned representations. We evaluate the benefit of speaker disentanglement on a set of content-related downstream tasks, and observe a consistent and notable performance advantage of our speaker-disentangled representations.
△ Less
Submitted 23 June, 2022; v1 submitted 20 April, 2022;
originally announced April 2022.
-
Simple and Effective Unsupervised Speech Synthesis
Authors:
Alexander H. Liu,
Cheng-I Jeff Lai,
Wei-Ning Hsu,
Michael Auli,
Alexei Baevski,
James Glass
Abstract:
We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe. The framework leverages recent work in unsupervised speech recognition as well as existing neural-based speech synthesis. Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus. Experiments demonstra…
▽ More
We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe. The framework leverages recent work in unsupervised speech recognition as well as existing neural-based speech synthesis. Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus. Experiments demonstrate the unsupervised system can synthesize speech similar to a supervised counterpart in terms of naturalness and intelligibility measured by human evaluation.
△ Less
Submitted 20 April, 2022; v1 submitted 5 April, 2022;
originally announced April 2022.
-
Analysis of Voice Conversion and Code-Switching Synthesis Using VQ-VAE
Authors:
Shuvayanti Das,
Jennifer Williams,
Catherine Lai
Abstract:
This paper presents an analysis of speech synthesis quality achieved by simultaneously performing voice conversion and language code-switching using multilingual VQ-VAE speech synthesis in German, French, English and Italian. In this paper, we utilize VQ code indices representing phone information from VQ-VAE to perform code-switching and a VQ speaker code to perform voice conversion in a single s…
▽ More
This paper presents an analysis of speech synthesis quality achieved by simultaneously performing voice conversion and language code-switching using multilingual VQ-VAE speech synthesis in German, French, English and Italian. In this paper, we utilize VQ code indices representing phone information from VQ-VAE to perform code-switching and a VQ speaker code to perform voice conversion in a single system with a neural vocoder. Our analysis examines several aspects of code-switching including the number of language switches and the number of words involved in each switch. We found that speech synthesis quality degrades after increasing the number of language switches within an utterance and decreasing the number of words. We also found some evidence of accent transfer when performing voice conversion across languages as observed when a speaker's original language differs from the language of a synthetic target utterance. We present results from our listening tests and discuss the inherent difficulties of assessing accent transfer in speech synthesis. Our work highlights some of the limitations and strengths of using a semi-supervised end-to-end system like VQ-VAE for handling multilingual synthesis. Our work provides insight into why multilingual speech synthesis is challenging and we suggest some directions for expanding work in this area.
△ Less
Submitted 28 March, 2022;
originally announced March 2022.
-
A Cross-Domain Approach for Continuous Impression Recognition from Dyadic Audio-Visual-Physio Signals
Authors:
Yuanchao Li,
Catherine Lai
Abstract:
The impression we make on others depends not only on what we say, but also, to a large extent, on how we say it. As a sub-branch of affective computing and social signal processing, impression recognition has proven critical in both human-human conversations and spoken dialogue systems. However, most research has studied impressions only from the signals expressed by the emitter, ignoring the resp…
▽ More
The impression we make on others depends not only on what we say, but also, to a large extent, on how we say it. As a sub-branch of affective computing and social signal processing, impression recognition has proven critical in both human-human conversations and spoken dialogue systems. However, most research has studied impressions only from the signals expressed by the emitter, ignoring the response from the receiver. In this paper, we perform impression recognition using a proposed cross-domain architecture on the dyadic IMPRESSION dataset. This improved architecture makes use of cross-domain attention and regularization. The cross-domain attention consists of intra- and inter-attention mechanisms, which capture intra- and inter-domain relatedness, respectively. The cross-domain regularization includes knowledge distillation and similarity enhancement losses, which strengthen the feature connections between the emitter and receiver. The experimental evaluation verified the effectiveness of our approach. Our approach achieved a concordance correlation coefficient of 0.770 in competence dimension and 0.748 in warmth dimension.
△ Less
Submitted 25 March, 2022;
originally announced March 2022.
-
Robotic Speech Synthesis: Perspectives on Interactions, Scenarios, and Ethics
Authors:
Yuanchao Li,
Catherine Lai
Abstract:
In recent years, many works have investigated the feasibility of conversational robots for performing specific tasks, such as healthcare and interview. Along with this development comes a practical issue: how should we synthesize robotic voices to meet the needs of different situations? In this paper, we discuss this issue from three perspectives: 1) the difficulties of synthesizing non-verbal and…
▽ More
In recent years, many works have investigated the feasibility of conversational robots for performing specific tasks, such as healthcare and interview. Along with this development comes a practical issue: how should we synthesize robotic voices to meet the needs of different situations? In this paper, we discuss this issue from three perspectives: 1) the difficulties of synthesizing non-verbal and interaction-oriented speech signals, particularly backchannels; 2) the scenario classification for robotic voice synthesis; 3) the ethical issues regarding the design of robot voice for its emotion and identity. We present the findings of relevant literature and our prior work, trying to bring the attention of human-robot interaction researchers to design better conversational robots in the future.
△ Less
Submitted 17 March, 2022;
originally announced March 2022.
-
SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities
Authors:
Hsiang-Sheng Tsai,
Heng-Jui Chang,
Wen-Chin Huang,
Zili Huang,
Kushal Lakhotia,
Shu-wen Yang,
Shuyan Dong,
Andy T. Liu,
Cheng-I Jeff Lai,
Jiatong Shi,
Xuankai Chang,
Phil Hall,
Hsuan-Jui Chen,
Shang-Wen Li,
Shinji Watanabe,
Abdelrahman Mohamed,
Hung-yi Lee
Abstract:
Transfer learning has proven to be crucial in advancing the state of speech and natural language processing research in recent years. In speech, a model pre-trained by self-supervised learning transfers remarkably well on multiple tasks. However, the lack of a consistent evaluation methodology is limiting towards a holistic understanding of the efficacy of such models. SUPERB was a step towards in…
▽ More
Transfer learning has proven to be crucial in advancing the state of speech and natural language processing research in recent years. In speech, a model pre-trained by self-supervised learning transfers remarkably well on multiple tasks. However, the lack of a consistent evaluation methodology is limiting towards a holistic understanding of the efficacy of such models. SUPERB was a step towards introducing a common benchmark to evaluate pre-trained models across various speech tasks. In this paper, we introduce SUPERB-SG, a new benchmark focused on evaluating the semantic and generative capabilities of pre-trained models by increasing task diversity and difficulty over SUPERB. We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain and quality across different types of tasks. It entails freezing pre-trained model parameters, only using simple task-specific trainable heads. The goal is to be inclusive of all researchers, and encourage efficient use of computational resources. We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
△ Less
Submitted 14 March, 2022;
originally announced March 2022.
-
Continuous Speech for Improved Learning Pathological Voice Disorders
Authors:
Syu-Siang Wang,
Chi-Te Wang,
Chih-Chung Lai,
Yu Tsao,
Shih-Hau Fang
Abstract:
Goal: Numerous studies had successfully differentiated normal and abnormal voice samples. Nevertheless, further classification had rarely been attempted. This study proposes a novel approach, using continuous Mandarin speech instead of a single vowel, to classify four common voice disorders (i.e. functional dysphonia, neoplasm, phonotrauma, and vocal palsy). Methods: In the proposed framework, aco…
▽ More
Goal: Numerous studies had successfully differentiated normal and abnormal voice samples. Nevertheless, further classification had rarely been attempted. This study proposes a novel approach, using continuous Mandarin speech instead of a single vowel, to classify four common voice disorders (i.e. functional dysphonia, neoplasm, phonotrauma, and vocal palsy). Methods: In the proposed framework, acoustic signals are transformed into mel-frequency cepstral coefficients, and a bi-directional long-short term memory network (BiLSTM) is adopted to model the sequential features. The experiments were conducted on a large-scale database, wherein 1,045 continuous speech were collected by the speech clinic of a hospital from 2012 to 2019. Results: Experimental results demonstrated that the proposed framework yields significant accuracy and unweighted average recall improvements of 78.12-89.27% and 50.92-80.68%, respectively, compared with systems that use a single vowel. Conclusions: The results are consistent with other machine learning algorithms, including gated recurrent units, random forest, deep neural networks, and LSTM. The sensitivities for each disorder were also analyzed, and the model capabilities were visualized via principal component analysis. An alternative experiment based on a balanced dataset again confirms the advantages of using continuous speech for learning voice disorders.
△ Less
Submitted 22 February, 2022;
originally announced February 2022.
-
Fusing ASR Outputs in Joint Training for Speech Emotion Recognition
Authors:
Yuanchao Li,
Peter Bell,
Catherine Lai
Abstract:
Alongside acoustic information, linguistic features based on speech transcripts have been proven useful in Speech Emotion Recognition (SER). However, due to the scarcity of emotion labelled data and the difficulty of recognizing emotional speech, it is hard to obtain reliable linguistic features and models in this research area. In this paper, we propose to fuse Automatic Speech Recognition (ASR)…
▽ More
Alongside acoustic information, linguistic features based on speech transcripts have been proven useful in Speech Emotion Recognition (SER). However, due to the scarcity of emotion labelled data and the difficulty of recognizing emotional speech, it is hard to obtain reliable linguistic features and models in this research area. In this paper, we propose to fuse Automatic Speech Recognition (ASR) outputs into the pipeline for joint training SER. The relationship between ASR and SER is understudied, and it is unclear what and how ASR features benefit SER. By examining various ASR outputs and fusion methods, our experiments show that in joint ASR-SER training, incorporating both ASR hidden and text output using a hierarchical co-attention fusion approach improves the SER performance the most. On the IEMOCAP corpus, our approach achieves 63.4% weighted accuracy, which is close to the baseline results achieved by combining ground-truth transcripts. In addition, we also present novel word error rate analysis on IEMOCAP and layer-difference analysis of the Wav2vec 2.0 model to better understand the relationship between ASR and SER.
△ Less
Submitted 17 March, 2022; v1 submitted 29 October, 2021;
originally announced October 2021.
-
SSAST: Self-Supervised Audio Spectrogram Transformer
Authors:
Yuan Gong,
Cheng-I Jeff Lai,
Yu-An Chung,
James Glass
Abstract:
Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology ca…
▽ More
Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology can also be applied to the audio domain. Specifically, the Audio Spectrogram Transformer (AST) achieves state-of-the-art results on various audio classification benchmarks. However, pure Transformer models tend to require more training data compared to CNNs, and the success of the AST relies on supervised pretraining that requires a large amount of labeled data and a complex training pipeline, thus limiting the practical usage of AST.
This paper focuses on audio and speech classification, and aims to reduce the need for large amounts of labeled data for AST by leveraging self-supervised learning using unlabeled data. Specifically, we propose to pretrain the AST model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio from AudioSet and Librispeech. We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification. The proposed self-supervised framework significantly boosts AST performance on all tasks, with an average improvement of 60.9%, leading to similar or even better results than a supervised pretrained AST. To the best of our knowledge, it is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self-supervised learning framework for AST.
△ Less
Submitted 10 February, 2022; v1 submitted 19 October, 2021;
originally announced October 2021.
-
On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis
Authors:
Cheng-I Jeff Lai,
Erica Cooper,
Yang Zhang,
Shiyu Chang,
Kaizhi Qian,
Yi-Lun Liao,
Yung-Sung Chuang,
Alexander H. Liu,
Junichi Yamagishi,
David Cox,
James Glass
Abstract:
Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparsity and its subsequent effects on synthetic speech. Additionally, we explored several…
▽ More
Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparsity and its subsequent effects on synthetic speech. Additionally, we explored several aspects of TTS pruning: amount of finetuning data versus sparsity, TTS-Augmentation to utilize unspoken text, and combining knowledge distillation and pruning. Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility, with similar prosody. All of our experiments are conducted on publicly available models, and findings in this work are backed by large-scale subjective tests and objective measures. Code and 200 pruned models are made available to facilitate future research on efficiency in TTS.
△ Less
Submitted 27 October, 2021; v1 submitted 3 October, 2021;
originally announced October 2021.
-
Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm
Authors:
Elijah Gutierrez,
Pilar Oplustil-Gallegos,
Catherine Lai
Abstract:
Text-to-Speech synthesis systems are generally evaluated using Mean Opinion Score (MOS) tests, where listeners score samples of synthetic speech on a Likert scale. A major drawback of MOS tests is that they only offer a general measure of overall quality-i.e., the naturalness of an utterance-and so cannot tell us where exactly synthesis errors occur. This can make evaluation of the appropriateness…
▽ More
Text-to-Speech synthesis systems are generally evaluated using Mean Opinion Score (MOS) tests, where listeners score samples of synthetic speech on a Likert scale. A major drawback of MOS tests is that they only offer a general measure of overall quality-i.e., the naturalness of an utterance-and so cannot tell us where exactly synthesis errors occur. This can make evaluation of the appropriateness of prosodic variation within utterances inconclusive. To address this, we propose a novel evaluation method based on the Rapid Prosody Transcription paradigm. This allows listeners to mark the locations of errors in an utterance in real-time, providing a probabilistic representation of the perceptual errors that occur in the synthetic signal. We conduct experiments that confirm that the fine-grained evaluation can be mapped to system rankings of standard MOS tests, but the error marking gives a much more comprehensive assessment of synthesized prosody. In particular, for standard audiobook test set samples, we see that error marks consistently cluster around words at major prosodic boundaries indicated by punctuation. However, for question-answer based stimuli, where we control information structure, we see differences emerge in the ability of neural TTS systems to generate context-appropriate prosodic prominence.
△ Less
Submitted 6 July, 2021;
originally announced July 2021.
-
PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition
Authors:
Cheng-I Jeff Lai,
Yang Zhang,
Alexander H. Liu,
Shiyu Chang,
Yi-Lun Liao,
Yung-Sung Chuang,
Kaizhi Qian,
Sameer Khurana,
David Cox,
James Glass
Abstract:
Self-supervised speech representation learning (speech SSL) has demonstrated the benefit of scale in learning rich representations for Automatic Speech Recognition (ASR) with limited paired data, such as wav2vec 2.0. We investigate the existence of sparse subnetworks in pre-trained speech SSL models that achieve even better low-resource ASR results. However, directly applying widely adopted prunin…
▽ More
Self-supervised speech representation learning (speech SSL) has demonstrated the benefit of scale in learning rich representations for Automatic Speech Recognition (ASR) with limited paired data, such as wav2vec 2.0. We investigate the existence of sparse subnetworks in pre-trained speech SSL models that achieve even better low-resource ASR results. However, directly applying widely adopted pruning methods such as the Lottery Ticket Hypothesis (LTH) is suboptimal in the computational cost needed. Moreover, we show that the discovered subnetworks yield minimal performance gain compared to the original dense network. We present Prune-Adjust-Re-Prune (PARP), which discovers and finetunes subnetworks for much better performance, while only requiring a single downstream ASR finetuning run. PARP is inspired by our surprising observation that subnetworks pruned for pre-training tasks need merely a slight adjustment to achieve a sizeable performance boost in downstream ASR tasks. Extensive experiments on low-resource ASR verify (1) sparse subnetworks exist in mono-lingual/multi-lingual pre-trained speech SSL, and (2) the computational advantage and performance gain of PARP over baseline pruning methods. In particular, on the 10min Librispeech split without LM decoding, PARP discovers subnetworks from wav2vec 2.0 with an absolute 10.9%/12.6% WER decrease compared to the full model. We further demonstrate the effectiveness of PARP via: cross-lingual pruning without any phone recognition degradation, the discovery of a multi-lingual subnetwork for 10 spoken languages in 1 finetuning run, and its applicability to pre-trained BERT/XLNet for natural language tasks.
△ Less
Submitted 26 October, 2021; v1 submitted 10 June, 2021;
originally announced June 2021.
-
SUPERB: Speech processing Universal PERformance Benchmark
Authors:
Shu-wen Yang,
Po-Han Chi,
Yung-Sung Chuang,
Cheng-I Jeff Lai,
Kushal Lakhotia,
Yist Y. Lin,
Andy T. Liu,
Jiatong Shi,
Xuankai Chang,
Guan-Ting Lin,
Tzu-Hsien Huang,
Wei-Cheng Tseng,
Ko-tik Lee,
Da-Rong Liu,
Zili Huang,
Shuyan Dong,
Shang-Wen Li,
Shinji Watanabe,
Abdelrahman Mohamed,
Hung-yi Lee
Abstract:
Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm. To bridge…
▽ More
Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm. To bridge this gap, we introduce Speech processing Universal PERformance Benchmark (SUPERB). SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data. Among multiple usages of the shared model, we especially focus on extracting the representation learned from SSL due to its preferable re-usability. We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model. Our results demonstrate that the framework is promising as SSL representations show competitive generalizability and accessibility across SUPERB tasks. We release SUPERB as a challenge with a leaderboard and a benchmark toolkit to fuel the research in representation learning and general speech processing.
△ Less
Submitted 15 October, 2021; v1 submitted 3 May, 2021;
originally announced May 2021.
-
Strike on Stage: a percussion and media performance
Authors:
Charles Martin,
Chi-Hsia Lai
Abstract:
This paper describes Strike on Stage, an interface and corresponding audio-visual performance work developed and performed in 2010 by percussionists and media artists Chi-Hsia Lai and Charles Martin. The concept of Strike on Stage is to integrate computer visuals and sound into an improvised percussion performance. A large projection surface is positioned directly behind the performers, while a co…
▽ More
This paper describes Strike on Stage, an interface and corresponding audio-visual performance work developed and performed in 2010 by percussionists and media artists Chi-Hsia Lai and Charles Martin. The concept of Strike on Stage is to integrate computer visuals and sound into an improvised percussion performance. A large projection surface is positioned directly behind the performers, while a computer vision system tracks their movements. The setup allows computer visualisation and sonification to be directly responsive and unified with the performers' gestures.
△ Less
Submitted 30 November, 2020;
originally announced December 2020.
-
Anatomically-Informed Deep Learning on Contrast-Enhanced Cardiac MRI for Scar Segmentation and Clinical Feature Extraction
Authors:
Haley G. Abramson,
Dan M. Popescu,
Rebecca Yu,
Changxin Lai,
Julie K. Shade,
Katherine C. Wu,
Mauro Maggioni,
Natalia A. Trayanova
Abstract:
Visualizing disease-induced scarring and fibrosis in the heart on cardiac magnetic resonance (CMR) imaging with contrast enhancement (LGE) is paramount in characterizing disease progression and quantifying pathophysiological substrates of arrhythmias. However, segmentation and scar/fibrosis identification from LGE-CMR is an intensive manual process prone to large inter-observer variability. Here,…
▽ More
Visualizing disease-induced scarring and fibrosis in the heart on cardiac magnetic resonance (CMR) imaging with contrast enhancement (LGE) is paramount in characterizing disease progression and quantifying pathophysiological substrates of arrhythmias. However, segmentation and scar/fibrosis identification from LGE-CMR is an intensive manual process prone to large inter-observer variability. Here, we present a novel fully-automated anatomically-informed deep learning solution for left ventricle (LV) and scar/fibrosis segmentation and clinical feature extraction from LGE-CMR. The technology involves three cascading convolutional neural networks that segment myocardium and scar/fibrosis from raw LGE-CMR images and constrain these segmentations within anatomical guidelines, thus facilitating seamless derivation of clinically-significant parameters. In addition to available LGE-CMR images, training used "LGE-like" synthetically enhanced cine scans. Results show excellent agreement with those of trained experts in terms of segmentation (balanced accuracy of $96\%$ and $75\%$ for LV and scar segmentation), clinical features ($2\%$ difference in mean scar-to-LV wall volume fraction), and anatomical fidelity. Our segmentation technology is extendable to other computer vision medical applications and to problems requiring guidelines adherence of predicted outputs.
△ Less
Submitted 8 January, 2021; v1 submitted 21 October, 2020;
originally announced October 2020.
-
The Coverage Overlap** Problem of Serving Arbitrary Crowds in 3D Drone Cellular Networks
Authors:
Chuan-Chi Lai,
Li-Chun Wang,
Zhu Han
Abstract:
Providing coverage for flash crowds is an important application for drone base stations (DBSs). However, any arbitrary crowd is likely to be distributed at a high density. Under the condition for each DBS to serve the same number of ground users, multiple DBSs may be placed at the same horizontal location but different altitudes and will cause severe co-channel interference, to which we refer as t…
▽ More
Providing coverage for flash crowds is an important application for drone base stations (DBSs). However, any arbitrary crowd is likely to be distributed at a high density. Under the condition for each DBS to serve the same number of ground users, multiple DBSs may be placed at the same horizontal location but different altitudes and will cause severe co-channel interference, to which we refer as the coverage overlap** problem. To solve this problem, we then proposed the data-driven 3D placement (DDP) and the enhanced DDP (eDDP) algorithms. The proposed DDP and eDDP can effectively find the appropriate number, altitude, location, and coverage of DBSs in the serving area in polynomial time to maximize the system sum rate and guarantee the minimum data rate requirement of the user equipment. The simulation results show that, compared with the balanced k-means approach, the proposed eDDP can increase the system sum rate by 200% and reduce the computation time by 50%. In particular, eDDP can effectively reduce the occurrence of the coverage overlap** problem and then outperform DDP by about 100% in terms of system sum rate.
△ Less
Submitted 20 August, 2020;
originally announced August 2020.
-
Quasi-Deterministic Channel Model for mmWaves: Mathematical Formalization and Validation
Authors:
Mattia Lecci,
Michele Polese,
Chieh** Lai,
Jian Wang,
Camillo Gentile,
Nada Golmie,
Michele Zorzi
Abstract:
5G and beyond networks will use, for the first time ever, the millimeter wave (mmWave) spectrum for mobile communications. Accurate performance evaluation is fundamental to the design of reliable mmWave networks, with accuracy rooted in the fidelity of the channel models. At mmWaves, the model must account for the spatial characteristics of propagation since networks will employ highly directional…
▽ More
5G and beyond networks will use, for the first time ever, the millimeter wave (mmWave) spectrum for mobile communications. Accurate performance evaluation is fundamental to the design of reliable mmWave networks, with accuracy rooted in the fidelity of the channel models. At mmWaves, the model must account for the spatial characteristics of propagation since networks will employ highly directional antennas to counter the much greater pathloss. In this regard, Quasi-Deterministic (QD) models are highly accurate channel models, which characterize the propagation in terms of clusters of multipath components, given by a reflected ray and multiple diffuse components of any given Computer Aided Design (CAD) scenario. This paper introduces a detailed mathematical formulation for QD models at mmWaves, that can be used as a reference for their implementation and development. Moreover, it compares channel instances obtained with an open source NIST QD model implementation against real measurements at 60 GHz, substantiating the accuracy of the model. Results show that, when comparing the proposed model and deterministic rays alone with a measurement campaign, the Kolmogorov-Smirnov (KS) test of the QD model improves by up to 0.537.
△ Less
Submitted 9 February, 2021; v1 submitted 1 June, 2020;
originally announced June 2020.
-
Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction
Authors:
Yi Zhao,
Haoyu Li,
Cheng-I Lai,
Jennifer Williams,
Erica Cooper,
Junichi Yamagishi
Abstract:
Vector Quantized Variational AutoEncoders (VQ-VAE) are a powerful representation learning framework that can discover discrete groups of features from a speech signal without supervision. Until now, the VQ-VAE architecture has previously modeled individual types of speech features, such as only phones or only F0. This paper introduces an important extension to VQ-VAE for learning F0-related supras…
▽ More
Vector Quantized Variational AutoEncoders (VQ-VAE) are a powerful representation learning framework that can discover discrete groups of features from a speech signal without supervision. Until now, the VQ-VAE architecture has previously modeled individual types of speech features, such as only phones or only F0. This paper introduces an important extension to VQ-VAE for learning F0-related suprasegmental information simultaneously along with traditional phone features.The proposed framework uses two encoders such that the F0 trajectory and speech waveform are both input to the system, therefore two separate codebooks are learned. We used a WaveRNN vocoder as the decoder component of VQ-VAE. Our speaker-independent VQ-VAE was trained with raw speech waveforms from multi-speaker Japanese speech databases. Experimental results show that the proposed extension reduces F0 distortion of reconstructed speech for all unseen test speakers, and results in significantly higher preference scores from a listening test. We additionally conducted experiments using single-speaker Mandarin speech to demonstrate advantages of our architecture in another language which relies heavily on F0.
△ Less
Submitted 16 May, 2020;
originally announced May 2020.
-
Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?
Authors:
Erica Cooper,
Cheng-I Lai,
Yusuke Yasuda,
Junichi Yamagishi
Abstract:
Previous work on speaker adaptation for end-to-end speech synthesis still falls short in speaker similarity. We investigate an orthogonal approach to the current speaker adaptation paradigms, speaker augmentation, by creating artificial speakers and by taking advantage of low-quality data. The base Tacotron2 model is modified to account for the channel and dialect factors inherent in these corpora…
▽ More
Previous work on speaker adaptation for end-to-end speech synthesis still falls short in speaker similarity. We investigate an orthogonal approach to the current speaker adaptation paradigms, speaker augmentation, by creating artificial speakers and by taking advantage of low-quality data. The base Tacotron2 model is modified to account for the channel and dialect factors inherent in these corpora. In addition, we describe a warm-start training strategy that we adopted for Tacotron2 training. A large-scale listening test is conducted, and a distance metric is adopted to evaluate synthesis of dialects. This is followed by an analysis on synthesis quality, speaker and dialect similarity, and a remark on the effectiveness of our speaker augmentation approach. Audio samples are available online.
△ Less
Submitted 7 August, 2020; v1 submitted 3 May, 2020;
originally announced May 2020.
-
Inverse Problems, Deep Learning, and Symmetry Breaking
Authors:
Kshitij Tayal,
Chieh-Hsin Lai,
Vipin Kumar,
Ju Sun
Abstract:
In many physical systems, inputs related by intrinsic system symmetries are mapped to the same output. When inverting such systems, i.e., solving the associated inverse problems, there is no unique solution. This causes fundamental difficulties for deploying the emerging end-to-end deep learning approach. Using the generalized phase retrieval problem as an illustrative example, we show that carefu…
▽ More
In many physical systems, inputs related by intrinsic system symmetries are mapped to the same output. When inverting such systems, i.e., solving the associated inverse problems, there is no unique solution. This causes fundamental difficulties for deploying the emerging end-to-end deep learning approach. Using the generalized phase retrieval problem as an illustrative example, we show that careful symmetry breaking on the training data can help get rid of the difficulties and significantly improve the learning performance. We also extract and highlight the underlying mathematical principle of the proposed solution, which is directly applicable to other inverse problems.
△ Less
Submitted 19 March, 2020;
originally announced March 2020.
-
Perception of prosodic variation for speech synthesis using an unsupervised discrete representation of F0
Authors:
Zack Hodari,
Catherine Lai,
Simon King
Abstract:
In English, prosody adds a broad range of information to segment sequences, from information structure (e.g. contrast) to stylistic variation (e.g. expression of emotion). However, when learning to control prosody in text-to-speech voices, it is not clear what exactly the control is modifying. Existing research on discrete representation learning for prosody has demonstrated high naturalness, but…
▽ More
In English, prosody adds a broad range of information to segment sequences, from information structure (e.g. contrast) to stylistic variation (e.g. expression of emotion). However, when learning to control prosody in text-to-speech voices, it is not clear what exactly the control is modifying. Existing research on discrete representation learning for prosody has demonstrated high naturalness, but no analysis has been performed on what these representations capture, or if they can generate meaningfully-distinct variants of an utterance. We present a phrase-level variational autoencoder with a multi-modal prior, using the mode centres as "intonation codes". Our evaluation establishes which intonation codes are perceptually distinct, finding that the intonation codes from our multi-modal latent model were significantly more distinct than a baseline using k-means clustering. We carry out a follow-up qualitative study to determine what information the codes are carrying. Most commonly, listeners commented on the intonation codes having a statement or question style. However, many other affect-related styles were also reported, including: emotional, uncertain, surprised, sarcastic, passive aggressive, and upset.
△ Less
Submitted 14 March, 2020;
originally announced March 2020.