-
ToneUnit: A Speech Discretization Approach for Tonal Language Speech Synthesis
Authors:
Dehua Tao,
Daxin Tan,
Yu Ting Yeung,
Xiao Chen,
Tan Lee
Abstract:
Representing speech as discretized units has numerous benefits in supporting downstream spoken language processing tasks. However, the approach has been less explored in speech synthesis of tonal languages like Mandarin Chinese. Our preliminary experiments on Chinese speech synthesis reveal the issue of "tone shift", where a synthesized speech utterance contains correct base syllables but incorrec…
▽ More
Representing speech as discretized units has numerous benefits in supporting downstream spoken language processing tasks. However, the approach has been less explored in speech synthesis of tonal languages like Mandarin Chinese. Our preliminary experiments on Chinese speech synthesis reveal the issue of "tone shift", where a synthesized speech utterance contains correct base syllables but incorrect tones. To address the issue, we propose the ToneUnit framework, which leverages annotated data with tone labels as CTC supervision to learn tone-aware discrete speech units for Mandarin Chinese speech. Our findings indicate that the discrete units acquired through the TonUnit resolve the "tone shift" issue in synthesized Chinese speech and yield favorable results in English synthesis. Moreover, the experimental results suggest that finite scalar quantization enhances the effectiveness of ToneUnit. Notably, ToneUnit can work effectively even with minimal annotated data.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Analysis and Utilization of Entrainment on Acoustic and Emotion Features in User-agent Dialogue
Authors:
Daxin Tan,
Nikos Kargas,
David McHardy,
Constantinos Papayiannis,
Antonio Bonafonte,
Marek Strelec,
Jonas Rohnke,
Agis Oikonomou Filandras,
Trevor Wood
Abstract:
Entrainment is the phenomenon by which an interlocutor adapts their speaking style to align with their partner in conversations. It has been found in different dimensions as acoustic, prosodic, lexical or syntactic. In this work, we explore and utilize the entrainment phenomenon to improve spoken dialogue systems for voice assistants. We first examine the existence of the entrainment phenomenon in…
▽ More
Entrainment is the phenomenon by which an interlocutor adapts their speaking style to align with their partner in conversations. It has been found in different dimensions as acoustic, prosodic, lexical or syntactic. In this work, we explore and utilize the entrainment phenomenon to improve spoken dialogue systems for voice assistants. We first examine the existence of the entrainment phenomenon in human-to-human dialogues in respect to acoustic feature and then extend the analysis to emotion features. The analysis results show strong evidence of entrainment in terms of both acoustic and emotion features. Based on this findings, we implement two entrainment policies and assess if the integration of entrainment principle into a Text-to-Speech (TTS) system improves the synthesis performance and the user experience. It is found that the integration of the entrainment principle into a TTS system brings performance improvement when considering acoustic features, while no obvious improvement is observed when considering emotion features.
△ Less
Submitted 6 December, 2022;
originally announced December 2022.
-
Automated Sex Classification of Children's Voices and Changes in Differentiating Factors with Age
Authors:
Fuling Chen,
Roberto Togneri,
Murray Maybery,
Diana Weiting Tan
Abstract:
Sex classification of children's voices allows for an investigation of the development of secondary sex characteristics which has been a key interest in the field of speech analysis. This research investigated a broad range of acoustic features from scripted and spontaneous speech and applied a hierarchical clustering-based machine learning model to distinguish the sex of children aged between 5 a…
▽ More
Sex classification of children's voices allows for an investigation of the development of secondary sex characteristics which has been a key interest in the field of speech analysis. This research investigated a broad range of acoustic features from scripted and spontaneous speech and applied a hierarchical clustering-based machine learning model to distinguish the sex of children aged between 5 and 15 years. We proposed an optimal feature set and our modelling achieved an average F1 score (the harmonic mean of the precision and recall) of 0.84 across all ages. Our results suggest that the sex classification is generally more accurate when a model is developed for each year group rather than for children in 4-year age bands, with classification accuracy being better for older age groups. We found that spontaneous speech could provide more helpful cues in sex classification than scripted speech, especially for children younger than 7 years. For younger age groups, a broad range of acoustic factors contributed evenly to sex classification, while for older age groups, F0-related acoustic factors were found to be the most critical predictors generally. Other important acoustic factors for older age groups include vocal tract length estimators, spectral flux, loudness and unvoiced features.
△ Less
Submitted 26 September, 2022;
originally announced September 2022.
-
CorrectSpeech: A Fully Automated System for Speech Correction and Accent Reduction
Authors:
Daxin Tan,
Liqun Deng,
Nianzu Zheng,
Yu Ting Yeung,
Xin Jiang,
Xiao Chen,
Tan Lee
Abstract:
This study propose a fully automated system for speech correction and accent reduction. Consider the application scenario that a recorded speech audio contains certain errors, e.g., inappropriate words, mispronunciations, that need to be corrected. The proposed system, named CorrectSpeech, performs the correction in three steps: recognizing the recorded speech and converting it into time-stamped s…
▽ More
This study propose a fully automated system for speech correction and accent reduction. Consider the application scenario that a recorded speech audio contains certain errors, e.g., inappropriate words, mispronunciations, that need to be corrected. The proposed system, named CorrectSpeech, performs the correction in three steps: recognizing the recorded speech and converting it into time-stamped symbol sequence, aligning recognized symbol sequence with target text to determine locations and types of required edit operations, and generating the corrected speech. Experiments show that the quality and naturalness of corrected speech depend on the performance of speech recognition and alignment modules, as well as the granularity level of editing operations. The proposed system is evaluated on two corpora: a manually perturbed version of VCTK and L2-ARCTIC. The results demonstrate that our system is able to correct mispronunciation and reduce accent in speech recordings. Audio samples are available online for demonstration https://daxintan-cuhk.github.io/CorrectSpeech/ .
△ Less
Submitted 13 October, 2022; v1 submitted 11 April, 2022;
originally announced April 2022.
-
Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech
Authors:
Guangyan Zhang,
Kaitao Song,
Xu Tan,
Daxin Tan,
Yuzi Yan,
Yanqing Liu,
Gang Wang,
Wei Zhou,
Tao Qin,
Tan Lee,
Sheng Zhao
Abstract:
Recently, leveraging BERT pre-training to improve the phoneme encoder in text to speech (TTS) has drawn increasing attention. However, the works apply pre-training with character-based units to enhance the TTS phoneme encoder, which is inconsistent with the TTS fine-tuning that takes phonemes as input. Pre-training only with phonemes as input can alleviate the input mismatch but lack the ability t…
▽ More
Recently, leveraging BERT pre-training to improve the phoneme encoder in text to speech (TTS) has drawn increasing attention. However, the works apply pre-training with character-based units to enhance the TTS phoneme encoder, which is inconsistent with the TTS fine-tuning that takes phonemes as input. Pre-training only with phonemes as input can alleviate the input mismatch but lack the ability to model rich representations and semantic information due to limited phoneme vocabulary. In this paper, we propose MixedPhoneme BERT, a novel variant of the BERT model that uses mixed phoneme and sup-phoneme representations to enhance the learning capability. Specifically, we merge the adjacent phonemes into sup-phonemes and combine the phoneme sequence and the merged sup-phoneme sequence as the model input, which can enhance the model capacity to learn rich contextual representations. Experiment results demonstrate that our proposed Mixed-Phoneme BERT significantly improves the TTS performance with 0.30 CMOS gain compared with the FastSpeech 2 baseline. The Mixed-Phoneme BERT achieves 3x inference speedup and similar voice quality to the previous TTS pre-trained model PnG BERT
△ Less
Submitted 19 July, 2022; v1 submitted 31 March, 2022;
originally announced March 2022.
-
Using Deep Learning with Large Aggregated Datasets for COVID-19 Classification from Cough
Authors:
Esin Darici Haritaoglu,
Nicholas Rasmussen,
Daniel C. H. Tan,
Jennifer Ranjani J.,
Jaclyn Xiao,
Gunvant Chaudhari,
Akanksha Rajput,
Praveen Govindan,
Christian Canham,
Wei Chen,
Minami Yamaura,
Laura Gomezjurado,
Aaron Broukhim,
Amil Khanzada,
Mert Pilanci
Abstract:
The Covid-19 pandemic has been one of the most devastating events in recent history, claiming the lives of more than 5 million people worldwide. Even with the worldwide distribution of vaccines, there is an apparent need for affordable, reliable, and accessible screening techniques to serve parts of the World that do not have access to Western medicine. Artificial Intelligence can provide a soluti…
▽ More
The Covid-19 pandemic has been one of the most devastating events in recent history, claiming the lives of more than 5 million people worldwide. Even with the worldwide distribution of vaccines, there is an apparent need for affordable, reliable, and accessible screening techniques to serve parts of the World that do not have access to Western medicine. Artificial Intelligence can provide a solution utilizing cough sounds as a primary screening mode for COVID-19 diagnosis. This paper presents multiple models that have achieved relatively respectable performance on the largest evaluation dataset currently presented in academic literature. Through investigation of a self-supervised learning model (Area under the ROC curve, AUC = 0.807) and a convolutional nerual network (CNN) model (AUC = 0.802), we observe the possibility of model bias with limited datasets. Moreover, we observe that performance increases with training data size, showing the need for the worldwide collection of data to help combat the Covid-19 pandemic with non-traditional means.
△ Less
Submitted 29 March, 2022; v1 submitted 5 January, 2022;
originally announced January 2022.
-
Environment Aware Text-to-Speech Synthesis
Authors:
Daxin Tan,
Guangyan Zhang,
Tan Lee
Abstract:
This study aims at designing an environment-aware text-to-speech (TTS) system that can generate speech to suit specific acoustic environments. It is also motivated by the desire to leverage massive data of speech audio from heterogeneous sources in TTS system development. The key idea is to model the acoustic environment in speech audio as a factor of data variability and incorporate it as a condi…
▽ More
This study aims at designing an environment-aware text-to-speech (TTS) system that can generate speech to suit specific acoustic environments. It is also motivated by the desire to leverage massive data of speech audio from heterogeneous sources in TTS system development. The key idea is to model the acoustic environment in speech audio as a factor of data variability and incorporate it as a condition in the process of neural network based speech synthesis. Two embedding extractors are trained with two purposely constructed datasets for characterization and disentanglement of speaker and environment factors in speech. A neural network model is trained to generate speech from extracted speaker and environment embeddings. Objective and subjective evaluation results demonstrate that the proposed TTS system is able to effectively disentangle speaker and environment factors and synthesize speech audio that carries designated speaker characteristics and environment attribute. Audio samples are available online for demonstration https://daxintan-cuhk.github.io/Environment-Aware-TTS/ .
△ Less
Submitted 6 August, 2022; v1 submitted 8 October, 2021;
originally announced October 2021.
-
A study on the efficacy of model pre-training in develo** neural text-to-speech system
Authors:
Guangyan Zhang,
Yichong Leng,
Daxin Tan,
Ying Qin,
Kaitao Song,
Xu Tan,
Sheng Zhao,
Tan Lee
Abstract:
In the development of neural text-to-speech systems, model pre-training with a large amount of non-target speakers' data is a common approach. However, in terms of ultimately achieved system performance for target speaker(s), the actual benefits of model pre-training are uncertain and unstable, depending very much on the quantity and text content of training data. This study aims to understand bet…
▽ More
In the development of neural text-to-speech systems, model pre-training with a large amount of non-target speakers' data is a common approach. However, in terms of ultimately achieved system performance for target speaker(s), the actual benefits of model pre-training are uncertain and unstable, depending very much on the quantity and text content of training data. This study aims to understand better why and how model pre-training can positively contribute to TTS system performance. It is postulated that the pre-training process plays a critical role in learning text-related variation in speech, while further training with the target speaker's data aims to capture the speaker-related variation. Different test sets are created with varying degrees of similarity to target speaker data in terms of text content. Experiments show that leveraging a speaker-independent TTS trained on speech data with diverse text content can improve the target speaker TTS on domain-mismatched text. We also attempt to reduce the amount of pre-training data for a new text domain and improve the data and computational efficiency. It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
△ Less
Submitted 7 October, 2021;
originally announced October 2021.
-
Applying the Information Bottleneck Principle to Prosodic Representation Learning
Authors:
Guangyan Zhang,
Ying Qin,
Daxin Tan,
Tan Lee
Abstract:
This paper describes a novel design of a neural network-based speech generation model for learning prosodic representation.The problem of representation learning is formulated according to the information bottleneck (IB) principle. A modified VQ-VAE quantized layer is incorporated in the speech generation model to control the IB capacity and adjust the balance between reconstruction power and dise…
▽ More
This paper describes a novel design of a neural network-based speech generation model for learning prosodic representation.The problem of representation learning is formulated according to the information bottleneck (IB) principle. A modified VQ-VAE quantized layer is incorporated in the speech generation model to control the IB capacity and adjust the balance between reconstruction power and disentangle capability of the learned representation. The proposed model is able to learn word-level prosodic representations from speech data. With an optimized IB capacity, the learned representations not only are adequate to reconstruct the original speech but also can be used to transfer the prosody onto different textual content. Extensive results of the objective and subjective evaluation are presented to demonstrate the effect of IB capacity control, the effectiveness, and potential usage of the learned prosodic representation in controllable neural speech generation.
△ Less
Submitted 5 August, 2021;
originally announced August 2021.
-
EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion
Authors:
Daxin Tan,
Liqun Deng,
Yu Ting Yeung,
Xin Jiang,
Xiao Chen,
Tan Lee
Abstract:
This paper presents the design, implementation and evaluation of a speech editing system, named EditSpeech, which allows a user to perform deletion, insertion and replacement of words in a given speech utterance, without causing audible degradation in speech quality and naturalness. The EditSpeech system is developed upon a neural text-to-speech (NTTS) synthesis framework. Partial inference and bi…
▽ More
This paper presents the design, implementation and evaluation of a speech editing system, named EditSpeech, which allows a user to perform deletion, insertion and replacement of words in a given speech utterance, without causing audible degradation in speech quality and naturalness. The EditSpeech system is developed upon a neural text-to-speech (NTTS) synthesis framework. Partial inference and bidirectional fusion are proposed to effectively incorporate the contextual information related to the edited region and achieve smooth transition at both left and right boundaries. Distortion introduced to the unmodified parts of the utterance is alleviated. The EditSpeech system is developed and evaluated on English and Chinese in multi-speaker scenarios. Objective and subjective evaluation demonstrate that EditSpeech outperforms a few baseline systems in terms of low spectral distortion and preferred speech quality. Audio samples are available online for demonstration https://daxintan-cuhk.github.io/EditSpeech/ .
△ Less
Submitted 7 October, 2021; v1 submitted 4 July, 2021;
originally announced July 2021.
-
CUHK-EE Voice Cloning System for ICASSP 2021 M2VoC Challenge
Authors:
Daxin Tan,
Hingpang Huang,
Guangyan Zhang,
Tan Lee
Abstract:
This paper presents the CUHK-EE voice cloning system for ICASSP 2021 M2VoC challenge. The challenge provides two Mandarin speech corpora: the AIShell-3 corpus of 218 speakers with noise and reverberation and the MST corpus including high-quality speech of one male and one female speakers. 100 and 5 utterances of 3 target speakers in different voice and style are provided in track 1 and 2 respectiv…
▽ More
This paper presents the CUHK-EE voice cloning system for ICASSP 2021 M2VoC challenge. The challenge provides two Mandarin speech corpora: the AIShell-3 corpus of 218 speakers with noise and reverberation and the MST corpus including high-quality speech of one male and one female speakers. 100 and 5 utterances of 3 target speakers in different voice and style are provided in track 1 and 2 respectively, and the participants are required to synthesize speech in target speaker's voice and style. We take part in the track 1 and carry out voice cloning based on 100 utterances of target speakers. An end-to-end voicing cloning system is developed to accomplish the task, which includes: 1. a text and speech front-end module with the help of forced alignment, 2. an acoustic model combining Tacotron2 and DurIAN to predict melspectrogram, 3. a Hifigan vocoder for waveform generation. Our system comprises three stages: multi-speaker training stage, target speaker adaption stage and target speaker synthesis stage. Our team is identified as T17. The subjective evaluation results provided by the challenge organizer demonstrate the effectiveness of our system. Audio samples are available at our demo page: https://daxintan-cuhk.github.io/CUHK-EE-system-M2VoC-challenge/ .
△ Less
Submitted 5 July, 2021; v1 submitted 8 March, 2021;
originally announced March 2021.
-
Voice Gender Scoring and Independent Acoustic Characterization of Perceived Masculinity and Femininity
Authors:
Fuling Chen,
Roberto Togneri,
Murray Maybery,
Diana Tan
Abstract:
Previous research has found that voices can provide reliable information to be used for gender classification with a high level of accuracy. In social psychology, perceived masculinity and femininity (masculinity and femininity rated by humans) has often been considered an important feature when investigating the influence of vocal features on social behaviours. While previous studies have charact…
▽ More
Previous research has found that voices can provide reliable information to be used for gender classification with a high level of accuracy. In social psychology, perceived masculinity and femininity (masculinity and femininity rated by humans) has often been considered an important feature when investigating the influence of vocal features on social behaviours. While previous studies have characterised the acoustic features that contributed to perceivers' judgements of speakers' masculinity or femininity, there is limited research on develo** a machine masculinity/femininity scoring model and characterizing the independent acoustic factors that contribute to perceivers' masculinity and femininity judgements. In this work, we first propose a machine scoring model of perceived masculinity/femininity based on the Extreme Random Forest and then characterize the independent and meaningful acoustic factors that contribute to perceivers' judgements by using a correlation matrix based hierarchical clustering method. Our results show that the machine ratings of masculinity and femininity strongly correlated with the human ratings of masculinity and femininity when we used an optimal speech duration of 7 seconds, with a correlation coefficient of up to .63 for females and .77 for males. Nine independent clusters of acoustic measures were generated from our modelling of femininity judgements for female voices and eight clusters were found for masculinity judgements for male voices. The results revealed that, for both genders, the F0 mean is the most important acoustic measure affecting the judgement of acoustic-related masculinity and femininity. The F3 mean, F4 mean and VTL estimators were found to be highly inter-correlated and appeared in the same cluster, forming the second most significant factor in influencing the assessment of acoustic-related masculinity and femininity.
△ Less
Submitted 4 August, 2022; v1 submitted 16 February, 2021;
originally announced February 2021.
-
Fine-grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement
Authors:
Daxin Tan,
Tan Lee
Abstract:
This paper presents a novel design of neural network system for fine-grained style modeling, transfer and prediction in expressive text-to-speech (TTS) synthesis. Fine-grained modeling is realized by extracting style embeddings from the mel-spectrograms of phone-level speech segments. Collaborative learning and adversarial learning strategies are applied in order to achieve effective disentangleme…
▽ More
This paper presents a novel design of neural network system for fine-grained style modeling, transfer and prediction in expressive text-to-speech (TTS) synthesis. Fine-grained modeling is realized by extracting style embeddings from the mel-spectrograms of phone-level speech segments. Collaborative learning and adversarial learning strategies are applied in order to achieve effective disentanglement of content and style factors in speech and alleviate the "content leakage" problem in style modeling. The proposed system can be used for varying-content speech style transfer in the single-speaker scenario. The results of objective and subjective evaluation show that our system performs better than other fine-grained speech style transfer models, especially in the aspect of content preservation. By incorporating a style predictor, the proposed system can also be used for text-to-speech synthesis. Audio samples are provided for system demonstration https://daxintan-cuhk.github.io/pl-csd-speech .
△ Less
Submitted 7 October, 2021; v1 submitted 8 November, 2020;
originally announced November 2020.
-
SoftPoolNet: Shape Descriptor for Point Cloud Completion and Classification
Authors:
Yida Wang,
David Joseph Tan,
Nassir Navab,
Federico Tombari
Abstract:
Point clouds are often the default choice for many applications as they exhibit more flexibility and efficiency than volumetric data. Nevertheless, their unorganized nature -- points are stored in an unordered way -- makes them less suited to be processed by deep learning pipelines. In this paper, we propose a method for 3D object completion and classification based on point clouds. We introduce a…
▽ More
Point clouds are often the default choice for many applications as they exhibit more flexibility and efficiency than volumetric data. Nevertheless, their unorganized nature -- points are stored in an unordered way -- makes them less suited to be processed by deep learning pipelines. In this paper, we propose a method for 3D object completion and classification based on point clouds. We introduce a new way of organizing the extracted features based on their activations, which we name soft pooling. For the decoder stage, we propose regional convolutions, a novel operator aimed at maximizing the global activation entropy. Furthermore, inspired by the local refining procedure in Point Completion Network (PCN), we also propose a patch-deforming operation to simulate deconvolutional operations for point clouds. This paper proves that our regional activation can be incorporated in many point cloud architectures like AtlasNet and PCN, leading to better performance for geometric completion. We evaluate our approach on different 3D tasks such as object completion and classification, achieving state-of-the-art accuracy.
△ Less
Submitted 17 August, 2020;
originally announced August 2020.
-
ForkNet: Multi-branch Volumetric Semantic Completion from a Single Depth Image
Authors:
Yida Wang,
David Joseph Tan,
Nassir Navab,
Federico Tombari
Abstract:
We propose a novel model for 3D semantic completion from a single depth image, based on a single encoder and three separate generators used to reconstruct different geometric and semantic representations of the original and completed scene, all sharing the same latent space. To transfer information between the geometric and semantic branches of the network, we introduce paths between them concaten…
▽ More
We propose a novel model for 3D semantic completion from a single depth image, based on a single encoder and three separate generators used to reconstruct different geometric and semantic representations of the original and completed scene, all sharing the same latent space. To transfer information between the geometric and semantic branches of the network, we introduce paths between them concatenating features at corresponding network layers. Motivated by the limited amount of training samples from real scenes, an interesting attribute of our architecture is the capacity to supplement the existing dataset by generating a new training dataset with high quality, realistic scenes that even includes occlusion and real noise. We build the new dataset by sampling the features directly from latent space which generates a pair of partial volumetric surface and completed volumetric semantic surface. Moreover, we utilize multiple discriminators to increase the accuracy and realism of the reconstructions. We demonstrate the benefits of our approach on standard benchmarks for the two most common completion tasks: semantic 3D scene completion and 3D object completion.
△ Less
Submitted 3 September, 2019;
originally announced September 2019.