-
Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder
Authors:
Haohan Guo,
Fenglong Xie,
Dongchao Yang,
Hui Lu,
Xixin Wu,
Helen Meng
Abstract:
VQ-VAE, as a mainstream approach of speech tokenizer, has been troubled by ``index collapse'', where only a small number of codewords are activated in large codebooks. This work proposes product-quantized (PQ) VAE with more codebooks but fewer codewords to address this problem and build large-codebook speech tokenizers. It encodes speech features into multiple VQ subspaces and composes them into c…
▽ More
VQ-VAE, as a mainstream approach of speech tokenizer, has been troubled by ``index collapse'', where only a small number of codewords are activated in large codebooks. This work proposes product-quantized (PQ) VAE with more codebooks but fewer codewords to address this problem and build large-codebook speech tokenizers. It encodes speech features into multiple VQ subspaces and composes them into codewords in a larger codebook. Besides, to utilize each VQ subspace well, we also enhance PQ-VAE via a dual-decoding training strategy with the encoding and quantized sequences. The experimental results demonstrate that PQ-VAE addresses ``index collapse" effectively, especially for larger codebooks. The model with the proposed training strategy further improves codebook perplexity and reconstruction quality, outperforming other multi-codebook VQ approaches. Finally, PQ-VAE demonstrates its effectiveness in language-model-based TTS, supporting higher-quality speech generation with larger codebooks.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning
Authors:
Haohan Guo,
Fenglong Xie,
Jiawen Kang,
Yujia Xiao,
Xixin Wu,
Helen Meng
Abstract:
This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower supervised data requirements via Vector-Quantized Self-Supervised Speech Representation Learning (VQ-S3RL) utilizing more unlabeled speech audio. This framework comprises two VQ-S3R learners: first, the principal learner aims to provide a generative Multi-Stage Multi-Codebook (MSMC) VQ-S3R via the…
▽ More
This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower supervised data requirements via Vector-Quantized Self-Supervised Speech Representation Learning (VQ-S3RL) utilizing more unlabeled speech audio. This framework comprises two VQ-S3R learners: first, the principal learner aims to provide a generative Multi-Stage Multi-Codebook (MSMC) VQ-S3R via the MSMC-VQ-GAN combined with the contrastive S3RL, while decoding it back to the high-quality audio; then, the associate learner further abstracts the MSMC representation into a highly-compact VQ representation through a VQ-VAE. These two generative VQ-S3R learners provide profitable speech representations and pre-trained models for TTS, significantly improving synthesis quality with the lower requirement for supervised data. QS-TTS is evaluated comprehensively under various scenarios via subjective and objective tests in experiments. The results powerfully demonstrate the superior performance of QS-TTS, winning the highest MOS over supervised or semi-supervised baseline TTS approaches, especially in low-resource scenarios. Moreover, comparing various speech representations and transfer learning methods in TTS further validates the notable improvement of the proposed VQ-S3RL to TTS, showing the best audio quality and intelligibility metrics. The trend of slower decay in the synthesis quality of QS-TTS with decreasing supervised data further highlights its lower requirements for supervised data, indicating its great potential in low-resource scenarios.
△ Less
Submitted 31 August, 2023;
originally announced September 2023.
-
Multi-Arm Robot Task Planning for Fruit Harvesting Using Multi-Agent Reinforcement Learning
Authors:
Tao Li,
Feng Xie,
Ya Xiong,
Qingchun Feng
Abstract:
The emergence of harvesting robotics offers a promising solution to the issue of limited agricultural labor resources and the increasing demand for fruits. Despite notable advancements in the field of harvesting robotics, the utilization of such technology in orchards is still limited. The key challenge is to improve operational efficiency. Taking into account inner-arm conflicts, couplings of DoF…
▽ More
The emergence of harvesting robotics offers a promising solution to the issue of limited agricultural labor resources and the increasing demand for fruits. Despite notable advancements in the field of harvesting robotics, the utilization of such technology in orchards is still limited. The key challenge is to improve operational efficiency. Taking into account inner-arm conflicts, couplings of DoFs, and dynamic tasks, we propose a task planning strategy for a harvesting robot with four arms in this paper. The proposed method employs a Markov game framework to formulate the four-arm robotic harvesting task, which avoids the computational complexity of solving an NP-hard scheduling problem. Furthermore, a multi-agent reinforcement learning (MARL) structure with a fully centralized collaboration protocol is used to train a MARL-based task planning network. Several simulations and orchard experiments are conducted to validate the effectiveness of the proposed method for a multi-arm harvesting robot in comparison with the existing method.
△ Less
Submitted 1 March, 2023;
originally announced March 2023.
-
Towards High-Quality Neural TTS for Low-Resource Languages by Learning Compact Speech Representations
Authors:
Haohan Guo,
Fenglong Xie,
Xixin Wu,
Hui Lu,
Helen Meng
Abstract:
This paper aims to enhance low-resource TTS by reducing training data requirements using compact speech representations. A Multi-Stage Multi-Codebook (MSMC) VQ-GAN is trained to learn the representation, MSMCR, and decode it to waveforms. Subsequently, we train the multi-stage predictor to predict MSMCRs from the text for TTS synthesis. Moreover, we optimize the training strategy by leveraging mor…
▽ More
This paper aims to enhance low-resource TTS by reducing training data requirements using compact speech representations. A Multi-Stage Multi-Codebook (MSMC) VQ-GAN is trained to learn the representation, MSMCR, and decode it to waveforms. Subsequently, we train the multi-stage predictor to predict MSMCRs from the text for TTS synthesis. Moreover, we optimize the training strategy by leveraging more audio to learn MSMCRs better for low-resource languages. It selects audio from other languages using speaker similarity metric to augment the training set, and applies transfer learning to improve training quality. In MOS tests, the proposed system significantly outperforms FastSpeech and VITS in standard and low-resource scenarios, showing lower data requirements. The proposed training strategy effectively enhances MSMCRs on waveform reconstruction. It improves TTS performance further, which wins 77% votes in the preference test for the low-resource TTS with only 15 minutes of paired data.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS
Authors:
Haohan Guo,
Fenglong Xie,
Frank K. Soong,
Xixin Wu,
Helen Meng
Abstract:
We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance neural TTS synthesis. A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data by down-sampling progressively in multiple stages into MSMC Representations (MSMCRs) with different time resolutions, and quantizing them with multiple VQ codebooks,…
▽ More
We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance neural TTS synthesis. A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data by down-sampling progressively in multiple stages into MSMC Representations (MSMCRs) with different time resolutions, and quantizing them with multiple VQ codebooks, respectively. Multi-stage predictors are trained to map the input text sequence to MSMCRs progressively by minimizing a combined loss of the reconstruction Mean Square Error (MSE) and "triplet loss". In synthesis, the neural vocoder converts the predicted MSMCRs into final speech waveforms. The proposed approach is trained and tested with an English TTS database of 16 hours by a female speaker. The proposed TTS achieves an MOS score of 4.41, which outperforms the baseline with an MOS of 3.62. Compact versions of the proposed TTS with much less parameters can still preserve high MOS scores. Ablation studies show that both multiple stages and multiple codebooks are effective for achieving high TTS performance.
△ Less
Submitted 22 September, 2022;
originally announced September 2022.
-
Weighted Concordance Index Loss-based Multimodal Survival Modeling for Radiation Encephalopathy Assessment in Nasopharyngeal Carcinoma Radiotherapy
Authors:
Jiansheng Fang,
Anwei Li,
Pu-Yun OuYang,
Jiajian Li,
**gwen Wang,
Hongbo Liu,
Fang-Yun Xie,
Jiang Liu
Abstract:
Radiation encephalopathy (REP) is the most common complication for nasopharyngeal carcinoma (NPC) radiotherapy. It is highly desirable to assist clinicians in optimizing the NPC radiotherapy regimen to reduce radiotherapy-induced temporal lobe injury (RTLI) according to the probability of REP onset. To the best of our knowledge, it is the first exploration of predicting radiotherapy-induced REP by…
▽ More
Radiation encephalopathy (REP) is the most common complication for nasopharyngeal carcinoma (NPC) radiotherapy. It is highly desirable to assist clinicians in optimizing the NPC radiotherapy regimen to reduce radiotherapy-induced temporal lobe injury (RTLI) according to the probability of REP onset. To the best of our knowledge, it is the first exploration of predicting radiotherapy-induced REP by jointly exploiting image and non-image data in NPC radiotherapy regimen. We cast REP prediction as a survival analysis task and evaluate the predictive accuracy in terms of the concordance index (CI). We design a deep multimodal survival network (MSN) with two feature extractors to learn discriminative features from multimodal data. One feature extractor imposes feature selection on non-image data, and the other learns visual features from images. Because the priorly balanced CI (BCI) loss function directly maximizing the CI is sensitive to uneven sampling per batch. Hence, we propose a novel weighted CI (WCI) loss function to leverage all REP samples effectively by assigning their different weights with a dual average operation. We further introduce a temperature hyper-parameter for our WCI to sharpen the risk difference of sample pairs to help model convergence. We extensively evaluate our WCI on a private dataset to demonstrate its favourability against its counterparts. The experimental results also show multimodal data of NPC radiotherapy can bring more gains for REP risk prediction.
△ Less
Submitted 22 June, 2022;
originally announced June 2022.
-
Nana-HDR: A Non-attentive Non-autoregressive Hybrid Model for TTS
Authors:
Shilun Lin,
Wenchao Su,
Li Meng,
Fenglong Xie,
Xinhui Li,
Li Lu
Abstract:
This paper presents Nana-HDR, a new non-attentive non-autoregressive model with hybrid Transformer-based Dense-fuse encoder and RNN-based decoder for TTS. It mainly consists of three parts: Firstly, a novel Dense-fuse encoder with dense connections between basic Transformer blocks for coarse feature fusion and a multi-head attention layer for fine feature fusion. Secondly, a single-layer non-autor…
▽ More
This paper presents Nana-HDR, a new non-attentive non-autoregressive model with hybrid Transformer-based Dense-fuse encoder and RNN-based decoder for TTS. It mainly consists of three parts: Firstly, a novel Dense-fuse encoder with dense connections between basic Transformer blocks for coarse feature fusion and a multi-head attention layer for fine feature fusion. Secondly, a single-layer non-autoregressive RNN-based decoder. Thirdly, a duration predictor instead of an attention model that connects the above hybrid encoder and decoder. Experiments indicate that Nana-HDR gives full play to the advantages of each component, such as strong text encoding ability of Transformer-based encoder, stateful decoding without being bothered by exposure bias and local information preference, and stable alignment provided by duration predictor. Due to these advantages, Nana-HDR achieves competitive performance in naturalness and robustness on two Mandarin corpora.
△ Less
Submitted 28 September, 2021;
originally announced September 2021.
-
GNSS Radio Occultation on Aerial Platforms with Commercial Off-The-Shelf Receivers
Authors:
Bryan C. Chan,
Ashish Goel,
Jonathan Kosh,
Tyler G. R. Reid,
Corey R. Snyder,
Paul M. Tarantino,
Saraswati Soedarmadji,
Widyadewi Soedarmadji,
Kevin Nelson,
Feiqin Xie,
Michael Vergalla
Abstract:
In recent decades, GNSS Radio Occultation soundings have proven an invaluable input to global weather forecasting. The success of government-sponsored programs such as COSMIC is now complemented by commercial low-cost cubesat implementations. The result is access to more than 10,000 soundings per day and improved weather forecasting accuracy. This movement towards commercialization has been suppor…
▽ More
In recent decades, GNSS Radio Occultation soundings have proven an invaluable input to global weather forecasting. The success of government-sponsored programs such as COSMIC is now complemented by commercial low-cost cubesat implementations. The result is access to more than 10,000 soundings per day and improved weather forecasting accuracy. This movement towards commercialization has been supported by several agencies, including the National Aeronautics and Space Administration (NASA), National Oceanic and Atmospheric Administration (NOAA) and the U.S. Air Force (USAF) with programs such as the Commercial Weather Data Pilot (CWDP). This has resulted in further interest in commercially deploying GNSS-RO on complementary platforms. Here, we examine a so far underutilized platform: the high-altitude weather balloon. Such meteorological radiosondes are deployed twice daily at over 900 locations globally and form an essential in-situ data source as a long-standing input to weather forecasting models. Adding GNSS-RO capability to existing radiosonde platforms would greatly expand capability, allowing for persistent and local area monitoring, a feature particularly useful for hurricane and other severe weather monitoring. A prohibitive barrier to entry to this inclusion is cost and complexity as GNSS-RO traditionally requires highly specialized and sensitive equipment. This paper describes a multi-year effort to develop a low-cost and scalable approach to balloon GNSS-RO based on Commercial-Off-The-Shelf (COTS) GNSS receivers. We present hardware prototypes and data processing techniques which demonstrate the technical feasibility of the approach through results from several flight testing campaigns.
△ Less
Submitted 27 September, 2021;
originally announced September 2021.
-
Triple M: A Practical Text-to-speech Synthesis System With Multi-guidance Attention And Multi-band Multi-time LPCNet
Authors:
Shilun Lin,
Fenglong Xie,
Li Meng,
Xinhui Li,
Li Lu
Abstract:
In this work, a robust and efficient text-to-speech (TTS) synthesis system named Triple M is proposed for large-scale online application. The key components of Triple M are: 1) A sequence-to-sequence model adopts a novel multi-guidance attention to transfer complementary advantages from guiding attention mechanisms to the basic attention mechanism without in-domain performance loss and online serv…
▽ More
In this work, a robust and efficient text-to-speech (TTS) synthesis system named Triple M is proposed for large-scale online application. The key components of Triple M are: 1) A sequence-to-sequence model adopts a novel multi-guidance attention to transfer complementary advantages from guiding attention mechanisms to the basic attention mechanism without in-domain performance loss and online service modification. Compared with single attention mechanism, multi-guidance attention not only brings better naturalness to long sentence synthesis, but also reduces the word error rate by 26.8%. 2) A new efficient multi-band multi-time vocoder framework, which reduces the computational complexity from 2.8 to 1.0 GFLOP and speeds up LPCNet by 2.75x on a single CPU.
△ Less
Submitted 7 April, 2021; v1 submitted 30 January, 2021;
originally announced February 2021.
-
Alleviating Class-wise Gradient Imbalance for Pulmonary Airway Segmentation
Authors:
Hao Zheng,
Yulei Qin,
Yun Gu,
Fangfang Xie,
Jie Yang,
Jiayuan Sun,
Guang-zhong Yang
Abstract:
Automated airway segmentation is a prerequisite for pre-operative diagnosis and intra-operative navigation for pulmonary intervention. Due to the small size and scattered spatial distribution of peripheral bronchi, this is hampered by severe class imbalance between foreground and background regions, which makes it challenging for CNN-based methods to parse distal small airways. In this paper, we d…
▽ More
Automated airway segmentation is a prerequisite for pre-operative diagnosis and intra-operative navigation for pulmonary intervention. Due to the small size and scattered spatial distribution of peripheral bronchi, this is hampered by severe class imbalance between foreground and background regions, which makes it challenging for CNN-based methods to parse distal small airways. In this paper, we demonstrate that this problem is arisen by gradient erosion and dilation of the neighborhood voxels. During back-propagation, if the ratio of the foreground gradient to background gradient is small while the class imbalance is local, the foreground gradients can be eroded by their neighborhoods. This process cumulatively increases the noise information included in the gradient flow from top layers to the bottom ones, limiting the learning of small structures in CNNs. To alleviate this problem, we use group supervision and the corresponding WingsNet to provide complementary gradient flows to enhance the training of shallow layers. To further address the intra-class imbalance between large and small airways, we design a General Union loss function which obviates the impact of airway size by distance-based weights and adaptively tunes the gradient ratio based on the learning process. Extensive experiments on public datasets demonstrate that the proposed method can predict the airway structures with higher accuracy and better morphological completeness than the baselines.
△ Less
Submitted 29 April, 2021; v1 submitted 24 November, 2020;
originally announced November 2020.
-
Diesel Generator Model Parameterization for Microgrid Simulation Using Hybrid Box-Constrained Levenberg-Marquardt Algorithm
Authors:
Qian Long,
Hui Yu,
Fuhong Xie,
Ning Lu,
David Lubkeman
Abstract:
Existing generator parameterization methods, typically developed for large turbine generator units, are difficult to apply to small kW-level diesel generators in microgrid applications. This paper presents a model parameterization method that estimates a complete set of kW-level diesel generator parameters simultaneously using only load-step-change tests with limited measurement points. This metho…
▽ More
Existing generator parameterization methods, typically developed for large turbine generator units, are difficult to apply to small kW-level diesel generators in microgrid applications. This paper presents a model parameterization method that estimates a complete set of kW-level diesel generator parameters simultaneously using only load-step-change tests with limited measurement points. This method provides a more cost-efficient and robust approach to achieve high-fidelity modeling of diesel generators for microgrid dynamic simulation. A two-stage hybrid box-constrained Levenberg-Marquardt (H-BCLM) algorithm is developed to search the optimal parameter set given the parameter bounds. A heuristic algorithm, namely Generalized Opposition-based Learning Genetic Algorithm (GOL-GA), is applied to identify proper initial estimates at the first stage, followed by a modified Levenberg-Marquardt algorithm designed to fine tune the solution based on the first-stage result. The proposed method is validated against dynamic simulation of a diesel generator model and field measurements from a 16kW diesel generator unit.
△ Less
Submitted 25 September, 2020; v1 submitted 22 September, 2020;
originally announced September 2020.
-
An Networked HIL Simulation System for Modeling Large-scale Power Systems
Authors:
Fuhong Xie,
Catie McEntee,
Mingzhi Zhang,
Ning Lu,
Xinda Ke,
Mallikarjuna R. Vallem,
Nader Samaan
Abstract:
This paper presents a network hardware-in-the-loop (HIL) simulation system for modeling large-scale power systems. Researchers have developed many HIL test systems for power systems in recent years. Those test systems can model both microsecond-level dynamic responses of power electronic systems and millisecond-level transients of transmission and distribution grids. By integrating individual HIL…
▽ More
This paper presents a network hardware-in-the-loop (HIL) simulation system for modeling large-scale power systems. Researchers have developed many HIL test systems for power systems in recent years. Those test systems can model both microsecond-level dynamic responses of power electronic systems and millisecond-level transients of transmission and distribution grids. By integrating individual HIL test systems into a network of HIL test systems, we can create large-scale power grid digital twins with flexible structures at required modeling resolution that fits for a wide range of system operating conditions. This will not only significantly reduce the need for field tests when develo** new technologies but also greatly shorten the model development cycle. In this paper, we present a networked OPAL-RT based HIL test system for develo** transmission-distribution coordinative Volt-VAR regulation technologies as an example to illustrate system setups, communication requirements among different HIL simulation systems, and system connection mechanisms. Impacts of communication delays, information exchange cycles, and computing delays are illustrated. Simulation results show that the performance of a networked HIL test system is satisfactory.
△ Less
Submitted 17 February, 2020;
originally announced February 2020.