-
HDR Imaging for Dynamic Scenes with Events
Authors:
Li Xiaopeng,
Zeng Zhaoyuan,
Fan Cien,
Zhao Chen,
Deng Lei,
Yu Lei
Abstract:
High dynamic range imaging (HDRI) for real-world dynamic scenes is challenging because moving objects may lead to hybrid degradation of low dynamic range and motion blur. Existing event-based approaches only focus on a separate task, while cascading HDRI and motion deblurring would lead to sub-optimal solutions, and unavailable ground-truth sharp HDR images aggravate the predicament. To address th…
▽ More
High dynamic range imaging (HDRI) for real-world dynamic scenes is challenging because moving objects may lead to hybrid degradation of low dynamic range and motion blur. Existing event-based approaches only focus on a separate task, while cascading HDRI and motion deblurring would lead to sub-optimal solutions, and unavailable ground-truth sharp HDR images aggravate the predicament. To address these challenges, we propose an Event-based HDRI framework within a Self-supervised learning paradigm, i.e., Self-EHDRI, which generalizes HDRI performance in real-world dynamic scenarios. Specifically, a self-supervised learning strategy is carried out by learning cross-domain conversions from blurry LDR images to sharp LDR images, which enables sharp HDR images to be accessible in the intermediate process even though ground-truth sharp HDR images are missing. Then, we formulate the event-based HDRI and motion deblurring model and conduct a unified network to recover the intermediate sharp HDR results, where both the high dynamic range and high temporal resolution of events are leveraged simultaneously for compensation. We construct large-scale synthetic and real-world datasets to evaluate the effectiveness of our method. Comprehensive experiments demonstrate that the proposed Self-EHDRI outperforms state-of-the-art approaches by a large margin. The codes, datasets, and results are available at https://lxp-whu.github.io/Self-EHDRI.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
Ground-to-UAV sub-Terahertz channel measurement and modeling
Authors:
Da Li,
Peian Li,
Jiabiao Zhao,
Jianjian Liang,
Jiacheng Liu,
Guohao Liu,
Yuanshuai Lei,
Wenbo Liu,
Jianqin Deng,
Fuyong Liu,
Jianjun Ma
Abstract:
Unmanned Aerial Vehicle (UAV) assisted terahertz (THz) wireless communications have been expected to play a vital role in the next generation of wireless networks. UAVs can serve as either repeaters or data collectors within the communication link, thereby potentially augmenting the efficacy of communication systems. Despite their promise, the channel analysis and modeling specific to THz wireless…
▽ More
Unmanned Aerial Vehicle (UAV) assisted terahertz (THz) wireless communications have been expected to play a vital role in the next generation of wireless networks. UAVs can serve as either repeaters or data collectors within the communication link, thereby potentially augmenting the efficacy of communication systems. Despite their promise, the channel analysis and modeling specific to THz wireless channels leveraging UAVs remain under explored. This work delves into a ground-to-UAV channel at 140 GHz, with a specific focus on the influence of UAV hovering behavior on channel performance. Employing experimental measurements through an unmodulated channel setup and a geometry-based stochastic model (GBSM) that integrates three-dimensional positional coordinates and beamwidth, this work evaluates the impact of UAV dynamic movements and antenna orientation on channel performance. Our findings highlight the minimal impact of UAV orientation adjustments on channel performance and underscore the diminishing necessity for precise alignment between UAVs and ground stations as beamwidth increases.
△ Less
Submitted 28 June, 2024; v1 submitted 3 April, 2024;
originally announced April 2024.
-
LightCAM: A Fast and Light Implementation of Context-Aware Masking based D-TDNN for Speaker Verification
Authors:
Di Cao,
Xianchen Wang,
Junfeng Zhou,
Jiakai Zhang,
Yan**g Lei,
Wenpeng Chen
Abstract:
Traditional Time Delay Neural Networks (TDNN) have achieved state-of-the-art performance at the cost of high computational complexity and slower inference speed, making them difficult to implement in an industrial environment. The Densely Connected Time Delay Neural Network (D-TDNN) with Context Aware Masking (CAM) module has proven to be an efficient structure to reduce complexity while maintaini…
▽ More
Traditional Time Delay Neural Networks (TDNN) have achieved state-of-the-art performance at the cost of high computational complexity and slower inference speed, making them difficult to implement in an industrial environment. The Densely Connected Time Delay Neural Network (D-TDNN) with Context Aware Masking (CAM) module has proven to be an efficient structure to reduce complexity while maintaining system performance. In this paper, we propose a fast and lightweight model, LightCAM, which further adopts a depthwise separable convolution module (DSM) and uses multi-scale feature aggregation (MFA) for feature fusion at different levels. Extensive experiments are conducted on VoxCeleb dataset, the comparative results show that it has achieved an EER of 0.83 and MinDCF of 0.0891 in VoxCeleb1-O, which outperforms the other mainstream speaker verification methods. In addition, complexity analysis further demonstrates that the proposed architecture has lower computational cost and faster inference speed.
△ Less
Submitted 12 February, 2024; v1 submitted 8 February, 2024;
originally announced February 2024.
-
HOPE: Hybrid-granularity Ordinal Prototype Learning for Progression Prediction of Mild Cognitive Impairment
Authors:
Chenhui Wang,
Yiming Lei,
Tao Chen,
Jun** Zhang,
Yuxin Li,
Hongming Shan
Abstract:
Mild cognitive impairment (MCI) is often at high risk of progression to Alzheimer's disease (AD). Existing works to identify the progressive MCI (pMCI) typically require MCI subtype labels, pMCI vs. stable MCI (sMCI), determined by whether or not an MCI patient will progress to AD after a long follow-up. However, prospectively acquiring MCI subtype data is time-consuming and resource-intensive; th…
▽ More
Mild cognitive impairment (MCI) is often at high risk of progression to Alzheimer's disease (AD). Existing works to identify the progressive MCI (pMCI) typically require MCI subtype labels, pMCI vs. stable MCI (sMCI), determined by whether or not an MCI patient will progress to AD after a long follow-up. However, prospectively acquiring MCI subtype data is time-consuming and resource-intensive; the resultant small datasets could lead to severe overfitting and difficulty in extracting discriminative information. Inspired by that various longitudinal biomarkers and cognitive measurements present an ordinal pathway on AD progression, we propose a novel Hybrid-granularity Ordinal PrototypE learning (HOPE) method to characterize AD ordinal progression for MCI progression prediction. First, HOPE learns an ordinal metric space that enables progression prediction by prototype comparison. Second, HOPE leverages a novel hybrid-granularity ordinal loss to learn the ordinal nature of AD via effectively integrating instance-to-instance ordinality, instance-to-class compactness, and class-to-class separation. Third, to make the prototype learning more stable, HOPE employs an exponential moving average strategy to learn the global prototypes of NC and AD dynamically. Experimental results on the internal ADNI and the external NACC datasets demonstrate the superiority of the proposed HOPE over existing state-of-the-art methods as well as its interpretability. Source code is made available at https://github.com/thibault-wch/HOPE-for-mild-cognitive-impairment.
△ Less
Submitted 19 January, 2024;
originally announced January 2024.
-
Accent-VITS:accent transfer for end-to-end TTS
Authors:
Linhan Ma,
Yongmao Zhang,
Xinfa Zhu,
Yi Lei,
Ziqian Ning,
Pengcheng Zhu,
Lei Xie
Abstract:
Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker's voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based end-to-end accent transfer model named Accent-VITS.Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable…
▽ More
Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker's voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based end-to-end accent transfer model named Accent-VITS.Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable effective and stable accent transfer.We leverage a hierarchical CVAE structure to model accent pronunciation information and acoustic features, respectively, using bottleneck features and mel spectrums as constraints.Moreover, the text-to-wave map** in VITS is decomposed into text-to-accent and accent-to-wave map**s in Accent-VITS. In this way, the disentanglement of accent and speaker timbre becomes be more stable and effective.Experiments on multi-accent and Mandarin datasets show that Accent-VITS achieves higher speaker similarity, accent similarity and speech naturalness as compared with a strong baseline.
△ Less
Submitted 29 December, 2023; v1 submitted 28 December, 2023;
originally announced December 2023.
-
Battery-Care Resource Allocation and Task Offloading in Multi-Agent Post-Disaster MEC Environment
Authors:
Yiwei Tang,
Hualong Huang,
Wenhan Zhan,
Geyong Min,
Zhekai Duan,
Yuchuan Lei
Abstract:
Being an up-and-coming application scenario of mobile edge computing (MEC), the post-disaster rescue suffers multitudinous computing-intensive tasks but unstably guaranteed network connectivity. In rescue environments, quality of service (QoS), such as task execution delay, energy consumption and battery state of health (SoH), is of significant meaning. This paper studies a multi-user post-disaste…
▽ More
Being an up-and-coming application scenario of mobile edge computing (MEC), the post-disaster rescue suffers multitudinous computing-intensive tasks but unstably guaranteed network connectivity. In rescue environments, quality of service (QoS), such as task execution delay, energy consumption and battery state of health (SoH), is of significant meaning. This paper studies a multi-user post-disaster MEC environment with unstable 5G communication, where device-to-device (D2D) link communication and dynamic voltage and frequency scaling (DVFS) are adopted to balance each user's requirement for task delay and energy consumption. A battery degradation evaluation approach to prolong battery lifetime is also presented. The distributed optimization problem is formulated into a mixed cooperative-competitive (MCC) multi-agent Markov decision process (MAMDP) and is tackled with recurrent multi-agent Proximal Policy Optimization (rMAPPO). Extensive simulations and comprehensive comparisons with other representative algorithms clearly demonstrate the effectiveness of the proposed rMAPPO-based offloading scheme.
△ Less
Submitted 23 December, 2023;
originally announced December 2023.
-
Channel Modeling for Terahertz Communications in Rain
Authors:
Peian Li,
Wenbo Liu,
Jiacheng Liu,
Da Li,
Guohao Liu,
Yuanshuai Lei,
Jiabiao Zhao,
Xiaopeng Wang,
Houjun Sun,
Jianjun Ma,
John F. Federici
Abstract:
Terahertz (THz) communication channels, integral to outdoor applications, are critically influenced by natural factors like rainfall. Our research focused on the nuanced effects of rain on these channels, employing an advanced rainfall emulation system. By analyzing key parameters such as rain rate, altitude based variations in rainfall, and diverse raindrop sizes, we identified the paramount sign…
▽ More
Terahertz (THz) communication channels, integral to outdoor applications, are critically influenced by natural factors like rainfall. Our research focused on the nuanced effects of rain on these channels, employing an advanced rainfall emulation system. By analyzing key parameters such as rain rate, altitude based variations in rainfall, and diverse raindrop sizes, we identified the paramount significance of the number of raindrops in the THz channel, particularly in scenarios with constant rain rates but varying drop sizes. Central to our findings is a novel model grounded in Mie scattering theory, which adeptly incorporates the variability of raindrop size distributions at different altitudes. This model has displayed strong congruence with our experimental results. In essence, our study underscores the inadequacy of solely depending on a fixed ground-based rain rate and emphasizes the imperative of calibrating distribution metrics to cater to specific environmental and operational contexts.
△ Less
Submitted 28 November, 2023;
originally announced November 2023.
-
PuzzleTuning: Explicitly Bridge Pathological and Natural Image with Puzzles
Authors:
Tianyi Zhang,
Shangqing Lyu,
Yanli Lei,
Sicheng Chen,
Nan Ying,
Yufang He,
Yu Zhao,
Yunlu Feng,
Hwee Kuan Lee,
Guanglei Zhang
Abstract:
Pathological image analysis is a crucial field in computer vision. Due to the annotation scarcity in the pathological field, pre-training with self-supervised learning (SSL) is widely applied to learn on unlabeled images. However, the current SSL-based pathological pre-training: (1) does not explicitly explore the essential focuses of the pathological field, and (2) does not effectively bridge wit…
▽ More
Pathological image analysis is a crucial field in computer vision. Due to the annotation scarcity in the pathological field, pre-training with self-supervised learning (SSL) is widely applied to learn on unlabeled images. However, the current SSL-based pathological pre-training: (1) does not explicitly explore the essential focuses of the pathological field, and (2) does not effectively bridge with and thus take advantage of the knowledge from natural images. To explicitly address them, we propose our large-scale PuzzleTuning framework, containing the following innovations. Firstly, we define three task focuses that can effectively bridge knowledge of pathological and natural domain: appearance consistency, spatial consistency, and restoration understanding. Secondly, we devise a novel multiple puzzle restoring task, which explicitly pre-trains the model regarding these focuses. Thirdly, we introduce an explicit prompt-tuning process to incrementally integrate the domain-specific knowledge. It builds a bridge to align the large domain gap between natural and pathological images. Additionally, a curriculum-learning training strategy is designed to regulate task difficulty, making the model adaptive to the puzzle restoring complexity. Experimental results show that our PuzzleTuning framework outperforms the previous state-of-the-art methods in various downstream tasks on multiple datasets. The code, demo, and pre-trained weights are available at https://github.com/sagizty/PuzzleTuning.
△ Less
Submitted 22 April, 2024; v1 submitted 11 November, 2023;
originally announced November 2023.
-
CPIA Dataset: A Comprehensive Pathological Image Analysis Dataset for Self-supervised Learning Pre-training
Authors:
Nan Ying,
Yanli Lei,
Tianyi Zhang,
Shangqing Lyu,
Chunhui Li,
Sicheng Chen,
Zeyu Liu,
Yu Zhao,
Guanglei Zhang
Abstract:
Pathological image analysis is a crucial field in computer-aided diagnosis, where deep learning is widely applied. Transfer learning using pre-trained models initialized on natural images has effectively improved the downstream pathological performance. However, the lack of sophisticated domain-specific pathological initialization hinders their potential. Self-supervised learning (SSL) enables pre…
▽ More
Pathological image analysis is a crucial field in computer-aided diagnosis, where deep learning is widely applied. Transfer learning using pre-trained models initialized on natural images has effectively improved the downstream pathological performance. However, the lack of sophisticated domain-specific pathological initialization hinders their potential. Self-supervised learning (SSL) enables pre-training without sample-level labels, which has great potential to overcome the challenge of expensive annotations. Thus, studies focusing on pathological SSL pre-training call for a comprehensive and standardized dataset, similar to the ImageNet in computer vision. This paper presents the comprehensive pathological image analysis (CPIA) dataset, a large-scale SSL pre-training dataset combining 103 open-source datasets with extensive standardization. The CPIA dataset contains 21,427,877 standardized images, covering over 48 organs/tissues and about 100 kinds of diseases, which includes two main data types: whole slide images (WSIs) and characteristic regions of interest (ROIs). A four-scale WSI standardization process is proposed based on the uniform resolution in microns per pixel (MPP), while the ROIs are divided into three scales artificially. This multi-scale dataset is built with the diagnosis habits under the supervision of experienced senior pathologists. The CPIA dataset facilitates a comprehensive pathological understanding and enables pattern discovery explorations. Additionally, to launch the CPIA dataset, several state-of-the-art (SOTA) baselines of SSL pre-training and downstream evaluation are specially conducted. The CPIA dataset along with baselines is available at https://github.com/zhanglab2021/CPIA_Dataset.
△ Less
Submitted 27 October, 2023;
originally announced October 2023.
-
Boosting Multi-Speaker Expressive Speech Synthesis with Semi-supervised Contrastive Learning
Authors:
Xinfa Zhu,
Yuke Li,
Yi Lei,
Ning Jiang,
Guoqing Zhao,
Lei Xie
Abstract:
This paper aims to build a multi-speaker expressive TTS system, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the disentangled style, em…
▽ More
This paper aims to build a multi-speaker expressive TTS system, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the disentangled style, emotion, and speaker representations from speech for style and emotion transfer. Furthermore, a semi-supervised training strategy is introduced to improve the data utilization efficiency by involving multi-domain data, including style-labeled data, emotion-labeled data, and abundant unlabeled data. To achieve expressive speech with diverse styles and emotions for a target speaker, the learned disentangled representations are integrated into an improved VITS model. Experiments on multi-domain data demonstrate the effectiveness of the proposed method.
△ Less
Submitted 25 April, 2024; v1 submitted 25 October, 2023;
originally announced October 2023.
-
Interference Management by Harnessing Multi-Domain Resources in Spectrum-Sharing Aided Satellite-Ground Integrated Networks
Authors:
Xiao** Ding,
Yue Lei,
Yulong Zou,
Gengxin Zhang,
Lajos Hanzo
Abstract:
A spectrum-sharing satellite-ground integrated network is conceived, consisting of a pair of non-geostationary orbit (NGSO) constellations and multiple terrestrial base stations, which impose the co-frequency interference (CFI) on each other. The CFI may increase upon increasing the number of satellites. To manage the potentially severe interference, we propose to rely on joint multi-domain resour…
▽ More
A spectrum-sharing satellite-ground integrated network is conceived, consisting of a pair of non-geostationary orbit (NGSO) constellations and multiple terrestrial base stations, which impose the co-frequency interference (CFI) on each other. The CFI may increase upon increasing the number of satellites. To manage the potentially severe interference, we propose to rely on joint multi-domain resource aided interference management (JMDR-IM). Specifically, the coverage overlap of the constellations considered is analyzed. Then, multi-domain resources - including both the beam-domain and power-domain - are jointly utilized for managing the CFI in an overlap** coverage region. This joint resource utilization is performed by relying on our specifically designed beam-shut-off and switching based beam scheduling, as well as on long short-term memory based joint autoregressive moving average assisted deep Q network aided power scheduling. Moreover, the outage probability (OP) of the proposed JMDR-IM scheme is derived, and the asymptotic analysis of the OP is also provided. Our performance evaluations demonstrate the superiority of the proposed JMDR-IM scheme in terms of its increased throughput and reduced OP.
△ Less
Submitted 29 January, 2024; v1 submitted 23 October, 2023;
originally announced October 2023.
-
Vec-Tok Speech: speech vectorization and tokenization for neural speech generation
Authors:
Xinfa Zhu,
Yuanjun Lv,
Yi Lei,
Tao Li,
Wendi He,
Hongbin Zhou,
Heng Lu,
Lei Xie
Abstract:
Language models (LMs) have recently flourished in natural language processing and computer vision, generating high-fidelity texts or images in various tasks. In contrast, the current speech generative models are still struggling regarding speech quality and task generalization. This paper presents Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating e…
▽ More
Language models (LMs) have recently flourished in natural language processing and computer vision, generating high-fidelity texts or images in various tasks. In contrast, the current speech generative models are still struggling regarding speech quality and task generalization. This paper presents Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating expressive and high-fidelity speech. Specifically, we propose a novel speech codec based on speech vectors and semantic tokens. Speech vectors contain acoustic details contributing to high-fidelity speech reconstruction, while semantic tokens focus on the linguistic content of speech, facilitating language modeling. Based on the proposed speech codec, Vec-Tok Speech leverages an LM to undertake the core of speech generation. Moreover, Byte-Pair Encoding (BPE) is introduced to reduce the token length and bit rate for lower exposure bias and longer context coverage, improving the performance of LMs. Vec-Tok Speech can be used for intra- and cross-lingual zero-shot voice conversion (VC), zero-shot speaking style transfer text-to-speech (TTS), speech-to-speech translation (S2ST), speech denoising, and speaker de-identification and anonymization. Experiments show that Vec-Tok Speech, built on 50k hours of speech, performs better than other SOTA models. Code will be available at https://github.com/BakerBunker/VecTok .
△ Less
Submitted 12 October, 2023; v1 submitted 11 October, 2023;
originally announced October 2023.
-
VITS-based Singing Voice Conversion System with DSPGAN post-processing for SVCC2023
Authors:
Yiquan Zhou,
Meng Chen,
Yi Lei,
Jihua Zhu,
Weifeng Zhao
Abstract:
This paper presents the T02 team's system for the Singing Voice Conversion Challenge 2023 (SVCC2023). Our system entails a VITS-based SVC model, incorporating three modules: a feature extractor, a voice converter, and a post-processor. Specifically, the feature extractor provides F0 contours and extracts speaker-independent linguistic content from the input singing voice by leveraging a HuBERT mod…
▽ More
This paper presents the T02 team's system for the Singing Voice Conversion Challenge 2023 (SVCC2023). Our system entails a VITS-based SVC model, incorporating three modules: a feature extractor, a voice converter, and a post-processor. Specifically, the feature extractor provides F0 contours and extracts speaker-independent linguistic content from the input singing voice by leveraging a HuBERT model. The voice converter is employed to recompose the speaker timbre, F0, and linguistic content to generate the waveform of the target speaker. Besides, to further improve the audio quality, a fine-tuned DSPGAN vocoder is introduced to re-synthesise the waveform. Given the limited target speaker data, we utilize a two-stage training strategy to adapt the base model to the target speaker. During model adaptation, several tricks, such as data augmentation and joint training with auxiliary singer data, are involved. Official challenge results show that our system achieves superior performance, especially in the cross-domain task, ranking 1st and 2nd in naturalness and similarity, respectively. Further ablation justifies the effectiveness of our system design.
△ Less
Submitted 8 October, 2023;
originally announced October 2023.
-
PromptSpeaker: Speaker Generation Based on Text Descriptions
Authors:
Yongmao Zhang,
Guanghou Liu,
Yi Lei,
Yunlin Chen,
Hao Yin,
Lei Xie,
Zhifei Li
Abstract:
Recently, text-guided content generation has received extensive attention. In this work, we explore the possibility of text description-based speaker generation, i.e., using text prompts to control the speaker generation process. Specifically, we propose PromptSpeaker, a text-guided speaker generation system. PromptSpeaker consists of a prompt encoder, a zero-shot VITS, and a Glow model, where the…
▽ More
Recently, text-guided content generation has received extensive attention. In this work, we explore the possibility of text description-based speaker generation, i.e., using text prompts to control the speaker generation process. Specifically, we propose PromptSpeaker, a text-guided speaker generation system. PromptSpeaker consists of a prompt encoder, a zero-shot VITS, and a Glow model, where the prompt encoder predicts a prior distribution based on the text description and samples from this distribution to obtain a semantic representation. The Glow model subsequently converts the semantic representation into a speaker representation, and the zero-shot VITS finally synthesizes the speaker's voice based on the speaker representation. We verify that PromptSpeaker can generate speakers new from the training set by objective metrics, and the synthetic speaker voice has reasonable subjective matching quality with the speaker prompt.
△ Less
Submitted 8 October, 2023;
originally announced October 2023.
-
Zero-Shot Emotion Transfer For Cross-Lingual Speech Synthesis
Authors:
Yuke Li,
Xinfa Zhu,
Yi Lei,
Hai Li,
Junhui Liu,
Danming Xie,
Lei Xie
Abstract:
Zero-shot emotion transfer in cross-lingual speech synthesis aims to transfer emotion from an arbitrary speech reference in the source language to the synthetic speech in the target language. Building such a system faces challenges of unnatural foreign accents and difficulty in modeling the shared emotional expressions of different languages. Building on the DelightfulTTS neural architecture, this…
▽ More
Zero-shot emotion transfer in cross-lingual speech synthesis aims to transfer emotion from an arbitrary speech reference in the source language to the synthetic speech in the target language. Building such a system faces challenges of unnatural foreign accents and difficulty in modeling the shared emotional expressions of different languages. Building on the DelightfulTTS neural architecture, this paper addresses these challenges by introducing specifically-designed modules to model the language-specific prosody features and language-shared emotional expressions separately. Specifically, the language-specific speech prosody is learned by a non-autoregressive predictive coding (NPC) module to improve the naturalness of the synthetic cross-lingual speech. The shared emotional expression between different languages is extracted from a pre-trained self-supervised model HuBERT with strong generalization capabilities. We further use hierarchical emotion modeling to capture more comprehensive emotions across different languages. Experimental results demonstrate the proposed framework's effectiveness in synthesizing bi-lingual emotional speech for the monolingual target speaker without emotional training data.
△ Less
Submitted 5 October, 2023;
originally announced October 2023.
-
Nonlinear Multi-Carrier System with Signal Clip**: Measurement, Analysis, and Optimization
Authors:
Yuyang Du,
Liang Hao,
Yiming Lei,
Qun Yang,
Shiqi Xu
Abstract:
Signal clip** is a classic technique for reducing peak-to-average power ratio (PAPR) in orthogonal frequency division multiplexing (OFDM) systems. It has been widely applied in consumer electronic devices owing to its low complexity and high efficiency. Although clip** reduces the nonlinear distortion caused by power amplifiers (PAs), it induces additional clip** distortion. Optimizing the j…
▽ More
Signal clip** is a classic technique for reducing peak-to-average power ratio (PAPR) in orthogonal frequency division multiplexing (OFDM) systems. It has been widely applied in consumer electronic devices owing to its low complexity and high efficiency. Although clip** reduces the nonlinear distortion caused by power amplifiers (PAs), it induces additional clip** distortion. Optimizing the joint system performance with consideration of both PA nonlinearity and clip** distortion remains an open problem due to the complex PA modeling. In this paper, we analyze the PA nonlinearity through the Bessel-Fourier PA (BFPA) model and simplify its power expression using inter-modulation product (IMP) analysis. We derive expressions of the receiver signal-to-noise ratio (SNR) and system symbol error rate (SER) for the nonlinear clipped OFDM system. With the derivations, we investigate the optimal system setting to achieve the SER lower bound in a practical OFDM system that considers both PA nonlinearity and clip** distortion. The methods and results presented in this paper can serve as a useful reference for the system-level optimization of clipped OFDM systems with nonlinear PA.
△ Less
Submitted 16 February, 2024; v1 submitted 1 October, 2023;
originally announced October 2023.
-
PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts
Authors:
Jixun Yao,
Yuguang Yang,
Yi Lei,
Ziqian Ning,
Yanni Hu,
Yu Pan,
**g**g Yin,
Hongbin Zhou,
Heng Lu,
Lei Xie
Abstract:
Style voice conversion aims to transform the style of source speech to a desired style according to real-world application demands. However, the current style voice conversion approach relies on pre-defined labels or reference speech to control the conversion process, which leads to limitations in style diversity or falls short in terms of the intuitive and interpretability of style representation…
▽ More
Style voice conversion aims to transform the style of source speech to a desired style according to real-world application demands. However, the current style voice conversion approach relies on pre-defined labels or reference speech to control the conversion process, which leads to limitations in style diversity or falls short in terms of the intuitive and interpretability of style representation. In this study, we propose PromptVC, a novel style voice conversion approach that employs a latent diffusion model to generate a style vector driven by natural language prompts. Specifically, the style vector is extracted by a style encoder during training, and then the latent diffusion model is trained independently to sample the style vector from noise, with this process being conditioned on natural language prompts. To improve style expressiveness, we leverage HuBERT to extract discrete tokens and replace them with the K-Means center embedding to serve as the linguistic content, which minimizes residual style information. Additionally, we deduplicate the same discrete token and employ a differentiable duration predictor to re-predict the duration of each token, which can adapt the duration of the same linguistic content to different styles. The subjective and objective evaluation results demonstrate the effectiveness of our proposed system.
△ Less
Submitted 26 December, 2023; v1 submitted 17 September, 2023;
originally announced September 2023.
-
On the Performance of Multidimensional Constellation Sha** for Linear and Nonlinear Optical Fiber Channel
Authors:
Bin Chen,
Zhiwei Liang,
Shen Li,
Yi Lei,
Gabriele Liga,
Alex Alvarado
Abstract:
Multidimensional constellation sha** of up to 32 dimensions with different spectral efficiencies are compared through AWGN and fiber-optic simulations. The results show that no constellation is universal and the balance of required and effective SNRs should be jointly considered for the specific optical transmission scenario.
Multidimensional constellation sha** of up to 32 dimensions with different spectral efficiencies are compared through AWGN and fiber-optic simulations. The results show that no constellation is universal and the balance of required and effective SNRs should be jointly considered for the specific optical transmission scenario.
△ Less
Submitted 18 October, 2023; v1 submitted 17 August, 2023;
originally announced August 2023.
-
Unsupervised Image Denoising in Real-World Scenarios via Self-Collaboration Parallel Generative Adversarial Branches
Authors:
Xin Lin,
Chao Ren,
Xiao Liu,
Jie Huang,
Yinjie Lei
Abstract:
Deep learning methods have shown remarkable performance in image denoising, particularly when trained on large-scale paired datasets. However, acquiring such paired datasets for real-world scenarios poses a significant challenge. Although unsupervised approaches based on generative adversarial networks offer a promising solution for denoising without paired datasets, they are difficult in surpassi…
▽ More
Deep learning methods have shown remarkable performance in image denoising, particularly when trained on large-scale paired datasets. However, acquiring such paired datasets for real-world scenarios poses a significant challenge. Although unsupervised approaches based on generative adversarial networks offer a promising solution for denoising without paired datasets, they are difficult in surpassing the performance limitations of conventional GAN-based unsupervised frameworks without significantly modifying existing structures or increasing the computational complexity of denoisers. To address this problem, we propose a SC strategy for multiple denoisers. This strategy can achieve significant performance improvement without increasing the inference complexity of the GAN-based denoising framework. Its basic idea is to iteratively replace the previous less powerful denoiser in the filter-guided noise extraction module with the current powerful denoiser. This process generates better synthetic clean-noisy image pairs, leading to a more powerful denoiser for the next iteration. This baseline ensures the stability and effectiveness of the training network. The experimental results demonstrate the superiority of our method over state-of-the-art unsupervised methods.
△ Less
Submitted 13 August, 2023;
originally announced August 2023.
-
METTS: Multilingual Emotional Text-to-Speech by Cross-speaker and Cross-lingual Emotion Transfer
Authors:
Xinfa Zhu,
Yi Lei,
Tao Li,
Yongmao Zhang,
Hongbin Zhou,
Heng Lu,
Lei Xie
Abstract:
Previous multilingual text-to-speech (TTS) approaches have considered leveraging monolingual speaker data to enable cross-lingual speech synthesis. However, such data-efficient approaches have ignored synthesizing emotional aspects of speech due to the challenges of cross-speaker cross-lingual emotion transfer - the heavy entanglement of speaker timbre, emotion, and language factors in the speech…
▽ More
Previous multilingual text-to-speech (TTS) approaches have considered leveraging monolingual speaker data to enable cross-lingual speech synthesis. However, such data-efficient approaches have ignored synthesizing emotional aspects of speech due to the challenges of cross-speaker cross-lingual emotion transfer - the heavy entanglement of speaker timbre, emotion, and language factors in the speech signal will make a system produce cross-lingual synthetic speech with an undesired foreign accent and weak emotion expressiveness. This paper proposes the Multilingual Emotional TTS (METTS) model to mitigate these problems, realizing both cross-speaker and cross-lingual emotion transfer. Specifically, METTS takes DelightfulTTS as the backbone model and proposes the following designs. First, to alleviate the foreign accent problem, METTS introduces multi-scale emotion modeling to disentangle speech prosody into coarse-grained and fine-grained scales, producing language-agnostic and language-specific emotion representations, respectively. Second, as a pre-processing step, formant shift-based information perturbation is applied to the reference signal for better disentanglement of speaker timbre in the speech. Third, a vector quantization-based emotion matcher is designed for reference selection, leading to decent naturalness and emotion diversity in cross-lingual synthetic speech. Experiments demonstrate the good design of METTS.
△ Less
Submitted 29 July, 2023;
originally announced July 2023.
-
The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task
Authors:
Kun Song,
Yi lei,
Peikun Chen,
Yiqing Cao,
Kun Wei,
Yongmao Zhang,
Lei Xie,
Ning Jiang,
Guoqing Zhao
Abstract:
This paper describes the NPU-MSXF system for the IWSLT 2023 speech-to-speech translation (S2ST) task which aims to translate from English speech of multi-source to Chinese speech. The system is built in a cascaded manner consisting of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS). We make tremendous efforts to handle the challenging multi-source input. Spec…
▽ More
This paper describes the NPU-MSXF system for the IWSLT 2023 speech-to-speech translation (S2ST) task which aims to translate from English speech of multi-source to Chinese speech. The system is built in a cascaded manner consisting of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS). We make tremendous efforts to handle the challenging multi-source input. Specifically, to improve the robustness to multi-source speech input, we adopt various data augmentation strategies and a ROVER-based score fusion on multiple ASR model outputs. To better handle the noisy ASR transcripts, we introduce a three-stage fine-tuning strategy to improve translation accuracy. Finally, we build a TTS model with high naturalness and sound quality, which leverages a two-stage framework, using network bottleneck features as a robust intermediate representation for speaker timbre and linguistic content disentanglement. Based on the two-stage framework, pre-trained speaker embedding is leveraged as a condition to transfer the speaker timbre in the source English speech to the translated Chinese speech. Experimental results show that our system has high translation accuracy, speech naturalness, sound quality, and speaker similarity. Moreover, it shows good robustness to multi-source data.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions
Authors:
Guanghou Liu,
Yongmao Zhang,
Yi Lei,
Yunlin Chen,
Rui Wang,
Zhifei Li,
Lei Xie
Abstract:
Style transfer TTS has shown impressive performance in recent years. However, style control is often restricted to systems built on expressive speech recordings with discrete style categories. In practical situations, users may be interested in transferring style by ty** text descriptions of desired styles, without the reference speech in the target style. The text-guided content generation tech…
▽ More
Style transfer TTS has shown impressive performance in recent years. However, style control is often restricted to systems built on expressive speech recordings with discrete style categories. In practical situations, users may be interested in transferring style by ty** text descriptions of desired styles, without the reference speech in the target style. The text-guided content generation techniques have drawn wide attention recently. In this work, we explore the possibility of controllable style transfer with natural language descriptions. To this end, we propose PromptStyle, a text prompt-guided cross-speaker style transfer system. Specifically, PromptStyle consists of an improved VITS and a cross-modal style encoder. The cross-modal style encoder constructs a shared space of stylistic and semantic representation through a two-stage training process. Experiments show that PromptStyle can achieve proper style transfer with text prompts while maintaining relatively high stability and speaker similarity. Audio samples are available in our demo page.
△ Less
Submitted 1 June, 2023; v1 submitted 30 May, 2023;
originally announced May 2023.
-
StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation
Authors:
Kun Song,
Yi Ren,
Yi Lei,
Chunfeng Wang,
Kun Wei,
Lei Xie,
Xiang Yin,
Zejun Ma
Abstract:
Direct speech-to-speech translation (S2ST) has gradually become popular as it has many advantages compared with cascade S2ST. However, current research mainly focuses on the accuracy of semantic translation and ignores the speech style transfer from a source language to a target language. The lack of high-fidelity expressive parallel data makes such style transfer challenging, especially in more p…
▽ More
Direct speech-to-speech translation (S2ST) has gradually become popular as it has many advantages compared with cascade S2ST. However, current research mainly focuses on the accuracy of semantic translation and ignores the speech style transfer from a source language to a target language. The lack of high-fidelity expressive parallel data makes such style transfer challenging, especially in more practical zero-shot scenarios. To solve this problem, we first build a parallel corpus using a multi-lingual multi-speaker text-to-speech synthesis (TTS) system and then propose the StyleS2ST model with cross-lingual speech style transfer ability based on a style adaptor on a direct S2ST system framework. Enabling continuous style space modeling of an acoustic model through parallel corpus training and non-parallel TTS data augmentation, StyleS2ST captures cross-lingual acoustic feature map** from the source to the target language. Experiments show that StyleS2ST achieves good style similarity and naturalness in both in-set and out-of-set zero-shot scenarios.
△ Less
Submitted 25 July, 2023; v1 submitted 28 May, 2023;
originally announced May 2023.
-
FAN-Net: Fourier-Based Adaptive Normalization For Cross-Domain Stroke Lesion Segmentation
Authors:
Weiyi Yu,
Yiming Lei,
Hongming Shan
Abstract:
Since stroke is the main cause of various cerebrovascular diseases, deep learning-based stroke lesion segmentation on magnetic resonance (MR) images has attracted considerable attention. However, the existing methods often neglect the domain shift among MR images collected from different sites, which has limited performance improvement. To address this problem, we intend to change style informatio…
▽ More
Since stroke is the main cause of various cerebrovascular diseases, deep learning-based stroke lesion segmentation on magnetic resonance (MR) images has attracted considerable attention. However, the existing methods often neglect the domain shift among MR images collected from different sites, which has limited performance improvement. To address this problem, we intend to change style information without affecting high-level semantics via adaptively changing the low-frequency amplitude components of the Fourier transform so as to enhance model robustness to varying domains. Thus, we propose a novel FAN-Net, a U-Net--based segmentation network incorporated with a Fourier-based adaptive normalization (FAN) and a domain classifier with a gradient reversal layer. The FAN module is tailored for learning adaptive affine parameters for the amplitude components of different domains, which can dynamically normalize the style information of source images. Then, the domain classifier provides domain-agnostic knowledge to endow FAN with strong domain generalizability. The experimental results on the ATLAS dataset, which consists of MR images from 9 sites, show the superior performance of the proposed FAN-Net compared with baseline methods.
△ Less
Submitted 23 April, 2023;
originally announced April 2023.
-
Analytical Model of Nonlinear Fiber Propagation for General Dual-Polarization Four-Dimensional Modulation Format
Authors:
Zhiwei Liang,
Bin Chen,
Yi Lei,
Gabriele Liga,
Alex Alvarado
Abstract:
Coherent dual-polarization (DP) optical transmission systems encode information on the four available degrees of freedom of an optical field: the two polarization states, each with two quadrature components. Such systems naturally operate based on a four-dimensional (4D) signal space. Having a general analytical model to accurately estimate nonlinear interference (NLI) is key to analyze such trans…
▽ More
Coherent dual-polarization (DP) optical transmission systems encode information on the four available degrees of freedom of an optical field: the two polarization states, each with two quadrature components. Such systems naturally operate based on a four-dimensional (4D) signal space. Having a general analytical model to accurately estimate nonlinear interference (NLI) is key to analyze such transmission systems as well as to study how different DP-4D formats are affected by NLI. However, the available models in the literature are not completely general. They either do not apply to the entire DP-4D formats or do not consider all the NLI contributions. In this paper, we develop a model that applies to all DP-4D modulation formats with independent symbols. Our model takes self-channel interference, cross-channel interference and multiple-channel interference effects into account. As an application of our model, we further study the effects of signal-noise interactions in long-haul transmission via the proposed model. When compared to previous results in the literature, our model is more accurate at predicting the contribution of NLI for both low and high dispersion fibers in single- and multi-channel transmission systems. For the NLI, we report an average gap from split step Fourier simulation results below 0.15 dB. The simulation results further show that by considering signal-noise interactions, the proposed model in long-haul transmission can reduce the transmission reach prediction error by 4%.
△ Less
Submitted 9 October, 2023; v1 submitted 14 February, 2023;
originally announced February 2023.
-
UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis
Authors:
Yi Lei,
Shan Yang,
Xinsheng Wang,
Qicong Xie,
Jixun Yao,
Lei Xie,
Dan Su
Abstract:
Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating high-quality speaking and singing voice according to textual input and music scores, respectively. Unifying TTS and SVS into a single system is crucial to the applications requiring both of them. Existing methods usually suffer from some limitations, which rely on either both singing and speaking data from the same person or…
▽ More
Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating high-quality speaking and singing voice according to textual input and music scores, respectively. Unifying TTS and SVS into a single system is crucial to the applications requiring both of them. Existing methods usually suffer from some limitations, which rely on either both singing and speaking data from the same person or cascaded models of multiple tasks. To address these problems, a simplified elegant framework for TTS and SVS, named UniSyn, is proposed in this paper. It is an end-to-end unified model that can make a voice speak and sing with only singing or speaking data from this person. To be specific, a multi-conditional variational autoencoder (MC-VAE), which constructs two independent latent sub-spaces with the speaker- and style-related (i.e. speak or sing) conditions for flexible control, is proposed in UniSyn. Moreover, supervised guided-VAE and timbre perturbation with the Wasserstein distance constraint are leveraged to further disentangle the speaker timbre and style. Experiments conducted on two speakers and two singers demonstrate that UniSyn can generate natural speaking and singing voice without corresponding training data. The proposed approach outperforms the state-of-the-art end-to-end voice generation work, which proves the effectiveness and advantages of UniSyn.
△ Less
Submitted 6 December, 2022; v1 submitted 3 December, 2022;
originally announced December 2022.
-
Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling
Authors:
Xinfa Zhu,
Yi Lei,
Kun Song,
Yongmao Zhang,
Tao Li,
Lei Xie
Abstract:
This paper aims to synthesize the target speaker's speech with desired speaking style and emotion by transferring the style and emotion from reference speech recorded by other speakers. We address this challenging problem with a two-stage framework composed of a text-to-style-and-emotion (Text2SE) module and a style-and-emotion-to-wave (SE2Wave) module, bridging by neural bottleneck (BN) features.…
▽ More
This paper aims to synthesize the target speaker's speech with desired speaking style and emotion by transferring the style and emotion from reference speech recorded by other speakers. We address this challenging problem with a two-stage framework composed of a text-to-style-and-emotion (Text2SE) module and a style-and-emotion-to-wave (SE2Wave) module, bridging by neural bottleneck (BN) features. To further solve the multi-factor (speaker timbre, speaking style and emotion) decoupling problem, we adopt the multi-label binary vector (MBV) and mutual information (MI) minimization to respectively discretize the extracted embeddings and disentangle these highly entangled factors in both Text2SE and SE2Wave modules. Moreover, we introduce a semi-supervised training strategy to leverage data from multiple speakers, including emotion-labeled data, style-labeled data, and unlabeled data. To better transfer the fine-grained expression from references to the target speaker in non-parallel transfer, we introduce a reference-candidate pool and propose an attention-based reference selection approach. Extensive experiments demonstrate the good design of our model.
△ Less
Submitted 14 March, 2023; v1 submitted 18 November, 2022;
originally announced November 2022.
-
Distinguishable Speaker Anonymization based on Formant and Fundamental Frequency Scaling
Authors:
Jixun Yao,
Qing Wang,
Yi Lei,
Pengcheng Guo,
Lei Xie,
Namin Wang,
Jie Liu
Abstract:
Speech data on the Internet are proliferating exponentially because of the emergence of social media, and the sharing of such personal data raises obvious security and privacy concerns. One solution to mitigate these concerns involves concealing speaker identities before sharing speech data, also referred to as speaker anonymization. In our previous work, we have developed an automatic speaker ver…
▽ More
Speech data on the Internet are proliferating exponentially because of the emergence of social media, and the sharing of such personal data raises obvious security and privacy concerns. One solution to mitigate these concerns involves concealing speaker identities before sharing speech data, also referred to as speaker anonymization. In our previous work, we have developed an automatic speaker verification (ASV)-model-free anonymization framework to protect speaker privacy while preserving speech intelligibility. Although the framework ranked first place in VoicePrivacy 2022 challenge, the anonymization was imperfect, since the speaker distinguishability of the anonymized speech was deteriorated. To address this issue, in this paper, we directly model the formant distribution and fundamental frequency (F0) to represent speaker identity and anonymize the source speech by the uniformly scaling formant and F0. By directly scaling the formant and F0, the speaker distinguishability degradation of the anonymized speech caused by the introduction of other speakers is prevented. The experimental results demonstrate that our proposed framework can improve the speaker distinguishability and significantly outperforms our previous framework in voice distinctiveness. Furthermore, our proposed method also can trade off the privacy-utility by using different scaling factors.
△ Less
Submitted 6 November, 2022;
originally announced November 2022.
-
Preserving background sound in noise-robust voice conversion via multi-task learning
Authors:
Jixun Yao,
Yi Lei,
Qing Wang,
Pengcheng Guo,
Ziqian Ning,
Lei Xie,
Hai Li,
Junhui Liu,
Danming Xie
Abstract:
Background sound is an informative form of art that is helpful in providing a more immersive experience in real-application voice conversion (VC) scenarios. However, prior research about VC, mainly focusing on clean voices, pay rare attention to VC with background sound. The critical problem for preserving background sound in VC is inevitable speech distortion by the neural separation model and th…
▽ More
Background sound is an informative form of art that is helpful in providing a more immersive experience in real-application voice conversion (VC) scenarios. However, prior research about VC, mainly focusing on clean voices, pay rare attention to VC with background sound. The critical problem for preserving background sound in VC is inevitable speech distortion by the neural separation model and the cascade mismatch between the source separation model and the VC model. In this paper, we propose an end-to-end framework via multi-task learning which sequentially cascades a source separation (SS) module, a bottleneck feature extraction module and a VC module. Specifically, the source separation task explicitly considers critical phase information and confines the distortion caused by the imperfect separation process. The source separation task, the typical VC task and the unified task shares a uniform reconstruction loss constrained by joint training to reduce the mismatch between the SS and VC modules. Experimental results demonstrate that our proposed framework significantly outperforms the baseline systems while achieving comparable quality and speaker similarity to the VC models trained with clean data.
△ Less
Submitted 6 November, 2022;
originally announced November 2022.
-
DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP
Authors:
Kun Song,
Yongmao Zhang,
Yi Lei,
Jian Cong,
Hanzhao Li,
Lei Xie,
Gang He,
**feng Bai
Abstract:
Recent development of neural vocoders based on the generative adversarial neural network (GAN) has shown obvious advantages of generating raw waveform conditioned on mel-spectrogram with fast inference speed and lightweight networks. Whereas, it is still challenging to train a universal neural vocoder that can synthesize high-fidelity speech from various scenarios with unseen speakers, languages,…
▽ More
Recent development of neural vocoders based on the generative adversarial neural network (GAN) has shown obvious advantages of generating raw waveform conditioned on mel-spectrogram with fast inference speed and lightweight networks. Whereas, it is still challenging to train a universal neural vocoder that can synthesize high-fidelity speech from various scenarios with unseen speakers, languages, and speaking styles. In this paper, we propose DSPGAN, a GAN-based universal vocoder for high-fidelity speech synthesis by applying the time-frequency domain supervision from digital signal processing (DSP). To eliminate the mismatch problem caused by the ground-truth spectrograms in the training phase and the predicted spectrograms in the inference phase, we leverage the mel-spectrogram extracted from the waveform generated by a DSP module, rather than the predicted mel-spectrogram from the Text-to-Speech (TTS) acoustic model, as the time-frequency domain supervision to the GAN-based vocoder. We also utilize sine excitation as the time-domain supervision to improve the harmonic modeling and eliminate various artifacts of the GAN-based vocoder. Experiments show that DSPGAN significantly outperforms the compared approaches and it can generate high-fidelity speech for various TTS models trained using diverse data.
△ Less
Submitted 28 May, 2023; v1 submitted 2 November, 2022;
originally announced November 2022.
-
Performance-Driven Controller Tuning via Derivative-Free Reinforcement Learning
Authors:
Yuheng Lei,
Jianyu Chen,
Shengbo Eben Li,
Sifa Zheng
Abstract:
Choosing an appropriate parameter set for the designed controller is critical for the final performance but usually requires a tedious and careful tuning process, which implies a strong need for automatic tuning methods. However, among existing methods, derivative-free ones suffer from poor scalability or low efficiency, while gradient-based ones are often unavailable due to possibly non-different…
▽ More
Choosing an appropriate parameter set for the designed controller is critical for the final performance but usually requires a tedious and careful tuning process, which implies a strong need for automatic tuning methods. However, among existing methods, derivative-free ones suffer from poor scalability or low efficiency, while gradient-based ones are often unavailable due to possibly non-differentiable controller structure. To resolve the issues, we tackle the controller tuning problem using a novel derivative-free reinforcement learning (RL) framework, which performs timestep-wise perturbation in parameter space during experience collection and integrates derivative-free policy updates into the advanced actor-critic RL architecture to achieve high versatility and efficiency. To demonstrate the framework's efficacy, we conduct numerical experiments on two concrete examples from autonomous driving, namely, adaptive cruise control with PID controller and trajectory tracking with MPC controller. Experimental results show that the proposed method outperforms popular baselines and highlight its strong potential for controller tuning.
△ Less
Submitted 11 September, 2022;
originally announced September 2022.
-
Deformable Image Registration using Unsupervised Deep Learning for CBCT-guided Abdominal Radiotherapy
Authors:
Huiqiao Xie,
Yang Lei,
Yabo Fu,
Tonghe Wang,
Justin Roper,
Jeffrey D. Bradley,
Pretesh Patel,
Tian Liu,
Xiaofeng Yang
Abstract:
CBCTs in image-guided radiotherapy provide crucial anatomy information for patient setup and plan evaluation. Longitudinal CBCT image registration could quantify the inter-fractional anatomic changes. The purpose of this study is to propose an unsupervised deep learning based CBCT-CBCT deformable image registration. The proposed deformable registration workflow consists of training and inference s…
▽ More
CBCTs in image-guided radiotherapy provide crucial anatomy information for patient setup and plan evaluation. Longitudinal CBCT image registration could quantify the inter-fractional anatomic changes. The purpose of this study is to propose an unsupervised deep learning based CBCT-CBCT deformable image registration. The proposed deformable registration workflow consists of training and inference stages that share the same feed-forward path through a spatial transformation-based network (STN). The STN consists of a global generative adversarial network (GlobalGAN) and a local GAN (LocalGAN) to predict the coarse- and fine-scale motions, respectively. The network was trained by minimizing the image similarity loss and the deformable vector field (DVF) regularization loss without the supervision of ground truth DVFs. During the inference stage, patches of local DVF were predicted by the trained LocalGAN and fused to form a whole-image DVF. The local whole-image DVF was subsequently combined with the GlobalGAN generated DVF to obtain final DVF. The proposed method was evaluated using 100 fractional CBCTs from 20 abdominal cancer patients in the experiments and 105 fractional CBCTs from a cohort of 21 different abdominal cancer patients in a holdout test. Qualitatively, the registration results show great alignment between the deformed CBCT images and the target CBCT image. Quantitatively, the average target registration error (TRE) calculated on the fiducial markers and manually identified landmarks was 1.91+-1.11 mm. The average mean absolute error (MAE), normalized cross correlation (NCC) between the deformed CBCT and target CBCT were 33.42+-7.48 HU, 0.94+-0.04, respectively. This promising registration method could provide fast and accurate longitudinal CBCT alignment to facilitate inter-fractional anatomic changes analysis and prediction.
△ Less
Submitted 29 August, 2022;
originally announced August 2022.
-
Shuffle Instances-based Vision Transformer for Pancreatic Cancer ROSE Image Classification
Authors:
Tianyi Zhang,
Youdan Feng,
Yunlu Feng,
Yu Zhao,
Yanli Lei,
Nan Ying,
Zhiling Yan,
Yufang He,
Guanglei Zhang
Abstract:
The rapid on-site evaluation (ROSE) technique can signifi-cantly accelerate the diagnosis of pancreatic cancer by im-mediately analyzing the fast-stained cytopathological images. Computer-aided diagnosis (CAD) can potentially address the shortage of pathologists in ROSE. However, the cancerous patterns vary significantly between different samples, making the CAD task extremely challenging. Besides…
▽ More
The rapid on-site evaluation (ROSE) technique can signifi-cantly accelerate the diagnosis of pancreatic cancer by im-mediately analyzing the fast-stained cytopathological images. Computer-aided diagnosis (CAD) can potentially address the shortage of pathologists in ROSE. However, the cancerous patterns vary significantly between different samples, making the CAD task extremely challenging. Besides, the ROSE images have complicated perturbations regarding color distribution, brightness, and contrast due to different staining qualities and various acquisition device types. To address these challenges, we proposed a shuffle instances-based Vision Transformer (SI-ViT) approach, which can reduce the perturbations and enhance the modeling among the instances. With the regrouped bags of shuffle instances and their bag-level soft labels, the approach utilizes a regression head to make the model focus on the cells rather than various perturbations. Simultaneously, combined with a classification head, the model can effectively identify the general distributive patterns among different instances. The results demonstrate significant improvements in the classification accuracy with more accurate attention regions, indicating that the diverse patterns of ROSE images are effectively extracted, and the complicated perturbations are significantly reduced. It also suggests that the SI-ViT has excellent potential in analyzing cytopathological images. The code and experimental results are available at https://github.com/sagizty/MIL-SI.
△ Less
Submitted 14 August, 2022;
originally announced August 2022.
-
Delay-Doppler Reversal for OTFS System in Doubly-selective Fading Channels
Authors:
Xiangxiang Li,
Haiyan Wang,
Yao Ge,
Xiaohong Shen,
Yuanyuan Lei
Abstract:
The recent proposed orthogonal time frequency space (OTFS) modulation shows signifcant advantages than conventional orthogonal frequency division multiplexing (OFDM) for high mobility wireless communications. However, a challenging problem is the development of effcient receivers for practical OTFS systems with low complexity. In this paper, we propose a novel delay-Doppler reversal (DDR) technolo…
▽ More
The recent proposed orthogonal time frequency space (OTFS) modulation shows signifcant advantages than conventional orthogonal frequency division multiplexing (OFDM) for high mobility wireless communications. However, a challenging problem is the development of effcient receivers for practical OTFS systems with low complexity. In this paper, we propose a novel delay-Doppler reversal (DDR) technology for OTFS system with desired performance and low complexity. We present the DDR technology from a perspective of two-dimensional cascaded channel model, analyze its computational complexity and also analyze its performance gain compared to the direct processing (DP) receiver without DDR. Simulation results demonstrate that our proposed DDR receiver outperforms traditional receivers in doubly-selective fading channels.
△ Less
Submitted 22 July, 2022;
originally announced July 2022.
-
Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion
Authors:
Yi Lei,
Shan Yang,
Jian Cong,
Lei Xie,
Dan Su
Abstract:
The zero-shot scenario for speech generation aims at synthesizing a novel unseen voice with only one utterance of the target speaker. Although the challenges of adapting new voices in zero-shot scenario exist in both stages -- acoustic modeling and vocoder, previous works usually consider the problem from only one stage. In this paper, we extend our previous Glow-WaveGAN to Glow-WaveGAN 2, aiming…
▽ More
The zero-shot scenario for speech generation aims at synthesizing a novel unseen voice with only one utterance of the target speaker. Although the challenges of adapting new voices in zero-shot scenario exist in both stages -- acoustic modeling and vocoder, previous works usually consider the problem from only one stage. In this paper, we extend our previous Glow-WaveGAN to Glow-WaveGAN 2, aiming to solve the problem from both stages for high-quality zero-shot text-to-speech and any-to-any voice conversion. We first build a universal WaveGAN model for extracting latent distribution $p(z)$ of speech and reconstructing waveform from it. Then a flow-based acoustic model only needs to learn the same $p(z)$ from texts, which naturally avoids the mismatch between the acoustic model and the vocoder, resulting in high-quality generated speech without model fine-tuning. Based on a continuous speaker space and the reversible property of flows, the conditional distribution can be obtained for any speaker, and thus we can further conduct high-quality zero-shot speech generation for new speakers. We particularly investigate two methods to construct the speaker space, namely pre-trained speaker encoder and jointly-trained speaker encoder. The superiority of Glow-WaveGAN 2 has been proved through TTS and VC experiments conducted on LibriTTS corpus and VTCK corpus.
△ Less
Submitted 5 July, 2022;
originally announced July 2022.
-
Geometrically-Shaped Multi-Dimensional Modulation Formats in Coherent Optical Transmission Systems
Authors:
Bin Chen,
Yi Lei,
Gabriele Liga,
Zhiwei Liang,
Wei Ling,
Xuwei Xue,
Alex Alvarado
Abstract:
Sha** modulation formats in multi-dimensional (MD) space is an effective approach to harvest spectral efficiency gains in both the additive white Gaussian noise (AWGN) channel and the optical fiber channel. In the first part of this paper, existing MD geometrically-shaped modulations for fiber optical communications are reviewed. It is shown that large gains can be obtained by exploiting correla…
▽ More
Sha** modulation formats in multi-dimensional (MD) space is an effective approach to harvest spectral efficiency gains in both the additive white Gaussian noise (AWGN) channel and the optical fiber channel. In the first part of this paper, existing MD geometrically-shaped modulations for fiber optical communications are reviewed. It is shown that large gains can be obtained by exploiting correlation in the dimensions or/and by increasing the cardinality of the modulation format. Practical limitations and challenges are also discussed together with efficient solutions. In the second part, we extend the recently proposed four-dimensional (4D) modulation format family based on the constraint of orthant-symmetry to high spectrum efficiencies up to 10 bit/4D-sym by maximizing generalized mutual information for AWGN channel. Reach increases of up to 25% for a multi-span optical fiber transmission system are reported. Lastly,with the help of a recently introduced nonlinear interference (NLI) model, an optimization for designing nonlinear-tolerant 4D modulation formats is introduced for a single-span optical fiber system. Simulation results show that the proposed NLI model-based 4D modulation format could increase the effective SNRs by 0.25 dB with respect to the AWGN channel-optimal 4D modulation format.
△ Less
Submitted 31 August, 2022; v1 submitted 3 July, 2022;
originally announced July 2022.
-
End-to-End Voice Conversion with Information Perturbation
Authors:
Qicong Xie,
Shan Yang,
Yi Lei,
Lei Xie,
Dan Su
Abstract:
The ideal goal of voice conversion is to convert the source speaker's speech to sound naturally like the target speaker while maintaining the linguistic content and the prosody of the source speech. However, current approaches are insufficient to achieve comprehensive source prosody transfer and target speaker timbre preservation in the converted speech, and the quality of the converted speech is…
▽ More
The ideal goal of voice conversion is to convert the source speaker's speech to sound naturally like the target speaker while maintaining the linguistic content and the prosody of the source speech. However, current approaches are insufficient to achieve comprehensive source prosody transfer and target speaker timbre preservation in the converted speech, and the quality of the converted speech is also unsatisfied due to the mismatch between the acoustic model and the vocoder. In this paper, we leverage the recent advances in information perturbation and propose a fully end-to-end approach to conduct high-quality voice conversion. We first adopt information perturbation to remove speaker-related information in the source speech to disentangle speaker timbre and linguistic content and thus the linguistic information is subsequently modeled by a content encoder. To better transfer the prosody of the source speech to the target, we particularly introduce a speaker-related pitch encoder which can maintain the general pitch pattern of the source speaker while flexibly modifying the pitch intensity of the generated speech. Finally, one-shot voice conversion is set up through continuous speaker space modeling. Experimental results indicate that the proposed end-to-end approach significantly outperforms the state-of-the-art models in terms of intelligibility, naturalness, and speaker similarity.
△ Less
Submitted 15 June, 2022;
originally announced June 2022.
-
Analytical SNR Prediction in Long-Haul Optical Transmission using General Dual-Polarization 4D Formats
Authors:
Zhiwei Liang,
Bin Chen,
Yi Lei,
Gabriele Liga,
Alex Alvarado
Abstract:
Nonlinear interference models for dual-polarization 4D (DP-4D) modulation have only been used so far to predict signal-signal nonlinear interference. We show that including the signal-noise term in the prediction of the effective signal-to-noise ratio in long distance DP-4D transmission improves the accuracy by up to 0.2 dB.
Nonlinear interference models for dual-polarization 4D (DP-4D) modulation have only been used so far to predict signal-signal nonlinear interference. We show that including the signal-noise term in the prediction of the effective signal-to-noise ratio in long distance DP-4D transmission improves the accuracy by up to 0.2 dB.
△ Less
Submitted 15 July, 2022; v1 submitted 2 June, 2022;
originally announced June 2022.
-
Zeroth-Order Actor-Critic
Authors:
Yuheng Lei,
Jianyu Chen,
Shengbo Eben Li,
Sifa Zheng
Abstract:
The recent advanced evolution-based zeroth-order optimization methods and the policy gradient-based first-order methods are two promising alternatives to solve reinforcement learning (RL) problems with complementary advantages. The former methods work with arbitrary policies, drive state-dependent and temporally-extended exploration, possess robustness-seeking property, but suffer from high sample…
▽ More
The recent advanced evolution-based zeroth-order optimization methods and the policy gradient-based first-order methods are two promising alternatives to solve reinforcement learning (RL) problems with complementary advantages. The former methods work with arbitrary policies, drive state-dependent and temporally-extended exploration, possess robustness-seeking property, but suffer from high sample complexity, while the latter methods are more sample efficient but are restricted to differentiable policies and the learned policies are less robust. To address these issues, we propose a novel Zeroth-Order Actor-Critic algorithm (ZOAC), which unifies these two methods into an on-policy actor-critic architecture to preserve the advantages from both. ZOAC conducts rollouts collection with timestep-wise perturbation in parameter space, first-order policy evaluation (PEV) and zeroth-order policy improvement (PIM) alternately in each iteration. We extensively evaluate our proposed method on a wide range of challenging continuous control benchmarks using different types of policies, where ZOAC outperforms zeroth-order and first-order baseline algorithms.
△ Less
Submitted 11 June, 2022; v1 submitted 29 January, 2022;
originally announced January 2022.
-
MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis
Authors:
Yi Lei,
Shan Yang,
Xinsheng Wang,
Lei Xie
Abstract:
Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, and thus synthesizing expressive speech has attracted much attention in recent years. Previous methods performed the expressive speech synthesis either with explicit labels or with a fixed-length style embedding extracted from reference audio, both of which can only learn an average style an…
▽ More
Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, and thus synthesizing expressive speech has attracted much attention in recent years. Previous methods performed the expressive speech synthesis either with explicit labels or with a fixed-length style embedding extracted from reference audio, both of which can only learn an average style and thus ignores the multi-scale nature of speech prosody. In this paper, we propose MsEmoTTS, a multi-scale emotional speech synthesis framework, to model the emotion from different levels. Specifically, the proposed method is a typical attention-based sequence-to-sequence model and with proposed three modules, including global-level emotion presenting module (GM), utterance-level emotion presenting module (UM), and local-level emotion presenting module (LM), to model the global emotion category, utterance-level emotion variation, and syllable-level emotion strength, respectively. In addition to modeling the emotion from different levels, the proposed method also allows us to synthesize emotional speech in different ways, i.e., transferring the emotion from reference audio, predicting the emotion from input text, and controlling the emotion strength manually. Extensive experiments conducted on a Chinese emotional speech corpus demonstrate that the proposed method outperforms the compared reference audio-based and text-based emotional speech synthesis methods on the emotion transfer speech synthesis and text-based emotion prediction speech synthesis respectively. Besides, the experiments also show that the proposed method can control the emotion expressions flexibly. Detailed analysis shows the effectiveness of each module and the good design of the proposed method.
△ Less
Submitted 17 January, 2022;
originally announced January 2022.
-
Shaped Four-Dimensional Modulation Formats for Optical Fiber Communication Systems
Authors:
Bin Chen,
Gabriele Liga,
Yi Lei,
Wei Ling,
Zhengyan Huan,
Xuwei Xue,
Alex Alvarado
Abstract:
We review the design of multidimensional modulations by maximizing generalized mutual information and compare the maximum transmission reach of recently introduced 4D formats. A model-based optimization for nonlinear-tolerant 4D modulations is also discussed.
We review the design of multidimensional modulations by maximizing generalized mutual information and compare the maximum transmission reach of recently introduced 4D formats. A model-based optimization for nonlinear-tolerant 4D modulations is also discussed.
△ Less
Submitted 23 December, 2021;
originally announced December 2021.
-
Low-Complexity Geometrical Sha** for 4D Modulation Formats via Amplitude Coding
Authors:
Bin Chen,
Wei Ling,
Yunus Can Gültekin,
Yi Lei,
Chigo Okonkwo,
Alex Alvarado
Abstract:
Signal sha** is vital to approach Shannon's capacity, yet it is challenging to implement at very high speeds. For example, probabilistic sha** often requires arithmetic coding to realize the target distribution. Geometric sha** requires look-up tables to store the constellation points. In this paper, we propose a four-dimensional amplitude coding (4D-AC) geometrical shaper architecture. The…
▽ More
Signal sha** is vital to approach Shannon's capacity, yet it is challenging to implement at very high speeds. For example, probabilistic sha** often requires arithmetic coding to realize the target distribution. Geometric sha** requires look-up tables to store the constellation points. In this paper, we propose a four-dimensional amplitude coding (4D-AC) geometrical shaper architecture. The proposed architecture can generate in real time geometrically shaped 4D formats via simple logic circuit operations and two conventional quadrature amplitude modulation (QAM) modulators. This paper describes the 4D-AC used in generating approximated versions of two recently proposed 4D orthant symmetric modulation formats with spectral efficiencies of 6 bit/4D-sym and 7 bit/4D-sym, respectively. Numerical results show losses below 0.05 dB when compared against the baseline formats.
△ Less
Submitted 29 October, 2021;
originally announced October 2021.
-
On Parameter Optimization and Reach Enhancement for the Improved Soft-Aided Staircase Decoder
Authors:
Yi Lei,
Bin Chen,
Gabriele Liga,
Alex Alvarado
Abstract:
The so-called improved soft-aided bit-marking algorithm was recently proposed for staircase codes (SCCs) in the context of fiber optical communications. This algorithm is known as iSABM-SCC. With the help of channel soft information, the iSABM-SCC decoder marks bits via thresholds to deal with both miscorrections and failures of hard-decision (HD) decoding. In this paper, we study iSABM-SCC focusi…
▽ More
The so-called improved soft-aided bit-marking algorithm was recently proposed for staircase codes (SCCs) in the context of fiber optical communications. This algorithm is known as iSABM-SCC. With the help of channel soft information, the iSABM-SCC decoder marks bits via thresholds to deal with both miscorrections and failures of hard-decision (HD) decoding. In this paper, we study iSABM-SCC focusing on the parameter optimization of the algorithm and its performance analysis, in terms of the gap to the achievable information rates (AIRs) of HD codes and the fiber reach enhancement. We show in this paper that the marking thresholds and the number of modified component decodings heavily affect the performance of iSABM-SCC, and thus, they need to be carefully optimized. By replacing standard decoding with the optimized iSABM-SCC decoding, the gap to the AIRs of HD codes can be reduced to 0.26-1.02 dB for code rates of 0.74-0.87 in the additive white Gaussian noise channel with 8-ary pulse amplitude modulation. The obtained reach increase is up to 22% for data rates between 401 Gbps and 468 Gbps in an optical fiber channel.
△ Less
Submitted 12 May, 2021;
originally announced May 2021.
-
Artificial Intelligence in Tumor Subregion Analysis Based on Medical Imaging: A Review
Authors:
Mingquan Lin,
Jacob Wynne,
Yang Lei,
Tonghe Wang,
Walter J. Curran,
Tian Liu,
Xiaofeng Yang
Abstract:
Medical imaging is widely used in cancer diagnosis and treatment, and artificial intelligence (AI) has achieved tremendous success in various tasks of medical image analysis. This paper reviews AI-based tumor subregion analysis in medical imaging. We summarize the latest AI-based methods for tumor subregion analysis and their applications. Specifically, we categorize the AI-based methods by traini…
▽ More
Medical imaging is widely used in cancer diagnosis and treatment, and artificial intelligence (AI) has achieved tremendous success in various tasks of medical image analysis. This paper reviews AI-based tumor subregion analysis in medical imaging. We summarize the latest AI-based methods for tumor subregion analysis and their applications. Specifically, we categorize the AI-based methods by training strategy: supervised and unsupervised. A detailed review of each category is presented, highlighting important contributions and achievements. Specific challenges and potential AI applications in tumor subregion analysis are discussed.
△ Less
Submitted 24 March, 2021;
originally announced March 2021.
-
A Soft-Aided Staircase Decoder Using Three-Level Channel Reliabilities
Authors:
Yi Lei,
Bin Chen,
Gabriele Liga,
Alexios Balatsoukas-Stimming,
Kaixuan Sun,
Alex Alvarado
Abstract:
The soft-aided bit-marking (SABM) algorithm is based on the idea of marking bits as highly reliable bits (HRBs), highly unreliable bits (HUBs), and uncertain bits to improve the performance of hard-decision (HD) decoders. The HRBs and HUBs are used to assist the HD decoders to prevent miscorrections and to decode those originally uncorrectable cases via bit flip** (BF), respectively. In this pap…
▽ More
The soft-aided bit-marking (SABM) algorithm is based on the idea of marking bits as highly reliable bits (HRBs), highly unreliable bits (HUBs), and uncertain bits to improve the performance of hard-decision (HD) decoders. The HRBs and HUBs are used to assist the HD decoders to prevent miscorrections and to decode those originally uncorrectable cases via bit flip** (BF), respectively. In this paper, an improved SABM algorithm (called iSABM) is proposed for staircase codes (SCCs). Similar to the SABM, iSABM marks bits with the help of channel reliabilities, i.e., using the absolute values of the log-likelihood ratios. The improvements offered by iSABM include: (i) HUBs being classified using a reliability threshold, (ii) BF randomly selecting HUBs, and (iii) soft-aided decoding over multiple SCC blocks. The decoding complexity of iSABM is comparable of that of SABM. This is due to the fact that on the one hand no sorting is required (lower complexity) because of the use of a threshold for HUBs, while on the other hand multiple SCC blocks use soft information (higher complexity). Additional gains of up to 0.53 dB with respect to SABM and 0.91 dB with respect to standard SCC decoding at a bit error rate of $10^{-6}$ are reported. Furthermore, it is shown that using 1-bit reliability marking, i.e., only having HRBs and HUBs, only causes a gain penalty of up to 0.25 dB with a significantly reduced memory requirement.
△ Less
Submitted 17 March, 2021;
originally announced March 2021.
-
Generative Adversarial Network for Image Synthesis
Authors:
Yang Lei,
Richard L. J. Qiu,
Tonghe Wang,
Walter J. Curran,
Tian Liu,
Xiaofeng Yang
Abstract:
This chapter reviews recent developments of generative adversarial networks (GAN)-based methods for medical and biomedical image synthesis tasks. These methods are classified into conditional GAN and Cycle-GAN according to the network architecture designs. For each category, a literature survey is given, which covers discussions of the network architecture designs, highlights important contributio…
▽ More
This chapter reviews recent developments of generative adversarial networks (GAN)-based methods for medical and biomedical image synthesis tasks. These methods are classified into conditional GAN and Cycle-GAN according to the network architecture designs. For each category, a literature survey is given, which covers discussions of the network architecture designs, highlights important contributions and identifies specific challenges.
△ Less
Submitted 30 December, 2020;
originally announced December 2020.
-
Fine-grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis
Authors:
Yi Lei,
Shan Yang,
Lei Xie
Abstract:
This paper proposes a unified model to conduct emotion transfer, control and prediction for sequence-to-sequence based fine-grained emotional speech synthesis. Conventional emotional speech synthesis often needs manual labels or reference audio to determine the emotional expressions of synthesized speech. Such coarse labels cannot control the details of speech emotion, often resulting in an averag…
▽ More
This paper proposes a unified model to conduct emotion transfer, control and prediction for sequence-to-sequence based fine-grained emotional speech synthesis. Conventional emotional speech synthesis often needs manual labels or reference audio to determine the emotional expressions of synthesized speech. Such coarse labels cannot control the details of speech emotion, often resulting in an averaged emotion expression delivery, and it is also hard to choose suitable reference audio during inference. To conduct fine-grained emotion expression generation, we introduce phoneme-level emotion strength representations through a learned ranking function to describe the local emotion details, and the sentence-level emotion category is adopted to render the global emotions of synthesized speech. With the global render and local descriptors of emotions, we can obtain fine-grained emotion expressions from reference audio via its emotion descriptors (for transfer) or directly from phoneme-level manual labels (for control). As for the emotional speech synthesis with arbitrary text inputs, the proposed model can also predict phoneme-level emotion expressions from texts, which does not require any reference audio or manual label.
△ Less
Submitted 17 November, 2020;
originally announced November 2020.
-
Learn2Sing: Target Speaker Singing Voice Synthesis by learning from a Singing Teacher
Authors:
Heyang Xue,
Shan Yang,
Yi Lei,
Lei Xie,
Xiulin Li
Abstract:
Singing voice synthesis has been paid rising attention with the rapid development of speech synthesis area. In general, a studio-level singing corpus is usually necessary to produce a natural singing voice from lyrics and music-related transcription. However, such a corpus is difficult to collect since it's hard for many of us to sing like a professional singer. In this paper, we propose an approa…
▽ More
Singing voice synthesis has been paid rising attention with the rapid development of speech synthesis area. In general, a studio-level singing corpus is usually necessary to produce a natural singing voice from lyrics and music-related transcription. However, such a corpus is difficult to collect since it's hard for many of us to sing like a professional singer. In this paper, we propose an approach -- Learn2Sing that only needs a singing teacher to generate the target speakers' singing voice without their singing voice data. In our approach, a teacher's singing corpus and speech from multiple target speakers are trained in a frame-level auto-regressive acoustic model where singing and speaking share the common speaker embedding and style tag embedding. Meanwhile, since there is no music-related transcription for the target speaker, we use log-scale fundamental frequency (LF0) as an auxiliary feature as the inputs of the acoustic model for building a unified input representation. In order to enable the target speaker to sing without singing reference audio in the inference stage, a duration model and an LF0 prediction model are also trained. Particularly, we employ domain adversarial training (DAT) in the acoustic model, which aims to enhance the singing performance of target speakers by disentangling style from acoustic features of singing and speaking data. Our experiments indicate that the proposed approach is capable of synthesizing singing voice for target speaker given only their speech samples.
△ Less
Submitted 17 November, 2020;
originally announced November 2020.
-
Synthetic MRI-aided Head-and-Neck Organs-at-Risk Auto-Delineation for CBCT-guided Adaptive Radiotherapy
Authors:
Xian** Dai,
Yang Lei,
Tonghe Wang,
Anees H. Dhabaan,
Mark McDonald,
Jonathan J. Beitler,
Walter J. Curran,
Jun Zhou,
Tian Liu,
Xiaofeng Yang
Abstract:
Purpose: Organ-at-risk (OAR) delineation is a key step for cone-beam CT (CBCT) based adaptive radiotherapy planning that can be a time-consuming, labor-intensive, and subject-to-variability process. We aim to develop a fully automated approach aided by synthetic MRI for rapid and accurate CBCT multi-organ contouring in head-and-neck (HN) cancer patients. MRI has superb soft-tissue contrasts, while…
▽ More
Purpose: Organ-at-risk (OAR) delineation is a key step for cone-beam CT (CBCT) based adaptive radiotherapy planning that can be a time-consuming, labor-intensive, and subject-to-variability process. We aim to develop a fully automated approach aided by synthetic MRI for rapid and accurate CBCT multi-organ contouring in head-and-neck (HN) cancer patients. MRI has superb soft-tissue contrasts, while CBCT offers bony-structure contrasts. Using the complementary information provided by MRI and CBCT is expected to enable accurate multi-organ segmentation in HN cancer patients. In our proposed method, MR images are firstly synthesized using a pre-trained cycle-consistent generative adversarial network given CBCT. The features of CBCT and synthetic MRI are then extracted using dual pyramid networks for final delineation of organs. CBCT images and their corresponding manual contours were used as pairs to train and test the proposed model. Quantitative metrics including Dice similarity coefficient (DSC) were used to evaluate the proposed method. The proposed method was evaluated on a cohort of 65 HN cancer patients. CBCT images were collected from those patients who received proton therapy. Overall, DSC values of 0.87, 0.79/0.79, 0.89/0.89, 0.90, 0.75/0.77, 0.86, 0.66, 0.78/0.77, 0.96, 0.89/0.89, 0.832, and 0.84 for commonly used OARs for treatment planning including brain stem, left/right cochlea, left/right eye, larynx, left/right lens, mandible, optic chiasm, left/right optic nerve, oral cavity, left/right parotid, pharynx, and spinal cord, respectively, were achieved. In this study, we developed a synthetic MRI-aided HN CBCT auto-segmentation method based on deep learning. It provides a rapid and accurate OAR auto-delineation approach, which can be used for adaptive radiation therapy.
△ Less
Submitted 8 October, 2020;
originally announced October 2020.
-
Learning-Based Synthetic Dual Energy CT Imaging from Single Energy CT for Stop** Power Ratio Calculation in Proton Radiation Therapy
Authors:
Serdar Charyyev,
Tonghe Wang,
Yang Lei,
Beth Ghavidel,
Jonathan J. Beitler,
Mark McDonald,
Walter J. Curran,
Tian Liu,
Jun Zhou,
Xiaofeng Yang
Abstract:
Purpose: Dual-energy CT (DECT) has been shown to derive stop** power ratio (SPR) map with higher accuracy than conventional single energy CT (SECT) by obtaining the energy dependence of photon interactions. However, DECT is not as widely implemented as SECT in proton radiation therapy simulation. This work presents a learning-based method to synthetize DECT images from SECT for proton radiation…
▽ More
Purpose: Dual-energy CT (DECT) has been shown to derive stop** power ratio (SPR) map with higher accuracy than conventional single energy CT (SECT) by obtaining the energy dependence of photon interactions. However, DECT is not as widely implemented as SECT in proton radiation therapy simulation. This work presents a learning-based method to synthetize DECT images from SECT for proton radiation therapy. Methods: The proposed method uses a residual attention generative adversarial network. Residual blocks with attention gates were used to force the model focus on the difference between DECT maps and SECT images. To evaluate the accuracy of the method, we retrospectively investigated 20 head-and-neck cancer patients with both DECT and SECT scans available. The high and low energy CT images acquired from DECT acted as learning targets in the training process for SECT datasets and were evaluated against results from the proposed method using a leave-one-out cross-validation strategy. To evaluate our method in the context of a practical application, we generated SPR maps from sDECT using physics-based dual-energy stoichiometric method and compared the maps to those generated from DECT. Results: The synthesized DECT images showed an average mean absolute error around 30 Hounsfield Unit (HU) across the whole-body volume. The corresponding SPR maps generated from synthetic DECT showed an average normalized mean square error of about 1% with reduced noise level and artifacts than those from original DECT. Conclusions: The accuracy of the synthesized DECT image by our machine-learning-based method was evaluated on head and neck patient, and potential feasibility for proton treatment planning and dose calculation was shown by generating SPR map using the synthesized DECT.
△ Less
Submitted 25 May, 2020;
originally announced May 2020.