Search | arXiv e-print repository

Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion

Authors: Ruiqi Li, Rongjie Huang, Yongqi Wang, Zhiqing Hong, Zhou Zhao

Abstract: Speech-to-singing voice conversion (STS) task always suffers from data scarcity, because it requires paired speech and singing data. Compounding this issue are the challenges of content-pitch alignment and the suboptimal quality of generated outputs, presenting significant hurdles in STS research. This paper presents SVPT, an STS approach boosted by a self-supervised singing voice pre-training mod… ▽ More Speech-to-singing voice conversion (STS) task always suffers from data scarcity, because it requires paired speech and singing data. Compounding this issue are the challenges of content-pitch alignment and the suboptimal quality of generated outputs, presenting significant hurdles in STS research. This paper presents SVPT, an STS approach boosted by a self-supervised singing voice pre-training model. We leverage spoken language model techniques to tackle the rhythm alignment problem and the in-context learning capability to achieve zero-shot conversion. We adopt discrete-unit random resampling and pitch corruption strategies, enabling training with unpaired singing data and thus mitigating the issue of data scarcity. SVPT also serves as an effective backbone for singing voice synthesis (SVS), offering insights into scaling up SVS models. Experimental results indicate that SVPT delivers notable improvements in both STS and SVS endeavors. Audio samples are available at https://speech2sing.github.io. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: 13 pages

arXiv:2405.09940 [pdf, other]

Robust Singing Voice Transcription Serves Synthesis

Authors: Ruiqi Li, Yu Zhang, Yongqi Wang, Zhiqing Hong, Rongjie Huang, Zhou Zhao

Abstract: Note-level Automatic Singing Voice Transcription (AST) converts singing recordings into note sequences, facilitating the automatic annotation of singing datasets for Singing Voice Synthesis (SVS) applications. Current AST methods, however, struggle with accuracy and robustness when used for practical annotation. This paper presents ROSVOT, the first robust AST model that serves SVS, incorporating… ▽ More Note-level Automatic Singing Voice Transcription (AST) converts singing recordings into note sequences, facilitating the automatic annotation of singing datasets for Singing Voice Synthesis (SVS) applications. Current AST methods, however, struggle with accuracy and robustness when used for practical annotation. This paper presents ROSVOT, the first robust AST model that serves SVS, incorporating a multi-scale framework that effectively captures coarse-grained note information and ensures fine-grained frame-level segmentation, coupled with an attention-based pitch decoder for reliable pitch prediction. We also established a comprehensive annotation-and-training pipeline for SVS to test the model in real-world settings. Experimental findings reveal that ROSVOT achieves state-of-the-art transcription accuracy with either clean or noisy inputs. Moreover, when trained on enlarged, automatically annotated datasets, the SVS model outperforms its baseline, affirming the capability for practical application. Audio samples are available at https://rosvot.github.io. △ Less

Submitted 3 June, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

Comments: ACL 2024

arXiv:2404.17064 [pdf, other]

Detection of Peri-Pancreatic Edema using Deep Learning and Radiomics Techniques

Authors: Ziliang Hong, Debesh Jha, Koushik Biswas, Zheyuan Zhang, Yury Velichko, Cemal Yazici, Temel Tirkes, Amir Borhani, Baris Turkbey, Alpay Medetalibeyoglu, Gorkem Durak, Ulas Bagci

Abstract: Identifying peri-pancreatic edema is a pivotal indicator for identifying disease progression and prognosis, emphasizing the critical need for accurate detection and assessment in pancreatitis diagnosis and management. This study \textit{introduces a novel CT dataset sourced from 255 patients with pancreatic diseases, featuring annotated pancreas segmentation masks and corresponding diagnostic labe… ▽ More Identifying peri-pancreatic edema is a pivotal indicator for identifying disease progression and prognosis, emphasizing the critical need for accurate detection and assessment in pancreatitis diagnosis and management. This study \textit{introduces a novel CT dataset sourced from 255 patients with pancreatic diseases, featuring annotated pancreas segmentation masks and corresponding diagnostic labels for peri-pancreatic edema condition}. With the novel dataset, we first evaluate the efficacy of the \textit{LinTransUNet} model, a linear Transformer based segmentation algorithm, to segment the pancreas accurately from CT imaging data. Then, we use segmented pancreas regions with two distinctive machine learning classifiers to identify existence of peri-pancreatic edema: deep learning-based models and a radiomics-based eXtreme Gradient Boosting (XGBoost). The LinTransUNet achieved promising results, with a dice coefficient of 80.85\%, and mIoU of 68.73\%. Among the nine benchmarked classification models for peri-pancreatic edema detection, \textit{Swin-Tiny} transformer model demonstrated the highest recall of $98.85 \pm 0.42$ and precision of $98.38\pm 0.17$. Comparatively, the radiomics-based XGBoost model achieved an accuracy of $79.61\pm4.04$ and recall of $91.05\pm3.28$, showcasing its potential as a supplementary diagnostic tool given its rapid processing speed and reduced training time. Our code is available \url{https://github.com/NUBagciLab/Peri-Pancreatic-Edema-Detection}. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.09313 [pdf, other]

Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment

Authors: Zhiqing Hong, Rongjie Huang, Xize Cheng, Yongqi Wang, Ruiqi Li, Fuming You, Zhou Zhao, Zhimeng Zhang

Abstract: A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to explore song synthesis. In this work, we propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consi… ▽ More A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to explore song synthesis. In this work, we propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis. Melodist leverages tri-tower contrastive pretraining to learn more effective text representation for controllable V2A synthesis. A Chinese song dataset mined from a music website is built up to alleviate data scarcity for our research. The evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency. Audio samples can be found in https://text2songMelodist.github.io/Sample/. △ Less

Submitted 20 May, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

Comments: ACL 2024 Main

arXiv:2403.11780 [pdf, other]

Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt

Authors: Yongqi Wang, Ruofan Hu, Rongjie Huang, Zhiqing Hong, Ruiqi Li, Wenrui Liu, Fuming You, Tao **, Zhou Zhao

Abstract: Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly. We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only… ▽ More Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly. We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation that enables text-conditioned vocal range control while kee** melodic accuracy. Furthermore, we explore various experiment settings, including different types of text representations, text encoder fine-tuning, and introducing speech data to alleviate data scarcity, aiming to facilitate further research. Experiments show that our model achieves favorable controlling ability and audio quality. Audio samples are available at http://prompt-singer.github.io . △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: Accepted by NAACL 2024 (main conference)

arXiv:2309.07566 [pdf, other]

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Authors: Yongqi Wang, Jionghao Bai, Rongjie Huang, Ruiqi Li, Zhiqing Hong, Zhou Zhao

Abstract: Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech during translation. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer between source and target speech. We propose an S2ST framework with an acoustic lan… ▽ More Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech during translation. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer between source and target speech. We propose an S2ST framework with an acoustic language model based on discrete units from a self-supervised model and a neural codec for style transfer. The acoustic language model leverages self-supervised in-context learning, acquiring the ability for style transfer without relying on any speaker-parallel data, thereby overcoming the issue of data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and style similarity. Audio samples are available at http://stylelm.github.io/ . △ Less

Submitted 14 September, 2023; originally announced September 2023.

Comments: 5 pages, 1 figure. submitted to ICASSP 2024

arXiv:2304.12995 [pdf, other]

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Authors: Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, **glin Liu, Yi Ren, Zhou Zhao, Shinji Watanabe

Abstract: Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements… ▽ More Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i.e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue. With an increasing demand to evaluate multi-modal LLMs of human intention understanding and cooperation with foundation models, we outline the principles and processes and test AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrate the capabilities of AudioGPT in solving AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, which empower humans to create rich and diverse audio content with unprecedented ease. Our system is publicly available at \url{https://github.com/AIGC-Audio/AudioGPT}. △ Less

Submitted 25 April, 2023; originally announced April 2023.

arXiv:2303.09248 [pdf, other]

Cross-Dimensional Refined Learning for Real-Time 3D Visual Perception from Monocular Video

Authors: Ziyang Hong, C. Patrick Yue

Abstract: We present a novel real-time capable learning method that jointly perceives a 3D scene's geometry structure and semantic labels. Recent approaches to real-time 3D scene reconstruction mostly adopt a volumetric scheme, where a Truncated Signed Distance Function (TSDF) is directly regressed. However, these volumetric approaches tend to focus on the global coherence of their reconstructions, which le… ▽ More We present a novel real-time capable learning method that jointly perceives a 3D scene's geometry structure and semantic labels. Recent approaches to real-time 3D scene reconstruction mostly adopt a volumetric scheme, where a Truncated Signed Distance Function (TSDF) is directly regressed. However, these volumetric approaches tend to focus on the global coherence of their reconstructions, which leads to a lack of local geometric detail. To overcome this issue, we propose to leverage the latent geometric prior knowledge in 2D image features by explicit depth prediction and anchored feature generation, to refine the occupancy learning in TSDF volume. Besides, we find that this cross-dimensional feature refinement methodology can also be adopted for the semantic segmentation task by utilizing semantic priors. Hence, we proposed an end-to-end cross-dimensional refinement neural network (CDRNet) to extract both 3D mesh and 3D semantic labeling in real time. The experiment results show that this method achieves a state-of-the-art 3D perception efficiency on multiple datasets, which indicates the great potential of our method for industrial applications. △ Less

Submitted 10 September, 2023; v1 submitted 16 March, 2023; originally announced March 2023.

Comments: Accpeted to ICCV 2023 Workshops. Project page: https://hafred.github.io/cdrnet/

arXiv:2207.14174 [pdf, other]

Bayesian Optimization-Based Beam Alignment for MmWave MIMO Communication Systems

Authors: Songjie Yang, Baojuan Liu, Zhiqin Hong, Zhongpei Zhang

Abstract: Due to the very narrow beam used in millimeter wave communication (mmWave), beam alignment (BA) is a critical issue. In this work, we investigate the issue of mmWave BA and present a novel beam alignment scheme on the basis of a machine learning strategy, Bayesian optimization (BO). In this context, we consider the beam alignment issue to be a black box function and then use BO to find the possibl… ▽ More Due to the very narrow beam used in millimeter wave communication (mmWave), beam alignment (BA) is a critical issue. In this work, we investigate the issue of mmWave BA and present a novel beam alignment scheme on the basis of a machine learning strategy, Bayesian optimization (BO). In this context, we consider the beam alignment issue to be a black box function and then use BO to find the possible optimal beam pair. During the BA procedure, this strategy exploits information from the measured beam pairs to predict the best beam pair. In addition, we suggest a novel BO algorithm based on the gradient boosting regression tree model. The simulation results demonstrate the spectral efficiency performance of our proposed schemes for BA using three different surrogate models. They also demonstrate that the proposed schemes can achieve spectral efficiency with a small overhead when compared to the orthogonal match pursuit (OMP) algorithm and the Thompson sampling-based multi-armed bandit (TS-MAB) method. △ Less

Submitted 28 July, 2022; originally announced July 2022.

arXiv:2207.14103 [pdf, other]

doi 10.1109/LAWP.2022.3192308

Low-complexity Sparse Array Synthesis Based on Off-grid Compressive Sensing

Authors: Songjie Yang, Baojuan Liu, Zhiqin Hong, Zhongpei Zhang

Abstract: A novel sparse array synthesis method for non-uniform planar arrays is proposed, which belongs to compressive sensing (CS)-based systhesis. Particularly, we propose an off-grid refinement technique to simultaneously optimize the antenna element positions and excitations with a low complexity, in response to the antenna position optimization problem that is difficult for standard CS. More important… ▽ More A novel sparse array synthesis method for non-uniform planar arrays is proposed, which belongs to compressive sensing (CS)-based systhesis. Particularly, we propose an off-grid refinement technique to simultaneously optimize the antenna element positions and excitations with a low complexity, in response to the antenna position optimization problem that is difficult for standard CS. More importantly, we take into account the minimum inter-element spacing constraint for ensuring the physically realizable solution. Specifically, the off-grid Orthogonal Match Pursuit (OMP) algorithm is first proposed with low complexity and then off-grid Look Ahead Orthogonal Match Pursuit (LAOMP) is designed with better synthesis performance but higher complexity. In addition, simulation results have shown the proposed schemes have more advantages in computational complexity and synthesis performances compared with the related method. △ Less

Submitted 28 July, 2022; originally announced July 2022.

arXiv:2205.13249 [pdf, other]

DT-SV: A Transformer-based Time-domain Approach for Speaker Verification

Authors: Nan Zhang, Jianzong Wang, Zhenhou Hong, Chendong Zhao, Xiaoyang Qu, **g Xiao

Abstract: Speaker verification (SV) aims to determine whether the speaker's identity of a test utterance is the same as the reference speech. In the past few years, extracting speaker embeddings using deep neural networks for SV systems has gone mainstream. Recently, different attention mechanisms and Transformer networks have been explored widely in SV fields. However, utilizing the original Transformer in… ▽ More Speaker verification (SV) aims to determine whether the speaker's identity of a test utterance is the same as the reference speech. In the past few years, extracting speaker embeddings using deep neural networks for SV systems has gone mainstream. Recently, different attention mechanisms and Transformer networks have been explored widely in SV fields. However, utilizing the original Transformer in SV directly may have frame-level information waste on output features, which could lead to restrictions on capacity and discrimination of speaker embeddings. Therefore, we propose an approach to derive utterance-level speaker embeddings via a Transformer architecture that uses a novel loss function named diffluence loss to integrate the feature information of different Transformer layers. Therein, the diffluence loss aims to aggregate frame-level features into an utterance-level representation, and it could be integrated into the Transformer expediently. Besides, we also introduce a learnable mel-fbank energy feature extractor named time-domain feature extractor that computes the mel-fbank features more precisely and efficiently than the standard mel-fbank extractor. Combining Diffluence loss and Time-domain feature extractor, we propose a novel Transformer-based time-domain SV model (DT-SV) with faster training speed and higher accuracy. Experiments indicate that our proposed model can achieve better performance in comparison with other models. △ Less

Submitted 26 May, 2022; originally announced May 2022.

Comments: Accepted by IJCNN2022 (The 2022 International Joint Conference on Neural Networks)

arXiv:2107.05162 [pdf, other]

Board-level Code-Modulated Embedded Test and Calibration of an X-band Phased-Array Transceiver

Authors: Zhangjie Hong, Simon Schönherr, Vikas Chauhan, Brian Floyd

Abstract: We present methods for built-in test and calibration of phased arrays using code-modulated embedded test (CoMET). Our approach employs Cartesian modulation of test signals within each element using existing phase shifters, combining of these signals into an aggregate code-multiplexed response, downconversion and creation of code-modulated element-to-element "interference products" using a built-in… ▽ More We present methods for built-in test and calibration of phased arrays using code-modulated embedded test (CoMET). Our approach employs Cartesian modulation of test signals within each element using existing phase shifters, combining of these signals into an aggregate code-multiplexed response, downconversion and creation of code-modulated element-to-element "interference products" using a built-in power detector, demodulation of correlations from the digitized interference response, and extraction of amplitude and phase per element using an equation solver. Rotated-axis methodology is discussed for accurate extraction of phase near the original 0/90/180/270 degree axes. Our techniques are demonstrated at board level for both receive and transmit modes using an eight-element 8-16 GHz phased array constructed using ADAR1000 chips from ADI. At 6 GHz, CoMET-extracted gain and phase are accurate to within 0.2 dB and 3 degree compared to network-analyzer measurements. We then employ CoMET in a calibration loop to determinate optimum control settings at 6 GHz, outside the 8-16 GHz band for which the array was designed. We achieve seven-bit phase resolution with equalized gain. The root-mean squared gain and phase errors are improved from 0.8 dB and 8 degree before calibration to 0.1 dB and 1.7 degree after calibration. △ Less

Submitted 11 July, 2021; originally announced July 2021.

arXiv:2103.13581 [pdf, other]

doi 10.1109/TASLP.2022.3182856

EfficientTDNN: Efficient Architecture Search for Speaker Recognition

Authors: Rui Wang, Zhihua Wei, Haoran Duan, Shouling Ji, Yang Long, Zhen Hong

Abstract: Convolutional neural networks (CNNs), such as the time-delay neural network (TDNN), have shown their remarkable capability in learning speaker embedding. However, they meanwhile bring a huge computational cost in storage size, processing, and memory. Discovering the specialized CNN that meets a specific constraint requires a substantial effort of human experts. Compared with hand-designed approach… ▽ More Convolutional neural networks (CNNs), such as the time-delay neural network (TDNN), have shown their remarkable capability in learning speaker embedding. However, they meanwhile bring a huge computational cost in storage size, processing, and memory. Discovering the specialized CNN that meets a specific constraint requires a substantial effort of human experts. Compared with hand-designed approaches, neural architecture search (NAS) appears as a practical technique in automating the manual architecture design process and has attracted increasing interest in spoken language processing tasks such as speaker recognition. In this paper, we propose EfficientTDNN, an efficient architecture search framework consisting of a TDNN-based supernet and a TDNN-NAS algorithm. The proposed supernet introduces temporal convolution of different ranges of the receptive field and feature aggregation of various resolutions from different layers to TDNN. On top of it, the TDNN-NAS algorithm quickly searches for the desired TDNN architecture via weight-sharing subnets, which surprisingly reduces computation while handling the vast number of devices with various resources requirements. Experimental results on the VoxCeleb dataset show the proposed EfficientTDNN enables approximate $10^{13}$ architectures concerning depth, kernel, and width. Considering different computation constraints, it achieves a 2.20% equal error rate (EER) with 204M multiply-accumulate operations (MACs), 1.41% EER with 571M MACs as well as 0.94% EER with 1.45G MACs. Comprehensive investigations suggest that the trained supernet generalizes subnets not sampled during training and obtains a favorable trade-off between accuracy and efficiency. △ Less

Submitted 18 June, 2022; v1 submitted 24 March, 2021; originally announced March 2021.

Comments: 13 pages, 12 figures, accepted to TASLP

arXiv:1802.00285 [pdf, other]

Virtual-to-Real: Learning to Control in Visual Semantic Segmentation

Authors: Zhang-Wei Hong, Chen Yu-Ming, Shih-Yang Su, Tzu-Yun Shann, Yi-Hsiang Chang, Hsuan-Kung Yang, Brian Hsi-Lin Ho, Chih-Chieh Tu, Yueh-Chuan Chang, Tsu-Ching Hsiao, Hsin-Wei Hsiao, Sih-Pin Lai, Chun-Yi Lee

Abstract: Collecting training data from the physical world is usually time-consuming and even dangerous for fragile robots, and thus, recent advances in robot learning advocate the use of simulators as the training platform. Unfortunately, the reality gap between synthetic and real visual data prohibits direct migration of the models trained in virtual worlds to the real world. This paper proposes a modular… ▽ More Collecting training data from the physical world is usually time-consuming and even dangerous for fragile robots, and thus, recent advances in robot learning advocate the use of simulators as the training platform. Unfortunately, the reality gap between synthetic and real visual data prohibits direct migration of the models trained in virtual worlds to the real world. This paper proposes a modular architecture for tackling the virtual-to-real problem. The proposed architecture separates the learning model into a perception module and a control policy module, and uses semantic image segmentation as the meta representation for relating these two modules. The perception module translates the perceived RGB image to semantic image segmentation. The control policy module is implemented as a deep reinforcement learning agent, which performs actions based on the translated image segmentation. Our architecture is evaluated in an obstacle avoidance task and a target following task. Experimental results show that our architecture significantly outperforms all of the baseline methods in both virtual and real environments, and demonstrates a faster learning curve than them. We also present a detailed analysis for a variety of variant configurations, and validate the transferability of our modular architecture. △ Less

Submitted 28 October, 2018; v1 submitted 1 February, 2018; originally announced February 2018.

Comments: 7 pages, accepted by IJCAI-18

Showing 1–14 of 14 results for author: Hong, Z