RefXVC: Cross-Lingual Voice Conversion with Enhanced Reference Leveraging

Mingyang Zhang,  Yi Zhou,  Yi Ren, Chen Zhang, Xiang Yin, Haizhou Li Mingyang Zhang, Haizhou Li are with the Shenzhen Research Institute of Big Data, School of Data Science, The Chinese University of Hong Kong, Shenzhen, China. (email: [email protected]; [email protected])Yi Zhou are with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore (e-mail: [email protected])Yi Ren, Chen Zhang, Xiang Yin are with Speech & Audio team, ByteDance AI Lab (email: [email protected], [email protected], [email protected])
Abstract

This paper proposes RefXVC, a method for cross-lingual voice conversion (XVC) that leverages reference information to improve conversion performance. Previous XVC works generally take an average speaker embedding to condition the speaker identity, which does not account for the changing timbre of speech that occurs with different pronunciations. To address this, our method uses both global and local speaker embeddings to capture the timbre changes during speech conversion. Additionally, we observed a connection between timbre and pronunciation in different languages and utilized this by incorporating a timbre encoder and a pronunciation matching network into our model. Furthermore, we found that the variation in tones is not adequately reflected in a sentence, and therefore, we used multiple references to better capture the range of a speaker’s voice. The proposed method outperformed existing systems in terms of both speech quality and speaker similarity, highlighting the effectiveness of leveraging reference information in cross-lingual voice conversion. The converted speech samples can be found on the website: http://refxvc.dn3point.com

Index Terms:
cross-lingual voice conversion (XVC), speaker embedding, multi-reference, pitch normalization

I Introduction

Among speech synthesis tasks, cross-lingual voice conversion (XVC) is an interesting research topic that allows for the conversion of the speaker’s voice from one language to another while maintaining the speaker’s identity [1, 2, 3, 4]. For instance, XVC enables the actor in a Hollywood English movie to speak perfect Spanish/Mandarin/Hindi/etc. It is challenging, but the enabling technology for various real-life applications, e.g., foreign language education [1], speech-to-speech translation [5], foreign movie dubbing, and so on [4]. In the XVC task, converted speech is expected to be sound as being pronounced by a native speaker.

If there are a number of high-quality speech data from the target speaker, one can easily build an acoustic model with the data for conversion. However, it is usually unrealistic to obtain such a quantity of speech data for each of the target speakers. Zero-shot XVC, therefore, draws the researcher’s attention, where only a few speech samples are required to enroll for generating desired voice. Thanks to the power of deep learning, zero-shot XVC has achieved great success. It generally takes speech data from multiple speakers and generates desired voice by conditioning on a speaker identity representation. A speaker can be simply represented by a fixed vector produced by pre-trained neural speaker recognition models [6] or a disentangled speaker embedding obtained from an encoder module [7]. Such speaker representations usually carry the averaged features per speaker [6] or per utterance[7].

The main motivation behind XVC is to enable seamless communication between speakers of different languages while preserving the naturalness and identity of the target speaker. However, XVC is a challenging task because it requires the conversion of not only the phonetic and linguistic features of the source speech but also the speaker identity and prosodic features [8]. One of the major challenges in XVC is to deal with the timbre changes that occur when a speaker produces different pronunciations across languages. Previous works in XVC have generally used an average speaker embedding to represent the speaker’s voice, which does not account for these timbre changes. As a result, the converted speech may sound unnatural or contain artifacts [9].

In addition to the timbre problem, a connection between timbre and pronunciation in different languages can be observed. Therefore, it also requires the conversion model to leverage the content information of the references to improve the converted speech quality[10]. However, it is found that the variation in tones is not adequately reflected in a sentence, and therefore, multiple references are necessary to better capture the range of a speaker’s voice.

Furthermore, The task of XVC involves two language systems. The differences in pronunciation, intonation, and other linguistic features create further obstacles to accurately describing the vocal tract of a speaker. Consequently, dissimilar traces towards the actual speaker still exist in the converted speech [11].

To address the challenges, we present the RefXVC system, which seeks to leverage speaker information from the reference to the maximal extent in order to improve XVC performance. Our proposed XVC network utilizes the autoencoder architecture to map input self-supervised learning (SSL) representations to acoustic features, which are conditioned on fine-grained speaker embeddings extracted using several techniques. Our approach is centered around the following key aspects:

1. We introduce a timbre encoder that extracts both global and local speaker embeddings from the source speech. They are combined to capture the time-varying speaker characteristics in a given sentence. The global speaker embedding characterizes the overall characteristics of the speaker’s voice, while the local speaker embeddings represent the fine-grained variations in timbre that occur with different pronunciations. By using a timbre encoder to extract both types of speaker information, we can ensure that the synthesized speech has the correct timbre and tone of the target speaker, which can improve the naturalness and authenticity of the generated speech.

2. The second aspect of our approach is the design of a pronunciation matching network to utilize content-related speaker information. The pronunciation matching network is trained to align SSL features of the source speech with those of the reference speech. By using content-related speaker information, such as the pronunciation of specific words and phrases, we can ensure that the converted speech has the correct pronunciation of the source sentence, which can improve the intelligibility and accuracy of the converted speech.

3. We further employ the use of multi-reference encoding to enrich the content information. In many cases, a single reference speech may not contain enough information to cover the nuances and variations of the source speech. Using multiple reference speech samples, we can enrich the content information and ensure that the converted speech has the correct intonation of the source content. This can improve the naturalness and expressiveness of the converted speech, which can be particularly important in applications of XVC.

Overall, our method tackles the challenges of XVC by utilizing a timbre encoder to extract both global and local speaker embeddings, a pronunciation matching network to utilize content-related speaker information, a multi-reference encoding technique to enrich the content information, and a normalized pitch as input to better preserve the native prosody. By combining these techniques, we can improve the quality of the converted speech and make it sound more natural and similar to the target speaker.

In the experiment, we convert the voice between English and Spanish speakers. We verify that the proposed system synthesizes natural speech with high speaker similarity by prompting in the zero-shot XVC task.

II Related Works

In this section, we revisit several voice conversion frameworks and self-supervised learning representations for zero-shot XVC. We also study the related works in speaker code representation.

II-A Cross-Lingual Voice Conversion

In cross-lingual voice conversion (XVC) tasks, source refers to an utterance from a native speaker in one language, while target is defined as the utterance from another speaker who speaks a different language. Ideally, converted speech carries the source’s speech content while presenting the target’s timbre. Popular XVC frameworks generally adopt the encoder-decoder architecture to disentangle the speaker-dependent component (speaker identity) from the speaker-independent component (speech content). In this way, one can convert the voice from one another by just changing the speaker-dependent component while kee** the speaker-independent component.

Variational autoencoder (VAE) [12, 13, 14, 15] is one such implementation, which learns a latent space for speaker-independent representation. Similarly, generative adversarial networks (GAN) [16] disentangle the speech attributes with an extra adversarial loss to guarantee a distribution match between the generated and true data [17, 18]. AutoVC is designed with a carefully designed bottleneck to constrain the information flow and is equipped with a speaker verification module to learn the speaker embedding [19]. Whisper[20] is also employed as a content feature extractor for XVC in [21]. The authors introduced the speaker consistency loss to enhance the speaker information contained within the extracted speaker embedding. However, these methods also do not explicitly leverage linguistic content as supervision, thus sometimes producing unclear or distorted samples.

Alternatively, pretraining techniques greatly enhance the conversion performance by providing a linguistic representation learned by a well-trained model. Automatic speech recognition (ASR) is a perfect option, which can provide linguistic representations by either Phonetic PosteriorGram (PPG) [22, 23, 4], or directly the discrete output phonemes. Yet, this requires extensive labeled data for training the ASR models. Additionally, recognition errors highly affect the generated speech intelligibility.

II-B SSL Representation

Self-supervised learning (SSL) representation has gained significant attention in recent years due to its ability to efficiently learn meaningful features from large and unstructured datasets [24], which is especially useful in domains such as computer vision [25], natural language processing [26], and speech recognition [27]. Several well-known SSL representations, e.g., Autoregressive Predictive Coding (APC) [28], Contrastive Predictive Coding (CPC) [29], and wav2vec [30] have been studied for voice conversion [31, 28, 32].

II-B1 HuBERT Token

The HuBERT token is a special type of token that is used in the HuBERT model [33]. It is a self-supervised pretraining model for speech processing similar to the popular BERT model. The model learns to predict masked portions of the audio signal. The HuBERT token represents the start and end of each audio clip used in the model. This is important because speech processing tasks often require analyzing long audio sequences, and the HuBERT token helps the model to segment the audio into smaller, more manageable segments. By doing so, it encodes meaningful representations of the audio signal in a way that can be useful for a variety of speech processing tasks, such as speech recognition [34], speaker identification [35], and speech-to-speech translation [36].

The HuBERT token is advantageous over other token-based approaches because it enables the model to process speech directly without the need for any additional preprocessing. This means that the model can learn speech features more efficiently and accurately. Additionally, the HuBERT model is designed to handle long sequences of speech, which is critical for many speech generation tasks. It has been investigated in XVC and obtained impressive performance [37]; hence, the HuBERT representation is an ideal option to set our starting point.

II-C Neural Speaker Encoding

The use of HuBERT tokens makes the content representation a fixed embedding. The choice of speaker representation is critical to obtain the desired voice. There are two popular approaches to obtain a speaker embedding: 1) extracting the hidden representations from pre-trained a speaker recognition systems such as d-vector [6], x-vector [38], and ECAPA-TDNN [39]; 2) jointly training a speaker encoder through multitask learning and extracting bottleneck features as a speaker embedding. The latter method typically involves disentangling the speaker information with an adversarial loss on the speaker classification task [40, 41, 42].

Pretraining a speaker recognition system offers the advantage of using large-scale speaker databases, enabling the learned speaker representation to exhibit high speaker similarity in several multi-speaker speech generation frameworks [43, 19, 44, 45]. On the other hand, joint training provides a more flexible optimization process dedicated to the speech synthesis task, providing further insights to characterize the speaker details [46, 47].

Resemblyzer is a popular choice in recent speech synthesis studies that allows for deriving a high-level representation of a voice through a deep learning model. Given an audio file of speech, Resemblyzer111github.com/resemble-ai/Resemblyzer creates a summary vector of 256-dimensional embedding that captures the voice’s characteristics. This is taken as a suitable reference for this work.

II-D Personalized Speech/Singing Voice Synthesis

Personalized text-to-speech (TTS) and singing voice synthesis (SVS) techniques generate speech and singing, respectively. They share the same objective with voice conversion of generating realistic and natural-sounding voices in a specific speaker’s voice, while the input of TTS and SVS are text and musical information, respectively, instead of speech. Current TTS and SVS frameworks mainly adopt the encoder-decoder architecture, where the encoder projects input into a latent embedding, and the decoder predicts acoustic features conditioned on a speaker representation [48, 43, 49, 50].

It can be noted that both TTS and SVS decoders work in the same way as the decoder in the VC system decoder, which relies on speaker embeddings to vary the speaker identity. Hence, their efforts in the speaker encoding adoption serve as a valuable source of inspiration for our research in XVC. For example, the TTS work described in [51] attempts to encode timbre information using a content-dependent time-varying speaker embedding. The model successfully captures timbre information of unseen target speakers during training. On the other hand, several SVS studies [52, 53] have established a correlation between speech and singing and demonstrated the benefits the learning of a unified speaker representation. Moreover, certain multilingual TTS works [54] have incorporated a secondary fine-tuning step to optimize a speaker identity-preserving loss, enabling the model to output a consistent voice regardless of language. These methods have a common goal of encouraging sharing of model capacity in speaker representation learning across linguistic and prosodic variations [55, 56].

These findings support our assumption that it is crucial to discover and establish the relationships in one’s vocal tract while pronouncing various content in different. The resultant speaker representation should be robust and comprehensive against languages, which is ideal for XVC.

Refer to caption
(a) RefXVC
Refer to caption
(b) Pronunciation Matching Network
Figure 1: (a) The overall architecture of the proposed RefXVC. (b) The details of the pronunciation matching network. HSsubscript𝐻𝑆H_{S}italic_H start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT denotes the hidden representation of source HuBERT, HRsubscript𝐻𝑅H_{R}italic_H start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT denotes the hidden representation of reference HuBERT, SLsubscript𝑆𝐿S_{L}italic_S start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT denotes the local speaker embedding and SGsubscript𝑆𝐺S_{G}italic_S start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT denotes the global speaker embedding.

III Proposed RefXVC

This section introduces the proposed RefXVC framework, including introducing the flow-based XVC framework, multi-lingual HuBERT token, timbre encoder, pronunciation matching network, multi-reference encoding, and pitch normalization.

III-A Flow-Based XVC Framework

The proposed RefXVC model is based on an autoencoder-like architecture that employs several information extraction modules to generate disentangled representations of Mel-spectrograms, as illustrated in Figure 1a. These modules consist of the following components: a content encoder that encodes input SSL representations into latent content information, a timbre encoder that extracts the target speaker’s timbre information from the reference speech, a pitch normalization module that normalizes the pitch on a per-sentence basis, a pronunciation matching network that represents a fine-grained time-varying timbral representation, and a speech decoder that takes all representations as input to generate speech with source content and target voice.

The content encoder and speech decoder follow the work presented in [37]. The content encoder is a stack of feed-forward Transformer layers with relative position encoding. And the speech decoder reconstructs the Mel-spectrogram using a mean absolute error (MAE) and a multi-length adversarial loss. The next sections will provide more details about each of the other sub-modules.

III-B Multilingual HuBERT Token

SSL pretraining has demonstrated its strength in learning high-level feature representations for various downstream tasks, where speech processing is one successful instance. In this section, we introduce the use of Multilingual HuBERT Token, a self-supervised learning model, to extract the feature representation from the reference speech. This multilingual model is capable of extracting phonetic features from input speech, irrespective of the language being spoken. This SSL pretraining allows us to leverage the massive amount of unlabeled speech data across different languages to learn a more robust and comprehensive feature representation.

We first encode the input speech using HuBERT to obtain its SSL representations, which provide a high-level representation of the input speech by capturing its phonetic and acoustic features. We then use this SSL representation as input to our XVC network. The multilingual HuBERT projects speech in different languages onto a common feature space representing the linguistic information and serves as a bridge between languages, thus enabling XVC.

Additionally, this SSL representation is also taken as input to our pronunciation matching network. It allows the network to focus on the phonetic features of the input speech and match them with corresponding features in the reference speech. This ensures accurate pronunciation and reduces foreign accents. The use of HuBERT is essential to handle XVC with high efficiency and accuracy.

III-C Timbre Encoder

The timbre of a speaker’s voice can vary depending on attributes such as pronunciation and intonation. Previous works commonly used an average speaker embedding to represent the speaker information, which may not capture the time-varying characteristics of the speaker’s voice details. To tackle this problem, we suggest utilizing a timbre encoder to extract global and local speaker embeddings. This approach aims to combine both representations to accurately characterize dynamic speaker information that varies over time, which is expected to enhance the preservation of speaker identity and the quality of the converted speech in XVC tasks. These embeddings can be used to benefit the following modules of the system, such as the multi-reference encoding and pronunciation matching network, by providing additional speaker-related information.

The timbre encoder is a critical component in our proposed XVC system. It is responsible for extracting the speaker embeddings, which are used to characterize the target speaker. We refer to this module as the ”timbre” encoder because it captures the timbral characteristics of the speaker’s voice. To extract both utterance-level and frame-level speaker embeddings, we use a three-layer bidirectional LSTM neural network. The speaker embedding is obtained by passing the entire reference speech through the LSTM network. The resulting output from the last LSTM layer is used as the utterance-level or global speaker embedding SGsubscript𝑆𝐺S_{G}italic_S start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, which characterizes the speaker’s overall voice characteristics. The output of the last LSTM layer at each frame is used as the frame-level or local speaker embedding SLsubscript𝑆𝐿S_{L}italic_S start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, which characterizes the speaker’s voice characteristics at that specific moment in time. By extracting both utterance-level and frame-level speaker embeddings, our timbre encoder captures both the overall characteristics of the speaker’s voice as well as the finer nuances of their speech. These speaker embeddings are then used as inputs to the following modules, which generate the converted speech that matches the content of the input speech while maintaining the speaker’s characteristics.

III-D Pronunciation Matching Network

In this section, we introduce the proposed Pronunciation Matching Network (PMN), which aims to leverage the content information from the reference speech during training. As discussed earlier, the timbre of a voice in voice conversion tasks is not constant and changes with different pronunciations. Previous works mainly relied on averaging speaker embedding, which is fixed at the utterance level or even for a speaker. Consequently, this embedding could fail to capture the dynamic variation in timbre. Additionally, in XVC, we believe a correlation exists between timbre and pronunciation in different languages, which has not been fully established in previous works. In this work, we would like to explore and propose PMN to reveal the unobtrusive relation between SSL representations of the source and reference speech by considering the pronunciation similarity of different languages. In this way, the network is empowered with an enhanced leveraging capability of characterizing the dynamic content information from the reference speech. The resultant conversion performance is expected to be robust over both speech quality and speaker identity preservation.

Cross-attention [57] is a powerful technique used in natural language processing and deep learning. It allows neural networks to attend to multiple parts of the input sequence simultaneously and weigh the importance of each part differently. This is achieved by computing attention scores between different positions of the input sequence and then combining the outputs of these computations to create a single output vector. Cross-attention has been used in a variety of tasks, including machine translation, image captioning, and question-answering. It has achieved significant improvements over traditional attention mechanisms. The ability to model complex relationships between different parts of the input sequence makes it a key component in many state-of-the-art models in natural language processing.

In this work, we introduce a novel PMN module to address the issue of speaker variability in XVC, as illustrated in Figure 1b. The network seeks to represent a fine-grained time-varying timbral representation dedicated to XVC tasks in a content-aware manner. This network enables the model to learn the optimal alignment between the input and reference speech, leading to improved speaker similarity in the generated speech. We incorporate a content-aware cross-attention module into the XVC network to achieve this. The cross-attention module takes the hidden representation of the source HuBERT tokens HSsubscript𝐻𝑆H_{S}italic_H start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT as queries, the hidden representation of the reference HuBERT tokens HRsubscript𝐻𝑅H_{R}italic_H start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT as keys, and the frame-level speaker embedding SLsubscript𝑆𝐿S_{L}italic_S start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT extracted from the reference utterances as values. The cross-attention mechanism allows the model to learn the alignment between the source and reference speech and to extract the fine-grained speaker embedding that is applied to the utterance-level speaker embedding SGsubscript𝑆𝐺S_{G}italic_S start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. The PMN is designed to learn the phonetic features of the source and reference speech and to identify the similar pronunciation units between the two languages. By using the frame-level speaker embedding extracted from the reference speech, PMN can capture the subtle variations in speaker characteristics that are critical for speaker similarity in XVC.

III-E Multi-Reference Encoding

In many real-world scenarios, obtaining a reference speech that contains similar pronunciation to the source input is not trivial. Even if a reference speech is available, it may not be enough to capture all the acoustic variations that exist in the source speech. Moreover, it is easy to obtain several sentences or utterances spoken by the source speaker as reference speech during inference. To address these challenges, we introduce a multi-reference encoding technique that can effectively leverage multiple references to improve cross-lingual voice conversion performance.

Multi-reference encoding is a technique used in natural language processing and information retrieval to improve the accuracy of language models by incorporating multiple reference sources [58]. Traditionally, language models use a single reference document or text corpus to learn patterns and generate predictions. However, this approach can be limited as it may not capture the full diversity and complexity of the language. Multi-reference encoding aims to overcome this limitation by encoding multiple reference sources simultaneously, allowing the model to learn from a wider range of examples and improve its ability to generalize to new inputs. This technique has been applied to a range of speech processing areas, including style transfer [59, 60], singing voice synthesis [61], and speech recognition [62], and has led to significant improvements in performance. Multi-reference encoding is a promising area of research in natural language processing and is expected to continue to play an important role in advancing the field.

In this study, we tackle the challenge of employing a solitary reference speech in XVC by proposing a multi-reference scheme for network training. Rather than relying on a lone reference speech, we leverage multiple reference speech samples to enhance the content information and account for nuanced pronunciation variations of the target speaker. Our approach enables the XVC network to optimize the alignment between languages and elevate the overall quality of the converted speech.

To ensure consistent speaker identity across the entire reference speech, we use the timbre encoder to extract the global speaker embedding SGsubscript𝑆𝐺S_{G}italic_S start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT for each reference utterance. We introduce a speaker similarity loss that sums the cosine embedding loss for any two SGsubscript𝑆𝐺S_{G}italic_S start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT:

Lss=i,jNcel(SGi,SGj)subscript𝐿𝑠𝑠subscript𝑖𝑗𝑁𝑐𝑒𝑙subscript𝑆subscript𝐺𝑖subscript𝑆subscript𝐺𝑗L_{ss}=\sum_{i,j\in N}cel(S_{G_{i}},S_{G_{j}})italic_L start_POSTSUBSCRIPT italic_s italic_s end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ italic_N end_POSTSUBSCRIPT italic_c italic_e italic_l ( italic_S start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (1)

where N𝑁Nitalic_N is the number of references and cel()𝑐𝑒𝑙cel()italic_c italic_e italic_l ( ) represents the cosine embedding loss. This loss encourages the embeddings for the same speaker to be close together in the embedding space, making the speaker identity more consistent and improving the overall quality of the converted speech. We chose to use three utterances of the reference speech during training, setting N=3𝑁3N=3italic_N = 3 in our work. This is due to the GPU memory limitation, as using a larger N𝑁Nitalic_N would require more memory to store and process the additional reference utterances. By setting N=3𝑁3N=3italic_N = 3, we balance the trade-off between using enough reference information to improve the conversion quality and kee** the memory consumption within feasible limits. Furthermore, using more than 3 reference utterances may not lead to further improvement in conversion quality, as the benefit of additional reference information could saturate after a certain point.

During training, we also investigated the efficiency of our approach when multiple references are used without involving the source input. When we use references during training, we want to provide the model with additional information about the target speaker’s voice without biasing it toward the source speaker’s voice. If we were to include the source input as one of the reference utterances, the model might simply learn to copy the source speaker’s voice rather than learn to generate a new voice that is similar to the target speaker’s voice. By using multiple reference utterances that do not include the source input, we encourage the model to learn a more general map** from the input speech to the target speaker’s voice. This can help to reduce the risk of over-fitting to a particular reference utterance and can also help to capture more variation in the target speaker’s voice.

III-F Pitch Normalization

Refer to caption
Figure 2: Illustration of the impact of normalized F0 on the system. (a) F0 of the source speech; (b) F0 of the converted speech utilizing normalized pitch; (c) F0 of the converted speech without utilizing normalized F0. The vertical axis stands for the amplitude of F0 in Hz, and the horizon axis stands for the time in seconds.

In XVC, one of the main challenges is to ensure that the converted speech retains the prosodic characteristics of the source language. Prosody, which includes pitch, rhythm, and intonation, plays an important role in conveying emotions and meaning in speech. However, the pitch distribution of speakers varies widely depending on factors such as age, gender, and speaking style, and this can cause the converted speech to sound foreign or unnatural.

To address this issue, we propose to introduce normalized pitch as an additional input to the XVC system [63, 64]. This enables explicit control of prosody in the converted speech, making it more similar to the input and reducing foreign accents. By normalizing the pitch on a per-sentence basis, we can ensure that the output speech has a similar prosody to the input, which is especially important in XVC tasks where a foreign accent can be a major issue. Overall, this approach allows for greater flexibility and control in XVC, ensuring that the output speech is not only recognizable but also natural-sounding and fluent.

We then incorporate the normalized pitch values as an additional input to the decoder module of the XVC network. The decoder takes the content information and fine-grained speaker embedding extracted from the timbre encoder, as well as the normalized pitch values as input, to generate the converted speech.

The F0 contour of the source speech and converted speech is compared in Figure 2, with and without the use of normalized pitch as an additional input. It can be observed that utilizing the normalized pitch as an additional input leads to the converted speech better following the prosody of the source speech, resulting in improved native sound and reduced foreign accent.

IV Experiment

IV-A Database

Our proposed system was evaluated on the Spanish-English XVC dataset through a series of experiments presented in this paper. For the English dataset, we utilized the train-clean-360 subset of the LibriTTS corpus, consisting of 191.29 hours of speech data from 904 speakers, including 430 female and 474 male speakers. Similarly, we used the Spanish subset of the Multilingual LibriSpeech (MLS) dataset for the Spanish dataset, which contains 917.68 hours of speech data from a total of 86 speakers, including 50 female and 36 male speakers. The data split was 90% for training and 10% for validation. For evaluation purposes, we employed the VCTK dataset [65] for English and the M-AILABS Speech Dataset [66] for Spanish. Allowing us to assess the performance of our method on unseen speakers and verify its generalization capability. We extracted the 80-dimensional Mel-spectrum features and HuBERT tokens from all speech data by downsampling it to 16kHz16𝑘𝐻𝑧16kHz16 italic_k italic_H italic_z, with a 20ms frameshift and a 64646464 ms frame length.

IV-B Model Architecture

The content encoder and decoder architectures in this work are based on the VC system presented in [37]. The content encoder comprises a feed-forward Transformer [57] with relative position encoding and a hidden size of 192. The decoder is composed of a posterior encoder, a speech decoder, and a multi-length discriminator. The posterior encoder consists of a 1D-convolution layer with stride 4, followed by ReLU activation and layer normalization, and a non-causal WaveNet layer. The number of encoder layers, WaveNet channel size, and kernel size are 8, 192, and 5, respectively. The speech decoder consists of a non-causal WaveNet layer and a 1D transposed convolution layer with stride 4, also followed by ReLU and layer normalization. The number of speech decoder layers, WaveNet channel size, and kernel sizes are set to 4, 192, and 5, respectively. The multi-length discriminator is an ensemble of three CNN-based discriminators that evaluate the Mel-spectrogram based on random windows with lengths of 32, 64, and 128 frames. Each CNN-based discriminator consists of N+1𝑁1N+1italic_N + 1 layers of 2D convolutions, each followed by a Leaky ReLU activation and a dropout layer. The latter N𝑁Nitalic_N convolutional layers are additionally followed by an instance normalization layer. After the convolutional layers, a linear layer projects the hidden states of the Mel-spectrogram slice to a scalar that represents the prediction of whether the input Mel-spectrogram is true or fake. In our experiments, we set N=2𝑁2N=2italic_N = 2 and the channel size of these discriminators to 32.

The timbre encoder is a 3-layer bidirectional LSTM network that takes Mel-spectrogram as input and generates a 256-dim utterance-level and frame-level speaker embedding. The pronunciation matching network is a cross-attention module composed of feed-forward Transformer layers. We also utilize a neural vocoder, HiFi-GAN, to convert the Mel-spectrogram to the waveform [67].

We trained our model on a single NVIDIA GeForce RTX 3090 GPU, with a batch size of 16 by using the Adam optimizer with β1=0.9,β2=0.98formulae-sequencesubscript𝛽10.9subscript𝛽20.98\beta_{1}=0.9,\beta_{2}=0.98italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.98, and the initial learning rate was 0.002 with the Noam decay scheme.

Refer to caption
(a) Speaker embedding from Resemblyzer
Refer to caption
(b) Speaker embedding from the timbre encoder
Figure 3: Speaker embedding visualization using t-SNE. (a) speaker embedding obtained from Resemblyzer; (b) speaker embedding from the timbre encoder. Colors and shapes represent the speaker and language, respectively. The numbers are the identification codes of speakers in the database.

IV-C Evaluations

IV-C1 Evaluation on Speaker Embedding

In this section, we aim to evaluate the performance of the speaker embedding generated by the timbre encoder of our proposed system. We use t-SNE [68], a dimensionality reduction technique, to visualize the speaker embedding spaces, providing a qualitative and intuitive understanding of how our system operates.

To compare our speaker embedding with the state-of-the-art deep learning-based voice encoder, we used Resemblyzer [69], which has been widely adopted in voice cloning projects. As shown in Figure 3, the speaker embedding space generated by Resemblyzer has a clear boundary for different languages. This means that speakers of different languages are represented differently in the speaker embedding space, which can make it challenging for XVC systems to preserve the speaker identity when converting between languages. In contrast, the speaker embeddings generated by our proposed system are language-agnostic, meaning that speakers from different languages are represented more similarly in the speaker embedding space. This makes our system more suitable for XVC across different languages, which is an important consideration for real-world applications.

Besides its language-agnostic property, we also analyzed the speaker embedding space produced by our proposed system and Resemblyzer in terms of intra-speaker and inter-speaker distances. Intra-speaker distance refers to the distance between embeddings of different utterances from the same speaker, whereas inter-speaker distance measures the distance between embeddings of different speakers. The analysis revealed that the speaker embedding space generated by our system has a smaller intra-speaker distance and a larger inter-speaker distance than Resemblyzer. This indicates that our system generates more compact speaker embeddings for the same speaker, enabling better differentiation between different speakers. In contrast, Resemblyzer tends to have larger intra-speaker distances, which may lead to less consistency in representing the same speaker.

The results of our evaluation demonstrate that our proposed system’s timbre encoder generates a more suitable and effective speaker embedding space for XVC than Resemblyzer. The language-agnostic property of our speaker embedding space and its ability to capture the unique speaker characteristics make it more appropriate for XVC tasks, where speaker similarity across languages is crucial.

IV-C2 Algorithm Comparison

We conducted experiments to evaluate the performance of our proposed system in two source-to-target speaker conversion settings: English-to-Spanish and Spanish-to-English. For each setting, we considered four gender-to-gender combinations: male-to-male (m2m), male-to-female (m2f), female-to-male (f2m), and female-to-female (f2f). We selected two females and two males from each language, resulting in 16 (=2×2×4absent224=2\times 2\times 4= 2 × 2 × 4) conversion pairs for each setting.

We evaluate the conversion results of the following systems:

  • Baseline: XVC network with only content encoder, timbre encoder, and decoder. Only global speaker embedding is used, and PMN is excluded.

  • Single-RefXVC: Our proposed RefXVC system with a single reference.

  • RefXVC (source-included): Our proposed RefXVC system with multi-reference technique while the source is included in the references during training.

  • RefXVC (source-excluded): Our proposed RefXVC system with multi-reference technique while the source is excluded in the references during training.

  • NANSY[51]: A recent work that is similar work with ours. It uses content-dependent time-varying speaker embedding to improve the speaker identity. We trained the model with official implementations using the same dataset and Mel-spectrogram configuration.

  • Diff-HierVC[70]: A recent work employs SSL features to represent content information and utilizes the style encoder from [71] to extract global speaker embeddings, with a diffusion model used as the generator.

TABLE I: Results of MOS test on speech quality, CMOS test on speaker similarity and WER. English-to-Spanish denotes that the English source speech is converted into the voice of a Spanish speaker and vice-versa for Spanish-to-English. P-values are calculated between different systems and the baseline system.
English-to-Spanish Spanish-to-English
System quality p-value similarity p-value WER quality p-value similarity p-value WER
Baseline 4.28±0.07 - 3.86±0.05 - 3.76 4.3±0.07 - 3.93±0.05 - 4.37
Diff-HierVC 4.21±0.03 0.007 3.91±0.02 <0.001 3.89 4.23±0.04 <0.001 3.92±0.03 0.23 4.52
NANSY 4.23±0.06 0.002 3.98±0.06 <0.001 3.52 4.33±0.05 0.015 4.11±0.07 <0.001 4.34
Single-RefXVC 4.34±0.08 <0.001 4.01±0.05 <0.001 3.18 4.31±0.07 0.14 4.24±0.07 <0.001 4.21
RefXVC (source-included) 4.32±0.06 <0.001 4.24±0.07 <0.001 3.23 4.35±0.06 <0.001 4.34±0.05 <0.001 4.17
RefXVC (source-excluded) 4.35±0.08 <0.001 4.39±0.07 <0.001 3.15 4.36±0.07 <0.001 4.35±0.07 <0.001 4.15
Ground-truth 4.85±0.05 <0.001 4.94±0.07 <0.001 2.23 4.91±0.05 <0.001 4.95±0.06 <0.001 3.58

We measured naturalness with a 5-scale mean opinion score (MOS [1-5]). Speaker similarity is also measured with a 5-scale comparison mean opinion score (CMOS [1-5]). We invited 20 native English speakers to conduct the listening experiments. We also calculate the Word Error Rate (WER) using Whisper-large [20] for both conversion pairs as an objective metric to evaluate the intelligibility. For all the experiments, the source speaker and target speaker were both unseen during training, which means we performed zero-shot any-to-any XVC. The results are presented in Table I.

As shown in the table, our proposed systems outperformed the baseline, NANSY and Diff-HierVC in terms of both speech quality and speaker similarity. The PMN model improved the speaker similarity score over the baseline, and the multi-PMN systems further improved the scores by incorporating multiple references with or without the source speech. In addition, we have conducted comparisons using single speaker reference and multiple averaged speaker references with the baseline method. The performance differences are minimal, as evidenced by the visualized speaker embedding shown in Fig. 3(b), where the variance within the same speaker is minimal. The results demonstrate the effectiveness of our proposed system in XVC tasks.

We observed that the result obtained by excluding reference from the source is better than the one obtained by including it. Figure 4 illustrates an example of the alignment between HSsubscript𝐻𝑆H_{S}italic_H start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and HRsubscript𝐻𝑅H_{R}italic_H start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT during training with different settings. In this example, we used three utterances as reference speech, with the first one being identical to the source in (a). This demonstrates that when the source is included during training, the pronunciation matching network tends to focus solely on the source utterance and disregards content information from other references. This contradicts the reason why we designed this mechanism. When excluding the source from the references, the pronunciation matching network can attend to the entire sentence and utilize the rich content information provided by the references. The figure confirms our conclusion that excluding the source is more effective in leveraging information from multiple references.

From the results presented in Table I, there appear to be some differences in performance between converting Spanish to English and English to Spanish audio. While the differences are relatively small, they do indicate that there might be subtle challenges unique to each conversion direction. One potential difficulty in converting Spanish to English could stem from the distinct phonetic and prosodic characteristics of the two languages. Spanish, for example, has a more consistent syllable-timed rhythm and a relatively simpler vowel system compared to the more stress-timed rhythm and complex vowel system of English [72]. These differences can pose challenges for voice conversion models, which need to accurately capture and reproduce the nuanced phonetic and prosodic features of the target language.

Refer to caption
Figure 4: An illustration of the multi-reference alignment between Hs and Hr as computed by the PMN. (a) Source-included reference: The PMN focuses predominantly on the source utterance itself, as indicated by the concentration of attention weights along the diagonal. (b) Source-excluded reference: The PMN attends to the entire reference sentence, as evidenced by the more distributed pattern of attention weights across the heatmap.
Refer to caption
Figure 5: The prosody similarity test results between converted speech with and without using pitch normalization information. A higher percentage of ‘Same (not sure)’ and ‘Same (sure)’ together suggests a higher similarity to the natural source speech, which is more preferred.

IV-C3 Evaluation on prosody

In addition to the evaluation of speech quality and speaker similarity, we also conducted a subjective evaluation to assess the effectiveness of incorporating normalized pitch as an additional input to the RefXVC system. This evaluation aimed to investigate whether the inclusion of normalized pitch can improve the prosody of the converted speech, making it more similar to the source native speech.

We generated converted speech using both RefXVC systems, with and without the normalized pitch input, and compared the prosody between the converted speech and the source ground-truth speech. To assess the prosody similarity, we recruited 20 native English speakers to participate in the subjective listening tests. The participants were instructed to focus on only the prosody similarity, irrespective of speaker identity, and were provided with four response options: ”Different, Sure”, ”Different, Not Sure”, ”Same, Not Sure”, and ”Same, Sure”.

The results of the evaluation are presented in figure 5. As shown in the figure, the RefXVC system with the normalized pitch input achieved a higher percentage of ”Same, Sure” responses compared to the system without the normalized pitch input. This indicates that the normalized pitch input can effectively improve the prosody of the converted speech.

These results further demonstrate the importance of considering prosody in XVC and the effectiveness of leveraging reference information from the source to improve prosody. By incorporating normalized pitch as an additional input, RefXVC can improve the prosody of the converted speech, reducing foreign accent in XVC and making the speech more similar to the source native speech.

V Conclusion

In this paper, we introduced a novel approach to cross-lingual voice conversion that maximizes reference leveraging in multiple ways. We proposed a timbre encoder and a pronunciation matching network to exploit the relationship between timbre and pronunciation in different languages and employed multiple reference sources to capture the tonal variations in a speaker’s speech more accurately. Furthermore, we introduced the use of normalized pitch as an additional input to enhance the prosody of the converted speech and prevent foreign accents. Our experimental results demonstrate that our proposed approach outperformed state-of-the-art methods in terms of objective evaluation metrics, and subjective evaluation results confirmed the effectiveness of our approach in improving the naturalness and similarity of the converted speech. However, we acknowledge certain limitations in our work. The generalization of our method to unseen languages remains an area for further investigation. In future work, we plan to explore the generalization capabilities of RefXVC to a wider array of languages.

References

  • [1] M. Abe, K. Shikano, and H. Kuwabara, “Cross-language voice conversion,” in International Conference on Acoustics, Speech, and Signal Processing, 1990, pp. 345–348 vol.1.
  • [2] M. Charlier, Y. Ohtani, T. Toda, A. Moinet, and T. Dutoit, “Cross-language voice conversion based on eigenvoices,” in INTERSPEECH, 2009, pp. 1635–1638.
  • [3] D. Sündermann, H. Höge, A. Bonafonte, H. Ney, and J. Hirschberg, “Text-independent cross-language voice conversion,” in INTERSPEECH, 2009.
  • [4] Y. Zhou, X. Tian, H. Xu, R. K. Das, and H. Li, “Cross-lingual voice conversion with bilingual phonetic posteriorgram and average modeling,” in IEEE ICASSP, 2019, pp. 6790–6794.
  • [5] D. Erro, A. Moreno, and A. Bonafonte, “Voice conversion based on weighted frequency war**,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 922–931, 2010.
  • [6] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in IEEE ICASSP, 2014, pp. 4052–4056.
  • [7] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
  • [8] Y. Zhou, Z. Wu, X. Tian, and H. Li, “Optimization of cross-lingual voice conversion with linguistics losses to reduce foreign accents,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1916–1926, 2023.
  • [9] Y. Zhou, X. Tian, and H. Li, “Language agnostic speaker embedding for cross-lingual personalized speech generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3427–3439, 2021.
  • [10] J. Anderson-Hsieh, R. Johnson, and K. Koehler, “The relationship between native speaker judgments of nonnative pronunciation and deviance in segmentais, prosody, and syllable structure,” Language Learning, vol. 42, no. 4, pp. 529–555, 1992. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-1770.1992.tb01043.x
  • [11] R. Duan, T. Kawahara, M. Dantsuji, and H. Nanjo, “Cross-lingual transfer learning of non-native acoustic modeling for pronunciation error detection and diagnosis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 391–401, 2020.
  • [12] D. P. Kingma, M. Welling et al., “An introduction to variational autoencoders,” Foundations and Trends® in Machine Learning, vol. 12, no. 4, pp. 307–392, 2019.
  • [13] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks,” in INTERSPEECH, 2017, pp. 3364–3368.
  • [14] A. Tjandra, B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura, “Vqvae unsupervised unit discovery and multi-scale code2spec inverter for zerospeech challenge 2019,” arXiv preprint arXiv:1905.11449, 2019.
  • [15] K. Ezzine, J. Di Martino, and M. Frikha, “Any-to-one non-parallel voice conversion system using an autoregressive conversion model and lpcnet vocoder,” Applied Sciences, vol. 13, no. 21, 2023. [Online]. Available: https://www.mdpi.com/2076-3417/13/21/11988
  • [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
  • [17] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Cyclegan-vc2: Improved cyclegan-based non-parallel voice conversion,” in IEEE ICASSP, 2019, pp. 6820–6824.
  • [18] B. Sisman, M. Zhang, M. Dong, and H. Li, “On the study of generative adversarial networks for cross-lingual voice conversion,” in ASRU, 2019.
  • [19] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “Autovc: Zero-shot voice style transfer with only autoencoder loss,” in International Conference on Machine Learning.   PMLR, 2019, pp. 5210–5219.
  • [20] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202.   PMLR, 23–29 Jul 2023, pp. 28 492–28 518. [Online]. Available: https://proceedings.mlr.press/v202/radford23a.html
  • [21] H. Guo, C. Liu, C. T. Ishi, and H. Ishiguro, “Using joint training speaker encoder with consistency loss to achieve cross-lingual voice conversion and expressive voice conversion,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023, pp. 1–8.
  • [22] T. J. Hazen, W. Shen, and C. White, “Query-by-example spoken term detection using phonetic posteriorgram templates,” in IEEE ASRU, 2009, pp. 421–426.
  • [23] L. Sun, K. Li, H. Wang, S. Kang, and H. Meng, “Phonetic posteriorgrams for many-to-one voice conversion without parallel data training,” in IEEE ICME, 2016, pp. 1–6.
  • [24] I. Misra and L. v. d. Maaten, “Self-supervised learning of pretext-invariant representations,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6707–6717.
  • [25] H.-Y. Tung, H.-W. Tung, E. Yumer, and K. Fragkiadaki, “Self-supervised learning of motion capture,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [26] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” arXiv preprint arXiv:1909.11942, 2019.
  • [27] M. Ravanelli, J. Zhong, S. Pascual, P. Swietojanski, J. Monteiro, J. Trmal, and Y. Bengio, “Multi-task self-supervised learning for robust speech recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 6989–6993.
  • [28] Y.-A. Chung and J. Glass, “Generative pre-training for speech with autoregressive predictive coding,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 3497–3501.
  • [29] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  • [30] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.
  • [31] W.-C. Huang, S.-W. Yang, T. Hayashi, H.-Y. Lee, S. Watanabe, and T. Toda, “S3prl-vc: Open-source voice conversion framework with self-supervised speech representations,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 6552–6556.
  • [32] M. Riviere, A. Joulin, P.-E. Mazaré, and E. Dupoux, “Unsupervised pretraining transfers well across languages,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 7414–7418.
  • [33] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  • [34] W.-N. Hsu, Y.-H. H. Tsai, B. Bolte, R. Salakhutdinov, and A. Mohamed, “Hubert: How much can a bad teacher benefit asr pre-training?” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 6533–6537.
  • [35] Y. Wang, A. Boumadane, and A. Heba, “A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding,” arXiv preprint arXiv:2111.02735, 2021.
  • [36] A. Lee, H. Gong, P.-A. Duquenne, H. Schwenk, P.-J. Chen, C. Wang, S. Popuri, Y. Adi, J. Pino, J. Gu et al., “Textless speech-to-speech translation on real data,” arXiv preprint arXiv:2112.08352, 2021.
  • [37] Y. Ren, C. Zhang, and S. YAN, “Bag of tricks for unsupervised text-to-speech,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=SbR9mpTuBn
  • [38] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in IEEE ICASSP, 2018, pp. 5329–5333.
  • [39] B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” arXiv preprint arXiv:2005.07143, 2020.
  • [40] Y.-H. Chen, D.-Y. Wu, T.-H. Wu, and H.-y. Lee, “Again-vc: A one-shot voice conversion using activation guidance and adaptive instance normalization,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 5954–5958.
  • [41] D.-Y. Wu and H.-y. Lee, “One-shot voice conversion by vector quantization,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 7734–7738.
  • [42] J.-c. Chou, C.-c. Yeh, and H.-y. Lee, “One-shot voice conversion by separating speaker and content representations with instance normalization,” arXiv preprint arXiv:1904.05742, 2019.
  • [43] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. Lopez Moreno, Y. Wu et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” Advances in neural information processing systems, vol. 31, 2018.
  • [44] M. Zhang, Y. Zhou, L. Zhao, and H. Li, “Transfer learning from speech synthesis to voice conversion with non-parallel training data,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1290–1302, 2021.
  • [45] Y. Zhou, X. Tian, and H. Li, “Language agnostic speaker embedding for cross-lingual personalized speech generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3427–3439, 2021.
  • [46] S. Ding, G. Zhao, and R. Gutierrez-Osuna, “Improving the speaker identity of non-parallel many-to-many voice conversion with adversarial speaker recognition.” in INTERSPEECH, 2020, pp. 776–780.
  • [47] W.-C. Huang, H. Luo, H.-T. Hwang, C.-C. Lo, Y.-H. Peng, Y. Tsao, and H.-M. Wang, “Unsupervised representation disentanglement using cross domain features and adversarial learning in variational autoencoder based voice conversion,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 4, no. 4, pp. 468–479, 2020.
  • [48] W. **, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep voice 3: Scaling text-to-speech with convolutional sequence learning,” arXiv preprint arXiv:1710.07654, 2017.
  • [49] M. Blaauw and J. Bonada, “A neural parametric singing synthesizer,” arXiv preprint arXiv:1704.03809, 2017.
  • [50] Y.-P. Cho, F.-R. Yang, Y.-C. Chang, C.-T. Cheng, X.-H. Wang, and Y.-W. Liu, “A survey on recent deep learning-driven singing voice synthesis systems,” in 2021 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR).   IEEE, 2021, pp. 319–323.
  • [51] H.-S. Choi, J. Yang, J. Lee, and H. Kim, “Nansy++: Unified voice synthesis with neural analysis and synthesis,” arXiv preprint arXiv:2211.09407, 2022.
  • [52] L. Zhang, C. Yu, H. Lu, C. Weng, C. Zhang, Y. Wu, X. Xie, Z. Li, and D. Yu, “Durian-sc: Duration informed attention network based singing voice conversion system,” arXiv preprint arXiv:2008.03009, 2020.
  • [53] J. Lee, H.-S. Choi, J. Koo, and K. Lee, “Disentangling timbre and singing style with multi-singer singing synthesis system,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 7224–7228.
  • [54] T. Nekvinda and O. Dušek, “One model, many languages: Meta-learning for multilingual text-to-speech,” arXiv preprint arXiv:2008.00768, 2020.
  • [55] A. W. Black and K. A. Lenzo, “Multilingual text-to-speech synthesis,” in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3.   IEEE, 2004, pp. iii–761.
  • [56] K. Azizah, M. Adriani, and W. Jatmiko, “Hierarchical transfer learning for multilingual, multi-speaker, and style transfer dnn-based tts on low-resource languages,” IEEE Access, vol. 8, pp. 179 798–179 812, 2020.
  • [57] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30.   Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
  • [58] R. Zheng, M. Ma, and L. Huang, “Multi-reference training with pseudo-references for neural translation and text generation,” Proceedings of EMNLP 2018, 2018.
  • [59] Y. Bian, C. Chen, Y. Kang, and Z. Pan, “Multi-reference tacotron by intercross training for style disentangling, transfer and control in speech synthesis,” arXiv preprint arXiv:1904.02373, 2019.
  • [60] M. Whitehill, S. Ma, D. McDuff, and Y. Song, “Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency,” in Proc. Interspeech 2020, 2020, pp. 4442–4446. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2020-2985
  • [61] S. Wang, J. Liu, Y. Ren, Z. Wang, C. Xu, and Z. Zhao, “Mr-svs: Singing voice synthesis with multi-reference encoder,” arXiv preprint arXiv:2201.03864, 2022.
  • [62] A. Ali, W. Magdy, P. Bell, and S. Renais, “Multi-reference wer for evaluating asr for languages with no orthographic rules,” in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015, pp. 576–580.
  • [63] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222–2235, 2007.
  • [64] K. Qian, Z. **, M. Hasegawa-Johnson, and G. J. Mysore, “F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6284–6288.
  • [65] C. Veaux, J. Yamagishi, K. MacDonald et al., “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2017.
  • [66] I. Solak, “The m-ailabs speech dataset,” Jun 2021. [Online]. Available: https://www.caito.de/2019/01/03/the-m-ailabs-speech-dataset/
  • [67] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33.   Curran Associates, Inc., 2020, pp. 17 022–17 033. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2020/file/c5d736809766d46260d816d8dbc9eb44-Paper.pdf
  • [68] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
  • [69] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in IEEE ICASSP, 2018, pp. 4879–4883.
  • [70] H.-Y. Choi, S.-H. Lee, and S.-W. Lee, “Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation,” in Proc. INTERSPEECH 2023, 2023, pp. 2283–2287.
  • [71] D. Min, D. B. Lee, E. Yang, and S. J. Hwang, “Meta-stylespeech : Multi-speaker adaptive text-to-speech generation,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139.   PMLR, 18–24 Jul 2021, pp. 7748–7759. [Online]. Available: https://proceedings.mlr.press/v139/min21b.html
  • [72] P. M. Carter, “Quantifying rhythmic differences between spanish, english, and hispanic english,” in Theoretical and Experimental Approaches to Romance Linguistics.   John Benjamins, 2005, pp. 63–75.