\interspeechcameraready\name

Hyun MyungKim \nameKangwookJang \nameHoirinKim

One-Class Learning with Adaptive Centroid Shift for Audio Deepfake Detection

Abstract

As speech synthesis systems continue to make remarkable advances in recent years, the importance of robust deepfake detection systems that perform well in unseen systems has grown. In this paper, we propose a novel adaptive centroid shift (ACS) method that updates the centroid representation by continually shifting as the weighted average of bonafide representations. Our approach uses only bonafide samples to define their centroid, which can yield a specialized centroid for one-class learning. Integrating our ACS with one-class learning gathers bonafide representations into a single cluster, forming well-separated embeddings robust to unseen spoofing attacks. Our proposed method achieves an equal error rate (EER) of 2.19% on the ASVspoof 2021 deepfake dataset, outperforming all existing systems. Furthermore, the t-SNE visualization illustrates that our method effectively maps the bonafide embeddings into a single cluster and successfully disentangles the bonafide and spoof classes.

keywords:
audio deepfake detection, one-class learning, ASVspoof challenge, anti-spoofing

1 Introduction

Speech synthesis systems such as text-to-speech (TTS) [1] and voice conversion (VC) [2] are evolving rapidly with the development of deep learning. These systems are easily accessible to the public at a low cost and produce sophisticated synthetic speech that is indistinguishable from genuine human speech. Despite its positive aspects, there is also the potential for misuse, such as using a deepfake voice for criminal purposes [3]. To address and prevent such issues, research on audio deepfake detection (ADD) that distinguishes between bonafide and fake speech is essential.

The primary challenge in ADD is to enhance generalization ability ensuring effective detection of unseen synthesis systems. Some studies [4, 5] have indicated that spoofing detection systems suffer significant performance degradation when facing unseen spoofing attacks. To enhance generalization ability, recent studies have focused on data augmentation techniques such as reverberation [6] and transmission effects [7]. Furthermore, there have been attempts to find a general representation by fine-tuning speech foundation models [8, 9] pre-trained on large-scale dataset of speech domain, such as LibriSpeech [10]. Additionally, attention mechanisms have been introduced to learn discriminative features for anti-spoofing by focusing on spoof-related features and suppressing unrelated ones [11, 12].

Meanwhile, the fundamental distinction between fake and bonafide speech lies in their origin. Bonafide speech originates from the vocal cords, while fake speech can be generated by a variety of different speech synthesis systems. In this regard, formulating the ADD task as a binary classification between bonafide and fake speech is impractical because the binary classification method intrinsically assumes that fake speech shares a similar distribution [13]. Since fake speech utterances have different distributions depending on the synthesis system [14], assuming consistent characteristics across all systems is unreasonable. In addition, rapidly evolving speech synthesis systems are making the distribution of fake speech more diverse.

Refer to caption

i𝑖iitalic_i-th minibatch

Refer to caption
Refer to caption

(i+1)𝑖1(i+1)( italic_i + 1 )-th minibatch

Figure 1: Illustration of the ACS method when the (i+1)𝑖1(i+1)( italic_i + 1 )-th minibatch is the input and each minibatch contains one bonafide sample. All dashes show their previous states, and the optimization movements are represented by arrows.

Instead of binary classification, recent studies [13, 15] proposed a one-class classification method [16] for ADD. The main idea of the one-class classification method is to learn the distribution of bonafide samples and set an appropriate boundary around them, considering all samples outside the boundary as fake speech. OC-Softmax [13], a representative one-class method for ADD, sets a tight margin for the bonafide samples, creating a compact boundary around the centroid vector. However, the centroid is influenced by both classes during the training process. Considering the important role of the centroid, which represents the bonafide class, its representativeness can be affected by fake samples.

To address this issue, we propose an ACS method, which determines the centroid using only bonafide speech. During training, the centroid vector is updated as a weighted average of bonafide samples. Utilizing the centroid obtained through ACS, we apply one-class learning to optimize bonafide samples to move closer to the updated center, while fake samples move further away. Our method maps the bonafide samples into a single cluster and forms a well-separated and simplified feature space, demonstrating superior generalization ability. Additionally, we employ the speech foundation model pre-trained on large-scale dataset [17, 18] consisting solely of bonafide speech. This can facilitate the extraction of features of bonafide speech that are highly discriminative from fake speech.

In this work, our proposed method demonstrates its superior generalization ability by outperforming the state-of-the-art (SOTA) system on the ASVspoof 2021 deepfake (DF) and 2019 logical access (LA) datasets. Furthermore, the t-SNE [19] visualization illustrates that our method effectively maps the bonafide representations into a single cluster and forms a well-separated and simplified feature space.

2 Method

2.1 XLS-R feature encoder

We employ a speech foundation model, XLS-R [20], as the front-end feature encoder. XLS-R is a model for self-supervised cross-lingual speech representation learning based on wav2vec 2.0 [21]. XLS-R is pre-trained on 128 languages and approximately 436k hours of unlabeled speech data, showing strong performance across various downstream speech tasks such as speech translation and speech recognition [20]. Since XLS-R is pre-trained on large-scale bonafide speech data from various languages, XLS-R can capture the inherent features of bonafide speech. Moreover, XLS-R shows superior anti-spoofing performance compared to other speech foundation models [8]. Therefore, we utilize XLS-R as the most appropriate front-end feature encoder for ADD.

2.2 Attentive statistics pooling

In order to incorporate both local and global level spoofing evidence present in the utterance, we introduce Attentive Statistics Pooling (ASP) [22]. Some studies [23, 24] have observed a significant performance degradation when the specific time frames are removed. This implies that evidence of voice spoofing, which helps to distinguish between fake and bonafide utterances, can vary across time frames. On the other hand, the evidence of voice spoofing may exist not only at the local level but also at the global level, such as excessive smoothing [25].

ASP uses an attention mechanism to obtain utterance-level features by assigning weights to each frame. The utterance embedding is obtained by concatenating the weighted mean and standard deviation vectors. The weighted mean vector focuses on important frames relevant to spoofing detection, and the standard deviation vector contains spoofing characteristics in terms of temporal variability over long contexts.

Let htCsubscript𝑡superscript𝐶{h}_{t}\in\mathbb{R}^{C}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT denote the frame-level feature at time step t𝑡titalic_t. The attention module computes scalar score etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for each frame-level feature

etsubscript𝑒𝑡\displaystyle e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =vf(Wht+b)absentsuperscript𝑣top𝑓𝑊subscript𝑡𝑏\displaystyle=v^{\top}f(Wh_{t}+b)= italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f ( italic_W italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b ) (1)

where v,bC𝑣𝑏superscript𝐶{v,b}\in\mathbb{R}^{C}italic_v , italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT and WC×C𝑊superscript𝐶𝐶{W}\in\mathbb{R}^{C\times C}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_C end_POSTSUPERSCRIPT are learnable parameters and f()𝑓f(\cdot)italic_f ( ⋅ ) is a non-linear activation function. The score is normalized across all time frames to obtain αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

αtsubscript𝛼𝑡\displaystyle\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =exp(et)τexp(eτ)absentsubscript𝑒𝑡subscript𝜏subscript𝑒𝜏\displaystyle=\frac{\exp(e_{t})}{\sum_{\tau}\exp(e_{\tau})}= divide start_ARG roman_exp ( italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT roman_exp ( italic_e start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) end_ARG (2)

The normalized score αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the importance of each frame relevant to spoofing detection and is reflected as the weight of each frame to compute the weighted mean vector μ~C~𝜇superscript𝐶{\widetilde{\mu}}\in\mathbb{R}^{C}over~ start_ARG italic_μ end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT.

μ~=τατhτ~𝜇subscript𝜏subscript𝛼𝜏subscript𝜏\widetilde{\mu}=\sum_{\tau}\alpha_{\tau}h_{\tau}\vspace{-2pt}over~ start_ARG italic_μ end_ARG = ∑ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT (3)

The weighted standard deviation σ~C~𝜎superscript𝐶{\widetilde{\sigma}}\in\mathbb{R}^{C}over~ start_ARG italic_σ end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT is defined as follows:

σ~=τατhτhτμ~μ~~𝜎subscript𝜏direct-productsubscript𝛼𝜏subscript𝜏subscript𝜏direct-product~𝜇~𝜇\widetilde{\sigma}=\sqrt{\sum_{\tau}\alpha_{\tau}h_{\tau}\odot h_{\tau}-% \widetilde{\mu}\odot\widetilde{\mu}}\vspace{-2pt}over~ start_ARG italic_σ end_ARG = square-root start_ARG ∑ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ⊙ italic_h start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT - over~ start_ARG italic_μ end_ARG ⊙ over~ start_ARG italic_μ end_ARG end_ARG (4)

where direct-product\odot represents the element-wise product.

Refer to caption
Figure 2: The pipeline of our proposed model. The blue boxes indicate the ASP module. The frame-level feature is extracted from the XLS-R feature encoder, and the utterance-level feature is obtained through the ASP module.

2.3 One-class learning with Adaptive Centroid Shift

We use a one-class learning approach rather than binary classification methods. Binary classification methods intrinsically assume that fake speech has a similar distribution [13], but this assumption is not suitable as fake speech exhibits different distributions [14] and continues to evolve. In one-class learning, the fundamental idea is to map embeddings of the target class closer to each other while pushing embeddings of non-target classes further away. OC-Softmax [13] is a representative one-class learning method for anti-spoofing. It uses a trainable centroid vector that is influenced by both bonafide and fake samples. However, the fake samples can affect the representativeness of the centroid characterizing the bonafide class.

In order to obtain the specialized centroid vector representing the bonafide class, we determine the centroid directly with ACS. The ACS method continuously updates the centroid vector by a weighted average of bonafide samples only when the bonafide samples are present within the mini-batch. After applying ACS, we optimize the bonafide samples to be closer to the centroid vector while pushing the fake samples away from the centroid vector. The key point of the ACS method is to define the centroid vector using only bonafide samples.

To elaborate, we initialize the centroid as the first encountered bonafide speech representation. Then, we continuously update the centroid by calculating the weighted average of the bonafide speech representations. The ratio of bonafide to fake samples in a single batch is 1 to 9. Including too many bonafide samples in a batch can alter the centroid, leading to decreased training stability. In Eq (5), Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the bonafide centroid vector determined by a total of n bonafide samples up to the i𝑖{i}italic_i-th mini-batch. When there are s bonafide samples in the (i+1)𝑖1({i+1})( italic_i + 1 )-th mini-batch, the (i+1)𝑖1{(i+1)}( italic_i + 1 )-th centroid vector becomes

Ci+1=nCi+sEi+1n+s,subscript𝐶𝑖1𝑛subscript𝐶𝑖𝑠subscript𝐸𝑖1𝑛𝑠{C}_{i+1}=\frac{n{C}_{i}+s{E}_{i+1}}{n+s},italic_C start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = divide start_ARG italic_n italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_s italic_E start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n + italic_s end_ARG , (5)

where EiDsubscript𝐸𝑖superscript𝐷{E}_{i}\in\mathbb{R}^{D}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT is the average of bonafide embeddings of i𝑖{i}italic_i-th mini-batch.

Once the centroid is defined by ACS for a certain mini-batch, our one-class loss function optimizes bonafide class embeddings to be closer to the centroid and fake class embeddings to be far from the centroid. We design the intuitive and straightforward one-class loss function based on metric learning. The one-class loss function 𝒪𝒞subscript𝒪𝒞\mathcal{L_{OC}}caligraphic_L start_POSTSUBSCRIPT caligraphic_O caligraphic_C end_POSTSUBSCRIPT, integrated with the centroid C𝐶Citalic_C obtained by ACS, is designed with the cosine distance metric, given by

𝒪𝒞=1Mbi=1Mbrb,iCrb,iC+1Msj=1Msrs,jCrs,jCsubscript𝒪𝒞1subscript𝑀𝑏superscriptsubscript𝑖1subscript𝑀𝑏superscriptsubscript𝑟𝑏𝑖top𝐶normsubscript𝑟𝑏𝑖norm𝐶1subscript𝑀𝑠superscriptsubscript𝑗1subscript𝑀𝑠superscriptsubscript𝑟𝑠𝑗top𝐶normsubscript𝑟𝑠𝑗norm𝐶\displaystyle\mathcal{L_{OC}}=-\frac{1}{M_{b}}\sum_{i=1}^{M_{b}}\frac{r_{b,i}^% {\top}C}{||r_{b,i}||\cdot||C||}+\frac{1}{M_{s}}\sum_{j=1}^{M_{s}}\frac{r_{s,j}% ^{\top}C}{||r_{s,j}||\cdot||C||}caligraphic_L start_POSTSUBSCRIPT caligraphic_O caligraphic_C end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_r start_POSTSUBSCRIPT italic_b , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_C end_ARG start_ARG | | italic_r start_POSTSUBSCRIPT italic_b , italic_i end_POSTSUBSCRIPT | | ⋅ | | italic_C | | end_ARG + divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_r start_POSTSUBSCRIPT italic_s , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_C end_ARG start_ARG | | italic_r start_POSTSUBSCRIPT italic_s , italic_j end_POSTSUBSCRIPT | | ⋅ | | italic_C | | end_ARG (6)

where rs,rb,CDsubscript𝑟𝑠subscript𝑟𝑏𝐶superscript𝐷{r}_{s},{r}_{b},C\in\mathbb{R}^{D}italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT are respectively a vector of spoof, bonafide and centroid and ||||||\cdot||| | ⋅ | | denotes the computation of the 2-Norm. Mbsubscript𝑀𝑏M_{b}italic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, Mssubscript𝑀𝑠M_{s}italic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT are the number of bonafide and fake samples in a mini-batch, respectively.

3 Experimental setup

3.1 Datasets and Metrics

In all experiments, we trained and validated our models using the train and development partitions of the ASVspoof 2019 LA dataset [26]. We evaluated our method on three subsets to investigate the generalization ability: ASVspoof 2019 LA (19LA), ASVspoof 2021 LA (21LA), and ASVspoof 2021 DF (21DF) [27]. The 21DF dataset consists of 600k utterances and includes more than 100 different spoofing attack algorithms involving audio coding and compression artifacts. The 19LA evaluation set consists of 13 different TTS and VC systems. The 21LA uses the same algorithms as 19LA for generating spoofed speech data and also reflects encoding and transmission effects.

We used the EER as the primary evaluation metric, and for the 19LA and 21LA, we also utilized the minimum normalized Tandem Detection Cost Function (min t-DCF) [28] as the additional metric.

3.2 Implementation Details

Data pre-processing All audio data are cropped or concatenated to create segments of around 4 seconds duration [29]. Rawboost [7] is utilized for data augmentation. For the 21LA database, we use a combination of linear and non-linear convolutive noise and impulsive signal-dependent additive noise strategies, while for the others, we use stationary signal-independent additive noise with random coloration [29].
XLS-R feature encoder We use a pre-trained XLS-R model comprising 0.3B parameters to extract a feature representation from the raw input waveform. The XLS-R model is implemented by using the fairseq framework [30].
Training details In the training phase, the XLS-R model is jointly optimized with the back-end classifier without freezing. The output sequences of the feature encoder are passed through the ASP module to obtain the utterance-level embedding. The Adam optimizer is used with an initial learning rate of 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, a weight decay of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and a batch size of 20. With the maximum number of training epochs set to 100, we implement early stop** to prevent over-fitting when the EER on the validation set showed no improvement for 7 consecutive iterations. The final system is derived by averaging the weights [31] of model checkpoints from the top 5 epochs with the highest EER performance on the validation set. All experiments are conducted on a single GeForce RTX 4090 GPU.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

(a) WCE                            (b) OC-softmax                            (c) ACS+OC

Figure 3: Visualization of the embedding space using t-SNE on the ASVspoof 2021 LA evaluation dataset.

4 Results and Analysis

Table 1: Comparison of our system with other recent systems evaluated on the ASVspoof 2021 DF dataset in terms of EER (%).
System EER (%)\downarrow
Hubert+LCNN [8] 12.39
XLS-R+LCNN [8] 4.75
XLS-R+AASIST [29] 2.85
WavLM+MFA [32] 2.56
OCKD [33] 2.27
Ours 2.19
Table 2: Comparison of our system with other recent systems evaluated on the ASVspoof 2019 LA and 2021 LA datasets, reported in terms of min t-DCF and EER (%).
Eval Set System EER(%) \downarrow min t-DCF \downarrow
19LA SENet [24] 1.14 0.0368
RawGAT-ST [12] 1.06 0.0335
AASIST [34] 0.83 0.0275
DFSincNet [35] 0.52 0.0176
WavLM+MFA [32] 0.42 -
Wav2Vec2+VIB [9] 0.40 0.0107
OCKD [33] 0.39 -
XLS-R+AASIST [33] 0.22 -
Ours 0.17 0.0050
21LA RawNet2 [7] 5.31 0.3099
WavLM+MFA [32] 5.08 -
Wav2Vec2+VIB [9] 4.92 -
DFSincNet [35] 3.05 0.2601
LCNN-LSTM [36] 2.21 -
Ours 1.30 0.2172
OCKD [33] 0.90 -
XLS-R+AASIST [29] 0.82 0.2066

4.1 Results

In Table 1, we compare our proposed system with other existing systems evaluated on the 21DF dataset. As shown in Table 1, our system outperforms all existing systems with an EER of 2.19%. Also, all systems except ours utilize a binary classification methods. This highlights the importance of the one-class learning approach in the ADD task. Note that OCKD [33] utilizes one-class learning but employs a teacher model based on binary classification. Additionally, our system achieves superior performance by using only a simple ASP module, compared to conventional systems that use complicated classifiers.

In Table 2, we report the min t-DCF and EER of the proposed system on the 19LA and 21LA evaluation subsets, comparing its performance with other existing anti-spoofing systems. At 19LA, our system achieves an EER of 0.17% and a min t-DCF of 0.0050, outperforming all existing systems. At 21LA, our system achieves an EER of 1.30% and a min t-DCF of 0.2172, demonstrating the competitive performance compared to the recent SOTA system. In summary, our proposed system shows excellent anti-spoofing performance not only on the 21DF dataset but also on various other datasets. This illustrates its generalization ability against unseen spoofing attacks.

4.2 Comparison of one-class and binary classification methods

In order to demonstrate the power of our methodology, we conduct a comparative analysis among our approach, binary classification method and other one-class method. In ADD task, OC-softmax [13] and weighted cross entropy (WCE) are representative loss functions in binary and one-class classification, respectively. In Table 3, we report the anti-spoofing performance of binary and one-class classification methods, including our proposed method. All training details are the same except for the loss function and a linear layer added to the WCE. In the one-class method, the utterance embedding is directly used to the loss function, whereas in the WCE, the utterance embedding passes through a linear layer to ensure the final output has a dimension of 2.

As is shown in Table 3, the one-class methods outperform the binary classification method. We can expect that performance degradation may occur due to the binary classification methods assuming that fake speech shares a similar distribution. Among the one-class methods, ACS+OC shows the best performance, due to its method of determining the centroid. The contents related to the centroid are described in Sec 4.3. Also, in Fig.3, we visualize embeddings evaluated on the 21LA evaluation partition using t-SNE [19] to compare ours and other methods. In contrast to WCE and OC-softmax, our proposed method maps bonafide embeddings into a single cluster, thereby demonstrating a clear separation between bonafide and spoof classes.

4.3 Analysis of centroid definition methods.

We investigate the anti-spoofing performance based on various centroid definition methods to further support our method. Table 4 compares the anti-spoofing performance for four different centroid definition methods. In the Table 4, we indicate the classes used to define the centroids for each method.

The fixed type defines the centroid as a random value and does not update it. The partially fixed ACS type defines the centroid in the same way as ACS for the first 5 epochs and then remains unchanged and fixed. In the trainable type, the neural network learns the centroid by backpropagation from cosine similarity.

According to the Table 4, the fixed type shows the lowest performance. We can infer that the fixed centroid may not adequately represent the bonafide class. In the partially fixed centroid, the EER decreased only up to the 5-th epoch on the development set and increased after fixing the centroid. Comparing Trainable with ACS, the ACS approach shows better performance than Trainable. In ACS, the centroid is influenced only by bonafide samples during the training process and not by fake samples, which can enhance the representativeness of the centroid. Furthermore, although outliers may be present in bonafide samples, the ACS method can improve the stability of the training by minimizing the impact of outliers on the centroid.

Table 3: Performance comparison between binary and one-class methods in terms of EER (%). All systems are evaluated on the ASVspoof 2021 LA and DF partition.
Method EER(%)\downarrow
Loss Binary One-class DF LA
WCE 3.14 1.67
OC-softmax 2.48 1.55
ACS+OC 2.19 1.30
Table 4: Performance comparison between centroid types in terms of EER (%). All systems are evaluated on the ASVspoof 2021 DF partition.
Type Spoof Bonafide EER(%)\downarrow
Fixed 3.13
Trainable 2.52
Partially fixed ACS 2.44
ACS 2.19

5 Conclusion

We propose an ACS for one-class learning to enhance the robustness of the model against unseen spoofing attacks. The ACS method calculates the centroid using only bonafide samples, enhancing the representativeness of the centroid. By incorporating ACS with our one-class loss function, we optimize the bonafide samples to be close to the centroid and the fake samples to be away from the centroid. Our approach maps the bonafide speech representations into a single cluster within the embedding space, improving generalization ability against unseen spoofing attacks. Furthermore, the ASP module effectively captures spoofing artifacts that exist both locally and globally within each utterance unit. Our system exhibits outstanding generalization ability against unseen spoofing attacks by outperforming all existing systems on the 21DF and 19LA datasets. To the best of our knowledge, it is the lowest reported EERs for both 21DF and 19LA databases.

6 Acknowledgements

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT, 2022-0-00653).

References

  • [1] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in International Conference on Learning Representations, 2020.
  • [2] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “Autovc: Zero-shot voice style transfer with only autoencoder loss,” in International Conference on Machine Learning.   PMLR, 2019.
  • [3] C. Stupp, “Fraudsters used ai to mimic ceo’s voice in unusual cybercrime case,” The Wall Street Journal, vol. 30, no. 08, 2019.
  • [4] X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch et al., “Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  • [5] N. Müller, P. Czempin, F. Diekmann, A. Froghyar, and K. Böttinger, “Does Audio Deepfake Detection Generalize?” in Proc. Interspeech, 2022.
  • [6] W. Cai, D. Cai, W. Liu, G. Li, and M. Li, “Countermeasures for Automatic Speaker Verification Replay Spoofing Attack : On Data Augmentation, Feature Representation, Classification and Fusion,” in Proc. Interspeech, 2017.
  • [7] H. Tak, M. Kamble, J. Patino, M. Todisco, and N. Evans, “Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” in Proc. ICASSP, 2022.
  • [8] X. Wang and J. Yamagishi, “Investigating Self-Supervised Front Ends for Speech Spoofing Countermeasures,” in Proc. The Speaker and Language Recognition Workshop (Odyssey 2022), 2022.
  • [9] Y. Eom, Y. Lee, J. S. Um, and H. Kim, “Anti-spoofing using transfer learning with variational information bottleneck,” in Proc. Interspeech, 2022.
  • [10] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Proc. ICASSP, 2015.
  • [11] H. Ling, L. Huang, J. Huang, B. Zhang, and P. Li, “Attention-based convolutional neural network for asv spoofing detection.” in Proc. Interspeech, 2021.
  • [12] H. Tak, J.-w. Jung, J. Patino, M. Kamble, M. Todisco, and N. Evans, “End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection,” in Proc. ASVspoof workshop, 2021.
  • [13] Y. Zhang, F. Jiang, and Z. Duan, “One-class learning towards synthetic voice spoofing detection,” IEEE Signal Processing Letters, 2021.
  • [14] X. Yan, J. Yi, J. Tao, C. Wang, H. Ma, T. Wang, S. Wang, and R. Fu, “An initial investigation for detecting vocoder fingerprints of fake audio,” in Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia, 2022.
  • [15] F. Alegre, A. Amehraye, and N. Evans, “A one-class classification approach to generalised speaker verification spoofing countermeasures using local binary patterns,” in 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS), 2013.
  • [16] M. M. Moya, M. W. Koch, and L. D. Hostetler, “One-class classifier networks for target recognition applications,” NASA STI/Recon Technical Report N, 1993.
  • [17] C. Wang, M. Rivière, A. Lee et al., “Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” in ACL 2021-59th Annual Meeting of the Association for Computational Linguistics, 2021.
  • [18] V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “MLS: A Large-Scale Multilingual Dataset for Speech Research,” in Proc. Interspeech, 2020.
  • [19] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
  • [20] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino et al., “Xls-r: Self-supervised cross-lingual speech representation learning at scale,” in Proc. Interspeech, 2022.
  • [21] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, 2020.
  • [22] K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” in Proc. Interspeech, 2018.
  • [23] Y. Zhang, Z. Li, J. Lu, H. Hua, W. Wang, and P. Zhang, “The impact of silence on speech anti-spoofing,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  • [24] Y. Zhang, W. Wang, and P. Zhang, “The Effect of Silence and Dual-Band Fusion in Anti-Spoofing System,” in Proc. Interspeech, 2021.
  • [25] X. Liu, M. Liu, L. Wang, K. A. Lee, H. Zhang, and J. Dang, “Leveraging positional-related local-global dependency for synthetic speech detection,” in Proc. ICASSP, 2023.
  • [26] X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch et al., “Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,” Computer Speech & Language, 2020.
  • [27] J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans et al., “Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,” in ASVspoof 2021 Workshop, 2021.
  • [28] T. Kinnunen, K. A. Lee, H. Delgado, N. Evans, M. Todisco, M. Sahidullah, J. Yamagishi, and D. A. Reynolds, “t-dcf: a detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification,” arXiv preprint arXiv:1804.09618, 2018.
  • [29] H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” in The Speaker and Language Recognition Workshop, 2022.
  • [30] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “fairseq: A fast, extensible toolkit for sequence modeling,” in Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
  • [31] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson, “Averaging weights leads to wider optima and better generalization,” in 34th Conference on Uncertainty in Artificial Intelligence, 2018.
  • [32] Y. Guo, H. Huang, X. Chen, H. Zhao, and Y. Wang, “Audio deepfake detection with self-supervised wavlm and multi-fusion attentive classifier,” in Proc. ICASSP, 2024.
  • [33] J. Lu, Y. Zhang, W. Wang, Z. Shang, and P. Zhang, “One-class knowledge distillation for spoofing speech detection,” in Proc. ICASSP, 2024.
  • [34] J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” in Proc. ICASSP, 2022.
  • [35] B. Huang, S. Cui, J. Huang, and X. Kang, “Discriminative frequency information learning for end-to-end speech anti-spoofing,” IEEE Signal Processing Letters, vol. 30, 2023.
  • [36] A. Tomilov, A. Svishchev, M. Volkova, A. Chirkovskiy, A. Kondratev, and G. Lavrentyeva, “Stc antispoofing systems for the asvspoof2021 challenge,” in Proc. ASVspoof 2021 Workshop, 2021.