[affiliation=1,2]ZhiyongWang \name[affiliation=1,*]RuiboFu \name[affiliation=1,]ZhengqiWen \name[affiliation=4]YuankunXie \name[affiliation=2]YukunLiu \name[affiliation=1,2]XiaopengWang \name[affiliation=1]XuefeiLiu \name[affiliation=1]YongweiLi \name[affiliation=3]JianhuaTao \name[affiliation=1,2]YiLu \name[affiliation=1,2]XinQi \name[affiliation=5]ShuchenShi
Generalized Fake Audio Detection via Deep Stable Learning
Abstract
Although current fake audio detection approaches have achieved remarkable success on specific datasets, they often fail when evaluated with datasets from different distributions. Previous studies typically address distribution shift by focusing on using extra data or applying extra loss restrictions during training. However, these methods either require a substantial amount of data or complicate the training process. In this work, we propose a stable learning-based training scheme that involves a Sample Weight Learning (SWL) module, addressing distribution shift by decorrelating all selected features via learning weights from training samples. The proposed portable plug-in-like SWL is easy to apply to multiple base models and generalizes them without using extra data during training. Experiments conducted on the ASVspoof datasets clearly demonstrate the effectiveness of SWL in generalizing different models across three evaluation datasets from different distributions.
keywords:
audio spoof, fake audio detection, stable learning, generalization1 Introduction
Existing synthesis techniques [1, 2] have the capability to create realistic imitations of human voices. What is even more concerning is that automatic speaker verification (ASV) systems [3] are susceptible to various types of spoofing attacks. To establish a more reliable automatic speaker verification system, Fake Audio Detection (FAD) model based on machine learning has emerged which aims to discern whether the input audio is produced by synthesis techniques (spoofed) or human (bona-fide).
While many machine learning approaches perform well under the I.I.D. (i.e., Independent and Identically Distributed) hypothesis, where the evaluation set and training set are independently sampled from the identical distribution. For FAD, there will always be new synthesis techniques and unexpected interferences not included in the training set, resulting in an unavoidable distribution shift in the test set. Therefore, researchers considered the distribution shift scenario when setting up datasets and introduced unseen conditions in the test set. For instance, ASVspoof2019 [4] and ASVspoof2021 [5] datasets exhibit significant domain shifts between the training and evaluation sets due to unseen spoofing attack methods and channel variation, making the datasets suitable for training FAD models and evaluating the generalization of FAD models.
Several studies have explored overcoming the distribution shift conducting experiments on ASVspoof datasets and related literature can be roughly divided in to two strands. One strand explores various training strategies, including multi-task learning [6, 7, 8], fusion-based methods [9, 10, 11], adversarial learning [12, 13], continual learning [14, 15], transfer learning using large pretrained models [16, 17, 18], and contrastive learning [19]. This type of method primarily enhances the training constraints by adding additional losses, implicitly supervising the model to obtain more discriminative features. While this approach can improve the generalization of the base model, it makes the training process intricate. Another strand focuses on the model’s input and can be further subdivided into data integration and data augmentation. Data integration means co-training a FAD model with multi-dataset, often combined with new training strategies, like [20] and [21] respectively introduced strategies named aggregation-separation domain generalization method and adaptive sharpness-aware minimization (ASAM). The basic idea of data integration is plain and simple, but this approach is sometimes effective and indicates no guarantee of improvements of generalization on every other dataset. Additionally, the interferences caused by the different characteristics of datasets remain unknown. Data augmentation uses technical means to modify training set data, making the distribution of the training set as close as possible to the distribution of the test set. For example, RawBoost [22] can be applied to raw audio data to simulate telephone scenarios. In [23], different vocoders are used to create various spoofing attack types of data based on the training set. Data augmentation can achieve good results under the premise of knowing the target distribution, which often cannot be satisfied in real applications.
To address the distribution shift issues and make the FAD model more applicable in more different distributions, this paper proposes a stable learning [24] based training scheme that involves a Sample Weight Learning (SWL) module. The aim of stable learning is to learn a predictive model that can achieve uniformly good performance in any possible environment, which perfectly fits the needs of FAD. The SWL module can be integrated into existing FAD models. This module takes the selected features as input and calculates sample weights to influence the final sample loss. Within the SWL module, we employ an approach based on Random Fourier Features (RFF) [25] to decorrelate all input features and utilize an iterative optimization approach [26] to gradually approximate global sample weights by iteratively optimizing local sample weights. Importantly, this module can be used as a portable plug-in during training and does not participate in the model’s inference process. Experimental results demonstrate that the SWL is applicable to multiple base FAD models and achieves consistent performance improvements when evaluated across multiple datasets from different distributions without the need for additional datasets for training or pre-training.
![Refer to caption](x1.png)
The contributions of this paper are presented as follows:
-
•
We propose a stable-learning based training scheme for FAD called SWL, which involves a SWL module based on RFF and utilizes an iterative optimization strategy. To the best of our knowledge, this is the first study to apply stable learning to improve the generalization of FAD models.
- •
-
•
We further explored what combination of node features in the AASIST model might be optimal for decorrelation when applying SWL.
2 Proposed Method
Different from the method mentioned earlier that uses extra data or complicates the training process by computing additional loss, we use stable learning to addressing distribution shift by decorrelating all input features without using extra data. Figure 1 shows the overall architecture of proposed SWL.
In stable learning, a sample weighting method [30] is proven to be useful to decorrelate features and help linear models produce more stable predictions under distribution shift. When extending these ideas into FAD models to handle more complex data types like audio, we first need to consider how to quantify independence between features. Moreover, this sample weighting method requires meticulous calculations for all features, which is not suitable for training deep neural networks.
In the following subsections, we detail sample weighting with RFF and iteratively learn global sample weights, which correspond to solving the two problems mentioned above.
2.1 Sample weighting with RFF
In stead of focusing only on eliminating dependence between bona-fide and spoofed samples, we tend to eliminate dependence between all input features so that we can get more dispersed representations of all samples. To eliminate the dependence between any pair of features and from samples in the representation space, we first need to quantify the independence between features. In [31], Hilbert-Schmidt Independence Criterion (HSIC) is proven to be capable of applying as a criterion to supervise feature decorrelation, but HSIC computationally cost too much on large datasets. In Euclidean space, Frobenius norm corresponds to the HilbertSchmidt norm [32], in this paper, Frobenius norm is used to quantify the independence between features.
We sample and from the distribution of and . Consider a measurable, positive definite kernel on the domain of random variable and we denote the corresponding Reproducing Kernel Hilbert Space as . If we denote cross-covariance operator from to by the symbol , then the partial cross-covariance matrix will be:
(1) |
where
(2) |
Here u and v are the RFF map** functions, and represent the numbers of functions from and denotes the function space of RFF with the following form:
(3) |
where and are sampled from the standard Normal distribution and the Uniform distribution respectively. Then, the independence quantification metric is defined as the Frobenius norm of the partial cross-covariance matrix (). Note that is always non-negative. So the lower the , the more independent the two variables and .
After establishing the quantification criteria, we can then measure general independence via RFF and eliminate the dependence between hidden state features through sample weighting. We denote the sample weights as and . After weighting, the changes into which can be calculated using Equation (2.1) as follows:
(4) |
Specifically, for selected features and mentioned above, we need to optimize w by
(5) |
where . Hence, weighting training samples with the optimal can mitigate the dependence between all selected hidden state features.
Generally, this algorithm iteratively optimize sample weights w, feature extractor function , and classifier function as follows:
(6) |
where , and means an input audio and prediction of the classifier, represents the weighted cross entropy loss function and represents the time stamp. Initially, .
2.2 Iteratively learn global sample weights
Equation 2.1 requires weight learned for every single input data. However, it cost too much for a FAD model based on deep neural networks trained on a large dataset to learn sample weights globally. For every batch during training, the features and the sample weights appear as follows:
(7) |
and , represent the current features and weights, are used to optimize the new sample weights in the next batch. and means features and weights represent the former global information learned from all previous batches. and means the features and weights in the current batch, representing local information. When a FAD model is trained with a large dataset, the concatenation would incur significant space consumption and lead to an increase in computational complexity.
To address this problem, we utilize an approximation method proposed in [26] which fuses the former global information and the local information at the end of each epoch as follows:
(8) |
We substitute with for the next batch. and use the hyperparameter to control the information, where a large value indicates long-term memory in the information, and a smaller value indicates short-term memory. In this paper, we set to . Detailed procedure can be found in Appendix A.1 of [26], we use Adam optimizer instead of SGD for optimization to converge faster and more stable.
3 Experiments
3.1 Datasets and Evaluation metrics
All experiments are trained on the Logical Access (LA) subset of the ASVspoof 2019 dataset. To verify the generalization of models, we utilized the ASVspoof 2021 dataset, which only contains evaluation sets and is the latest and most challenging edition of the ASVspoof challenge series. In comparison to ASVspoof 2019, the utterances in the LA scenario are transmitted across real telephone systems and the utterances in the DF scenario are processed through various audio compressors. Table 1 presents the overall statistics of the datasets used in the experiments.
Equal error rate (EER) is used to evaluate the performance of models. The lower the EER value, the better the models.
Dataset | # Spks | # Utts | # Attks |
---|---|---|---|
19LA | 20 / 48 | 25380 / 108978 | 6 / 13 |
21LA | - / 48 | - / 181566 | - / 13 |
21DF | - / 48 | - / 611829 | - / 13 |
3.2 Experimental setup
For experiments, we choose three FAD models as base models to apply SWL, and we will briefly explain why we choose them and how we apply SWL to each of them.
First base FAD model is AASIST model, both the original version and the light version respectively denoted as AASIST and AASIST-L. There two reasons for choosing this model. Firstly, this model is open source and has been recently utilized in multiple research studies, demonstrating state-of-the-art (SOTA) performance. Secondly, the features sent into the classifier in this model are composed of multiple node features extracted based on temporal and spectral domains. This setup allows for further exploration of feature interpretability regarding the optimal combination of node features for decorrelation.
Rawnet2 is well-known as one of the baselines of the ASVspoof challenge and a typical end to end model. As for the TSSD model, its performance is SOTA among end-to-end models. We use the input of the last fully connection layer in both models to apply SWL. We do not adjust parameters (e.g. learning rate, model architecture configuration) of all three base FAD models. The best models are chosen based on the lowest EER on the development set of 19LA.
Model | 19LA | 21LA | 21DF |
---|---|---|---|
AASIST | 1.52 | 9.96 | 20.45 |
AASIST + SWL | 1.38 | 9.40 | 19.09 |
AASIST-L | 1.26 | 11.44 | 22.40 |
AASIST-L + SWL | 1.14 | 8.02 | 19.67 |
\hdashlineAASIST-L + ASAM | 1.48 | 10.18 | 19.58 |
AASIST-L + ASAM∗ | 1.27 | 12.41 | 19.84 |
RawNet2 | 4.72 | 11.04 | 22.13 |
RawNet2 + SWL | 4.47 | 8.69 | 21.14 |
TSSD | 2.12 | 17.75 | 32.10 |
TSSD + SWL | 2.00 | 15.83 | 29.55 |
4 Results and Analyses
4.1 SWL generalizes base FAD models
In Table2, we validate the proposed SWL method by evaluating models on three datasets from different distributions, namely 19LA, 21LA and 21DF. From the results, it can be observed that WSL is able to simultaneously generalize the base FAD models on multiple datasets from different distributions.
![Refer to caption](extracted/5645670/aasist_l_n.png)
In addition, We compare SWL with another method known as ASAM, which has shown significant advancements in enhancing the generalization of base FAD models [21]. The results demonstrate that SWL achieves competitive performance comparing ASAM in assisting the base model when trained on the same single dataset. Furthermore, upon comparing the performance of the base model using ASAM and training it with multiple datasets, it becomes evident that SWL can better generalize the base model without the need for extra training data.
4.2 More RFF map** functions, higher generalization
In Section 2.1, we demonstrate how proposed SWL method work theoretically. There is a hyperparameter that might influence the performance, which is the number of RFF map** functions. In this part, we conduct some experiments applying SWL to AASIST-L model selecting all node features to be decorrelated and using different numbers of RFF map** functions. The results are recorded in the Figure 2.
It turns out that, for AASIST-L model, generalization is better when number of RFF map** functions is higher, but at the same time, both the training duration and computational complexity will also increase. In additional, there is a decrease in performance of all models on 19LA. We attribute this to the non-optimal combination of node features selected for decorrelation, and we will discuss it in the next section.
Node Features | 19LA | 21LA | 21DF |
---|---|---|---|
5N | 1.55 | 9.02 | 20.99 |
4N | 1.15 | 11.68 | 20.24 |
2N-T | 1.40 | 10.23 | 21.22 |
2N-S | 1.14 | 8.02 | 19.67 |
4.3 What combination of nodes is better for decorrelation
In AASIST framework [27], the temporal and spectral domains are modeled by two graph moduls in parallel and finally turn to four node features representing temporal and spectral information. Finally, the four node features and a stack node features are sent to classifier after simple concatenation.
In this section, we explore what combination of the five node features is optimal to be selected for feature decorrelation. Four combinations were considered, namely, all five node features (5N), removing the stack node features (4N), two node features related to spectral information (2N-S), and two node features related to temporal information (2N-T). Then, we conduct experiments based on AASIST-L model setting the number of RFF map** functions to 20. The experimental results in Table 3 show that, among the four combinations, 2N-S is optimal for feature decorrelation and yields the best results, indicating that decorrelating the 2N-S is easier, and the spectral domain is more helpful for FAD.
5 Conclusions
In this paper, we propose a stable learning based training scheme named SWL for FAD, which involves a sample weight learning module based on Random Fourier Features and utilizes an iterative optimization strategy. SWL focuses more on selected feature that extracted by FAD models and decorrelates them to help model produce more stable predictions under distribution shift. Experiments conducted on the ASVspoof datasets demonstrate that our methods can generalize multiple FAD models when evaluated across three different datasets. We further find out that, for AASIST-L, generalization becomes better when the number of RFF map** functions is higher, and the node features related to spectral features are better for decorrelating when applying SWL.
6 Acknowledgements
This work is supported by the National Natural Science Foundation of China (NSFC) (No.62101553, No.62306316, No.U21B20210, No. 62201571).
References
- [1] R. Fu, J. Tao, Z. Wen, and Y. Zheng, “Phoneme dependent speaker embedding and model factorization for multi-speaker speech synthesis and adaptation,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6930–6934.
- [2] R. Fu, J. Tao, Z. Wen, J. Yi, and T. Wang, “Focusing on attention: prosody transfer and adaptative optimization strategy for multi-speaker end-to-end speech synthesis,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6709–6713.
- [3] D. A. Reynolds, “An overview of automatic speaker recognition technology,” in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, 2002, pp. IV–4072–IV–4075.
- [4] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. H. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,” in Proc. Interspeech 2019, 2019, pp. 1008–1012.
- [5] X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch et al., “Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- [6] K. Ma, Y. Feng, B. Chen, and G. Zhao, “End-to-end dual-branch network towards synthetic speech detection,” IEEE Signal Processing Letters, vol. 30, pp. 359–363, 2023.
- [7] Y. Mo and S. Wang, “Multi-task learning improves synthetic speech detection,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6392–6396.
- [8] X. Wang and J. Yamagishi, “Estimating the confidence of speech spoofing countermeasure,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6372–6376.
- [9] P. Wen, K. Hu, W. Yue, S. Zhang, W. Zhou, and Z. Wang, “Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectrograms,” in Proc. INTERSPEECH 2023, 2023, pp. 271–275.
- [10] Y. Zhang, W. Wang, and P. Zhang, “The Effect of Silence and Dual-Band Fusion in Anti-Spoofing System,” in Proc. Interspeech 2021, 2021, pp. 4279–4283.
- [11] J. Monteiro, J. Alam, and T. H. Falk, “An ensemble based approach for generalized detection of spoofing attacks to automatic speaker recognizers,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6599–6603.
- [12] H. Wu, S. Liu, H. Meng, and H.-y. Lee, “Defense against adversarial attacks on spoofing countermeasures of asv,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6564–6568.
- [13] Y. Ren, H. Peng, L. Li, X. Xue, Y. Lan, and Y. Yang, “Generalized voice spoofing detection via integral knowledge amalgamation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- [14] H. Ma, J. Yi, J. Tao, Y. Bai, Z. Tian, and C. Wang, “Continual Learning for Fake Audio Detection,” in Proc. Interspeech 2021, 2021, pp. 886–890.
- [15] X. Zhang, J. Yi, J. Tao, C. Wang, and C. Y. Zhang, “Do you remember? Overcoming catastrophic forgetting for fake audio detection,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–29 Jul 2023, pp. 41 819–41 831. [Online]. Available: https://proceedings.mlr.press/v202/zhang23au.html
- [16] H. Tak, M. Todisco, X. Wang, J. weon Jung, J. Yamagishi, and N. Evans, “Automatic Speaker Verification Spoofing and Deepfake Detection Using Wav2vec 2.0 and Data Augmentation,” in Proc. The Speaker and Language Recognition Workshop (Odyssey 2022), 2022, pp. 112–119.
- [17] P. Kawa, M. Plata, M. Czuba, P. Szymański, and P. Syga, “Improved DeepFake Detection Using Whisper Features,” in Proc. INTERSPEECH 2023, 2023, pp. 4009–4013.
- [18] X. Wang and J. Yamagishi, “Investigating Self-Supervised Front Ends for Speech Spoofing Countermeasures,” in Proc. The Speaker and Language Recognition Workshop (Odyssey 2022), 2022, pp. 100–106.
- [19] C. Goel, S. Koppisetti, B. Colman, A. Shahriyari, and G. Bharaj, “Towards Attention-based Contrastive Learning for Audio Spoof Detection,” in Proc. INTERSPEECH 2023, 2023, pp. 2758–2762.
- [20] Y. Xie, H. Cheng, Y. Wang, and L. Ye, “Learning A Self-Supervised Domain-Invariant Feature Representation for Generalized Audio Deepfake Detection,” in Proc. INTERSPEECH 2023, 2023, pp. 2808–2812.
- [21] H. ** Shim, J. weon Jung, and T. Kinnunen, “Multi-Dataset Co-Training with Sharpness-Aware Optimization for Audio Anti-spoofing,” in Proc. INTERSPEECH 2023, 2023, pp. 3804–3808.
- [22] H. Tak, M. Kamble, J. Patino, M. Todisco, and N. Evans, “Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6382–6386.
- [23] X. Wang and J. Yamagishi, “Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- [24] P. Cui and S. Athey, “Stable learning establishes some common ground between causal inference and machine learning,” Nature Machine Intelligence, vol. 4, pp. 110 – 115, 2022.
- [25] A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” Advances in neural information processing systems, vol. 20, 2007.
- [26] X. Zhang, P. Cui, R. Xu, L. Zhou, Y. He, and Z. Shen, “Deep stable learning for out-of-distribution generalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5372–5382.
- [27] J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6367–6371.
- [28] H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, “End-to-end anti-spoofing with rawnet2,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6369–6373.
- [29] G. Hua, A. B. J. Teoh, and H. Zhang, “Towards end-to-end synthetic speech detection,” IEEE Signal Processing Letters, vol. 28, pp. 1265–1269, 2021.
- [30] K. Kuang, R. Xiong, P. Cui, S. Athey, and B. Li, “Stable prediction with model misspecification and agnostic distribution shift,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 4485–4492.
- [31] H. Bahng, S. Chun, S. Yun, J. Choo, and S. J. Oh, “Learning de-biased representations with biased representations,” in International Conference on Machine Learning. PMLR, 2020, pp. 528–539.
- [32] E. V. Strobl, K. Zhang, and S. Visweswaran, “Approximate kernel-based conditional independence tests for fast non-parametric causal discovery,” Journal of Causal Inference, vol. 7, no. 1, p. 20180017, 2019.