\interspeechcameraready\name

[affiliation=1,2]ZhiyongWang \name[affiliation=1,*]RuiboFu \name[affiliation=1,]ZhengqiWen \name[affiliation=4]YuankunXie \name[affiliation=2]YukunLiu \name[affiliation=1,2]XiaopengWang \name[affiliation=1]XuefeiLiu \name[affiliation=1]YongweiLi \name[affiliation=3]JianhuaTao \name[affiliation=1,2]YiLu \name[affiliation=1,2]XinQi \name[affiliation=5]ShuchenShi

Generalized Fake Audio Detection via Deep Stable Learning

Abstract

Although current fake audio detection approaches have achieved remarkable success on specific datasets, they often fail when evaluated with datasets from different distributions. Previous studies typically address distribution shift by focusing on using extra data or applying extra loss restrictions during training. However, these methods either require a substantial amount of data or complicate the training process. In this work, we propose a stable learning-based training scheme that involves a Sample Weight Learning (SWL) module, addressing distribution shift by decorrelating all selected features via learning weights from training samples. The proposed portable plug-in-like SWL is easy to apply to multiple base models and generalizes them without using extra data during training. Experiments conducted on the ASVspoof datasets clearly demonstrate the effectiveness of SWL in generalizing different models across three evaluation datasets from different distributions.

keywords:
audio spoof, fake audio detection, stable learning, generalization

1 Introduction

Existing synthesis techniques [1, 2] have the capability to create realistic imitations of human voices. What is even more concerning is that automatic speaker verification (ASV) systems [3] are susceptible to various types of spoofing attacks. To establish a more reliable automatic speaker verification system, Fake Audio Detection (FAD) model based on machine learning has emerged which aims to discern whether the input audio is produced by synthesis techniques (spoofed) or human (bona-fide).

While many machine learning approaches perform well under the I.I.D. (i.e., Independent and Identically Distributed) hypothesis, where the evaluation set and training set are independently sampled from the identical distribution. For FAD, there will always be new synthesis techniques and unexpected interferences not included in the training set, resulting in an unavoidable distribution shift in the test set. Therefore, researchers considered the distribution shift scenario when setting up datasets and introduced unseen conditions in the test set. For instance, ASVspoof2019 [4] and ASVspoof2021 [5] datasets exhibit significant domain shifts between the training and evaluation sets due to unseen spoofing attack methods and channel variation, making the datasets suitable for training FAD models and evaluating the generalization of FAD models.

Several studies have explored overcoming the distribution shift conducting experiments on ASVspoof datasets and related literature can be roughly divided in to two strands. One strand explores various training strategies, including multi-task learning [6, 7, 8], fusion-based methods [9, 10, 11], adversarial learning [12, 13], continual learning [14, 15], transfer learning using large pretrained models [16, 17, 18], and contrastive learning [19]. This type of method primarily enhances the training constraints by adding additional losses, implicitly supervising the model to obtain more discriminative features. While this approach can improve the generalization of the base model, it makes the training process intricate. Another strand focuses on the model’s input and can be further subdivided into data integration and data augmentation. Data integration means co-training a FAD model with multi-dataset, often combined with new training strategies, like [20] and [21] respectively introduced strategies named aggregation-separation domain generalization method and adaptive sharpness-aware minimization (ASAM). The basic idea of data integration is plain and simple, but this approach is sometimes effective and indicates no guarantee of improvements of generalization on every other dataset. Additionally, the interferences caused by the different characteristics of datasets remain unknown. Data augmentation uses technical means to modify training set data, making the distribution of the training set as close as possible to the distribution of the test set. For example, RawBoost [22] can be applied to raw audio data to simulate telephone scenarios. In [23], different vocoders are used to create various spoofing attack types of data based on the training set. Data augmentation can achieve good results under the premise of knowing the target distribution, which often cannot be satisfied in real applications.

To address the distribution shift issues and make the FAD model more applicable in more different distributions, this paper proposes a stable learning [24] based training scheme that involves a Sample Weight Learning (SWL) module. The aim of stable learning is to learn a predictive model that can achieve uniformly good performance in any possible environment, which perfectly fits the needs of FAD. The SWL module can be integrated into existing FAD models. This module takes the selected features as input and calculates sample weights to influence the final sample loss. Within the SWL module, we employ an approach based on Random Fourier Features (RFF) [25] to decorrelate all input features and utilize an iterative optimization approach [26] to gradually approximate global sample weights by iteratively optimizing local sample weights. Importantly, this module can be used as a portable plug-in during training and does not participate in the model’s inference process. Experimental results demonstrate that the SWL is applicable to multiple base FAD models and achieves consistent performance improvements when evaluated across multiple datasets from different distributions without the need for additional datasets for training or pre-training.

Refer to caption
Figure 1: The overall architecture of the proposed stable learning based method. LSWD refers to Learning Sample Weighting for Decorrelation as described in Section 2.1. The number of RFF map** fuctions and hidden state feature are flexible to be adjusted. In the training stage, we only need to feed the selected hidden state feature into SWL module and multiply the computed sample weights with the Weighted Cross-Entropy (WCE) loss. In the inference phase, the model directly conduct prediction without calculation of sample weights.

The contributions of this paper are presented as follows:

  • We propose a stable-learning based training scheme for FAD called SWL, which involves a SWL module based on RFF and utilizes an iterative optimization strategy. To the best of our knowledge, this is the first study to apply stable learning to improve the generalization of FAD models.

  • Experiments conducted on the ASVspoof datasets show that SWL can generalize AASIST [27], RawNet2 [28], and TSSD [29] when evaluated across three different datasets.

  • We further explored what combination of node features in the AASIST model might be optimal for decorrelation when applying SWL.

2 Proposed Method

Different from the method mentioned earlier that uses extra data or complicates the training process by computing additional loss, we use stable learning to addressing distribution shift by decorrelating all input features without using extra data. Figure 1 shows the overall architecture of proposed SWL.

In stable learning, a sample weighting method [30] is proven to be useful to decorrelate features and help linear models produce more stable predictions under distribution shift. When extending these ideas into FAD models to handle more complex data types like audio, we first need to consider how to quantify independence between features. Moreover, this sample weighting method requires meticulous calculations for all features, which is not suitable for training deep neural networks.

In the following subsections, we detail sample weighting with RFF and iteratively learn global sample weights, which correspond to solving the two problems mentioned above.

2.1 Sample weighting with RFF

In stead of focusing only on eliminating dependence between bona-fide and spoofed samples, we tend to eliminate dependence between all input features so that we can get more dispersed representations of all samples. To eliminate the dependence between any pair of features Z:,isubscript𝑍:𝑖Z_{:,i}italic_Z start_POSTSUBSCRIPT : , italic_i end_POSTSUBSCRIPT and Z:,jsubscript𝑍:𝑗Z_{:,j}italic_Z start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT from samples in the representation space, we first need to quantify the independence between features. In [31], Hilbert-Schmidt Independence Criterion (HSIC) is proven to be capable of applying as a criterion to supervise feature decorrelation, but HSIC computationally cost too much on large datasets. In Euclidean space, Frobenius norm corresponds to the HilbertSchmidt norm [32], in this paper, Frobenius norm is used to quantify the independence between features.

We sample (A1,A2,,An)subscript𝐴1subscript𝐴2subscript𝐴𝑛(A_{1},A_{2},\ldots,A_{n})( italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and (B1,B2,,Bn)subscript𝐵1subscript𝐵2subscript𝐵𝑛(B_{1},B_{2},\ldots,B_{n})( italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) from the distribution of A𝐴Aitalic_A and B𝐵Bitalic_B. Consider a measurable, positive definite kernel kDsubscript𝑘𝐷k_{D}italic_k start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT on the domain of random variable D𝐷Ditalic_D and we denote the corresponding Reproducing Kernel Hilbert Space as Dsubscript𝐷\mathcal{H}_{D}caligraphic_H start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT. If we denote cross-covariance operator from Bsubscript𝐵\mathcal{H}_{B}caligraphic_H start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT to Asubscript𝐴\mathcal{H}_{A}caligraphic_H start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT by the symbol ΣABsubscriptΣAB\Sigma_{\text{AB}}roman_Σ start_POSTSUBSCRIPT AB end_POSTSUBSCRIPT, then the partial cross-covariance matrix will be:

Σ^ABsubscript^Σ𝐴𝐵\displaystyle\hat{\Sigma}_{AB}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_A italic_B end_POSTSUBSCRIPT =1n1i=1n[(u(Ai)1nj=1nu(Aj))T\displaystyle=\frac{1}{n-1}\sum_{i=1}^{n}[(\textbf{u}(A_{i})-\frac{1}{n}\sum_{% j=1}^{n}\textbf{u}(A_{j}))^{T}\cdot= divide start_ARG 1 end_ARG start_ARG italic_n - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT [ ( u ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT u ( italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅
(v(Bi)1nj=1nv(Bj))],\displaystyle\phantom{=AAAA==}(\textbf{v}(B_{i})-\frac{1}{n}\sum_{j=1}^{n}% \textbf{v}(B_{j}))],( v ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT v ( italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) ] , (1)

where

u(A)u𝐴\displaystyle\textbf{u}(A)u ( italic_A ) =(u1(A),u2(A),,unA(A)),uj(A)RFF,j,formulae-sequenceabsentsubscript𝑢1𝐴subscript𝑢2𝐴subscript𝑢subscript𝑛𝐴𝐴subscript𝑢𝑗𝐴subscriptRFFfor-all𝑗\displaystyle=(u_{1}(A),u_{2}(A),\ldots,u_{n_{A}}(A)),\quad u_{j}(A)\in% \mathcal{H}_{\text{RFF}},\quad\forall j,= ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_A ) , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_A ) , … , italic_u start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_A ) ) , italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_A ) ∈ caligraphic_H start_POSTSUBSCRIPT RFF end_POSTSUBSCRIPT , ∀ italic_j ,
v(B)v𝐵\displaystyle\textbf{v}(B)v ( italic_B ) =(v1(B),v2(B),,vnB(B)),vj(B)RFF,j,formulae-sequenceabsentsubscript𝑣1𝐵subscript𝑣2𝐵subscript𝑣subscript𝑛𝐵𝐵subscript𝑣𝑗𝐵subscriptRFFfor-all𝑗\displaystyle=(v_{1}(B),v_{2}(B),\ldots,v_{n_{B}}(B)),\quad v_{j}(B)\in% \mathcal{H}_{\text{RFF}},\quad\forall j,= ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_B ) , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_B ) , … , italic_v start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_B ) ) , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_B ) ∈ caligraphic_H start_POSTSUBSCRIPT RFF end_POSTSUBSCRIPT , ∀ italic_j , (2)

Here u and v are the RFF map** functions, nAsubscript𝑛𝐴n_{A}italic_n start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and nBsubscript𝑛𝐵n_{B}italic_n start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT represent the numbers of functions from RFFsubscriptRFF\mathcal{H}_{\text{RFF}}caligraphic_H start_POSTSUBSCRIPT RFF end_POSTSUBSCRIPT and RFFsubscriptRFF\mathcal{H}_{\text{RFF}}caligraphic_H start_POSTSUBSCRIPT RFF end_POSTSUBSCRIPT denotes the function space of RFF with the following form:

RFFsubscriptRFF\displaystyle\mathcal{H}_{\text{RFF}}caligraphic_H start_POSTSUBSCRIPT RFF end_POSTSUBSCRIPT ={h:x2cos(ωx+ϕ)\displaystyle=\{h:x\rightarrow\sqrt{2}\cos(\omega x+\phi)\mid= { italic_h : italic_x → square-root start_ARG 2 end_ARG roman_cos ( italic_ω italic_x + italic_ϕ ) ∣
ωN(0,1),ϕUniform(0,2π)},\displaystyle\phantom{==}\omega\sim{N}(0,1),\phi\sim\text{Uniform}(0,2\pi)\},italic_ω ∼ italic_N ( 0 , 1 ) , italic_ϕ ∼ Uniform ( 0 , 2 italic_π ) } , (3)

where ω𝜔\omegaitalic_ω and ϕitalic-ϕ\phiitalic_ϕ are sampled from the standard Normal distribution and the Uniform distribution respectively. Then, the independence quantification metric IABsubscript𝐼𝐴𝐵I_{AB}italic_I start_POSTSUBSCRIPT italic_A italic_B end_POSTSUBSCRIPT is defined as the Frobenius norm of the partial cross-covariance matrix (IAB=Σ^ABF2subscript𝐼𝐴𝐵superscriptsubscriptnormsubscript^Σ𝐴𝐵𝐹2I_{AB}=\|\hat{\Sigma}_{{AB}}\|_{F}^{2}italic_I start_POSTSUBSCRIPT italic_A italic_B end_POSTSUBSCRIPT = ∥ over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_A italic_B end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT). Note that IABsubscript𝐼𝐴𝐵I_{AB}italic_I start_POSTSUBSCRIPT italic_A italic_B end_POSTSUBSCRIPT is always non-negative. So the lower the IABsubscript𝐼𝐴𝐵I_{AB}italic_I start_POSTSUBSCRIPT italic_A italic_B end_POSTSUBSCRIPT, the more independent the two variables A𝐴Aitalic_A and B𝐵Bitalic_B.

After establishing the quantification criteria, we can then measure general independence via RFF and eliminate the dependence between hidden state features through sample weighting. We denote the sample weights as w+n𝑤superscriptsubscript𝑛w\in\mathbb{R}_{+}^{n}italic_w ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and Σi=1nωi=nsuperscriptsubscriptΣi=1𝑛subscript𝜔𝑖𝑛\Sigma_{\text{i=1}}^{n}\omega_{i}=nroman_Σ start_POSTSUBSCRIPT i=1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_n. After weighting, the Σ^ABsubscript^ΣAB\hat{\Sigma}_{\text{AB}}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT AB end_POSTSUBSCRIPT changes into Σ^AB;wsubscript^ΣAB;w\hat{\Sigma}_{\text{AB;w}}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT AB;w end_POSTSUBSCRIPT which can be calculated using Equation (2.1) as follows:

Σ^AB;wsubscript^Σ𝐴𝐵w\displaystyle\hat{\Sigma}_{AB;\text{w}}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_A italic_B ; w end_POSTSUBSCRIPT =1n1i=1n[(ωiu(Ai)1nj=1nωju(Aj))T\displaystyle=\frac{1}{n-1}\sum_{i=1}^{n}[(\omega_{i}\textbf{u}(A_{i})-\frac{1% }{n}\sum_{j=1}^{n}\omega_{j}\textbf{u}(A_{j}))^{T}\cdot= divide start_ARG 1 end_ARG start_ARG italic_n - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT [ ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT u ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT u ( italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅
(ωiv(Bi)1nj=1nωjv(Bj))],\displaystyle\phantom{=AAAA==}(\omega_{i}\textbf{v}(B_{i})-\frac{1}{n}\sum_{j=% 1}^{n}\omega_{j}\textbf{v}(B_{j}))],( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT v ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT v ( italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) ] , (4)

Specifically, for selected features Z:,isubscript𝑍:,iZ_{\text{:,i}}italic_Z start_POSTSUBSCRIPT :,i end_POSTSUBSCRIPT and Z:,jsubscript𝑍:,jZ_{\text{:,j}}italic_Z start_POSTSUBSCRIPT :,j end_POSTSUBSCRIPT mentioned above, we need to optimize w by

wsuperscriptw\displaystyle\textbf{w}^{*}w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT =argminwΔn1i<jmzΣ^Z:,i,Z:,j;wF2,absentwsubscriptΔ𝑛𝑎𝑟𝑔𝑚𝑖𝑛1𝑖𝑗subscript𝑚𝑧superscriptsubscriptnormsubscript^Σsubscript𝑍:𝑖subscript𝑍:𝑗w𝐹2\displaystyle=\underset{\textbf{w}\in\Delta_{n}}{argmin}\underset{1\leq i<j% \leq m_{z}}{\sum}\|\hat{\Sigma}_{{Z_{:,i},Z_{:,j};\textbf{w}}}\|_{F}^{2},= start_UNDERACCENT w ∈ roman_Δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_UNDERACCENT start_ARG italic_a italic_r italic_g italic_m italic_i italic_n end_ARG start_UNDERACCENT 1 ≤ italic_i < italic_j ≤ italic_m start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_UNDERACCENT start_ARG ∑ end_ARG ∥ over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT : , italic_i end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT ; w end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (5)

where Δn={w+ni=1nwi=n}subscriptΔ𝑛conditional-setwsuperscriptsubscript𝑛superscriptsubscript𝑖1𝑛subscript𝑤𝑖𝑛\Delta_{n}=\{\textbf{w}\in\mathbb{R}_{+}^{n}\mid\sum_{i=1}^{n}w_{i}=n\}roman_Δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { w ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_n }. Hence, weighting training samples with the optimal wsuperscriptw\textbf{w}^{*}w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT can mitigate the dependence between all selected hidden state features.

Generally, this algorithm iteratively optimize sample weights w, feature extractor function f𝑓fitalic_f , and classifier function g𝑔gitalic_g as follows:

f(t+1),g(t+1)superscript𝑓𝑡1superscript𝑔𝑡1\displaystyle f^{(t+1)},g^{(t+1)}italic_f start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT =argminf,gi=1nwi(t)L(g(f(Xi)),yi),absent𝑓𝑔𝑎𝑟𝑔𝑚𝑖𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝑤𝑖𝑡𝐿𝑔𝑓subscript𝑋𝑖subscript𝑦𝑖\displaystyle=\underset{f,g}{argmin}\sum_{i=1}^{n}w_{i}^{(t)}L(g(f(X_{i})),y_{% i}),= start_UNDERACCENT italic_f , italic_g end_UNDERACCENT start_ARG italic_a italic_r italic_g italic_m italic_i italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_L ( italic_g ( italic_f ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,
w(t+1)superscriptw𝑡1\displaystyle\textbf{w}^{(t+1)}w start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT =argminwΔn1i<jmzΣ^Z:,i(t+1),Z:,j(t+1);wF2,absentwsubscriptΔ𝑛𝑎𝑟𝑔𝑚𝑖𝑛1𝑖𝑗subscript𝑚𝑧superscriptsubscriptnormsubscript^Σsubscriptsuperscript𝑍𝑡1:𝑖subscriptsuperscript𝑍𝑡1:𝑗w𝐹2\displaystyle=\underset{\textbf{w}\in\Delta_{n}}{argmin}\underset{1\leq i<j% \leq m_{z}}{\sum}\|\hat{\Sigma}_{{Z^{(t+1)}_{:,i},Z^{(t+1)}_{:,j};\textbf{w}}}% \|_{F}^{2},= start_UNDERACCENT w ∈ roman_Δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_UNDERACCENT start_ARG italic_a italic_r italic_g italic_m italic_i italic_n end_ARG start_UNDERACCENT 1 ≤ italic_i < italic_j ≤ italic_m start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_UNDERACCENT start_ARG ∑ end_ARG ∥ over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT : , italic_i end_POSTSUBSCRIPT , italic_Z start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT ; w end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (6)

where Z(t+1)=f(t+1)(X)superscript𝑍𝑡1superscript𝑓𝑡1𝑋Z^{(t+1)}=f^{(t+1)}(X)italic_Z start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = italic_f start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT ( italic_X ), Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT means an input audio and prediction of the classifier, L(·,·)𝐿··L(\textperiodcentered,\textperiodcentered)italic_L ( · , · ) represents the weighted cross entropy loss function and t𝑡titalic_t represents the time stamp. Initially, w(0)=(1,1,,1)Tsuperscriptw0superscript111𝑇\textbf{w}^{(0)}=(1,1,...,1)^{T}w start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = ( 1 , 1 , … , 1 ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .

2.2 Iteratively learn global sample weights

Equation 2.1 requires weight learned for every single input data. However, it cost too much for a FAD model based on deep neural networks trained on a large dataset to learn sample weights globally. For every batch during training, the features and the sample weights appear as follows:

ZO,subscriptZ𝑂\displaystyle\text{Z}_{O},Z start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT , =Concat(ZG1,ZG2,,ZGk,ZL)absentConcatsubscriptZ𝐺1subscriptZ𝐺2subscriptZ𝐺𝑘subscriptZ𝐿\displaystyle=\text{Concat}(\text{Z}_{G1},\text{Z}_{G2},\ldots,\text{Z}_{Gk},% \text{Z}_{L})= Concat ( Z start_POSTSUBSCRIPT italic_G 1 end_POSTSUBSCRIPT , Z start_POSTSUBSCRIPT italic_G 2 end_POSTSUBSCRIPT , … , Z start_POSTSUBSCRIPT italic_G italic_k end_POSTSUBSCRIPT , Z start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT )
wO,subscriptw𝑂\displaystyle\text{w}_{O},w start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT , =Concat(wG1,wG2,,wGk,wL)absentConcatsubscriptw𝐺1subscriptw𝐺2subscriptw𝐺𝑘subscriptw𝐿\displaystyle=\text{Concat}(\text{w}_{G1},\text{w}_{G2},\ldots,\text{w}_{Gk},% \text{w}_{L})= Concat ( w start_POSTSUBSCRIPT italic_G 1 end_POSTSUBSCRIPT , w start_POSTSUBSCRIPT italic_G 2 end_POSTSUBSCRIPT , … , w start_POSTSUBSCRIPT italic_G italic_k end_POSTSUBSCRIPT , w start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) (7)

ZOsubscriptZ𝑂\text{Z}_{O}Z start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT and wOsubscriptw𝑂\text{w}_{O}w start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT, represent the current features and weights, are used to optimize the new sample weights in the next batch. ZG1,,ZGksubscriptZ𝐺1subscriptZ𝐺𝑘\text{Z}_{G1},\ldots,\text{Z}_{Gk}Z start_POSTSUBSCRIPT italic_G 1 end_POSTSUBSCRIPT , … , Z start_POSTSUBSCRIPT italic_G italic_k end_POSTSUBSCRIPT and wG1,,wGksubscriptw𝐺1subscriptw𝐺𝑘\text{w}_{G1},\ldots,\text{w}_{Gk}w start_POSTSUBSCRIPT italic_G 1 end_POSTSUBSCRIPT , … , w start_POSTSUBSCRIPT italic_G italic_k end_POSTSUBSCRIPT means features and weights represent the former global information learned from all previous batches. ZLsubscriptZ𝐿\text{Z}_{L}Z start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and wLsubscriptw𝐿\text{w}_{L}w start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT means the features and weights in the current batch, representing local information. When a FAD model is trained with a large dataset, the concatenation would incur significant space consumption and lead to an increase in computational complexity.

To address this problem, we utilize an approximation method proposed in [26] which fuses the former global information and the local information at the end of each epoch as follows:

ZGi,superscriptsubscriptZsubscript𝐺𝑖\displaystyle\text{Z}_{G_{i}}^{{}^{\prime}},Z start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , =αiZGi+(1αi)ZLabsentsubscript𝛼𝑖subscriptZ𝐺𝑖1subscript𝛼𝑖subscriptZ𝐿\displaystyle=\alpha_{i}\text{Z}_{Gi}+(1-\alpha_{i})\text{Z}_{L}= italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Z start_POSTSUBSCRIPT italic_G italic_i end_POSTSUBSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) Z start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT
wGi,superscriptsubscriptwsubscript𝐺𝑖\displaystyle\text{w}_{G_{i}}^{{}^{\prime}},w start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , =αiwGi+(1αi)wLabsentsubscript𝛼𝑖subscriptw𝐺𝑖1subscript𝛼𝑖subscriptw𝐿\displaystyle=\alpha_{i}\text{w}_{Gi}+(1-\alpha_{i})\text{w}_{L}= italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT w start_POSTSUBSCRIPT italic_G italic_i end_POSTSUBSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) w start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT (8)

We substitute (ZGi,wGi)subscriptZ𝐺𝑖subscriptw𝐺𝑖(\text{Z}_{Gi},\text{w}_{Gi})( Z start_POSTSUBSCRIPT italic_G italic_i end_POSTSUBSCRIPT , w start_POSTSUBSCRIPT italic_G italic_i end_POSTSUBSCRIPT ) with (ZGi,wGi)superscriptsubscriptZ𝐺𝑖subscriptsuperscriptw𝐺𝑖(\text{Z}_{Gi}^{{}^{\prime}},\text{w}^{{}^{\prime}}_{Gi})( Z start_POSTSUBSCRIPT italic_G italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , w start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_i end_POSTSUBSCRIPT ) for the next batch. and use the hyperparameter α𝛼\alphaitalic_α to control the information, where a large value indicates long-term memory in the information, and a smaller value indicates short-term memory. In this paper, we set α𝛼\alphaitalic_α to 0.90.90.90.9. Detailed procedure can be found in Appendix A.1 of [26], we use Adam optimizer instead of SGD for optimization to converge faster and more stable.

3 Experiments

3.1 Datasets and Evaluation metrics

All experiments are trained on the Logical Access (LA) subset of the ASVspoof 2019 dataset. To verify the generalization of models, we utilized the ASVspoof 2021 dataset, which only contains evaluation sets and is the latest and most challenging edition of the ASVspoof challenge series. In comparison to ASVspoof 2019, the utterances in the LA scenario are transmitted across real telephone systems and the utterances in the DF scenario are processed through various audio compressors. Table 1 presents the overall statistics of the datasets used in the experiments.

Equal error rate (EER) is used to evaluate the performance of models. The lower the EER value, the better the models.

Table 1: Datasets used in the experiments. # Spks, # Utts, and # Attks indicate the number of speakers, utterances, and spoofing attacks, respectively. Training and evaluation set is divided by /. 19LA, 21LA and 21DF respectively represent ASVspoof 2019 LA, ASVspoof 2021 LA, ASVspoof 2021 DF.
Dataset # Spks # Utts # Attks
19LA 20 / 48 25380 / 108978 6 / 13
21LA - / 48 - / 181566 - / 13
21DF - / 48 - / 611829 - / 13

3.2 Experimental setup

For experiments, we choose three FAD models as base models to apply SWL, and we will briefly explain why we choose them and how we apply SWL to each of them.

First base FAD model is AASIST model, both the original version and the light version respectively denoted as AASIST and AASIST-L. There two reasons for choosing this model. Firstly, this model is open source and has been recently utilized in multiple research studies, demonstrating state-of-the-art (SOTA) performance. Secondly, the features sent into the classifier in this model are composed of multiple node features extracted based on temporal and spectral domains. This setup allows for further exploration of feature interpretability regarding the optimal combination of node features for decorrelation.

Rawnet2 is well-known as one of the baselines of the ASVspoof challenge and a typical end to end model. As for the TSSD model, its performance is SOTA among end-to-end models. We use the input of the last fully connection layer in both models to apply SWL. We do not adjust parameters (e.g. learning rate, model architecture configuration) of all three base FAD models. The best models are chosen based on the lowest EER on the development set of 19LA.

Table 2: EER(%) results of the base FAD models and applying SWL to them. The results about ASAM are from [21] and the represents the FAD model is trained on multi-dataset.
Model 19LA 21LA 21DF
AASIST 1.52 9.96 20.45
AASIST + SWL 1.38 9.40 19.09
AASIST-L 1.26 11.44 22.40
AASIST-L + SWL 1.14 8.02 19.67
\hdashlineAASIST-L + ASAM 1.48 10.18 19.58
AASIST-L + ASAM 1.27 12.41 19.84
RawNet2 4.72 11.04 22.13
RawNet2 + SWL 4.47 8.69 21.14
TSSD 2.12 17.75 32.10
TSSD + SWL 2.00 15.83 29.55

4 Results and Analyses

4.1 SWL generalizes base FAD models

In Table2, we validate the proposed SWL method by evaluating models on three datasets from different distributions, namely 19LA, 21LA and 21DF. From the results, it can be observed that WSL is able to simultaneously generalize the base FAD models on multiple datasets from different distributions.

Refer to caption
Figure 2: Applying SWL to AASIST-L using different numbers of RFF map** functions. The dashed line represents the performance of AASIST-L without applying SWL.

In addition, We compare SWL with another method known as ASAM, which has shown significant advancements in enhancing the generalization of base FAD models [21]. The results demonstrate that SWL achieves competitive performance comparing ASAM in assisting the base model when trained on the same single dataset. Furthermore, upon comparing the performance of the base model using ASAM and training it with multiple datasets, it becomes evident that SWL can better generalize the base model without the need for extra training data.

4.2 More RFF map** functions, higher generalization

In Section 2.1, we demonstrate how proposed SWL method work theoretically. There is a hyperparameter that might influence the performance, which is the number of RFF map** functions. In this part, we conduct some experiments applying SWL to AASIST-L model selecting all node features to be decorrelated and using different numbers of RFF map** functions. The results are recorded in the Figure 2.

It turns out that, for AASIST-L model, generalization is better when number of RFF map** functions is higher, but at the same time, both the training duration and computational complexity will also increase. In additional, there is a decrease in performance of all models on 19LA. We attribute this to the non-optimal combination of node features selected for decorrelation, and we will discuss it in the next section.

Table 3: Different combination of node features applying SWL to AASIST-L model.
Node Features 19LA 21LA 21DF
5N 1.55 9.02 20.99
4N 1.15 11.68 20.24
2N-T 1.40 10.23 21.22
2N-S 1.14 8.02 19.67

4.3 What combination of nodes is better for decorrelation

In AASIST framework [27], the temporal and spectral domains are modeled by two graph moduls in parallel and finally turn to four node features representing temporal and spectral information. Finally, the four node features and a stack node features are sent to classifier after simple concatenation.

In this section, we explore what combination of the five node features is optimal to be selected for feature decorrelation. Four combinations were considered, namely, all five node features (5N), removing the stack node features (4N), two node features related to spectral information (2N-S), and two node features related to temporal information (2N-T). Then, we conduct experiments based on AASIST-L model setting the number of RFF map** functions to 20. The experimental results in Table 3 show that, among the four combinations, 2N-S is optimal for feature decorrelation and yields the best results, indicating that decorrelating the 2N-S is easier, and the spectral domain is more helpful for FAD.

5 Conclusions

In this paper, we propose a stable learning based training scheme named SWL for FAD, which involves a sample weight learning module based on Random Fourier Features and utilizes an iterative optimization strategy. SWL focuses more on selected feature that extracted by FAD models and decorrelates them to help model produce more stable predictions under distribution shift. Experiments conducted on the ASVspoof datasets demonstrate that our methods can generalize multiple FAD models when evaluated across three different datasets. We further find out that, for AASIST-L, generalization becomes better when the number of RFF map** functions is higher, and the node features related to spectral features are better for decorrelating when applying SWL.

6 Acknowledgements

This work is supported by the National Natural Science Foundation of China (NSFC) (No.62101553, No.62306316, No.U21B20210, No. 62201571).

References

  • [1] R. Fu, J. Tao, Z. Wen, and Y. Zheng, “Phoneme dependent speaker embedding and model factorization for multi-speaker speech synthesis and adaptation,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).  IEEE, 2019, pp. 6930–6934.
  • [2] R. Fu, J. Tao, Z. Wen, J. Yi, and T. Wang, “Focusing on attention: prosody transfer and adaptative optimization strategy for multi-speaker end-to-end speech synthesis,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).  IEEE, 2020, pp. 6709–6713.
  • [3] D. A. Reynolds, “An overview of automatic speaker recognition technology,” in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, 2002, pp. IV–4072–IV–4075.
  • [4] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. H. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,” in Proc. Interspeech 2019, 2019, pp. 1008–1012.
  • [5] X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch et al., “Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  • [6] K. Ma, Y. Feng, B. Chen, and G. Zhao, “End-to-end dual-branch network towards synthetic speech detection,” IEEE Signal Processing Letters, vol. 30, pp. 359–363, 2023.
  • [7] Y. Mo and S. Wang, “Multi-task learning improves synthetic speech detection,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).  IEEE, 2022, pp. 6392–6396.
  • [8] X. Wang and J. Yamagishi, “Estimating the confidence of speech spoofing countermeasure,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).  IEEE, 2022, pp. 6372–6376.
  • [9] P. Wen, K. Hu, W. Yue, S. Zhang, W. Zhou, and Z. Wang, “Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectrograms,” in Proc. INTERSPEECH 2023, 2023, pp. 271–275.
  • [10] Y. Zhang, W. Wang, and P. Zhang, “The Effect of Silence and Dual-Band Fusion in Anti-Spoofing System,” in Proc. Interspeech 2021, 2021, pp. 4279–4283.
  • [11] J. Monteiro, J. Alam, and T. H. Falk, “An ensemble based approach for generalized detection of spoofing attacks to automatic speaker recognizers,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).  IEEE, 2020, pp. 6599–6603.
  • [12] H. Wu, S. Liu, H. Meng, and H.-y. Lee, “Defense against adversarial attacks on spoofing countermeasures of asv,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).  IEEE, 2020, pp. 6564–6568.
  • [13] Y. Ren, H. Peng, L. Li, X. Xue, Y. Lan, and Y. Yang, “Generalized voice spoofing detection via integral knowledge amalgamation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  • [14] H. Ma, J. Yi, J. Tao, Y. Bai, Z. Tian, and C. Wang, “Continual Learning for Fake Audio Detection,” in Proc. Interspeech 2021, 2021, pp. 886–890.
  • [15] X. Zhang, J. Yi, J. Tao, C. Wang, and C. Y. Zhang, “Do you remember? Overcoming catastrophic forgetting for fake audio detection,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202.  PMLR, 23–29 Jul 2023, pp. 41 819–41 831. [Online]. Available: https://proceedings.mlr.press/v202/zhang23au.html
  • [16] H. Tak, M. Todisco, X. Wang, J. weon Jung, J. Yamagishi, and N. Evans, “Automatic Speaker Verification Spoofing and Deepfake Detection Using Wav2vec 2.0 and Data Augmentation,” in Proc. The Speaker and Language Recognition Workshop (Odyssey 2022), 2022, pp. 112–119.
  • [17] P. Kawa, M. Plata, M. Czuba, P. Szymański, and P. Syga, “Improved DeepFake Detection Using Whisper Features,” in Proc. INTERSPEECH 2023, 2023, pp. 4009–4013.
  • [18] X. Wang and J. Yamagishi, “Investigating Self-Supervised Front Ends for Speech Spoofing Countermeasures,” in Proc. The Speaker and Language Recognition Workshop (Odyssey 2022), 2022, pp. 100–106.
  • [19] C. Goel, S. Koppisetti, B. Colman, A. Shahriyari, and G. Bharaj, “Towards Attention-based Contrastive Learning for Audio Spoof Detection,” in Proc. INTERSPEECH 2023, 2023, pp. 2758–2762.
  • [20] Y. Xie, H. Cheng, Y. Wang, and L. Ye, “Learning A Self-Supervised Domain-Invariant Feature Representation for Generalized Audio Deepfake Detection,” in Proc. INTERSPEECH 2023, 2023, pp. 2808–2812.
  • [21] H. ** Shim, J. weon Jung, and T. Kinnunen, “Multi-Dataset Co-Training with Sharpness-Aware Optimization for Audio Anti-spoofing,” in Proc. INTERSPEECH 2023, 2023, pp. 3804–3808.
  • [22] H. Tak, M. Kamble, J. Patino, M. Todisco, and N. Evans, “Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).  IEEE, 2022, pp. 6382–6386.
  • [23] X. Wang and J. Yamagishi, “Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).  IEEE, 2023, pp. 1–5.
  • [24] P. Cui and S. Athey, “Stable learning establishes some common ground between causal inference and machine learning,” Nature Machine Intelligence, vol. 4, pp. 110 – 115, 2022.
  • [25] A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” Advances in neural information processing systems, vol. 20, 2007.
  • [26] X. Zhang, P. Cui, R. Xu, L. Zhou, Y. He, and Z. Shen, “Deep stable learning for out-of-distribution generalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5372–5382.
  • [27] J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).  IEEE, 2022, pp. 6367–6371.
  • [28] H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, “End-to-end anti-spoofing with rawnet2,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).  IEEE, 2021, pp. 6369–6373.
  • [29] G. Hua, A. B. J. Teoh, and H. Zhang, “Towards end-to-end synthetic speech detection,” IEEE Signal Processing Letters, vol. 28, pp. 1265–1269, 2021.
  • [30] K. Kuang, R. Xiong, P. Cui, S. Athey, and B. Li, “Stable prediction with model misspecification and agnostic distribution shift,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 4485–4492.
  • [31] H. Bahng, S. Chun, S. Yun, J. Choo, and S. J. Oh, “Learning de-biased representations with biased representations,” in International Conference on Machine Learning.  PMLR, 2020, pp. 528–539.
  • [32] E. V. Strobl, K. Zhang, and S. Visweswaran, “Approximate kernel-based conditional independence tests for fast non-parametric causal discovery,” Journal of Causal Inference, vol. 7, no. 1, p. 20180017, 2019.