\NewEnviron

ack

Acknowledgments and Disclosure of Funding

Maciej Falkiewicz1,2  Naoya Takeishi3,4  Alexandros Kalousis2
1Computer Science Department, University of Geneva  2HES-SO/HEG Genève
3The University of Tokyo  4RIKEN
{maciej.falkiewicz, alexandros.kalousis}@hesge.ch
[email protected]
\BODY

Kolmogorov–Smirnov GAN

Maciej Falkiewicz1,2  Naoya Takeishi3,4  Alexandros Kalousis2
1Computer Science Department, University of Geneva  2HES-SO/HEG Genève
3The University of Tokyo  4RIKEN
{maciej.falkiewicz, alexandros.kalousis}@hesge.ch
[email protected]
Abstract

We propose a novel deep generative model, the Kolmogorov-Smirnov Generative Adversarial Network (KSGAN). Unlike existing approaches, KSGAN formulates the learning process as a minimization of the Kolmogorov-Smirnov (KS) distance, generalized to handle multivariate distributions. This distance is calculated using the quantile function, which acts as the critic in the adversarial training process. We formally demonstrate that minimizing the KS distance leads to the trained approximate distribution aligning with the target distribution. We propose an efficient implementation and evaluate its effectiveness through experiments. The results show that KSGAN performs on par with existing adversarial methods, exhibiting stability during training, resistance to mode drop** and collapse, and tolerance to variations in hyperparameter settings. Additionally, we review the literature on the Generalized KS test and discuss the connections between KSGAN and existing adversarial generative models.

Refer to caption
Figure 1: A schematic depiction of how the Generalized Kolmogorov-Smirnov (KS) distance between target Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and approximate Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT distributions with respect to critic cϕsubscript𝑐italic-ϕc_{\phi}italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT is computed. The critic is evaluated on samples xFsubscript𝑥𝐹x_{F}italic_x start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT (|bold-|\color[rgb]{0.5390625,0.703125,0.3203125}\boldsymbol{|}bold_|) and xGsubscript𝑥𝐺x_{G}italic_x start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT (|bold-|\color[rgb]{0.109375,0.44140625,0.65234375}\boldsymbol{|}bold_|) from the target and approximate distributions respectively. The λ𝜆\lambdaitalic_λ threshold moves from -\infty- ∞ to ++\infty+ ∞ establishing a stack of level sets. At each level, the fraction of datapoints (\color[rgb]{0.5390625,0.703125,0.3203125}\bullet and \color[rgb]{0.109375,0.44140625,0.65234375}\bullet) below the threshold is calculated for each distribution independently. This produces the F(Γcϕ(λ))subscript𝐹subscriptΓsubscript𝑐italic-ϕ𝜆\mathds{P}_{F}\left(\Gamma_{c_{\phi}}(\lambda)\right)blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( roman_Γ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ ) ) and G(Γcϕ(λ))subscript𝐺subscriptΓsubscript𝑐italic-ϕ𝜆\mathds{P}_{G}\left(\Gamma_{c_{\phi}}(\lambda)\right)blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( roman_Γ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_λ ) ) curves. The Generalized KS distance is the largest absolute difference between the curves shown as bold-↕\color[rgb]{0.86328125,0.078125,0.234375}\boldsymbol{\updownarrow}bold_↕ in the right figure. Best viewed in color.

1 Introduction

Generative modeling is about fitting a model to a target distribution, usually the data. A fundamental taxonomy of models assigns them into prescribed and implicit statistical models [9], with partial overlap between the two classes. Prescribed models directly parameterize the distribution’s probability density function, while implicit models parameterize the generator that allows samples to be drawn from the distribution. The ultimate application of the model primarily dictates the choice between the two approaches. It does, however, have consequences regarding the available types of divergences that we can minimize when fitting the model. The divergences differ in the stability of optimization and computational efficiency, as well as statistical efficiency, which all affect the final performance of the model.

The natural approach for fitting a prescribed model is maximum likelihood estimation (MLE), equivalently formulated as minimization of Kullback–Leibler divergence. Likelihood evaluation for normalized models is straightforward. In non-normalized models, density evaluation is expensive; in this context, Hyvärinen [22] proposed the score matching objective, which can be interpreted as the Fisher divergence [30]. This approach is very effective for simulation-free training of ODE[7]/SDE[42, 19]-based models which are state-of-the-art in multiple domains today.

The principle driving the fitting of implicit statistical models is to push the model to generate samples that are indistinguishable from the target. An inflection point for this family of models came with the Generative Adversarial Network (GAN) [13], which took the principle literally and introduced an auxiliary classifier trained in an adversarial process to discriminate between the two distributions. The classification error given an optimal classifier relates to the Jensen–Shannon divergence between generator and the target. Initial work in this area involved applying heuristic tricks to deal with learning problems, namely vanishing gradients, unstable training, and mode drop** or collapse. Further advancements focused on using other distances based on the principle of adversarial learning of auxiliary models, which were supposed to have certain favorable properties with respect to the original GAN.

The Bayesian inference community has been reluctant to adopt adversarial methods [8], and the attempts to apply them in this context [40] indicate a credibility problem. A significant drawback of approximate methods is the excessive reduction of diversity in the distribution [17], the extremes of which lead to mode drop** [1]. In this work, we consider another distance for training implicit statistical models, i.e., the Kolmogorov-Smirnov (KS) distance, which, to the best of our knowledge, has not been used in this context before. The distinctive feature of the KS distance is that it directly measures the coverage discrepancy of each other’s credibility regions by the distributions under analysis at all confidence levels. Thus, its minimization straightforwardly leads to the correct spread of the probability mass, avoiding mode drop**, overconfidence, and mode collapse when applied with a sufficient sampling budget.

We term the proposed model as Kolmogorov-Smirnov Generative Adversarial Network (KSGAN). We show how to generalize the standard KS distance to higher dimensions based on Polonik [38] in section 2, allowing our method to be used for multidimensional distributions. Next, in section 3, we show how to efficiently leverage the distance in an adversarial training process and show formally that the proposed algorithm leads to an alignment of the approximate and target distributions. We support the theoretical findings with empirical results presented in section 6.

2 Generalized Kolmogorov–Smirnov distance

We generalize the Kolmogorov–Smirnov (KS) distance (sometimes called simply Kolmogorov distance) between continuous probability distributions on one-dimensional spaces to multidimensional spaces and show that it is a metric. The test statistic of the KS test is a KS distance between empirical and target distributions (or two empirical in the case of the two-sample case). For this reason, our proposal is directly inspired by the generalization of the test introduced in Polonik [38].

Let us consider two probability measures Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT on a measurable space (𝒳,𝒜)𝒳𝒜(\mathcal{X},\mathcal{A})( caligraphic_X , caligraphic_A ), where the sample space 𝒳𝒳\mathcal{X}caligraphic_X is a vector space such as IRd𝐼superscript𝑅𝑑I\!\!R^{d}italic_I italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and 𝒜𝒜\mathcal{A}caligraphic_A is the corresponding event space; F:𝒳[0,1]:𝐹𝒳01F:\mathcal{X}\rightarrow[0,1]italic_F : caligraphic_X → [ 0 , 1 ] and G:𝒳[0,1]:𝐺𝒳01G:\mathcal{X}\rightarrow[0,1]italic_G : caligraphic_X → [ 0 , 1 ] are the cumulative distribution functions (CDFs) of Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT respectively.111In what follows we will use Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT for the true data distribution and Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT for the learnt one We say that F=Gsubscript𝐹subscript𝐺\mathds{P}_{F}=\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT iff A𝒜,F(A)=G(A)formulae-sequencefor-all𝐴𝒜subscript𝐹𝐴subscript𝐺𝐴\forall\ A\in\mathcal{A},\ \mathds{P}_{F}(A)=\mathds{P}_{G}(A)∀ italic_A ∈ caligraphic_A , blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_A ) = blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_A ). When dim(𝒳)=1dimension𝒳1\dim(\mathcal{X})=1roman_dim ( caligraphic_X ) = 1 then the KS distance is

DKS(F,G):=supx𝒳|F(x)G(x)|.assignsubscript𝐷KSsubscript𝐹subscript𝐺subscriptsupremum𝑥𝒳𝐹𝑥𝐺𝑥D_{\mathrm{KS}}\left(\mathds{P}_{F},\mathds{P}_{G}\right):=\sup_{x\in\mathcal{% X}}|F(x)-G(x)|.italic_D start_POSTSUBSCRIPT roman_KS end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) := roman_sup start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT | italic_F ( italic_x ) - italic_G ( italic_x ) | . (1)

In the multivariate case, the problem with using the KS distance as is is that on a d𝑑ditalic_d-dimensional space, there are 2d1superscript2𝑑12^{d}-12 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT - 1 ways of defining a CDF. The distance has to be independent of the particular definition and thus should be the largest across all the possibilities [35]. This, however, becomes prohibitive for any d>2𝑑2d>2italic_d > 2. In other words, the challenge comes from a multidimensional vector space not being a partially ordered set. Everything that follows in this section consists of proposing a partial order, showing that, under certain conditions, a probability distribution can be uniquely determined on its basis and operationalizing it in an optimization problem.

We begin by bringing the classical result that

DKS(F,G)=supα[0,1]|F(G1(α))α|,subscript𝐷KSsubscript𝐹subscript𝐺subscriptsupremum𝛼01𝐹superscript𝐺1𝛼𝛼D_{\mathrm{KS}}\left(\mathds{P}_{F},\mathds{P}_{G}\right)=\sup_{\alpha\in[0,1]% }|F(G^{-1}(\alpha))-\alpha|,italic_D start_POSTSUBSCRIPT roman_KS end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) = roman_sup start_POSTSUBSCRIPT italic_α ∈ [ 0 , 1 ] end_POSTSUBSCRIPT | italic_F ( italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) ) - italic_α | , (2)

where G1:[0,1]𝒳:superscript𝐺101𝒳G^{-1}:[0,1]\rightarrow\mathcal{X}italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT : [ 0 , 1 ] → caligraphic_X is the inverse CDF also called the quantile function. Einmahl and Mason [10] show that there exists a natural generalization of the quantile function to multivariate distribution, which we restate below.

Definition 1 (Generalized Quantile Function).

Let v:𝒜IR+:v𝒜𝐼subscript𝑅\operatorname*{v}:\mathcal{A}\rightarrow I\!\!R_{+}roman_v : caligraphic_A → italic_I italic_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT be a measure, and 𝒞𝒜𝒞𝒜\mathcal{C}\subset\mathcal{A}caligraphic_C ⊂ caligraphic_A an arbitrary subset of the event space, then a function C,𝒞(α):[0,1]𝒞:subscript𝐶𝒞𝛼01𝒞C_{\mathds{P},\mathcal{C}}(\alpha):[0,1]\rightarrow\mathcal{C}italic_C start_POSTSUBSCRIPT blackboard_P , caligraphic_C end_POSTSUBSCRIPT ( italic_α ) : [ 0 , 1 ] → caligraphic_C such that

C,𝒞(α)argminC𝒞{v(C):(C)α}subscript𝐶𝒞𝛼subscriptargmin𝐶𝒞:v𝐶𝐶𝛼C_{\mathds{P},\mathcal{C}}(\alpha)\in\operatorname*{arg\,min}_{C\in\mathcal{C}% }\{\operatorname*{v}(C):\mathds{P}(C)\geqslant\alpha\}italic_C start_POSTSUBSCRIPT blackboard_P , caligraphic_C end_POSTSUBSCRIPT ( italic_α ) ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_C ∈ caligraphic_C end_POSTSUBSCRIPT { roman_v ( italic_C ) : blackboard_P ( italic_C ) ⩾ italic_α } (3)

is called the generalized quantile function in 𝒞𝒞\mathcal{C}caligraphic_C for \mathds{P}blackboard_P with respect to vv\operatorname*{v}roman_v222In the general case, C,𝒞(α)subscript𝐶𝒞𝛼C_{\mathds{P},\mathcal{C}}(\alpha)italic_C start_POSTSUBSCRIPT blackboard_P , caligraphic_C end_POSTSUBSCRIPT ( italic_α ) at any given level α𝛼\alphaitalic_α is not uniquely determined, i.e. there may exist several sets C,C𝒞 s.t. CC𝐶superscript𝐶𝒞 s.t. 𝐶superscript𝐶C,C^{\prime}\in\mathcal{C}\textrm{ s.t. }C\neq C^{\prime}italic_C , italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_C s.t. italic_C ≠ italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that satisfy the condition in eq. 3. For simplicity, we will call all such sets the (generalized) quantile sets at level α𝛼\alphaitalic_α and write C,𝒞(α)=Csubscript𝐶𝒞𝛼𝐶C_{\mathds{P},\mathcal{C}}(\alpha)=Citalic_C start_POSTSUBSCRIPT blackboard_P , caligraphic_C end_POSTSUBSCRIPT ( italic_α ) = italic_C and C,𝒞(α)=Csubscript𝐶𝒞𝛼superscript𝐶C_{\mathds{P},\mathcal{C}}(\alpha)=C^{\prime}italic_C start_POSTSUBSCRIPT blackboard_P , caligraphic_C end_POSTSUBSCRIPT ( italic_α ) = italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for all of them..

The generalized quantile function evaluated at level α𝛼\alphaitalic_α yields a minimum-volume set [36] whose probability is at least α𝛼\alphaitalic_α, and it is the smallest with respect to vv\operatorname*{v}roman_v such set in 𝒞𝒞\mathcal{C}caligraphic_C, thus the name. For the remainder of this paper, we assume that vv\operatorname*{v}roman_v is the Lebesgue measure.

It may seem that it is enough to plug CG,𝒞(α)subscript𝐶subscript𝐺𝒞𝛼C_{\mathds{P}_{G},\mathcal{C}}(\alpha)italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT ( italic_α ) in place of G1(α)superscript𝐺1𝛼G^{-1}(\alpha)italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) and Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT in place of F𝐹Fitalic_F in eq. 2 to establish the Generalized KS distance but it turns out that such a distance does not satisfy the positivity condition DKS(F,G)>0subscript𝐷KSsubscript𝐹subscript𝐺0D_{\mathrm{KS}}\left(\mathds{P}_{F},\mathds{P}_{G}\right)>0italic_D start_POSTSUBSCRIPT roman_KS end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) > 0 if FGsubscript𝐹subscript𝐺\mathds{P}_{F}\neq\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≠ blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT as the example below shows.

Example 1 (Polonik [38]).

Let Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT be the probability measure of a chi distribution with one degree of freedom χ12superscriptsubscript𝜒12\sqrt{\chi_{1}^{2}}square-root start_ARG italic_χ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG which has support on IR+𝐼subscript𝑅I\!\!R_{+}italic_I italic_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT and Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT the probability measure of a standard Gaussian distribution 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ) which has support on the whole IR𝐼𝑅I\!\!Ritalic_I italic_R. Given 𝒞=𝒜𝒞𝒜\mathcal{C}=\mathcal{A}caligraphic_C = caligraphic_A we have

F(CG,𝒞(α))=αα[0,1],subscript𝐹subscript𝐶subscript𝐺𝒞𝛼𝛼for-all𝛼01\mathds{P}_{F}(C_{\mathds{P}_{G},\mathcal{C}}(\alpha))=\alpha\ \forall\alpha% \in[0,1],blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT ( italic_α ) ) = italic_α ∀ italic_α ∈ [ 0 , 1 ] , (4)

while clearly FGsubscript𝐹subscript𝐺\mathds{P}_{F}\neq\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≠ blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. The statement in eq. 4 is easy to show by observing that x[0,)for-all𝑥0\forall x\in[0,\infty)∀ italic_x ∈ [ 0 , ∞ ) the density of Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is twice the density of Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and CG,𝒞(α)subscript𝐶subscript𝐺𝒞𝛼C_{\mathds{P}_{G},\mathcal{C}}(\alpha)italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT ( italic_α ) are intervals centered at 0.

Instead, a solution based on the quantile functions of both distributions is needed, which we present in definition 2.

Definition 2 (Generalized Kolmogorov-Smirnov distance).

Let the Generalized Kolmogorov-Smirnov distance be formulated as follows:

DGKS(F,G):=supα[0,1]C{CG,𝒞,CF,𝒞}[|F(C(α))G(C(α))|].assignsubscript𝐷GKSsubscript𝐹subscript𝐺subscriptsupremum𝛼01𝐶subscript𝐶subscript𝐺𝒞subscript𝐶subscript𝐹𝒞delimited-[]subscript𝐹𝐶𝛼subscript𝐺𝐶𝛼D_{\mathrm{GKS}}\left(\mathds{P}_{F},\mathds{P}_{G}\right):=\sup_{\begin{% subarray}{c}\alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{G},\mathcal{C}},C_{\mathds{P}_{F},\mathcal{C}}\}\end{% subarray}}\left[|\mathds{P}_{F}(C(\alpha))-\mathds{P}_{G}(C(\alpha))|\right].italic_D start_POSTSUBSCRIPT roman_GKS end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) := roman_sup start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_α ∈ [ 0 , 1 ] end_CELL end_ROW start_ROW start_CELL italic_C ∈ { italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT } end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ | blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_C ( italic_α ) ) - blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_C ( italic_α ) ) | ] . (5)

Such distance is symmetric, satisfying the triangle inequality as shown in section A.1. For the remainder of this section, we will show that the Generalized KS distance in eq. 5 meets the necessary DGKS(,)=0subscript𝐷GKS0D_{\mathrm{GKS}}\left(\mathds{P},\mathds{P}\right)=0italic_D start_POSTSUBSCRIPT roman_GKS end_POSTSUBSCRIPT ( blackboard_P , blackboard_P ) = 0 and sufficient DGKS(F,G)>0subscript𝐷GKSsubscript𝐹subscript𝐺0D_{\mathrm{GKS}}\left(\mathds{P}_{F},\mathds{P}_{G}\right)>0italic_D start_POSTSUBSCRIPT roman_GKS end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) > 0 if FGsubscript𝐹subscript𝐺\mathds{P}_{F}\neq\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≠ blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT conditions to consider it a metric. In the proof, we will rely on the probability density function of \mathds{P}blackboard_P with respect to a reference measure vv\operatorname*{v}roman_v, which we denote with p:𝒳[0,):𝑝𝒳0p:\mathcal{X}\rightarrow[0,\infty)italic_p : caligraphic_X → [ 0 , ∞ ). Let

Γp(λ):={x:p(x)λ}assignsubscriptΓ𝑝𝜆conditional-set𝑥𝑝𝑥𝜆\Gamma_{p}(\lambda):=\{x:p(x)\geqslant\lambda\}roman_Γ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_λ ) := { italic_x : italic_p ( italic_x ) ⩾ italic_λ } (6)

denote the density level set of p𝑝pitalic_p at level λ0𝜆0\lambda\geqslant 0italic_λ ⩾ 0 (also called the highest density region [21]), and let Πp:={Γp(λ):λ0}assignsubscriptΠ𝑝conditional-setsubscriptΓ𝑝𝜆𝜆0\Pi_{p}:=\{\Gamma_{p}(\lambda):\lambda\geqslant 0\}roman_Π start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT := { roman_Γ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_λ ) : italic_λ ⩾ 0 }. The following observations about level sets will introduce the fundamental tools to prove the necessary and sufficient conditions for the generalized KS distance.

Remark 1 (The silhouette [37]).

For any density p𝑝pitalic_p, the following holds

p(x)=0𝟙Γp(λ)(x)dλ,𝑝𝑥superscriptsubscript0subscript1subscriptΓ𝑝𝜆𝑥differential-d𝜆p(x)=\int_{0}^{\infty}\mathds{1}_{\Gamma_{p}(\lambda)}(x)\mathrm{d}\lambda,italic_p ( italic_x ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT roman_Γ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_λ ) end_POSTSUBSCRIPT ( italic_x ) roman_d italic_λ , (7)

where 𝟙Csubscript1𝐶\mathds{1}_{C}blackboard_1 start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT denotes the indicator function of a set C𝐶Citalic_C. The RHS of eq. 7 is called the silhouette.

An immediate consequence of remark 1 is that ΠpsubscriptΠ𝑝\Pi_{p}roman_Π start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ordered with respect to λ0𝜆0\lambda\geqslant 0italic_λ ⩾ 0 fully characterizes \mathds{P}blackboard_P, because p𝑝pitalic_p does. Graphically, the silhouette is a multidimensional stack of level sets.

Remark 2.

Density level sets are minimum-volume sets [38] The quantity (C)λv(C)𝐶𝜆v𝐶\mathds{P}(C)-\lambda\operatorname*{v}(C)blackboard_P ( italic_C ) - italic_λ roman_v ( italic_C ) is maximized over 𝒜𝒜\mathcal{A}caligraphic_A by Γp(λ)subscriptΓ𝑝𝜆\Gamma_{p}(\lambda)roman_Γ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_λ ), and thus if Γp(λ)𝒞subscriptΓ𝑝𝜆𝒞\Gamma_{p}(\lambda)\in\mathcal{C}roman_Γ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_λ ) ∈ caligraphic_C, then Γp(λ)=C,𝒞(α)subscriptΓ𝑝𝜆subscript𝐶𝒞𝛼\Gamma_{p}(\lambda)=C_{\mathds{P},\mathcal{C}}(\alpha)roman_Γ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_λ ) = italic_C start_POSTSUBSCRIPT blackboard_P , caligraphic_C end_POSTSUBSCRIPT ( italic_α )333There may be other sets C=C,𝒞(α)𝐶subscript𝐶𝒞𝛼C=C_{\mathds{P},\mathcal{C}}(\alpha)italic_C = italic_C start_POSTSUBSCRIPT blackboard_P , caligraphic_C end_POSTSUBSCRIPT ( italic_α ) but Γp(λ)subscriptΓ𝑝𝜆\Gamma_{p}(\lambda)roman_Γ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_λ ) will certainly be one of them. at level α=(Γp(λ))=p(x)𝟙[λ,)(p(x))dx𝛼subscriptΓ𝑝𝜆𝑝𝑥subscript1𝜆𝑝𝑥differential-d𝑥\alpha=\mathds{P}(\Gamma_{p}(\lambda))=\int p(x)\mathds{1}_{[\lambda,\infty)}(% p(x))\mathrm{d}xitalic_α = blackboard_P ( roman_Γ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_λ ) ) = ∫ italic_p ( italic_x ) blackboard_1 start_POSTSUBSCRIPT [ italic_λ , ∞ ) end_POSTSUBSCRIPT ( italic_p ( italic_x ) ) roman_d italic_x.

Below, we present the fundamental theoretical result behind the proposed method, which restates Lemma 1.2. of Polonik [38].

{theoremE}

[Necessary and sufficient conditions][end, restate, text link section] Let vv\operatorname*{v}roman_v be a measure on (𝒳,𝒜)𝒳𝒜(\mathcal{X},\mathcal{A})( caligraphic_X , caligraphic_A ). Suppose that Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT are probability measures on (𝒳,𝒜)𝒳𝒜(\mathcal{X},\mathcal{A})( caligraphic_X , caligraphic_A ) with densities (with reference measure vv\operatorname*{v}roman_v) f𝑓fitalic_f and g𝑔gitalic_g respectively. Assuming that

  1. A.1

    ΠfΠg𝒞subscriptΠ𝑓subscriptΠ𝑔𝒞\Pi_{f}\cup\Pi_{g}\subset\mathcal{C}roman_Π start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∪ roman_Π start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ⊂ caligraphic_C;

  2. A.2

    CF,𝒞(α)subscript𝐶subscript𝐹𝒞𝛼C_{\mathds{P}_{F},\mathcal{C}}(\alpha)italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT ( italic_α ) and CG,𝒞(α)subscript𝐶subscript𝐺𝒞𝛼C_{\mathds{P}_{G},\mathcal{C}}(\alpha)italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT ( italic_α ) are uniquely determined444In the sense defined in Polonik [38] in 𝒞𝒞\mathcal{C}caligraphic_C with respect to vv\operatorname*{v}roman_v

the following two statements are equivalent:

  1. S.1

    F=Gsubscript𝐹subscript𝐺\mathds{P}_{F}=\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT;

  2. S.2

    DGKS(F,G)=0.subscript𝐷GKSsubscript𝐹subscript𝐺0D_{\mathrm{GKS}}\left(\mathds{P}_{F},\mathds{P}_{G}\right)=0.italic_D start_POSTSUBSCRIPT roman_GKS end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) = 0 .

{proofE}

The S.1S.2S.1S.2\ref{statement:optimality}\implies\ref{statement:zero_distance} direction is trivial to show and works without satisfying the assumptions [38]. Therefore, we focus on showing that S.2S.1S.2S.1\ref{statement:zero_distance}\implies\ref{statement:optimality}. Let

S𝒞()={(v(C),(C)):C𝒞}IR+×[0,1],subscript𝑆𝒞conditional-setv𝐶𝐶𝐶𝒞𝐼subscript𝑅01S_{\mathcal{C}}(\mathds{P})=\{(\operatorname*{v}(C),\mathds{P}(C)):C\in% \mathcal{C}\}\subset I\!\!R_{+}\times[0,1],italic_S start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( blackboard_P ) = { ( roman_v ( italic_C ) , blackboard_P ( italic_C ) ) : italic_C ∈ caligraphic_C } ⊂ italic_I italic_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT × [ 0 , 1 ] , (8)

and denote with Γ(λ)Γ𝜆\Gamma(\lambda)roman_Γ ( italic_λ ) the level set of density of \mathds{P}blackboard_P as defined in eq. 6, and let Π:={Γ(λ):λ0}assignΠconditional-setΓ𝜆𝜆0\Pi:=\{\Gamma(\lambda):\lambda\geqslant 0\}roman_Π := { roman_Γ ( italic_λ ) : italic_λ ⩾ 0 }. Further, let S~𝒞subscript~𝑆𝒞\tilde{S}_{\mathcal{C}}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT denote the least concave majorant [5] to S𝒞()subscript𝑆𝒞S_{\mathcal{C}}(\mathds{P})italic_S start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( blackboard_P ), that is, the smallest concave function from IR+𝐼subscript𝑅I\!\!R_{+}italic_I italic_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT to [0,1]01[0,1][ 0 , 1 ] lying above S𝒞()subscript𝑆𝒞S_{\mathcal{C}}(\mathds{P})italic_S start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( blackboard_P ). S~𝒞subscript~𝑆𝒞\tilde{S}_{\mathcal{C}}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT is supported on the generalized quantiles of \mathds{P}blackboard_P in 𝒞𝒞\mathcal{C}caligraphic_C, i.e. on the points (v(C,𝒞(α)),(C,𝒞(α)))vsubscript𝐶𝒞𝛼subscript𝐶𝒞𝛼(\operatorname*{v}(C_{\mathds{P},\mathcal{C}}(\alpha)),\mathds{P}(C_{\mathds{P% },\mathcal{C}}(\alpha)))( roman_v ( italic_C start_POSTSUBSCRIPT blackboard_P , caligraphic_C end_POSTSUBSCRIPT ( italic_α ) ) , blackboard_P ( italic_C start_POSTSUBSCRIPT blackboard_P , caligraphic_C end_POSTSUBSCRIPT ( italic_α ) ) ). Finally, let S~𝒞()subscript~𝑆𝒞\partial\tilde{S}_{\mathcal{C}}(\mathds{P})∂ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( blackboard_P ) be the intersection of the extremal points of the convex hull of S𝒞()subscript𝑆𝒞S_{\mathcal{C}}(\mathds{P})italic_S start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( blackboard_P ) with the graph of S~𝒞subscript~𝑆𝒞\tilde{S}_{\mathcal{C}}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT. Given Π𝒞Π𝒞\Pi\subset\mathcal{C}roman_Π ⊂ caligraphic_C which we assume in A.1 for Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, and in the light of remark 2 we have that for any set C𝐶Citalic_C such that (v(C),(C))S~𝒞()v𝐶𝐶subscript~𝑆𝒞(\operatorname*{v}(C),\mathds{P}(C))\in\partial\tilde{S}_{\mathcal{C}}(\mathds% {P})( roman_v ( italic_C ) , blackboard_P ( italic_C ) ) ∈ ∂ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( blackboard_P ) there is a level λ𝜆\lambdaitalic_λ for which C=Γ(λ)𝐶Γ𝜆C=\Gamma(\lambda)italic_C = roman_Γ ( italic_λ ), and it is equal the left-hand derivative of S~𝒞subscript~𝑆𝒞\tilde{S}_{\mathcal{C}}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT in the point v(C)v𝐶\operatorname*{v}(C)roman_v ( italic_C ). From remark 1, we have that the silhouette fully characterizes \mathds{P}blackboard_P, and therefore S~𝒞()subscript~𝑆𝒞\partial\tilde{S}_{\mathcal{C}}(\mathds{P})∂ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( blackboard_P ) does it as well.

Eventually, we conclude the proof with the observation that given S.2, under Lemma 2.1 of Polonik [38] (where A.2 is utilized) we have that the extremal points of the convex hulls of S𝒞(F)subscript𝑆𝒞subscript𝐹S_{\mathcal{C}}(\mathds{P}_{F})italic_S start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) and S𝒞(G)subscript𝑆𝒞subscript𝐺S_{\mathcal{C}}(\mathds{P}_{G})italic_S start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) are the same points, thus S~F()=S~G()subscript~𝑆subscript𝐹subscript~𝑆subscript𝐺\partial\tilde{S}_{\mathds{P}_{F}}(\mathds{P})=\partial\tilde{S}_{\mathds{P}_{% G}}(\mathds{P})∂ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_P ) = ∂ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_P ), and finally F=Gsubscript𝐹subscript𝐺\mathds{P}_{F}=\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT.

Meeting assumption A.1 is a demanding challenge, almost equivalent to learning the target distribution. Below, we propose a relaxation of it, which we will use to show the validity of our method.

{theoremE}

[Relaxation of assumption A.1][end, restate, text link section] Remark 2 holds if assumption A.1 is relaxed to the case that 𝒞𝒞\mathcal{C}caligraphic_C contains sets that are uniquely determined with density level sets of Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT up to a set C𝐶Citalic_C such that

C2CF(C)=G(C),subscriptfor-allsuperscript𝐶superscript2𝐶subscript𝐹superscript𝐶subscript𝐺superscript𝐶\forall_{C^{\prime}\in 2^{C}}\ \mathds{P}_{F}(C^{\prime})=\mathds{P}_{G}(C^{% \prime}),∀ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ 2 start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , (9)

and let r:=F(C)=G(C)assign𝑟subscript𝐹𝐶subscript𝐺𝐶r:=\mathds{P}_{F}(C)=\mathds{P}_{G}(C)italic_r := blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_C ) = blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_C ), then the supremum in statement S.2 is restricted to [0,1r]01𝑟[0,1-r][ 0 , 1 - italic_r ].

{proofE}

The statement in eq. 9 is equivalent to saying that F=Gsubscript𝐹subscript𝐺\mathds{P}_{F}=\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT on (C,2C)𝐶superscript2𝐶(C,2^{C})( italic_C , 2 start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ). Analogously to the proof of remark 2 we can show that F=Gsubscript𝐹subscript𝐺\mathds{P}_{F}=\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT on (𝒳C,2𝒳C)𝒳𝐶superscript2𝒳𝐶(\mathcal{X}\setminus C,2^{\mathcal{X}\setminus C})( caligraphic_X ∖ italic_C , 2 start_POSTSUPERSCRIPT caligraphic_X ∖ italic_C end_POSTSUPERSCRIPT ). By observing that probability measures are σ𝜎\sigmaitalic_σ-additive, we conclude that F=Gsubscript𝐹subscript𝐺\mathds{P}_{F}=\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT on (𝒳,𝒜)𝒳𝒜(\mathcal{X},\mathcal{A})( caligraphic_X , caligraphic_A ), and thus the result of remark 2 holds.

3 Kolmogorov–Smirnov GAN

For the remainder of the paper, we will consider Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT as the target distribution represented by a dataset {xF}subscript𝑥𝐹\{x_{F}\}{ italic_x start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT }, and Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT as the approximate distribution that we want to train by minimizing the Generalized KS distance in eq. 5 with Stochastic Gradient Descent. We model Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT as a pushforward gθ#Zsubscriptsubscript𝑔𝜃#subscript𝑍{g_{\theta}}_{\#}\mathds{P}_{Z}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUBSCRIPT # end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT of a simple (e.g., Gaussian, or Uniform) latent distribution Zsubscript𝑍\mathds{P}_{Z}blackboard_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT supported on 𝒵𝒵\mathcal{Z}caligraphic_Z, with a neural network gθ:𝒵𝒳:subscript𝑔𝜃𝒵𝒳g_{\theta}:\mathcal{Z}\rightarrow\mathcal{X}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : caligraphic_Z → caligraphic_X, parameterized with θ𝜃\thetaitalic_θ, which we call the generator.

The major challenge in utilizing eq. 5 is the necessity of finding the C,𝒞(α)subscript𝐶𝒞𝛼C_{\mathds{P},\mathcal{C}}(\alpha)italic_C start_POSTSUBSCRIPT blackboard_P , caligraphic_C end_POSTSUBSCRIPT ( italic_α ) terms which is an optimization problem on its own. The idea that we propose in this work is to amortize the procedure by modeling the generalized quantile functions CF,𝒞(α)subscript𝐶subscript𝐹𝒞𝛼C_{\mathds{P}_{F},\mathcal{C}}(\alpha)italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT ( italic_α ) and CG,𝒞(α)subscript𝐶subscript𝐺𝒞𝛼C_{\mathds{P}_{G},\mathcal{C}}(\alpha)italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT ( italic_α ) with additional neural networks which have to be trained in parallel to the generator gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Therefore, our method is based on adversarial training [13], where optimization proceeds in alternating phases of minimization and maximization for different sets of parameters. Hence the name of the proposed method, the Kolmogorov–Smirnov Generative Adversarial Network.

3.1 Neural Quantile Function

The generalized quantile function defined in definition 1 is an infinite-dimensional vector function C,𝒞:[0,1]C𝒞:subscript𝐶𝒞01𝐶𝒞C_{\mathds{P},\mathcal{C}}:[0,1]\rightarrow C\in\mathcal{C}italic_C start_POSTSUBSCRIPT blackboard_P , caligraphic_C end_POSTSUBSCRIPT : [ 0 , 1 ] → italic_C ∈ caligraphic_C. Such objects do not have an expressive, explicit representation that allows for gradient-based optimization. Therefore, we use an implicit representation inspired by density level sets in eq. 6. We propose to use neural level sets defined in definition 3 that are modeled by a neural network c:𝒳IR:𝑐𝒳𝐼𝑅c:\mathcal{X}\rightarrow I\!\!Ritalic_c : caligraphic_X → italic_I italic_R, which we will refer to as the critic.

Definition 3 (Neural level set).

Given a neural network c:𝒳IR:𝑐𝒳𝐼𝑅c:\mathcal{X}\rightarrow I\!\!Ritalic_c : caligraphic_X → italic_I italic_R, the neural level set at level λ𝜆\lambdaitalic_λ is defined as555Please note that the direction of the inequality in eq. 10 is opposite of the one in eq. 6 which is a convention that aligns the critic with the energy function of Energy-Based models.

Γc(λ):={x:c(x)λ}, and let Πc:={Γc(λ):λIR}.formulae-sequenceassignsubscriptΓ𝑐𝜆conditional-set𝑥𝑐𝑥𝜆assign and let subscriptΠ𝑐conditional-setsubscriptΓ𝑐𝜆𝜆𝐼𝑅\Gamma_{c}(\lambda):=\{x:c(x)\leqslant\lambda\},\text{ and let }\Pi_{c}:=\{% \Gamma_{c}(\lambda):\lambda\in I\!\!R\}.roman_Γ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_λ ) := { italic_x : italic_c ( italic_x ) ⩽ italic_λ } , and let roman_Π start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT := { roman_Γ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_λ ) : italic_λ ∈ italic_I italic_R } . (10)

Neural level sets are used, for example, in image segmentation [6, 20] and surface reconstruction from point clouds [3]. They fit our application because for computing the Generalized KS distance in eq. 5, the explicit materialization of generalized quantiles is not required as long as the probability measure can be efficiently evaluated on the implicitly specified sets. We set 𝒞=Πc𝒞subscriptΠ𝑐\mathcal{C}=\Pi_{c}caligraphic_C = roman_Π start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and thus C,Πc(α)=Γc(λα)subscript𝐶subscriptΠ𝑐𝛼subscriptΓ𝑐subscript𝜆𝛼C_{\mathds{P},\Pi_{c}}(\alpha)=\Gamma_{c}(\lambda_{\alpha})italic_C start_POSTSUBSCRIPT blackboard_P , roman_Π start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_α ) = roman_Γ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ), with λα=argminλIR{λ:(Γc(λ))α}subscript𝜆𝛼subscriptargmin𝜆𝐼𝑅:𝜆subscriptΓ𝑐𝜆𝛼\lambda_{\alpha}=\operatorname*{arg\,min}_{\lambda\in I\!\!R}\{\lambda:\mathds% {P}(\Gamma_{c}(\lambda))\geqslant\alpha\}italic_λ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_λ ∈ italic_I italic_R end_POSTSUBSCRIPT { italic_λ : blackboard_P ( roman_Γ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_λ ) ) ⩾ italic_α }. For a probability measure superscript\mathds{P}^{\prime}blackboard_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT the following holds:

(C,Πc(α))=𝔼x[𝟙(,λα](c(x))],superscriptsubscript𝐶subscriptΠ𝑐𝛼subscript𝔼similar-to𝑥superscriptdelimited-[]subscript1subscript𝜆𝛼𝑐𝑥\mathds{P}^{\prime}\left(C_{\mathds{P},\Pi_{c}}(\alpha)\right)=\mathbb{E}_{x% \sim\mathds{P}^{\prime}}\left[\mathds{1}_{(-\infty,\lambda_{\alpha}]}(c(x))% \right],blackboard_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT blackboard_P , roman_Π start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_α ) ) = blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ blackboard_1 start_POSTSUBSCRIPT ( - ∞ , italic_λ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT ( italic_c ( italic_x ) ) ] , (11)

which shows that the terms in eq. 5 under neural level sets can be Monte-Carlo estimated given samples from the respective distributions. Assumption A.2 is satisfied by neural level sets by construction.

The formulation of the Generalized KS distance in eq. 5 includes two generalized quantile functions CF,𝒞(α)subscript𝐶subscript𝐹𝒞𝛼C_{\mathds{P}_{F},\mathcal{C}}(\alpha)italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT ( italic_α ) corresponding to target distribution Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and CG,𝒞(α)subscript𝐶subscript𝐺𝒞𝛼C_{\mathds{P}_{G},\mathcal{C}}(\alpha)italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT ( italic_α ) corresponding to the approximate distribution Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. Both have to be modeled with the respective neural networks cϕFsubscript𝑐subscriptitalic-ϕ𝐹c_{\phi_{F}}italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT and cϕGsubscript𝑐subscriptitalic-ϕ𝐺c_{\phi_{G}}italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT, where we use ϕ={ϕF,ϕG}italic-ϕsubscriptitalic-ϕ𝐹subscriptitalic-ϕ𝐺\phi=\{\phi_{F},\phi_{G}\}italic_ϕ = { italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } to denote the joint set of their parameters. In section 3.3, we show how to parameterize both critics with a single neural network. We set 𝒞=ΠcϕFΠcϕG𝒞subscriptΠsubscript𝑐subscriptitalic-ϕ𝐹subscriptΠsubscript𝑐subscriptitalic-ϕ𝐺\mathcal{C}=\Pi_{c_{\phi_{F}}}\cup\Pi_{c_{\phi_{G}}}caligraphic_C = roman_Π start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ roman_Π start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

3.2 Optimizing generator’s parameters θ𝜃\thetaitalic_θ

The Generalized KS distance in eq. 5 is a supremum over a unit interval and two functions; thus, it can be upper-bounded as

DGKS(F,G)C{CG,𝒞,CF,𝒞}supα[0,1][|F(C(α))G(C(α))|].subscript𝐷GKSsubscript𝐹subscript𝐺subscript𝐶subscript𝐶subscript𝐺𝒞subscript𝐶subscript𝐹𝒞subscriptsupremum𝛼01delimited-[]subscript𝐹𝐶𝛼subscript𝐺𝐶𝛼D_{\mathrm{GKS}}\left(\mathds{P}_{F},\mathds{P}_{G}\right)\leqslant\sum_{C\in% \{C_{\mathds{P}_{G},\mathcal{C}},C_{\mathds{P}_{F},\mathcal{C}}\}}\sup_{\alpha% \in[0,1]}\left[|\mathds{P}_{F}(C(\alpha))-\mathds{P}_{G}(C(\alpha))|\right].italic_D start_POSTSUBSCRIPT roman_GKS end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) ⩽ ∑ start_POSTSUBSCRIPT italic_C ∈ { italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT } end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_α ∈ [ 0 , 1 ] end_POSTSUBSCRIPT [ | blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_C ( italic_α ) ) - blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_C ( italic_α ) ) | ] . (12)

Next, we plug in 𝒞=ΠcϕFΠcϕG𝒞subscriptΠsubscript𝑐subscriptitalic-ϕ𝐹subscriptΠsubscript𝑐subscriptitalic-ϕ𝐺\mathcal{C}=\Pi_{c_{\phi_{F}}}\cup\Pi_{c_{\phi_{G}}}caligraphic_C = roman_Π start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ roman_Π start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT to eq. 12 and use eq. 11 to get generator’s objective:

g=cϕ{cϕG,cϕF}supλIR[|𝔼xF[𝟙(,λ](cϕ(x))]𝔼xG[𝟙(,λ](cϕ(x))]|].subscript𝑔subscriptsubscript𝑐italic-ϕsubscript𝑐subscriptitalic-ϕ𝐺subscript𝑐subscriptitalic-ϕ𝐹subscriptsupremum𝜆𝐼𝑅delimited-[]subscript𝔼similar-to𝑥subscript𝐹delimited-[]subscript1𝜆subscript𝑐italic-ϕ𝑥subscript𝔼similar-to𝑥subscript𝐺delimited-[]subscript1𝜆subscript𝑐italic-ϕ𝑥\mathcal{L}_{g}=\sum_{c_{\phi}\in\{c_{\phi_{G}},c_{\phi_{F}}\}}\sup_{\lambda% \in I\!\!R}\left[|\mathbb{E}_{x\sim\mathds{P}_{F}}\left[\mathds{1}_{(-\infty,% \lambda]}(c_{\phi}(x))\right]-\mathbb{E}_{x\sim\mathds{P}_{G}}\left[\mathds{1}% _{(-\infty,\lambda]}(c_{\phi}(x))\right]|\right].caligraphic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∈ { italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT } end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_λ ∈ italic_I italic_R end_POSTSUBSCRIPT [ | blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_1 start_POSTSUBSCRIPT ( - ∞ , italic_λ ] end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) ) ] - blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_1 start_POSTSUBSCRIPT ( - ∞ , italic_λ ] end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) ) ] | ] . (13)

In practice, the expectations in eq. 13 are estimated on finite samples from the two distributions, i.e. {xF}subscript𝑥𝐹\{x_{F}\}{ italic_x start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } mentioned before, and {xG}subscript𝑥𝐺\{x_{G}\}{ italic_x start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } sampled from the approximate distribution Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT using the reparametrization trick to facilitate backpropagation of gradients. Therefore, the two terms become step functions in λ𝜆\lambdaitalic_λ, and the supremum is located on one of the steps. That way, a line search on IR𝐼𝑅I\!\!Ritalic_I italic_R reduces to a maximum over a finite set. To preserve the differentiability of the cost function calculated in this way, we apply Straight-through Estimator [4] in place of indication function 𝟙1\mathds{1}blackboard_1. A schematic depiction of the process for a single critic is shown in fig. 1.

3.3 Optimizing critics’ parameters ϕitalic-ϕ\phiitalic_ϕ

By optimizing critics’ parameters ϕitalic-ϕ\phiitalic_ϕ, we want to satisfy assumption A.1 so that Generalized KS distance becomes a metric. For the problem posed in such a way, we lack supervision, i.e., we do not know the target sets’ shapes. However, we can reformulate the problem as an estimation of the density functions of the two considered measures Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and use the obtained approximate density models to build level sets. We can constitute an optimization problem for such a task based solely on finite sets of samples, which we have for Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and can arbitrarily generate from Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. As the estimator, we propose to use the Energy-based model (EBM) [43], which, thanks to the lack of constraints in the choice of architecture, can be very expressive while having favorable computational complexity at inference. To carry out EMB training effectively, we will introduce a new min-max game, the “min phase” of which will turn out to be the initial objective in eq. 5, and in this way, we will close the adversarial cycle.

Let the critic cϕF(x)subscript𝑐subscriptitalic-ϕ𝐹𝑥c_{\phi_{F}}(x)italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) serve as the energy function. The density given by the EBM is then pcϕF(x)=exp(cϕF(x))/ZcϕFsubscript𝑝subscript𝑐subscriptitalic-ϕ𝐹𝑥subscript𝑐subscriptitalic-ϕ𝐹𝑥subscript𝑍subscript𝑐subscriptitalic-ϕ𝐹p_{c_{\phi_{F}}}(x)=\exp(-c_{\phi_{F}}(x))/Z_{c_{\phi_{F}}}italic_p start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) = roman_exp ( - italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) / italic_Z start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT, where ZcϕF=exp(cϕF(x))dxsubscript𝑍subscript𝑐subscriptitalic-ϕ𝐹subscript𝑐subscriptitalic-ϕ𝐹𝑥differential-d𝑥Z_{c_{\phi_{F}}}=\int\exp(-c_{\phi_{F}}(x))\mathrm{d}xitalic_Z start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∫ roman_exp ( - italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) roman_d italic_x is the normalizing constant called partition function. The standard technique for learning the model given target data distribution Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is MLE, where the likelihood

𝔼xF[logpcϕF(x)]=𝔼xF[cϕF(x)]logZcϕFsubscript𝔼similar-to𝑥subscript𝐹delimited-[]subscript𝑝subscript𝑐subscriptitalic-ϕ𝐹𝑥subscript𝔼similar-to𝑥subscript𝐹delimited-[]subscript𝑐subscriptitalic-ϕ𝐹𝑥subscript𝑍subscript𝑐subscriptitalic-ϕ𝐹\mathbb{E}_{x\sim\mathds{P}_{F}}[\log p_{c_{\phi_{F}}}(x)]=\mathbb{E}_{x\sim% \mathds{P}_{F}}[-c_{\phi_{F}}(x)]-\log Z_{c_{\phi_{F}}}blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ] = blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ] - roman_log italic_Z start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT (14)

is maximized wrt ϕFsubscriptitalic-ϕ𝐹\phi_{F}italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT. An unbiased estimate of the gradient of the second term can be obtained with samples from the EBM itself, typically achieved with MCMC sampling. Many approaches to avoid this expensive procedure have been described in the literature [43], and among them, the one based on adversarial training [23] is the most appealing to us. It introduces an auxiliary distribution aux(F)subscript𝑎𝑢𝑥𝐹\mathds{P}_{aux(F)}blackboard_P start_POSTSUBSCRIPT italic_a italic_u italic_x ( italic_F ) end_POSTSUBSCRIPT, such that the gradient of eq. 14 wrt ϕFsubscriptitalic-ϕ𝐹\phi_{F}italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is approximated with the gradient of

𝔼xF[cϕF(x)]𝔼xaux(F)[cϕF(x)].subscript𝔼similar-to𝑥subscript𝐹delimited-[]subscript𝑐subscriptitalic-ϕ𝐹𝑥subscript𝔼similar-to𝑥subscript𝑎𝑢𝑥𝐹delimited-[]subscript𝑐subscriptitalic-ϕ𝐹𝑥\mathbb{E}_{x\sim\mathds{P}_{F}}[-c_{\phi_{F}}(x)]-\mathbb{E}_{x\sim\mathds{P}% _{aux(F)}}[-c_{\phi_{F}}(x)].blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ] - blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUBSCRIPT italic_a italic_u italic_x ( italic_F ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ] . (15)

Consequently, an additional objective aux(F)subscript𝑎𝑢𝑥𝐹\mathcal{L}_{aux(F)}caligraphic_L start_POSTSUBSCRIPT italic_a italic_u italic_x ( italic_F ) end_POSTSUBSCRIPT must be introduced, the optimization of which will lead to the alignment of aux(F)subscript𝑎𝑢𝑥𝐹\mathds{P}_{aux(F)}blackboard_P start_POSTSUBSCRIPT italic_a italic_u italic_x ( italic_F ) end_POSTSUBSCRIPT and cϕFsubscriptsubscript𝑐subscriptitalic-ϕ𝐹\mathds{P}_{c_{\phi_{F}}}blackboard_P start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT, where cϕFsubscriptsubscript𝑐subscriptitalic-ϕ𝐹\mathds{P}_{c_{\phi_{F}}}blackboard_P start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the probability distribution with density pcϕF(x)subscript𝑝subscript𝑐subscriptitalic-ϕ𝐹𝑥p_{c_{\phi_{F}}}(x)italic_p start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ). We take an analogous approach to estimate cϕG(x)subscript𝑐subscriptitalic-ϕ𝐺𝑥c_{\phi_{G}}(x)italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ).

When we (i) set cϕG(x):=cϕF(x)assignsubscript𝑐subscriptitalic-ϕ𝐺𝑥subscript𝑐subscriptitalic-ϕ𝐹𝑥c_{\phi_{G}}(x):=-c_{\phi_{F}}(x)italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) := - italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ), and (ii) repurpose Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT as aux(F)subscript𝑎𝑢𝑥𝐹\mathds{P}_{aux(F)}blackboard_P start_POSTSUBSCRIPT italic_a italic_u italic_x ( italic_F ) end_POSTSUBSCRIPT and Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT as aux(G)subscript𝑎𝑢𝑥𝐺\mathds{P}_{aux(G)}blackboard_P start_POSTSUBSCRIPT italic_a italic_u italic_x ( italic_G ) end_POSTSUBSCRIPT, we show in LABEL:{app:critic_objective} that the MLE objectives for the critics – now, denoted as cϕsubscript𝑐italic-ϕc_{\phi}italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT – simplify as c=𝔼xG[cϕ(x)]𝔼xF[cϕ(x)]subscript𝑐subscript𝔼similar-to𝑥subscript𝐺delimited-[]subscript𝑐italic-ϕ𝑥subscript𝔼similar-to𝑥subscript𝐹delimited-[]subscript𝑐italic-ϕ𝑥\mathcal{L}_{c}=\mathbb{E}_{x\sim\mathds{P}_{G}}[c_{\phi}(x)]-\mathbb{E}_{x% \sim\mathds{P}_{F}}[c_{\phi}(x)]caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) ] - blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) ], which is then maximized in an adversarial game against the Generalized KS distance in eq. 5.

The standard approach for aligning the auxiliary distributions with their targets is to use the Kullback–Leibler divergence. We propose using the Generalized KS distance instead. We set aux(F)=DGKS(G,cϕ)subscript𝑎𝑢𝑥𝐹subscript𝐷GKSsubscript𝐺subscriptsubscript𝑐italic-ϕ\mathcal{L}_{aux(F)}=D_{\mathrm{GKS}}\left(\mathds{P}_{G},\mathds{P}_{c_{\phi}% }\right)caligraphic_L start_POSTSUBSCRIPT italic_a italic_u italic_x ( italic_F ) end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT roman_GKS end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) and aux(G)=DGKS(F,cϕ)subscript𝑎𝑢𝑥subscript𝐺subscript𝐷GKSsubscript𝐹subscriptsubscript𝑐italic-ϕ\mathcal{L}_{aux(\mathds{P}_{G})}=D_{\mathrm{GKS}}\left(\mathds{P}_{F},\mathds% {P}_{-c_{\phi}}\right)caligraphic_L start_POSTSUBSCRIPT italic_a italic_u italic_x ( blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT roman_GKS end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). By analyzing these objectives in the fashion of section 3.2, we note that aux(G)subscript𝑎𝑢𝑥subscript𝐺\mathcal{L}_{aux(\mathds{P}_{G})}caligraphic_L start_POSTSUBSCRIPT italic_a italic_u italic_x ( blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT is the same as our original objective DGKS(F,G)subscript𝐷GKSsubscript𝐹subscript𝐺D_{\mathrm{GKS}}\left(\mathds{P}_{F},\mathds{P}_{G}\right)italic_D start_POSTSUBSCRIPT roman_GKS end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) – which is symmetric – when we approximate sampling from cϕsubscriptsubscript𝑐italic-ϕ\mathds{P}_{c_{\phi}}blackboard_P start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_POSTSUBSCRIPT with the target distribution Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT. Analogously for aux(G)subscript𝑎𝑢𝑥subscript𝐺\mathcal{L}_{aux(\mathds{P}_{G})}caligraphic_L start_POSTSUBSCRIPT italic_a italic_u italic_x ( blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT where sampling from cϕsubscriptsubscript𝑐italic-ϕ\mathds{P}_{-c_{\phi}}blackboard_P start_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_POSTSUBSCRIPT is approximated with Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. Therefore, we have shown that the auxiliary objectives are already integrated into the adversarial game.

In practice, we find the score penalty regularizer of Kumar et al. [26], derived from the score matching objective, helpful to stabilize training. Therefore, we subtract it from csubscript𝑐\mathcal{L}_{c}caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT weighted by a hyperparameter β𝛽\betaitalic_β. In this way, we get a critic that is smoother and, therefore, generates regular level sets that facilitate optimization. We summarize the proposed training procedure in algorithm 1.

Input : Target distribution Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT; latent distribution Zsubscript𝑍\mathds{P}_{Z}blackboard_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT; generator network gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT; critic network cϕsubscript𝑐italic-ϕc_{\phi}italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT; number of critic updates kϕsubscript𝑘italic-ϕk_{\phi}italic_k start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT; number of generator updates kθsubscript𝑘𝜃k_{\theta}italic_k start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT; score penalty weight β𝛽\betaitalic_β;
Output : Trained model Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT approximating Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT;
1 repeat
2       for i=1𝑖1i=1italic_i = 1 to kϕsubscript𝑘italic-ϕk_{\phi}italic_k start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT do
             Draw batch {x}Fsimilar-to𝑥subscript𝐹\{x\}\sim\mathds{P}_{F}{ italic_x } ∼ blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and {z}Zsimilar-to𝑧subscript𝑍\{z\}\sim\mathds{P}_{Z}{ italic_z } ∼ blackboard_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ;
              // critic’s inner loop
3             c1|{z}|{z}xcϕ(gθ(z))22+1|{x}|{x}xcϕ(x)22subscript𝑐1𝑧subscript𝑧superscriptsubscriptdelimited-∥∥subscript𝑥subscript𝑐italic-ϕsubscript𝑔𝜃𝑧221𝑥subscript𝑥superscriptsubscriptdelimited-∥∥subscript𝑥subscript𝑐italic-ϕ𝑥22\mathcal{R}_{c}\leftarrow\frac{1}{|\{z\}|}\sum_{\{z\}}\lVert\nabla_{x}c_{\phi}% (g_{\theta}(z))\rVert_{2}^{2}+\frac{1}{|\{x\}|}\sum_{\{x\}}\lVert\nabla_{x}c_{% \phi}(x)\rVert_{2}^{2}caligraphic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG | { italic_z } | end_ARG ∑ start_POSTSUBSCRIPT { italic_z } end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG | { italic_x } | end_ARG ∑ start_POSTSUBSCRIPT { italic_x } end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT;
4             c1|{z}|{z}cϕ(gθ(z))1|{x}|{x}cϕ(x)subscript𝑐1𝑧subscript𝑧subscript𝑐italic-ϕsubscript𝑔𝜃𝑧1𝑥subscript𝑥subscript𝑐italic-ϕ𝑥\mathcal{L}_{c}\leftarrow\frac{1}{|\{z\}|}\sum_{\{z\}}c_{\phi}(g_{\theta}(z))-% \frac{1}{|\{x\}|}\sum_{\{x\}}c_{\phi}(x)caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG | { italic_z } | end_ARG ∑ start_POSTSUBSCRIPT { italic_z } end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z ) ) - divide start_ARG 1 end_ARG start_ARG | { italic_x } | end_ARG ∑ start_POSTSUBSCRIPT { italic_x } end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x );
5             Update ϕitalic-ϕ\phiitalic_ϕ by using (cβc)ϕsubscript𝑐𝛽subscript𝑐italic-ϕ\frac{\partial(\mathcal{L}_{c}-\beta\mathcal{R}_{c})}{\partial\phi}divide start_ARG ∂ ( caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - italic_β caligraphic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_ϕ end_ARG to maximize cβcsubscript𝑐𝛽subscript𝑐\mathcal{L}_{c}-\beta\mathcal{R}_{c}caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - italic_β caligraphic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT;
6            
7      for i=1𝑖1i=1italic_i = 1 to kθsubscript𝑘𝜃k_{\theta}italic_k start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT do
             Draw batch {x}Fsimilar-to𝑥subscript𝐹\{x\}\sim\mathds{P}_{F}{ italic_x } ∼ blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and {z}Zsimilar-to𝑧subscript𝑍\{z\}\sim\mathds{P}_{Z}{ italic_z } ∼ blackboard_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ;
              // generator’s inner loop
8             {cF}{cϕ(x):{x}}subscript𝑐𝐹conditional-setsubscript𝑐italic-ϕ𝑥𝑥\{c_{F}\}\leftarrow\{c_{\phi}(x):\{x\}\}{ italic_c start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } ← { italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) : { italic_x } } and {cG}{cϕ(gθ(z)):{z}}subscript𝑐𝐺conditional-setsubscript𝑐italic-ϕsubscript𝑔𝜃𝑧𝑧\{c_{G}\}\leftarrow\{c_{\phi}(g_{\theta}(z)):\{z\}\}{ italic_c start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } ← { italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z ) ) : { italic_z } };
9             {λ}{cF}{cG}𝜆subscript𝑐𝐹subscript𝑐𝐺\{\lambda\}\leftarrow\{c_{F}\}\cup\{c_{G}\}{ italic_λ } ← { italic_c start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } ∪ { italic_c start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT };
10             g,Fmax{λ}|1|{z}|{cG}𝟙(,λ](cG)1|{x}|{cF}𝟙(,λ](cF)|subscript𝑔𝐹subscript𝜆1𝑧subscriptsubscript𝑐𝐺subscript1𝜆subscript𝑐𝐺1𝑥subscriptsubscript𝑐𝐹subscript1𝜆subscript𝑐𝐹\mathcal{L}_{g,F}\leftarrow\max_{\{\lambda\}}\left|\frac{1}{|\{z\}|}\sum_{\{c_% {G}\}}\mathds{1}_{(-\infty,\lambda]}(c_{G})-\frac{1}{|\{x\}|}\sum_{\{c_{F}\}}% \mathds{1}_{(-\infty,\lambda]}(c_{F})\right|caligraphic_L start_POSTSUBSCRIPT italic_g , italic_F end_POSTSUBSCRIPT ← roman_max start_POSTSUBSCRIPT { italic_λ } end_POSTSUBSCRIPT | divide start_ARG 1 end_ARG start_ARG | { italic_z } | end_ARG ∑ start_POSTSUBSCRIPT { italic_c start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } end_POSTSUBSCRIPT blackboard_1 start_POSTSUBSCRIPT ( - ∞ , italic_λ ] end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG | { italic_x } | end_ARG ∑ start_POSTSUBSCRIPT { italic_c start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } end_POSTSUBSCRIPT blackboard_1 start_POSTSUBSCRIPT ( - ∞ , italic_λ ] end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) |;
11             g,Gmax{λ}|1|{x}|{cF}𝟙(,λ](cF)1|{z}|{cG}𝟙(,λ](cG)|subscript𝑔𝐺subscript𝜆1𝑥subscriptsubscript𝑐𝐹subscript1𝜆subscript𝑐𝐹1𝑧subscriptsubscript𝑐𝐺subscript1𝜆subscript𝑐𝐺\mathcal{L}_{g,G}\leftarrow\max_{\{\lambda\}}\left|\frac{1}{|\{x\}|}\sum_{\{c_% {F}\}}\mathds{1}_{(-\infty,-\lambda]}(-c_{F})-\frac{1}{|\{z\}|}\sum_{\{c_{G}\}% }\mathds{1}_{(-\infty,-\lambda]}(-c_{G})\right|caligraphic_L start_POSTSUBSCRIPT italic_g , italic_G end_POSTSUBSCRIPT ← roman_max start_POSTSUBSCRIPT { italic_λ } end_POSTSUBSCRIPT | divide start_ARG 1 end_ARG start_ARG | { italic_x } | end_ARG ∑ start_POSTSUBSCRIPT { italic_c start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT } end_POSTSUBSCRIPT blackboard_1 start_POSTSUBSCRIPT ( - ∞ , - italic_λ ] end_POSTSUBSCRIPT ( - italic_c start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG | { italic_z } | end_ARG ∑ start_POSTSUBSCRIPT { italic_c start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT } end_POSTSUBSCRIPT blackboard_1 start_POSTSUBSCRIPT ( - ∞ , - italic_λ ] end_POSTSUBSCRIPT ( - italic_c start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) |;
12             gg,F+g,Gsubscript𝑔subscript𝑔𝐹subscript𝑔𝐺\mathcal{L}_{g}\leftarrow\mathcal{L}_{g,F}+\mathcal{L}_{g,G}caligraphic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ← caligraphic_L start_POSTSUBSCRIPT italic_g , italic_F end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_g , italic_G end_POSTSUBSCRIPT;
13             Update θ𝜃\thetaitalic_θ by using gθsubscript𝑔𝜃\frac{\partial\mathcal{L}_{g}}{\partial\theta}divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_θ end_ARG to minimize gsubscript𝑔\mathcal{L}_{g}caligraphic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT;
14            
15      
16until not converged;
return gθ#Zsubscriptsubscript𝑔𝜃#subscript𝑍{g_{\theta}}_{\#}\mathds{P}_{Z}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUBSCRIPT # end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT
Algorithm 1 Learning a generative model by minimizing Generalized KS distance.

4 Discussion

In section 3.3, where we justify the choice of the critic’s objective function, we refer to methods for training EBMs, which are approximate density distribution models. Thus, the reader can expect that our proposed critic cϕsubscript𝑐italic-ϕc_{\phi}italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT in the limit of convergence of the algorithm will become a source of information about the density distribution of the target distribution Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT accompanying the model that generates samples Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. However, this does not happen as a consequence of the design choice (i), that is, the setup of cϕF=cϕG=cϕsubscript𝑐subscriptitalic-ϕ𝐹subscript𝑐subscriptitalic-ϕ𝐺subscript𝑐italic-ϕc_{\phi_{F}}=-c_{\phi_{G}}=c_{\phi}italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT = - italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT. An EBM can only be equivalent to its inverse in the case of a uniform distribution. In addition, because of design choice (ii), during training, the critic is not evaluated outside of the support of Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and, therefore, can reach arbitrary values there. Despite these observations, the Generalized KS distance present in our algorithm exposes sufficient conditions because of remark 2.

The feature distinguishing KSGAN from other adversarial generative modeling approaches is that regardless of the outcome of the critic’s inner problem, minimizing eq. 5 is justified because Generalized KS distance, despite not meeting assumption A.1, is a pseudo-metric [38]. For comparison, the dual representation of Wasserstein distance, used in WGAN [2] requires attaining the supremum in the inner problem.

The distances used for training generative models all fall into either the category of f𝑓fitalic_f-divergences Df(F,G)=𝒜f(dF/dG)dGsubscript𝐷𝑓subscript𝐹subscript𝐺subscript𝒜𝑓dsubscript𝐹dsubscript𝐺differential-dsubscript𝐺D_{f}(\mathds{P}_{F},\mathds{P}_{G})=\int_{\mathcal{A}}f\left(\mathrm{d}% \mathds{P}_{F}/\mathrm{d}\mathds{P}_{G}\right)\mathrm{d}\mathds{P}_{G}italic_D start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) = ∫ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT italic_f ( roman_d blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT / roman_d blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) roman_d blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT or integral probability metrics (IPMs) D(F,G)=supf|𝔼xFf(x)𝔼xGf(x)|subscript𝐷subscript𝐹subscript𝐺subscriptsupremum𝑓subscript𝔼similar-to𝑥subscript𝐹𝑓𝑥subscript𝔼similar-to𝑥subscript𝐺𝑓𝑥D_{\mathcal{F}}(\mathds{P}_{F},\mathds{P}_{G})=\sup_{f\in\mathcal{F}}|\mathbb{% E}_{x\sim\mathds{P}_{F}}f(x)-\mathbb{E}_{x\sim\mathds{P}_{G}}f(x)|italic_D start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) = roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_x ) - blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_x ) |. The classical one-dimensional KS distance is an instance of IPM with ={𝟙(,t]|tIR}conditional-setsubscript1𝑡𝑡𝐼𝑅\mathcal{F}=\{\mathds{1}_{(-\infty,t]}|t\in I\!\!R\}caligraphic_F = { blackboard_1 start_POSTSUBSCRIPT ( - ∞ , italic_t ] end_POSTSUBSCRIPT | italic_t ∈ italic_I italic_R } or ={𝟙G1(α)|α[0,1]}conditional-setsubscript1superscript𝐺1𝛼𝛼01\mathcal{F}=\{\mathds{1}_{G^{-1}(\alpha)}|\alpha\in[0,1]\}caligraphic_F = { blackboard_1 start_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) end_POSTSUBSCRIPT | italic_α ∈ [ 0 , 1 ] } when having access to the inverse CDF of one of the distributions based on eq. 2. One can see the Generalized KS distance from the perspective of IPM with ={𝟙C(α)|α[0,1]&C{CF,𝒞,CG,𝒞}}conditional-setsubscript1𝐶𝛼𝛼01𝐶subscript𝐶subscript𝐹𝒞subscript𝐶subscript𝐺𝒞\mathcal{F}=\{\mathds{1}_{C(\alpha)}|\alpha\in[0,1]\ \&\ C\in\{C_{\mathds{P}_{% F},\mathcal{C}},C_{\mathds{P}_{G},\mathcal{C}}\}\}caligraphic_F = { blackboard_1 start_POSTSUBSCRIPT italic_C ( italic_α ) end_POSTSUBSCRIPT | italic_α ∈ [ 0 , 1 ] & italic_C ∈ { italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT } }. Assuming direct access to CF,𝒞subscript𝐶subscript𝐹𝒞C_{\mathds{P}_{F},\mathcal{C}}italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT and CG,𝒞subscript𝐶subscript𝐺𝒞C_{\mathds{P}_{G},\mathcal{C}}italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT, for example when both Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT are Normalizing Flows [24, 34], measuring the distance comes down to a line search.

5 Related work

The need to generalize the KS test, and therefore distance, to multiple dimensions arose naturally from the side of practitioners who collected such data and wished to test related hypotheses. It was first addressed by Peacock [35], where a two-dimensional test for applications in astronomy was proposed. It involves considering all possible orders in this space and using the one that maximizes the distance between the distributions. A modification of this procedure has been proposed by Fasano and Franceschini [11] where only four candidate CDFs have to be considered, causing the test to be applicable in three dimensions, with eight candidates, under similar computational constraints. Chronologically, the following approach was the one on which we base our work, proposed in Polonik [38] but made possible by the author’s earlier work [36, 37]. To the best of our knowledge, the first work that practically uses the theory developed by Polonik is Glazer et al. [12], which we recommend as an introduction to our work. It proposes applying the Generalized KS test based on the support vector machines for detecting distribution shifts in data streams.

As an instance of the adversarial generative modeling family, our work is related to all the countless GAN [13] follow-ups. We highlight those that study the learning process from the perspective of the distance being minimized. The work of Arjovsky and Bottou [1] provides a formal analysis of the heuristic tricks used for stabilizing the training of GANs. The f𝑓fitalic_f-GAN [33] proposes a unified training framework targeting f𝑓fitalic_f-divergences, which relies on a variational lower bound of the objective that results in the adversarial process. Approaches relying on the integral probability metric include FisherGAN [32], the Generative Moment Matching Networks [29] based on MMD, just like the later, more sophisticated MMD GAN [28], and finally the Wasserstein GAN (WGAN) [2] with the WGAN-GP follow-up [16] which shares common features with our work. Our maximum likelihood approach to fitting the critic results in the same functional form of the loss as WGAN(-GP) uses. In addition, the score penalty we use is similar to the gradient penalty of WGAN-GP.

6 Experiments

We evaluate the proposed method on eight synthetic 2D distributions (see section B.1 for details) and two image datasets, i.e. MNIST [27] and CIFAR-10 [25]. We compare against other adversarial methods, GAN and WGAN-GP, using the same neural network architectures and training hyper-parameters unless specified otherwise (see appendix C for details). All the quantitative results are presented based on five random initializations of the models. The source code for all the experiments is provided at https://github.com/DMML-Geneva/ksgan.

In all KSGAN experiments, we relax the maximum in algorithm 1 and algorithm 1 of algorithm 1 with sample average. In all experiments, we re-use the last batch of samples from the latent distribution (and target distribution in the case of KSGAN) from the critic’s optimization inner loop as the first batch for the generator’s optimization inner loop.

6.1 Synthetic distributions

Analyzing adversarial methods on synthetic, low-dimensional distributions is not popular. However, we conduct such an experiment because we are interested in whether the model generates samples from the support of the target distribution and how accurately it approximates the distribution. Working with small-dimensional distributions, we do not have to be as concerned about the curse of dimensionality when calculating sample-based distances, and we can visually compare the resulting histograms.

In table 1, we report the squared population MMD [15] between target and approximate distributions, computed with Gaussian kernel on 65536 samples from each distribution. Details about how we chose the kernel’s bandwidth can be found in section B.1. GAN and WGAN-GP fail to converge with kϕ=kθ=1subscript𝑘italic-ϕsubscript𝑘𝜃1k_{\phi}=k_{\theta}=1italic_k start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = italic_k start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = 1 (we do not report the results to economize on space); thus, we set kθ=5subscript𝑘𝜃5k_{\theta}=5italic_k start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = 5 for them. The proposed KSGAN with kθ=1subscript𝑘𝜃1k_{\theta}=1italic_k start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = 1 performs at a similar level to WGAN-GP, the better of the two former, despite using five times less training budget. We present additional results on the synthetic datasets in D.1, which include performance with different training dataset sizes, non-default hyper-parameter setups for KSGAN, and histograms of the samples for qualitative comparison.

Table 1: Squared population MMD×103absentsuperscript103\times 10^{3}× 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT (\downarrow) between test data and samples from the methods trained on 65536 samples, averaged over five random initializations with the standard deviation calculated with Bessel’s correction in the parentheses. The proposed KSGAN with kϕ=1subscript𝑘italic-ϕ1k_{\phi}=1italic_k start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = 1 performs on par with the WGAN-GP trained with five times the budget kϕ=5subscript𝑘italic-ϕ5k_{\phi}=5italic_k start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = 5. See D.1 for qualitative comparison.
Method (kϕsubscript𝑘italic-ϕk_{\phi}italic_k start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, kθsubscript𝑘𝜃k_{\theta}italic_k start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT)
Distribution GAN (5, 1) WGAN-GP (5, 1) KSGAN (1, 1)
swissroll 3.37 (1.023) 0.29 (0.119) 0.39 (0.100)
circles 2.98 (1.501) 0.27 (0.215) 0.49 (0.240)
rings 2.00 (1.264) 0.13 (0.082) 0.43 (0.162)
moons 1.41 (0.757) 0.35 (0.136) 0.53 (0.189)
8gaussians 3.57 (2.719) 0.35 (0.248) 0.32 (0.277)
pinwheel 1.66 (1.451) 0.27 (0.184) 0.40 (0.086)
2spirals 0.93 (0.822) 0.27 (0.191) 0.44 (0.232)
checkerboard 1.43 (0.899) 0.38 (0.296) 0.86 (0.468)

6.2 MNIST

We use the 50000500005000050000 training instances to train the models, and based on visual inspection of the generated samples (reported in D.2), we conclude that all the methods achieve comparable, high samples quality. To assess the quality of the distribution approximation, we use a pre-trained classifier on the same data as the generative models (details in section B.2). We run the same experiment on 3StackedMNIST [44], which has 1000 modes. We report the results in table 2.

In this experiment, we set the training budget for all methods to kϕ=1subscript𝑘italic-ϕ1k_{\phi}=1italic_k start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = 1, kθ=1subscript𝑘𝜃1k_{\theta}=1italic_k start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = 1 for a fair comparison. We find that all methods always recover all the modes with the standard MNIST target. However, GAN fails to distribute the probability mass uniformly between the digits. As the number of modes increases with the 3StackedMNIST target, GAN demonstrates its inferiority to other methods by losing 198 modes on average (four initialization cover approx. 985 modes, and one fails to converge, achieving only 98 modes). WGAN-GP and KSGAN consistently recover all the modes while being on par regarding KL divergence, which differs little between networks’ initialization.

Table 2: The number of captured modes and Kullback-Leibler divergence between the distribution of sampled digits and target uniform distribution averaged over five random initializations with the standard deviation calculated with Bessel’s correction in the parentheses. All the methods were trained with the same budget kϕ=1subscript𝑘italic-ϕ1k_{\phi}=1italic_k start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = 1, kθ=1subscript𝑘𝜃1k_{\theta}=1italic_k start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = 1. WGAN-GP and KSGAN cover all the modes in all experiments while demonstrating low KL divergence.
MNIST 3StackedMNIST
Method (kϕsubscript𝑘italic-ϕk_{\phi}italic_k start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, kθsubscript𝑘𝜃k_{\theta}italic_k start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT) # modes \uparrow KL \downarrow # modes \uparrow KL \downarrow
GAN (1,1) 10 (0.00) 0.6007 (0.27550) 808 (396.91) 1.4160 (1.36819)
WGAN-GP (1,1) 10 (0.00) 0.0087 (0.00499) 1000 (0.00) 0.0336 (0.00461)
KSGAN (1,1) 10 (0.00) 0.0056 (0.00045) 1000 (0.00) 0.0362 (0.00534)

6.3 CIFAR-10

We use the 50000500005000050000 training instances to train the models and report the generated samples in D.3. We train the models in a fully unconditional manner, i.e., not using the class information at all – contrary to many unconditional models that use class information in normalization layers. We quantify the quality of fitted models by computing the Inception Score (IS) [41] and Fréchet inception distance (FID) [18] from the test set and report the results in table 3 based on five random initializations. For reference, in the table, we include the IS of the training dataset and the FID between the training and test sets.

In this experiment, we set the training budget for all methods to kϕ=1subscript𝑘italic-ϕ1k_{\phi}=1italic_k start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = 1, kθ=1subscript𝑘𝜃1k_{\theta}=1italic_k start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = 1 for a fair comparison. All models fail to accurately approximate the target distribution, which is evident from a quantitative comparison in table 3 and a qualitative one in D.3. KSGAN is characterized by the lowest variance between initializations among the methods considered.

Table 3: Inception Score (IS) and Fréchet inception distance (FID) metrics averaged over five random initializations with the standard deviation calculated with Bessel’s correction in the parentheses. All the methods were trained with the same budget kϕ=1subscript𝑘italic-ϕ1k_{\phi}=1italic_k start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = 1, kθ=1subscript𝑘𝜃1k_{\theta}=1italic_k start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = 1. The scores for the training dataset are included in the top row, as “Real data” for reference. WGAN-GP and KSGAN perform similarly on average, while KSGAN exhibits lower variance between networks’ initialization.
Method (kϕsubscript𝑘italic-ϕk_{\phi}italic_k start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, kθsubscript𝑘𝜃k_{\theta}italic_k start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT) IS \uparrow FID \downarrow
Real data 11.2643 5.8369
GAN (1,1) 6.6209 (0.59187) 47.9414 (10.78435)
WGAN-GP (1,1) 6.7351 (0.31735) 44.3026 (6.61652)
KSGAN (1,1) 6.6429 (0.16785) 41.1555 (3.26385)

7 Conclusions and future work

In this work, we investigated the use of Generalized Kolmogorov–Smirnov distance for training deep implicit statistical models, i.e., generative networks. We proposed an efficient way to compute the distance and termed the resulting model Kolmogorov–Smirnov Generative Adversarial Network because it uses adversarial learning. Based on the empirical evaluation of the proposed model, the results of which we report, we conclude that it can be considered as an alternative to existing models in its class. At the same time, we point out that many properties of KSGAN have not been studied, and we leave this as a future work direction.

Interesting aspects to explore are the characteristics of learning dynamics with the number of generator updates exceeding the number of critic updates, alternative ways to train the critic, and alternative representations of generalized quantile sets. The natural scaling of the Generalized KS distance may also prove beneficial regarding the interpretability of learning curves, learning rate scheduling, or early stop**. In addition, we hope that our work will draw the attention of the machine learning community to the Generalized KS distance, applications of which remain to be explored.

{ack}

We acknowledge the financial support of the Swiss National Science Foundation within the MIGRATE project (grant no. 209434). The computations were performed at the University of Geneva on "Baobab" and "Yggdrasil" HPC clusters.

References

  • Arjovsky and Bottou [2017] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In International Conference on Learning Representations, 2017.
  • Arjovsky et al. [2017] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
  • Atzmon et al. [2019] M. Atzmon, N. Haim, L. Yariv, O. Israelov, H. Maron, and Y. Lipman. Controlling neural level sets. Advances in Neural Information Processing Systems, 32(NeurIPS), 2019.
  • Bengio et al. [2013] Y. Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  • Carolan [2002] C. A. Carolan. The least concave majorant of the empirical distribution function. The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 30(2):317–328, 2002.
  • Chen et al. [2023] G. Chen, Z. Yu, H. Liu, Y. Ma, and B. Yu. DevelSet: Deep Neural Level Set for Instant Mask Optimization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 42(12):5020–5033, 2023.
  • Chen et al. [2018] R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.
  • Cranmer et al. [2020] K. Cranmer, J. Brehmer, and G. Louppe. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48):30055–30062, 2020.
  • Diggle and Gratton [1984] P. J. Diggle and R. J. Gratton. Monte carlo methods of inference for implicit statistical models. Journal of the Royal Statistical Society. Series B (Methodological), 46(2):193–227, 1984.
  • Einmahl and Mason [1992] J. H. J. Einmahl and D. M. Mason. Generalized Quantile Processes. The Annals of Statistics, 20(2), jun 1992.
  • Fasano and Franceschini [1987] G. Fasano and A. Franceschini. A multidimensional version of the Kolmogorov–Smirnov test. Monthly Notices of the Royal Astronomical Society, 225(1):155–170, mar 1987.
  • Glazer et al. [2012] A. Glazer, M. Lindenbaoum, and S. Markovitch. Learning high-density regions for a generalized kolmogorov-smirnov test in high-dimensional data. Advances in Neural Information Processing Systems, 1:728–736, 2012.
  • Goodfellow et al. [2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
  • Grathwohl et al. [2018] W. Grathwohl, R. T. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, 2018.
  • Gretton et al. [2012] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(25):723–773, 2012.
  • Gulrajani et al. [2017] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. Advances in neural information processing systems, 30, 2017.
  • Hermans et al. [2022] J. Hermans, A. Delaunoy, F. Rozet, A. Wehenkel, V. Begy, and G. Louppe. A crisis in simulation-based inference? beware, your posterior approximations can be unfaithful. Transactions on Machine Learning Research, 2022.
  • Heusel et al. [2017] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • Ho et al. [2020] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Hu et al. [2017] P. Hu, B. Shuai, J. Liu, and G. Wang. Deep level sets for salient object detection. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017-Janua:540–549, 2017.
  • Hyndman [1996] R. J. Hyndman. Computing and graphing highest density regions. The American Statistician, 50(2):120–126, 1996.
  • Hyvärinen [2005] A. Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695–709, 2005.
  • Kim and Bengio [2016] T. Kim and Y. Bengio. Deep directed generative models with energy-based probability estimation. arXiv preprint arXiv:1606.03439, 2016.
  • Kobyzev et al. [2020] I. Kobyzev, S. J. Prince, and M. A. Brubaker. Normalizing flows: An introduction and review of current methods. IEEE transactions on pattern analysis and machine intelligence, 43(11):3964–3979, 2020.
  • Krizhevsky [2009] A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.
  • Kumar et al. [2019] R. Kumar, S. Ozair, A. Goyal, A. Courville, and Y. Bengio. Maximum entropy generators for energy-based models. arXiv preprint arXiv:1901.08508, 2019.
  • Lecun et al. [1998] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Li et al. [2017] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. Póczos. Mmd gan: Towards deeper understanding of moment matching network. Advances in neural information processing systems, 30, 2017.
  • Li et al. [2015] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In International conference on machine learning, pages 1718–1727. PMLR, 2015.
  • Lyu [2009] S. Lyu. Interpretation and generalization of score matching. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 359–366, 2009.
  • Miyato et al. [2018] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.
  • Mroueh and Sercu [2017] Y. Mroueh and T. Sercu. Fisher gan. Advances in neural information processing systems, 30, 2017.
  • Nowozin et al. [2016] S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  • Papamakarios et al. [2021] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57):1–64, 2021.
  • Peacock [1983] J. A. Peacock. Two-dimensional goodness-of-fit testing in astronomy. Monthly Notices of the Royal Astronomical Society, 202(3):615–627, mar 1983.
  • Polonik [1997] W. Polonik. Minimum volume sets in statistics: Recent developments. In R. Klar and O. Opitz, editors, Classification and Knowledge Organization, pages 187–194, Berlin, Heidelberg, 1997. Springer Berlin Heidelberg.
  • Polonik [1998] W. Polonik. The silhouette, concentration functions and ml-density estimation under order restrictions. The Annals of Statistics, 26(5):1857–1877, 1998.
  • Polonik [1999] W. Polonik. Concentration and goodness-of-fit in higher dimensions: (Asymptotically) distribution-free methods. Annals of Statistics, 27(4):1210–1229, 1999.
  • Radford et al. [2016] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In Y. Bengio and Y. LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
  • Ramesh et al. [2022] P. Ramesh, J.-M. Lueckmann, J. Boelts, Á. Tejero-Cantero, D. S. Greenberg, P. J. Goncalves, and J. H. Macke. GATSBI: Generative adversarial training for simulation-based inference. In International Conference on Learning Representations, 2022.
  • Salimans et al. [2016] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  • Song and Ermon [2019] Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  • Song and Kingma [2021] Y. Song and D. P. Kingma. How to train your energy-based models. arXiv preprint arXiv:2101.03288, 2021.
  • Srivastava et al. [2017] A. Srivastava, L. Valkov, C. Russell, M. U. Gutmann, and C. Sutton. Veegan: Reducing mode collapse in gans using implicit variational learning. Advances in neural information processing systems, 30, 2017.

Appendix A Proofs

\printProofs

A.1 Generalized KS distance satisfies triangle inequality

Let us consider three probability measures Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT, Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, and Hsubscript𝐻\mathds{P}_{H}blackboard_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT on a measurable space (𝒳,𝒜)𝒳𝒜(\mathcal{X},\mathcal{A})( caligraphic_X , caligraphic_A ).

DGKS(F,H)+DGKS(H,G)subscript𝐷GKSsubscript𝐹subscript𝐻subscript𝐷GKSsubscript𝐻subscript𝐺\displaystyle D_{\mathrm{GKS}}\left(\mathds{P}_{F},\mathds{P}_{H}\right)+D_{% \mathrm{GKS}}\left(\mathds{P}_{H},\mathds{P}_{G}\right)italic_D start_POSTSUBSCRIPT roman_GKS end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) + italic_D start_POSTSUBSCRIPT roman_GKS end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT )
=supα[0,1]C{CF,𝒞,CH,𝒞}[|F(C(α))H(C(α))|]+supα[0,1]C{CH,𝒞,CG,𝒞}[|H(C(α))G(C(α))|]absentsubscriptsupremum𝛼01𝐶subscript𝐶subscript𝐹𝒞subscript𝐶subscript𝐻𝒞delimited-[]subscript𝐹𝐶𝛼subscript𝐻𝐶𝛼subscriptsupremum𝛼01𝐶subscript𝐶subscript𝐻𝒞subscript𝐶subscript𝐺𝒞delimited-[]subscript𝐻𝐶𝛼subscript𝐺𝐶𝛼\displaystyle=\sup_{\begin{subarray}{c}\alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{F},\mathcal{C}},C_{\mathds{P}_{H},\mathcal{C}}\}\end{% subarray}}\left[|\mathds{P}_{F}(C(\alpha))-\mathds{P}_{H}(C(\alpha))|\right]+% \sup_{\begin{subarray}{c}\alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{H},\mathcal{C}},C_{\mathds{P}_{G},\mathcal{C}}\}\end{% subarray}}\left[|\mathds{P}_{H}(C(\alpha))-\mathds{P}_{G}(C(\alpha))|\right]= roman_sup start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_α ∈ [ 0 , 1 ] end_CELL end_ROW start_ROW start_CELL italic_C ∈ { italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT } end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ | blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_C ( italic_α ) ) - blackboard_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_C ( italic_α ) ) | ] + roman_sup start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_α ∈ [ 0 , 1 ] end_CELL end_ROW start_ROW start_CELL italic_C ∈ { italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT } end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ | blackboard_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_C ( italic_α ) ) - blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_C ( italic_α ) ) | ]
=(i)supα[0,1]C{CF,𝒞,CH,𝒞,CG,𝒞}[|F(C(α))H(C(α))|]+supα[0,1]C{CH,𝒞,CG,𝒞,CF,𝒞}[|H(C(α))G(C(α))|]superscript(i)absentsubscriptsupremum𝛼01𝐶subscript𝐶subscript𝐹𝒞subscript𝐶subscript𝐻𝒞subscript𝐶subscript𝐺𝒞delimited-[]subscript𝐹𝐶𝛼subscript𝐻𝐶𝛼subscriptsupremum𝛼01𝐶subscript𝐶subscript𝐻𝒞subscript𝐶subscript𝐺𝒞subscript𝐶subscript𝐹𝒞delimited-[]subscript𝐻𝐶𝛼subscript𝐺𝐶𝛼\displaystyle\stackrel{{\scriptstyle\text{(i)}}}{{=}}\sup_{\begin{subarray}{c}% \alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{F},\mathcal{C}},C_{\mathds{P}_{H},\mathcal{C}},C_{\mathds% {P}_{G},\mathcal{C}}\}\end{subarray}}\left[|\mathds{P}_{F}(C(\alpha))-\mathds{% P}_{H}(C(\alpha))|\right]+\sup_{\begin{subarray}{c}\alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{H},\mathcal{C}},C_{\mathds{P}_{G},\mathcal{C}},C_{\mathds% {P}_{F},\mathcal{C}}\}\end{subarray}}\left[|\mathds{P}_{H}(C(\alpha))-\mathds{% P}_{G}(C(\alpha))|\right]start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG (i) end_ARG end_RELOP roman_sup start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_α ∈ [ 0 , 1 ] end_CELL end_ROW start_ROW start_CELL italic_C ∈ { italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT } end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ | blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_C ( italic_α ) ) - blackboard_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_C ( italic_α ) ) | ] + roman_sup start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_α ∈ [ 0 , 1 ] end_CELL end_ROW start_ROW start_CELL italic_C ∈ { italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT } end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ | blackboard_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_C ( italic_α ) ) - blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_C ( italic_α ) ) | ]
=supα[0,1]C{CF,𝒞,CH,𝒞,CG,𝒞}[|F(C(α))H(C(α))|]+[|H(C(α))G(C(α))|]absentsubscriptsupremum𝛼01𝐶subscript𝐶subscript𝐹𝒞subscript𝐶subscript𝐻𝒞subscript𝐶subscript𝐺𝒞delimited-[]subscript𝐹𝐶𝛼subscript𝐻𝐶𝛼delimited-[]subscript𝐻𝐶𝛼subscript𝐺𝐶𝛼\displaystyle=\sup_{\begin{subarray}{c}\alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{F},\mathcal{C}},C_{\mathds{P}_{H},\mathcal{C}},C_{\mathds% {P}_{G},\mathcal{C}}\}\end{subarray}}\left[|\mathds{P}_{F}(C(\alpha))-\mathds{% P}_{H}(C(\alpha))|\right]+\left[|\mathds{P}_{H}(C(\alpha))-\mathds{P}_{G}(C(% \alpha))|\right]= roman_sup start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_α ∈ [ 0 , 1 ] end_CELL end_ROW start_ROW start_CELL italic_C ∈ { italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT } end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ | blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_C ( italic_α ) ) - blackboard_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_C ( italic_α ) ) | ] + [ | blackboard_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_C ( italic_α ) ) - blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_C ( italic_α ) ) | ]
(ii)supα[0,1]C{CF,𝒞,CH,𝒞,CG,𝒞}[|F(C(α))G(C(α))|]superscript(ii)absentsubscriptsupremum𝛼01𝐶subscript𝐶subscript𝐹𝒞subscript𝐶subscript𝐻𝒞subscript𝐶subscript𝐺𝒞delimited-[]subscript𝐹𝐶𝛼subscript𝐺𝐶𝛼\displaystyle\stackrel{{\scriptstyle\text{(ii)}}}{{\geqslant}}\sup_{\begin{% subarray}{c}\alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{F},\mathcal{C}},C_{\mathds{P}_{H},\mathcal{C}},C_{\mathds% {P}_{G},\mathcal{C}}\}\end{subarray}}\left[|\mathds{P}_{F}(C(\alpha))-\mathds{% P}_{G}(C(\alpha))|\right]start_RELOP SUPERSCRIPTOP start_ARG ⩾ end_ARG start_ARG (ii) end_ARG end_RELOP roman_sup start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_α ∈ [ 0 , 1 ] end_CELL end_ROW start_ROW start_CELL italic_C ∈ { italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT } end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ | blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_C ( italic_α ) ) - blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_C ( italic_α ) ) | ]
=supα[0,1]C{CG,𝒞,CF,𝒞}[|F(C(α))G(C(α))|]=DGKS(F,G)absentsubscriptsupremum𝛼01𝐶subscript𝐶subscript𝐺𝒞subscript𝐶subscript𝐹𝒞delimited-[]subscript𝐹𝐶𝛼subscript𝐺𝐶𝛼subscript𝐷GKSsubscript𝐹subscript𝐺\displaystyle=\sup_{\begin{subarray}{c}\alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{G},\mathcal{C}},C_{\mathds{P}_{F},\mathcal{C}}\}\end{% subarray}}\left[|\mathds{P}_{F}(C(\alpha))-\mathds{P}_{G}(C(\alpha))|\right]=D% _{\mathrm{GKS}}\left(\mathds{P}_{F},\mathds{P}_{G}\right)= roman_sup start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_α ∈ [ 0 , 1 ] end_CELL end_ROW start_ROW start_CELL italic_C ∈ { italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , caligraphic_C end_POSTSUBSCRIPT } end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ | blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_C ( italic_α ) ) - blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_C ( italic_α ) ) | ] = italic_D start_POSTSUBSCRIPT roman_GKS end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT )

In (i), we use the fact that the supremum of absolute difference in distribution coverage is maximized with the generalized quantile function of one of them. In (ii), we apply triangle inequality for absolute value. Thus we have shown that DGKS(F,H)+DGKS(H,G)DGKS(F,G)subscript𝐷GKSsubscript𝐹subscript𝐻subscript𝐷GKSsubscript𝐻subscript𝐺subscript𝐷GKSsubscript𝐹subscript𝐺D_{\mathrm{GKS}}\left(\mathds{P}_{F},\mathds{P}_{H}\right)+D_{\mathrm{GKS}}% \left(\mathds{P}_{H},\mathds{P}_{G}\right)\geqslant D_{\mathrm{GKS}}\left(% \mathds{P}_{F},\mathds{P}_{G}\right)italic_D start_POSTSUBSCRIPT roman_GKS end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) + italic_D start_POSTSUBSCRIPT roman_GKS end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) ⩾ italic_D start_POSTSUBSCRIPT roman_GKS end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) which is the triangle inequality for the Generalized KS distance.

A.2 Objective for the critic

Given two adversarial maximum likelihood objectives from Kim and Bengio [23], we (i) set cϕG(x):=cϕF(x)assignsubscript𝑐subscriptitalic-ϕ𝐺𝑥subscript𝑐subscriptitalic-ϕ𝐹𝑥c_{\phi_{G}}(x):=-c_{\phi_{F}}(x)italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) := - italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ), and (ii) repurpose Gsubscript𝐺\mathds{P}_{G}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT as aux(F)subscript𝑎𝑢𝑥𝐹\mathds{P}_{aux(F)}blackboard_P start_POSTSUBSCRIPT italic_a italic_u italic_x ( italic_F ) end_POSTSUBSCRIPT and Fsubscript𝐹\mathds{P}_{F}blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT as aux(G)subscript𝑎𝑢𝑥𝐺\mathds{P}_{aux(G)}blackboard_P start_POSTSUBSCRIPT italic_a italic_u italic_x ( italic_G ) end_POSTSUBSCRIPT, and show that:

12(𝔼xF[cϕF(x)]𝔼xaux(F)[cϕF(x)])+12(𝔼xG[cϕG(x)]𝔼xaux(G)[cϕG(x)])12subscript𝔼similar-to𝑥subscript𝐹delimited-[]subscript𝑐subscriptitalic-ϕ𝐹𝑥subscript𝔼similar-to𝑥subscript𝑎𝑢𝑥𝐹delimited-[]subscript𝑐subscriptitalic-ϕ𝐹𝑥12subscript𝔼similar-to𝑥subscript𝐺delimited-[]subscript𝑐subscriptitalic-ϕ𝐺𝑥subscript𝔼similar-to𝑥subscript𝑎𝑢𝑥𝐺delimited-[]subscript𝑐subscriptitalic-ϕ𝐺𝑥\displaystyle\frac{1}{2}(\mathbb{E}_{x\sim\mathds{P}_{F}}[-c_{\phi_{F}}(x)]-% \mathbb{E}_{x\sim\mathds{P}_{aux(F)}}[-c_{\phi_{F}}(x)])+\frac{1}{2}(\mathbb{E% }_{x\sim\mathds{P}_{G}}[-c_{\phi_{G}}(x)]-\mathbb{E}_{x\sim\mathds{P}_{aux(G)}% }[-c_{\phi_{G}}(x)])divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ] - blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUBSCRIPT italic_a italic_u italic_x ( italic_F ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ] ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ] - blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUBSCRIPT italic_a italic_u italic_x ( italic_G ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - italic_c start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ] )
=12(𝔼xF[cϕ(x)]𝔼xG[cϕ(x)]+𝔼xG[cϕ(x)]𝔼xF[cϕ(x)])absent12subscript𝔼similar-to𝑥subscript𝐹delimited-[]subscript𝑐italic-ϕ𝑥subscript𝔼similar-to𝑥subscript𝐺delimited-[]subscript𝑐italic-ϕ𝑥subscript𝔼similar-to𝑥subscript𝐺delimited-[]subscript𝑐italic-ϕ𝑥subscript𝔼similar-to𝑥subscript𝐹delimited-[]subscript𝑐italic-ϕ𝑥\displaystyle\quad=\frac{1}{2}(\mathbb{E}_{x\sim\mathds{P}_{F}}[-c_{\phi}(x)]-% \mathbb{E}_{x\sim\mathds{P}_{G}}[-c_{\phi}(x)]+\mathbb{E}_{x\sim\mathds{P}_{G}% }[c_{\phi}(x)]-\mathbb{E}_{x\sim\mathds{P}_{F}}[c_{\phi}(x)])= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) ] - blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) ] + blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) ] - blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) ] )
=𝔼xG[cϕ(x)]𝔼xF[cϕ(x)].absentsubscript𝔼similar-to𝑥subscript𝐺delimited-[]subscript𝑐italic-ϕ𝑥subscript𝔼similar-to𝑥subscript𝐹delimited-[]subscript𝑐italic-ϕ𝑥\displaystyle\quad=\mathbb{E}_{x\sim\mathds{P}_{G}}[c_{\phi}(x)]-\mathbb{E}_{x% \sim\mathds{P}_{F}}[c_{\phi}(x)].= blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) ] - blackboard_E start_POSTSUBSCRIPT italic_x ∼ blackboard_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_c start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) ] .

Appendix B Experiments details

In this section, we provide additional details about experiments conducted in the paper that did not fit in the main text. All the models reported in the paper were trained under 12 hours on a single Nvidia GeForce GTX TITAN X GPU (12GB vRAM) with 32GB of RAM and 2 CPU cores. We report results based on 645 models trained, which amounts to 7740 GPU hours at most. We estimate that about three times as much computing time was used for preliminary experiments not reported in the paper.

B.1 Synthetic

The synthetic 2D distributions are adopted from the official code of Grathwohl et al. [14]https://github.com/rtqichen/ffjord. We randomly generate 65536 training and 65536 test instances from each distribution. In D.1, we report the results of training the models with fewer instances but evaluated using the entire test set.

We choose the bandwidth of the Gaussian filter in squared population MMD as the median of L2 pairwise distances between 65536 instances sampled from the simulator. The resulting values can be found in the code we provide with the paper.

B.2 MNIST

To detect the modes in the (3Stacked)MNIST experiments, we use a pre-trained classifier from PyTorch examples, trained for 14 epochs of the train set of the original MNIST dataset. We expect to find 10 and 1000 modes for the MNIST and 3StackedMNIST, respectively. We measure the KL divergence between the classifier’s output and discrete uniform distribution for both distributions.

B.3 CIFAR-10

We sample 32768 instances from reach model. We compute the Inception Score using the implementation from https://github.com/sbarratt/inception-score-pytorch. We compute the Fréchet inception distance using the implementation from https://github.com/mseitzer/pytorch-fid.

Appendix C Architectures and hyper-parameters

C.1 Synthetic

For all of the methods and distributions, we use the same architecture, described in table 4, with spectral normalization [31] on linear layers for GAN. In all cases, we train the generator and critic with Adam(β1=0.5subscript𝛽10.5\beta_{1}=0.5italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.5, β2=0.9subscript𝛽20.9\beta_{2}=0.9italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.9) optimizer with a constant learning rate of 0.0001, without L2 regularization or weight decay, for 128000 generator updates with batch size equal to 512. We use the standard loss for GAN, enforcing class 1 for real samples and 0 for generated samples. In WGAN-GP, we use 0.1 weight on gradient penalty (identified as a good value in preliminary experiments, which we do not report), and in KSGAN β=1.0𝛽1.0\beta=1.0italic_β = 1.0 as the weight for score penalty.

Table 4: Architectures for synthetic 2D datasets.
zIR8𝒩(0,I)𝑧𝐼superscript𝑅8similar-to𝒩0𝐼z\in I\!\!R^{8}\sim\mathcal{N}(0,I)italic_z ∈ italic_I italic_R start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT ∼ caligraphic_N ( 0 , italic_I )
Linear(bias=True), 851285128\rightarrow 5128 → 512
ReLU
Linear(bias=True), 512512512512512\rightarrow 512512 → 512
ReLU
Linear(bias=True), 512512512512512\rightarrow 512512 → 512
ReLU
Linear(bias=True), 51225122512\rightarrow 2512 → 2
(a) Generator
Linear(bias=True), 251225122\rightarrow 5122 → 512
LeakyReLU(slope=0.2)
Linear(bias=True), 512512512512512\rightarrow 512512 → 512
LeakyReLU(slope=0.2)
Linear(bias=True), 512512512512512\rightarrow 512512 → 512
LeakyReLU(slope=0.2)
Linear(bias=True), 51215121512\rightarrow 1512 → 1
(b) Critic

C.2 MNIST

For the MNIST experiments, we use the DCGAN [39] architecture, without batch normalization layers, with 128-dimensional latent Gaussian distribution. For the 3StackedMNIST distribution, we increase the number of input and output channels for the critic and generator, respectively. We train the generator and critic with Adam(β1=0.5subscript𝛽10.5\beta_{1}=0.5italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.5, β2=0.9subscript𝛽20.9\beta_{2}=0.9italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.9) optimizer with a constant learning rate of 0.0001, without L2 regularization or weight decay, for 200000 generator updates with batch size equal to 50. In the case of GAN for 3StackedMNIST, we use a learning rate of 0.001 (identified as a good value in preliminary experiments, which we do not report). We use the flipped loss for GAN, enforcing class 0 for real samples and 1 for generated samples. In WGAN-GP, we use 10.0 weight on gradient penalty (identified as a good value in preliminary experiments, which we do not report), and in KSGAN β=1.0𝛽1.0\beta=1.0italic_β = 1.0 as the weight for score penalty.

C.3 CIFAR-10

For the CIFAR-10 experiments, we use ResNet architecture from Gulrajani et al. [16]. We train the generator and critic with Adam(β1=0.0subscript𝛽10.0\beta_{1}=0.0italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.0, β2=0.9subscript𝛽20.9\beta_{2}=0.9italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.9) optimizer with a constant learning rate of 0.0001, without L2 regularization or weight decay, for 199936 generator updates with batch size equal to 64. We use the flipped loss for GAN, enforcing class 0 for real samples and 1 for generated samples. In WGAN-GP, we use 10.0 weight on gradient penalty (identified as a good value in preliminary experiments, which we do not report), and in KSGAN β=1.0𝛽1.0\beta=1.0italic_β = 1.0 as the weight for score penalty.

Appendix D Extended results

In this section, we report additional experiment results that did not fit in the main text. This includes materials allowing a qualitative comparison of the trained models.

D.1 Synthetic data

In 2, we report, extended relative to table 2 in the main text, a study of the quality of trained models as measured by the squared population MMD. Solid lines denote the average over five random initializations, and the shaded area represents the two-σ𝜎\sigmaitalic_σ interval. KSGAN performs on par with WGAN-GP while being trained with a five times less training budget. In 3, we show the histograms of 65536 samples from the models (a single random initialization), with a histogram of test data in the first column for reference. For KSGAN, in addition to the configurations included in table 2, we include one with a training budget matching that of GAN and WGAN-GP, and one with a training budget reduced by two, where the critic is updated only every second update of the generator.

Refer to caption
Figure 2: Squared population MMD between approximate and test distribution as a function of the number of training instances. Solid lines denote the average over five random initializations, and the shaded area represents the two-σ𝜎\sigmaitalic_σ interval. Best viewed in color.
Refer to caption
Figure 3: Histograms of samples from distributions denoted on the top. Heatmap colors are shared for all figures in each row. Best viewed in color.

D.2 MNIST

In 4, we show samples from one of the random initializations reported in table 2 in the main text. All models demonstrate similar sample quality, while for GAN, the digit \csq@thequote@oinit\csq@thequote@oopen1\csq@thequote@oclose is over-represented, which corresponds with the high KL in table 2.

Refer to caption
(a) GAN (1, 1)
Refer to caption
(b) WGAN-GP (1, 1)
Refer to caption
(c) KSGAN (1, 1)
Figure 4: Samples from the respective models trained on the MNIST dataset.

D.3 CIFAR-10

In 5, we show samples from one of the random initializations reported in table 3 in the main text. All models demonstrate similar, low sample quality.

Refer to caption
(a) GAN (1, 1)
Refer to caption
(b) WGAN-GP (1, 1)
Refer to caption
(c) KSGAN (1, 1)
Figure 5: Samples from the respective models trained on the CIFAR-10 dataset. Best viewed in color.