Acknowledgments and Disclosure of Funding

\NewEnviron

ack

Acknowledgments and Disclosure of Funding

Maciej Falkiewicz^1,2 Naoya Takeishi^3,4 Alexandros Kalousis²
¹Computer Science Department, University of Geneva ²HES-SO/HEG Genève
³The University of Tokyo ⁴RIKEN
{maciej.falkiewicz, alexandros.kalousis}@hesge.ch
[email protected]

\BODY

Kolmogorov–Smirnov GAN

Abstract

We propose a novel deep generative model, the Kolmogorov-Smirnov Generative Adversarial Network (KSGAN). Unlike existing approaches, KSGAN formulates the learning process as a minimization of the Kolmogorov-Smirnov (KS) distance, generalized to handle multivariate distributions. This distance is calculated using the quantile function, which acts as the critic in the adversarial training process. We formally demonstrate that minimizing the KS distance leads to the trained approximate distribution aligning with the target distribution. We propose an efficient implementation and evaluate its effectiveness through experiments. The results show that KSGAN performs on par with existing adversarial methods, exhibiting stability during training, resistance to mode drop** and collapse, and tolerance to variations in hyperparameter settings. Additionally, we review the literature on the Generalized KS test and discuss the connections between KSGAN and existing adversarial generative models.

Refer to caption — Figure 1: A schematic depiction of how the Generalized Kolmogorov-Smirnov (KS) distance between target $\mathds{P}_{F}$ and approximate $\mathds{P}_{G}$ distributions with respect to critic $c_{\phi}$ is computed. The critic is evaluated on samples $x_{F}$ ( $\color[rgb]{0.5390625,0.703125,0.3203125}\boldsymbol{|}$ ) and $x_{G}$ ( $\color[rgb]{0.109375,0.44140625,0.65234375}\boldsymbol{|}$ ) from the target and approximate distributions respectively. The $\lambda$ threshold moves from $-\infty$ to $+\infty$ establishing a stack of level sets. At each level, the fraction of datapoints ( $\color[rgb]{0.5390625,0.703125,0.3203125}\bullet$ and $\color[rgb]{0.109375,0.44140625,0.65234375}\bullet$ ) below the threshold is calculated for each distribution independently. This produces the $\mathds{P}_{F}\left(\Gamma_{c_{\phi}}(\lambda)\right)$ and $\mathds{P}_{G}\left(\Gamma_{c_{\phi}}(\lambda)\right)$ curves. The Generalized KS distance is the largest absolute difference between the curves shown as $\color[rgb]{0.86328125,0.078125,0.234375}\boldsymbol{\updownarrow}$ in the right figure. Best viewed in color.

1 Introduction

Generative modeling is about fitting a model to a target distribution, usually the data. A fundamental taxonomy of models assigns them into prescribed and implicit statistical models [9], with partial overlap between the two classes. Prescribed models directly parameterize the distribution’s probability density function, while implicit models parameterize the generator that allows samples to be drawn from the distribution. The ultimate application of the model primarily dictates the choice between the two approaches. It does, however, have consequences regarding the available types of divergences that we can minimize when fitting the model. The divergences differ in the stability of optimization and computational efficiency, as well as statistical efficiency, which all affect the final performance of the model.

The natural approach for fitting a prescribed model is maximum likelihood estimation (MLE), equivalently formulated as minimization of Kullback–Leibler divergence. Likelihood evaluation for normalized models is straightforward. In non-normalized models, density evaluation is expensive; in this context, Hyvärinen [22] proposed the score matching objective, which can be interpreted as the Fisher divergence [30]. This approach is very effective for simulation-free training of ODE[7]/SDE[42, 19]-based models which are state-of-the-art in multiple domains today.

The principle driving the fitting of implicit statistical models is to push the model to generate samples that are indistinguishable from the target. An inflection point for this family of models came with the Generative Adversarial Network (GAN) [13], which took the principle literally and introduced an auxiliary classifier trained in an adversarial process to discriminate between the two distributions. The classification error given an optimal classifier relates to the Jensen–Shannon divergence between generator and the target. Initial work in this area involved applying heuristic tricks to deal with learning problems, namely vanishing gradients, unstable training, and mode drop** or collapse. Further advancements focused on using other distances based on the principle of adversarial learning of auxiliary models, which were supposed to have certain favorable properties with respect to the original GAN.

The Bayesian inference community has been reluctant to adopt adversarial methods [8], and the attempts to apply them in this context [40] indicate a credibility problem. A significant drawback of approximate methods is the excessive reduction of diversity in the distribution [17], the extremes of which lead to mode drop** [1]. In this work, we consider another distance for training implicit statistical models, i.e., the Kolmogorov-Smirnov (KS) distance, which, to the best of our knowledge, has not been used in this context before. The distinctive feature of the KS distance is that it directly measures the coverage discrepancy of each other’s credibility regions by the distributions under analysis at all confidence levels. Thus, its minimization straightforwardly leads to the correct spread of the probability mass, avoiding mode drop**, overconfidence, and mode collapse when applied with a sufficient sampling budget.

We term the proposed model as Kolmogorov-Smirnov Generative Adversarial Network (KSGAN). We show how to generalize the standard KS distance to higher dimensions based on Polonik [38] in section 2, allowing our method to be used for multidimensional distributions. Next, in section 3, we show how to efficiently leverage the distance in an adversarial training process and show formally that the proposed algorithm leads to an alignment of the approximate and target distributions. We support the theoretical findings with empirical results presented in section 6.

2 Generalized Kolmogorov–Smirnov distance

We generalize the Kolmogorov–Smirnov (KS) distance (sometimes called simply Kolmogorov distance) between continuous probability distributions on one-dimensional spaces to multidimensional spaces and show that it is a metric. The test statistic of the KS test is a KS distance between empirical and target distributions (or two empirical in the case of the two-sample case). For this reason, our proposal is directly inspired by the generalization of the test introduced in Polonik [38].

Let us consider two probability measures $\mathds{P}_{F}$ and $\mathds{P}_{G}$ on a measurable space $(\mathcal{X},\mathcal{A})$ , where the sample space $\mathcal{X}$ is a vector space such as $I\!\!R^{d}$ and $\mathcal{A}$ is the corresponding event space; $F:\mathcal{X}\rightarrow[0,1]$ and $G:\mathcal{X}\rightarrow[0,1]$ are the cumulative distribution functions (CDFs) of $\mathds{P}_{F}$ and $\mathds{P}_{G}$ respectively.¹¹1In what follows we will use $\mathds{P}_{F}$ for the true data distribution and $\mathds{P}_{G}$ for the learnt one We say that $\mathds{P}_{F}=\mathds{P}_{G}$ iff $\forall\ A\in\mathcal{A},\ \mathds{P}_{F}(A)=\mathds{P}_{G}(A)$ . When $\dim(\mathcal{X})=1$ then the KS distance is

D_{\mathrm{KS}}\left(\mathds{P}_{F},\mathds{P}_{G}\right):=\sup_{x\in\mathcal{% X}}|F(x)-G(x)|.

(1)

In the multivariate case, the problem with using the KS distance as is is that on a $d$ -dimensional space, there are $2^{d}-1$ ways of defining a CDF. The distance has to be independent of the particular definition and thus should be the largest across all the possibilities [35]. This, however, becomes prohibitive for any $d>2$ . In other words, the challenge comes from a multidimensional vector space not being a partially ordered set. Everything that follows in this section consists of proposing a partial order, showing that, under certain conditions, a probability distribution can be uniquely determined on its basis and operationalizing it in an optimization problem.

We begin by bringing the classical result that

D_{\mathrm{KS}}\left(\mathds{P}_{F},\mathds{P}_{G}\right)=\sup_{\alpha\in[0,1]% }|F(G^{-1}(\alpha))-\alpha|,

(2)

where $G^{-1}:[0,1]\rightarrow\mathcal{X}$ is the inverse CDF also called the quantile function. Einmahl and Mason [10] show that there exists a natural generalization of the quantile function to multivariate distribution, which we restate below.

Definition 1 (Generalized Quantile Function).

Let $\operatorname*{v}:\mathcal{A}\rightarrow I\!\!R_{+}$ be a measure, and $\mathcal{C}\subset\mathcal{A}$ an arbitrary subset of the event space, then a function $C_{\mathds{P},\mathcal{C}}(\alpha):[0,1]\rightarrow\mathcal{C}$ such that

C_{\mathds{P},\mathcal{C}}(\alpha)\in\operatorname*{arg\,min}_{C\in\mathcal{C}% }\{\operatorname*{v}(C):\mathds{P}(C)\geqslant\alpha\}

(3)

is called the generalized quantile function in $\mathcal{C}$ for $\mathds{P}$ with respect to $\operatorname*{v}$ ²²2In the general case, $C_{\mathds{P},\mathcal{C}}(\alpha)$ at any given level $\alpha$ is not uniquely determined, i.e. there may exist several sets $C,C^{\prime}\in\mathcal{C}\textrm{ s.t. }C\neq C^{\prime}$ that satisfy the condition in eq. 3. For simplicity, we will call all such sets the (generalized) quantile sets at level $\alpha$ and write $C_{\mathds{P},\mathcal{C}}(\alpha)=C$ and $C_{\mathds{P},\mathcal{C}}(\alpha)=C^{\prime}$ for all of them..

The generalized quantile function evaluated at level $\alpha$ yields a minimum-volume set [36] whose probability is at least $\alpha$ , and it is the smallest with respect to $\operatorname*{v}$ such set in $\mathcal{C}$ , thus the name. For the remainder of this paper, we assume that $\operatorname*{v}$ is the Lebesgue measure.

It may seem that it is enough to plug $C_{\mathds{P}_{G},\mathcal{C}}(\alpha)$ in place of $G^{-1}(\alpha)$ and $\mathds{P}_{F}$ in place of $F$ in eq. 2 to establish the Generalized KS distance but it turns out that such a distance does not satisfy the positivity condition $D_{\mathrm{KS}}\left(\mathds{P}_{F},\mathds{P}_{G}\right)>0$ if $\mathds{P}_{F}\neq\mathds{P}_{G}$ as the example below shows.

Example 1 (Polonik [38]).

Let $\mathds{P}_{F}$ be the probability measure of a chi distribution with one degree of freedom $\sqrt{\chi_{1}^{2}}$ which has support on $I\!\!R_{+}$ and $\mathds{P}_{G}$ the probability measure of a standard Gaussian distribution $\mathcal{N}(0,1)$ which has support on the whole $I\!\!R$ . Given $\mathcal{C}=\mathcal{A}$ we have

\mathds{P}_{F}(C_{\mathds{P}_{G},\mathcal{C}}(\alpha))=\alpha\ \forall\alpha% \in[0,1],

(4)

while clearly $\mathds{P}_{F}\neq\mathds{P}_{G}$ . The statement in eq. 4 is easy to show by observing that $\forall x\in[0,\infty)$ the density of $\mathds{P}_{F}$ is twice the density of $\mathds{P}_{G}$ and $C_{\mathds{P}_{G},\mathcal{C}}(\alpha)$ are intervals centered at 0.

Instead, a solution based on the quantile functions of both distributions is needed, which we present in definition 2.

Definition 2 (Generalized Kolmogorov-Smirnov distance).

Let the Generalized Kolmogorov-Smirnov distance be formulated as follows:

D_{\mathrm{GKS}}\left(\mathds{P}_{F},\mathds{P}_{G}\right):=\sup_{\begin{% subarray}{c}\alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{G},\mathcal{C}},C_{\mathds{P}_{F},\mathcal{C}}\}\end{% subarray}}\left[|\mathds{P}_{F}(C(\alpha))-\mathds{P}_{G}(C(\alpha))|\right].

(5)

Such distance is symmetric, satisfying the triangle inequality as shown in section A.1. For the remainder of this section, we will show that the Generalized KS distance in eq. 5 meets the necessary $D_{\mathrm{GKS}}\left(\mathds{P},\mathds{P}\right)=0$ and sufficient $D_{\mathrm{GKS}}\left(\mathds{P}_{F},\mathds{P}_{G}\right)>0$ if $\mathds{P}_{F}\neq\mathds{P}_{G}$ conditions to consider it a metric. In the proof, we will rely on the probability density function of $\mathds{P}$ with respect to a reference measure $\operatorname*{v}$ , which we denote with $p:\mathcal{X}\rightarrow[0,\infty)$ . Let

\Gamma_{p}(\lambda):=\{x:p(x)\geqslant\lambda\}

(6)

denote the density level set of $p$ at level $\lambda\geqslant 0$ (also called the highest density region [21]), and let $\Pi_{p}:=\{\Gamma_{p}(\lambda):\lambda\geqslant 0\}$ . The following observations about level sets will introduce the fundamental tools to prove the necessary and sufficient conditions for the generalized KS distance.

Remark 1 (The silhouette [37]).

For any density $p$ , the following holds

p(x)=\int_{0}^{\infty}\mathds{1}_{\Gamma_{p}(\lambda)}(x)\mathrm{d}\lambda,

(7)

where $\mathds{1}_{C}$ denotes the indicator function of a set $C$ . The RHS of eq. 7 is called the silhouette.

An immediate consequence of remark 1 is that $\Pi_{p}$ ordered with respect to $\lambda\geqslant 0$ fully characterizes $\mathds{P}$ , because $p$ does. Graphically, the silhouette is a multidimensional stack of level sets.

Remark 2.

Density level sets are minimum-volume sets [38] The quantity $\mathds{P}(C)-\lambda\operatorname*{v}(C)$ is maximized over $\mathcal{A}$ by $\Gamma_{p}(\lambda)$ , and thus if $\Gamma_{p}(\lambda)\in\mathcal{C}$ , then $\Gamma_{p}(\lambda)=C_{\mathds{P},\mathcal{C}}(\alpha)$ ³³3There may be other sets $C=C_{\mathds{P},\mathcal{C}}(\alpha)$ but $\Gamma_{p}(\lambda)$ will certainly be one of them. at level $\alpha=\mathds{P}(\Gamma_{p}(\lambda))=\int p(x)\mathds{1}_{[\lambda,\infty)}(% p(x))\mathrm{d}x$ .

Below, we present the fundamental theoretical result behind the proposed method, which restates Lemma 1.2. of Polonik [38].

{theoremE}

[Necessary and sufficient conditions][end, restate, text link section] Let $\operatorname*{v}$ be a measure on $(\mathcal{X},\mathcal{A})$ . Suppose that $\mathds{P}_{F}$ and $\mathds{P}_{G}$ are probability measures on $(\mathcal{X},\mathcal{A})$ with densities (with reference measure $\operatorname*{v}$ ) $f$ and $g$ respectively. Assuming that

A.1

$\Pi_{f}\cup\Pi_{g}\subset\mathcal{C}$ ;
A.2

$C_{\mathds{P}_{F},\mathcal{C}}(\alpha)$ and $C_{\mathds{P}_{G},\mathcal{C}}(\alpha)$ are uniquely determined⁴⁴4In the sense defined in Polonik [38] in $\mathcal{C}$ with respect to $\operatorname*{v}$

the following two statements are equivalent:

S.1

$\mathds{P}_{F}=\mathds{P}_{G}$ ;
S.2

$D_{\mathrm{GKS}}\left(\mathds{P}_{F},\mathds{P}_{G}\right)=0.$

{proofE}

The $\ref{statement:optimality}\implies\ref{statement:zero_distance}$ direction is trivial to show and works without satisfying the assumptions [38]. Therefore, we focus on showing that $\ref{statement:zero_distance}\implies\ref{statement:optimality}$ . Let

S_{\mathcal{C}}(\mathds{P})=\{(\operatorname*{v}(C),\mathds{P}(C)):C\in% \mathcal{C}\}\subset I\!\!R_{+}\times[0,1],

(8)

and denote with $\Gamma(\lambda)$ the level set of density of $\mathds{P}$ as defined in eq. 6, and let $\Pi:=\{\Gamma(\lambda):\lambda\geqslant 0\}$ . Further, let $\tilde{S}_{\mathcal{C}}$ denote the least concave majorant [5] to $S_{\mathcal{C}}(\mathds{P})$ , that is, the smallest concave function from $I\!\!R_{+}$ to $[0,1]$ lying above $S_{\mathcal{C}}(\mathds{P})$ . $\tilde{S}_{\mathcal{C}}$ is supported on the generalized quantiles of $\mathds{P}$ in $\mathcal{C}$ , i.e. on the points $(\operatorname*{v}(C_{\mathds{P},\mathcal{C}}(\alpha)),\mathds{P}(C_{\mathds{P% },\mathcal{C}}(\alpha)))$ . Finally, let $\partial\tilde{S}_{\mathcal{C}}(\mathds{P})$ be the intersection of the extremal points of the convex hull of $S_{\mathcal{C}}(\mathds{P})$ with the graph of $\tilde{S}_{\mathcal{C}}$ . Given $\Pi\subset\mathcal{C}$ which we assume in A.1 for $\mathds{P}_{F}$ and $\mathds{P}_{G}$ , and in the light of remark 2 we have that for any set $C$ such that $(\operatorname*{v}(C),\mathds{P}(C))\in\partial\tilde{S}_{\mathcal{C}}(\mathds% {P})$ there is a level $\lambda$ for which $C=\Gamma(\lambda)$ , and it is equal the left-hand derivative of $\tilde{S}_{\mathcal{C}}$ in the point $\operatorname*{v}(C)$ . From remark 1, we have that the silhouette fully characterizes $\mathds{P}$ , and therefore $\partial\tilde{S}_{\mathcal{C}}(\mathds{P})$ does it as well.

Eventually, we conclude the proof with the observation that given S.2, under Lemma 2.1 of Polonik [38] (where A.2 is utilized) we have that the extremal points of the convex hulls of $S_{\mathcal{C}}(\mathds{P}_{F})$ and $S_{\mathcal{C}}(\mathds{P}_{G})$ are the same points, thus $\partial\tilde{S}_{\mathds{P}_{F}}(\mathds{P})=\partial\tilde{S}_{\mathds{P}_{% G}}(\mathds{P})$ , and finally $\mathds{P}_{F}=\mathds{P}_{G}$ .

Meeting assumption A.1 is a demanding challenge, almost equivalent to learning the target distribution. Below, we propose a relaxation of it, which we will use to show the validity of our method.

{theoremE}

[Relaxation of assumption A.1][end, restate, text link section] Remark 2 holds if assumption A.1 is relaxed to the case that $\mathcal{C}$ contains sets that are uniquely determined with density level sets of $\mathds{P}_{F}$ and $\mathds{P}_{G}$ up to a set $C$ such that

\forall_{C^{\prime}\in 2^{C}}\ \mathds{P}_{F}(C^{\prime})=\mathds{P}_{G}(C^{% \prime}),

(9)

and let $r:=\mathds{P}_{F}(C)=\mathds{P}_{G}(C)$ , then the supremum in statement S.2 is restricted to $[0,1-r]$ .

{proofE}

The statement in eq. 9 is equivalent to saying that $\mathds{P}_{F}=\mathds{P}_{G}$ on $(C,2^{C})$ . Analogously to the proof of remark 2 we can show that $\mathds{P}_{F}=\mathds{P}_{G}$ on $(\mathcal{X}\setminus C,2^{\mathcal{X}\setminus C})$ . By observing that probability measures are $\sigma$ -additive, we conclude that $\mathds{P}_{F}=\mathds{P}_{G}$ on $(\mathcal{X},\mathcal{A})$ , and thus the result of remark 2 holds.

3 Kolmogorov–Smirnov GAN

For the remainder of the paper, we will consider $\mathds{P}_{F}$ as the target distribution represented by a dataset $\{x_{F}\}$ , and $\mathds{P}_{G}$ as the approximate distribution that we want to train by minimizing the Generalized KS distance in eq. 5 with Stochastic Gradient Descent. We model $\mathds{P}_{G}$ as a pushforward ${g_{\theta}}_{\#}\mathds{P}_{Z}$ of a simple (e.g., Gaussian, or Uniform) latent distribution $\mathds{P}_{Z}$ supported on $\mathcal{Z}$ , with a neural network $g_{\theta}:\mathcal{Z}\rightarrow\mathcal{X}$ , parameterized with $\theta$ , which we call the generator.

The major challenge in utilizing eq. 5 is the necessity of finding the $C_{\mathds{P},\mathcal{C}}(\alpha)$ terms which is an optimization problem on its own. The idea that we propose in this work is to amortize the procedure by modeling the generalized quantile functions $C_{\mathds{P}_{F},\mathcal{C}}(\alpha)$ and $C_{\mathds{P}_{G},\mathcal{C}}(\alpha)$ with additional neural networks which have to be trained in parallel to the generator $g_{\theta}$ . Therefore, our method is based on adversarial training [13], where optimization proceeds in alternating phases of minimization and maximization for different sets of parameters. Hence the name of the proposed method, the Kolmogorov–Smirnov Generative Adversarial Network.

3.1 Neural Quantile Function

The generalized quantile function defined in definition 1 is an infinite-dimensional vector function $C_{\mathds{P},\mathcal{C}}:[0,1]\rightarrow C\in\mathcal{C}$ . Such objects do not have an expressive, explicit representation that allows for gradient-based optimization. Therefore, we use an implicit representation inspired by density level sets in eq. 6. We propose to use neural level sets defined in definition 3 that are modeled by a neural network $c:\mathcal{X}\rightarrow I\!\!R$ , which we will refer to as the critic.

Definition 3 (Neural level set).

Given a neural network $c:\mathcal{X}\rightarrow I\!\!R$ , the neural level set at level $\lambda$ is defined as⁵⁵5Please note that the direction of the inequality in eq. 10 is opposite of the one in eq. 6 which is a convention that aligns the critic with the energy function of Energy-Based models.

\Gamma_{c}(\lambda):=\{x:c(x)\leqslant\lambda\},\text{ and let }\Pi_{c}:=\{% \Gamma_{c}(\lambda):\lambda\in I\!\!R\}.

(10)

Neural level sets are used, for example, in image segmentation [6, 20] and surface reconstruction from point clouds [3]. They fit our application because for computing the Generalized KS distance in eq. 5, the explicit materialization of generalized quantiles is not required as long as the probability measure can be efficiently evaluated on the implicitly specified sets. We set $\mathcal{C}=\Pi_{c}$ , and thus $C_{\mathds{P},\Pi_{c}}(\alpha)=\Gamma_{c}(\lambda_{\alpha})$ , with $\lambda_{\alpha}=\operatorname*{arg\,min}_{\lambda\in I\!\!R}\{\lambda:\mathds% {P}(\Gamma_{c}(\lambda))\geqslant\alpha\}$ . For a probability measure $\mathds{P}^{\prime}$ the following holds:

\mathds{P}^{\prime}\left(C_{\mathds{P},\Pi_{c}}(\alpha)\right)=\mathbb{E}_{x% \sim\mathds{P}^{\prime}}\left[\mathds{1}_{(-\infty,\lambda_{\alpha}]}(c(x))% \right],

(11)

which shows that the terms in eq. 5 under neural level sets can be Monte-Carlo estimated given samples from the respective distributions. Assumption A.2 is satisfied by neural level sets by construction.

The formulation of the Generalized KS distance in eq. 5 includes two generalized quantile functions $C_{\mathds{P}_{F},\mathcal{C}}(\alpha)$ corresponding to target distribution $\mathds{P}_{F}$ and $C_{\mathds{P}_{G},\mathcal{C}}(\alpha)$ corresponding to the approximate distribution $\mathds{P}_{G}$ . Both have to be modeled with the respective neural networks $c_{\phi_{F}}$ and $c_{\phi_{G}}$ , where we use $\phi=\{\phi_{F},\phi_{G}\}$ to denote the joint set of their parameters. In section 3.3, we show how to parameterize both critics with a single neural network. We set $\mathcal{C}=\Pi_{c_{\phi_{F}}}\cup\Pi_{c_{\phi_{G}}}$ .

3.2 Optimizing generator’s parameters $\theta$

The Generalized KS distance in eq. 5 is a supremum over a unit interval and two functions; thus, it can be upper-bounded as

D_{\mathrm{GKS}}\left(\mathds{P}_{F},\mathds{P}_{G}\right)\leqslant\sum_{C\in% \{C_{\mathds{P}_{G},\mathcal{C}},C_{\mathds{P}_{F},\mathcal{C}}\}}\sup_{\alpha% \in[0,1]}\left[|\mathds{P}_{F}(C(\alpha))-\mathds{P}_{G}(C(\alpha))|\right].

(12)

Next, we plug in $\mathcal{C}=\Pi_{c_{\phi_{F}}}\cup\Pi_{c_{\phi_{G}}}$ to eq. 12 and use eq. 11 to get generator’s objective:

\mathcal{L}_{g}=\sum_{c_{\phi}\in\{c_{\phi_{G}},c_{\phi_{F}}\}}\sup_{\lambda% \in I\!\!R}\left[|\mathbb{E}_{x\sim\mathds{P}_{F}}\left[\mathds{1}_{(-\infty,% \lambda]}(c_{\phi}(x))\right]-\mathbb{E}_{x\sim\mathds{P}_{G}}\left[\mathds{1}% _{(-\infty,\lambda]}(c_{\phi}(x))\right]|\right].

(13)

In practice, the expectations in eq. 13 are estimated on finite samples from the two distributions, i.e. $\{x_{F}\}$ mentioned before, and $\{x_{G}\}$ sampled from the approximate distribution $\mathds{P}_{G}$ using the reparametrization trick to facilitate backpropagation of gradients. Therefore, the two terms become step functions in $\lambda$ , and the supremum is located on one of the steps. That way, a line search on $I\!\!R$ reduces to a maximum over a finite set. To preserve the differentiability of the cost function calculated in this way, we apply Straight-through Estimator [4] in place of indication function $\mathds{1}$ . A schematic depiction of the process for a single critic is shown in fig. 1.

3.3 Optimizing critics’ parameters $\phi$

By optimizing critics’ parameters $\phi$ , we want to satisfy assumption A.1 so that Generalized KS distance becomes a metric. For the problem posed in such a way, we lack supervision, i.e., we do not know the target sets’ shapes. However, we can reformulate the problem as an estimation of the density functions of the two considered measures $\mathds{P}_{F}$ and $\mathds{P}_{G}$ and use the obtained approximate density models to build level sets. We can constitute an optimization problem for such a task based solely on finite sets of samples, which we have for $\mathds{P}_{F}$ and can arbitrarily generate from $\mathds{P}_{G}$ . As the estimator, we propose to use the Energy-based model (EBM) [43], which, thanks to the lack of constraints in the choice of architecture, can be very expressive while having favorable computational complexity at inference. To carry out EMB training effectively, we will introduce a new min-max game, the “min phase” of which will turn out to be the initial objective in eq. 5, and in this way, we will close the adversarial cycle.

Let the critic $c_{\phi_{F}}(x)$ serve as the energy function. The density given by the EBM is then $p_{c_{\phi_{F}}}(x)=\exp(-c_{\phi_{F}}(x))/Z_{c_{\phi_{F}}}$ , where $Z_{c_{\phi_{F}}}=\int\exp(-c_{\phi_{F}}(x))\mathrm{d}x$ is the normalizing constant called partition function. The standard technique for learning the model given target data distribution $\mathds{P}_{F}$ is MLE, where the likelihood

\mathbb{E}_{x\sim\mathds{P}_{F}}[\log p_{c_{\phi_{F}}}(x)]=\mathbb{E}_{x\sim% \mathds{P}_{F}}[-c_{\phi_{F}}(x)]-\log Z_{c_{\phi_{F}}}

(14)

is maximized wrt $\phi_{F}$ . An unbiased estimate of the gradient of the second term can be obtained with samples from the EBM itself, typically achieved with MCMC sampling. Many approaches to avoid this expensive procedure have been described in the literature [43], and among them, the one based on adversarial training [23] is the most appealing to us. It introduces an auxiliary distribution $\mathds{P}_{aux(F)}$ , such that the gradient of eq. 14 wrt $\phi_{F}$ is approximated with the gradient of

\mathbb{E}_{x\sim\mathds{P}_{F}}[-c_{\phi_{F}}(x)]-\mathbb{E}_{x\sim\mathds{P}% _{aux(F)}}[-c_{\phi_{F}}(x)].

(15)

Consequently, an additional objective $\mathcal{L}_{aux(F)}$ must be introduced, the optimization of which will lead to the alignment of $\mathds{P}_{aux(F)}$ and $\mathds{P}_{c_{\phi_{F}}}$ , where $\mathds{P}_{c_{\phi_{F}}}$ denotes the probability distribution with density $p_{c_{\phi_{F}}}(x)$ . We take an analogous approach to estimate $c_{\phi_{G}}(x)$ .

When we (i) set $c_{\phi_{G}}(x):=-c_{\phi_{F}}(x)$ , and (ii) repurpose $\mathds{P}_{G}$ as $\mathds{P}_{aux(F)}$ and $\mathds{P}_{F}$ as $\mathds{P}_{aux(G)}$ , we show in LABEL:{app:critic_objective} that the MLE objectives for the critics – now, denoted as $c_{\phi}$ – simplify as $\mathcal{L}_{c}=\mathbb{E}_{x\sim\mathds{P}_{G}}[c_{\phi}(x)]-\mathbb{E}_{x% \sim\mathds{P}_{F}}[c_{\phi}(x)]$ , which is then maximized in an adversarial game against the Generalized KS distance in eq. 5.

The standard approach for aligning the auxiliary distributions with their targets is to use the Kullback–Leibler divergence. We propose using the Generalized KS distance instead. We set $\mathcal{L}_{aux(F)}=D_{\mathrm{GKS}}\left(\mathds{P}_{G},\mathds{P}_{c_{\phi}% }\right)$ and $\mathcal{L}_{aux(\mathds{P}_{G})}=D_{\mathrm{GKS}}\left(\mathds{P}_{F},\mathds% {P}_{-c_{\phi}}\right)$ . By analyzing these objectives in the fashion of section 3.2, we note that $\mathcal{L}_{aux(\mathds{P}_{G})}$ is the same as our original objective $D_{\mathrm{GKS}}\left(\mathds{P}_{F},\mathds{P}_{G}\right)$ – which is symmetric – when we approximate sampling from $\mathds{P}_{c_{\phi}}$ with the target distribution $\mathds{P}_{F}$ . Analogously for $\mathcal{L}_{aux(\mathds{P}_{G})}$ where sampling from $\mathds{P}_{-c_{\phi}}$ is approximated with $\mathds{P}_{G}$ . Therefore, we have shown that the auxiliary objectives are already integrated into the adversarial game.

In practice, we find the score penalty regularizer of Kumar et al. [26], derived from the score matching objective, helpful to stabilize training. Therefore, we subtract it from $\mathcal{L}_{c}$ weighted by a hyperparameter $\beta$ . In this way, we get a critic that is smoother and, therefore, generates regular level sets that facilitate optimization. We summarize the proposed training procedure in algorithm 1.

Input : Target distribution

\mathds{P}_{F}

; latent distribution

\mathds{P}_{Z}

; generator network

g_{\theta}

; critic network

c_{\phi}

; number of critic updates

k_{\phi}

; number of generator updates

k_{\theta}

; score penalty weight

\beta

;

Output : Trained model

\mathds{P}_{G}

approximating

\mathds{P}_{F}

;

1 repeat

2 for $i=1$ to $k_{\phi}$ do

Draw batch

\{x\}\sim\mathds{P}_{F}

and

\{z\}\sim\mathds{P}_{Z}

;

// critic’s inner loop

\mathcal{R}_{c}\leftarrow\frac{1}{|\{z\}|}\sum_{\{z\}}\lVert\nabla_{x}c_{\phi}% (g_{\theta}(z))\rVert_{2}^{2}+\frac{1}{|\{x\}|}\sum_{\{x\}}\lVert\nabla_{x}c_{% \phi}(x)\rVert_{2}^{2}

;

\mathcal{L}_{c}\leftarrow\frac{1}{|\{z\}|}\sum_{\{z\}}c_{\phi}(g_{\theta}(z))-% \frac{1}{|\{x\}|}\sum_{\{x\}}c_{\phi}(x)

;

5 Update

\phi

by using

\frac{\partial(\mathcal{L}_{c}-\beta\mathcal{R}_{c})}{\partial\phi}

to maximize

\mathcal{L}_{c}-\beta\mathcal{R}_{c}

;

7 for $i=1$ to $k_{\theta}$ do

Draw batch

\{x\}\sim\mathds{P}_{F}

and

\{z\}\sim\mathds{P}_{Z}

;

// generator’s inner loop

\{c_{F}\}\leftarrow\{c_{\phi}(x):\{x\}\}

and

\{c_{G}\}\leftarrow\{c_{\phi}(g_{\theta}(z)):\{z\}\}

;

\{\lambda\}\leftarrow\{c_{F}\}\cup\{c_{G}\}

;

\mathcal{L}_{g,F}\leftarrow\max_{\{\lambda\}}\left|\frac{1}{|\{z\}|}\sum_{\{c_% {G}\}}\mathds{1}_{(-\infty,\lambda]}(c_{G})-\frac{1}{|\{x\}|}\sum_{\{c_{F}\}}% \mathds{1}_{(-\infty,\lambda]}(c_{F})\right|

;

\mathcal{L}_{g,G}\leftarrow\max_{\{\lambda\}}\left|\frac{1}{|\{x\}|}\sum_{\{c_% {F}\}}\mathds{1}_{(-\infty,-\lambda]}(-c_{F})-\frac{1}{|\{z\}|}\sum_{\{c_{G}\}% }\mathds{1}_{(-\infty,-\lambda]}(-c_{G})\right|

;

\mathcal{L}_{g}\leftarrow\mathcal{L}_{g,F}+\mathcal{L}_{g,G}

;

13 Update

\theta

by using

\frac{\partial\mathcal{L}_{g}}{\partial\theta}

to minimize

\mathcal{L}_{g}

;

16until not converged;

return ${g_{\theta}}_{\#}\mathds{P}_{Z}$

Algorithm 1 Learning a generative model by minimizing Generalized KS distance.

4 Discussion

In section 3.3, where we justify the choice of the critic’s objective function, we refer to methods for training EBMs, which are approximate density distribution models. Thus, the reader can expect that our proposed critic $c_{\phi}$ in the limit of convergence of the algorithm will become a source of information about the density distribution of the target distribution $\mathds{P}_{F}$ accompanying the model that generates samples $\mathds{P}_{G}$ . However, this does not happen as a consequence of the design choice (i), that is, the setup of $c_{\phi_{F}}=-c_{\phi_{G}}=c_{\phi}$ . An EBM can only be equivalent to its inverse in the case of a uniform distribution. In addition, because of design choice (ii), during training, the critic is not evaluated outside of the support of $\mathds{P}_{F}$ and $\mathds{P}_{G}$ and, therefore, can reach arbitrary values there. Despite these observations, the Generalized KS distance present in our algorithm exposes sufficient conditions because of remark 2.

The feature distinguishing KSGAN from other adversarial generative modeling approaches is that regardless of the outcome of the critic’s inner problem, minimizing eq. 5 is justified because Generalized KS distance, despite not meeting assumption A.1, is a pseudo-metric [38]. For comparison, the dual representation of Wasserstein distance, used in WGAN [2] requires attaining the supremum in the inner problem.

The distances used for training generative models all fall into either the category of $f$ -divergences $D_{f}(\mathds{P}_{F},\mathds{P}_{G})=\int_{\mathcal{A}}f\left(\mathrm{d}% \mathds{P}_{F}/\mathrm{d}\mathds{P}_{G}\right)\mathrm{d}\mathds{P}_{G}$ or integral probability metrics (IPMs) $D_{\mathcal{F}}(\mathds{P}_{F},\mathds{P}_{G})=\sup_{f\in\mathcal{F}}|\mathbb{% E}_{x\sim\mathds{P}_{F}}f(x)-\mathbb{E}_{x\sim\mathds{P}_{G}}f(x)|$ . The classical one-dimensional KS distance is an instance of IPM with $\mathcal{F}=\{\mathds{1}_{(-\infty,t]}|t\in I\!\!R\}$ or $\mathcal{F}=\{\mathds{1}_{G^{-1}(\alpha)}|\alpha\in[0,1]\}$ when having access to the inverse CDF of one of the distributions based on eq. 2. One can see the Generalized KS distance from the perspective of IPM with $\mathcal{F}=\{\mathds{1}_{C(\alpha)}|\alpha\in[0,1]\ \&\ C\in\{C_{\mathds{P}_{% F},\mathcal{C}},C_{\mathds{P}_{G},\mathcal{C}}\}\}$ . Assuming direct access to $C_{\mathds{P}_{F},\mathcal{C}}$ and $C_{\mathds{P}_{G},\mathcal{C}}$ , for example when both $\mathds{P}_{F}$ and $\mathds{P}_{G}$ are Normalizing Flows [24, 34], measuring the distance comes down to a line search.

5 Related work

The need to generalize the KS test, and therefore distance, to multiple dimensions arose naturally from the side of practitioners who collected such data and wished to test related hypotheses. It was first addressed by Peacock [35], where a two-dimensional test for applications in astronomy was proposed. It involves considering all possible orders in this space and using the one that maximizes the distance between the distributions. A modification of this procedure has been proposed by Fasano and Franceschini [11] where only four candidate CDFs have to be considered, causing the test to be applicable in three dimensions, with eight candidates, under similar computational constraints. Chronologically, the following approach was the one on which we base our work, proposed in Polonik [38] but made possible by the author’s earlier work [36, 37]. To the best of our knowledge, the first work that practically uses the theory developed by Polonik is Glazer et al. [12], which we recommend as an introduction to our work. It proposes applying the Generalized KS test based on the support vector machines for detecting distribution shifts in data streams.

As an instance of the adversarial generative modeling family, our work is related to all the countless GAN [13] follow-ups. We highlight those that study the learning process from the perspective of the distance being minimized. The work of Arjovsky and Bottou [1] provides a formal analysis of the heuristic tricks used for stabilizing the training of GANs. The $f$ -GAN [33] proposes a unified training framework targeting $f$ -divergences, which relies on a variational lower bound of the objective that results in the adversarial process. Approaches relying on the integral probability metric include FisherGAN [32], the Generative Moment Matching Networks [29] based on MMD, just like the later, more sophisticated MMD GAN [28], and finally the Wasserstein GAN (WGAN) [2] with the WGAN-GP follow-up [16] which shares common features with our work. Our maximum likelihood approach to fitting the critic results in the same functional form of the loss as WGAN(-GP) uses. In addition, the score penalty we use is similar to the gradient penalty of WGAN-GP.

6 Experiments

We evaluate the proposed method on eight synthetic 2D distributions (see section B.1 for details) and two image datasets, i.e. MNIST [27] and CIFAR-10 [25]. We compare against other adversarial methods, GAN and WGAN-GP, using the same neural network architectures and training hyper-parameters unless specified otherwise (see appendix C for details). All the quantitative results are presented based on five random initializations of the models. The source code for all the experiments is provided at https://github.com/DMML-Geneva/ksgan.

In all KSGAN experiments, we relax the maximum in algorithm 1 and algorithm 1 of algorithm 1 with sample average. In all experiments, we re-use the last batch of samples from the latent distribution (and target distribution in the case of KSGAN) from the critic’s optimization inner loop as the first batch for the generator’s optimization inner loop.

6.1 Synthetic distributions

Analyzing adversarial methods on synthetic, low-dimensional distributions is not popular. However, we conduct such an experiment because we are interested in whether the model generates samples from the support of the target distribution and how accurately it approximates the distribution. Working with small-dimensional distributions, we do not have to be as concerned about the curse of dimensionality when calculating sample-based distances, and we can visually compare the resulting histograms.

In table 1, we report the squared population MMD [15] between target and approximate distributions, computed with Gaussian kernel on 65536 samples from each distribution. Details about how we chose the kernel’s bandwidth can be found in section B.1. GAN and WGAN-GP fail to converge with $k_{\phi}=k_{\theta}=1$ (we do not report the results to economize on space); thus, we set $k_{\theta}=5$ for them. The proposed KSGAN with $k_{\theta}=1$ performs at a similar level to WGAN-GP, the better of the two former, despite using five times less training budget. We present additional results on the synthetic datasets in D.1, which include performance with different training dataset sizes, non-default hyper-parameter setups for KSGAN, and histograms of the samples for qualitative comparison.

Table 1: Squared population MMD

\times 10^{3}

(

\downarrow

) between test data and samples from the methods trained on 65536 samples, averaged over five random initializations with the standard deviation calculated with Bessel’s correction in the parentheses. The proposed KSGAN with

k_{\phi}=1

performs on par with the WGAN-GP trained with five times the budget

k_{\phi}=5

. See D.1 for qualitative comparison.

	Method ( $k_{\phi}$ , $k_{\theta}$ )
Distribution	GAN (5, 1)	WGAN-GP (5, 1)	KSGAN (1, 1)
swissroll	3.37 (1.023)	0.29 (0.119)	0.39 (0.100)
circles	2.98 (1.501)	0.27 (0.215)	0.49 (0.240)
rings	2.00 (1.264)	0.13 (0.082)	0.43 (0.162)
moons	1.41 (0.757)	0.35 (0.136)	0.53 (0.189)
8gaussians	3.57 (2.719)	0.35 (0.248)	0.32 (0.277)
pinwheel	1.66 (1.451)	0.27 (0.184)	0.40 (0.086)
2spirals	0.93 (0.822)	0.27 (0.191)	0.44 (0.232)
checkerboard	1.43 (0.899)	0.38 (0.296)	0.86 (0.468)

6.2 MNIST

We use the $50000$ training instances to train the models, and based on visual inspection of the generated samples (reported in D.2), we conclude that all the methods achieve comparable, high samples quality. To assess the quality of the distribution approximation, we use a pre-trained classifier on the same data as the generative models (details in section B.2). We run the same experiment on 3StackedMNIST [44], which has 1000 modes. We report the results in table 2.

In this experiment, we set the training budget for all methods to $k_{\phi}=1$ , $k_{\theta}=1$ for a fair comparison. We find that all methods always recover all the modes with the standard MNIST target. However, GAN fails to distribute the probability mass uniformly between the digits. As the number of modes increases with the 3StackedMNIST target, GAN demonstrates its inferiority to other methods by losing 198 modes on average (four initialization cover approx. 985 modes, and one fails to converge, achieving only 98 modes). WGAN-GP and KSGAN consistently recover all the modes while being on par regarding KL divergence, which differs little between networks’ initialization.

Table 2: The number of captured modes and Kullback-Leibler divergence between the distribution of sampled digits and target uniform distribution averaged over five random initializations with the standard deviation calculated with Bessel’s correction in the parentheses. All the methods were trained with the same budget

k_{\phi}=1

k_{\theta}=1

. WGAN-GP and KSGAN cover all the modes in all experiments while demonstrating low KL divergence.

Method ( $k_{\phi}$ , $k_{\theta}$ )	# modes $\uparrow$	KL $\downarrow$	# modes $\uparrow$	KL $\downarrow$
	MNIST		3StackedMNIST
GAN (1,1)	10 (0.00)	0.6007 (0.27550)	808 (396.91)	1.4160 (1.36819)
WGAN-GP (1,1)	10 (0.00)	0.0087 (0.00499)	1000 (0.00)	0.0336 (0.00461)
KSGAN (1,1)	10 (0.00)	0.0056 (0.00045)	1000 (0.00)	0.0362 (0.00534)

6.3 CIFAR-10

We use the $50000$ training instances to train the models and report the generated samples in D.3. We train the models in a fully unconditional manner, i.e., not using the class information at all – contrary to many unconditional models that use class information in normalization layers. We quantify the quality of fitted models by computing the Inception Score (IS) [41] and Fréchet inception distance (FID) [18] from the test set and report the results in table 3 based on five random initializations. For reference, in the table, we include the IS of the training dataset and the FID between the training and test sets.

In this experiment, we set the training budget for all methods to $k_{\phi}=1$ , $k_{\theta}=1$ for a fair comparison. All models fail to accurately approximate the target distribution, which is evident from a quantitative comparison in table 3 and a qualitative one in D.3. KSGAN is characterized by the lowest variance between initializations among the methods considered.

Table 3: Inception Score (IS) and Fréchet inception distance (FID) metrics averaged over five random initializations with the standard deviation calculated with Bessel’s correction in the parentheses. All the methods were trained with the same budget

k_{\phi}=1

k_{\theta}=1

. The scores for the training dataset are included in the top row, as “Real data” for reference. WGAN-GP and KSGAN perform similarly on average, while KSGAN exhibits lower variance between networks’ initialization.

Method ( $k_{\phi}$ , $k_{\theta}$ )	IS $\uparrow$	FID $\downarrow$
Real data	11.2643	5.8369
GAN (1,1)	6.6209 (0.59187)	47.9414 (10.78435)
WGAN-GP (1,1)	6.7351 (0.31735)	44.3026 (6.61652)
KSGAN (1,1)	6.6429 (0.16785)	41.1555 (3.26385)

7 Conclusions and future work

In this work, we investigated the use of Generalized Kolmogorov–Smirnov distance for training deep implicit statistical models, i.e., generative networks. We proposed an efficient way to compute the distance and termed the resulting model Kolmogorov–Smirnov Generative Adversarial Network because it uses adversarial learning. Based on the empirical evaluation of the proposed model, the results of which we report, we conclude that it can be considered as an alternative to existing models in its class. At the same time, we point out that many properties of KSGAN have not been studied, and we leave this as a future work direction.

Interesting aspects to explore are the characteristics of learning dynamics with the number of generator updates exceeding the number of critic updates, alternative ways to train the critic, and alternative representations of generalized quantile sets. The natural scaling of the Generalized KS distance may also prove beneficial regarding the interpretability of learning curves, learning rate scheduling, or early stop**. In addition, we hope that our work will draw the attention of the machine learning community to the Generalized KS distance, applications of which remain to be explored.

{ack}

We acknowledge the financial support of the Swiss National Science Foundation within the MIGRATE project (grant no. 209434). The computations were performed at the University of Geneva on "Baobab" and "Yggdrasil" HPC clusters.

References

Arjovsky and Bottou [2017] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In International Conference on Learning Representations, 2017.
Arjovsky et al. [2017] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
Atzmon et al. [2019] M. Atzmon, N. Haim, L. Yariv, O. Israelov, H. Maron, and Y. Lipman. Controlling neural level sets. Advances in Neural Information Processing Systems, 32(NeurIPS), 2019.
Bengio et al. [2013] Y. Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
Carolan [2002] C. A. Carolan. The least concave majorant of the empirical distribution function. The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 30(2):317–328, 2002.
Chen et al. [2023] G. Chen, Z. Yu, H. Liu, Y. Ma, and B. Yu. DevelSet: Deep Neural Level Set for Instant Mask Optimization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 42(12):5020–5033, 2023.
Chen et al. [2018] R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.
Cranmer et al. [2020] K. Cranmer, J. Brehmer, and G. Louppe. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48):30055–30062, 2020.
Diggle and Gratton [1984] P. J. Diggle and R. J. Gratton. Monte carlo methods of inference for implicit statistical models. Journal of the Royal Statistical Society. Series B (Methodological), 46(2):193–227, 1984.
Einmahl and Mason [1992] J. H. J. Einmahl and D. M. Mason. Generalized Quantile Processes. The Annals of Statistics, 20(2), jun 1992.
Fasano and Franceschini [1987] G. Fasano and A. Franceschini. A multidimensional version of the Kolmogorov–Smirnov test. Monthly Notices of the Royal Astronomical Society, 225(1):155–170, mar 1987.
Glazer et al. [2012] A. Glazer, M. Lindenbaoum, and S. Markovitch. Learning high-density regions for a generalized kolmogorov-smirnov test in high-dimensional data. Advances in Neural Information Processing Systems, 1:728–736, 2012.
Goodfellow et al. [2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
Grathwohl et al. [2018] W. Grathwohl, R. T. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, 2018.
Gretton et al. [2012] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(25):723–773, 2012.
Gulrajani et al. [2017] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. Advances in neural information processing systems, 30, 2017.
Hermans et al. [2022] J. Hermans, A. Delaunoy, F. Rozet, A. Wehenkel, V. Begy, and G. Louppe. A crisis in simulation-based inference? beware, your posterior approximations can be unfaithful. Transactions on Machine Learning Research, 2022.
Heusel et al. [2017] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
Ho et al. [2020] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Hu et al. [2017] P. Hu, B. Shuai, J. Liu, and G. Wang. Deep level sets for salient object detection. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017-Janua:540–549, 2017.
Hyndman [1996] R. J. Hyndman. Computing and graphing highest density regions. The American Statistician, 50(2):120–126, 1996.
Hyvärinen [2005] A. Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695–709, 2005.
Kim and Bengio [2016] T. Kim and Y. Bengio. Deep directed generative models with energy-based probability estimation. arXiv preprint arXiv:1606.03439, 2016.
Kobyzev et al. [2020] I. Kobyzev, S. J. Prince, and M. A. Brubaker. Normalizing flows: An introduction and review of current methods. IEEE transactions on pattern analysis and machine intelligence, 43(11):3964–3979, 2020.
Krizhevsky [2009] A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.
Kumar et al. [2019] R. Kumar, S. Ozair, A. Goyal, A. Courville, and Y. Bengio. Maximum entropy generators for energy-based models. arXiv preprint arXiv:1901.08508, 2019.
Lecun et al. [1998] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Li et al. [2017] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. Póczos. Mmd gan: Towards deeper understanding of moment matching network. Advances in neural information processing systems, 30, 2017.
Li et al. [2015] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In International conference on machine learning, pages 1718–1727. PMLR, 2015.
Lyu [2009] S. Lyu. Interpretation and generalization of score matching. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 359–366, 2009.
Miyato et al. [2018] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.
Mroueh and Sercu [2017] Y. Mroueh and T. Sercu. Fisher gan. Advances in neural information processing systems, 30, 2017.
Nowozin et al. [2016] S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
Papamakarios et al. [2021] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57):1–64, 2021.
Peacock [1983] J. A. Peacock. Two-dimensional goodness-of-fit testing in astronomy. Monthly Notices of the Royal Astronomical Society, 202(3):615–627, mar 1983.
Polonik [1997] W. Polonik. Minimum volume sets in statistics: Recent developments. In R. Klar and O. Opitz, editors, Classification and Knowledge Organization, pages 187–194, Berlin, Heidelberg, 1997. Springer Berlin Heidelberg.
Polonik [1998] W. Polonik. The silhouette, concentration functions and ml-density estimation under order restrictions. The Annals of Statistics, 26(5):1857–1877, 1998.
Polonik [1999] W. Polonik. Concentration and goodness-of-fit in higher dimensions: (Asymptotically) distribution-free methods. Annals of Statistics, 27(4):1210–1229, 1999.
Radford et al. [2016] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In Y. Bengio and Y. LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
Ramesh et al. [2022] P. Ramesh, J.-M. Lueckmann, J. Boelts, Á. Tejero-Cantero, D. S. Greenberg, P. J. Goncalves, and J. H. Macke. GATSBI: Generative adversarial training for simulation-based inference. In International Conference on Learning Representations, 2022.
Salimans et al. [2016] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
Song and Ermon [2019] Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
Song and Kingma [2021] Y. Song and D. P. Kingma. How to train your energy-based models. arXiv preprint arXiv:2101.03288, 2021.
Srivastava et al. [2017] A. Srivastava, L. Valkov, C. Russell, M. U. Gutmann, and C. Sutton. Veegan: Reducing mode collapse in gans using implicit variational learning. Advances in neural information processing systems, 30, 2017.

Appendix A Proofs

\printProofs

A.1 Generalized KS distance satisfies triangle inequality

Let us consider three probability measures $\mathds{P}_{F}$ , $\mathds{P}_{G}$ , and $\mathds{P}_{H}$ on a measurable space $(\mathcal{X},\mathcal{A})$ .

	$\displaystyle D_{\mathrm{GKS}}\left(\mathds{P}_{F},\mathds{P}_{H}\right)+D_{% \mathrm{GKS}}\left(\mathds{P}_{H},\mathds{P}_{G}\right)$
	$\displaystyle=\sup_{\begin{subarray}{c}\alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{F},\mathcal{C}},C_{\mathds{P}_{H},\mathcal{C}}\}\end{% subarray}}\left[\|\mathds{P}_{F}(C(\alpha))-\mathds{P}_{H}(C(\alpha))\|\right]+% \sup_{\begin{subarray}{c}\alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{H},\mathcal{C}},C_{\mathds{P}_{G},\mathcal{C}}\}\end{% subarray}}\left[\|\mathds{P}_{H}(C(\alpha))-\mathds{P}_{G}(C(\alpha))\|\right]$
	$\displaystyle\stackrel{{\scriptstyle\text{(i)}}}{{=}}\sup_{\begin{subarray}{c}% \alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{F},\mathcal{C}},C_{\mathds{P}_{H},\mathcal{C}},C_{\mathds% {P}_{G},\mathcal{C}}\}\end{subarray}}\left[\|\mathds{P}_{F}(C(\alpha))-\mathds{% P}_{H}(C(\alpha))\|\right]+\sup_{\begin{subarray}{c}\alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{H},\mathcal{C}},C_{\mathds{P}_{G},\mathcal{C}},C_{\mathds% {P}_{F},\mathcal{C}}\}\end{subarray}}\left[\|\mathds{P}_{H}(C(\alpha))-\mathds{% P}_{G}(C(\alpha))\|\right]$
	$\displaystyle=\sup_{\begin{subarray}{c}\alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{F},\mathcal{C}},C_{\mathds{P}_{H},\mathcal{C}},C_{\mathds% {P}_{G},\mathcal{C}}\}\end{subarray}}\left[\|\mathds{P}_{F}(C(\alpha))-\mathds{% P}_{H}(C(\alpha))\|\right]+\left[\|\mathds{P}_{H}(C(\alpha))-\mathds{P}_{G}(C(% \alpha))\|\right]$
	$\displaystyle\stackrel{{\scriptstyle\text{(ii)}}}{{\geqslant}}\sup_{\begin{% subarray}{c}\alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{F},\mathcal{C}},C_{\mathds{P}_{H},\mathcal{C}},C_{\mathds% {P}_{G},\mathcal{C}}\}\end{subarray}}\left[\|\mathds{P}_{F}(C(\alpha))-\mathds{% P}_{G}(C(\alpha))\|\right]$
	$\displaystyle=\sup_{\begin{subarray}{c}\alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{G},\mathcal{C}},C_{\mathds{P}_{F},\mathcal{C}}\}\end{% subarray}}\left[\|\mathds{P}_{F}(C(\alpha))-\mathds{P}_{G}(C(\alpha))\|\right]=D% _{\mathrm{GKS}}\left(\mathds{P}_{F},\mathds{P}_{G}\right)$

In (i), we use the fact that the supremum of absolute difference in distribution coverage is maximized with the generalized quantile function of one of them. In (ii), we apply triangle inequality for absolute value. Thus we have shown that $D_{\mathrm{GKS}}\left(\mathds{P}_{F},\mathds{P}_{H}\right)+D_{\mathrm{GKS}}% \left(\mathds{P}_{H},\mathds{P}_{G}\right)\geqslant D_{\mathrm{GKS}}\left(% \mathds{P}_{F},\mathds{P}_{G}\right)$ which is the triangle inequality for the Generalized KS distance.

A.2 Objective for the critic

Given two adversarial maximum likelihood objectives from Kim and Bengio [23], we (i) set $c_{\phi_{G}}(x):=-c_{\phi_{F}}(x)$ , and (ii) repurpose $\mathds{P}_{G}$ as $\mathds{P}_{aux(F)}$ and $\mathds{P}_{F}$ as $\mathds{P}_{aux(G)}$ , and show that:

	$\displaystyle\frac{1}{2}(\mathbb{E}_{x\sim\mathds{P}_{F}}[-c_{\phi_{F}}(x)]-% \mathbb{E}_{x\sim\mathds{P}_{aux(F)}}[-c_{\phi_{F}}(x)])+\frac{1}{2}(\mathbb{E% }_{x\sim\mathds{P}_{G}}[-c_{\phi_{G}}(x)]-\mathbb{E}_{x\sim\mathds{P}_{aux(G)}% }[-c_{\phi_{G}}(x)])$
	$\displaystyle\quad=\frac{1}{2}(\mathbb{E}_{x\sim\mathds{P}_{F}}[-c_{\phi}(x)]-% \mathbb{E}_{x\sim\mathds{P}_{G}}[-c_{\phi}(x)]+\mathbb{E}_{x\sim\mathds{P}_{G}% }[c_{\phi}(x)]-\mathbb{E}_{x\sim\mathds{P}_{F}}[c_{\phi}(x)])$
	$\displaystyle\quad=\mathbb{E}_{x\sim\mathds{P}_{G}}[c_{\phi}(x)]-\mathbb{E}_{x% \sim\mathds{P}_{F}}[c_{\phi}(x)].$

Appendix B Experiments details

In this section, we provide additional details about experiments conducted in the paper that did not fit in the main text. All the models reported in the paper were trained under 12 hours on a single Nvidia GeForce GTX TITAN X GPU (12GB vRAM) with 32GB of RAM and 2 CPU cores. We report results based on 645 models trained, which amounts to 7740 GPU hours at most. We estimate that about three times as much computing time was used for preliminary experiments not reported in the paper.

B.1 Synthetic

The synthetic 2D distributions are adopted from the official code of Grathwohl et al. [14] – https://github.com/rtqichen/ffjord. We randomly generate 65536 training and 65536 test instances from each distribution. In D.1, we report the results of training the models with fewer instances but evaluated using the entire test set.

We choose the bandwidth of the Gaussian filter in squared population MMD as the median of L2 pairwise distances between 65536 instances sampled from the simulator. The resulting values can be found in the code we provide with the paper.

B.2 MNIST

To detect the modes in the (3Stacked)MNIST experiments, we use a pre-trained classifier from PyTorch examples, trained for 14 epochs of the train set of the original MNIST dataset. We expect to find 10 and 1000 modes for the MNIST and 3StackedMNIST, respectively. We measure the KL divergence between the classifier’s output and discrete uniform distribution for both distributions.

B.3 CIFAR-10

We sample 32768 instances from reach model. We compute the Inception Score using the implementation from https://github.com/sbarratt/inception-score-pytorch. We compute the Fréchet inception distance using the implementation from https://github.com/mseitzer/pytorch-fid.

Appendix C Architectures and hyper-parameters

C.1 Synthetic

For all of the methods and distributions, we use the same architecture, described in table 4, with spectral normalization [31] on linear layers for GAN. In all cases, we train the generator and critic with Adam( $\beta_{1}=0.5$ , $\beta_{2}=0.9$ ) optimizer with a constant learning rate of 0.0001, without L2 regularization or weight decay, for 128000 generator updates with batch size equal to 512. We use the standard loss for GAN, enforcing class 1 for real samples and 0 for generated samples. In WGAN-GP, we use 0.1 weight on gradient penalty (identified as a good value in preliminary experiments, which we do not report), and in KSGAN $\beta=1.0$ as the weight for score penalty.

Table 4: Architectures for synthetic 2D datasets.

$z\in I\!\!R^{8}\sim\mathcal{N}(0,I)$
Linear(bias=True), $8\rightarrow 512$
ReLU
Linear(bias=True), $512\rightarrow 512$
ReLU
Linear(bias=True), $512\rightarrow 512$
ReLU
Linear(bias=True), $512\rightarrow 2$

(a) Generator

Linear(bias=True), $2\rightarrow 512$
LeakyReLU(slope=0.2)
Linear(bias=True), $512\rightarrow 512$
LeakyReLU(slope=0.2)
Linear(bias=True), $512\rightarrow 512$
LeakyReLU(slope=0.2)
Linear(bias=True), $512\rightarrow 1$

(b) Critic

C.2 MNIST

For the MNIST experiments, we use the DCGAN [39] architecture, without batch normalization layers, with 128-dimensional latent Gaussian distribution. For the 3StackedMNIST distribution, we increase the number of input and output channels for the critic and generator, respectively. We train the generator and critic with Adam( $\beta_{1}=0.5$ , $\beta_{2}=0.9$ ) optimizer with a constant learning rate of 0.0001, without L2 regularization or weight decay, for 200000 generator updates with batch size equal to 50. In the case of GAN for 3StackedMNIST, we use a learning rate of 0.001 (identified as a good value in preliminary experiments, which we do not report). We use the flipped loss for GAN, enforcing class 0 for real samples and 1 for generated samples. In WGAN-GP, we use 10.0 weight on gradient penalty (identified as a good value in preliminary experiments, which we do not report), and in KSGAN $\beta=1.0$ as the weight for score penalty.

C.3 CIFAR-10

For the CIFAR-10 experiments, we use ResNet architecture from Gulrajani et al. [16]. We train the generator and critic with Adam( $\beta_{1}=0.0$ , $\beta_{2}=0.9$ ) optimizer with a constant learning rate of 0.0001, without L2 regularization or weight decay, for 199936 generator updates with batch size equal to 64. We use the flipped loss for GAN, enforcing class 0 for real samples and 1 for generated samples. In WGAN-GP, we use 10.0 weight on gradient penalty (identified as a good value in preliminary experiments, which we do not report), and in KSGAN $\beta=1.0$ as the weight for score penalty.

Appendix D Extended results

In this section, we report additional experiment results that did not fit in the main text. This includes materials allowing a qualitative comparison of the trained models.

D.1 Synthetic data

In 2, we report, extended relative to table 2 in the main text, a study of the quality of trained models as measured by the squared population MMD. Solid lines denote the average over five random initializations, and the shaded area represents the two- $\sigma$ interval. KSGAN performs on par with WGAN-GP while being trained with a five times less training budget. In 3, we show the histograms of 65536 samples from the models (a single random initialization), with a histogram of test data in the first column for reference. For KSGAN, in addition to the configurations included in table 2, we include one with a training budget matching that of GAN and WGAN-GP, and one with a training budget reduced by two, where the critic is updated only every second update of the generator.

D.2 MNIST

In 4, we show samples from one of the random initializations reported in table 2 in the main text. All models demonstrate similar sample quality, while for GAN, the digit \csq@thequote@oinit\csq@thequote@oopen1\csq@thequote@oclose is over-represented, which corresponds with the high KL in table 2.

D.3 CIFAR-10

In 5, we show samples from one of the random initializations reported in table 3 in the main text. All models demonstrate similar, low sample quality.

	$\displaystyle D_{\mathrm{GKS}}\left(\mathds{P}_{F},\mathds{P}_{H}\right)+D_{% \mathrm{GKS}}\left(\mathds{P}_{H},\mathds{P}_{G}\right)$
	$\displaystyle=\sup_{\begin{subarray}{c}\alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{F},\mathcal{C}},C_{\mathds{P}_{H},\mathcal{C}}\}\end{% subarray}}\left[\|\mathds{P}_{F}(C(\alpha))-\mathds{P}_{H}(C(\alpha))\|\right]+% \sup_{\begin{subarray}{c}\alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{H},\mathcal{C}},C_{\mathds{P}_{G},\mathcal{C}}\}\end{% subarray}}\left[\|\mathds{P}_{H}(C(\alpha))-\mathds{P}_{G}(C(\alpha))\|\right]$
	$\displaystyle\stackrel{{\scriptstyle\text{(i)}}}{{=}}\sup_{\begin{subarray}{c}% \alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{F},\mathcal{C}},C_{\mathds{P}_{H},\mathcal{C}},C_{\mathds% {P}_{G},\mathcal{C}}\}\end{subarray}}\left[\|\mathds{P}_{F}(C(\alpha))-\mathds{% P}_{H}(C(\alpha))\|\right]+\sup_{\begin{subarray}{c}\alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{H},\mathcal{C}},C_{\mathds{P}_{G},\mathcal{C}},C_{\mathds% {P}_{F},\mathcal{C}}\}\end{subarray}}\left[\|\mathds{P}_{H}(C(\alpha))-\mathds{% P}_{G}(C(\alpha))\|\right]$
	$\displaystyle=\sup_{\begin{subarray}{c}\alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{F},\mathcal{C}},C_{\mathds{P}_{H},\mathcal{C}},C_{\mathds% {P}_{G},\mathcal{C}}\}\end{subarray}}\left[\|\mathds{P}_{F}(C(\alpha))-\mathds{% P}_{H}(C(\alpha))\|\right]+\left[\|\mathds{P}_{H}(C(\alpha))-\mathds{P}_{G}(C(% \alpha))\|\right]$
	$\displaystyle\stackrel{{\scriptstyle\text{(ii)}}}{{\geqslant}}\sup_{\begin{% subarray}{c}\alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{F},\mathcal{C}},C_{\mathds{P}_{H},\mathcal{C}},C_{\mathds% {P}_{G},\mathcal{C}}\}\end{subarray}}\left[\|\mathds{P}_{F}(C(\alpha))-\mathds{% P}_{G}(C(\alpha))\|\right]$
	$\displaystyle=\sup_{\begin{subarray}{c}\alpha\in[0,1]\\ C\in\{C_{\mathds{P}_{G},\mathcal{C}},C_{\mathds{P}_{F},\mathcal{C}}\}\end{% subarray}}\left[\|\mathds{P}_{F}(C(\alpha))-\mathds{P}_{G}(C(\alpha))\|\right]=D% _{\mathrm{GKS}}\left(\mathds{P}_{F},\mathds{P}_{G}\right)$