ScoreFusion: Fusing Score-based Generative Models via Kullback–Leibler Barycenters
Abstract
We study the problem of fusing pre-trained (auxiliary) generative models to enhance the training of a target generative model. We propose using KL-divergence weighted barycenters as an optimal fusion mechanism, in which the barycenter weights are optimally trained to minimize a suitable loss for the target population. While computing the optimal KL-barycenter weights can be challenging, we demonstrate that this process can be efficiently executed using diffusion score training when the auxiliary generative models are also trained based on diffusion score methods. Moreover, we show that our fusion method has a dimension-free sample complexity in total variation distance provided that the auxiliary models are well fitted for their own task and the auxiliary tasks combined capture the target well. The main takeaway of our method is that if the auxiliary models are well-trained and can borrow features from each other that are present in the target, our fusion method significantly improves the training of generative models. We provide a concise computational implementation of the fusion algorithm, and validate its efficiency in the low-data regime with numerical experiments involving mixtures models and image datasets.
Contents
1 Introduction
In recent advancements within the field of generative models, diffusion models [47, 24, 44] have emerged as have emerged as a potent framework for synthesizing high-quality and diverse outputs across diverse domains such as imagery, audio, and textual content [23, 4, 39, 37, 5, 57]. Successful commercial examples include DALL·E [38], Stable Diffusion [40], and Imagen [41]. The underlying mechanism of diffusion models involves a progressive addition of noise to a data sample until it approximates a Gaussian distribution, followed by a learned reverse process to reconstruct the original data by gradually denoising it.
Diffusion models rely on large datasets of high-dimensional data to accurately model the complex distributions needed for tasks like image generation and data augmentation [42, 55, 28, 43]. Without sufficient training data, diffusion models struggle to produce high-quality, diverse outputs and can overfit to the limited data they have been trained on [55, 59, 58].
![Refer to caption](extracted/5697135/images/HD/base_samples_dynamics.png)
However, in practice, data scarcity can hinder the performance of generative models, especially in domains where data is limited due to high costs, privacy concerns, and proprietary restrictions by companies treating their data as a competitive advantage. These challenges mean that even as the demand for powerful generative models grows, the scarcity of usable data can significantly limit their development and effectiveness. To demonstrate this phenomenon, we show the generative performance of the digit with different data sample sizes in Figure 1. We observed that the quality of a diffusion model deteriorates noticeably as , the training data size, decreases.
To address the issue of data scarcity, researchers and practitioners often utilize the idea of transfer learning [34, 56, 51, 60, 49]. Transfer learning is a technique in machine learning where a model developed for one task is reused as the starting point for a model on another task. This approach allows a model trained on large and common datasets to be adapted to a different, but related, problem or dataset with less data available. Many recent works develop transfer learning algorithms to finetune the diffusion models and achieve empirical success in areas such as image generation [52, 31, 33].
In this paper, we develop a fusion method for diffusion models. Specifically, our goal is to build a generative diffusion model for a target distribution where data availability is limited, using the assistance of multiple pretrained diffusion models. These pretrained models have been trained on several common datasets, allowing them to capture a broad range of features and patterns. By leveraging these pretrained models, we aim to enhance the performance of our diffusion model on the target distribution, despite the scarcity of training data. The difference is in transfer learning, parameters of diffusion models are retrained while in our method, we freeze neural network weights and create a new neural network with an extra linear layer.
Our method is based on fusing diffusion processes through the computation of an optimal barycenters. Given a set of weights, a barycenter is typically defined as a probability measure that minimizes the weighted sum of distances (or divergences) to a set of reference measures. The most common barycenter problem among distributions is the Wasserstein barycenter [1, 16, 36, 45, 13, 26]. However, computing Wasserstein barycenters is generally challenging [35, 7, 48, 22]. Therefore, we utilize a Kullback–Leibler (KL) barycenter [14, 6], which has an analytical solution given any section of weights (both when the reference measures are supported in Euclidean spaces or can be represented as diffusion processes). In turn, the weights of our KL barycenter formulation are optimized according to a suitable class of training losses, as we will explain in the sequel.
Our goal is to find the optimal weights to approximate the target dataset. We formulate two convex optimization problems, leading to two fusion methods. The first method is intuitive but requires an estimate of the reference densities and numerical integration, which is usually challenging in high-dimensional contexts. The second method is computationally cheaper since it becomes a linear regression problem after being embedded into the diffusion space, and it still achieves good theoretical and empirical performance.
The main contributions of our work are concluded as:
1) We demonstrate that KL barycenter fusion of auxiliary models can be efficiently implemented when the auxiliary models are trained based on score diffusion. In this case, the optimal score is linear in the auxiliary scores.
2) We provide generalization bounds which split the error into four components. First, the error between the optimal KL barycenter and the target at time zero (whose direct implementation is difficult due to numerical integration). The second term corresponds to the sample complexity and the third term is the approximation error obtained by the diffusion embedding (which facilitates the training). The fourth component reflects the quality of auxiliary score estimations.
3) We numerically demonstrate the performance of our proposed fusion method. Specifically, we found that our method outperforms the basic diffusion method when the training sample size is small.
The rest of the paper is organized as follows. Section 2 reviews the background of KL barycenter and diffusion models. Section 3 details our proposed fusion methods. Section 4 provides convergence results for our methods. Section 5 presents numerical results. Finally, Section 6 concludes the paper with future directions. All proofs are relegated to the appendix.
2 Preliminaries and setup
2.1 Notations
The following notation will be used. Given two functions , we say if there exists a constant such that for all , . When , where , we say if there exists a constant such that for all close enough to , . In asymptotic cases, we use and interchangeably. if and only if and . is the space of all continuous functions on equipped with the uniform topology. In this paper, we consider a Polish spaces , which could be or . For a Polish space equipped with Borel -algebra , we denote as the space of probability measures on equipped with the topology of weak convergence. In a normed vector space , denotes the corresponding norm. denotes the standard norm. Given a matrix , we use to denote its transpose. We denote . We use to present the -dimensional probability simplex, i.e., .
2.2 Barycenter problems and Kullback–Leibler divergence
Given a set of probability measures on a Polish space and a measure of dissimilarity (e.g. a metric or a divergence) between two elements in , , we define the barycenter problem with respect to and weight as the optimization problem
where are called the reference measures. With a fixed choice of weight and reference measures, the solution of the barycenter problem is denoted as .
Recall the definition of Kullback–Leibler (KL) divergence: suppose , then if and otherwise; where is the Radon-Nikodym derivative of with respect to . In particular, if , and are absolutely continuous random vectors (with respect to Lebesgue measure) in with densities and respectively, then If is the KL divergence, we recover the KL barycenter problem [14]. In fact, for any Polish space , the KL barycenter problem is strictly convex hence has at most one solution.
2.3 Background on diffusion models
Our score fusion method depends the generative diffusion model driven by stochastic differential equations (SDEs) developed in Song et al. [47], Ho et al. [24], Sohl-Dickstein et al. [44]. In this section, we review the background of generative diffusion model.
2.3.1 Forward process: adding noise
We begin with the unsupervised learning setup. Given an unlabeled dataset i.i.d. from a distribution , the forward diffusion process is defined as the differential form
(1) |
where is a vector-valued function, is a scalar function, and denotes a standard -dimensional Brownian motion. From now on, we assume the existence and denote by the marginal density function of , and let be the transition kernel from to , for , where is the terminal time for the forward process (time horizon). If and with and , then Equation (1) becomes a linear SDE with Gaussian transition kernels
(2) |
which is an Ornstein-Ulenback (OU) process. If is large enough, then is close to , a Gaussian distribution with mean 0 (vector) and covariance matrix . The forward process can be viewed as the following dynamic: given the data distribution, we gradually add noise to it such that it becomes a known distribution in the long run.
2.3.2 Backward process: denoising
If we reverse a diffusion process in time, then under some mild conditions (see, for example, Cattiaux et al. [10], Föllmer [20]) which are satisfied for all processes under consideration in this work, we still get a diffusion process. To be more precise, we want to have a process such that for , . From the Fokker–Planck equation and the log trick [3], the corresponding reverse process for Process (1) is
(3) |
where represents taking derivative with respect to the space variable . We call the term as the (Stein) score function. If the forward process is an OU process, then the reverse process is
(4) |
If the backward SDE can be simulated (which is typically done via Euler–Maruyama method, see details in Appendix A.2), we can generate samples from the distribution . We can view simulating the backward SDE as the denoising step from pure noise to the groundtruth distribution.
2.3.3 Score estimation
The only remaining task is score estimation for . There are many ways to achieve this, and some of them are equivalent up to constants that is independent of the training parameters. In this paper, we choose the time-dependent score matching loss used in Song et al. [46]:
(5) |
where is a weighting function, and is a score estimator , usually chosen as a neural network. Then score estimation is done by the empirical loss using SGD [29].
There are many ways to measure the goodness of the generative model. Suppose is a measure of dissimilarity in , then we say is a generalization error with respect to , where is the target distribution and is the distribution of the generated samples.
Recently, several analysis about the generative properties of diffusion models has been done; however, even in the case of compactly supported target distributions and sufficient smoothness regularity, the basic diffusion model encounters the curse of dimensionality. Therefore, a large amount of target data is needed to generate high quality samples. For a detailed discussion, see Appendix A.3.
3 KL barycenters and fusion methods
In Section 3.1, we propose and analytically solve two types of KL barycenter problems. These solutions will lead to the development of our fusion methods, which is detailed in Section 3.2.
3.1 KL barycenter problems
Theorem 1.
Suppose and for each is absolutely continuous with respect to the Lebesgue measure, with densities respectively. Then, the distribution-lelvel KL barycenter is unique with density
Our second barycenter problem is performed when the sample space is the continuous-function space, i.e., . This context yields a process-level KL barycenter. When the underlying measures are represented by SDEs, we offer a closed-form solution for the process-level KL barycenter in Theorem 2.
Theorem 2.
Suppose for each , the -th SDE has the form
and has a unique strong solution. The law of solution of each SDE is denoted as . We further assume, for each , has an absolutely continuous density with respect to the Lebegue measure and uniformly bounded, then process-level KL barycenter can be represented as the SDE
where , is the distribution-level KL barycenter of reference measures , and is a standard Brownian motion.
In this paper, fusing distributions is viewed as computing a KL barycenter with optimized weights. This naturally connects to the idea of transfer learning. Given well-trained reference generative models, Our fusing method optimizes the weights to approximate a target distribution.
3.2 Fusion methods
Recall that in our task, we are given datasets with abundant samples, and our goal is to generate samples for a target dataset with limited available data. Therefore, in this section, we denote the target measure as and we assume that we are given reference diffusion generative models and they are able to generate samples from different reference measures , respectively. Specifically, each reference measure corresponds to an auxiliary backward diffusion process
(6) |
where is a well-trained score function for the the -th reference measure. we introduce two fusion algorithms and related generalization error results.
In practice, the discretized version of the SDE (6) is used. Specifically, we employ a small time-discretization step nd a total of time steps (hence ). Since is close to the Gaussian distribution , the SDE (6) is approximated by and
(7) |
Then, given a weight , Theorem 2 implies that the corresponding process-level KL barycenter follows the SDE:
(8) |
We denote the distribution of the terminal variable as , which will later serve as the distribution of generated sample.
The key component in our diffusion method is to find an optimal such that the is as close as the target measure as possible. To achieve this goal, we propose two fusion methods that relies on two different optimization problems.
The first method directly optimizes on the probability measure defined on the Euclidean space, which is based on Theorem 1. Namely, we consider the following convex problem
(9) |
where denote the densities of the reference measures and denote the density of target distribution . We refer to this fusion method as vanilla fusion. Suppose we have an accurate estimation of the densities s. We then use Frank-Wolfe method to solve Problem (9) and get an optimal . In the Frank-Wolfe method, the gradient term can be approximated by sample mean estimators from target data (See Remark 2 in Appendix C.1.3). To generate samples, we plug in the to (8) and simulate the SDE.
We notice that a similar idea of fusing component distributions via KL barycenter compared with vanilla fusion has been proposed in Claici et al. [14], which uses averaging KL divergence as a metric to recover the mean-field approximation of posterior distribution of the fused global model. Both methods solves a two-layer optimziation problem: finding the barycenter and the optimal weight. Moreover, both methods introduce a convex optimization problem to help find optimizers. The difference is that vanilla fusion solves the barycenter problem first (since we almost know the analytical barycenter) and the main task is to find optimal weights, while Claici et al. [14] finds both optimizers simultaneously and their convex problem is only a relaxation of the original target.
However, the diffusion generative models usually cannot directly estimate the densities . Therefore, for complicated high-dimension distributions, it is usually hard to directly apply vanilla fusion. Therefore, we propose a practical alternative, process-level method called ScoreFusion. The numerical results in Section 5 were generated by employing Algorithm 1.
In our second method, we first build a forward process starting from the target dataset, according to (2). We denote this forward process as and the corresponding density as . Then, we modify the loss function (5) as a linear regression problem
(10) |
where we choose . The intuition behind the choice of is that we want to learn an optimal such that is close to the target . Therefore, when (the forward process has not inject much noise), the obtained from the training is affected less by the noise. Theoretically, choosing is optimal, but this is hard to implement. Algorithm 1 with can be viewed as a variant of vanilla fusion since the learning is only performed on the distribution level (), and extremely small causes numerical instability in practice, which makes sense given the numerical integration and density estimations needed in the vanilla fusion. The optimization problem associated with our second method is The details are in Algorithm 1.
4 Convergence results
This section details the convergence results for our proposed fusion methods. We focus on sample complexities, quantified by the necessary number of samples in the target dataset, in terms of total variation distance. We show that the sample complexities of our methods are dimension-free, given that the auxiliary processes are accurately fitted to their reference distributions and together offer adequate information for the target distribution. To begin with, we assume all distributions are compactly supported.
Assumption 1.
The target and reference distributions are all compactly supported in with absolutely continuous densities. We assume that their second moments are bounded by .
Proposition 1 implies that Problem (9) is easy to solve given that the reference densities can be estimated. We further require Assumption 2 below, which guarantees that each auxilary process is accurately trained in the sense that the score function at each time step is well-fitted.
Assumption 2.
For each and for all , is -Lipschitz with and the step size satisfies ; for each and , with small .
To proceed, we denote and to be the solutions of Problems 9 and 10, respectively. Furthermore, the corresponding barycenters are denoted as and . Assumption 3 below states that the theoretical optimal barycenters are close to the target measure, which ensures all reference distributions together are able to provide sufficient information for the target distribution.
Assumption 3.
and , with small and .
Based on Assumptions 1, 2 and 3, we provide convergence results for the vanilla fusion and ScoreFusion (Algorithm 1) in Theorems 3 and 4, respectively.
Theorem 3.
Suppose that Assumptions 1, 2, and 3 are satisfied. We further assume for each fixed , , where is the barycenter of the output distributions of auxiliary processes. Then, for and , the output distribution of the vanilla fusion method, , we have with probability at least ,
where SE is the error of auxiliary score estimation, defined as
Theorem 4.
Theorems 3 and 4 demonstrate dimension-free sample complexities given that auxiliaries are well approximated and auxiliaries all combined capture the features of target well. More specifically, each bound in Theorems 3 and 4 has 4 terms, which represents different sources of error.
The quality of combined auxiliaries is the essential assumption in both Theorems 3 and 4. The sampling error in Theorem 4 reflects the fact that with the help of diffusion models, the optimization in fact becomes linear in terms of scores, making the problem easier and escape the curse of dimensionality. The the approximation to time term replaces the vanilla fusion with a small controllable noise but makes the implementation much easier. It worth noticing that there is a tradeoff between choosing : the smaller , the more accurate the optimal weights are, but the more probably that the algorithm will encounter numerical instability. Finally, the score estimation term of the auxiliaries can be small with a careful choice of discretization time steps and accurate auxiliary score approximation (see Remark 3 in Appendix C.2).
5 Numerical results
We implement ScoreFusion model and examine its performance on both synthetic and real-world image datasets. The auxiliary score functions uses the same U-Net backbone as the code repository of Song et al. [47] for score-based diffusion. Our experiments vary the quantity of training data available to ScoreFusion and the baseline, which is a regular score-based diffusion model. We aim to demonstrate that in low data regime, using ScoreFusion outperforms training a score model from scratch. This section summarizes key experiment findings, leaving implementation details and additional data to Appendix D.
5.1 Bimodal Gaussian mixture distributions
We test ScoreFusion’s ability to approximate am one-dimensional bimodal Gaussian mixture distribution using two auxillary distributions. Since the data is synthetic, we can access the true density function of the target distribution and auxiliary distributions, shown in the right of Figure 2; the ground truth distribution is in grey. Table 1 gives the -Wasserstein distance between the distribution learned by ScoreFusion and the ground truth distribution, calculated using SciPy.
Model | |||||
Baseline | |||||
ScoreFusion | |||||
![Refer to caption](extracted/5697135/images/1D/1d_64_comp1.png)
![Refer to caption](extracted/5697135/images/1D/1d_256_comp2.png)
Using only training data, ScoreFusion can already learn a good representation of the ground truth distribution. In contrast, the standard diffusion model is overly widespread and fails to capture the modes of the Gaussian mixture. Moreover, ScoreFusion consistently outperforms the baseline in distance when the number of training data is fewer than .
5.2 EMNIST with heterogeneous digits mix
We further demonstrate our algorithm on the EMNIST dataset [15], an augmentation of the original MNIST dataset comprising handwritten digits in 1x28x28 format. We selected five subsets (, ) from EMNIST, focusing on the digits 7 and 9 with varying frequencies: , , , , and . serve as auxiliary datasets for training the auxiliary scores, each rich in training samples to ensure adequate training of the auxiliaries. is used as the target dataset for training both ScoreFusion and the baseline, with variations in training and validation data to assess comparative test performance.
Two metrics are used to evaluate image samples generated by different models. First is the Negative Log Likelihood (NLL), measured on a held-out test dataset and expressed in bits per dimension (bpd) [50]; a smaller NLL implies that test images are more likely samples from the trained generative model, and is a standard metric for evaluating diffusion models [24, 47]. Table 2 displays the results for target sample sizes ranging from to , which shows that the ScoreFusion model can generalize to test samples much better than the baseline diffusion model in the low-data regime.
Sample size | ||||
Baseline | ||||
Single auxiliary | ||||
ScoreFusion | ||||
The second metric examines the digit class distribution of generated samples, a discrete distribution over ten classes. This metric is related to the idea of sample diversity as explained in Naeem et al. [32]. To estimate the ratio of digits in the samples, we train an image classifier called SpinalNet [27] on the entire EMNIST digits class, achieving a classification accuracy. At evaluation, we sample images from a trained generative model, feed them into the pre-trained SpinalNet, and average the outputs (i.e. mean of length- softmaxed logits vectors) to approximate the generative model’s digits distribution. A comparison is given in Table 3. ScoreFusion consistently mirror the proportion of 7’s and 9’s of the ground truth dataset where the baseline struggles, an impressive result given that this metric was not explicitly optimized in the training of ScoreFusion.
Digit | True | ||||||||
Baseline | Fusion | Baseline | Fusion | Baseline | Fusion | Baseline | Fusion | ||
7 | 60% | 47.9% | 55.6% | 66.8% | 57.5% | 65.5% | 56.6% | 66.7% | 59.8% |
9 | 40% | 10.3% | 39.4% | 23.8% | 38.0% | 26.7% | 39.8% | 27.9% | 36.7% |
Others | 0 | 41.8% | 5.0% | 9.4% | 4.5% | 7.8% | 3.6% | 5.4% | 3.5% |
Finally, we present generated images from the baseline diffusion model and ScoreFusion in Figure 3. With only 64 training images, ScoreFusion can already produce high-quality digits, while the baseline diffusion method generates unrecognizable images. ScoreFusion also outperforms the baseline with 256 training images, producing clearer and more accurate digits.
![Refer to caption](extracted/5697135/images/HD/fusion_v_base.png)
6 Conclusion
In this paper, we propose a fusion method based on KL barycenter that can be easily implemented if the auxiliary score estimations are obtained from diffusion. We provide a theoretical analysis of the sample complexity, showing that it is dimension-free given accurate auxiliary score estimation and closeness between optimal KL barycenter and the target distribution. The numerical experiments further demonstrate that our fusion method performs much better than the basic diffusion model in the low data regime. This work forms a basic starting point of approximating target when data is limited using the method of fusion, in which diffusion model makes the implementation much easier. More broadly, the fusion methods may be applied to other variants in diffusion models family, including different assumptions on initial distributions [17, 18, 8], other neural network structures, [11], Schrödinger bridges [30, 54, 18] etc.
Acknowledgments
This work is supported generously by the NSF grants CCF-2312204 and CCF-2312205 and Air Force Office of Scientific Research FA9550-20-1-0397. Additional support is gratefully acknowledged from NSF 1915967, 2118199, and 2229012.
References
- Agueh and Carlier [2011] Martial Agueh and Guillaume Carlier. Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
- Ambrosio et al. [2005] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient Flows. Springer Science Business Media, 2005.
- Anderson [1982] Brian D. O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.
- Austin et al. [2021] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In Advances in Neural Information Processing Systems 34 (NeurIPS 2021). NeurIPS, 2021.
- Avrahami et al. [2022] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Banerjee et al. [2005] Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, and Suvrit Sra. Clustering on the unit hypersphere using von mises-fisher distributions. Journal of Machine Learning Research, 6:1345–1382, 2005.
- Benamou et al. [2015] Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015.
- Block et al. [2022] Alexander Block, Youssef Mroueh, and Alexander Rakhlin. Generative modeling with denoising auto-encoders and langevin sampling. arXiv e-prints, 2022.
- Braun et al. [2022] Gábor Braun, Alejandro Carderera, Cyrille W. Combettes, Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and Sebastian Pokutta. Conditional gradient methods. arXiv preprint arXiv:2211.14103, 2022.
- Cattiaux et al. [2022] Patrick Cattiaux, Giovanni Conforti, Ivan Gentil, and Christian Léonard. Time reversal of diffusion processes under a finite entropy condition. arXiv preprint arXiv:2104.07708, 2022.
- Chen et al. [2023a] Minshuo Chen, Kaixuan Huang, Tuo Zhao, and Mengdi Wang. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 4672–4712. PMLR, 2023a.
- Chen et al. [2023b] Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R. Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. arXiv preprint arXiv:2209.1121, 2023b.
- Claici et al. [2018] Sebastian Claici, Edward Chien, and Justin Solomon. Stochastic wasserstein barycenters. In International Conference on Machine Learning, pages 1141–1150, 2018.
- Claici et al. [2020] Sebastian Claici, Mikhail Yurochkin, Soumya Ghosh, and Justin Solomon. Model fusion with kullback-leibler divergence. In International Conference on Machine Learning. PMLR, 2020.
- Cohen et al. [2017] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), pages 2921–2926. IEEE, 2017.
- Cuturi and Doucet [2014] Marco Cuturi and Arnaud Doucet. Fast computation of wasserstein barycenters. In International Conference on Machine Learning, pages 685–693, 2014.
- De Bortoli [2022] Valentin De Bortoli. Convergence of denoising diffusion models under the manifold hypothesis. Transactions on Machine Learning Research, 2022.
- De Bortoli et al. [2021] Valentin De Bortoli, Jacob Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. In Advances in Neural Information Processing Systems, volume 34, pages 17695–17709. Curran Associates, Inc., 2021.
- Evans [2010] Lawrence C. Evans. Partial Differential Equations. American Mathematical Society, 2010.
- Föllmer [1985] H Föllmer. An entropy approach to the time reversal of diffusion processes. Stochastic differential systems (Marseille-Luminy, 1984), 69:156–163, 1985.
- Gelfand and Fomin [2000] I. M. Gelfand and S. V. Fomin. Calculus of Variations. Dover Publications, 2000.
- Genevay et al. [2016] Aude Genevay, Marco Cuturi, Gabriel Peyré, and Francis Bach. Stochastic optimization for large-scale optimal transport. Advances in Neural Information Processing Systems, 29:3432–3440, 2016.
- Gong et al. [2022] Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models. arXiv preprint arXiv::2210.08933, 2022.
- Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Hsu et al. [2021] Daniel Hsu, Clayton Sanford, and Rocco A. Servedio Emmanouil V. Vlatakis Gkaragkounis. On the approximation power of two-layer networks of random relus. In Proceedings of Machine Learning Research, volume 134, pages 1–39. 34th Annual Conference on Learning Theory, 2021.
- Janati et al. [2020] Hicham Janati, Marco Cuturi, and Alexandre Gramfort. Debiased sinkhorn barycenters. In International Conference on Machine Learning, pages 4692–4701, 2020.
- Kabir et al. [2022] HM Dipu Kabir, Moloud Abdar, Abbas Khosravi, Seyed Mohammad Jafar Jalali, Amir F Atiya, Saeid Nahavandi, and Dipti Srinivasan. Spinalnet: Deep neural network with gradual input. IEEE Transactions on Artificial Intelligence, 2022.
- Kuznetsova et al. [2020] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. In International Journal of Computer Vision. ICCV, 2020.
- Li et al. [2023] Puheng Li, Zhong Li, Huishuai Zhang, and Jiang Bian. On the generalization properties of diffusion models. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023). NeurIPS, 2023.
- Liu et al. [2022] Hongjun Liu, Xiang Zhang, and Qionghai Li. Sb-ddpm: Schrödinger bridge diffusion denoising probabilistic model for generative tasks. IEEE Transactions on Neural Networks and Learning Systems, 2022.
- Mokady et al. [2022] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. arXiv preprint arXiv:2211.09794, 2022.
- Naeem et al. [2020] Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and diversity metrics for generative models. In International Conference on Machine Learning, pages 7176–7185. PMLR, 2020.
- Omri Avrahami [2022] Dani Lischinski Omri Avrahami, Ohad Fried. Blended latent diffusion. arXiv preprint arXiv:2206.02779, 2022.
- Pan and Yang [2009] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
- Peyré and Cuturi [2019] Gabriel Peyré and Marco Cuturi. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355–607, 2019.
- Peyré et al. [2016] Gabriel Peyré, Marco Cuturi, and Justin Solomon. Gromov-wasserstein averaging of kernel and distance matrices. In International Conference on Machine Learning, pages 2664–2672, 2016.
- Popov et al. [2021] Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, , and Mikhail Kudinov. Grad-tts. A diffusion probabilistic model for text-to-speech. In International Conference on Learning Representations, 2021.
- Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Rasul et al. [2021] Kashif Rasul, Calvin Seward, Ingmar Schuster, and Roland Vollgraf. Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting. In International Conference on Learning Representations, 2021.
- Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, , and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Saharia et al. [2022a] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems 36 (NeurIPS 2022). NeurIPS, 2022a.
- Saharia et al. [2022b] Chitwan Saharia, William Chan, Saurabh Saxena†, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022b.
- Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models. In Advances in Neural Information Processing Systems 36. NeurIPS, 2022.
- Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. Proceedings of the 32nd International Conference on Machine Learning, 37:2256–2265, 2015.
- Solomon et al. [2015] Justin Solomon, Fernando de Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, and Leonidas Guibas. Convolutional wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics, 34(4):66, 2015.
- Song et al. [2021a] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. In Advances in Neural Information Processing Systems 35 (NeurIPS 2021). NeurIPS, 2021a.
- Song et al. [2021b] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modelling through stochastic differential equations. In International Conference on Learning Representations (ICLR 2021). ICLR, 2021b.
- Staib et al. [2017] Matthew Staib, Sebastian Claici, Justin Solomon, and Stefanie Jegelka. Parallel streaming wasserstein barycenters. In Advances in Neural Information Processing Systems, volume 30, pages 2647–2658, 2017.
- Tan et al. [2020] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. A survey on deep transfer learning. Artificial Intelligence Review, 52(2):1089–1116, 2020.
- Theis et al. [2016] L Theis, A van den Oord, and M Bethge. A note on the evaluation of generative models. In International Conference on Learning Representations (ICLR 2016), pages 1–10, 2016.
- Torrey and Shavlik [2010] Lisa Torrey and Jude Shavlik. Transfer learning. Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, pages 242–264, 2010.
- Tumanyan et al. [2022] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. arXiv preprint arXiv:2211.12572, 2022.
- van Handel [2016] Ramon van Handel. Probability in high dimension, apc 550 lecture notes, December 2016.
- Vargas et al. [2022] Francisco J. Vargas, James E. Taylor, and Valentin de Bortoli. Solving schrödinger bridges via maximum likelihood: Applications to diffusion-based generative modeling. In Proceedings of the International Conference on Learning Representations, 2022.
- Wang et al. [2023] Zhendong Wang, Yifan Jiang, Huangjie Zheng, Peihao Wang, Pengcheng He, Zhangyang Wang, Weizhu Chen, and Mingyuan Zhou. Patch diffusion: Faster and more data-efficient training of diffusion models. arXiv preprint arXiv:2304.12526, 2023.
- Weiss et al. [2016] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. A survey of transfer learning. In Journal of Big Data, volume 3, pages 1–40. Springer, 2016.
- Wu et al. [2022] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022.
- Zhang et al. [2023] Ruoyu Zhang, Yanzeng Li, Yongliang Ma, Ming Zhou, and Lei Zou. Llmaaa: Making large language models as active annotators. arXiv preprint arXiv:2310.19596, 2023.
- Zhu et al. [2023] **gyuan Zhu, Huimin Ma, Jiansheng Chen, and Jian Yuan. Domainstudio: Fine-tuning diffusion models for domain-driven image generation using limited data. arXiv preprint arXiv:2306.14153, 2023.
- Zhuang et al. [2021] Fuzhen Zhuang, Ziliang Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning. In Proceedings of the IEEE, volume 109, pages 43–76. IEEE, 2021.
Appendix A More about basic diffusion models
A.1 About the time reversal formula
Note that Equations (3) and (4) are still represented as a “forward” processes. If we replace by , where is a standard -dimensional Brownian motion flows backward from time to 0, then Equation (3) becomes
which is the reverse SDE presented in Song et al. [47]. Hence for the forward OU process, the reverse process has another representation by
A.2 Discretization and backward sampling
In this section, we follow the scheme in Chen et al. [12].
Given samples from (data distribution), we train a neural network with the loss function (5). Let be the step size of the time discretization, and there are steps, hence . We assume that for each time , the score estimation of is obtained. In order to simulate the reverse SDE (3), we first replace the score function with the estimate . Next, for each , the value of this coefficient in the SDE at time , which yields the new time-discretized SDE with each ,
(11) |
and , where is the (theoretical) stationary distribution of the forward process (1).
There are several details in this implementation. In practice, when we use OU process as the forward, then Equation (11) becomes
with , which is a linear SDE. In particular, conditioned on is Gaussian, so the sampling is easier.
In theory, we should use , which we have no access to. The above implementation takes advantage of as is large enough. This introduces a small initialization error.
A.3 About the generalization error of basic diffusion model
In Li et al. [29], a random feature model is considered as the score estimator. The basic intuition is that the generalization error with respect to the KL divergence, is decomposed into three terms: the training error, approximation error of underlying random feature model, and the convergence error of stationary measures. Among these three, the third one is ignorable since the fast rate of convergence of an OU process (or, from log Sobolev inequality for Gaussian random variables in van Handel [53]). The first one is also small since random feature model in this setting is essentially linear regression with least squares.
Moreover, as stated in Hsu et al. [25], random feature model can approximate Lipschitz functions with compact supports. However, the approximation error can be large and cause curse of dimensionality if we choose . To illustrate this, we make a more general statement including smoothness considerations.
To be more precise, we introduce the following setting. We use the basic diffusion model with a forward OU process. The score function is parameterized by the random feature model with random features:
where is the ReLU activation function, is the trainable parameters, , are initially sampled from some pre-chosen distributions (related to random features) and remain frozen during the training, and is the time embedding function. The precise description is given below.
Assume that and are drawn i.i.d. from a distribution , then as , from strong law of large numbers, with probability 1,
(12) |
where and . From the positive homogeneity of ReLU function, we may assume . The optimal solution is denoted by when replacing in loss objective with .
Define a kernel and denote the induced reproducing kernel Hilbert space (RKHS) as ; if there is no misunderstanding, we denote . It follows that if and only if .
In Hsu et al. [25], a notion of approximation quality called minimum width of the neural network is defined to measure the minimum number of random features needed to guarantee an accurate enough approximation with high probability. The exact definition is given below.
Definition 1.
Given and a function with bounded norm , where is the measure in associated with the corresponding function space. We also denote . The minimum width is defined to be the smallest such that with probability at least over ,
Moreover, for , , and be an open and bounded set, is the Sobolev space with order consists of all locally integrable function such that for each multiindex with , weak derivative of exists and has finite norm (see Evans [19]). If , we denote to reflect the fact that it is a Hilbert space now. Finally, recall that the space of all Lipshitz functions on is the same as .
With these settings and definitions, we can state and prove the following generalization error for the basic diffusion model using random feature model.
Theorem 5.
Suppose that the target distribution is continuously differentiable and has a compact support, we choose an appropriate random feature , and there exists a RKHS such that . Assume that the initial loss, trainable parameters, the embedding function and the weighting function are all bounded. We further suppose that for all , the score function and there exists such that , where is compact. Then for fixed , with probability at least , we have
where is the training time (steps) in the gradient flow dynamics (see Li et al. [29]), is the number of random features, is the sample size of the target distribution, is the stationary Gaussian distribution, is the distribution of the forward OU process at time , is the target distribution, and is the distribution of the generated samples.
Proof.
The proof follows exactly the same as in the proof of Theorem 1 in Li et al. [29]. The only extra work is to compute the universal approximation error of the random feature model for Sobolev functions on a compact domain. From compacted supported assumption (Lemma 1 in Li et al. [29]), the forward process defines a random path contained in a compact rectangular domain in .
Theorem 35 in Hsu et al. [25] states the existence of a random feature such that for any with , , which implies the approximation error term.
Remark 1.
The random feature model has two difficulties in implementation.
If , , and are large enough, then the generalization error is small regardless to the sample size . However, the choice of random feature is hard in practice, especially in neither Hsu et al. [25] nor Li et al. [29] the method to choose is specified. Therefore, the assumption that is appropriately chosen is very strong.
Even if is appropriately chosen, if we let and try to find an optimal early stop** time as in Li et al. [29], the term still dominates and shows the curse of dimensionality.
Appendix B Proof of results in Section 3.1
Before the proofs, we note the strict convexity of the KL barycenter problems via a simple lemma.
Lemma 1.
For any Polish space , the KL barycenter problem is strictly convex.
Proof.
Let and such that and , for each , then
where the inequality follows from the strictly convexity of KL divergence in terms of with fixed . Therefore, the KL barycenter problem is strictly convex.
B.1 Proof of Theorem 1
Proof.
It suffices to consider a probability measure with absolutely continuous density (otherwise the KL divergence is ) and show the existence. If there is no confusion, we use the density and measure interchangeably. We denote as the space of all absolutely continuous distributions and define a functional that for
Therefore, the barycenter problem becomes
which is a variational problem with a subsidiary condition ([21]). Therefore, from calculus of variations, a necessary condition for to be an extremal of the variational problem is for some constant
Hence, the optimal solution is
B.2 Proof of Theorem 2
Before the proof of Theorem 2, we review a consequence of Girsanov’s theorem (Theorem 8 in Chen et al. [12]). We will use a similar technique as in Chen et al. [12]) to prove heorem 2.
Theorem 6.
Suppose . For , let and the stochastic exponential , where is a -Brownian motion. Assume . Then is a square integrable -martingale. Moreover, if then is a true -martingale and the process is a -Brownian motion, where is a probabilty measure such that .
In most applications of Girsanov’s theorem, we need to check a sufficient condition to hold, known as Novikov’s condition. In the context of Theorem 6, Novikov’s condition is
(13) |
Now we begin the proof of Theorem 2.
Proof.
From Lemma 1, it suffices to show the existence. Let with initial distribution . We denote as the initial distribution of the process whose law is measure as notation. From the chain rule of KL divergence, we have
where the first term solves the KL barycenter problem with respect to the initial distributions, and the second term solves the KL barycenter problem with all reference processes have the same initial distribution. Therefore, to finish the proof, we can assume for each , , the same initial distribution.
Since we are finding the minimizer of the weight sum of KL divergences, it is sufficient to assume that is the law of a diffusion process which is a strong solution of an SDE with the same diffusion (volatility) coefficient as all reference processes:
where is a standard Brownian motion, and otherwise the KL divergence would be . For now, we assume that is uniformly bounded.
When applying Girsanov’s theorem, it is more convenient to view different path measures on as the different laws of the same single stochastic process. For notational convenience, we denote the single process as .
For each , we can apply the Girsanov’s theorem to and
in the setting of Theorem 6. Therefore, under the measure , there exists a Brownian motion such that
Since under the measure , with probability 1,
then this also holds -almost surely, which implies that -almost surely, , and
In other words, in law.
Therefore,
since Ito integral with regular integrand is a true martingale.
Therefore, the objective function of process level KL barycenter problem becomes
given we assume that all of reference laws have the same initial distribution. Therefore, as a functional optimization problem, the minimizer , which finishes the proof.
Appendix C Proof of results in Section 4
C.1 Preliminaries and basic tools
C.1.1 Preliminaries
We include this subsection to present basic definitions and notations used in our proofs.
Definition 2.
is a Polish space equipped with Borel -algebra , is a set of probability measures, we say converges to weakly if and only if for each bounded and continuous function , as ,
Definition 3.
Given two measurable spaces and , is a measurable function, and is a (positive) measure space. The pushforward of is defined to be a measure such that for any ,
Definition 4.
A differentiable function is called -smooth if for any ,
Definition 5.
A stochastic process is called a local martingale if there exists a sequence of nondecreasing stop** times such that and is a true martingale.
Next we define some notations and stochastic processes that will be used in the following proofs.
Recall the process (6) is a backward SDE with score terms replaced by the estimations. We say for each , process is the theoretical backward process with exact score terms:
(14) |
The corresponding forward process is denoted as :
(15) |
We denote the marginal density of as ; when , we use the notation . Process (8) is a time-discretized SDE to be implemented in practice. It can be viewed as an approximation of the theoretical barycenter process (denoted as ) of the backward SDEs of the form (14):
(16) |
where is the distribution level KL barycenter at time with respect to the reference measures . When is large, is approximated by in Equation (8). In theory, there is corresponding forward process with respect to process (16):
(17) |
For a fixed , we denote as the marginal distribution of process (17) at time ; when , we ignore the time subscript.
C.1.2 Basic algorithms
In this section, we recall the Frank-Wolfe method [9], which is used to solve an optimization problem with -smooth convex function on a compact domain :
(18) |
To measure the error of the algorithm, we define for each , the primary gap is
where is the minimizer of problem (18).
C.1.3 Basic lemmas
In this subsection, we first list some basic lemmas (Lemma 2 to 5) that serve as essential tools in our proofs. All proofs can be found in [12].
Lemma 2.
Lemma 3.
Lemma 4.
Consider a sequence of functions and a function such that there exists a nondecreasing sequence such that as and for each , , then for each , uniformly over .
Lemma 5.
is a continuous function, and such that for each , , then as , uniformly over .
Next, we review and give two results related to the fusion algorithms.
Lemma 6.
For any fixed , , the KL barycenter of .
Proof.
In this proof, we use the following notations: suppose and , we denote as the transition density of the th auxiliary process from time to . Similarly, as the transition density of the barycenter process from time to .
Let be fixed, then at each time ,
Expanding LHS and RHS at the same time, we get
Note that as , and , where the limit is the delta function. Therefore, from the compactness assumption and dominated convergence theorem,
Therefore,
Since is a density function, then after normalization
which is the solution of KL barycenter problem with reference measures .
Next we give the proof of Proposition 1.
Proof.
Recall that the objective function for is
(19) |
We note that the first term is linear in , so to show convexity, it is enough to show the second term is convex in . If we denote for each and as the uniform distribution on , then
where and is the Lebesgue measure of . Since log of moment generating function is convex, then second term in Equation (19) is convex in .
Remark 2.
In theory, the first order condition of the convex optimization problem (9) is
In practice, each is replaced by the estimated auxiliary densities, and the second term is computed independent of the target data . However, the implementation is extremely hard since the numerical integration of the second term may have large error and the error is hard to control.
C.2 Proof of Theorem 3
Before the proof of the sample complexity of the whole algorithm, we first prove a lemma about the auxiliary score estimation errors. The proof is adapted from Chen et al. [12].
Lemma 7.
Remark 3.
To interpret the result, suppose and , then for fixed , if we choose and , and hiding the logarithmic factors, then with , . In particular, if we want to choose the sampling error , it suffices to have .
Proof.
We denote the laws of process (16) and (8) as and , respectively. For simplicity of the proof, we define a fictitious diffusion satisfying the SDE with :
(20) |
since in practice, it is always convenient to use Gaussian as a prior. We denote law of process (20) as .
We also denote the score estimators of the process (17) as . Similarly as before, we consider only one stochastic process now to use Girsanov’s theorem.
For , we have the discretization error with
From Lemma 16 in Chen et al. [12], we have the bound for the second term since ,
Moreover, from -Lipschitz condition,
Hence,
From Lemma 2 and Lemma 3, we have
Therefore,
Next, we claim that
(21) |
Then from triangle inequality, Pinsker’s inequality, and data processing inequality,
Hence it suffices to prove Equation (21). We will use a localization argument and apply Girsanov’s theorem. The notations are the same as in Theorem 6.
Let , , where is an -Brownian motion and for ,
Recall that
Since is a local martingale, then there exists a non-decreasing sequence of stop** times such that is a true martingale. Note that , where , therefore
Applying Theorem 6 to , we have that under the measure , there exists a Brownian motion such that for all ,
Since under we have almost surely
which also holds -almost surely since . Therefore, -almost surely, and
In other words, is the law of the solution of the above SDE. Plugging in the Radon-Nikodym derivatives, we get
since is a martingale and is a bounded stop** time (apply optional sampling theorem).
Now consider a coupling of , : a sequence of stochastic processes over the same probability space, a stochastic process and a single Brownian motion over that space such that almost surely, ,
and
Hence law of is and law of is . The existence of such coupling is shown in Chen et al. [12].
Fix , define the map such that
Since for each , , then from Lemma 4, we have almost surely uniformly over , which implies that weakly.
Before the proof, we introduce some notations that will only be used for the proof of Theorem 3. Recall that the vanilla fusion method requires two layers of approximation before running the Frank-Wolfe method: we use target samples to estimate an expectation and we also estimate the densities of auxiliaries. As a notation, we denote as the distribution of the generated sample by vanilla fusion, which is in Section 4. is the weight computed with target samples, denotes the barycenter of with the weight , and denotes the barycenter of with the weight , where is the collection of estimates of auxiliary densities. Note that in Section 4.
Proof.
From triangle inequality, we have
where represents the error when computing using the Frank-Wolfe method, by assumption, and is the error from auxiliary score estimations, which is bounded by Lemma 7.
Therefore it only remains to bound . From Pinsker’s inequality,
hence it is enough to bound . From the compactedness assumption, we note that the objective function of problem (9) is -smooth for some constant . Since the simplex in real space is convex, we denote the diameter of constrain set as .
We denote as the weight computed after iterations with target samples, then we claim that for , and , with probability at least ,
(22) |
We will use an induction argument to show Equation (C.2). The main estimation is based on the smoothness of and compactness of the constrain set. Let , then from Hoeffding’s inequality, with probability at least ,
By rearranging the terms, we get
(23) |
Now we begin the induction argument. If , then Equation (23) becomes
which is Equation (C.2), hence base case is shown. Now suppose there exists such that Equation (C.2) holds, then from Equation (23)
since . Hence Equation (C.2) is proved and if we let , then
Therefore, from Pinsker’s inequality, with probability at least ,
C.3 Proof of Theorem 4
Before the proof, we define notations that will be used in this proof. denotes the output distribution of Algorithm 1, which is in Section 4. For a fixed small , in the training phase of ScoreFusion, we have the forward process for ,
(24) |
We learn an optimal weight by solving problem (10). We denote the marginal distribution of process (24) at time for fixed as . Even though in practice we do not use the backward process of process (24), the following two versions of backward processes will help in the proof of Theorem 4: for with , and fixed ,
(25) |
and for ,
(26) |
where . Process (26) is the time-discretization version of process (25) without the initialization error (since ). We denote the law of process (25) and (26) as and , respectively. For fixed , we call and .
Proof.
From triangle inequality, we have
From Lemma 7, we bound the last term
To bound the first term, we use chain rule of KL divergence, Girsanov’s theorem, and an approximation argument similar as in Section C.2 to get
where represents the approximation error and represents the excess risk. Therefore, from McDiarmid’s inequality, for , with probability at least ,
since and is small.
Finally, we need to give a bound on . The intuition is that from continuity of a diffusion process, when is small, then and are similar. Therefore, the approximation error of the linear regression should be small, given Assumption 3.
Fix , then since has a Lipschitz density and the compactedness assumption, the loss is
Therefore, from Pinsker’s inequality, with probability at least ,
which finishes the proof.
Appendix D Experiment details
D.1 Training and architecture details
To standardize comparison, the baseline and the auxiliary score models are parametrized by the exact same UNet architecture; the only difference between a baseline and an auxiliary is the amount of training data they have access to. The Python classes in our supplementary codebase, model_1D.ScoreNet and model_EMNIST.ScoreNet, are both modified from the ScoreNet class given in the GitHub repository of Song et al. [47]. One caveat is that to accommodate the one-dimensional data in Section 5.1, we changed the stride and kernel size of the convolutional layers in model_1D.ScoreNet to be . The one-dimensional UNet has trainable parameters; the EMNIST UNet triples the trainable parameters count to millons. ScoreFusion models has only trainable parameters where is the number of auxiliary scores.
We follow the standard machine learning convention of splitting each dataset into train, validation, and test sets with stratified sampling to ensure class balance. The ratio of training data to validation data is . We use the ground truth digit labels only for data-splitting, hiding them from the model during training. Model training taking more than an hour was run on two NVIDIA A40 GPUs in a computing cluster, while lightweight tasks are run on Google Colab using an A100 GPU.
Model checkpoints corresponding to all our experiments, both for the pre-trained auxiliary score models and the baseline models, can be found in the subdirectory ckpt in the .zip file.
D.2 Section 5.1 supplementary data
Due to space limit, we cannot fit all the data columns into Table 1. We attach in Table 4 the complete data table for 1-Wasserstein distances from each learned distribution to the ground truth distribution when the training size varies. Standard error is calculated from the 1-Wasserstein distance of batch-pairs of random samples drawn independently from the ground truth and a trained generative model. We note that there exists randomness in fitting as a result of Stochastic Gradient Descent.
Model | |||
Baseline | |||
ScoreFusion | |||
of ScoreFusion | |||
Model | |||
Baseline | |||
ScoreFusion | |||
of ScoreFusion | |||
Additional histograms of the distributions learned by ScoreFusion versus the baseline are attached:
![Refer to caption](extracted/5697135/images/1D/1d_32_comp1.png)
![Refer to caption](extracted/5697135/images/1D/1d_64_comp1.png)
![Refer to caption](extracted/5697135/images/1D/1d_128_comp1.png)
![Refer to caption](extracted/5697135/images/1D/1d_256_comp1.png)
![Refer to caption](extracted/5697135/images/1D/1d_512_comp1.png)
![Refer to caption](extracted/5697135/images/1D/1d_1024_comp1.png)
D.3 Section 5.2 supplementary data
We also provide supplementary data for the experiments on handwritten EMNIST digits. Table 5 gives the empirical distribution of the digits sampled unconditionally from the auxiliary scores.
Auxiliary Score | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
1 | 0.1% | 0.1% | 0.6% | 0.6% | 1.1% | 0.3% | 0.0% | 18.7% | 0.2% | 78.2% |
2 | 0.1% | 0.1% | 0.3% | 0.8% | 1.1% | 0.5% | 0.0% | 41.1% | 0.2% | 55.8% |
3 | 0.0% | 0.2% | 0.7% | 0.7% | 1.2% | 0.8% | 0.0% | 72.1% | 0.6% | 23.7% |
4 | 0.1% | 0.5% | 0.7% | 0.5% | 0.9% | 0.4% | 0.1% | 87.9% | 0.3% | 8.6% |
Target Distribution | 60% | 40% | ||||||||
Digit | True | ||||||||
Baseline | Fusion | Baseline | Fusion | Baseline | Fusion | Baseline | Fusion | ||
7 | 60% | 47.9% | 55.6% | 57.9% | 55.5% | 66.8% | 57.5% | 64.8% | 58.2% |
9 | 40% | 10.3% | 39.4% | 12.8% | 41.7% | 23.8% | 38.0% | 28.3% | 38.7% |
Others | 0 | 41.8% | 5.0% | 29.3% | 2.8% | 9.4% | 4.5% | 6.9% | 3.1% |
Digit | True | ||||||
Baseline | Fusion | Baseline | Fusion | Baseline | Fusion | ||
7 | 60% | 65.5% | 55.6% | 66.7% | 59.8% | 67.4% | 59.7% |
9 | 40% | 26.7% | 39.8% | 27.9% | 36.7% | 29.0% | 37.4% |
Others | 0 | 7.8% | 3.6% | 5.4% | 3.5% | 3.6% | 2.9% |
0.199 | 0.187 | 0.182 | 0.181 | 0.167 | 0.183 | 0.176 | |
0.305 | 0.326 | 0.328 | 0.319 | 0.345 | 0.311 | 0.310 | |
0.279 | 0.267 | 0.284 | 0.285 | 0.319 | 0.294 | 0.295 | |
0.217 | 0.220 | 0.206 | 0.216 | 0.170 | 0.213 | 0.220 | |