ScoreFusion: Fusing Score-based Generative Models via Kullback–Leibler Barycenters

Hao Liu Stanford University, CA 94305, US. Email: [email protected] Junze (Tony) Ye Stanford University, CA 94305, US. Email: [email protected] Jose Blanchet Stanford University, CA 94305, US. Email: [email protected] Nian Si The University of Chicago, IL 60637, US. Email: [email protected]

(June 28, 2024)

Abstract

We study the problem of fusing pre-trained (auxiliary) generative models to enhance the training of a target generative model. We propose using KL-divergence weighted barycenters as an optimal fusion mechanism, in which the barycenter weights are optimally trained to minimize a suitable loss for the target population. While computing the optimal KL-barycenter weights can be challenging, we demonstrate that this process can be efficiently executed using diffusion score training when the auxiliary generative models are also trained based on diffusion score methods. Moreover, we show that our fusion method has a dimension-free sample complexity in total variation distance provided that the auxiliary models are well fitted for their own task and the auxiliary tasks combined capture the target well. The main takeaway of our method is that if the auxiliary models are well-trained and can borrow features from each other that are present in the target, our fusion method significantly improves the training of generative models. We provide a concise computational implementation of the fusion algorithm, and validate its efficiency in the low-data regime with numerical experiments involving mixtures models and image datasets.

1 Introduction

In recent advancements within the field of generative models, diffusion models [47, 24, 44] have emerged as have emerged as a potent framework for synthesizing high-quality and diverse outputs across diverse domains such as imagery, audio, and textual content [23, 4, 39, 37, 5, 57]. Successful commercial examples include DALL·E [38], Stable Diffusion [40], and Imagen [41]. The underlying mechanism of diffusion models involves a progressive addition of noise to a data sample until it approximates a Gaussian distribution, followed by a learned reverse process to reconstruct the original data by gradually denoising it.

Diffusion models rely on large datasets of high-dimensional data to accurately model the complex distributions needed for tasks like image generation and data augmentation [42, 55, 28, 43]. Without sufficient training data, diffusion models struggle to produce high-quality, diverse outputs and can overfit to the limited data they have been trained on [55, 59, 58].

Refer to caption — Figure 1: Quality of a diffusion model deteriorates noticeably as $n$ , the training data size, decreases.

However, in practice, data scarcity can hinder the performance of generative models, especially in domains where data is limited due to high costs, privacy concerns, and proprietary restrictions by companies treating their data as a competitive advantage. These challenges mean that even as the demand for powerful generative models grows, the scarcity of usable data can significantly limit their development and effectiveness. To demonstrate this phenomenon, we show the generative performance of the digit with different data sample sizes in Figure 1. We observed that the quality of a diffusion model deteriorates noticeably as $n$ , the training data size, decreases.

To address the issue of data scarcity, researchers and practitioners often utilize the idea of transfer learning [34, 56, 51, 60, 49]. Transfer learning is a technique in machine learning where a model developed for one task is reused as the starting point for a model on another task. This approach allows a model trained on large and common datasets to be adapted to a different, but related, problem or dataset with less data available. Many recent works develop transfer learning algorithms to finetune the diffusion models and achieve empirical success in areas such as image generation [52, 31, 33].

In this paper, we develop a fusion method for diffusion models. Specifically, our goal is to build a generative diffusion model for a target distribution where data availability is limited, using the assistance of multiple pretrained diffusion models. These pretrained models have been trained on several common datasets, allowing them to capture a broad range of features and patterns. By leveraging these pretrained models, we aim to enhance the performance of our diffusion model on the target distribution, despite the scarcity of training data. The difference is in transfer learning, parameters of diffusion models are retrained while in our method, we freeze neural network weights and create a new neural network with an extra linear layer.

Our method is based on fusing diffusion processes through the computation of an optimal barycenters. Given a set of weights, a barycenter is typically defined as a probability measure that minimizes the weighted sum of distances (or divergences) to a set of reference measures. The most common barycenter problem among distributions is the Wasserstein barycenter [1, 16, 36, 45, 13, 26]. However, computing Wasserstein barycenters is generally challenging [35, 7, 48, 22]. Therefore, we utilize a Kullback–Leibler (KL) barycenter [14, 6], which has an analytical solution given any section of weights (both when the reference measures are supported in Euclidean spaces or can be represented as diffusion processes). In turn, the weights of our KL barycenter formulation are optimized according to a suitable class of training losses, as we will explain in the sequel.

Our goal is to find the optimal weights to approximate the target dataset. We formulate two convex optimization problems, leading to two fusion methods. The first method is intuitive but requires an estimate of the reference densities and numerical integration, which is usually challenging in high-dimensional contexts. The second method is computationally cheaper since it becomes a linear regression problem after being embedded into the diffusion space, and it still achieves good theoretical and empirical performance.

The main contributions of our work are concluded as:

1) We demonstrate that KL barycenter fusion of auxiliary models can be efficiently implemented when the auxiliary models are trained based on score diffusion. In this case, the optimal score is linear in the auxiliary scores.

2) We provide generalization bounds which split the error into four components. First, the error between the optimal KL barycenter and the target at time zero (whose direct implementation is difficult due to numerical integration). The second term corresponds to the sample complexity $O(n^{-1/4})$ and the third term is the approximation error obtained by the diffusion embedding (which facilitates the training). The fourth component reflects the quality of auxiliary score estimations.

3) We numerically demonstrate the performance of our proposed fusion method. Specifically, we found that our method outperforms the basic diffusion method when the training sample size is small.

The rest of the paper is organized as follows. Section 2 reviews the background of KL barycenter and diffusion models. Section 3 details our proposed fusion methods. Section 4 provides convergence results for our methods. Section 5 presents numerical results. Finally, Section 6 concludes the paper with future directions. All proofs are relegated to the appendix.

2 Preliminaries and setup

2.1 Notations

The following notation will be used. Given two functions $f,g:D\to\mathbb{R}$ , we say $f\lesssim g$ if there exists a constant $C>0$ such that for all $x\in D$ , $f(x)\leq Cg(x)$ . When $x\to a$ , where $a\in[-\infty,\infty]$ , we say $f(x)=\mathcal{O}(g(x))$ if there exists a constant $C>0$ such that for all $x$ close enough to $a$ , $|f(x)|\leq Mg(x)$ . In asymptotic cases, we use $\mathcal{O}$ and $\lesssim$ interchangeably. $f\sim g$ if and only if $f\lesssim g$ and $g\lesssim f$ . $C([0,T]:\mathbb{R}^{d})$ is the space of all continuous functions on $\mathbb{R}^{d}$ equipped with the uniform topology. In this paper, we consider a Polish spaces $S$ , which could be $\mathbb{R}^{d}$ or $C([0,T]:\mathbb{R}^{d})$ . For a Polish space $S$ equipped with Borel $\sigma$ -algebra $\mathcal{B}(S)$ , we denote $\mathcal{P}(S)$ as the space of probability measures on $S$ equipped with the topology of weak convergence. In a normed vector space $\left(X,\left\lVert.\right\rVert\right)$ , $\left\lVert.\right\rVert$ denotes the corresponding norm. $\left\lVert.\right\rVert_{p}$ denotes the standard $L^{p}$ norm. Given a matrix $\boldsymbol{A}$ , we use $\boldsymbol{A}^{T}$ to denote its transpose. We denote $\boldsymbol{\lambda}=\left(\lambda_{1},\ldots,\lambda_{k}\right)^{T}\in[0,1]^{k}$ . We use $\Delta_{k}$ to present the $k$ -dimensional probability simplex, i.e., $\Delta_{k}=\{\boldsymbol{\lambda}\in[0,1]^{k}:\sum_{i=1}^{k}\lambda_{i}=1\}$ .

2.2 Barycenter problems and Kullback–Leibler divergence

Given a set of probability measures $P_{1},\ldots,P_{k}\in\mathcal{P}(S)$ on a Polish space $S$ and a measure of dissimilarity (e.g. a metric or a divergence) between two elements in $\mathcal{P}(S)$ , $D$ , we define the barycenter problem with respect to $D$ and weight $\boldsymbol{\lambda}$ as the optimization problem

\min_{\mu\in\mathcal{P}(S)}\sum_{i=1}^{k}\lambda_{i}D\left(\mu,P_{i}\right)% \quad\text{s.t. }\boldsymbol{\lambda}\in\Delta_{k},

where $P_{1},\ldots,P_{k}$ are called the reference measures. With a fixed choice of weight and reference measures, the solution of the barycenter problem is denoted as $\mu_{\boldsymbol{\lambda}}$ .

Recall the definition of Kullback–Leibler (KL) divergence: suppose $P,Q\in\mathcal{P}(S)$ , then $D_{\text{KL}}\left(P\parallel Q\right)=\int\log\left(\frac{dP}{dQ}\right)\,dP$ if $P\ll Q$ and $D_{\text{KL}}\left(P\parallel Q\right)=\infty$ otherwise; where $\frac{dP}{dQ}$ is the Radon-Nikodym derivative of $P$ with respect to $Q$ . In particular, if $S=\mathbb{R}^{d}$ , $P$ and $Q$ are absolutely continuous random vectors (with respect to Lebesgue measure) in $\mathbb{R}^{d}$ with densities $p$ and $q$ respectively, then $D_{\text{KL}}\left(P\parallel Q\right)=\int p(x)\log\left(\frac{p(x)}{q(x)}% \right)dx.$ If $D$ is the KL divergence, we recover the KL barycenter problem [14]. In fact, for any Polish space $S$ , the KL barycenter problem is strictly convex hence has at most one solution.

2.3 Background on diffusion models

Our score fusion method depends the generative diffusion model driven by stochastic differential equations (SDEs) developed in Song et al. [47], Ho et al. [24], Sohl-Dickstein et al. [44]. In this section, we review the background of generative diffusion model.

2.3.1 Forward process: adding noise

We begin with the unsupervised learning setup. Given an unlabeled dataset i.i.d. from a distribution $p_{0}$ , the forward diffusion process is defined as the differential form

dX(t)=f(t,X(t))dt+g(t)dW(t),X(0)\sim p_{0},

(1)

where $f:\mathbb{R}_{+}\times\mathbb{R}^{d}\to\mathbb{R}^{d}$ is a vector-valued function, $g:\mathbb{R}_{+}\to\mathbb{R}$ is a scalar function, and $W(t)$ denotes a standard $d$ -dimensional Brownian motion. From now on, we assume the existence and denote by $p_{t}(x)$ the marginal density function of $X(t)$ , and let $p_{t|s}\left(X(t)|X(s)\right)$ be the transition kernel from $X(s)$ to $X(t)$ , for $0\leq s\leq t\leq T<\infty$ , where $T$ is the terminal time for the forward process (time horizon). If $f(t,x)=-ax$ and $g(t)=\sigma$ with $a>0$ and $\sigma>0$ , then Equation (1) becomes a linear SDE with Gaussian transition kernels

dX(t)=-aX(t)dt+\sigma dW(t),X(0)\sim p_{0},

(2)

which is an Ornstein-Ulenback (OU) process. If $T$ is large enough, then $p_{T}$ is close to $\pi\sim\mathcal{N}\left(\textbf{0},\frac{\sigma^{2}}{2a}\textbf{I}\right)$ , a Gaussian distribution with mean 0 (vector) and covariance matrix $\frac{\sigma^{2}}{2a}\textbf{I}$ . The forward process can be viewed as the following dynamic: given the data distribution, we gradually add noise to it such that it becomes a known distribution in the long run.

2.3.2 Backward process: denoising

If we reverse a diffusion process in time, then under some mild conditions (see, for example, Cattiaux et al. [10], Föllmer [20]) which are satisfied for all processes under consideration in this work, we still get a diffusion process. To be more precise, we want to have a process $\tilde{X}$ such that for $t\in[0,T]$ , $\tilde{X}(t)=X(T-t)$ . From the Fokker–Planck equation and the log trick [3], the corresponding reverse process for Process (1) is

d\tilde{X}(t)=\left(-f(T-t,\tilde{X}(t))+g^{2}(T-t)\nabla\log p_{T-t}\left(% \tilde{X}(t)\right)\right)dt+g(T-t)dW(t),\tilde{X}(0)\sim p_{T},

(3)

where $\nabla$ represents taking derivative with respect to the space variable $x$ . We call the term $\nabla\log p_{t}(x)$ as the (Stein) score function. If the forward process is an OU process, then the reverse process is

d\tilde{X}(t)=\left(a\tilde{X}(t)+\sigma^{2}\nabla\log p_{T-t}\left(\tilde{X}(% t)\right)\right)dt+\sigma dW(t),\tilde{X}(0)\sim p_{T}.

(4)

If the backward SDE can be simulated (which is typically done via Euler–Maruyama method, see details in Appendix A.2), we can generate samples from the distribution $p_{0}$ . We can view simulating the backward SDE as the denoising step from pure noise to the groundtruth distribution.

2.3.3 Score estimation

The only remaining task is score estimation for $\nabla\log p_{t}(x)$ . There are many ways to achieve this, and some of them are equivalent up to constants that is independent of the training parameters. In this paper, we choose the time-dependent score matching loss used in Song et al. [46]:

\mathcal{L}\left(\theta;\gamma\right):=\mathbb{E}_{t\sim\mathcal{U}[0,T]}\left% [\gamma(t)\mathbb{E}_{X(t)\sim p_{t}}\left[\left\lVert s_{t,\theta}\left(X(t)% \right)-\nabla\log p_{t}(X(t))\right\rVert_{2}^{2}\right]\right],

(5)

where $\gamma:[0,T]\to\mathbb{R}_{+}$ is a weighting function, and $s_{t,\theta}:\mathbb{R}^{d}\to\mathbb{R}^{d}$ is a score estimator $s_{t,\theta}$ , usually chosen as a neural network. Then score estimation is done by the empirical loss using SGD [29].

There are many ways to measure the goodness of the generative model. Suppose $D(.,.)$ is a measure of dissimilarity in $\mathcal{P}(S)$ , then we say $D(\mu,\hat{\mu})$ is a generalization error with respect to $D$ , where $\mu$ is the target distribution and $\hat{\mu}$ is the distribution of the generated samples.

Recently, several analysis about the generative properties of diffusion models has been done; however, even in the case of compactly supported target distributions and sufficient smoothness regularity, the basic diffusion model encounters the curse of dimensionality. Therefore, a large amount of target data is needed to generate high quality samples. For a detailed discussion, see Appendix A.3.

3 KL barycenters and fusion methods

In Section 3.1, we propose and analytically solve two types of KL barycenter problems. These solutions will lead to the development of our fusion methods, which is detailed in Section 3.2.

3.1 KL barycenter problems

Theorem 1.

Suppose $\{\mu_{1},\ldots,\mu_{k}\}\subset\mathcal{P}(\mathbb{R}^{d})$ and for each $i=1,\ldots,k,$ $\mu_{i}$ is absolutely continuous with respect to the Lebesgue measure, with densities $p_{1},\ldots,p_{k}$ respectively. Then, the distribution-lelvel KL barycenter $\mu_{\boldsymbol{\lambda}}$ is unique with density $p_{\boldsymbol{\lambda}}(x)=\frac{\prod_{i=1}^{k}p_{i}(x)^{\lambda_{i}}}{\int_% {\mathbb{R}^{d}}\prod_{i=1}^{k}p_{i}(x)^{\lambda_{i}}dx}.$

Our second barycenter problem is performed when the sample space is the continuous-function space, i.e., $S=C([0,T]:\mathbb{R}^{d})$ . This context yields a process-level KL barycenter. When the underlying measures are represented by SDEs, we offer a closed-form solution for the process-level KL barycenter in Theorem 2.

Theorem 2.

Suppose for each $i=1,2,\ldots,k$ , the $i$ -th SDE has the form

dX_{i}(t)=\left[c\left(t,X_{i}(t)\right)+\sigma(t)^{2}a_{i}\left(t,X(t)\right)% \right]dt+\sigma(t)dW_{i}(t),X_{i}(0)\sim\mu_{i},

and has a unique strong solution. The law of solution of each SDE is denoted as $P_{i}\in\mathcal{P}(C([0,T]:\mathbb{R}^{d})$ . We further assume, for each $i=1,2,\ldots,k$ , $\mu_{i}$ has an absolutely continuous density with respect to the Lebegue measure and $a_{i}$ uniformly bounded, then process-level KL barycenter can be represented as the SDE

dX(t)=\left[c\left(t,X(t)\right)+\sigma(t)^{2}a\left(t,X(t)\right)\right]dt+% \sigma(t)dW(t),X(0)\sim\mu,

where $a(t,x)=\sum_{i=1}^{k}\lambda_{i}a_{i}(t,x)$ , $\mu$ is the distribution-level KL barycenter of reference measures $\mu_{1},\ldots,\mu_{k}$ , and $W$ is a standard Brownian motion.

In this paper, fusing $k$ distributions is viewed as computing a KL barycenter with optimized weights. This naturally connects to the idea of transfer learning. Given $k$ well-trained reference generative models, Our fusing method optimizes the weights $\lambda_{1},\ldots,\lambda_{k}$ to approximate a target distribution.

3.2 Fusion methods

Recall that in our task, we are given $k$ datasets with abundant samples, and our goal is to generate samples for a target dataset with limited available data. Therefore, in this section, we denote the target measure as $\nu$ and we assume that we are given $k$ reference diffusion generative models and they are able to generate samples from $k$ different reference measures $\mu_{1},\ldots,\mu_{k}$ , respectively. Specifically, each reference measure corresponds to an auxiliary backward diffusion process

d\tilde{X}_{i}(t)=\left(a\tilde{X}_{i}(t)+\sigma^{2}s^{i}_{T-t,\theta^{*}}% \left(\tilde{X}_{i}(t)\right)\right)dt+\sigma dW_{i}(t),\tilde{X}_{i}(0)\sim p% ^{i}_{T},

(6)

where $s^{i}_{T-t,\theta^{*}}$ is a well-trained score function for the the $i$ -th reference measure. we introduce two fusion algorithms and related generalization error results.

In practice, the discretized version of the SDE (6) is used. Specifically, we employ a small time-discretization step $h$ nd a total of $N$ time steps (hence $T=Nh$ ). Since $p_{T}^{i}$ is close to the Gaussian distribution $\pi$ , the SDE (6) is approximated by $\hat{X}(0)\sim\pi$ and

d\hat{X}_{i}(t)=\left(a\hat{X}_{i}(t)+\sigma^{2}s^{i}_{T-lh,\theta^{*}}\left(% \hat{X}_{i}(lh)\right)\right)dt+\sigma dW(t),t\in[lh,(l+1)h].

(7)

Then, given a weight $\boldsymbol{\lambda}$ , Theorem 2 implies that the corresponding process-level KL barycenter follows the SDE:

d\hat{Y}(t)=\left(a\hat{Y}(t)+\sigma^{2}\sum_{i=1}^{k}\lambda_{i}s^{i}_{T-lh,% \theta^{*}}\left(\hat{Y}(lh)\right)\right)dt+\sigma dW(t),t\in[lh,(l+1)h],\hat% {Y}(0)\sim\pi.

(8)

We denote the distribution of the terminal variable $\hat{Y}(T)$ as $\hat{p}_{\boldsymbol{\lambda}}$ , which will later serve as the distribution of generated sample.

The key component in our diffusion method is to find an optimal $\boldsymbol{\lambda}^{*}$ such that the $\hat{p}_{\boldsymbol{\lambda}^{*}}$ is as close as the target measure $\nu$ as possible. To achieve this goal, we propose two fusion methods that relies on two different optimization problems.

The first method directly optimizes on the probability measure defined on the Euclidean space, which is based on Theorem 1. Namely, we consider the following convex problem

\min_{\boldsymbol{\lambda}\in\Delta_{k}}\quad D_{\text{KL}}(\nu\parallel\mu_{% \boldsymbol{\lambda}})=\min_{\boldsymbol{\lambda}\in\Delta_{k}}\quad\mathbb{E}% _{\nu}\left[\log q(X)-\sum_{i=1}^{k}\lambda_{i}\log p_{i}(X)\right]+\log\left(% \int\prod_{i=1}^{k}p_{i}(y)^{\lambda_{i}}dy\right),

(9)

where $p_{1},\ldots,p_{k}$ denote the densities of the reference measures and $q(x)$ denote the density of target distribution $\nu$ . We refer to this fusion method as vanilla fusion. Suppose we have an accurate estimation of the densities $p_{i}$ s. We then use Frank-Wolfe method to solve Problem (9) and get an optimal $\boldsymbol{\lambda}^{*}$ . In the Frank-Wolfe method, the gradient term can be approximated by sample mean estimators from target data $\nu$ (See Remark 2 in Appendix C.1.3). To generate samples, we plug in the $\boldsymbol{\lambda}^{*}$ to (8) and simulate the SDE.

We notice that a similar idea of fusing component distributions via KL barycenter compared with vanilla fusion has been proposed in Claici et al. [14], which uses averaging KL divergence as a metric to recover the mean-field approximation of posterior distribution of the fused global model. Both methods solves a two-layer optimziation problem: finding the barycenter and the optimal weight. Moreover, both methods introduce a convex optimization problem to help find optimizers. The difference is that vanilla fusion solves the barycenter problem first (since we almost know the analytical barycenter) and the main task is to find optimal weights, while Claici et al. [14] finds both optimizers simultaneously and their convex problem is only a relaxation of the original target.

However, the diffusion generative models usually cannot directly estimate the densities $p_{1},\ldots,p_{k}$ . Therefore, for complicated high-dimension distributions, it is usually hard to directly apply vanilla fusion. Therefore, we propose a practical alternative, process-level method called ScoreFusion. The numerical results in Section 5 were generated by employing Algorithm 1.

In our second method, we first build a forward process starting from the target dataset, according to (2). We denote this forward process as $\tilde{X}^{\nu}(t)$ and the corresponding density as $p_{t}^{\nu}(x)$ . Then, we modify the loss function (5) as a linear regression problem

\displaystyle\tilde{\mathcal{L}}\left(\boldsymbol{\lambda};\theta^{*},\gamma% \right)=\mathbb{E}_{t\sim\mathcal{U}[0,\tilde{T}]}\left[\gamma(t)\left(\mathbb% {E}_{X(t)\sim p^{\nu}_{t}}\left[\left\lVert\sum_{i=1}^{k}\left(\lambda_{i}s^{i% }_{t,\theta^{*}}\left(X(t)\right)\right)-\nabla\log p^{\nu}_{t}(X(t))\right% \rVert_{2}^{2}\right]\right)\right]

(10)

where we choose $\tilde{T}\ll T$ . The intuition behind the choice of $\tilde{T}$ is that we want to learn an optimal $\boldsymbol{\lambda}^{*}$ such that $p_{\boldsymbol{\lambda}^{*}}$ is close to the target $\nu$ . Therefore, when $\tilde{T}\ll T$ (the forward process has not inject much noise), the $\hat{\boldsymbol{\lambda}}$ obtained from the training is affected less by the noise. Theoretically, choosing $\tilde{T}=0$ is optimal, but this is hard to implement. Algorithm 1 with $\tilde{T}=0$ can be viewed as a variant of vanilla fusion since the learning is only performed on the distribution level ( $p_{0}$ ), and extremely small $\tilde{T}$ causes numerical instability in practice, which makes sense given the numerical integration and density estimations needed in the vanilla fusion. The optimization problem associated with our second method is $\min_{\boldsymbol{\lambda}\in\Delta_{k}}\tilde{\mathcal{L}}\left(\boldsymbol{% \lambda};\theta^{*},\gamma\right).$ The details are in Algorithm 1.

Algorithm 1 ScoreFusion

1:Input: Training data

\mathcal{D}

, pre-trained score functions

s^{1}_{t,\theta^{*}},\ldots,s^{k}_{t,\theta^{*}}

. Hyperparameter

\tilde{T}

2:Output: Samples from a distribution

\hat{\nu}_{D}

.generated

3:I. Training Phase

4:Randomly initialize non-negative

\lambda_{1},\ldots,\lambda_{k}

s.t.

\sum\lambda_{i}=1

5:repeat

6: Run forward process

\tilde{X}^{\nu}(t)

using a mini-batch from

\mathcal{D}

7: Evaluate the loss function (10) and back-propagate onto

\lambda_{1},\ldots,\lambda_{k}

\triangleright

\lambda_{i}

’s are softmaxed to enforce the probability simplex constraint

9:until converged. Save the optimal

\boldsymbol{\lambda}^{*}=\{\lambda^{*}_{1},\lambda^{*}_{2},\ldots,\lambda_{k}^% {*}\}

10:II. Sampling Phase

11:

s_{t,\boldsymbol{\lambda}^{*}}(\hat{Y}(t)):=\sum_{i=1}^{k}\lambda^{*}_{i}s_{t,% \theta^{*}}^{i}(\hat{Y}(t))

12:Simulate the backward SDE (8) with

s_{t,\boldsymbol{\lambda}^{*}}(\cdot)

starting from a Gaussian prior and generate samples.

4 Convergence results

This section details the convergence results for our proposed fusion methods. We focus on sample complexities, quantified by the necessary number of samples in the target dataset, in terms of total variation distance. We show that the sample complexities of our methods are dimension-free, given that the auxiliary processes are accurately fitted to their reference distributions and together offer adequate information for the target distribution. To begin with, we assume all distributions are compactly supported.

Assumption 1.

The target and reference distributions are all compactly supported in $\mathbb{K}\subset\mathbb{R}^{d}$ with absolutely continuous densities. We assume that their second moments are bounded by $M\in(0,\infty)$ .

Proposition 1.

Under Assumption 1, Problem (9) is convex in $\boldsymbol{\lambda}$ .

Proposition 1 implies that Problem (9) is easy to solve given that the reference densities can be estimated. We further require Assumption 2 below, which guarantees that each auxilary process is accurately trained in the sense that the score function at each time step is well-fitted.

Assumption 2.

For each $1,2,\ldots,k$ and for all $t\in[0,T]$ , $\nabla\log p^{i}_{t}$ is $L$ -Lipschitz with $L\geq 1$ and the step size $h=T/N$ satisfies $h\lesssim 1/L$ ; for each $1,\ldots,k$ and $l=0,1,\ldots,N$ , $\mathbb{E}_{p^{i}_{lh}}\left[\left\|s^{i}_{lh,\theta^{*}}-\nabla\log p^{i}_{lh% }\right\|_{2}^{2}\right]\leq\epsilon_{\text{score}}^{2}$ with small $\epsilon_{\text{score}}$ .

Assumption 2 is widely used in the diffuion model literature (see, for example, Chen et al. [12]).

To proceed, we denote $\boldsymbol{\lambda}^{\ast}$ and $\boldsymbol{\Lambda}^{\ast}$ to be the solutions of Problems 9 and 10, respectively. Furthermore, the corresponding barycenters are denoted as $\mu_{\boldsymbol{\lambda}^{\ast}}$ and $\mu_{\boldsymbol{\Lambda}^{\ast}}$ . Assumption 3 below states that the theoretical optimal barycenters are close to the target measure, which ensures all reference distributions together are able to provide sufficient information for the target distribution.

Assumption 3.

$D_{\text{KL}}\left(\nu\parallel\mu_{\boldsymbol{\lambda}^{*}}\right)\leq% \epsilon_{0}^{2}$ and $D_{\text{KL}}\left(\nu\parallel\mu_{\boldsymbol{\Lambda}^{*}}\right)\leq% \epsilon_{1}^{2}$ , with small $\epsilon_{0}$ and $\epsilon_{1}$ .

Based on Assumptions 1, 2 and 3, we provide convergence results for the vanilla fusion and ScoreFusion (Algorithm 1) in Theorems 3 and 4, respectively.

Theorem 3.

Suppose that Assumptions 1, 2, and 3 are satisfied. We further assume for each fixed $\boldsymbol{\lambda}\in\Delta_{k}$ , $\text{TV}\left(\mu_{\boldsymbol{\lambda}},\hat{\mu}_{\boldsymbol{\lambda}}% \right)\leq\epsilon_{2}$ , where $\hat{\mu}_{\boldsymbol{\lambda}}$ is the barycenter of the output distributions of $k$ auxiliary processes. Then, for $\delta>0$ and $\delta\ll 1$ , the output distribution of the vanilla fusion method, $\hat{\nu}_{D}$ , we have with probability at least $1-\delta$ ,

\displaystyle\text{TV}\left(\nu,\hat{\nu}_{D}\right)

\displaystyle\lesssim\underbrace{\epsilon_{0}}_{\text{quality of combined % auxiliaries}}+\underbrace{{\mathcal{O}\left(\left(\log\left(\frac{1}{\delta}% \right)\right)^{1/4}n^{-1/4}\right)}}_{\text{mean estimation error}}+% \underbrace{\epsilon_{2}}_{\text{auxiliary density estimation}}+\text{ SE},

where SE is the error of auxiliary score estimation, defined as

SE={\left[\exp(-T)\max_{i=1,2,\ldots,k}\sqrt{D_{\text{KL}}\left(p^{i}_{T}% \parallel\pi\right)}+\sigma\sqrt{kT}\left(\epsilon_{\text{score}}+L\sqrt{dh}+% Lh\sqrt{M}\right)\right]}.

Theorem 4.

Suppose that Assumptions 1, 2, and 3 are satisfied. Then, for $\delta>0$ and $\delta\ll 1$ , for the output distribution of Algorithm 1, $\hat{\nu}_{P}$ , with probability at least $1-\delta$ ,

\displaystyle\text{TV}\left(\nu,\hat{\nu}_{P}\right)

\displaystyle\lesssim\underbrace{\sigma\epsilon_{1}}_{\text{quality of % combined auxiliaries}}+\underbrace{\mathcal{O}\left(\sigma\left(\log\left(% \frac{1}{\delta}\right)\right)^{1/4}n^{-1/4}\right)}_{\text{sampling errors}}+% \underbrace{\sigma\sqrt{k}\mathcal{O}\left(\tilde{T}^{1/4}\right)}_{\text{% approximation of time 0}}+\text{ SE}.

Theorems 3 and 4 demonstrate dimension-free sample complexities given that auxiliaries are well approximated and auxiliaries all combined capture the features of target well. More specifically, each bound in Theorems 3 and 4 has 4 terms, which represents different sources of error.

The quality of combined auxiliaries is the essential assumption in both Theorems 3 and 4. The sampling error in Theorem 4 reflects the fact that with the help of diffusion models, the optimization in fact becomes linear in terms of scores, making the problem easier and escape the curse of dimensionality. The the approximation to time $t=0$ term replaces the vanilla fusion with a small controllable noise but makes the implementation much easier. It worth noticing that there is a tradeoff between choosing $\tilde{T}$ : the smaller $\tilde{T}$ , the more accurate the optimal weights are, but the more probably that the algorithm will encounter numerical instability. Finally, the score estimation term of the auxiliaries can be small with a careful choice of discretization time steps and accurate auxiliary score approximation (see Remark 3 in Appendix C.2).

5 Numerical results

We implement ScoreFusion model and examine its performance on both synthetic and real-world image datasets. The auxiliary score functions uses the same U-Net backbone as the code repository of Song et al. [47] for score-based diffusion. Our experiments vary the quantity of training data available to ScoreFusion and the baseline, which is a regular score-based diffusion model. We aim to demonstrate that in low data regime, using ScoreFusion outperforms training a score model from scratch. This section summarizes key experiment findings, leaving implementation details and additional data to Appendix D.

5.1 Bimodal Gaussian mixture distributions

We test ScoreFusion’s ability to approximate am one-dimensional bimodal Gaussian mixture distribution using two auxillary distributions. Since the data is synthetic, we can access the true density function of the target distribution and auxiliary distributions, shown in the right of Figure 2; the ground truth distribution is in grey. Table 1 gives the $1$ -Wasserstein distance $\mathcal{W}_{1}$ between the distribution learned by ScoreFusion and the ground truth distribution, calculated using SciPy.

Table 1: 1-Wasserstein distance from the ground truth distribution. Standard error is calculated from the

\mathcal{W}_{1}

distances of

10

random draws of

8096

samples from each generator.

Model	$2^{5}$	$2^{6}$	$2^{7}$	$2^{9}$	$2^{10}$
Baseline	$106.93\pm 1.43$	$13.46\pm 0.28$	$16.74\pm 0.27$	$0.55\pm 0.04$	$\mathbf{0.15\pm 0.02}$
ScoreFusion	$\mathbf{0.39\pm 0.02}$	$\mathbf{0.51\pm 0.03}$	$\mathbf{0.36\pm 0.02}$	$\mathbf{0.38\pm 0.02}$	$0.30\pm 0.02$

Using only $64$ training data, ScoreFusion can already learn a good representation of the ground truth distribution. In contrast, the standard diffusion model is overly widespread and fails to capture the modes of the Gaussian mixture. Moreover, ScoreFusion consistently outperforms the baseline in $\mathcal{W}_{1}$ distance when the number of training data is fewer than $2^{10}$ .

5.2 EMNIST with heterogeneous digits mix

We further demonstrate our algorithm on the EMNIST dataset [15], an augmentation of the original MNIST dataset comprising handwritten digits in 1x28x28 format. We selected five subsets ( $D_{i}$ , $i=1,\ldots,5$ ) from EMNIST, focusing on the digits 7 and 9 with varying frequencies: $(10\%,90\%)$ , $(30\%,70\%)$ , $(70\%,30\%)$ , $(90\%,10\%)$ , and $(60\%,40\%)$ . $D_{1},\ldots,D_{4}$ serve as auxiliary datasets for training the auxiliary scores, each rich in training samples to ensure adequate training of the auxiliaries. $D_{5}$ is used as the target dataset for training both ScoreFusion and the baseline, with variations in training and validation data to assess comparative test performance.

Two metrics are used to evaluate image samples generated by different models. First is the Negative Log Likelihood (NLL), measured on a held-out test dataset and expressed in bits per dimension (bpd) [50]; a smaller NLL implies that test images are more likely samples from the trained generative model, and is a standard metric for evaluating diffusion models [24, 47]. Table 2 displays the results for target sample sizes ranging from $2^{6}$ to $2^{12}$ , which shows that the ScoreFusion model can generalize to test samples much better than the baseline diffusion model in the low-data regime.

Table 2: Mean NLL (test) under different counts of training data

Sample size	$2^{6}$	$2^{8}$	$2^{10}$	$2^{12}$
Baseline	$7.186\pm 0.019$	$6.235\pm 0.016$	$5.725\pm 0.024$	$4.979\pm 0.028$
Single auxiliary	$4.768\pm 0.024$
ScoreFusion	$\mathbf{4.733\pm 0.029}$	$\mathbf{4.733\pm 0.018}$	$\mathbf{4.718\pm 0.022}$	$\mathbf{4.715\pm 0.021}$

The second metric examines the digit class distribution of generated samples, a discrete distribution over ten classes. This metric is related to the idea of sample diversity as explained in Naeem et al. [32]. To estimate the ratio of digits in the samples, we train an image classifier called SpinalNet [27] on the entire EMNIST digits class, achieving a $99.5\%$ classification accuracy. At evaluation, we sample $1024$ images from a trained generative model, feed them into the pre-trained SpinalNet, and average the outputs (i.e. mean of $1024$ length- $10$ softmaxed logits vectors) to approximate the generative model’s digits distribution. A comparison is given in Table 3. ScoreFusion consistently mirror the proportion of 7’s and 9’s of the ground truth dataset where the baseline struggles, an impressive result given that this metric was not explicitly optimized in the training of ScoreFusion.

Table 3: Digits distribution estimated by SpinalNet. Bolded columns are the breakdown for ScoreFusion. “Others” category refers to fraction of samples resembling digits other than the 7’s or 9’s more.

Digit	True	$2^{6}$		$2^{8}$		$2^{10}$		$2^{12}$
Digit	True	Baseline	Fusion	Baseline	Fusion	Baseline	Fusion	Baseline	Fusion
7	60%	47.9%	55.6%	66.8%	57.5%	65.5%	56.6%	66.7%	59.8%
9	40%	10.3%	39.4%	23.8%	38.0%	26.7%	39.8%	27.9%	36.7%
Others	0	41.8%	5.0%	9.4%	4.5%	7.8%	3.6%	5.4%	3.5%

Finally, we present generated images from the baseline diffusion model and ScoreFusion in Figure 3. With only 64 training images, ScoreFusion can already produce high-quality digits, while the baseline diffusion method generates unrecognizable images. ScoreFusion also outperforms the baseline with 256 training images, producing clearer and more accurate digits.

6 Conclusion

In this paper, we propose a fusion method based on KL barycenter that can be easily implemented if the auxiliary score estimations are obtained from diffusion. We provide a theoretical analysis of the sample complexity, showing that it is dimension-free given accurate auxiliary score estimation and closeness between optimal KL barycenter and the target distribution. The numerical experiments further demonstrate that our fusion method performs much better than the basic diffusion model in the low data regime. This work forms a basic starting point of approximating target when data is limited using the method of fusion, in which diffusion model makes the implementation much easier. More broadly, the fusion methods may be applied to other variants in diffusion models family, including different assumptions on initial distributions [17, 18, 8], other neural network structures, [11], Schrödinger bridges [30, 54, 18] etc.

Acknowledgments

This work is supported generously by the NSF grants CCF-2312204 and CCF-2312205 and Air Force Office of Scientific Research FA9550-20-1-0397. Additional support is gratefully acknowledged from NSF 1915967, 2118199, and 2229012.

References

Agueh and Carlier [2011] Martial Agueh and Guillaume Carlier. Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
Ambrosio et al. [2005] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient Flows. Springer Science $\&$ Business Media, 2005.
Anderson [1982] Brian D. O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.
Austin et al. [2021] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In Advances in Neural Information Processing Systems 34 (NeurIPS 2021). NeurIPS, 2021.
Avrahami et al. [2022] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
Banerjee et al. [2005] Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, and Suvrit Sra. Clustering on the unit hypersphere using von mises-fisher distributions. Journal of Machine Learning Research, 6:1345–1382, 2005.
Benamou et al. [2015] Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015.
Block et al. [2022] Alexander Block, Youssef Mroueh, and Alexander Rakhlin. Generative modeling with denoising auto-encoders and langevin sampling. arXiv e-prints, 2022.
Braun et al. [2022] Gábor Braun, Alejandro Carderera, Cyrille W. Combettes, Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and Sebastian Pokutta. Conditional gradient methods. arXiv preprint arXiv:2211.14103, 2022.
Cattiaux et al. [2022] Patrick Cattiaux, Giovanni Conforti, Ivan Gentil, and Christian Léonard. Time reversal of diffusion processes under a finite entropy condition. arXiv preprint arXiv:2104.07708, 2022.
Chen et al. [2023a] Minshuo Chen, Kaixuan Huang, Tuo Zhao, and Mengdi Wang. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 4672–4712. PMLR, 2023a.
Chen et al. [2023b] Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R. Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. arXiv preprint arXiv:2209.1121, 2023b.
Claici et al. [2018] Sebastian Claici, Edward Chien, and Justin Solomon. Stochastic wasserstein barycenters. In International Conference on Machine Learning, pages 1141–1150, 2018.
Claici et al. [2020] Sebastian Claici, Mikhail Yurochkin, Soumya Ghosh, and Justin Solomon. Model fusion with kullback-leibler divergence. In International Conference on Machine Learning. PMLR, 2020.
Cohen et al. [2017] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), pages 2921–2926. IEEE, 2017.
Cuturi and Doucet [2014] Marco Cuturi and Arnaud Doucet. Fast computation of wasserstein barycenters. In International Conference on Machine Learning, pages 685–693, 2014.
De Bortoli [2022] Valentin De Bortoli. Convergence of denoising diffusion models under the manifold hypothesis. Transactions on Machine Learning Research, 2022.
De Bortoli et al. [2021] Valentin De Bortoli, Jacob Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. In Advances in Neural Information Processing Systems, volume 34, pages 17695–17709. Curran Associates, Inc., 2021.
Evans [2010] Lawrence C. Evans. Partial Differential Equations. American Mathematical Society, 2010.
Föllmer [1985] H Föllmer. An entropy approach to the time reversal of diffusion processes. Stochastic differential systems (Marseille-Luminy, 1984), 69:156–163, 1985.
Gelfand and Fomin [2000] I. M. Gelfand and S. V. Fomin. Calculus of Variations. Dover Publications, 2000.
Genevay et al. [2016] Aude Genevay, Marco Cuturi, Gabriel Peyré, and Francis Bach. Stochastic optimization for large-scale optimal transport. Advances in Neural Information Processing Systems, 29:3432–3440, 2016.
Gong et al. [2022] Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models. arXiv preprint arXiv::2210.08933, 2022.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Hsu et al. [2021] Daniel Hsu, Clayton Sanford, and Rocco A. Servedio Emmanouil V. Vlatakis Gkaragkounis. On the approximation power of two-layer networks of random relus. In Proceedings of Machine Learning Research, volume 134, pages 1–39. 34th Annual Conference on Learning Theory, 2021.
Janati et al. [2020] Hicham Janati, Marco Cuturi, and Alexandre Gramfort. Debiased sinkhorn barycenters. In International Conference on Machine Learning, pages 4692–4701, 2020.
Kabir et al. [2022] HM Dipu Kabir, Moloud Abdar, Abbas Khosravi, Seyed Mohammad Jafar Jalali, Amir F Atiya, Saeid Nahavandi, and Dipti Srinivasan. Spinalnet: Deep neural network with gradual input. IEEE Transactions on Artificial Intelligence, 2022.
Kuznetsova et al. [2020] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. In International Journal of Computer Vision. ICCV, 2020.
Li et al. [2023] Puheng Li, Zhong Li, Huishuai Zhang, and Jiang Bian. On the generalization properties of diffusion models. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023). NeurIPS, 2023.
Liu et al. [2022] Hongjun Liu, Xiang Zhang, and Qionghai Li. Sb-ddpm: Schrödinger bridge diffusion denoising probabilistic model for generative tasks. IEEE Transactions on Neural Networks and Learning Systems, 2022.
Mokady et al. [2022] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. arXiv preprint arXiv:2211.09794, 2022.
Naeem et al. [2020] Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and diversity metrics for generative models. In International Conference on Machine Learning, pages 7176–7185. PMLR, 2020.
Omri Avrahami [2022] Dani Lischinski Omri Avrahami, Ohad Fried. Blended latent diffusion. arXiv preprint arXiv:2206.02779, 2022.
Pan and Yang [2009] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
Peyré and Cuturi [2019] Gabriel Peyré and Marco Cuturi. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355–607, 2019.
Peyré et al. [2016] Gabriel Peyré, Marco Cuturi, and Justin Solomon. Gromov-wasserstein averaging of kernel and distance matrices. In International Conference on Machine Learning, pages 2664–2672, 2016.
Popov et al. [2021] Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, , and Mikhail Kudinov. Grad-tts. A diffusion probabilistic model for text-to-speech. In International Conference on Learning Representations, 2021.
Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
Rasul et al. [2021] Kashif Rasul, Calvin Seward, Ingmar Schuster, and Roland Vollgraf. Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting. In International Conference on Learning Representations, 2021.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, , and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
Saharia et al. [2022a] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems 36 (NeurIPS 2022). NeurIPS, 2022a.
Saharia et al. [2022b] Chitwan Saharia, William Chan, Saurabh Saxena†, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022b.
Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models. In Advances in Neural Information Processing Systems 36. NeurIPS, 2022.
Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. Proceedings of the 32nd International Conference on Machine Learning, 37:2256–2265, 2015.
Solomon et al. [2015] Justin Solomon, Fernando de Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, and Leonidas Guibas. Convolutional wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics, 34(4):66, 2015.
Song et al. [2021a] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. In Advances in Neural Information Processing Systems 35 (NeurIPS 2021). NeurIPS, 2021a.
Song et al. [2021b] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modelling through stochastic differential equations. In International Conference on Learning Representations (ICLR 2021). ICLR, 2021b.
Staib et al. [2017] Matthew Staib, Sebastian Claici, Justin Solomon, and Stefanie Jegelka. Parallel streaming wasserstein barycenters. In Advances in Neural Information Processing Systems, volume 30, pages 2647–2658, 2017.
Tan et al. [2020] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. A survey on deep transfer learning. Artificial Intelligence Review, 52(2):1089–1116, 2020.
Theis et al. [2016] L Theis, A van den Oord, and M Bethge. A note on the evaluation of generative models. In International Conference on Learning Representations (ICLR 2016), pages 1–10, 2016.
Torrey and Shavlik [2010] Lisa Torrey and Jude Shavlik. Transfer learning. Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, pages 242–264, 2010.
Tumanyan et al. [2022] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. arXiv preprint arXiv:2211.12572, 2022.
van Handel [2016] Ramon van Handel. Probability in high dimension, apc 550 lecture notes, December 2016.
Vargas et al. [2022] Francisco J. Vargas, James E. Taylor, and Valentin de Bortoli. Solving schrödinger bridges via maximum likelihood: Applications to diffusion-based generative modeling. In Proceedings of the International Conference on Learning Representations, 2022.
Wang et al. [2023] Zhendong Wang, Yifan Jiang, Huangjie Zheng, Peihao Wang, Pengcheng He, Zhangyang Wang, Weizhu Chen, and Mingyuan Zhou. Patch diffusion: Faster and more data-efficient training of diffusion models. arXiv preprint arXiv:2304.12526, 2023.
Weiss et al. [2016] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. A survey of transfer learning. In Journal of Big Data, volume 3, pages 1–40. Springer, 2016.
Wu et al. [2022] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022.
Zhang et al. [2023] Ruoyu Zhang, Yanzeng Li, Yongliang Ma, Ming Zhou, and Lei Zou. Llmaaa: Making large language models as active annotators. arXiv preprint arXiv:2310.19596, 2023.
Zhu et al. [2023] **gyuan Zhu, Huimin Ma, Jiansheng Chen, and Jian Yuan. Domainstudio: Fine-tuning diffusion models for domain-driven image generation using limited data. arXiv preprint arXiv:2306.14153, 2023.
Zhuang et al. [2021] Fuzhen Zhuang, Ziliang Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning. In Proceedings of the IEEE, volume 109, pages 43–76. IEEE, 2021.

Appendix A More about basic diffusion models

A.1 About the time reversal formula

Note that Equations (3) and (4) are still represented as a “forward” processes. If we replace $W(t)$ by $\tilde{W}(t)$ , where $\tilde{W}(t)$ is a standard $d$ -dimensional Brownian motion flows backward from time $T$ to 0, then Equation (3) becomes

d\hat{X}(t)=\left(f(T-t,\hat{X}(t))-g^{2}(T-t)\nabla\log p_{T-t}\left(\hat{X}(% t)\right)\right)dt+g(T-t)d\tilde{W}(t),\hat{X}(T)\sim p_{T},

which is the reverse SDE presented in Song et al. [47]. Hence for the forward OU process, the reverse process has another representation by

d\hat{X}(t)=\left(-a\hat{X}(t)-\sigma^{2}\nabla\log p_{T-t}\left(\hat{X}(t)% \right)\right)dt+\sigma d\tilde{W}(t),\hat{X}(T)\sim p_{T}.

A.2 Discretization and backward sampling

In this section, we follow the scheme in Chen et al. [12].

Given $n$ samples $X_{0}^{(1)},\ldots,X_{0}^{(n)}$ from $p_{0}$ (data distribution), we train a neural network with the loss function (5). Let $h>0$ be the step size of the time discretization, and there are $N$ steps, hence $T=Nh$ . We assume that for each time $l=0,1,\ldots,N$ , the score estimation $s_{lh,\theta^{*}}$ of $\nabla\log p_{t}$ is obtained. In order to simulate the reverse SDE (3), we first replace the score function $\nabla\log p_{T-t}$ with the estimate $s_{T-t,\theta^{*}}$ . Next, for each $t\in[lh,(l+1)h]$ , the value of this coefficient in the SDE at time $lh$ , which yields the new time-discretized SDE with each $t\in[lh,(l+1)h]$ ,

d\hat{X}(t)=\left(-f(T-t,\hat{X}(t))+g^{2}(T-t)s_{T-t,\theta^{*}}\left(\hat{X}% _{kh}\right)\right)dt+g(T-t)dW(t)

(11)

and $\hat{X}(0)\sim\Pi$ , where $\Pi$ is the (theoretical) stationary distribution of the forward process (1).

There are several details in this implementation. In practice, when we use OU process as the forward, then Equation (11) becomes

d\hat{X}(t)=\left(a\hat{X}(t)+\sigma^{2}s_{T-t,\theta^{*}}\left(\hat{X}_{kh}% \right)\right)dt+\sigma dW(t),t\in[lh,(l+1)h],

with $\Pi=\pi$ , which is a linear SDE. In particular, $X_{(l+1)h}$ conditioned on $X_{lh}$ is Gaussian, so the sampling is easier.

In theory, we should use $\Pi\sim p_{T}$ , which we have no access to. The above implementation takes advantage of $p_{T}\approx\Pi$ as $T$ is large enough. This introduces a small initialization error.

A.3 About the generalization error of basic diffusion model

In Li et al. [29], a random feature model is considered as the score estimator. The basic intuition is that the generalization error with respect to the KL divergence, $D_{\text{KL}}\left(\mu\parallel\hat{\mu}\right)$ is decomposed into three terms: the training error, approximation error of underlying random feature model, and the convergence error of stationary measures. Among these three, the third one is ignorable since the fast rate of convergence of an OU process (or, from log Sobolev inequality for Gaussian random variables in van Handel [53]). The first one is also small since random feature model in this setting is essentially linear regression with least squares.

Moreover, as stated in Hsu et al. [25], random feature model can approximate Lipschitz functions with compact supports. However, the approximation error can be large and cause curse of dimensionality if we choose $m\sim n$ . To illustrate this, we make a more general statement including smoothness considerations.

To be more precise, we introduce the following setting. We use the basic diffusion model with a forward OU process. The score function $s_{t,\theta}(x)$ is parameterized by the random feature model with $m$ random features:

s_{t,\theta}(x)=\frac{1}{m}A\sigma\left(Wx+Ue(t)\right)=\frac{1}{m}\sum_{j=1}^% {m}a_{j}\sigma\left(w_{j}^{T}x+u_{j}^{T}e(t)\right),

where $\sigma$ is the ReLU activation function, $A=(a_{1},\ldots,a_{m})\in\mathbb{R}^{d\times m}$ is the trainable parameters, $W=(w_{1},\ldots,w_{m})^{T}\in\mathbb{R}^{m\times d}$ , $U=(u_{1},\ldots,u_{m})^{T}\in\mathbb{R}^{m\times d_{e}}$ are initially sampled from some pre-chosen distributions (related to random features) and remain frozen during the training, and $e:\mathbb{R}_{+}\to\mathbb{R}^{d_{e}}$ is the time embedding function. The precise description is given below.

Assume that $a_{j},w_{j},$ and $u_{j}$ are drawn i.i.d. from a distribution $\rho$ , then as $m\to\infty$ , from strong law of large numbers, with probability 1,

s_{t,\theta}(x)\to\bar{s}_{t,\bar{\theta}}(x)=\mathbb{E}_{(w,u)\sim\rho_{0}}% \left[a(w,u)\sigma\left(w^{T}x+u^{T}e(t)\right)\right],

(12)

where $a(w,u)=\frac{1}{\rho_{0}(w,u)}\int a\rho(a,w,u)da$ and $\rho_{0}(w,u)=\int\rho(a,w,u)da$ . From the positive homogeneity of ReLU function, we may assume $\left\lVert u\right\rVert+\left\lVert w\right\rVert\leq 1$ . The optimal solution is denoted by $\bar{\theta^{*}}$ when replacing $s_{t,\theta}(x)$ in loss objective with $\bar{s}_{t,\bar{\theta}}(x)$ .

Define a kernel $K_{\rho_{0}}(x,y)=\mathbb{E}_{(w,u)\sim\rho_{0}}\left[\sigma\left(w^{T}x+u^{T}% e(t)\right)\sigma\left(w^{T}y+u^{T}e(t)\right)\right]$ and denote the induced reproducing kernel Hilbert space (RKHS) as $\mathcal{H}_{K_{\rho_{0}}}$ ; if there is no misunderstanding, we denote $\mathcal{H}:=\mathcal{H}_{K_{\rho_{0}}}$ . It follows that $\bar{s}_{t,\bar{\theta}}\in\mathcal{H}$ if and only if $\left\lVert\bar{s}_{t,\bar{\theta}}\right\rVert_{\mathcal{H}}=\mathbb{E}_{(w,u% )\sim\rho_{0}}\left[\left\lVert a(w,u)\right\rVert_{2}^{2}\right]<\infty$ .

In Hsu et al. [25], a notion of approximation quality called minimum width of the neural network is defined to measure the minimum number of random features needed to guarantee an accurate enough approximation with high probability. The exact definition is given below.

Definition 1.

Given $\epsilon,\delta>0$ and a function $f:\mathbb{R}^{d}\to\mathbb{R}$ with bounded norm $\left\lVert f\right\rVert_{\alpha}<\infty$ , where $\alpha$ is the measure in $\mathbb{R}^{d}$ associated with the corresponding function space. We also denote $g^{(i)}(x)=\sigma\left(w^{T}x+u^{T}e(t)\right)$ . The minimum width $m_{f,\epsilon,\delta,\alpha,\rho_{0}}$ is defined to be the smallest $r\in\mathbb{Z}^{+}$ such that with probability at least $1-\delta$ over $g^{(1)},\ldots,g^{(r)}$ ,

\inf_{g\in\text{span}\left(g^{(1)},\ldots,g^{(r)}\right)}\left\lVert f-g\right% \rVert_{\alpha}<\epsilon.

Moreover, for $s\geq 0$ , $p\in[1,\infty]$ , and $U\subset\mathbb{R}^{d}$ be an open and bounded set, $W^{s,p}(U)$ is the Sobolev space with order $s,p$ consists of all locally integrable function $f$ such that for each multiindex $\alpha$ with $|\alpha|\leq s$ , weak derivative of $f$ exists and has finite $L^{p}$ norm (see Evans [19]). If $p=2$ , we denote $W^{s,2}(U)=H^{s}(U)$ to reflect the fact that it is a Hilbert space now. Finally, recall that the space of all Lipshitz functions on $U$ is the same as $W^{1,\infty}(U)$ .

With these settings and definitions, we can state and prove the following generalization error for the basic diffusion model using random feature model.

Theorem 5.

Suppose that the target distribution $\mu$ is continuously differentiable and has a compact support, we choose an appropriate random feature $\rho_{0}$ , and there exists a RKHS $\mathcal{H}$ such that $\bar{s}_{0,\bar{\theta^{*}}}\in\mathcal{H}$ . Assume that the initial loss, trainable parameters, the embedding function $e(t)$ and the weighting function $\gamma(t)$ are all bounded. We further suppose that for all $t\in[0,T]$ , the score function $\nabla\log p_{t}\in H^{s}(K)\cap W^{1,\infty}(K)$ and there exists $\gamma>0$ such that $\left\lVert\nabla\log p_{t}\right\rVert_{H^{s}(K)}\leq\gamma$ , where $K\subset\mathbb{R}^{d}$ is compact. Then for fixed $0<\epsilon,\delta\ll 1$ , with probability at least $1-\delta$ , we have

	$\displaystyle D_{\text{KL}}\left(\mu\|\|\hat{\mu}\right)$	$\displaystyle\lesssim\left(\frac{\tau^{4}}{m^{3}n}+\frac{\tau^{2}}{mn}+\frac{% \tau^{3}}{m^{2}}+\frac{1}{\tau}+\frac{1}{m}\right)$
		$\displaystyle+\min\left(\left(\frac{s}{\log m}\right)^{s/2},\left(\frac{d\left% (m^{1/d}-2\right)}{s\gamma^{2/s}}\right)^{-s/2}\right)+D_{\text{KL}}\left(p_{T% }\|\|\pi\right),$

where $\tau$ is the training time (steps) in the gradient flow dynamics (see Li et al. [29]), $m$ is the number of random features, $n$ is the sample size of the target distribution, $\pi$ is the stationary Gaussian distribution, $p_{T}$ is the distribution of the forward OU process at time $T$ , $\mu$ is the target distribution, and $\hat{\mu}$ is the distribution of the generated samples.

Proof.

The proof follows exactly the same as in the proof of Theorem 1 in Li et al. [29]. The only extra work is to compute the universal approximation error of the random feature model for Sobolev functions on a compact domain. From compacted supported assumption (Lemma 1 in Li et al. [29]), the forward process defines a random path $\left(X(t),t\right)_{t\in[0,T]}$ contained in a compact rectangular domain in $\mathbb{R}^{d+1}$ .

Theorem 35 in Hsu et al. [25] states the existence of a random feature $\rho_{0}$ such that for any $f\in H^{s}(K)$ with $\left\lVert f\right\rVert_{H^{s}(K)}\leq\gamma$ , $m_{f,\epsilon,\delta,\alpha,\rho_{0}}\lesssim\frac{s^{2}\gamma^{2+4/s}d^{2}}{% \epsilon^{2+4/s}}\log\left(\frac{1}{\delta}\right)\exp\left(\min\left(d\log% \left(\frac{\gamma^{2}}{\epsilon^{2}d}+2\right),\frac{\gamma^{2}}{\epsilon^{2}% }\log\left(\frac{d\epsilon^{2}}{\gamma^{2}}+2\right)\right)\right)$ , which implies the approximation error term.

Remark 1.

The random feature model has two difficulties in implementation.

If $m$ , $T$ , and $\tau$ are large enough, then the generalization error is small regardless to the sample size $n$ . However, the choice of random feature $\rho_{0}$ is hard in practice, especially in neither Hsu et al. [25] nor Li et al. [29] the method to choose $\rho_{0}$ is specified. Therefore, the assumption that $\rho_{0}$ is appropriately chosen is very strong.

Even if $\rho_{0}$ is appropriately chosen, if we let $m\sim n$ and try to find an optimal early stop** time as in Li et al. [29], the term $\min\left(\left(\frac{s}{\log n}\right)^{s/2},\left(\frac{d\left(n^{1/d}-2% \right)}{s\gamma^{2/s}}\right)^{-s/2}\right)$ still dominates and shows the curse of dimensionality.

Appendix B Proof of results in Section 3.1

Before the proofs, we note the strict convexity of the KL barycenter problems via a simple lemma.

Lemma 1.

For any Polish space $S$ , the KL barycenter problem $\min_{\mu\in\mathcal{P}(S)}\sum_{i=1}^{k}\lambda_{i}D_{\text{KL}}\left(\mu% \parallel P_{i}\right)\text{s.t.}\sum_{i=1}^{k}\lambda_{i}=1$ is strictly convex.

Proof.

Let $t\in(0,1)$ and $\mu_{1},\mu_{2}\in S$ such that $\mu_{1}\ll P_{i}$ and $\mu_{2}\ll P_{i}$ , for each $i=1,2,\ldots,k$ , then

	$\displaystyle\sum_{i=1}^{k}\lambda_{i}D_{\text{KL}}\left(t\mu_{1}+(1-t)\mu_{2}% \parallel P_{i}\right)$	$\displaystyle<\sum_{i=1}^{k}\lambda_{i}\left[tD_{\text{KL}}\left(\mu_{1}% \parallel P_{i}\right)+(1-t)D_{\text{KL}}\left(\mu_{2}\parallel P_{i}\right)\right]$
		$\displaystyle=t\sum_{i=1}^{k}\lambda_{i}D_{\text{KL}}\left(\mu_{1}\parallel P_% {i}\right)+(1-t)\sum_{i=1}^{k}\lambda_{i}D_{\text{KL}}\left(\mu_{2}\parallel P% _{i}\right),$

where the inequality follows from the strictly convexity of KL divergence in terms of $\mu$ with fixed $P_{i}$ . Therefore, the KL barycenter problem is strictly convex.

B.1 Proof of Theorem 1

Proof.

It suffices to consider a probability measure $\mu\in\mathcal{P}(\mathbb{R}^{d})$ with absolutely continuous density $q(x)$ (otherwise the KL divergence is $\infty$ ) and show the existence. If there is no confusion, we use the density and measure interchangeably. We denote $\mathcal{P}_{\text{ac}}(\mathbb{R}^{d})$ as the space of all absolutely continuous distributions and define a functional $F:\mathcal{P}_{\text{ac}}(\mathbb{R}^{d})\to\mathbb{R}$ that for $x\in\mathbb{R}^{d},$

F(q,x)=\sum_{i=1}^{k}\lambda_{i}q(x)\log\left(\frac{q(x)}{p_{i}(x)}\right).

Therefore, the barycenter problem becomes

\min_{\mu\in\mathcal{P}_{\text{ac}}(\mathbb{R}^{d})}\int_{x\in\mathbb{R}^{d}}F% (q,x)dx\quad\text{s.t.}\sum_{i=1}^{k}\lambda_{i}=1\text{ and }\int_{x\in% \mathbb{R}^{d}}q(x)dx=1,

which is a variational problem with a subsidiary condition ([21]). Therefore, from calculus of variations, a necessary condition for $q$ to be an extremal of the variational problem is for some constant $m$

\frac{\partial}{\partial q}F(q)+m=0.

Hence, the optimal solution is

q^{*}(x)=\frac{\prod_{i=1}^{k}p_{i}(x)^{\lambda_{i}}}{\int\prod_{i=1}^{k}p_{i}% (x)^{\lambda_{i}}dx}.

B.2 Proof of Theorem 2

Before the proof of Theorem 2, we review a consequence of Girsanov’s theorem (Theorem 8 in Chen et al. [12]). We will use a similar technique as in Chen et al. [12]) to prove heorem 2.

Theorem 6.

Suppose $Q\in\mathcal{P}(C([0,T]:\mathbb{R}^{d}))$ . For $t\in[0,T]$ , let $\mathcal{L}(t)=\int_{0}^{t}b(s)dB(s)$ and the stochastic exponential $\mathcal{E}\left(\mathcal{L}\right)(t)=\exp\left(\int_{0}^{t}b(s)dB(s)-\frac{1% }{2}\int_{0}^{t}\left\|b(s)\right\|_{2}^{2}ds\right)$ , where $B$ is a $Q$ -Brownian motion. Assume $\mathbb{E}_{Q}\left[\int_{0}^{T}\left\|b(s)\right\|_{2}^{2}ds\right]<\infty$ . Then $\mathcal{L}$ is a square integrable $Q$ -martingale. Moreover, if $\mathbb{E}_{Q}\left[\mathcal{E}\left(\mathcal{L}\right)(T)\right]=1,$ then $\mathcal{E}\left(\mathcal{L}\right)$ is a true $Q$ -martingale and the process $B(t)-\int_{0}^{t}b(s)ds$ is a $P$ -Brownian motion, where $P$ is a probabilty measure such that $P=\mathcal{E}\left(\mathcal{L}\right)(T)Q$ .

In most applications of Girsanov’s theorem, we need to check a sufficient condition to hold, known as Novikov’s condition. In the context of Theorem 6, Novikov’s condition is

\mathbb{E}_{Q}\left[\exp\left(\frac{1}{2}\int_{0}^{T}\left\|b(s)\right\|_{2}^{% 2}ds\right)\right]<\infty.

(13)

Now we begin the proof of Theorem 2.

Proof.

From Lemma 1, it suffices to show the existence. Let $\alpha\in\mathcal{P}(C([0,T]:\mathbb{R}^{d})$ with initial distribution $\alpha_{0}$ . We denote $\alpha(0)$ as the initial distribution of the process whose law is measure $\alpha$ as notation. From the chain rule of KL divergence, we have

	$\displaystyle\sum_{i=1}^{k}\lambda_{i}D_{\text{KL}}\left(\alpha\parallel P_{i}\right)$	$\displaystyle=\sum_{i=1}^{k}\lambda_{i}D_{\text{KL}}\left(\alpha_{0}\parallel% \mu_{i}\right)$
		$\displaystyle+\mathbb{E}_{z\sim\alpha_{0}}\left[\sum_{i=1}^{k}\lambda_{i}D_{% \text{KL}}\left(\alpha\left(.\|\alpha(0)=z\right)\parallel P_{i}\left(.\|P_{i}(0% )=z\right)\right)\right],$

where the first term solves the KL barycenter problem with respect to the initial distributions, and the second term solves the KL barycenter problem with all reference processes have the same initial distribution. Therefore, to finish the proof, we can assume for each $i=1,\ldots,k$ , $\mu_{i}\sim\mu$ , the same initial distribution.

Since we are finding the minimizer of the weight sum of KL divergences, it is sufficient to assume that $\alpha$ is the law of a diffusion process which is a strong solution of an SDE with the same diffusion (volatility) coefficient as all reference processes:

dX(t)=\left[c\left(t,X(t)\right)+\sigma(t)^{2}a\left(t,X(t)\right)\right]dt+% \sigma(t)dB(t),X(0)\sim\mu,

where $B$ is a standard Brownian motion, and otherwise the KL divergence would be $\infty$ . For now, we assume that $a(t,x)$ is uniformly bounded.

When applying Girsanov’s theorem, it is more convenient to view different path measures on $\mathcal{P}(C([0,T]:\mathbb{R}^{d})$ as the different laws of the same single stochastic process. For notational convenience, we denote the single process as $\{Z(t)\}_{t\in[0,T]}$ .

For each $i=1,\ldots,k$ , we can apply the Girsanov’s theorem to $Q=\alpha$ and

b(t)=\sigma(t)\left(a_{i}(t,Z(t))-a(t,Z(t))\right)

in the setting of Theorem 6. Therefore, under the measure $P=\mathcal{E}\left(\mathcal{L}\right)(T)\alpha$ , there exists a Brownian motion $\{\beta(t)\}_{t\in[0,T]}$ such that

dB(t)=\sigma(t)\left(a_{i}(t,Z(t))-a(t,Z(t))\right)dt+d\beta(t).

Since under the measure $\alpha$ , with probability 1,

dZ(t)=\left[c\left(t,Z(t)\right)+\sigma(t)^{2}a\left(t,Z(t)\right)\right]dt+% \sigma(t)dB(t),Z(0)\sim\mu,

then this also holds $P$ -almost surely, which implies that $P$ -almost surely, $Z(0)\sim\mu$ , and

	$\displaystyle dZ(t)$	$\displaystyle=\left[c\left(t,Z(t)\right)+\sigma(t)^{2}a\left(t,Z(t)\right)% \right]dt+\sigma(t)\left[a_{i}(t,Z(t))-a(t,Z(t))\right]dt+\sigma d\beta(t)$
		$\displaystyle=\left[c\left(t,Z(t)\right)+\sigma(t)^{2}a_{i}\left(t,Z(t)\right)% \right]dt+\sigma d\beta(t).$

In other words, $P\sim P_{i}$ in law.

Therefore,

	$\displaystyle D_{\text{KL}}\left(\alpha\parallel P_{i}\right)$	$\displaystyle=\mathbb{E}_{\alpha}\left[\log\left(\frac{d\alpha}{dP_{i}}\right)\right]$
		$\displaystyle=\mathbb{E}_{\alpha}\left[\log\left(\frac{1}{\mathcal{E}\left(% \mathcal{L}\right)(T)}\right)\right]$
		$\displaystyle=\frac{1}{2}\mathbb{E}_{\alpha}\left[\int_{0}^{T}\sigma(t)^{2}% \left\\|a_{i}(t,Z(t))-a(t,Z(t)\right\\|_{2}^{2}dt\right]$
		$\displaystyle+\mathbb{E}_{\alpha}\left[\int_{0}^{T}\sigma(t)^{2}\left(a(t,Z(t)% )-a_{i}(t,Z(t)\right)dt\right]$
		$\displaystyle=\frac{1}{2}\mathbb{E}_{\alpha}\left[\int_{0}^{T}\sigma(t)^{2}% \left\\|a_{i}(t,Z(t))-a(t,Z(t)\right\\|_{2}^{2}dt\right]$

since Ito integral with regular integrand is a true martingale.

Therefore, the objective function of process level KL barycenter problem becomes

\frac{1}{2}\sum_{i=1}^{k}\lambda_{i}\mathbb{E}_{\alpha}\left[\int_{0}^{T}% \sigma(t)^{2}\left\|a(t,Z(t))-a_{i}(t,Z(t)\right\|_{2}^{2}dt\right],

given we assume that all of reference laws have the same initial distribution. Therefore, as a functional optimization problem, the minimizer $a^{*}(t,x)=\sum_{i=1}^{k}\lambda_{i}a_{i}(t,x)$ , which finishes the proof.

Appendix C Proof of results in Section 4

C.1 Preliminaries and basic tools

C.1.1 Preliminaries

We include this subsection to present basic definitions and notations used in our proofs.

Definition 2.

$S$ is a Polish space equipped with Borel $\sigma$ -algebra $\mathcal{B}(S)$ , $\{P_{n}\}_{n\in\mathbb{N}}\subset\mathcal{P}(S)$ is a set of probability measures, we say $P_{n}$ converges to $P\in\mathcal{P}(S)$ weakly if and only if for each bounded and continuous function $f:S\to\mathbb{R}$ , as $n\to\infty$ ,

\int_{S}f(x)dP_{n}(x)\to\int_{S}f(x)dP(x).

Definition 3.

Given two measurable spaces $\left(X,\mathcal{F}\right)$ and $\left(Y,\mathcal{G}\right)$ , $f:X\to Y$ is a measurable function, and $\left(X,\mathcal{F},\mu\right)$ is a (positive) measure space. The pushforward of $\mu$ is defined to be a measure $f_{\#}\mu$ such that for any $B\in\mathcal{G}$ ,

f_{\#}\mu(B)=\mu\left(f^{-1}(B)\right).

Definition 4.

A differentiable function $F:\mathbb{R}^{d}\to\mathbb{R}$ is called $L$ -smooth if for any $x,y\in\mathbb{R}^{d}$ ,

\lvert F(x)-F(y)-F^{\prime}(y)(x-y)\rvert\ \leq\frac{L}{2}\left\|y-x\right\|_{% 2}^{2}.

Definition 5.

A stochastic process $\{X_{t}\}_{t\in[0,T]}$ is called a local martingale if there exists a sequence of nondecreasing stop** times $\{T_{n}\}_{n\in\mathbb{N}}$ such that $T_{n}\to T$ and $\{X_{t\wedge T_{n}}\}_{t\in[0,T]}$ is a true martingale.

Next we define some notations and stochastic processes that will be used in the following proofs.

Recall the process (6) is a backward SDE with score terms replaced by the estimations. We say for each $i=1,2,\ldots,k$ , process $\bar{X}_{i}$ is the theoretical backward process with exact score terms:

d\bar{X}_{i}(t)=\left(a\bar{X}_{i}(t)+\sigma^{2}\nabla\log p^{i}_{T-t}\left(% \bar{X}_{i}(t)\right)\right)dt+\sigma dW_{i}(t),\bar{X}_{i}(0)\sim p^{i}_{T}.

(14)

The corresponding forward process is denoted as $X_{i}$ :

dX_{i}(t)=-aX_{i}(t)dt+\sigma dW(t),X_{i}(0)\sim p_{i}\sim\mu_{i}.

(15)

We denote the marginal density of $X_{i}(t)$ as $p^{i}_{t}$ ; when $t=0$ , we use the notation $p_{i}\sim\mu_{i}$ . Process (8) is a time-discretized SDE to be implemented in practice. It can be viewed as an approximation of the theoretical barycenter process (denoted as $\tilde{Y}$ ) of the backward SDEs of the form (14):

d\tilde{Y}(t)=\left(a\tilde{Y}(t)+\sigma^{2}\sum_{i=1}^{k}\lambda_{i}\nabla% \log p^{i}_{T-t}\left(\tilde{Y}(t)\right)\right)dt+\sigma dW(t),\tilde{Y}(0)% \sim\gamma^{d}_{T},

(16)

where $\gamma^{d}_{T}$ is the distribution level KL barycenter at time $T$ with respect to the reference measures $\{p^{1}_{T},\ldots,p^{k}_{T}\}$ . When $T$ is large, $\gamma^{d}_{T}$ is approximated by $\pi$ in Equation (8). In theory, there is corresponding forward process with respect to process (16):

dY(t)=-aX(t)dt+\sigma dW(t),Y(0)\sim\tilde{Y}(T).

(17)

For a fixed $\boldsymbol{\lambda}$ , we denote $p_{\boldsymbol{\lambda},t}$ as the marginal distribution of process (17) at time $t$ ; when $t=0$ , we ignore the time subscript.

C.1.2 Basic algorithms

In this section, we recall the Frank-Wolfe method [9], which is used to solve an optimization problem with $L$ -smooth convex function $f:\mathcal{X}\to\mathbb{R}$ on a compact domain $\mathcal{X}$ :

\min_{x\in\mathcal{X}}f(x)

(18)

Algorithm 2 (vanilla) Frank-Wolfe with function-agnostic step size rule [9]

1:Input: Start atom

x_{0}\in\mathcal{X}

, objective function

f

, smoothness

L

2:Output: Iterates

x_{1},\ldots,x_{\tau}\in\mathcal{X}

3:for

\tau=0\text{ to }\ldots

v_{\tau}\leftarrow\arg\min_{v\in\mathcal{X}}\langle\nabla f(x_{\tau}),v\rangle

\gamma_{\tau}\leftarrow\begin{cases}1&\text{if }\tau=1\\ \frac{2}{\tau+3}&\text{if }\tau>1\end{cases}

x_{\tau+1}\leftarrow x_{\tau}+\gamma_{\tau}(v_{\tau}-x_{\tau})

7:end for

To measure the error of the algorithm, we define for each $\tau\geq 1$ , the primary gap is

h_{\tau}=h(x_{\tau})=f(x_{\tau})-f(x^{*}),

where $x^{*}$ is the minimizer of problem (18).

C.1.3 Basic lemmas

In this subsection, we first list some basic lemmas (Lemma 2 to 5) that serve as essential tools in our proofs. All proofs can be found in [12].

Lemma 2.

Suppose that Assumption 1 and 2 hold. For each $i=1,2,\ldots,k$ , let $Z_{i}(t)$ denote the forward auxiliary process (15), then for all $t\geq 0$ ,

\mathbb{E}\left[\left\|Z_{i}(t)\right\|_{2}^{2}\right]\leq d\vee M\text{ and }% \mathbb{E}\left[\left\|\nabla\log p^{i}_{t}\left(Z_{i}(t)\right)\right\|_{2}^{% 2}\right]\leq Ld.

Lemma 3.

Suppose that Assumption 1 holds. For each $i=1,2,\ldots,k$ , let $Z_{i}(t)$ denote the forward auxiliary process (15). For $0\leq s<t$ , let $\delta=t-s$ . If $\delta\leq 1$ , then

\mathbb{E}\left[\left\|Z_{i}(t)-Z_{i}(s)\right\|_{2}^{2}\right]\lesssim\delta^% {2}M+\delta d.

Lemma 4.

Consider a sequence of functions $f_{n}:[0,T]\to\mathbb{R}^{d}$ and a function $f:[0,T]\to\mathbb{R}^{d}$ such that there exists a nondecreasing sequence $\{T_{n}\}_{n\in\mathbb{N}}\subset[0,T]$ such that $T_{n}\to T$ as $n\to\infty$ and for each $t\leq T_{n}$ , $f_{n}(t)=f(t)$ , then for each $\epsilon>0$ , $f_{n}\to f$ uniformly over $[0,T-\epsilon]$ .

Lemma 5.

$f:[0,T]\to\mathbb{R}^{d}$ is a continuous function, and $f_{\epsilon}:[0,T]\to\mathbb{R}^{d}$ such that for each $\epsilon>0$ , $f_{\epsilon}(t)=f\left(t\wedge(T-\epsilon)\right)$ , then as $\epsilon\to 0$ , $f_{\epsilon}\to f$ uniformly over $[0,T]$ .

Next, we review and give two results related to the fusion algorithms.

Lemma 6.

For any fixed $\boldsymbol{\lambda}\in\Delta_{k}$ , $\tilde{Y}(T)\sim\mu_{\boldsymbol{\lambda}}$ , the KL barycenter of $\{\mu_{1},\ldots,\mu_{k}\}$ .

Proof.

In this proof, we use the following notations: suppose $x,y\in\mathbb{R}^{d}$ and $0\leq s\leq t\leq T$ , we denote $p^{i}(x,t|y,s)$ as the transition density of the $i$ th auxiliary process from time $s$ to $t$ . Similarly, $p^{\boldsymbol{\lambda}}(x,t|y,s)$ as the transition density of the barycenter process from time $s$ to $t$ .

Let $\boldsymbol{\lambda}$ be fixed, then at each time $t\in[0,T]$ ,

\displaystyle\nabla\log\left(p_{\boldsymbol{\lambda},t}(x)\right)=\nabla\sum_{% i=1}^{n}\lambda_{i}\log\left(p_{t}^{i}(x)\right).

Expanding LHS and RHS at the same time, we get

\nabla\log\left(\int p^{\boldsymbol{\lambda}}(x,t|y,0)p_{\boldsymbol{\lambda}}% (y)dy\right)=\nabla\sum_{i=1}^{k}\lambda_{i}\log\left(\int p^{i}(x,t|y,0)p_{i}% (y)dy\right),

Note that as $t\to 0$ , $p^{i}(x,t|y,0)\to\delta(x-y)$ and $p^{\boldsymbol{\lambda}}(x,t|y,0)\to\delta(x-y)$ , where the limit is the delta function. Therefore, from the compactness assumption and dominated convergence theorem,

	$\displaystyle\nabla\log p_{\boldsymbol{\lambda}}(x)$	$\displaystyle=\lim_{t\to 0}\nabla\log\left(\int p^{\boldsymbol{\lambda}}(x,t\|y% ,0)p_{\boldsymbol{\lambda}}(y)dy\right)$
		$\displaystyle=\displaystyle{\lim_{t\to 0}}\nabla\sum_{i=1}^{k}\lambda_{i}\log% \left(\int p^{i}(x,t\|y,0)p_{i}(y)dy\right)$
		$\displaystyle=\nabla\sum_{i=1}^{k}\lambda_{i}\log p_{i}(x).$

Therefore,

	$\displaystyle\log p_{\boldsymbol{\lambda}}(x)$	$\displaystyle\propto\sum_{i=1}^{k}\lambda_{i}\log p_{i}(x)$
		$\displaystyle=\log\left(\prod_{i=1}^{k}p_{i}(x)^{\lambda_{i}}\right)$
		$\displaystyle=\log\left(\prod_{i=1}^{k}p_{i}(x)^{\lambda_{i}}\right).$

Since $p_{\boldsymbol{\lambda}}(x)$ is a density function, then after normalization

p_{\boldsymbol{\lambda}}(x)=\frac{\prod_{i=1}^{k}p_{i}(x)^{\lambda_{i}}}{\int% \prod_{i=1}^{k}p_{i}(x)^{\lambda_{i}}dx},

which is the solution of KL barycenter problem with reference measures $p_{1},\ldots,p_{k}$ .

Next we give the proof of Proposition 1.

Proof.

Recall that the objective function for $\boldsymbol{\lambda}$ is

F(\boldsymbol{\lambda})=\mathbb{E}_{\nu}\left[\log\nu(X)-\sum_{i=1}^{k}\lambda% _{i}\log p_{i}(X)\right]+\log\left(\int\prod_{i=1}^{k}p_{i}(y)^{\lambda_{i}}dy% \right).

(19)

We note that the first term is linear in $\boldsymbol{\lambda}$ , so to show convexity, it is enough to show the second term is convex in $\boldsymbol{\lambda}$ . If we denote $h_{i}(x)=\log\left(p_{i}(x)\right)$ for each $i=1,2,\ldots,k$ and $X$ as the uniform distribution on $K$ , then

	$\displaystyle\log\left(\int\prod_{i=1}^{k}p_{i}(y)^{\lambda_{i}}dy\right)$	$\displaystyle=\log\left(\int_{\mathbb{K}}\prod_{i=1}^{k}p_{i}(y)^{\lambda_{i}}% dy\right)$
		$\displaystyle=\log\left(\frac{1}{\|\mathbb{K}\|}\int_{\mathbb{K}}\exp\left(\sum_% {i=1}^{k}h_{i}(y)\lambda_{i}\right)dy\right)+\log\left(\|\mathbb{K}\|\right)$
		$\displaystyle=\log\left(\mathbb{E}\left[\exp\left(\boldsymbol{\lambda}^{T}Z% \right)\right]\right)+\log\left(\|\mathbb{K}\|\right),$

where $Z=\left(h_{1}(X),\ldots,h_{k}(X)\right)$ and $|\mathbb{K}|$ is the Lebesgue measure of $\mathbb{K}$ . Since log of moment generating function is convex, then second term in Equation (19) is convex in $\boldsymbol{\lambda}$ .

Remark 2.

In theory, the first order condition of the convex optimization problem (9) is

	$\displaystyle\frac{\partial F}{\partial\lambda_{i}}(\boldsymbol{\lambda})$	$\displaystyle=-\int\nu(x)h_{i}(x)dx+\frac{\partial}{\partial\lambda_{i}}\log% \left(\int\prod_{l=1}^{k}p_{l}(y)^{\lambda_{l}}dy\right)$
		$\displaystyle=-\mathbb{E}_{\nu}\left[h_{i}(X)\right]+\frac{\int\prod_{l=1}^{k}% p_{l}(y)^{\lambda_{l}}\log p_{i}(y)dy}{\int\prod_{l=1}^{k}p_{l}(y)^{\lambda_{l% }}dy}$
		$\displaystyle=-\mathbb{E}_{\nu}\left[h_{i}(X)\right]+\frac{\int\exp\left(\sum_% {l=1}^{k}\lambda_{l}h_{l}(y)\right)h_{i}(y)dy}{\int\exp\left(\sum_{l=1}^{k}% \lambda_{l}h_{l}(y)\right)dy}.$

In practice, each $h_{i}$ is replaced by the estimated auxiliary densities, and the second term is computed independent of the target data $\nu$ . However, the implementation is extremely hard since the numerical integration of the second term may have large error and the error is hard to control.

C.2 Proof of Theorem 3

Before the proof of the sample complexity of the whole algorithm, we first prove a lemma about the auxiliary score estimation errors. The proof is adapted from Chen et al. [12].

Lemma 7.

Suppose that Assumption 2 holds, $\boldsymbol{\lambda}$ is fixed, and the step size $h=T/N$ satisfies $h\lesssim 1/L$ , where $L\geq 1$ . Let $p_{\boldsymbol{\lambda}}$ and $\hat{p}_{\boldsymbol{\lambda}}$ denote the distribution of process (16) and (8) at time $T$ , respectively. Then we have

\text{TV}\left(p_{\boldsymbol{\lambda}},\hat{p}_{\boldsymbol{\lambda}}\right)% \lesssim\exp(-T)\max_{i=1,2,\ldots,k}\sqrt{D_{\text{KL}}\left(p^{i}_{T}% \parallel\pi\right)}+\sigma\sqrt{kT}\left(\epsilon_{\text{score}}+L\sqrt{dh}+% Lh\sqrt{M}\right).

Remark 3.

To interpret the result, suppose $\max_{i=1,2,\ldots,k}\sqrt{D_{\text{KL}}\left(p^{i}_{T}\parallel\pi\right)}% \lesssim\text{poly}(d)$ and $M\leq d$ , then for fixed $\epsilon$ , if we choose $T\sim\log\left(\max_{i=1,2,\ldots,k}\sqrt{D_{\text{KL}}\left(p^{i}_{T}% \parallel\pi\right)}/\epsilon\right)$ and $h\sim\frac{\epsilon^{2}}{L^{2}\sigma^{2}kd}$ , and hiding the logarithmic factors, then with $N\sim\frac{L^{2}\sigma^{2}kd}{\epsilon^{2}}$ , $\text{SE}\lesssim\epsilon+\epsilon_{\text{score}}$ . In particular, if we want to choose the sampling error $\text{SE}\lesssim\epsilon$ , it suffices to have $\epsilon_{\text{score}}\lesssim\epsilon$ .

Proof.

We denote the laws of process (16) and (8) as $\alpha$ and $\beta\in C([0,T]:\mathbb{R}^{d})$ , respectively. For simplicity of the proof, we define a fictitious diffusion satisfying the SDE with $\hat{Y}(0)\sim\gamma^{d}_{T}$ :

d\hat{Y}(t)=\left(a\hat{Y}(t)+\sigma^{2}\sum_{i=1}^{k}\lambda_{i}s^{i}_{T-lh,% \theta^{*}}\left(\hat{Y}(lh)\right)\right)dt+\sigma dW_{i}(t),t\in[lh,(l+1)h].

(20)

since in practice, it is always convenient to use Gaussian $\pi$ as a prior. We denote law of process (20) as $\beta_{T}\in C([0,T]:\mathbb{R}^{d})$ .

We also denote the score estimators of the process (17) as $s^{\boldsymbol{\lambda}}_{lh,\theta^{*}}$ . Similarly as before, we consider only one stochastic process $Z(t)_{t\in[0,T]}$ now to use Girsanov’s theorem.

For $t\in[lh,(l+1)h]$ , we have the discretization error $\mathcal{L}$ with

	$\displaystyle\mathcal{L}$	$\displaystyle=\sigma^{2}\mathbb{E}_{\alpha}\left[\left\lVert s^{\boldsymbol{% \lambda}}_{T-lh,\theta^{*}}\left(Z(lh)\right)-\nabla\log p_{\boldsymbol{% \lambda},T-t}\left(Z(t)\right)\right\rVert_{2}^{2}\right]$
		$\displaystyle=\sigma^{2}\mathbb{E}_{\alpha}\left[\left\lVert\sum_{i=1}^{k}% \lambda_{i}\left[s^{i}_{T-lh,\theta^{*}}\left(Z(lh)\right)-\nabla\log p^{i}_{T% -t}\left(Z(t)\right)\right]\right\rVert_{2}^{2}\right]$
		$\displaystyle\lesssim\sigma^{2}\sum_{i=1}^{k}\lambda_{i}^{2}\mathbb{E}_{\alpha% }\left[\left\lVert s^{i}_{T-lh,\theta^{*}}\left(Z(lh)\right)-\nabla\log p^{i}_% {T-t}\left(Z(t)\right)\right\rVert_{2}^{2}\right]$
		$\displaystyle\lesssim\sigma^{2}\sum_{i=1}^{k}\lambda_{i}^{2}\mathbb{E}_{\alpha% }\left[\left\lVert s^{i}_{T-lh,\theta^{*}}\left(Z(lh)\right)-\nabla\log p^{i}_% {T-lh}\left(Z(lh)\right)\right\rVert_{2}^{2}\right]$
		$\displaystyle+\sigma^{2}\sum_{i=1}^{k}\lambda_{i}^{2}\mathbb{E}_{\alpha}\left[% \left\lVert\nabla\log p^{i}_{T-lh}\left(Z(lh)\right)-\nabla\log p^{i}_{T-t}% \left(Z(lh)\right)\right\rVert_{2}^{2}\right]$
		$\displaystyle+\sigma^{2}\sum_{i=1}^{k}\lambda_{i}^{2}\mathbb{E}_{\alpha}\left[% \left\lVert\nabla\log p^{i}_{T-t}\left(Z(lh)\right)-\nabla\log p^{i}_{T-t}% \left(Z(t)\right)\right\rVert_{2}^{2}\right]$
		$\displaystyle\lesssim k\sigma^{2}\left(\epsilon_{\text{score}}^{2}+\mathbb{E}_% {\alpha}\left[\left\lVert\nabla\log\left(\frac{p^{i}_{T-lh}}{p^{i}_{T-t}}% \right)\left(Z(lh)\right)\right\rVert_{2}^{2}\right]+L^{2}\mathbb{E}_{\alpha}% \left[\left\lVert Z(lh)-Z(t)\right\rVert_{2}^{2}\right]\right).$

From Lemma 16 in Chen et al. [12], we have the bound for the second term since $L\geq 1$ ,

	$\displaystyle\mathbb{E}_{\alpha}\left[\left\lVert\nabla\log\left(\frac{p^{i}_{% T-lh}}{p^{i}_{T-t}}\right)\left(Z(lh)\right)\right\rVert_{2}^{2}\right]$	$\displaystyle\lesssim L^{2}dh+L^{2}h^{2}\mathbb{E}_{\alpha}\left[\left\lVert Z% (lh)\right\rVert_{2}^{2}\right]$
		$\displaystyle+(1+L^{2})h^{2}\mathbb{E}_{\alpha}\left[\left\lVert\nabla\log p^{% i}_{T-t}Z(lh)\right\rVert_{2}^{2}\right]$
		$\displaystyle\lesssim L^{2}dh+L^{2}h^{2}\mathbb{E}_{\alpha}\left[\left\lVert Z% (lh)\right\rVert_{2}^{2}\right]$
		$\displaystyle+L^{2}h^{2}\mathbb{E}_{\alpha}\left[\left\lVert\nabla\log p^{i}_{% T-t}Z(lh)\right\rVert_{2}^{2}\right].$

Moreover, from $L$ -Lipschitz condition,

	$\displaystyle\left\lVert\nabla\log p^{i}_{T-t}Z(lh)\right\rVert_{2}^{2}$	$\displaystyle\lesssim\left\lVert\nabla\log p^{i}_{T-t}Z(t)\right\rVert_{2}^{2}% +\left\lVert\nabla\log p^{i}_{T-t}Z(lh)-\nabla\log p^{i}_{T-t}Z(t)\right\rVert% _{2}^{2}$
		$\displaystyle\lesssim\left\lVert\nabla\log p^{i}_{T-t}Z(t)\right\rVert_{2}^{2}% +L^{2}\left\lVert Z(lh)-Z(t)\right\rVert_{2}^{2}$

Hence,

	$\displaystyle\mathcal{L}$	$\displaystyle=\sigma^{2}\mathbb{E}_{\alpha}\left[\left\lVert s^{\boldsymbol{% \lambda}}_{T-lh,\theta^{*}}\left(Z(lh)\right)-\nabla\log p_{\boldsymbol{% \lambda},T-t}\left(Z(t)\right)\right\rVert_{2}^{2}\right]$
		$\displaystyle\lesssim k\sigma^{2}\epsilon_{\text{score}}^{2}+k\sigma^{2}L^{2}% dh+k\sigma^{2}L^{2}h^{2}\mathbb{E}_{\alpha}\left[\left\lVert Z(lh)\right\rVert% _{2}^{2}\right]$
		$\displaystyle+k\sigma^{2}L^{2}h^{2}\mathbb{E}_{\alpha}\left[\left\lVert\nabla% \log p^{i}_{T-t}Z(t)\right\rVert_{2}^{2}\right]+k\sigma^{2}L^{2}\mathbb{E}_{% \alpha}\left[\left\lVert Z(lh)-Z(t)\right\rVert_{2}^{2}\right].$

From Lemma 2 and Lemma 3, we have

	$\displaystyle\mathcal{L}$	$\displaystyle=\sigma^{2}\mathbb{E}_{\alpha}\left[\left\lVert s^{\boldsymbol{% \lambda}}_{T-lh,\theta^{*}}\left(Z(lh)\right)-\nabla\log p_{\boldsymbol{% \lambda},T-t}\left(Z(t)\right)\right\rVert_{2}^{2}\right]$
		$\displaystyle\lesssim k\sigma^{2}\left(\epsilon_{\text{score}}^{2}+L^{2}dh+L^{% 2}h^{2}\left(d+M\right)+L^{3}dh^{2}+L^{2}\left(dh+Mh^{2}\right)\right)$
		$\displaystyle\lesssim k\sigma^{2}\left(\epsilon_{\text{score}}^{2}+L^{2}dh+L^{% 2}h^{2}M\right).$

Therefore,

	$\displaystyle\mathcal{L}$	$\displaystyle=\sigma^{2}\sum_{l=0}^{N-1}\mathbb{E}_{\alpha}\left[\int_{lh}^{(l% +1)h}\left\lVert s^{\boldsymbol{\lambda}}_{T-lh,\theta^{*}}\left(Z(lh)\right)-% \nabla\log p_{\boldsymbol{\lambda},T-t}\left(Z(t)\right)\right\rVert_{2}^{2}dt\right]$
		$\displaystyle\lesssim\sigma^{2}kT\left(\epsilon_{\text{score}}^{2}+L^{2}dh+L^{% 2}h^{2}M\right).$

Next, we claim that

D_{\text{KL}}\left(\alpha\parallel\beta_{T}\right)\lesssim k\sigma^{2}T\left(% \epsilon_{\text{score}}^{2}+L^{2}dh+L^{2}h^{2}M\right).

(21)

Then from triangle inequality, Pinsker’s inequality, and data processing inequality,

	$\displaystyle\text{TV}\left(p_{\boldsymbol{\lambda}},\hat{p}_{\boldsymbol{% \lambda}}\right)$	$\displaystyle\leq\text{TV}\left(\alpha,\beta\right)$
		$\displaystyle\leq\text{TV}\left(\beta,\beta_{T}\right)+\text{TV}\left(\alpha,% \beta_{T}\right)$
		$\displaystyle\leq\text{TV}\left(\pi,\gamma^{d}_{T}\right)+\text{TV}\left(% \alpha,\beta_{T}\right)$
		$\displaystyle\lesssim\exp(-T)\max_{i=1,2,\ldots,k}\sqrt{D_{\text{KL}}\left(p^{% i}_{T}\parallel\pi\right)}+\sigma\sqrt{kT}\left(\epsilon_{\text{score}}+L\sqrt% {dh}+Lh\sqrt{M}\right).$

Hence it suffices to prove Equation (21). We will use a localization argument and apply Girsanov’s theorem. The notations are the same as in Theorem 6.

Let $t\in[0,T]$ , $\mathcal{L}(t)=\int_{0}^{t}b(s)dB(s)$ , where $B$ is an $\alpha$ -Brownian motion and for $t\in[lh,(l+1)h]$ ,

b(t)=\sigma\left(s^{\boldsymbol{\lambda}}_{T-lh,\theta^{*}}\left(Z(lh)\right)-% \nabla\log p_{\boldsymbol{\lambda},T-t}\left(Z(t)\right)\right).

Recall that

\mathbb{E}_{\alpha}\left[\int_{0}^{T}\left\|b(s)\right\|_{2}^{2}ds\right]% \lesssim kT\sigma^{2}\left(\epsilon_{\text{score}}^{2}+L^{2}dh+L^{2}h^{2}M% \right).

Since $\{\mathcal{E}\left(\mathcal{L}\right)(t)\}_{t\in[0,T]}$ is a local martingale, then there exists a non-decreasing sequence of stop** times $T_{n}\to T$ such that $\{\mathcal{E}\left(\mathcal{L}\right)(t\wedge T_{n})\}_{t\in[0,T]}$ is a true martingale. Note that $\mathcal{E}\left(\mathcal{L}\right)(t\wedge T_{n})=\mathcal{E}\left(\mathcal{L% }^{n}\right)(t)$ , where $\mathcal{L}^{n}(t)=\mathcal{L}(t\wedge T_{n})$ , therefore

\mathbb{E}_{\alpha}\left[\mathcal{E}\left(\mathcal{L}^{n}\right)(T)\right]=% \mathbb{E}_{\alpha}\left[\mathcal{E}\left(\mathcal{L}^{n}\right)(0)\right]=1.

Applying Theorem 6 to $\mathcal{L}^{n}(t)=\int_{0}^{t}b(s)\mathbf{1}_{[0,T_{n}]}(s)dB(s)$ , we have that under the measure $P^{n}=\mathcal{E}\left(\mathcal{L}^{n}\right)(T)\alpha$ , there exists a Brownian motion $\beta^{n}$ such that for all $t\in[0,T]$ ,

dB(t)=\sigma\left(s^{\boldsymbol{\lambda}}_{T-lh,\theta^{*}}\left(Z(lh)\right)% -\nabla\log p_{\boldsymbol{\lambda},T-t}\left(Z(t)\right)\right)\mathbf{1}_{[0% ,T_{n}]}(t)dt+d\beta^{n}(t).

Since under $\alpha$ we have almost surely

dZ(t)=\left(aZ(t)+\sigma^{2}\nabla\log p_{\boldsymbol{\lambda},T-t}\left(Z(t)% \right)\right)dt+\sigma dB(t),Z(0)\sim\gamma^{d},

which also holds $P^{n}$ -almost surely since $P^{n}\ll\alpha$ . Therefore, $P^{n}$ -almost surely, $Z(0)\sim\gamma^{d}$ and

	$\displaystyle dZ(t)$	$\displaystyle=\left[aZ(t)+\sigma^{2}s^{\boldsymbol{\lambda}}_{T-lh,\theta^{*}}% \left(Z(lh)\right)\right]\mathbf{1}_{[0,T_{n}]}dt$
		$\displaystyle+\left[aZ(t)\nabla\log p_{\boldsymbol{\lambda},T-t}\left(Z(t)% \right)\right]\mathbf{1}_{[T_{n},T]}dt+\sigma d\beta(t).$

In other words, $P^{n}$ is the law of the solution of the above SDE. Plugging in the Radon-Nikodym derivatives, we get

	$\displaystyle D_{\text{KL}}\left(\alpha\parallel P^{n}\right)$	$\displaystyle=\mathbb{E}_{\alpha}\left[\log\left(\frac{d\alpha}{dP^{n}}\right)\right]$
		$\displaystyle=\mathbb{E}_{\alpha}\left[\log\left(\frac{1}{\mathcal{E}\left(% \mathcal{L}\right)(T_{n})}\right)\right]$
		$\displaystyle=\mathbb{E}_{\alpha}\left[-\mathcal{L}(T_{n})+\frac{1}{2}\int_{0}% ^{T_{n}}\left\\|b(s)\right\\|_{2}^{2}ds\right]$
		$\displaystyle=\mathbb{E}_{\alpha}\left[\frac{1}{2}\int_{0}^{T_{n}}\left\\|b(s)% \right\\|_{2}^{2}ds\right]$
		$\displaystyle\leq\mathbb{E}_{\alpha}\left[\frac{1}{2}\int_{0}^{T}\left\\|b(s)% \right\\|_{2}^{2}ds\right]$
		$\displaystyle\lesssim kT\sigma^{2}\left(\epsilon_{\text{score}}^{2}+L^{2}dh+L^% {2}h^{2}M\right)$

since $\mathcal{L}(T_{n})$ is a martingale and $T_{n}$ is a bounded stop** time (apply optional sampling theorem).

Now consider a coupling of $\left(P^{n}\right)_{n\in\mathbb{N}}$ , $\beta_{T}$ : a sequence of stochastic processes $\left(Z^{n}\right)_{n\in\mathbb{N}}$ over the same probability space, a stochastic process $Z$ and a single Brownian motion $W$ over that space such that $Z(0)=Z^{n}(0)$ almost surely, $Z(0)\sim\gamma^{d}$ ,

	$\displaystyle dZ^{n}(t)$	$\displaystyle=\left[aZ^{n}(t)+\sigma^{2}s^{\boldsymbol{\lambda}}_{T-lh,\theta^% {*}}\left(Z^{n}(lh)\right)\right]\mathbf{1}_{[0,T_{n}]}dt$
		$\displaystyle+\left[aZ^{n}(t)+\nabla\log p_{\boldsymbol{\lambda},T-t}\left(Z^{% n}(t)\right)\right]\mathbf{1}_{[T_{n},T]}dt+\sigma dW(t),$

and

dZ(t)=\left[aZ(t)+\sigma^{2}s^{\boldsymbol{\lambda}}_{T-lh,\theta^{*}}\left(Z^% {n}(lh)\right)\right]dt+\sigma dW(t).

Hence law of $Z^{n}$ is $P^{n}$ and law of $Z$ is $\beta_{T}$ . The existence of such coupling is shown in Chen et al. [12].

Fix $\epsilon>0$ , define the map $\pi_{\epsilon}:C([0,T]:\mathbb{R}^{d})\to C([0,T]:\mathbb{R}^{d})$ such that

\pi_{\epsilon}(\omega)(t)=\omega\left(t\wedge T-\epsilon\right).

Since for each $t\in[0,T_{n}]$ , $Z^{n}(t)=Z(t)$ , then from Lemma 4, we have $\pi_{\epsilon}\left(Z^{n}\right)\to\pi_{\epsilon}\left(Z\right)$ almost surely uniformly over $[0,T]$ , which implies that $\pi_{\epsilon\text{}\#}P^{n}\to\pi_{\epsilon\text{}\#}\beta_{T}$ weakly.

Since KL divergence is lower semicontinuous, then from data processing inequality, we have

	$\displaystyle D_{\text{KL}}\left(\pi_{\epsilon\text{}\#}\alpha\parallel\pi_{% \epsilon\text{}\#}\beta_{T}\right)$	$\displaystyle\leq\liminf_{n\to\infty}D_{\text{KL}}\left(\pi_{\epsilon\text{}\#% }\alpha\parallel\pi_{\epsilon\text{}\#}P^{n}\right)$
		$\displaystyle\leq D_{\text{KL}}\left(\alpha\parallel P^{n}\right)$
		$\displaystyle\lesssim kT\sigma^{2}\left(\epsilon_{\text{score}}^{2}+L^{2}dh+L^% {2}h^{2}M\right).$

From Lemma 5, as $\epsilon\to 0$ , $\pi_{\epsilon}(\omega)\to\omega$ uniformly over $[0,T]$ . Hence, from Corollary 9.4.6 in Ambrosio et al. [2], as $\epsilon\to 0$ , $D_{\text{KL}}\left(\pi_{\epsilon\text{}\#}\alpha\parallel\pi_{\epsilon\text{}% \#}\beta_{T}\right)\to D_{\text{KL}}\left(\alpha\parallel\beta_{T}\right)$ . Therefore, from Pinsker’s inequality,

D_{\text{KL}}\left(\alpha\parallel\beta_{T}\right)\lesssim kT\sigma^{2}\left(% \epsilon_{\text{score}}^{2}+L^{2}dh+L^{2}h^{2}M\right).

Before the proof, we introduce some notations that will only be used for the proof of Theorem 3. Recall that the vanilla fusion method requires two layers of approximation before running the Frank-Wolfe method: we use target samples to estimate an expectation and we also estimate the densities of auxiliaries. As a notation, we denote $\hat{\bar{p}}_{\hat{\boldsymbol{\lambda}}}$ as the distribution of the generated sample by vanilla fusion, which is $\hat{\nu}_{D}$ in Section 4. $\boldsymbol{\hat{\lambda}}$ is the weight computed with $n$ target samples, $p_{\hat{\boldsymbol{\lambda}}}$ denotes the barycenter of $\{\mu_{1},\ldots,\mu_{k}\}$ with the weight $\hat{\boldsymbol{\lambda}}$ , and $\bar{p}_{\boldsymbol{\hat{\boldsymbol{\lambda}}}}$ denotes the barycenter of $\{\bar{p}_{1},\ldots,\bar{p}_{k}\}$ with the weight $\hat{\boldsymbol{\lambda}}$ , where $\{\bar{p}_{1},\ldots,\bar{p}_{k}\}$ is the collection of estimates of auxiliary densities. Note that $\bar{p}_{\boldsymbol{\hat{\boldsymbol{\lambda}}}}\sim\hat{\mu}_{\boldsymbol{% \lambda}}$ in Section 4.

Proof.

From triangle inequality, we have

	$\displaystyle\text{TV}\left(\nu,\hat{\bar{p}}_{\hat{\boldsymbol{\lambda}}}\right)$	$\displaystyle\leq\text{TV}\left(\nu,p_{\hat{\boldsymbol{\lambda}}}\right)+% \text{TV}\left(p_{\boldsymbol{\hat{\boldsymbol{\lambda}}}},\bar{p}_{% \boldsymbol{\hat{\boldsymbol{\lambda}}}}\right)+\text{TV}\left(\bar{p}_{% \boldsymbol{\hat{\boldsymbol{\lambda}}}},\hat{\bar{p}}_{\hat{\boldsymbol{% \lambda}}}\right)$
		$\displaystyle:=I_{1}+I_{2}+I_{3},$

where $I_{1}$ represents the error when computing using the Frank-Wolfe method, $I_{2}\leq\epsilon_{2}$ by assumption, and $I_{3}$ is the error from auxiliary score estimations, which is bounded by Lemma 7.

Therefore it only remains to bound $I_{1}$ . From Pinsker’s inequality,

I_{1}=\text{TV}\left(\nu,p_{\hat{\boldsymbol{\lambda}}}\right)\lesssim\sqrt{D_% {\text{KL}}\left(\nu\parallel p_{\hat{\boldsymbol{\lambda}}}\right)},

hence it is enough to bound $D_{\text{KL}}\left(\nu\parallel p_{\hat{\boldsymbol{\lambda}}}\right)$ . From the compactedness assumption, we note that the objective function $F$ of problem (9) is $\tilde{L}$ -smooth for some constant $\tilde{L}$ . Since the simplex in real space is convex, we denote the diameter of constrain set as $D$ .

We denote $\hat{\boldsymbol{\lambda}}(\tau)$ as the weight computed after $\tau$ iterations with $n$ target samples, then we claim that for $\tau\geq 1$ , and $\delta>0$ , with probability at least $1-\delta$ ,

	$\displaystyle h_{\tau}$	$\displaystyle=F\left(\hat{\boldsymbol{\lambda}}(\tau)\right)-F(\boldsymbol{% \lambda}^{})=D_{\text{KL}}\left(\nu\parallel p_{\hat{\boldsymbol{\lambda}}(% \tau)}\right)-D_{\text{KL}}\left(\nu\parallel p_{\boldsymbol{\lambda}^{}}\right)$
		$\displaystyle\lesssim\frac{2\tilde{L}D^{2}}{\tau+3}+\mathcal{O}\left(\left(% \log\left(\frac{1}{\delta}\right)\right)^{1/2}n^{-1/2}\right).$		(22)

We will use an induction argument to show Equation (C.2). The main estimation is based on the smoothness of $F$ and compactness of the constrain set. Let $\delta>0$ , then from Hoeffding’s inequality, with probability at least $1-\delta$ ,

	$\displaystyle F\left(\hat{\boldsymbol{\lambda}}(\tau+1)\right)-F\left(\hat{% \boldsymbol{\lambda}}(\tau)\right)$	$\displaystyle\leq\left\langle\nabla F\left(\hat{\boldsymbol{\lambda}}(\tau)% \right),\hat{\boldsymbol{\lambda}}(\tau+1)-\hat{\boldsymbol{\lambda}}(\tau)% \right\rangle+\frac{\tilde{L}}{2}\left\\|\hat{\boldsymbol{\lambda}}(\tau+1)-% \hat{\boldsymbol{\lambda}}(\tau)\right\\|_{2}^{2}$
		$\displaystyle=\gamma_{\tau}\left\langle\nabla F\left(\hat{\boldsymbol{\lambda}% }(\tau)\right),v_{\tau}-\hat{\boldsymbol{\lambda}}(\tau)\right\rangle+\frac{% \tilde{L}\gamma_{\tau}^{2}}{2}\left\\|v_{\tau}-\hat{\boldsymbol{\lambda}}(\tau)% \right\\|_{2}^{2}$
		$\displaystyle\lesssim\gamma_{\tau}\left\langle\nabla F\left(\hat{\boldsymbol{% \lambda}}(\tau)\right),\boldsymbol{\lambda}^{*}-\hat{\boldsymbol{\lambda}}(% \tau)\right\rangle+\frac{\tilde{L}\gamma_{\tau}^{2}D^{2}}{2}$
		$\displaystyle+\mathcal{O}\left(\left(\log\left(\frac{1}{\delta}\right)\right)^% {1/2}n^{-1/2}\right)\gamma_{\tau}\left\\|\boldsymbol{\lambda}^{*}-\hat{% \boldsymbol{\lambda}}(\tau)\right\\|_{2}$
		$\displaystyle\leq\gamma_{\tau}\left(F\left(\boldsymbol{\lambda}^{*}\right)-F% \left(\hat{\boldsymbol{\lambda}}(\tau)\right)\right)+\frac{\tilde{L}\gamma_{% \tau}^{2}D^{2}}{2}$
		$\displaystyle+\mathcal{O}\left(\left(\log\left(\frac{1}{\delta}\right)\right)^% {1/2}n^{-1/2}\right).$

By rearranging the terms, we get

h_{\tau+1}\lesssim\left(1-\gamma_{\tau}\right)h_{\tau}+\gamma_{\tau}^{2}\frac{% \tilde{L}D^{2}}{2}+\mathcal{O}\left(\left(\log\left(\frac{1}{\delta}\right)% \right)^{1/2}n^{-1/2}\right).

(23)

Now we begin the induction argument. If $\tau=1$ , then Equation (23) becomes

h_{1}\lesssim\frac{\tilde{L}D^{2}}{2}+\mathcal{O}\left(\left(\log\left(\frac{1% }{\delta}\right)\right)^{1/2}n^{-1/2}\right),

which is Equation (C.2), hence base case is shown. Now suppose there exists $\tau$ such that Equation (C.2) holds, then from Equation (23)

	$\displaystyle h_{\tau+1}$	$\displaystyle\lesssim\left(1-\frac{2}{\tau+3}\right)\frac{2\tilde{L}D^{2}}{% \tau+3}+\frac{4}{(\tau+3)^{2}}\frac{\tilde{L}D^{2}}{2}+\mathcal{O}\left(\left(% \log\left(\frac{1}{\delta}\right)\right)^{1/2}n^{-1/2}\right)$
		$\displaystyle=\frac{2\tilde{L}D^{2}(\tau+2)}{(\tau+3)^{2}}+\mathcal{O}\left(% \left(\log\left(\frac{1}{\delta}\right)\right)^{1/2}n^{-1/2}\right)$
		$\displaystyle\leq\frac{2\tilde{L}D^{2}}{\tau+4}+\mathcal{O}\left(\left(\log% \left(\frac{1}{\delta}\right)\right)^{1/2}n^{-1/2}\right)$

since $(\tau+2)(\tau+4)\leq(\tau+3)^{2}$ . Hence Equation (C.2) is proved and if we let $\tau\to\infty$ , then

	$\displaystyle D_{\text{KL}}\left(\nu\parallel p_{\hat{\boldsymbol{\lambda}}(% \tau)}\right)$	$\displaystyle\leq D_{\text{KL}}\left(\nu\parallel p_{\boldsymbol{\lambda}^{*}}% \right)+\frac{2\tilde{L}D^{2}}{\tau+3}+\mathcal{O}\left(\left(\log\left(\frac{% 1}{\delta}\right)\right)^{1/2}n^{-1/2}\right)$
		$\displaystyle\lesssim\epsilon_{0}^{2}+\mathcal{O}\left(\left(\log\left(\frac{1% }{\delta}\right)\right)^{1/2}n^{-1/2}\right).$

Therefore, from Pinsker’s inequality, with probability at least $1-\delta$ ,

	$\displaystyle\text{TV}\left(\nu,\hat{\bar{p}}_{\hat{\boldsymbol{\lambda}}}\right)$	$\displaystyle\lesssim\epsilon_{0}+\epsilon_{2}+\exp(-T)\max_{i=1,2,\ldots,k}% \sqrt{D_{\text{KL}}\left(p^{i}_{T}\parallel\pi\right)}+\sigma\sqrt{kT}\left(% \epsilon_{\text{score}}+L\sqrt{dh}+Lh\sqrt{M}\right)$
		$\displaystyle+\mathcal{O}\left(\left(\log\left(\frac{1}{\delta}\right)\right)^% {1/4}n^{-1/4}\right).$

C.3 Proof of Theorem 4

Before the proof, we define notations that will be used in this proof. $\hat{p}_{\hat{\boldsymbol{\Lambda}}}$ denotes the output distribution of Algorithm 1, which is $\hat{\nu}_{P}$ in Section 4. For a fixed small $\tilde{T}\ll 1$ , in the training phase of ScoreFusion, we have the forward process for $t\in[0,\tilde{T}]$ ,

dZ(t)=-aZ(t)dt+\sigma dW(t),X(0)\sim\nu.

(24)

We learn an optimal weight by solving problem (10). We denote the marginal distribution of process (24) at time $t$ for fixed $\boldsymbol{\Lambda}$ as $p^{\nu}_{t}$ . Even though in practice we do not use the backward process of process (24), the following two versions of backward processes will help in the proof of Theorem 4: for $t\in[0,\tilde{T}]$ with $\tilde{Z}(0)\sim\gamma^{d}_{\tilde{T}}\sim\hat{Z}(0)$ , and fixed $\boldsymbol{\Lambda}$ ,

d\tilde{Z}(t)=\left(a\tilde{Z}(t)+\sigma^{2}\nabla\log p^{\nu}_{T-t}\left(% \tilde{Z}(t)\right)\right)dt+\sigma dW(t),\tilde{Z}(\tilde{T})\sim\nu,

(25)

and for $l=0,1,\ldots,N_{\tilde{T}}$ ,

d\hat{Z}(t)=\left(a\hat{Z}(t)+\sigma^{2}\sum_{i=1}^{k}\Lambda_{i}s^{i}_{T-lh,% \theta^{*}}\left(\hat{Z}(lh)\right)\right)dt+\sigma dW(t),t\in[lh,(l+1)h],

(26)

where $hN_{\tilde{T}}=\tilde{T}$ . Process (26) is the time-discretization version of process (25) without the initialization error (since $\tilde{Z}(0)\sim\hat{Z}(0)$ ). We denote the law of process (25) and (26) as $\alpha_{\tilde{T}}$ and $\beta_{\tilde{T}}\in\mathcal{P}(C([0,T]:\mathbb{R}^{d})$ , respectively. For fixed $\boldsymbol{\Lambda}$ , we call $\tilde{Z}(\tilde{T})\sim p^{\tilde{T}}_{\boldsymbol{\Lambda}}$ and $\hat{Z}(\tilde{T})\sim\hat{p}^{\tilde{T}}_{\boldsymbol{\Lambda}}$ .

Proof.

From triangle inequality, we have

	$\displaystyle\text{TV}\left(\nu,\hat{p}_{\hat{\boldsymbol{\Lambda}}}\right)$	$\displaystyle\leq\text{TV}\left(\nu,\hat{p}^{\tilde{T}}_{\hat{\boldsymbol{% \Lambda}}}\right)+\text{TV}\left(\hat{p}^{\tilde{T}}_{\hat{\boldsymbol{\Lambda% }}},p^{\tilde{T}}_{\hat{\boldsymbol{\Lambda}}}\right)+\text{TV}\left(p^{\tilde% {T}}_{\hat{\boldsymbol{\Lambda}}},\hat{p}_{\hat{\boldsymbol{\Lambda}}}\right)$
		$\displaystyle=\text{TV}\left(\nu,\hat{p}^{\tilde{T}}_{\hat{\boldsymbol{\Lambda% }}}\right)+\text{TV}\left(\hat{p}^{\tilde{T}}_{\hat{\boldsymbol{\Lambda}}},p^{% \tilde{T}}_{\hat{\boldsymbol{\Lambda}}}\right)+\text{TV}\left(p_{\hat{% \boldsymbol{\Lambda}}},\hat{p}_{\hat{\boldsymbol{\Lambda}}}\right)$
		$\displaystyle\lesssim\text{TV}\left(\nu,\hat{p}^{\tilde{T}}_{\hat{\boldsymbol{% \Lambda}}}\right)+\text{TV}\left(p_{\hat{\boldsymbol{\Lambda}}},\hat{p}_{\hat{% \boldsymbol{\Lambda}}}\right).$

From Lemma 7, we bound the last term

\text{TV}\left(p_{\hat{\boldsymbol{\Lambda}}},\hat{p}_{\hat{\boldsymbol{% \Lambda}}}\right)\lesssim\exp(-T)\max_{i=1,2,\ldots,k}\sqrt{D_{\text{KL}}\left% (p^{i}_{T}\parallel\pi\right)}+\sqrt{kT}\sigma\left(\epsilon_{\text{score}}+L% \sqrt{dh}+Lh\sqrt{M}\right).

To bound the first term, we use chain rule of KL divergence, Girsanov’s theorem, and an approximation argument similar as in Section C.2 to get

	$\displaystyle D_{\text{KL}}\left(\nu\parallel\hat{p}^{\tilde{T}}_{\hat{% \boldsymbol{\Lambda}}}\right)$	$\displaystyle\lesssim D_{\text{KL}}\left(\alpha_{\tilde{T}}\parallel\beta_{% \tilde{T}}\right)$
		$\displaystyle\lesssim\frac{1}{\tilde{T}}\sum_{l=0}^{N_{\tilde{T}}-1}\mathbb{E}% _{\alpha_{\tilde{T}}}\left[\int_{lh}^{(l+1)h}\sigma^{2}\left\lVert s^{% \boldsymbol{\Lambda}}_{T-lh,\theta^{*}}\left(Z(lh)\right)-\nabla\log p^{\nu}_{% T-t}\left(Z(t)\right)\right\rVert_{2}^{2}dt\right]$
		$\displaystyle\lesssim\frac{1}{\tilde{T}}\int_{0}^{\tilde{T}}\left[\sigma^{2}% \mathbb{E}_{Z(t)\sim p^{\nu}_{t}}\left[\left\lVert\sum_{i=1}^{k}\left(\Lambda_% {i}s^{i}_{t,\theta^{*}}\left(Z(t)\right)\right)-\nabla\log p^{\nu}_{t}(Z(t))% \right\rVert_{2}^{2}\right]\right]dt$
		$\displaystyle\lesssim\tilde{\mathcal{L}}\left(\hat{\boldsymbol{\Lambda}};% \theta^{},\sigma^{2}\right)=\tilde{\mathcal{L}}\left(\boldsymbol{\Lambda}^{}% ;\theta^{},\sigma^{2}\right)+\left[\tilde{\mathcal{L}}\left(\hat{\boldsymbol{% \Lambda}};\theta^{},\sigma^{2}\right)-\tilde{\mathcal{L}}\left(\boldsymbol{% \Lambda}^{};\theta^{},\sigma^{2}\right)\right]$
		$\displaystyle:=I_{1}+I_{2},$

where $I_{1}$ represents the approximation error and $I_{2}$ represents the excess risk. Therefore, from McDiarmid’s inequality, for $\delta>0$ , with probability at least $1-\delta$ ,

\displaystyle I_{2}\lesssim\mathcal{O}\left(\sigma^{2}\left(\log\left(\frac{1}% {\delta}\right)\right)^{-1/2}\left(N_{\tilde{T}}n\right)^{-1/2}\right)\lesssim% \mathcal{O}\left(\sigma^{2}\left(\log\left(\frac{1}{\delta}\right)\right)^{-1/% 2}n^{-1/2}\right)

since $\tilde{T}\lesssim T$ and $N_{\tilde{T}}$ is small.

Finally, we need to give a bound on $I_{1}$ . The intuition is that from continuity of a diffusion process, when $\tilde{T}$ is small, then $p^{\nu}_{\tilde{T}}$ and $p^{\nu}_{0}$ are similar. Therefore, the approximation error of the linear regression should be small, given Assumption 3.

Fix $t\in[0,\tilde{T}]$ , then since $\nu$ has a Lipschitz density and the compactedness assumption, the loss $\mathcal{L}$ is

	$\displaystyle\mathcal{L}$	$\displaystyle=\tilde{\mathcal{L}}\left(\boldsymbol{\Lambda}^{};\theta^{},% \sigma^{2}\right)$
		$\displaystyle=\frac{\sigma^{2}}{\tilde{T}}\int_{0}^{\tilde{T}}\mathbb{E}_{Z(t)% \sim p^{\nu}_{t}}\left[\left\lVert\sum_{i=1}^{k}\Lambda^{}_{i}s^{i}_{t,\theta% ^{}}\left(Z(t)\right)-\nabla\log p^{\nu}_{t}(Z(t))\right\rVert_{2}^{2}\right]dt$
		$\displaystyle\lesssim\frac{\sigma^{2}}{\tilde{T}}\int_{0}^{\tilde{T}}\mathbb{E% }_{Z(t)\sim p^{\nu}_{t}}\left[\left\lVert\sum_{i=1}^{k}\Lambda^{}_{i}s^{i}_{t% ,\theta^{}}\left(Z(t)\right)-\sum_{i=1}^{k}\Lambda^{*}_{i}\nabla\log p^{i}_{t% }(Z(t))\right\rVert_{2}^{2}\right]dt$
		$\displaystyle+\frac{\sigma^{2}}{\tilde{T}}\int_{0}^{\tilde{T}}\mathbb{E}_{Z(t)% \sim p^{\nu}_{t}}\left[\left\lVert\sum_{i=1}^{k}\Lambda^{*}_{i}\nabla\log p^{i% }_{t}(Z(t))-\nabla\log p^{\nu}_{t}(Z(t))\right\rVert_{2}^{2}\right]dt$
		$\displaystyle\lesssim\sigma^{2}k\epsilon_{\text{score}}^{2}+\sigma^{2}\mathbb{% E}_{Z(0)\sim\nu}\left[\left\lVert\sum_{i=1}^{k}\Lambda^{*}_{i}\nabla\log p^{i}% _{0}(Z(0))-\nabla\log p^{\nu}_{0}(Z(0))\right\rVert_{2}^{2}\right]dt$
		$\displaystyle+\frac{\sigma^{2}}{\tilde{T}}\int_{0}^{\tilde{T}}\mathbb{E}_{Z(t)% \sim p^{\nu}_{t}}\left[\left\lVert\sum_{i=1}^{k}\Lambda^{}_{i}\nabla\log p^{i% }_{t}(Z(t))-\sum_{i=1}^{k}\Lambda^{}_{i}\nabla\log p^{i}_{0}(Z(t))\right% \rVert_{2}^{2}\right]dt$
		$\displaystyle+\frac{\sigma^{2}}{\tilde{T}}\int_{0}^{\tilde{T}}\mathbb{E}_{Z(t)% \sim p^{\nu}_{t}}\left[\left\lVert\nabla\log p^{\nu}_{t}(Z(t))-\nabla\log p^{% \nu}_{0}(Z(t))\right\rVert_{2}^{2}\right]dt$
		$\displaystyle\lesssim\sigma^{2}k\epsilon_{\text{score}}^{2}+\sigma^{2}\mathbb{% E}_{Z(0)\sim\nu}\left[\left\lVert p^{\nu}_{0}(Z(0))-p_{\boldsymbol{\Lambda}^{*% }}(Z(0))\right\rVert_{2}^{2}\right]$
		$\displaystyle+\sigma^{2}\mathbb{E}_{Z(\tilde{T})\sim\gamma^{d}_{\tilde{T}}}% \left[\left\lVert p^{\nu}_{\tilde{T}}(Z(\tilde{T}))-p^{\nu}_{0}(Z(\tilde{T}))% \right\rVert_{2}^{2}\right]+\sigma^{2}k\mathbb{E}_{Z(\tilde{T})\sim\gamma^{d}_% {\tilde{T}}}\left[\left\lVert p^{j\in[1,k]}_{\tilde{T}}(Z(\tilde{T}))-p^{j\in[% 1,k]}_{0}(Z(\tilde{T}))\right\rVert_{2}^{2}\right]$
		$\displaystyle\lesssim\sigma^{2}k\epsilon_{\text{score}}^{2}+\sigma^{2}D_{\text% {KL}}\left(\nu\parallel p_{\boldsymbol{\Lambda}}^{*}\right)+\sigma^{2}kD_{% \text{KL}}\left(p^{\nu}_{\tilde{T}}\parallel p^{\nu}_{0}\right)$
		$\displaystyle\lesssim\sigma^{2}k\epsilon_{\text{score}}^{2}+\sigma^{2}\epsilon% _{1}^{2}+\sigma^{2}k\mathcal{O}\left(\left(\tilde{T}\right)^{1/2}\right).$

Therefore, from Pinsker’s inequality, with probability at least $1-\delta$ ,

	$\displaystyle\text{TV}\left(\nu,\hat{p}_{\hat{\boldsymbol{\Lambda}}}\right)$	$\displaystyle\lesssim\text{TV}\left(\nu,\hat{p}^{\tilde{T}}_{\hat{\boldsymbol{% \Lambda}}}\right)+\text{TV}\left(p_{\hat{\boldsymbol{\Lambda}}},\hat{p}_{\hat{% \boldsymbol{\Lambda}}}\right)$
		$\displaystyle\lesssim\sqrt{D_{\text{KL}}\left(\nu\parallel\hat{p}^{\tilde{T}}_% {\hat{\boldsymbol{\Lambda}}}\right)}+\text{TV}\left(p_{\hat{\boldsymbol{% \Lambda}}},\hat{p}_{\hat{\boldsymbol{\Lambda}}}\right)$
		$\displaystyle\lesssim\sigma\epsilon_{1}+\sigma\sqrt{k}\mathcal{O}\left(\tilde{% T}^{1/4}\right)+\mathcal{O}\left(\sigma\left(\log\left(\frac{1}{\delta}\right)% \right)^{-1/4}n^{-1/4}\right)$
		$\displaystyle+\exp(-T)\max_{i=1,2,\ldots,k}\sqrt{D_{\text{KL}}\left(p^{i}_{T}% \parallel\pi\right)}+\sigma\sqrt{kT}\left(\epsilon_{\text{score}}+L\sqrt{dh}+% Lh\sqrt{M}\right),$

which finishes the proof.

Appendix D Experiment details

D.1 Training and architecture details

To standardize comparison, the baseline and the auxiliary score models are parametrized by the exact same UNet architecture; the only difference between a baseline and an auxiliary is the amount of training data they have access to. The Python classes in our supplementary codebase, model_1D.ScoreNet and model_EMNIST.ScoreNet, are both modified from the ScoreNet class given in the GitHub repository of Song et al. [47]. One caveat is that to accommodate the one-dimensional data in Section 5.1, we changed the stride and kernel size of the convolutional layers in model_1D.ScoreNet to be $1$ . The one-dimensional UNet has $344k$ trainable parameters; the EMNIST UNet triples the trainable parameters count to $1.1$ millons. ScoreFusion models has only $k$ trainable parameters where $k$ is the number of auxiliary scores.

We follow the standard machine learning convention of splitting each dataset into train, validation, and test sets with stratified sampling to ensure class balance. The ratio of training data to validation data is $4:1$ . We use the ground truth digit labels only for data-splitting, hiding them from the model during training. Model training taking more than an hour was run on two NVIDIA A40 GPUs in a computing cluster, while lightweight tasks are run on Google Colab using an A100 GPU.

Model checkpoints corresponding to all our experiments, both for the pre-trained auxiliary score models and the baseline models, can be found in the subdirectory ckpt in the .zip file.

D.2 Section 5.1 supplementary data

Due to space limit, we cannot fit all the data columns into Table 1. We attach in Table 4 the complete data table for 1-Wasserstein distances from each learned distribution to the ground truth distribution when the training size varies. Standard error is calculated from the 1-Wasserstein distance of $10$ batch-pairs of $8096$ random samples drawn independently from the ground truth and a trained generative model. We note that there exists randomness in fitting $\boldsymbol{\lambda}^{*}$ as a result of Stochastic Gradient Descent.

Table 4: 1-Wasserstein distance from the ground truth Gaussian mixture

Model	$2^{5}$	$2^{6}$	$2^{7}$
Baseline	$106.93\pm 1.43$	$13.46\pm 0.28$	$16.74\pm 0.27$
ScoreFusion	$\mathbf{0.39\pm 0.02}$	$\mathbf{0.51\pm 0.03}$	$\mathbf{0.36\pm 0.02}$
$\boldsymbol{\lambda}^{*}$ of ScoreFusion	$[0.62,0.38]$	$[0.65,0.35]$	$[0.46,0.54]$

Model	$2^{8}$	$2^{9}$	$2^{10}$
Baseline	$2.13\pm 0.12$	$0.55\pm 0.04$	$\mathbf{0.15\pm 0.02}$
ScoreFusion	$\mathbf{0.58\pm 0.03}$	$\mathbf{0.38\pm 0.02}$	$0.30\pm 0.02$
$\boldsymbol{\lambda}^{*}$ of ScoreFusion	$[0.68,0.32]$	$[0.61,0.39]$	$[0.58,0.42]$

Additional histograms of the distributions learned by ScoreFusion versus the baseline are attached:

D.3 Section 5.2 supplementary data

We also provide supplementary data for the experiments on handwritten EMNIST digits. Table 5 gives the empirical distribution of the digits sampled unconditionally from the auxiliary scores.

Table 5: Digits percentage of 1024 images sampled from the auxiliary score models without fusion. Classified by SpinalNet.

Auxiliary Score	0	1	2	3	4	5	6	7	8	9
1	0.1%	0.1%	0.6%	0.6%	1.1%	0.3%	0.0%	18.7%	0.2%	78.2%
2	0.1%	0.1%	0.3%	0.8%	1.1%	0.5%	0.0%	41.1%	0.2%	55.8%
3	0.0%	0.2%	0.7%	0.7%	1.2%	0.8%	0.0%	72.1%	0.6%	23.7%
4	0.1%	0.5%	0.7%	0.5%	0.9%	0.4%	0.1%	87.9%	0.3%	8.6%
Target Distribution								60%		40%

Table 6: Full version of Table 3, Part I. Digits distribution estimated by SpinalNet. Bolded columns are the breakdown for ScoreFusion. “Others” category refers to fraction of samples resembling digits other than the 7’s or 9’s more.

Digit	True	$2^{6}$		$2^{7}$		$2^{8}$		$2^{9}$
Digit	True	Baseline	Fusion	Baseline	Fusion	Baseline	Fusion	Baseline	Fusion
7	60%	47.9%	55.6%	57.9%	55.5%	66.8%	57.5%	64.8%	58.2%
9	40%	10.3%	39.4%	12.8%	41.7%	23.8%	38.0%	28.3%	38.7%
Others	0	41.8%	5.0%	29.3%	2.8%	9.4%	4.5%	6.9%	3.1%

Table 7: Full version of Table 3, Part II. Digits distribution estimated by SpinalNet. Bolded columns are the breakdown for ScoreFusion. “Others” category refers to fraction of samples resembling digits other than the 7’s or 9’s more.

Digit	True	$2^{10}$		$2^{12}$		$2^{14}$
Digit	True	Baseline	Fusion	Baseline	Fusion	Baseline	Fusion
7	60%	65.5%	55.6%	66.7%	59.8%	67.4%	59.7%
9	40%	26.7%	39.8%	27.9%	36.7%	29.0%	37.4%
Others	0	7.8%	3.6%	5.4%	3.5%	3.6%	2.9%

Table 8: Optimal weights

\boldsymbol{\lambda}^{*}

corresponding to the ScoreFusion models whose NLL test losses we reported in Table 2. Each column is a weight vector that parameterizes the ScoreFusion model trained with

2^{j}

data.

$\boldsymbol{\lambda}_{i}$	$2^{6}$	$2^{7}$	$2^{8}$	$2^{9}$	$2^{10}$	$2^{12}$	$2^{14}$
$i=1$	0.199	0.187	0.182	0.181	0.167	0.183	0.176
$i=2$	0.305	0.326	0.328	0.319	0.345	0.311	0.310
$i=3$	0.279	0.267	0.284	0.285	0.319	0.294	0.295
$i=4$	0.217	0.220	0.206	0.216	0.170	0.213	0.220