Moment Matching Denoising Gibbs Sampling
Abstract
Energy-Based Models (EBMs) offer a versatile framework for modeling complex data distributions. However, training and sampling from EBMs continue to pose significant challenges. The widely-used Denoising Score Matching (DSM) method Vincent (2011) for scalable EBM training suffers from inconsistency issues, causing the energy model to learn a ‘noisy’ data distribution. In this work, we propose an efficient sampling framework, (pseudo)-Gibbs sampling with moment matching, which enables effective sampling from the underlying clean model when given a ‘noisy’ model that has been well-trained via DSM. We explore the benefits of our approach compared to related methods and demonstrate how to scale the method to high-dimensional datasets.
1 Energy-Based Models
Energy-Based Models (EBMs) have attracted a lot of attention in the generative model literature Ngiam et al. (2011); Xie et al. (2016); Du and Mordatch (2019); Song and Ermon (2019). EBMs are a type of non-normalized probabilistic model that determines the probability density function without a known normalizing constant. For continuous data , the density function of an EBM is specified as where the is a nonlinear function with parameter and is the normalization constant that is independent of . The energy parameterization allows for greater flexibility in model parameterization and the ability to model a wider range of probability distributions. However, the lack of a known normalizing constant makes training these models challenging. We start by giving a brief introduction of how to estimate in EBMs and refer the reader to Song and Kingma (2021) for a detailed overview of different training techniques for continuous EBMs.
Likelihood-based training: A classic method to learn is to minimize the KL divergence between the data distribution and the model density , which is defined as
(1) |
where we use to denote the equivalence up to a constant that is independent of . The integration of can be approximated by Monte Carlo with the training dataset ; in this case, it is equivalent to the maximum likelihood estimate (MLE) Bishop and Nasrabadi (2006). However, for EBMs, minimizing the KL divergence requires the estimation of , which is intractable for nonlinear defined by a neural network. Various methods have been proposed to alleviate the intractability by introducing techniques like Markov chain Monte Carlo (MCMC) Hinton (2002); Nijkamp et al. (2019); Du et al. (2020); Gao et al. (2018) or adversarial training Kim and Bengio (2016); Zhai et al. (2016); Bose et al. (2018).
Score-based training: Alternatively, Hyvärinen (2005) proposes to minimize the Fisher divergence to learn , which is defined as
(2) |
where we use to denote the score function of distribution : . Under certain regularity conditions, the Fisher divergence is equivalent to the score-matching (SM) objective Hyvärinen (2005),
(3) |
which does not require estimation of the intractable . However, this objective needs to calculate the Hessian trace in every gradient step during training, which is computationally expensive and does not scale to high dimensional data or requires approximation Song et al. (2020). In this paper, we will focus on another training method, denoising score matching Vincent (2011), which overcomes the tractability and scalability issues mentioned above, and is introduced in the next section.
1.1 Denoising Score Matching
For the target data density , a noise distribution is introduced to construct a noised data distribution . Denoising score matching (DSM) Vincent (2011) minimizes the Fisher divergence between the noised data distribution and an energy-based model , with
(4) |
where the last equation is due to being tractable for the Gaussian distribution .
Compared to the KL or SM objectives, the DSM objective is scalable and well-defined when the data distribution is singular111The singular distribution is not absolutely continuous () with respect to the Lebesgue measure, thus doesn’t allow a density function (Tao, 2011, p.172). A typical example is a data distribution supported on a lower-dimensional manifold. In this case, the KL divergence is ill-defined and cannot be used to train the models. When using DSM, the distribution after Gaussian convolution would always be , thus can be a valid training objective, see Zhang et al. (2020) or Arjovsky et al. (2017) for a detailed introduction. Zhang et al. (2020) and can alleviate the blindness problem of score matching Song and Ermon (2019); Wenliang and Kanagawa (2020); Zhang et al. (2022b). On the other hand, there is a notable disadvantage associated with the DSM objective: for a fixed , the DSM objective is not a consistent objective for learning the underlying data distribution since . A common solution is to anneal during training. However, Equation 4 is not defined when since the division in Equation 4 will make unbounded, which results in an inconsistent objective. Annealing increases the variance of the training gradients Song and Kingma (2021); Wang et al. (2020), which makes the optimization challenging in practice.
To overcome the challenges, we propose an alternative data generation scheme: we use DSM with a fixed to train a ‘noisy’ energy model and then construct a sampler which targets the underlying ‘clean’ model. Specifically, our contributions are summarized as follows:
-
•
We demonstrate that for an EBM that learns a noisy data distribution, there exists a unique underlying clean model which recovers the true data distribution.
-
•
We introduce a pseudo-Gibbs sampling scheme incorporating an analytical moment-matching approximation of the denoising distribution. This allows us to sample from the underlying clean model without requiring additional training.
-
•
We illustrate how to scale our method for high-dimensional data and demonstrate the generation of high-quality images using only a single level of fixed noise. Furthermore, we showcase the application of our proposed method in multi-level noise scenarios, closely resembling a diffusion model.
2 Clean Model Identification
For a fixed , DSM can only learn a ‘noisy’ data distribution even in the ideal case where the Fisher divergence is exactly minimized, since . In this case, the following theorem shows that there exists a ‘clean’ model that is implicitly defined that learns the true data distribution.
Theorem 2.1 (Existence of the underlying clean model for optimal ).
When the Fisher divergence goes to 0, , there exists an unique underlying clean model such that and .
See Appendix A.1 for proof. This theorem shows that despite training an EBM on noisy data, there is an implicit model within it that can recover the true data distribution. Therefore, instead of annealing the noise to recover the true data distribution, we will demonstrate how to directly sample from the implicitly-defined clean model given the noisy energy-based model in the next section.
We want to highlight that the ‘perfect fit’ assumption, i.e. achieving , may not hold for a complex data distribution or underpowered EBMs. Therefore, we provide general sufficient conditions for the existence of the clean model for an imperfect EBM in Appendix A.2.
2.1 Gibbs Sampling with Gaussian Moment Matching
Given a well-trained noisy energy-based model , the clean model has the form
(5) |
where the denoising distribution can be written as . We notice that, since the noise distribution is known, a Gibbs sampling scheme can be constructed to sample from the underlying clean model if we know the denoising distribution , with
(6) |
where the initial sample can be drawn from a standard Gaussian . However, as the denoising distribution is usually intractable for complex , we propose an analytical Gaussian moment matching approximation of .
Denote the mean and covariance of as
(7) |
The classic Gaussian moment matching method Minka (2013) specifies a Gaussian approximation , which matches the first and second moment of . When , the first mean of the denoised distribution has a well-known analytical form Song et al. (2021); Bao et al. (2022); Efron (2011); Robbins (1992)
(8) |
we include the derivation in Appendix A.3. Using this identity, we can rewrite Equation 4 as
(9) |
where we can see that the Fisher divergence only depends on . Since (Theorem 2.1), the function fully characterizes the distribution . Therefore, and can provide sufficient information to determine . As a consequence, the following theorem shows that the covariance function can also be analytically derived.
Theorem 2.2 (Analytical Covariance Identity).
Given a clean model such that with , the and of the has the following relations
(10) |
See Appendix A.3 for proof. This analytical covariance identity can be seen as a high-dimensional generalization of the 2nd-order Tweedie’s Formula Efron (2011); Robbins (1992). Therefore, the analytical full-covariance moment matching approximation can be written as
(11) |
We want to highlight that since the Gaussian moment matching is only an approximation of , the sampling scheme in Equation 6 is a ‘pseudo’ Gibbs sampler unless the true is also a Gaussian distribution222For example, when , the true posterior will be a Gaussian with mean and variance , we can verify that ., which is not true for general non-Gaussian . However, since and are already sufficient to specify , it should be possible to derive expressions for higher-order moments which themselves involve only and ; we leave this to future work. To our knowledge, the -conditioned full covariance Gaussian moment matching approximation to has not been derived previously. In the next section, we briefly discuss the connections between our method and other related approaches.
2.2 Connection to Covariance Learning Approaches
Bengio et al. (2013) proposes to approximate the true posterior with a variational distribution . The parameter is then learned by minimizing the joint KL divergence
(12) |
where . The joint KL divergence in Equation 12 encourages to match the moments of the true posterior , and defines an upper bound of the marginal KL Zhang et al. (2019)
(13) |
where the model is implicitly defined as the marginal of the joint . When is a consistent estimator of , this asymptotic distribution of the Gibbs sampling will converge to the true data distribution Bengio et al. (2013). For continuous data, the variational distribution is chosen as a Gaussian distribution , where the mean and the covariance are parameterized by neural networks. We note that the only difference between the KL and DSM objective (Equation 9) is that the KL objective additionally learns the covariance. We thus show that the optimal covariance under KL minimization is the proposed analytical covariance.
Theorem 2.3 (Optimal Gaussian Approximation).
Let and assume Gaussian distribution , then the optimal such that
(14) |
has the mean and covariance with the form
(15) |
see Appendix A.4 for proof. Therefore, when the optimal mean function is learned , the optimal can be analytically derived, making the learning of redundant. In addition to the training inefficiency caused by more parameters, the amortized covariance network may suffer from poor generalization Zhang et al. (2022a). Moreover, the KL objective is also not well-defined for learning data distributions which lie on a low-dimensional manifold, e.g. MNIST, see Section 4 for a detailed discussion. In this case, the learned may be a degenerate matrix, making the Gaussian density function ill-defined Zhang et al. (2020) which impedes the training, see Figure 5 for an example.
Paper Meng et al. (2021) proposes a higher-order score-matching loss to simultaneously learn both the first order score and the second order score . However, our findings indicate that the mean function (or the first order score ) already contains all the moment information of the underlying true distribution , and the optimal moment can be derived using the mean function. Therefore, learning the second-order score is redundant and may lead to sub-optimal inference.
2.3 Connection to Analytic DDPM
The recent paper Bao et al. (2022) considers a constrained variational family in the context of diffusion model and derive the optimal as
(16) |
which can also be rewritten using the score function
(17) |
In Appendix B, we provide a detailed derivation to show how this approximation can be linked to our method using the Fisher information identity Fisher (1925). This approximation has two potential limitations: first, compared to full covariance moment matching, the assumed isotropic covariance structure may be insufficiently flexible to capture the true posterior; second, the covariance is independent of .
The second assumption only holds when is a linear function of 333Since when is a linear function of , using Theorem 2.2, we have will not depend on , see also footnote 1 for an example. (e.g. when is Gaussian) and does not hold for other non-Gaussian . Therefore, our -dependent full-covariance approximation offers a more versatile approximation family, which ultimately results in a more precise estimation. However, in certain applications such as accelerating the sampling procedure of a diffusion model Bao et al. (2022), it is advantageous to use a -independent isotropic covariance due to its inexpensive estimation. On the other hand, our -dependent covariance necessitates the computation of the Hessian for each , making it inefficient for high-dimensional data. In Section 3, we will explore approaches to mitigate this limitation.
2.4 Posterior Approximation Comparison
We now consider a toy example to compare the three denoising posterior approximations discussed above. Let be a Mixture of Gaussians (MoG) whose components are 2D Gaussians with means and isotropic covariance with . The noise distribution is with , so is an MoG with the same component means and diagonal covariance ; see Figure 0(a) for a visualization. In this case the true posterior does not allow a tractable form. Fortunately, given a noisy sample and an evaluation point , we can evaluate the true density using Bayes rule: . Figure 1 shows the true posteriors given four different where we use grid data in -space to visualize the density.
To train the model, we sample 10,000 data points from as our training data. For the KL-trained Gibbs sampler described in Section 2.2, we use a network with 3 hidden layers with 400 hidden units, Swish activation Ramachandran et al. (2017) and output size 4 to generate both mean and log standard deviation of the Gaussian approximation. For the moment-matching Gibbs sampler (including both full and isotropic covariance), we use the same network architecture but with output size 1 to get the scalar energy and DSM as the training objective. Both networks are trained with batch size 100 and Adam Kingma and Ba (2014) optimizer with learning rate for 100 epochs. For the -independent isotropic covariance, we use the Monte Carlo approximation to estimate the variance Bao et al. (2022) with 10000 samples from .
Data | Learn diag. | Analytic iso. | Analytic full |
---|---|---|---|
MoGs | |||
Rings | |||
Roll |
Figure 1 visualizes the approximations to the denoising posterior estimated by each of the three methods described in the previous sections. We surprisingly find that although the KL objective in Equation 12 encourages to match the moments of , the learned covariance in Figure 0(e) still underestimates the variance of the posterior. This shows the redundancy of covariance learning can degrade the variational approximation performance. Additionally, the -independent covariance fails to account for the relative positions of and lacks the ability to predict the posterior’s elliptical shape due to its isotropic nature. In contrast, our -dependent full covariance approximation overcomes these limitations, enabling more accurate predictions that capture the intricate geometry of the posterior distribution.
We then use the estimated posterior to conduct (pseudo) Gibbs sampling to generate samples444The code of the experiments can be found in https://github.com/zmtomorrow/MMDGS_NeurIPS.. Specifically, we initialize the first sample and run one Markov Chain with 10,000 time steps to generate 10,000 samples. In addition to the mixture of Gaussian datasets, we also train and generate samples from the 2D Swiss roll and two-ring datasets. For numerical evaluation, we calculate the Maximum Mean Discrepancy (MMD) Gretton et al. (2012) between 10k samples generated by a single-chain Gibbs sampler and 10k samples from the training dataset respectively. The kernel insides MMD is a sum over 5 Gaussian kernels with bandwidth ranging over . The MMD results (including both mean and std) are calculated using 5 random seeds. We find that Gibbs sampling with the proposed analytical full covariance achieves the best results; numerical results are in Table 1, with a visual comparison in Figure 2.
3 Scalable Implementations for Image Data
Scalable Diagonal Hessian Approximation As we discussed in Section 2.3, the proposed full covariance Gaussian approximation in Equation 10 requires calculating an Hessian for each with size , which brings both memory and computation difficulties for high-dimensional data. A naive diagonal Hessian method (only using the diagonal entries in the Hessian) will address the memory bottleneck but still needs times backward passes for the exact computation of the diagonal term Martens et al. (2012). In this paper, we use the following diagonal Hessian approximation Bekas et al. (2007),
(18) |
where is a Rademacher random variable with entries and denotes the element-wise product555This estimation should be distinguished from the Hutchinson’s Trace estimation Hutchinson (1990): where a dot-product is used between and .. This estimator will converge to the exact Hessian diagonals when Bekas et al. (2007). The computation for each can be computed by two forward-backward passes. It is worth emphasizing that our -dependent diagonal moment matching approach provides a comparable level of flexibility to the variational method proposed in Bengio et al. (2013) while eliminating the need for additional training of the diagonal covariance. Furthermore, our method remains more flexible than the isotropic -independent moment matching method proposed by Bao et al. (2022).
Energy or Score Parameterization For the full-covariance moment matching in Equation 10, we require to be symmetric to obtain a valid Gaussian approximation. However, if we learn the score function using a network , its Jacobian is not guaranteed to be symmetric. In this case, we follow Saremi et al. (2018) and directly parameterize the density function with a neural network and let the score function . This can be obtained by AutoDiff packages like PyTorch Paszke et al. (2017), and this parameterization guarantees to be symmetric. We also notice that when using the diagonal Hessian approximation (Equation 18), we only need entries in to be positive in order to obtain a valid Gaussian approximation. In this case, the score parameterization remains applicable and offers more efficient training compared to the energy parameterization. Therefore, the combination of full/diagonal covariance and energy/score parameterization provides a tradeoff between flexibility and inference speed, allowing for a flexible approach while maintaining computational efficiency during training.
4 Image Generation with a Single Noise Level
We then apply the proposed method to model the grey-scale MNIST LeCun (1998) dataset. We use the standard U-Net architecture Song and Ermon (2019); Ronneberger et al. (2015) with a single fixed noise level ; the effect of varying is explored in Appendix C.1. For the KL training objective, the output channel size is 2 to generate both mean and log-std at the same time. For DSM training, we take the sum of the U-Net output to obtain the scalar energy evaluation which also relates to the product-of-experts model described in Saremi et al. (2018). We train both networks for 300 epochs with learning rate and batch-size 100.
As discussed in Section 1.1, the KL divergence is not well-defined for manifold data distributions.
This limitation becomes evident when working with MNIST, where the presence of constant black pixels in the boundary areas leads to a rapid decrease towards 0 in the variance of during training. Consequently, the likelihood value tends to approach infinity, resulting in unstable training. In contrast, the DSM objective is well-defined for manifold data, providing a stable training process even in the presence of such boundary effects. Figure 5 provides a visual comparison of the two training procedures, demonstrating the improved stability and effectiveness of the DSM objective in handling manifold data distributions.
For the sample generation process, calculating the full-covariance Gaussian posterior becomes challenging. We therefore apply the scalable diagonal Hessian approximation described in Section 3 to approximate the diagonal Gaussian covariance of . We find that the estimated diagonal Hessian occasionally contains small negative values due to approximation error; we, therefore, use the function with to ensure the positivity of the diagonal covariance. The -independent isotropic covariance and the proposed -dependent diagonal covariance share the same mean function.
We first visualize the covariance estimated by three different methods in Figure 4. We use 100 Rademacher samples in estimating the diagonal Hessian (Equation 18) and 50,000 samples in estimating the isotropic variance (Equation 17). We find that both -dependent diagonal covariance approximations can capture the posterior structure whereas the isotropic -independent covariance is just Gaussian noise since the variance is shared between different digit and pixel locations. In Figure 3, we plot the sample comparison for three methods.
Since the isotropic covariance has the same variance in each dimension, the generated samples in Figure 2(b) contain white noise in the black background, whereas the proposed full-covariance sampler can generate a clean black background in Figure 2(a). On the other hand, the samples generated by the KL-trained Gibbs sampler (Figure 2(c)) have worse sample quality due to the unstable training.
We then apply the same method to model the more complicated CIFAR 10 Krizhevsky et al. (2009) dataset. We use the same U-Net structure as used in Song and Ermon (2020) and directly parameterize the score function rather than the energy function to speed up the training. The noise level is fixed at . We train the model using Adam optimizer with learning rate and batch size 100 for 1000 epochs. We visualize the denoising posterior diagonal covariance in Figure 6 when using different numbers of Rademacher samples (Equation 18). We observe that better covariance estimation can be obtained by increasing the number of samples. To balance efficiency and accuracy, we use a sample number of 10 in the subsequent Gibbs sampling stage. Figure 7 shows three independent Markov chains with the samples plotted every 10 Gibbs steps, which demonstrates that sharp images can be generated with even one fixed level of noise.
Limitation: In the CIFAR experiment, we observe a mode collapse phenomenon when running multiple independent Markov chains for a longer time. This phenomenon is likely due to the small noise level , which prevents the sampler from exploring the full space, as commonly found with MCMC methods Robert et al. (1999). This effect is visually represented in Figure 8, where we assess the Fréchet Inception Distance (FID) values for 50,000 images sampled with varying numbers of Gibbs steps. Notably, the FID increases beyond 40 Gibbs steps, and visual evidence of mode collapse is observed (Figure 12). In the ensuing section, we will demonstrate the application of our method to settings with multiple noise levels, an approach that may help mitigate the issue of mode collapse.
5 Image Generation Using Multiple Noise Levels
The success of diffusion models and lessons from prior work on score-based generative models point to the importance of using multiple noise levels Ho et al. (2020); Song and Ermon (2019) when modelling data with complex multi-modal distributions. Intuitively, by learning to denoise data at a range of noise levels, a single network can learn both the fine and global structure of the distribution, which in turn allows for more effective sampling algorithms capable of efficiently exploring diverse modes (Song and Ermon, 2019). We therefore propose to adapt the denoising Gibbs sampling procedure to sample from distributions corrupted with multiple noise levels. For this purpose, we use a noise-conditioned score network trained by Song and Ermon (2020), who generated high-quality samples using a procedure inspired by annealed Langevin dynamics Welling and Teh (2011). This procedure involves generating samples from a sequence of distributions , corrupted by progressively decreasing levels of Gaussian noise (parameterized via standard deviations , with ). At a given step in the sequence, Langevin dynamics is used to sample from the corresponding noised distribution , using the score network to approximate the gradient of the noised distribution. The outputs of this Langevin dynamics run are then used to initialize the same procedure at the next noise level, leading the sampling procedure to converge gradually towards the data distribution as the noise level tends to zero (i.e. ). The algorithm of the annealed Langevin dynamics with multi-level noise used in Song and Ermon (2019, 2020) is summarized in Algorithm 1.
We show that the proposed Gibbs sampling scheme can be directly applied to a pre-trained score-based generative model as a drop-in replacement for Langevin dynamics MCMC in the generation stage. At each noise level, we use samples from the previous noise level to initialise a Gibbs sampling chain targeting the marginal distribution at the current noise level , which now plays the role of the ‘clean’ distribution in Equation 5. Therefore, the noisy distribution at time step is a Gaussian . The optimal denoising distribution is thus a function of the level of noise at step relative to the level of noise at the previous step . For generation efficiency, we employ 3 Gibbs steps at each noise level, using 3 Rademacher samples to approximate the diagonal Hessian. The sampling procedure is summarized in Algorithm 2. In Figure 9 and 10, we visualize the samples from models that are trained on CIFAR10 and CelebA separately. Further experimental details can be found in Appendix C.2.
For direct comparison with the results of (Song and Ermon, 2020) on CIFAR10, we retain the same schedule of noise levels used to generate samples with Langevin dynamics. We generate 50000 samples using this approach and report FID and Inception scores in Table 2. Our multi-level Gibbs sampling scheme produces samples of equivalent quality to the multi-level Langevin dynamics of (Song and Ermon, 2020), confirming its applicability to complex natural image data. The FID is also notably superior to that of the single-noise level Gibbs sampling, and the samples exhibit significant visual diversity (Figure 9). This underlines the importance of employing multi-level noise in our approach. Recent advances in sampling strategies for score-based models leveraging the framework of stochastic differential equations (Song et al., 2021) have led to significant further improvements in generation quality as shown in Table 2; we leave the exploration of possible applications of our method to this framework to future work.
6 Conclusion
This paper focuses on addressing the inconsistency problem in training energy-based models (EBMs) using denoising score matching. Specifically, we identify the presence of an underlying clean model within a ‘noisy’ EBM and propose an efficient sampling scheme for the clean model. We demonstrate how this method can be effectively applied to high-dimensional data and showcase image generation results in both single and multi-level noise settings. More broadly, we hope our more accurate denoising posterior opens new avenues for future work on score-based methods in machine learning.
References
- Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
- Bao et al. (2022) F. Bao, C. Li, J. Zhu, and B. Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv preprint arXiv:2201.06503, 2022.
- Bekas et al. (2007) C. Bekas, E. Kokiopoulou, and Y. Saad. An estimator for the diagonal of a matrix. Applied numerical mathematics, 57(11-12):1214–1229, 2007.
- Bengio et al. (2013) Y. Bengio, L. Yao, G. Alain, and P. Vincent. Generalized denoising auto-encoders as generative models. Advances in neural information processing systems, 26, 2013.
- Bishop and Nasrabadi (2006) C. M. Bishop and N. M. Nasrabadi. Pattern recognition and machine learning, volume 4. Springer, 2006.
- Bose et al. (2018) A. J. Bose, H. Ling, and Y. Cao. Adversarial contrastive estimation. arXiv preprint arXiv:1805.03642, 2018.
- Du and Mordatch (2019) Y. Du and I. Mordatch. Implicit generation and modeling with energy based models. Advances in Neural Information Processing Systems, 32, 2019.
- Du et al. (2020) Y. Du, S. Li, J. Tenenbaum, and I. Mordatch. Improved contrastive divergence training of energy based models. arXiv preprint arXiv:2012.01316, 2020.
- Efron (2011) B. Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106(496):1602–1614, 2011.
- Fisher (1925) R. A. Fisher. Theory of statistical estimation. In Mathematical proceedings of the Cambridge philosophical society, volume 22, pages 700–725. Cambridge University Press, 1925.
- Gao et al. (2018) R. Gao, Y. Lu, J. Zhou, S.-C. Zhu, and Y. N. Wu. Learning generative convnets via multi-grid modeling and sampling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9155–9164, 2018.
- Gretton et al. (2012) A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
- Hinton (2002) G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.
- Ho et al. (2020) J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Hutchinson (1990) M. F. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 19(2):433–450, 1990.
- Hyvärinen (2005) A. Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
- Jolicoeur-Martineau et al. (2021) A. Jolicoeur-Martineau, R. Piché-Taillefer, I. Mitliagkas, and R. T. des Combes. Adversarial score matching and improved sampling for image generation. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=eLfqMl3z3lq.
- Kim and Bengio (2016) T. Kim and Y. Bengio. Deep directed generative models with energy-based probability estimation. arXiv preprint arXiv:1606.03439, 2016.
- Kingma and Ba (2014) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Krizhevsky et al. (2009) A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.
- LeCun (1998) Y. LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
- Martens et al. (2012) J. Martens, I. Sutskever, and K. Swersky. Estimating the hessian by back-propagating curvature. arXiv preprint arXiv:1206.6464, 2012.
- Meng et al. (2021) C. Meng, Y. Song, W. Li, and S. Ermon. Estimating high order gradients of the data distribution by denoising. Advances in Neural Information Processing Systems, 34:25359–25369, 2021.
- Minka (2013) T. P. Minka. Expectation propagation for approximate bayesian inference. arXiv preprint arXiv:1301.2294, 2013.
- Ngiam et al. (2011) J. Ngiam, Z. Chen, P. W. Koh, and A. Y. Ng. Learning deep energy models. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 1105–1112, 2011.
- Nijkamp et al. (2019) E. Nijkamp, M. Hill, S.-C. Zhu, and Y. N. Wu. Learning non-convergent non-persistent short-run mcmc toward energy-based model. Advances in Neural Information Processing Systems, 32, 2019.
- Paszke et al. (2017) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
- Ramachandran et al. (2017) P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
- Robbins (1992) H. E. Robbins. An empirical bayes approach to statistics. In Breakthroughs in Statistics: Foundations and basic theory, pages 388–394. Springer, 1992.
- Robert et al. (1999) C. P. Robert, G. Casella, and G. Casella. Monte Carlo statistical methods, volume 2. Springer, 1999.
- Ronneberger et al. (2015) O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
- Salimans and Ho (2021) T. Salimans and J. Ho. Should ebms model the energy or the score? In Energy Based Models Workshop-ICLR 2021, 2021.
- Saremi (2019) S. Saremi. On approximating with neural networks. arXiv preprint arXiv:1910.12744, 2019.
- Saremi et al. (2018) S. Saremi, A. Mehrjou, B. Schölkopf, and A. Hyvärinen. Deep energy estimator networks. arXiv preprint arXiv:1805.08306, 2018.
- Song and Ermon (2019) Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
- Song and Ermon (2020) Y. Song and S. Ermon. Improved techniques for training score-based generative models. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/92c3b916311a5517d9290576e3ea37ad-Abstract.html.
- Song and Kingma (2021) Y. Song and D. P. Kingma. How to train your energy-based models. arXiv preprint arXiv:2101.03288, 2021.
- Song et al. (2020) Y. Song, S. Garg, J. Shi, and S. Ermon. Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, pages 574–584. PMLR, 2020.
- Song et al. (2021) Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS.
- Tao (2011) T. Tao. An introduction to measure theory, volume 126. American Mathematical Society Providence, 2011.
- Vincent (2011) P. Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
- Wang et al. (2020) Z. Wang, S. Cheng, L. Yueru, J. Zhu, and B. Zhang. A wasserstein minimum velocity approach to learning unnormalized models. In International Conference on Artificial Intelligence and Statistics, pages 3728–3738. PMLR, 2020.
- Welling and Teh (2011) M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688, 2011.
- Wendland (2004) H. Wendland. Scattered data approximation, volume 17. Cambridge university press, 2004.
- Wenliang and Kanagawa (2020) L. Wenliang and H. Kanagawa. Blindness of score-based methods to isolated components and mixing proportions. arXiv preprint arXiv:2008.10087, 2020.
- Xie et al. (2016) J. Xie, Y. Lu, S.-C. Zhu, and Y. Wu. A theory of generative convnet. In International Conference on Machine Learning, pages 2635–2644. PMLR, 2016.
- Zhai et al. (2016) S. Zhai, Y. Cheng, R. Feris, and Z. Zhang. Generative adversarial networks as variational training of energy based models. arXiv preprint arXiv:1611.01799, 2016.
- Zhang et al. (2019) M. Zhang, T. Bird, R. Habib, T. Xu, and D. Barber. Variational f-divergence minimization. arXiv preprint arXiv:1907.11891, 2019.
- Zhang et al. (2020) M. Zhang, P. Hayes, T. Bird, R. Habib, and D. Barber. Spread divergence. In International Conference on Machine Learning, pages 11106–11116. PMLR, 2020.
- Zhang et al. (2022a) M. Zhang, P. Hayes, and D. Barber. Generalization gap in amortized inference. Advances in neural information processing systems, 2022a.
- Zhang et al. (2022b) M. Zhang, O. Key, P. Hayes, D. Barber, B. Paige, and F.-X. Briol. Towards healing the blindness of score matching. arXiv preprint arXiv:2209.07396, 2022b.
Appendix A Proof and Derivations
A.1 Proof of Theorem 2.1
The existence is straightforward, since , we can simply let , which makes . To show the uniqueness, we denote density , so and can be written as convolutions
(19) |
we then have
(20) |
where denotes the Fourier transform. Since the Fourier transform of a Gaussian is also a Gaussian, so everywhere, we have
(21) |
Therefore, is the unique distribution that makes . This technique has also been used to construct spread KL divergence (we denote as ) Zhang et al. [2020], which is defined as where , to train implicit model . Different from the DSM situation, when , the underlying model is directly available, whereas the EBM trained by DSM learns to be the noisy distribution .
A.2 General Conditions Characterising the Existence of the Clean Model
In the previous section, we assume for a flexible neural network parameterized , the energy-based model trained by Equation 4 can recover the target noisy data distribution so there exists an underlying model such that and . This assumption is commonly used in the literature on score-based methods. For example, in the score-based diffusion models literature Song and Ermon [2019], Ho et al. [2020], Bao et al. [2022], for any data , the score function is usually parameterized by a neural network . However, this parameterization cannot guarantee is a conservative vector field, or in other words, there doesn’t exist a distribution such that and is symmetric Salimans and Ho [2021], Saremi [2019]. Therefore, perfect score estimation is implicitly assumed to allow an EBM interpretation.
However, the underlying clean model doesn’t always exist for imperfect model . We here provide the sufficient and necessary conditions which guarantee the existence of the underlying clean model.
Theorem A.1 (Necessary and Sufficient conditions for the existence of the underlying clean model.).
For a model with the convolutional noise distribution , there exists an underlying model such that if and only if is positive semi-definite 666 A continuous function is positive semi-definite if for all , all sets of pairwise distinct centers and all , , see [Wendland, 2004, Definition 6.1]. Additionally, the underlying distribution can be written as
(22) |
where is the inverse Fourier transform. This theorem is a straightforward corollary of Bochner’s Theorem 777 Bochner’s Theorem [Wendland, 2004, Theorem 6.6]: A continuous function is positive semi-definite if and only if it is the Fourier transform of a finite non-negative Borel measure on . . However, for the energy model , it’s difficult to design a functioning family of that satisfies the positive semi-definite condition and have the tractable score function at the same time 888For example, one can define a noisy energy-based model with , which always allows an underlying clean energy-based model such that with . However, the score function is intractable in this case.. We thus leave the design of better energy function parameterizations as a promising future direction.
A.3 Proof of Theorem 2.2
Derivation of the Mean Identity
We let , where , we have
where we define the model denoising posterior using Bayes rule . The second equality is due to the following Gaussian distribution property
(23) |
Derivations of the Analytical Full Covariance Identity
We have derived the mean identity
(24) |
Taking the gradient over in both side and scaling with , we have
(25) |
We can also expand the hessian of the :
Therefore, we obtain the analytical full covariance identity.
(26) |
A.4 Proof of Theorem 2.3
Lemma A.2 (KL to Gaussian Bao et al. [2022]).
Let be a distribution with mean and covariance and , denote the differential entropy as , we have
(27) |
The proof can be found in Bao et al. [2022] Lemma 2.
We can then prove Theorem 2.3. Since , where , we have
(28) |
Assume Gaussian distribution and denote the mean and covariance of the true posterior are and , then the optimal is
(29) | ||||
(30) | ||||
(31) | ||||
(32) |
Therefore, the optimal under the joint KL has the mean and covariance .
Appendix B Connection to Analytical DDPM
Paper Bao et al. [2022] considers the constrained variational family and derive the optimal as
(33) |
which can also be rewritten using the score function
(34) |
To make a deep connection, we can also plug our analytical full covariance (Equation 10) into Equation 16
(35) |
which recovers Equation 17, where the first equality is due to the well-known Fisher information identity Fisher [1925].
Appendix C Experiments
All the experiments conducted in this paper are run on one single NVDIA GTX 3090.
C.1 Effect of the Single Noise Choice on MNIST
Figure 11 shows the samples generated by our method with the EBM trained with difference in the noise distribution , we can find the image quality also heavily depends on the choice of the noise scale and achieves the best visual quality, we then use this hyper-parameter in the subsequent comparisons.
C.2 Multi-level Noise Details
For full details on the architecture and noise schedule used in the multi-level noise experiments in Section 5, we refer to Appendix B of [Song and Ermon, 2020]. For our multi-level Gibbs sampling procedure, we used 3 Gibbs steps at each noise level and 3 Rademacher samples for each diagonal Hessian computation. Following [Song and Ermon, 2020], we used a total of 232 noise levels, distributed according to their proposed geometric schedule, and applied a final denoising step in which the mean of the clean distribution conditioned on the final output of the sampling procedure is returned (the final output of the sampling procedure is a sample from the noised distribution from the noise distribution at the smallest noise level). This denoising step was previously found to improve FID scores [Jolicoeur-Martineau et al., 2021] significantly.