Symmetric Equilibrium Learning of VAEs
Boris Flach Dmitrij Schlesinger Alexander Shekhovtsov Czech Techn. University in Prague Dresden University of Technology Czech Techn. University in Prague
Abstract
We view variational autoencoders (VAE) as decoder–encoder pairs, which map distributions in the data space to distributions in the latent space and vice versa. The standard learning approach for VAEs is the maximisation of the evidence lower bound (ELBO). It is asymmetric in that it aims at learning a latent variable model while using the encoder as an auxiliary means only. Moreover, it requires a closed form a-priori latent distribution. This limits its applicability in more complex scenarios, such as general semi-supervised learning and employing complex generative models as priors. We propose a Nash equilibrium learning approach, which is symmetric with respect to the encoder and decoder and allows learning VAEs in situations where both the data and the latent distributions are accessible only by sampling. The flexibility and simplicity of this approach allows its application to a wide range of learning scenarios and downstream tasks.
1 INTRODUCTION
Variational autoencoders (Kingma and Welling, 2014; Rezende et al., 2014) are a well established and well analysed approach of learning latent variable models of the form . Given a distribution , in the data space and an assumed distribution , in the latent space, a VAE combines a pair of parametrised distributions , , which are usually modelled in terms of deep networks. The standard way to learn this encoder–decoder pair is to maximise the evidence lower bound of the data log-likelihood,
(1) | ||||
This learning formulation is particularly well suited to situations where only the generative model is of interest. The research in this area in recent years has culminated in deep hierarchical VAEs (Vahdat and Kautz, 2020) and diffusion models (Ho et al., 2020; Rombach et al., 2022), which can be viewed also as hierarchical VAEs. The encoder’s role is auxiliary in the ELBO, and it is even fixed to a simple noisy shrinkage in diffusion models. However, a learned encoder is often of interest in applications on its own — it can provide compact representations, useful for downstream tasks (e.g. for semantic hashing, Dadaneh et al. 2020). Furthermore, while only samples from are needed in (1), an explicit model of is required in order to compute (and differentiate) the KL-divergence term. Although solutions to the latter problem have been proposed, they come with some other limitations (discussed in detail in Section 5).
The asymmetries of the standard VAE learning approach pointed above make it difficult to use it in semi-supervised training scenarios and in situations where both spaces and are complex and possibly structured, as for instance in semantic segmentation with images and segmentations . Learning an encoder–decoder pair in such a scenario would naturally allow solving inference problems in both directions between and as well as to build more complex models. The requirement to model by a simple and tractable density becomes then a significant limitation.
In this work, we propose a symmetric learning approach inspired by game theory, which leads to a simple learning algorithm. The method can handle implicitly given marginal distributions and . It does not require gradients of parametric discrete expectations like the gradient of ELBO w.r.t. the encoder parameters, and therefore no reparametrisation is needed. Consequently, handling discrete or continuous variables is simple. The method gives a novel view of the well-known wake-sleep algorithm (Hinton et al., 1995), as discussed in Section 5. It can be applied to models with structured latent spaces, like hierarchical VAE, and extended to models consisting of 3 or more groups of variables. In the latter case, the model consists of several inference networks – one for each group of variables. They are learned jointly and can address an extended range of tasks at inference time, as we demonstrate experimentally.
The rest of the paper is organised as follows. In the next two sections we derive and analyse our novel learning approach. In the following section we exemplify its application to advanced models and learning setups. In the final experimental section we compare it with ELBO learning, show that it provides comparable model estimates, and demonstrate its applicability to more complex models not addressable by ELBO.
2 PROBLEM FORMULATION
We propose a generic learning approach, whose primary goal is to learn a decoder and an encoder in the following training scenarios:
Semi-supervised learning: We assume training samples and and possibly also joint samples , i.i.d. drawn from an unknown distribution and its marginals.
Unsupervised learning: Only samples of are observed. In this case the space is a free modelling choice.
Similar to VAE learning, the choice of the models for the decoder and encoder is dictated by the need to be able to evaluate (or at least differentiate) their respective log-densities and to sample from them. We will assume that the decoder and encoder belong to parametric exponential families of the form
(2a) | |||
(2b) |
where and are fixed sufficient statistics. The map**s and are usually modelled by deep networks, parametrised by , . Notice that variables , can be either discrete or continuous depending on the chosen exponential family. Common choices are e.g. Bernoulli or Gaussian models.
3 SYMMETRIC EQUILIBRIUM LEARNING
We present our general approach and theoretical analysis for the semi-supervised learning task from the previous section, which naturally calls for a symmetric formulation.
For simplicity of exposition, let us assume that only marginal empirical distributions and are given, but no joint observations are available. The goal is to learn an encoder–decoder pair and by (i) optimising the likelihood of the observed data and (ii) enforcing the encoder and decoder consistency at the same time. We formulate the learning task symmetrically as finding a Nash equilibrium for a two-player game. The strategy of the first player is represented by the decoder . Similarly, the strategy of the second player is represented by the encoder . The utility function of a player is the likelihood of the training data w.r.t. its strategy. Thereby, training examples are completed by the strategy of the other player. For example, the missing information in the examples for the decoder likelihood is completed by the encoder strategy: . Proceeding in the same way for the encoder, we obtain the utility functions
(3a) | ||||
(3b) |
As we will see later, the game aims at maximising the decoder likelihood and the encoder likelihood of the training data simultaneously, whereby the mutual completion reinforces decoder-encoder consistency.
A Nash equilibrium of the game is a pair such that
(4) |
i.e. a point at which neither player can improve its objective function. Towards finding an equilibrium we consider a simple gradient algorithm, in which each player tries to improve its utility w.r.t. to its strategy
(5) |
These updates may be executed in parallel or sequentially. Stochastic unbiased estimates of the required gradients are readily obtained by differentiating Monte-Carlo estimates of expectations (3) with as few as a single sample. Unlike in ELBO, the expectation does not need to be differentiated with respect to the encoder parameters and similarly for . There is no need for the reparametrization trick in case of continuous variables or specialised gradient estimators through discrete samples in case of discrete variables.
Uniqueness
It is well known that nonzero-sum games can have multiple and even infinitely many Nash equilibria. It is therefore crucial to analyse uniqueness of the solution as well as the convergence properties of the algorithm (5).
Extending the decoder and encoder to joint models via
(6) |
the game utilities (3) can be compactly written as
(7) |
This game is hard to analyse because of non-linear map**s involved.
To allow for theoretical analysis we will enlarge the spaces of feasible joint distributions by considering the following canonical exponential families
(8a) | |||
(8b) |
where are sufficient statistics on , and are free parameter vectors and and are cumulant functions ensuring normalisation. The models (3) are log-linear in and by design. At the same time, with sufficiently complex and they can represent or approximate all models from the original families which were parametrised in terms of neural networks.
We explain this model relaxation for the case of binary valued vectors and . The components of the vector of natural parameters in (2) and the corresponding cumulant function are then pseudo-Boolean functions and can be written as polynomials in the components of . The same holds for the components of the sufficient statistic vector . This means that if we take the components of in the relaxed class to contain all base monomials, then for any there would be a corresponding parameter vector making the models equal. Notice that only under this correspondence the exponent part in (8a) matches the conditional distribution while this is not true for a generic .
Theorem 1.
The two-player game with utility functions
(9a) | |||
(9b) |
and strategies given by exponential family distributions (3) has a unique, asymptotically stable equilibrium.
The proof is given in Appendix A. The idea is to construct a dual formulation of the game, which maximises the entropy under moment matching constraints. In this reformulation, it is then easy to prove the diagonal strict concavity condition (Rosen, 1965) – a sufficient condition for uniqueness. Following theorems 7-10 in (Rosen, 1965), the theorem implies that the simple gradient ascent algorithm (5) converges to the unique equilibrium point.
The theorem applies to log-linear models (3) with free natural parameters and and guarantees that the proposed algorithm converges to a unique equilibrium in this case. This has direct applicability to e.g. EF-Harmonium models, which are however outside of our scope. Its value for VAEs defined in terms of neural networks is rather indirect: if the algorithm works in the lifted space, it gives more confidence that it would also make sense in a subspace with a non-linear parametrisation.
Consistency
Finally, we discuss the question of encoder–decoder consistency. We say that models and are consistent if there exists a joint distribution of which they are conditional distributions (see also Liu et al. 2021). Since we model and independently, they are in general inconsistent. Enforcing the consistency strictly, while kee** the models in exponential families (2), leads to a joint necessarily collapsing to an EF-Harmonium (Arnold and Strauss 1991, Shekhovtsov et al. 2022), which is a severe limitation. However, encouraging consistency could serve as a useful regularisation and can improve learning efficiency.
We observe that our game formulation implicitly encourages consistency.
Proposition 1.
See details in Appendix A. The utility is an alternative decomposition of ELBO into the data likelihood part and the encoder–posterior divergence, encouraging consistency. The utility is a symmetric counterpart. The difference to ELBO learning is that is optimised over only and not over and vice-versa for .
Similar to ELBO learning, there is no guarantee that the proposed learning approach will result in a consistent decoder–encoder pair defining a unique joint distribution. The necessity for such a joint distribution might be however dictated by the application for which the VAE is learned. Or it might arise if the learned VAE is only a part of a larger model, which requires such a joint distribution. In such cases we may consider the distribution (e.g. Liu et al. 2021)
(10) |
with implicitly defined marginals and . They must satisfy and , which leads to the equations
(11a) | ||||
(11b) |
While it is usually not possible to compute these marginals in closed form, it is nevertheless possible to sample from them and from the joint as the limiting distributions of a Markov chain that alternates sampling of and , as considered by Lamb et al. (2017).
4 ADVANCED MODELS AND LEARNING SETUPS
In this section we exemplify the application of the proposed learning approach to several practically relevant learning setups and more complex models.
Semi-Supervised Learning with Mixed Data
We extend the model and learning setup from Section 3 in two respects. First, we assume that in addition to empirical distributions and we also have complete training examples, i.e., matching pairs , forming an empirical distribution . Note that here -s are empirical distributions, hence e.g. need not be a marginal of . Second, we assume that the decoder’s joint distribution is defined using its own parametrised prior for , i.e. .
The utility function of the decoder sums the -likelihoods of the training set, of which the likelihoods of examples and , are tractable. The missing information in examples with intractable -likelihood is completed by the encoder strategy . Proceeding in the same way for the encoder, we get the utility functions
(12a) | ||||
(12b) |
Although we follow the symmetric approach as before, the utilities (4) are not entirely symmetric due to the model asymmetry: has its own parametrised prior , whereas lacks a prior model for .
Unsupervised Learning
By unsupervised learning we will understand the case when only is observed. The choice and interpretation of the space and the respective distribution is then completely free. We are interested in learning a decoder model and an encoder approximating .
The utility function for the decoder is given by its likelihood for the examples , completed by the encoder. To form a likelihood for the encoder, we consider examples generated by the decoder model. The resulting utility functions are
(13) |
In comparison with ELBO approach, the required stochastic gradients of the log-likelihoods are easy to compute, as discussed in Section 3. Notice that the algorithm applies also in case when is fixed and implicit, i.e. accessible by sampling only.
Hierarchical VAEs
Finally, we show that our unsupervised learning approach generalises to hierarchical / autoregressive VAEs. We assume that the hidden state consists of parts , and can be observed. Such models come in two variants. In the first one the factorisation order of the encoder is reverse to the factorisation order of the decoder. Examples are e.g. Helmholtz machines (Hinton et al., 1995) and deep belief networks (Hinton et al., 2006). Here, we will consider the second variant, in which the encoder and decoder have the same order of factorisation:
(14a) | ||||
(14b) |
The encoder of such models can share parameters with the decoder, in particular Sønderby et al. (2016) proposed to define the encoder by
(15) |
where is a factorised function of and are the hidden layer outputs of a deterministic encoder network , parameterised by . The strategy of the first player is represented by the decoder parameters , while the strategy of the second player is represented by the encoder parameters . The utility functions for unsupervised learning are as in (4). Thanks to the factorisation of the decoder and encoder, they decompose into sums over the blocks and and are tractable.
The model can be also learned “partially” semi-supervised by assuming that besides training examples we have access to a (usually smaller) set of training examples . This is relevant, for example, when represents some hidden state(s) like classes or segmentations, on which we want to condition the decoder . The additional training examples will add
(16a) | |||
(16b) |
to the respective utility functions.
5 RELATED WORK
Wake-Sleep
The learning algorithm (5) with utility functions (4) in the unsupervised case turns out to be equivalent to the wake-sleep (WS) algorithm first proposed by Hinton et al. (1995). However, we arrived at it from a conceptually new game-theoretic formulation, allowing for new analysis and generalisation to other settings (semi-supervised, partial observation scenarios). In Appendix B we give a brief overview of the original WS and follow-up works.
Implicit Prior
An important advantage of the proposed method is allowing prior to be implicit, i.e. accessible via samples only. Several works have extended VAEs to handle implicit encoders and priors. Mescheder et al. (2017) and Huszár (2017) estimate the log-density ratio in ELBO by learning a logistic regression discriminator. Similar to GANs, this requires an inner loop with possibly complex discriminator. Molchanov et al. (2019) allow both the encoder and the prior to be an intractable mixture of tractable densities. At the training time, a finite sample from the mixture is used to form a density estimate of and a lower bound on ELBO. These approaches are substantially more complex than ours and have further limitations. The prior can be made completely implicit, by assuming that the encoder-decoder model is consistent and hence defines a joint distribution and its marginals symmetrically. Towards this end Liu et al. (2021) explicitly optimise consistency and an expression that matches likelihood when assuming consistency.
Symmetric Learning
Asymmetry of ELBO formulation has motivated several approaches, alternative to ours. Dumoulin et al. (2017) minimises Jensen-Shannon divergence between joint encoder and decoder . To estimate this divergence, a discriminator of joint samples is learned alongside, as in GANs. Pu et al. (2017) use a similar approach to minimise the symmetrised KL divergence. Lamb et al. (2017) learns the MCMC encoder–decoder sampler by using a discriminator between data-clamped and free-running chains. An important difference to our work is that the game in these approaches is between the discriminator and the model, not between decoder and encoder.
Unsupervised and Semi-Supervised VAEs
Unsupervised equilibrium learning with utilities (4) can be reinterpreted to facilitate theoretical comparison with ELBO alongside Proposition 1. Furthermore, hierarchical model with observed (4) is closely related to semi-supervised learning with ELBO (Kingma et al., 2014). These connections are detailed in Appendix C.
6 EXPERIMENTS
Hierarchical VAE (MNIST)
Random Latent Codes | Limiting Distribution | |
ELBO |
||
---|---|---|
Symmetric |
||
The goal of this experiment is to compare the symmetric equilibrium learning and ELBO learning on a simple dataset – MNIST images binarised by a suitably chosen threshold. We consider two hierarchical VAE model variants, each with two groups of binary valued latent variables and . The decoder model is , where we assume a uniform distribution . The encoder for the first model variant (similar to ladder VAEs) factorises in the same order as the decoder, i.e. and shares parameters with the decoder as described in Sec. 3. The encoder in the second model variant factorises in reverse order, i.e. and shares no parameters with the decoder. The networks used in the encoders and decoders are standard deep convolutional networks of decreasing and increasing spatial resolution respectively. More details are provided in Appendix E111The code is available under
https://github.com/dschles70/symvae-aistats2024. Training such models with ELBO requires a specialised gradient estimator for differentiating expectations in w.r.t. its parameters. We use the estimator by Gregor et al. (2014), which is superior to straight-through and comparable to complex unbiased estimators for VAEs (Gu et al., 2016). Notice again, that no such approximation is required for the symmetric equilibrium learning.
Besides validating the generative capabilities of two resulting hierarchical VAEs, we want to analyse the consistency of their decoder–encoder pairs. We therefore generate images (i) from the decoder model and (ii) from the limiting distribution (see Sec. 3 for explanation). Fig. 1 and Table 1 indicate that the models obtained by symmetric learning achieves better consistency having at the same time slightly better FID scores. This is confirmed by tSNE embeddings of samples from the two models (see Appendix E).
model / alg. | rand. latent | limiting |
---|---|---|
LVAE, ELBO | 5.17 | 83.30 |
LVAE, symmetric | 1.73 | 3.63 |
RVAE, ELBO | 5.83 | 29.59 |
RVAE, symmetric | 0.81 | 5.40 |
To further strengthen this finding, we conducted similar experiments for the Fashion-MNIST dataset. Results and details are given in Appendix F.
The next experiment aims to show that the internal representations of a hierarchical VAE can be learned to have good generative and discriminative capabilities at the same time, even without “supervised” terms in the encoder objective as in (4). For this we extend by ten additional binary variables, which encode the class labels (one hot encoding). This means that combines latent variables with class labels . We learn the model by symmetric learning from labelled examples , but use the following utility functions
(17) |
This means that the class information is used only when learning the decoder (notice that factorises w.r.t. to and ). The encoder is learned solely on examples generated from the decoder, i.e. without any discriminative terms. The learned encoder achieves 99% classification accuracy on the MNIST validation set, with almost no decrease of the FID scores for the generated images ( when sampled from the decoder and when sampled from the limiting distribution). We also analyse tSNE embeddings of samples of the latent part of and samples of , both from the prior distribution and from the limiting distribution . Fig. 2 reveals that the latent part of is fully class agnostic, whereas is clearly clustered w.r.t. the digit classes. This can be interpreted as follows. The latent part of is “transversal” to the class labels and presumably encodes image properties like stroke width, slant etc., whereas the internal representations in are clustered by digit classes and encode the appearance properties separately for each class.
Semantic Segmentation (CelebA)
The following experiments illustrate the flexibility of the proposed approach on an application which is not accessible by ELBO learning. We consider the task of semantic segmentation with the goal to build a generative image segmentation model which can (i) generate image and segmentation pairs, (ii) segment given images, and (iii) generate images given a segmentation.
We use the CelebA-HQ dataset (Karras et al., 2018) and downscale its images and segmentations to pixels for simplicity.
Let be an image and be a segmentation (a categorical variable for each pixel). In order to model a distribution , we might try to learn a VAE with a decoder and encoder , assuming e.g. a uniform prior distribution for the vector of binary latent variables . However, this alone will not meet our goals because we can not access the resulting distributions and . We propose to model as limiting distribution of a pair of parametrised conditional probability distributions and (see (10)). This means that the marginal probability distributions and are defined implicitly through the corresponding marginalisation constraints.
To summarise, the whole model consists of three learnable conditional probability distributions , and . This defines a nested game with three players. Their respective strategies are represented by , and . Their utility functions are
(18a) | ||||
(18b) | ||||
(18c) |
where Gibbs sampling is applied for obtaining pairs in the last utility function. (See Appendix D for detailed explanation).
To ease the training, we start by pre-training model parts for and separately. For the former we introduce latent variables , which should encode segmentation shapes, and define with uniform prior . The model for is a latent variable model with latent variables , also uniformly distributed a-priori, which should encode appearance properties, like e.g. segment colours, textures, characteristic shadows etc. Both and are equipped with corresponding encoders, i.e. and , and trained by symmetric learning, which is straightforward. All conditional probability distributions and are implemented as moderate complexity feed-forward networks, which output the parameters of the corresponding probability distribution. For example, is a diagonal normal distribution with means and standard deviations provided by the corresponding network.
Results for the learned are illustrated in Fig. 3 in the following way. We consider pairs of training examples, each consisting of an image and its segmentation. The first example is encoded by and the sampled latent code is used to decode the segmentation of the second example to an image by using .
After pre-training we extend the model part , learned in the previous step, to represent by adding an “additional branch”, i.e. we define
(19) |
where is the pre-trained network, denotes the segmentation in one-hot encoding and is the additional network, which makes dependent on and as well. Its initial weights are chosen so that it outputs zeros at the beginning.
Finally, the model (6) is initialised by the pre-trained components and trained towards a Nash equilibrium for the three player game as explained above. Fig. 5 shows a few results. The model achieves 95.2% segmentation accuracy on the training set and 90.7% segmentation accuracy on the validation set.
An important property of the obtained model is its ability to complete missing information for any subset of its variables. Given a partial observation – e.g. an image part, or a segmentation part, or a combination of such parts – we can complete the missing data by sampling from the corresponding limiting distribution. We illustrate this property on the example of inference from incomplete images . Let consist of two parts: an observed part and a hidden part . In order to segment such an image by the maximum marginal decision strategy, we need to compute the marginal probabilities for each pixel . They can be estimated by Gibbs sampling, which alternately draws all unobserved random variables, including . We accumulate segmentation label frequencies for each pixel during the sampling and finally decide for the label with the highest occurrence. A few results are presented in Fig. 5. As compared to the segmentation from complete images, the segmentation accuracy drops from 95.2% to 92.8% for the training set and from 90.7% to 88.8% for the validation set. We consider this accuracy drop as minor, because the segmentations inferred for the hidden image parts need not necessarily coincide with the ground truth – they should only be “plausible”, which is seen in the figure. Although not the primary goal of this experiment, Gibbs sampling allows at the same time to reconstruct the image content in the hidden parts (i.e. in-painting). For this we employ a mean-marginal decision, i.e. we average all sampled image values observed during Gibbs sampling. Although the results are sometimes not perfect (see the last row in Fig. 5), it is however enough to infer reasonable segmentations.
7 CONCLUSION
We propose an alternative learning approach for variational autoencoders. For this we view VAEs as decoder–encoder pairs and derive a symmetric learning formulation inspired by game theory, which leads to a simple learning algorithm for finding a Nash equilibrium. We prove its uniqueness under fairly general assumptions. The proposed method can be applied for various learning scenarios and for models with complex, possibly structured latent spaces. This includes implicit distributions in the latent space as well as discrete latent variables. We show experimentally that the models learned by this method are comparable to those obtained by ELBO learning and demonstrate its applicability for tasks that are not accessible by standard VAE learning.
Acknowledgements
We would like to thank our colleagues Tomas Werner and Denis Barucic for their continued interest in this work and their valuable comments and discussions which helped to improve the manuscript. We also thank the reviewers for their critical remarks, which encouraged us to present more experiments and to resolve remaining unclarities. B.F. gratefully acknowledges support by the Czech OP VVV project ”Research Center for Informatics” (CZ.02.1.01/0.0/0.0/16019/0000765). D.S. was supported by the German Federal Ministry of Education and Research (BMBF) project 01/S18026A-F and by the German Federal Ministry for Economic Affairs and Climate Action (BMWK) project 01MN23021A. A.S. was supported by the Czech Science Foundation grant GA24-12697S. The authors would like to thank the Center for Information Services and HPC (ZIH) at TU Dresden for providing computing resources.
References
- Arnold and Strauss (1991) B. C. Arnold and D. J. Strauss. Bivariate distributions with conditionals in prescribed exponential families. Journal of the Royal Statistical Society Series B (Methodological), 53(2), 1991.
- Bornschein and Bengio (2015) Jorg Bornschein and Yoshua Bengio. Reweighted wake-sleep. ArXiv, 1406.2751, 2015.
- Burda et al. (2016) Yuri Burda, Roger B. Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In ICLR, 2016.
- Dadaneh et al. (2020) Siamak Zamani Dadaneh, Shahin Boluki, Mingzhang Yin, Mingyuan Zhou, and Xiaoning Qian. Pairwise supervised hashing with Bernoulli variational auto-encoder and self-control gradient estimator. In UAI, volume 124, 2020.
- Dumoulin et al. (2017) Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and Aaron Courville. Adversarially learned inference. In ICLR, 2017.
- Gregor et al. (2014) Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. Deep autoregressive networks. In ICML, 2014.
- Gu et al. (2016) Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. Muprop: Unbiased backpropagation for stochastic neural networks. In ICLR, May 2016.
- Hinton et al. (1995) Geoffrey E. Hinton, Peter Dayan, Brendan J. Frey, and Radford M. Neal. The "wake-sleep" algorithm for unsupervised neural networks. Science, 268(5214), May 1995.
- Hinton et al. (2006) Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Comput., 18(7), jul 2006.
- Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, volume 33, 2020.
- Huszár (2017) Ferenc Huszár. Variational inference using implicit distributions. ArXiv, abs/1702.08235, 2017.
- Ikeda et al. (1998) Shiro Ikeda, Shun-ichi Amari, and Hiroyuki Nakahara. Convergence of the wake-sleep algorithm. In NeurIPS, volume 11, 1998.
- Karras et al. (2018) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In ICLR, 2018.
- Kingma and Welling (2014) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.
- Kingma et al. (2014) Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, and Max Welling. Semi-supervised learning with deep generative models. In NeurIPS, NIPS’14, 2014.
- Lamb et al. (2017) Alex M Lamb, Devon Hjelm, Yaroslav Ganin, Joseph Paul Cohen, Aaron C Courville, and Yoshua Bengio. Gibbsnet: Iterative adversarial inference for deep graphical models. In NeurIPS, volume 30, 2017.
- Le et al. (2020) Tuan Anh Le, Adam R. Kosiorek, N. Siddharth, Yee Whye Teh, and Frank Wood. Revisiting reweighted wake-sleep for models with stochastic control flow. In UAI, volume 115, 2020.
- Liu et al. (2021) Chang Liu, Haoyue Tang, Tao Qin, **tao Wang, and Tie-Yan Liu. On the generative utility of cyclic conditionals. In NeurIPS, 2021.
- Mescheder et al. (2017) Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational Bayes: Unifying variational autoencoders and generative adversarial networks. In ICML, 2017.
- Molchanov et al. (2019) Dmitry Molchanov, Valery Kharitonov, Artem Sobolev, and Dmitry P. Vetrov. Doubly semi-implicit variational inference. In AISTATS, volume 89, 2019.
- Pu et al. (2017) Yuchen Pu, Weiyao Wang, Ricardo Henao, Liqun Chen, Zhe Gan, Chunyuan Li, and Lawrence Carin. Adversarial symmetric variational autoencoder. In NeurIPS, volume 30, 2017.
- Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.
- Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Rosen (1965) J. B. Rosen. Existence and uniqueness of equilibrium points for concave n-person games. Econometrica, 33(3), 1965. doi: 10.2307/1911749.
- Shekhovtsov et al. (2022) Alexander Shekhovtsov, Dmitrij Schlesinger, and Boris Flach. VAE approximation error: ELBO and exponential families. In ICLR, 2022.
- Sønderby et al. (2016) Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In NeurIPS, volume 29, 2016.
- Vahdat and Kautz (2020) Arash Vahdat and Jan Kautz. NVAE: A deep hierarchical variational autoencoder. In NeurIPS, 2020.
- Vértes and Sahani (2018) Eszter Vértes and Maneesh Sahani. Flexible and accurate inference and learning for deep generative models. In NeurIPS, volume 31, 2018.
- Wenliang et al. (2020) Li Wenliang, Theodore Moskovitz, Heishiro Kanagawa, and Maneesh Sahani. Amortised learning by wake-sleep. In ICML, volume 119, 13–18 Jul 2020.
Appendix A PROOFS
In this section we provide proofs of formal claims regarding uniqueness and consistency-enforcement. See 1
Proof.
We repeat here the model assumptions (3) for convenience
(20a) | ||||
(20b) |
Our proof relies on the classic result of (Rosen, 1965), who shows that games satisfying diagonal strict concavity (DSC), a condition stronger than concavity, have unique Nash equilibria.
Since the log-partition function of an exponential family is convex in its natural parameters, it follows that the game utilities are concave in their own strategies. A sufficient condition for the stronger DSC criterion is that the symmetrised Jacobian of the map**
(21) |
is negative definite. The most convenient way to prove this condition is to “dualise” the game. Maximising w.r.t. is equivalent to finding the exponential family model, whose expected sufficient statistic coincides with . This follows from
(22) |
The corresponding dual task reads
(23a) | |||
(23b) |
This can be seen by noticing that (A) is a convex optimisation task with linear constraints. Hence, we can apply Fenchel duality
(24) |
where denotes the Fenchel conjugate function of . For our case, we have and the corresponding dual variables . The Fenchel conjugate of the function is . Substituting all terms in the rhs of (24) and solving for , we get the task .
Applying the same dualisation for , we obtain the following “dual” game. The strategy of the first player is represented by and the strategy of the second player is represented by . The utility functions and of the players depend on their respective strategy only. The game has additional linear constraints, where we assume existence of an interior feasible point . The assertion of the theorem follows from Theorems 3,4,9 in (Rosen, 1965), if we prove that the symmetrised Jacobian of the map**
(25) |
is positive definite. This is trivial since the Jacobian is diagonal with elements in the first half of the diagonal and elements in its second half. ∎
See 1
Proof.
For completeness, we include the fact that is an alternative form of the ELBO. It is verified as follows:
(26a) | |||
(26b) | |||
(26c) | |||
(26d) | |||
(26e) | |||
Therefore,
(27) |
Therefore, for a fixed , utilities and share all local and global minima in . It is straightforward to see that is an equilibrium of the game with utilities and iff it is an equilibrium of the game with utilities (3). ∎
Appendix B WAKE SLEEP
In this section we give a brief overview of the original wake-sleep (WS) algorithm and follow-up works.
Hinton et al. (1995) considered a multilayer network of stochastic neurons. The “recognition” (encoder) connections are used to convert the input vector into a representation in one or more layers of hidden units. The “generative” (decoder) connections are then used to reconstruct an approximation to the input vector from its underlying representation. In the wake phase of WS, given an observed sample from the training dataset, a sample of hidden states is obtained from the encoder network and the decoder is learned on the joint sample . In the sleep phase a joint sample is drawn from the decoder model and the encoder is learned.
The model was initially assuming binary units and factorised encoder and decoder. In case of a hierarchical encoder–decoder model, the learning decouples over layers and no back-propagation is needed. Extended to a deep exponential family model (Vértes and Sahani, 2018), it is equivalent to a hierarchical VAE with the reverse encoder structure.
Bornschein and Bengio (2015) et al. uses importance sampling, similar to IWAE (Burda et al., 2016), to tighten the bounds for the decoder and introduces a wake-phase (importance weighted) update of the encoder, tightening the ELBO (as in VAE) as well.
Vértes and Sahani (2018) and Wenliang et al. (2020) showed that the encoder in WS can be specified implicitly by its mean parameters, which allows for non-conditionally independent encoders. This makes encoders more flexible so that higher quality decoder can be trained but impairs inference.
The advantage of not requiring differentiation through discrete sampling has been explored by Le et al. (2020) for models with stochastic control flow.
To our knowledge, prior work has neither extended WS to semi-supervised setting nor discussed the question of why it is a reasonable algorithm. The only analysis attempt by Ikeda et al. (1998) is limited to a strictly consistent encoder-decoder in a simple special case.
Appendix C (DIS-)SIMILARITIES TO ELBO
In this section we elaborate on similarities and difference between symmetric learning and ELBO learning in unsupervised as well as semi-supervised case (Kingma et al., 2014).
Unsupervised
Recall, that in the unsupervised case we consider utility functions
(28) |
As discussed in Proposition 1, the decoder utility can be equivalently replaced with the common ELBO (both have the same dependence on ). The difference to VAE of Kingma and Welling (2014) is therefore only in the encoder learning. In VAE the encoder is learned to tighten ELBO, i.e. to minimise the so-called reverse KL divergence in the expectation over the data distribution:
(29) |
In the equilibrium learning, minimising in (C) w.r.t. encoder is equivalent to minimising
(30) |
which is a forward KL divergence between the same conditional distributions, and the expectation is over the generative model . The choice of the encoder as the true posterior, , when possible (i.e. for consistent models), is optimal to both ELBO and symmetric learning. But in general, leads to different preferred solutions.
Semi-Supervised
Semi-supervised learning of VAE was previously considered by Kingma et al. (2014). It can be seen that the hierarchical model (14a) is a generalisation of the generative model of Kingma et al. (2014): the state consists of two parts , where is the image label, available only for a part of images. Similar to unsupervised case, when learning the decoder for a fixed encoder, the learning objective (Kingma et al., 2014, Eq. 8) is equivalent to our .
Only the learning of encoder differs. In their formulation the encoder minimises
(31) | |||
where is an empirical coefficient. In case when there are no unlabelled pairs, the first term disappears and the ELBO learning approach (Kingma et al., 2014) decouples into learning of a conditional VAE (decoder and encoder conditioned on : , ) and an independent discriminative learning of the encoder part from the labelled data only. Thus, the generative counterpart of the model has no effect on learning of the recognition part (unless there is a parameter sharing).
In our formulation the encoder maximises
(32) |
This objective is more homogeneous because both terms correspond to forward KL divergences. When there are no unlabelled training pairs, the objective stays the same and the encoder part still needs to fulfil two goals: to approximate the posterior of the decoder (in the expectation over the generated distribution , like in the unsupervised case) and to approximate the empirical distribution (in the expectation over ). A weighting coefficient might be appropriate here as well to balance the two objectives. Our semi-supervised MNIST experiment in Section 6 with utilities (6) shows that even when switching off the discriminative counterpart, the encoder still efficiently learns to classify.
Appendix D LEARNING MODELS WITH IMPLICIT MARGINALS
Here we give a more detailed derivation of the learning in situations, where a joint model is given by means of its conditional distributions only, i.e. marginal distributions are given implicitly. In particular, we used it in our experiments with CelebA to define and learn , where are images, are segmentations, and are latent variables. Since everything is conditioned on we will omit it for clarity and use and as variables of interest to be inline with our experiments.
With the above agreement, we want to learn two conditional probability distributions and . As both images and segmentations are rather complex, it is desirable to avoid making any assumptions about the prior (marginal) distributions and . Towards this end, we consider the MCMC process starting from a random state and alternately sampling using and . This process defines two limiting joint distributions, depending on which variable was sampled last:
(33) |
where and are solutions to the stationary equations
(34a) | ||||
(34b) |
It is natural to consider the mixture of these two limiting distributions
(35) |
as we suggest in (10). Our goal therefore will be to maximise the likelihood of the data under this mixture joint model. The likelihood can be lower-bounded w.r.t. mixture components as
(36) |
Note that this lower bound is tight if the mixture components coincide, i.e. and are consistent. The terms in (36) corresponding to and are tractable under assumption (2). However, and are not given in closed form and depend on both and . We approximate their defining equations (D) as
(37a) | ||||
(37b) |
and use these expressions in the mixture model (35). With this approximation, (36) sums the data likelihood terms with respect to separate model components , , and . Hence, optimising this sum decouples into optimising the two objectives
(38) |
independently in and , respectively. It remains only to explain how to handle and , which are still intractable. Substituting (D) and introducing a lower bound for w.r.t. summation over gives
(39) |
If we consider the equilibrium learning approach, the objective is to be optimised only w.r.t. its own parameters , and therefore we can drop and terms. Applying similar steps to leads to the following effective equilibrium learning objectives:
(40) |
(41) |
Note that the first terms in these utilities correspond to the pseudo-likelihood objective, whereas the mutual completion in the second terms additionally enforces consistency.
Appendix E ADDITIONAL DETAILS FOR MNIST EXPERIMENTS
Random Latent Codes | Limiting Distribution | |
ELBO |
||
---|---|---|
Symmetric |
||
Here, we provide additional implementation details for the HVAE models used by symmetric learning and by ELBO optimisation in the first MNIST experiment. The first model variant is defined by the decoder and the encoder where is uniform and is the data distribution. The network architecture is shown in Fig. 6. The one-dimensional components are connected by a Multi-layer Perceptron (MLP) architecture. We used two hidden layers, hidden units each in our MLPs. Connections between and are implemented by standard convolutional encoder/decoder architectures with decreasing and increasing spatial resolutions respectively. Both encoder and decoder have 6 hidden layers, connected by 2D-convolution operations. In order to effectively reduce the spatial dimension some convolutions are performed with strides. We used the activation function everywhere. The network weights are learned using the Adam-optimiser.
The hierarchical decoder consists of two “separate” networks, an MLP and a decoder, representing and respectively. The encoder corresponding to the direct factorisation order (shown in the figure) is a multi-head network. The common part is an encoder, which produces intermediate features, whereas the heads are an MLP for and a single fully connected layer for . Two network outputs and serve as multiplier to the hierarchical decoder model, so and . For the reverse factorisation order we keep the hierarchical encoder architecture basically the same but split it into two separate networks: the encoder for and the MLP for .
The learning curves for losses/utilities are shown in Fig. 7 for ELBO learning and symmetric learning respectively as a function of gradient update steps. For better clarity all values are normalised by the number of corresponding elements, e.g. we show the per-pixel data-loss in ELBO. It is clearly seen that the convergence behaviours are pretty similar in both cases: all values converge very quickly to almost their final values, followed by a long period in which they change much more slowly. However, we observed that the quality of generated images keeps improving, even after the losses/utilities have almost reached saturation. Hence, we run all our experiments with a small learning rate of for 1M gradient update steps (note: only first 100k steps are shown in Fig. 7 for better visibility).
We further compare the HVAE models obtained by symmetric learning and by ELBO optimisation by embedding samples for and from (i) the prior distributions , , (ii) the posterior distributions , , and (iii) the limiting distributions and for each of the two models by tSNE. Fig. 10 shows that all three samples match well for the model learned by symmetric learning. This is however not the case for the model learned by ELBO.
Appendix F FASHION MNIST
We also tested our approach for HVAE with the direct encoder factorisation order on the Fashion MNIST dataset. The model is exactly the same as the one used in our first MNIST experiment, except:
-
–
Images are grey-valued now. We model them by a Gaussian, where the means for all pixels are computed by a network, and the standard deviation is common for all pixels and does not depend on , i.e. . The network architecture for is the same as the decoder in the MNIST experiment, is learned alongside with the network weights.
-
–
We observed that the overall results are slightly better (especially for ELBO), when using ReLU activations in instead of used for MNIST.
The results are shown in Figs. 9 and 9. They confirm our finding, that ELBO and symmetric learning are on par, whereby the latter produces more consistent encoder/decoder pairs.