Search | arXiv e-print repository

Causal normalizing flows: from theory to practice

Authors: Adrián Javaloy, Pablo Sánchez-Martín, Isabel Valera

Abstract: In this work, we deepen on the use of normalizing flows for causal reasoning. Specifically, we first leverage recent results on non-linear ICA to show that causal models are identifiable from observational data given a causal ordering, and thus can be recovered using autoregressive normalizing flows (NFs). Second, we analyze different design and learning choices for causal normalizing flows to cap… ▽ More In this work, we deepen on the use of normalizing flows for causal reasoning. Specifically, we first leverage recent results on non-linear ICA to show that causal models are identifiable from observational data given a causal ordering, and thus can be recovered using autoregressive normalizing flows (NFs). Second, we analyze different design and learning choices for causal normalizing flows to capture the underlying causal data-generating process. Third, we describe how to implement the do-operator in causal NFs, and thus, how to answer interventional and counterfactual questions. Finally, in our experiments, we validate our design and training choices through a comprehensive ablation study; compare causal NFs to other approaches for approximating causal models; and empirically demonstrate that causal NFs can be used to address real-world problems, where the presence of mixed discrete-continuous data and partial knowledge on the causal graph is the norm. The code for this work can be found at https://github.com/psanch21/causal-flows. △ Less

Submitted 8 December, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

Comments: 32 pages, 15 figures. Accepted as an Oral presentation at NeurIPS 2023

arXiv:2211.11853 [pdf, other]

Learnable Graph Convolutional Attention Networks

Authors: Adrián Javaloy, Pablo Sanchez-Martin, Amit Levi, Isabel Valera

Abstract: Existing Graph Neural Networks (GNNs) compute the message exchange between nodes by either aggregating uniformly (convolving) the features of all the neighboring nodes, or by applying a non-uniform score (attending) to the features. Recent works have shown the strengths and weaknesses of the resulting GNN architectures, respectively, GCNs and GATs. In this work, we aim at exploiting the strengths… ▽ More Existing Graph Neural Networks (GNNs) compute the message exchange between nodes by either aggregating uniformly (convolving) the features of all the neighboring nodes, or by applying a non-uniform score (attending) to the features. Recent works have shown the strengths and weaknesses of the resulting GNN architectures, respectively, GCNs and GATs. In this work, we aim at exploiting the strengths of both approaches to their full extent. To this end, we first introduce the graph convolutional attention layer (CAT), which relies on convolutions to compute the attention scores. Unfortunately, as in the case of GCNs and GATs, we show that there exists no clear winner between the three (neither theoretically nor in practice) as their performance directly depends on the nature of the data (i.e., of the graph and features). This result brings us to the main contribution of our work, the learnable graph convolutional attention network (L-CAT): a GNN architecture that automatically interpolates between GCN, GAT and CAT in each layer, by adding only two scalar parameters. Our results demonstrate that L-CAT is able to efficiently combine different GNN layers along the network, outperforming competing methods in a wide range of datasets, and resulting in a more robust model that reduces the need of cross-validating. △ Less

Submitted 28 February, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

Comments: Published as a conference paper at ICLR 2023. 35 pages, 5 figures

arXiv:2206.04496 [pdf, other]

Mitigating Modality Collapse in Multimodal VAEs via Impartial Optimization

Authors: Adrián Javaloy, Maryam Meghdadi, Isabel Valera

Abstract: A number of variational autoencoders (VAEs) have recently emerged with the aim of modeling multimodal data, e.g., to jointly model images and their corresponding captions. Still, multimodal VAEs tend to focus solely on a subset of the modalities, e.g., by fitting the image while neglecting the caption. We refer to this limitation as modality collapse. In this work, we argue that this effect is a c… ▽ More A number of variational autoencoders (VAEs) have recently emerged with the aim of modeling multimodal data, e.g., to jointly model images and their corresponding captions. Still, multimodal VAEs tend to focus solely on a subset of the modalities, e.g., by fitting the image while neglecting the caption. We refer to this limitation as modality collapse. In this work, we argue that this effect is a consequence of conflicting gradients during multimodal VAE training. We show how to detect the sub-graphs in the computational graphs where gradients conflict (impartiality blocks), as well as how to leverage existing gradient-conflict solutions from multitask learning to mitigate modality collapse. That is, to ensure impartial optimization across modalities. We apply our training framework to several multimodal VAE models, losses and datasets from the literature, and empirically show that our framework significantly improves the reconstruction performance, conditional generation, and coherence of the latent space across modalities. △ Less

Submitted 9 June, 2022; originally announced June 2022.

Comments: Accepted as a Spotlight paper at ICML 2022. 27 pages, 10 figures

arXiv:2103.02631 [pdf, other]

RotoGrad: Gradient Homogenization in Multitask Learning

Authors: Adrián Javaloy, Isabel Valera

Abstract: Multitask learning is being increasingly adopted in applications domains like computer vision and reinforcement learning. However, optimally exploiting its advantages remains a major challenge due to the effect of negative transfer. Previous works have tracked down this issue to the disparities in gradient magnitudes and directions across tasks, when optimizing the shared network parameters. While… ▽ More Multitask learning is being increasingly adopted in applications domains like computer vision and reinforcement learning. However, optimally exploiting its advantages remains a major challenge due to the effect of negative transfer. Previous works have tracked down this issue to the disparities in gradient magnitudes and directions across tasks, when optimizing the shared network parameters. While recent work has acknowledged that negative transfer is a two-fold problem, existing approaches fall short as they only focus on either homogenizing the gradient magnitude across tasks; or greedily change the gradient directions, overlooking future conflicts. In this work, we introduce RotoGrad, an algorithm that tackles negative transfer as a whole: it jointly homogenizes gradient magnitudes and directions, while ensuring training convergence. We show that RotoGrad outperforms competing methods in complex problems, including multi-label classification in CelebA and computer vision tasks in the NYUv2 dataset. A Pytorch implementation can be found in https://github.com/adrianjav/rotograd. △ Less

Submitted 16 February, 2022; v1 submitted 3 March, 2021; originally announced March 2021.

Comments: Spotlight at ICLR 2022. 24 pages, 9 figures

arXiv:2006.15090 [pdf, other]

Relative gradient optimization of the Jacobian term in unsupervised deep learning

Authors: Luigi Gresele, Giancarlo Fissore, Adrián Javaloy, Bernhard Schölkopf, Aapo Hyvärinen

Abstract: Learning expressive probabilistic models correctly describing the data is a ubiquitous problem in machine learning. A popular approach for solving it is map** the observations into a representation space with a simple joint distribution, which can typically be written as a product of its marginals -- thus drawing a connection with the field of nonlinear independent component analysis. Deep densi… ▽ More Learning expressive probabilistic models correctly describing the data is a ubiquitous problem in machine learning. A popular approach for solving it is map** the observations into a representation space with a simple joint distribution, which can typically be written as a product of its marginals -- thus drawing a connection with the field of nonlinear independent component analysis. Deep density models have been widely used for this task, but their maximum likelihood based training requires estimating the log-determinant of the Jacobian and is computationally expensive, thus imposing a trade-off between computation and expressive power. In this work, we propose a new approach for exact training of such neural networks. Based on relative gradients, we exploit the matrix structure of neural network parameters to compute updates efficiently even in high-dimensional spaces; the computational cost of the training is quadratic in the input size, in contrast with the cubic scaling of naive approaches. This allows fast training with objective functions involving the log-determinant of the Jacobian, without imposing constraints on its structure, in stark contrast to autoregressive normalizing flows. △ Less

Submitted 26 October, 2020; v1 submitted 26 June, 2020; originally announced June 2020.

arXiv:2002.11369 [pdf, other]

Lipschitz standardization for multivariate learning

Authors: Adrián Javaloy, Isabel Valera

Abstract: Probabilistic learning is increasingly being tackled as an optimization problem, with gradient-based approaches as predominant methods. When modelling multivariate likelihoods, a usual but undesirable outcome is that the learned model fits only a subset of the observed variables, overlooking the rest. In this work, we study this problem through the lens of multitask learning (MTL), where similar e… ▽ More Probabilistic learning is increasingly being tackled as an optimization problem, with gradient-based approaches as predominant methods. When modelling multivariate likelihoods, a usual but undesirable outcome is that the learned model fits only a subset of the observed variables, overlooking the rest. In this work, we study this problem through the lens of multitask learning (MTL), where similar effects have been broadly studied. While MTL solutions do not directly apply in the probabilistic setting (as they cannot handle the likelihood constraints) we show that similar ideas may be leveraged during data preprocessing. First, we show that data standardization often helps under common continuous likelihoods, but it is not enough in the general case, specially under mixed continuous and discrete likelihood models. In order for balance multivariate learning, we then propose a novel data preprocessing, Lipschitz standardization, which balances the local Lipschitz smoothness across variables. Our experiments on real-world datasets show that Lipschitz standardization leads to more accurate multivariate models than the ones learned using existing data preprocessing techniques. The models and datasets employed in the experiments can be found in https://github.com/adrianjav/lipschitz-standardization. △ Less

Submitted 21 October, 2020; v1 submitted 26 February, 2020; originally announced February 2020.

Showing 1–6 of 6 results for author: Javaloy, A