Search | arXiv e-print repository

Recovering Latent Confounders from High-dimensional Proxy Variables

Authors: Nathan Mankovich, Homer Durand, Emiliano Diaz, Gherardo Varando, Gustau Camps-Valls

Abstract: Detecting latent confounders from proxy variables is an essential problem in causal effect estimation. Previous approaches are limited to low-dimensional proxies, sorted proxies, and binary treatments. We remove these assumptions and present a novel Proxy Confounder Factorization (PCF) framework for continuous treatment effect estimation when latent confounders manifest through high-dimensional, m… ▽ More Detecting latent confounders from proxy variables is an essential problem in causal effect estimation. Previous approaches are limited to low-dimensional proxies, sorted proxies, and binary treatments. We remove these assumptions and present a novel Proxy Confounder Factorization (PCF) framework for continuous treatment effect estimation when latent confounders manifest through high-dimensional, mixed proxy variables. For specific sample sizes, our two-step PCF implementation, using Independent Component Analysis (ICA-PCF), and the end-to-end implementation, using Gradient Descent (GD-PCF), achieve high correlation with the latent confounder and low absolute error in causal effect estimation with synthetic datasets in the high sample size regime. Even when faced with climate data, ICA-PCF recovers four components that explain $75.9\%$ of the variance in the North Atlantic Oscillation, a known confounder of precipitation patterns in Europe. Code for our PCF implementations and experiments can be found here: https://github.com/IPL-UV/confound_it. The proposed methodology constitutes a step** stone towards discovering latent confounders and can be applied to many problems in disciplines dealing with high-dimensional observed proxies, e.g., spatiotemporal fields. △ Less

Submitted 21 March, 2024; originally announced March 2024.

arXiv:2305.13341 [pdf, other]

Discovering Causal Relations and Equations from Data

Authors: Gustau Camps-Valls, Andreas Gerhardus, Urmi Ninad, Gherardo Varando, Georg Martius, Emili Balaguer-Ballester, Ricardo Vinuesa, Emiliano Diaz, Laure Zanna, Jakob Runge

Abstract: Physics is a field of science that has traditionally used the scientific method to answer questions about why natural phenomena occur and to make testable models that explain the phenomena. Discovering equations, laws and principles that are invariant, robust and causal explanations of the world has been fundamental in physical sciences throughout the centuries. Discoveries emerge from observing t… ▽ More Physics is a field of science that has traditionally used the scientific method to answer questions about why natural phenomena occur and to make testable models that explain the phenomena. Discovering equations, laws and principles that are invariant, robust and causal explanations of the world has been fundamental in physical sciences throughout the centuries. Discoveries emerge from observing the world and, when possible, performing interventional studies in the system under study. With the advent of big data and the use of data-driven methods, causal and equation discovery fields have grown and made progress in computer science, physics, statistics, philosophy, and many applied fields. All these domains are intertwined and can be used to discover causal relations, physical laws, and equations from observational data. This paper reviews the concepts, methods, and relevant works on causal and equation discovery in the broad field of Physics and outlines the most important challenges and promising future lines of research. We also provide a taxonomy for observational causal and equation discovery, point out connections, and showcase a complete set of case studies in Earth and climate sciences, fluid dynamics and mechanics, and the neurosciences. This review demonstrates that discovering fundamental laws and causal relations by observing natural phenomena is being revolutionised with the efficient exploitation of observational data, modern machine learning algorithms and the interaction with domain knowledge. Exciting times are ahead with many challenges and opportunities to improve our understanding of complex systems. △ Less

Submitted 21 May, 2023; originally announced May 2023.

Comments: 137 pages

arXiv:2012.04922 [pdf, other]

Consistent regression of biophysical parameters with kernel methods

Authors: Emiliano Díaz, Adrián Pérez-Suay, Valero Laparra, Gustau Camps-Valls

Abstract: This paper introduces a novel statistical regression framework that allows the incorporation of consistency constraints. A linear and nonlinear (kernel-based) formulation are introduced, and both imply closed-form analytical solutions. The models exploit all the information from a set of drivers while being maximally independent of a set of auxiliary, protected variables. We successfully illustrat… ▽ More This paper introduces a novel statistical regression framework that allows the incorporation of consistency constraints. A linear and nonlinear (kernel-based) formulation are introduced, and both imply closed-form analytical solutions. The models exploit all the information from a set of drivers while being maximally independent of a set of auxiliary, protected variables. We successfully illustrate the performance in the estimation of chlorophyll content. △ Less

Submitted 9 December, 2020; originally announced December 2020.

Comments: arXiv admin note: substantial text overlap with arXiv:1710.05578

arXiv:1704.01932 [pdf]

Estimación de la inicial de referencia utilizando simulación

Authors: Emiliano Díaz

Abstract: The method proposed by Bernardo and Smith [2000] to approximate reference priors by simulation was analyzed with the objective of improving the procedure in order to obtain consistent estimators and to allow the estimation of asymptotic probability intervals. In this sense, the variance of Bernardo's estimator was derived and was used to construct probability intervals that permitted the expressio… ▽ More The method proposed by Bernardo and Smith [2000] to approximate reference priors by simulation was analyzed with the objective of improving the procedure in order to obtain consistent estimators and to allow the estimation of asymptotic probability intervals. In this sense, the variance of Bernardo's estimator was derived and was used to construct probability intervals that permitted the expression of the estimation error as a function of sample size. Additionally a variance reduction technique (common random numbers) were explored as a means to obtain more precise estimations with smaller sample sizes. These technique was found to considerably reduce estimation error for some of the examples explored. In other cases the use of the technique resulted in zero estimation error given that the estimator does not depend on the sample. △ Less

Submitted 1 April, 2017; originally announced April 2017.

Comments: in Spanish

arXiv:1704.00829 [pdf, other]

Online deforestation detection

Authors: Emiliano Diaz

Abstract: Deforestation detection using satellite images can make an important contribution to forest management. Current approaches can be broadly divided into those that compare two images taken at similar periods of the year and those that monitor changes by using multiple images taken during the growing season. The CMFDA algorithm described in Zhu et al. (2012) is an algorithm that builds on the latter… ▽ More Deforestation detection using satellite images can make an important contribution to forest management. Current approaches can be broadly divided into those that compare two images taken at similar periods of the year and those that monitor changes by using multiple images taken during the growing season. The CMFDA algorithm described in Zhu et al. (2012) is an algorithm that builds on the latter category by implementing a year-long, continuous, time-series based approach to monitoring images. This algorithm was developed for 30m resolution, 16-day frequency reflectance data from the Landsat satellite. In this work we adapt the algorithm to 1km, 16-day frequency reflectance data from the modis sensor aboard the Terra satellite. The CMFDA algorithm is composed of two submodels which are fitted on a pixel-by-pixel basis. The first estimates the amount of surface reflectance as a function of the day of the year. The second estimates the occurrence of a deforestation event by comparing the last few predicted and real reflectance values. For this comparison, the reflectance observations for six different bands are first combined into a forest index. Real and predicted values of the forest index are then compared and high absolute differences for consecutive observation dates are flagged as deforestation events. Our adapted algorithm also uses the two model framework. However, since the modis 13A2 dataset used, includes reflectance data for different spectral bands than those included in the Landsat dataset, we cannot construct the forest index. Instead we propose two contrasting approaches: a multivariate and an index approach similar to that of CMFDA. △ Less

Submitted 3 April, 2017; originally announced April 2017.

arXiv:1704.00588 [pdf, other]

Causality and surrogate variable analysis

Authors: Emiliano Diaz

Abstract: Gene expression depends on thousands of factors and we usually only have access to tens or hundreds of observations of gene expression levels meaning we are in a high-dimensional setting. Additionally we don't always observe or care about all the factors. However, many different gene expression levels depend on a set of common factors. By observing the joint variance of the gene expression levels… ▽ More Gene expression depends on thousands of factors and we usually only have access to tens or hundreds of observations of gene expression levels meaning we are in a high-dimensional setting. Additionally we don't always observe or care about all the factors. However, many different gene expression levels depend on a set of common factors. By observing the joint variance of the gene expression levels together with the observed primary variables (those we care about) Surrogate Variable Analysis (SVA) seeks to estimate the remaining unobserved factors. The ultimate goal is to assess whether the primary variable (or vector) has a significant effect on the different gene expression levels, but without estimating unobserved factors first the various regression models and hypothesis tests are dependent which complicates significance analysis. In this work we define a class of additive gene expression structural equation models (SEMs) which are convenient for modeling gene expression data and which provides a useful framework to understand the various steps of the SVA methodology. We justify the use of this class from a modeling viewpoint but also from a causality viewpoint by exploring the independence and causality properties of this class and comparing to the biologically driven data assumptions. For this we use some of the theory that has been developed elsewhere on graphical models and causality. We then give a detailed description of the SVA methodology and its implementation in the R package sva referring each step to different parts of the additive gene expression SEM defined previously. △ Less

Submitted 3 April, 2017; originally announced April 2017.

arXiv:1704.00575 [pdf, other]

Sparse mean localization by information theory

Authors: Emiliano Diaz

Abstract: Sparse feature selection is necessary when we fit statistical models, we have access to a large group of features, don't know which are relevant, but assume that most are not. Alternatively, when the number of features is larger than the available data the model becomes over parametrized and the sparse feature selection task involves selecting the most informative variables for the model. When the… ▽ More Sparse feature selection is necessary when we fit statistical models, we have access to a large group of features, don't know which are relevant, but assume that most are not. Alternatively, when the number of features is larger than the available data the model becomes over parametrized and the sparse feature selection task involves selecting the most informative variables for the model. When the model is a simple location model and the number of relevant features does not grow with the total number of features, sparse feature selection corresponds to sparse mean estimation. We deal with a simplified mean estimation problem consisting of an additive model with gaussian noise and mean that is in a restricted, finite hypothesis space. This restriction simplifies the mean estimation problem into a selection problem of combinatorial nature. Although the hypothesis space is finite, its size is exponential in the dimension of the mean. In limited data settings and when the size of the hypothesis space depends on the amount of data or on the dimension of the data, choosing an approximation set of hypotheses is a desirable approach. Choosing a set of hypotheses instead of a single one implies replacing the bias-variance trade off with a resolution-stability trade off. Generalization capacity provides a resolution selection criterion based on allowing the learning algorithm to communicate the largest amount of information in the data to the learner without error. In this work the theory of approximation set coding and generalization capacity is explored in order to understand this approach. We then apply the generalization capacity criterion to the simplified sparse mean estimation problem and detail an importance sampling algorithm which at once solves the difficulty posed by large hypothesis spaces and the slow convergence of uniform sampling algorithms. △ Less

Submitted 3 April, 2017; originally announced April 2017.

Showing 1–7 of 7 results for author: Díaz, E