-
Uncertainty quantification in automated valuation models with locally weighted conformal prediction
Authors:
Anders Hjort,
Gudmund Horn Hermansen,
Johan Pensar,
Jonathan P. Williams
Abstract:
Non-parametric machine learning models, such as random forests and gradient boosted trees, are frequently used to estimate house prices due to their predictive accuracy, but such methods are often limited in their ability to quantify prediction uncertainty. Conformal Prediction (CP) is a model-agnostic framework for constructing confidence sets around machine learning prediction models with minima…
▽ More
Non-parametric machine learning models, such as random forests and gradient boosted trees, are frequently used to estimate house prices due to their predictive accuracy, but such methods are often limited in their ability to quantify prediction uncertainty. Conformal Prediction (CP) is a model-agnostic framework for constructing confidence sets around machine learning prediction models with minimal assumptions. However, due to the spatial dependencies observed in house prices, direct application of CP leads to confidence sets that are not calibrated everywhere, i.e., too large of confidence sets in certain geographical regions and too small in others. We survey various approaches to adjust the CP confidence set to account for this and demonstrate their performance on a data set from the housing market in Oslo, Norway. Our findings indicate that calibrating the confidence sets on a \textit{locally weighted} version of the non-conformity scores makes the coverage more consistently calibrated in different geographical regions. We also perform a simulation study on synthetically generated sale prices to empirically explore the performance of CP on housing market data under idealized conditions with known data-generating mechanisms.
△ Less
Submitted 11 December, 2023;
originally announced December 2023.
-
DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation
Authors:
Ghadi S. Al Hajj,
Johan Pensar,
Geir Kjetil Sandve
Abstract:
Data simulation is fundamental for machine learning and causal inference, as it allows exploration of scenarios and assessment of methods in settings with full control of ground truth. Directed acyclic graphs (DAGs) are well established for encoding the dependence structure over a collection of variables in both inference and simulation settings. However, while modern machine learning is applied t…
▽ More
Data simulation is fundamental for machine learning and causal inference, as it allows exploration of scenarios and assessment of methods in settings with full control of ground truth. Directed acyclic graphs (DAGs) are well established for encoding the dependence structure over a collection of variables in both inference and simulation settings. However, while modern machine learning is applied to data of an increasingly complex nature, DAG-based simulation frameworks are still confined to settings with relatively simple variable types and functional forms. We here present DagSim, a Python-based framework for DAG-based data simulation without any constraints on variable types or functional relations. A succinct YAML format for defining the simulation model structure promotes transparency, while separate user-provided functions for generating each variable based on its parents ensure simulation code modularization. We illustrate the capabilities of DagSim through use cases where metadata variables control shapes in an image and patterns in bio-sequences.
△ Less
Submitted 30 September, 2022; v1 submitted 6 May, 2022;
originally announced May 2022.
-
Improving generalization of machine learning-identified biomarkers with causal modeling: an investigation into immune receptor diagnostics
Authors:
Milena Pavlović,
Ghadi S. Al Hajj,
Chakravarthi Kanduri,
Johan Pensar,
Mollie Wood,
Ludvig M. Sollid,
Victor Greiff,
Geir Kjetil Sandve
Abstract:
Machine learning is increasingly used to discover diagnostic and prognostic biomarkers from high-dimensional molecular data. However, a variety of factors related to experimental design may affect the ability to learn generalizable and clinically applicable diagnostics. Here, we argue that a causal perspective improves the identification of these challenges and formalizes their relation to the rob…
▽ More
Machine learning is increasingly used to discover diagnostic and prognostic biomarkers from high-dimensional molecular data. However, a variety of factors related to experimental design may affect the ability to learn generalizable and clinically applicable diagnostics. Here, we argue that a causal perspective improves the identification of these challenges and formalizes their relation to the robustness and generalization of machine learning-based diagnostics. To make for a concrete discussion, we focus on a specific, recently established high-dimensional biomarker - adaptive immune receptor repertoires (AIRRs). Through simulations, we illustrate how major biological and experimental factors of the AIRR domain may influence the learned biomarkers. In conclusion, we argue that causal modeling improves machine learning-based biomarker robustness by identifying stable relations between variables and by guiding the adjustment of the relations and variables that vary between populations.
△ Less
Submitted 3 April, 2023; v1 submitted 20 April, 2022;
originally announced April 2022.
-
Towards Scalable Bayesian Learning of Causal DAGs
Authors:
Jussi Viinikka,
Antti Hyttinen,
Johan Pensar,
Mikko Koivisto
Abstract:
We give methods for Bayesian inference of directed acyclic graphs, DAGs, and the induced causal effects from passively observed complete data. Our methods build on a recent Markov chain Monte Carlo scheme for learning Bayesian networks, which enables efficient approximate sampling from the graph posterior, provided that each node is assigned a small number $K$ of candidate parents. We present algo…
▽ More
We give methods for Bayesian inference of directed acyclic graphs, DAGs, and the induced causal effects from passively observed complete data. Our methods build on a recent Markov chain Monte Carlo scheme for learning Bayesian networks, which enables efficient approximate sampling from the graph posterior, provided that each node is assigned a small number $K$ of candidate parents. We present algorithmic techniques to significantly reduce the space and time requirements, which make the use of substantially larger values of $K$ feasible. Furthermore, we investigate the problem of selecting the candidate parents per node so as to maximize the covered posterior mass. Finally, we combine our sampling method with a novel Bayesian approach for estimating causal effects in linear Gaussian DAG models. Numerical experiments demonstrate the performance of our methods in detecting ancestor-descendant relations, and in causal effect estimation our Bayesian method is shown to outperform previous approaches.
△ Less
Submitted 18 November, 2020; v1 submitted 30 September, 2020;
originally announced October 2020.
-
Learning pairwise Markov network structures using correlation neighborhoods
Authors:
Juri Kuronen,
Jukka Corander,
Johan Pensar
Abstract:
Markov networks are widely studied and used throughout multivariate statistics and computer science. In particular, the problem of learning the structure of Markov networks from data without invoking chordality assumptions in order to retain expressiveness of the model class has been given a considerable attention in the recent literature, where numerous constraint-based or score-based methods hav…
▽ More
Markov networks are widely studied and used throughout multivariate statistics and computer science. In particular, the problem of learning the structure of Markov networks from data without invoking chordality assumptions in order to retain expressiveness of the model class has been given a considerable attention in the recent literature, where numerous constraint-based or score-based methods have been introduced. Here we develop a new search algorithm for the network score-optimization that has several computational advantages and scales well to high-dimensional data sets. The key observation behind the algorithm is that the neighborhood of a variable can be efficiently captured using local penalized likelihood ratio (PLR) tests by exploiting an exponential decay of correlations across the neighborhood with an increasing graph-theoretic distance from the focus node. The candidate neighborhoods are then processed by a two-stage hill-climbing (HC) algorithm. Our approach, termed fully as PLRHC-BIC$_{0.5}$, compares favorably against the state-of-the-art methods in all our experiments spanning both low- and high-dimensional networks and a wide range of sample sizes. An efficient implementation of PLRHC-BIC$_{0.5}$ is freely available from the URL: https://github.com/jurikuronen/plrhc.
△ Less
Submitted 30 October, 2019;
originally announced October 2019.
-
High-dimensional structure learning of binary pairwise Markov networks: A comparative numerical study
Authors:
Johan Pensar,
Yingying Xu,
Santeri Puranen,
Maiju Pesonen,
Yoshiyuki Kabashima,
Jukka Corander
Abstract:
Learning the undirected graph structure of a Markov network from data is a problem that has received a lot of attention during the last few decades. As a result of the general applicability of the model class, a myriad of methods have been developed in parallel in several research fields. Recently, as the size of the considered systems has increased, the focus of new methods has been shifted towar…
▽ More
Learning the undirected graph structure of a Markov network from data is a problem that has received a lot of attention during the last few decades. As a result of the general applicability of the model class, a myriad of methods have been developed in parallel in several research fields. Recently, as the size of the considered systems has increased, the focus of new methods has been shifted towards the high-dimensional domain. In particular, introduction of the pseudo-likelihood function has pushed the limits of score-based methods which were originally based on the likelihood function. At the same time, methods based on simple pairwise tests have been developed to meet the challenges arising from increasingly large data sets in computational biology. Apart from being applicable to high-dimensional problems, methods based on the pseudo-likelihood and pairwise tests are fundamentally very different. To compare the accuracy of the different types of methods, an extensive numerical study is performed on data generated by binary pairwise Markov networks. A parallelizable Gibbs sampler, based on restricted Boltzmann machines, is proposed as a tool to efficiently sample from sparse high-dimensional networks. The results of the study show that pairwise methods can be more accurate than pseudo-likelihood methods in settings often encountered in high-dimensional structure learning applications.
△ Less
Submitted 12 July, 2019; v1 submitted 14 January, 2019;
originally announced January 2019.
-
Learning Gaussian Graphical Models With Fractional Marginal Pseudo-likelihood
Authors:
Janne Leppä-aho,
Johan Pensar,
Teemu Roos,
Jukka Corander
Abstract:
We propose a Bayesian approximate inference method for learning the dependence structure of a Gaussian graphical model. Using pseudo-likelihood, we derive an analytical expression to approximate the marginal likelihood for an arbitrary graph structure without invoking any assumptions about decomposability. The majority of the existing methods for learning Gaussian graphical models are either restr…
▽ More
We propose a Bayesian approximate inference method for learning the dependence structure of a Gaussian graphical model. Using pseudo-likelihood, we derive an analytical expression to approximate the marginal likelihood for an arbitrary graph structure without invoking any assumptions about decomposability. The majority of the existing methods for learning Gaussian graphical models are either restricted to decomposable graphs or require specification of a tuning parameter that may have a substantial impact on learned structures. By combining a simple sparsity inducing prior for the graph structures with a default reference prior for the model parameters, we obtain a fast and easily applicable scoring function that works well for even high-dimensional data. We demonstrate the favourable performance of our approach by large-scale comparisons against the leading methods for learning non-decomposable Gaussian graphical models. A theoretical justification for our method is provided by showing that it yields a consistent estimator of the graph structure.
△ Less
Submitted 25 February, 2016;
originally announced February 2016.
-
Labeled Directed Acyclic Graphs: a generalization of context-specific independence in directed graphical models
Authors:
Johan Pensar,
Henrik Nyman,
Timo Koski,
Jukka Corander
Abstract:
We introduce a novel class of labeled directed acyclic graph (LDAG) models for finite sets of discrete variables. LDAGs generalize earlier proposals for allowing local structures in the conditional probability distribution of a node, such that unrestricted label sets determine which edges can be deleted from the underlying directed acyclic graph (DAG) for a given context. Several properties of the…
▽ More
We introduce a novel class of labeled directed acyclic graph (LDAG) models for finite sets of discrete variables. LDAGs generalize earlier proposals for allowing local structures in the conditional probability distribution of a node, such that unrestricted label sets determine which edges can be deleted from the underlying directed acyclic graph (DAG) for a given context. Several properties of these models are derived, including a generalization of the concept of Markov equivalence classes. Efficient Bayesian learning of LDAGs is enabled by introducing an LDAG-based factorization of the Dirichlet prior for the model parameters, such that the marginal likelihood can be calculated analytically. In addition, we develop a novel prior distribution for the model structures that can appropriately penalize a model for its labeling complexity. A non-reversible Markov chain Monte Carlo algorithm combined with a greedy hill climbing approach is used for illustrating the useful properties of LDAG models for both real and synthetic data sets.
△ Less
Submitted 4 October, 2013;
originally announced October 2013.
-
Learning Chordal Markov Networks by Constraint Satisfaction
Authors:
Jukka Corander,
Tomi Janhunen,
Jussi Rintanen,
Henrik Nyman,
Johan Pensar
Abstract:
We investigate the problem of learning the structure of a Markov network from data. It is shown that the structure of such networks can be described in terms of constraints which enables the use of existing solver technology with optimization capabilities to compute optimal networks starting from initial scores computed from the data. To achieve efficient encodings, we develop a novel characteriza…
▽ More
We investigate the problem of learning the structure of a Markov network from data. It is shown that the structure of such networks can be described in terms of constraints which enables the use of existing solver technology with optimization capabilities to compute optimal networks starting from initial scores computed from the data. To achieve efficient encodings, we develop a novel characterization of Markov network structure using a balancing condition on the separators between cliques forming the network. The resulting translations into propositional satisfiability and its extensions such as maximum satisfiability, satisfiability modulo theories, and answer set programming, enable us to prove optimal certain network structures which have been previously found by stochastic search.
△ Less
Submitted 3 October, 2013;
originally announced October 2013.