-
Knowledge-Guided Additive Modeling For Supervised Regression
Authors:
Yann Claes,
Vân Anh Huynh-Thu,
Pierre Geurts
Abstract:
Learning processes by exploiting restricted domain knowledge is an important task across a plethora of scientific areas, with more and more hybrid methods combining data-driven and model-based approaches. However, while such hybrid methods have been tested in various scientific applications, they have been mostly tested on dynamical systems, with only limited study about the influence of each mode…
▽ More
Learning processes by exploiting restricted domain knowledge is an important task across a plethora of scientific areas, with more and more hybrid methods combining data-driven and model-based approaches. However, while such hybrid methods have been tested in various scientific applications, they have been mostly tested on dynamical systems, with only limited study about the influence of each model component on global performance and parameter identification. In this work, we assess the performance of hybrid modeling against traditional machine learning methods on standard regression problems. We compare, on both synthetic and real regression problems, several approaches for training such hybrid models. We focus on hybrid methods that additively combine a parametric physical term with a machine learning term and investigate model-agnostic training procedures. We also introduce a new hybrid approach based on partial dependence functions. Experiments are carried out with different types of machine learning models, including tree-based models and artificial neural networks.
△ Less
Submitted 5 July, 2023;
originally announced July 2023.
-
From global to local MDI variable importances for random forests and when they are Shapley values
Authors:
Antonio Sutera,
Gilles Louppe,
Van Anh Huynh-Thu,
Louis Wehenkel,
Pierre Geurts
Abstract:
Random forests have been widely used for their ability to provide so-called importance measures, which give insight at a global (per dataset) level on the relevance of input variables to predict a certain output. On the other hand, methods based on Shapley values have been introduced to refine the analysis of feature relevance in tree-based models to a local (per instance) level. In this context,…
▽ More
Random forests have been widely used for their ability to provide so-called importance measures, which give insight at a global (per dataset) level on the relevance of input variables to predict a certain output. On the other hand, methods based on Shapley values have been introduced to refine the analysis of feature relevance in tree-based models to a local (per instance) level. In this context, we first show that the global Mean Decrease of Impurity (MDI) variable importance scores correspond to Shapley values under some conditions. Then, we derive a local MDI importance measure of variable relevance, which has a very natural connection with the global MDI measure and can be related to a new notion of local feature relevance. We further link local MDI importances with Shapley values and discuss them in the light of related measures from the literature. The measures are illustrated through experiments on several classification and regression problems.
△ Less
Submitted 3 November, 2021;
originally announced November 2021.
-
Optimizing model-agnostic Random Subspace ensembles
Authors:
Vân Anh Huynh-Thu,
Pierre Geurts
Abstract:
This paper presents a model-agnostic ensemble approach for supervised learning. The proposed approach is based on a parametric version of Random Subspace, in which each base model is learned from a feature subset sampled according to a Bernoulli distribution. Parameter optimization is performed using gradient descent and is rendered tractable by using an importance sampling approach that circumven…
▽ More
This paper presents a model-agnostic ensemble approach for supervised learning. The proposed approach is based on a parametric version of Random Subspace, in which each base model is learned from a feature subset sampled according to a Bernoulli distribution. Parameter optimization is performed using gradient descent and is rendered tractable by using an importance sampling approach that circumvents frequent re-training of the base models after each gradient descent step. The degree of randomization in our parametric Random Subspace is thus automatically tuned through the optimization of the feature selection probabilities. This is an advantage over the standard Random Subspace approach, where the degree of randomization is controlled by a hyper-parameter. Furthermore, the optimized feature selection probabilities can be interpreted as feature importance scores. Our algorithm can also easily incorporate any differentiable regularization term to impose constraints on these importance scores.
△ Less
Submitted 20 January, 2023; v1 submitted 7 September, 2021;
originally announced September 2021.
-
Gene regulatory network inference: an introductory survey
Authors:
Vân Anh Huynh-Thu,
Guido Sanguinetti
Abstract:
Gene regulatory networks are powerful abstractions of biological systems. Since the advent of high-throughput measurement technologies in biology in the late 90s, reconstructing the structure of such networks has been a central computational problem in systems biology. While the problem is certainly not solved in its entirety, considerable progress has been made in the last two decades, with matur…
▽ More
Gene regulatory networks are powerful abstractions of biological systems. Since the advent of high-throughput measurement technologies in biology in the late 90s, reconstructing the structure of such networks has been a central computational problem in systems biology. While the problem is certainly not solved in its entirety, considerable progress has been made in the last two decades, with mature tools now available. This chapter aims to provide an introduction to the basic concepts underpinning network inference tools, attempting a categorisation which highlights commonalities and relative strengths. While the chapter is meant to be self-contained, the material presented should provide a useful background to the later, more specialised chapters of this book.
△ Less
Submitted 19 December, 2018; v1 submitted 12 January, 2018;
originally announced January 2018.
-
Context-dependent feature analysis with random forests
Authors:
Antonio Sutera,
Gilles Louppe,
Vân Anh Huynh-Thu,
Louis Wehenkel,
Pierre Geurts
Abstract:
In many cases, feature selection is often more complicated than identifying a single subset of input variables that would together explain the output. There may be interactions that depend on contextual information, i.e., variables that reveal to be relevant only in some specific circumstances. In this setting, the contribution of this paper is to extend the random forest variable importances fram…
▽ More
In many cases, feature selection is often more complicated than identifying a single subset of input variables that would together explain the output. There may be interactions that depend on contextual information, i.e., variables that reveal to be relevant only in some specific circumstances. In this setting, the contribution of this paper is to extend the random forest variable importances framework in order (i) to identify variables whose relevance is context-dependent and (ii) to characterize as precisely as possible the effect of contextual information on these variables. The usage and the relevance of our framework for highlighting context-dependent variables is illustrated on both artificial and real datasets.
△ Less
Submitted 12 May, 2016;
originally announced May 2016.
-
Bridging physiological and evolutionary time scales in a gene regulatory network
Authors:
Gwenaëlle Marchand,
Vân Anh Huynh-Thu,
Nolan Kane,
Sandrine Arribat,
Didier Varès,
David Rengel,
Sandrine Balzergue,
Loren Rieseberg,
Patrick Vincourt,
Pierre Geurts,
Matthieu Vignes,
Nicolas B. Langlade
Abstract:
Gene regulatory networks (GRN) govern phenotypic adaptations and reflect the trade-offs between physiological responses and evolutionary adaptation that act at different time scales. To identify patterns of molecular function and genetic diversity in GRNs, we studied the drought response of the common sunflower, Helianthus annuus, and how the underlying GRN is related to its evolution. We examined…
▽ More
Gene regulatory networks (GRN) govern phenotypic adaptations and reflect the trade-offs between physiological responses and evolutionary adaptation that act at different time scales. To identify patterns of molecular function and genetic diversity in GRNs, we studied the drought response of the common sunflower, Helianthus annuus, and how the underlying GRN is related to its evolution. We examined the responses of 32,423 expressed sequences to drought and to abscisic acid and selected 145 co-expressed transcripts. We characterized their regulatory relationships in nine kinetic studies based on different hormones. From this, we inferred a GRN by meta-analyses of a Gaussian graphical model and a random forest algorithm and studied the genetic differentiation among populations (FST) at nodes. We identified two main hubs in the network that transport nitrate in guard cells. This suggests that nitrate transport is a critical aspect of sunflower physiological response to drought. We observed that differentiation of the network genes in elite sunflower cultivars is correlated with their position and connectivity. This systems biology approach combined molecular data at different time scales and identified important physiological processes. At the evolutionary level, we propose that network topology could influence responses to human selection and possibly adaptation to dry environments.
△ Less
Submitted 19 March, 2014; v1 submitted 24 September, 2013;
originally announced September 2013.