-
Computationally efficient multi-level Gaussian process regression for functional data observed under completely or partially regular sampling designs
Authors:
Adam Gorm Hoffmann,
Claus Thorn Ekstrøm,
Andreas Kryger Jensen
Abstract:
Gaussian process regression is a frequently used statistical method for flexible yet fully probabilistic non-linear regression modeling. A common obstacle is its computational complexity which scales poorly with the number of observations. This is especially an issue when applying Gaussian process models to multiple functions simultaneously in various applications of functional data analysis.
We…
▽ More
Gaussian process regression is a frequently used statistical method for flexible yet fully probabilistic non-linear regression modeling. A common obstacle is its computational complexity which scales poorly with the number of observations. This is especially an issue when applying Gaussian process models to multiple functions simultaneously in various applications of functional data analysis.
We consider a multi-level Gaussian process regression model where a common mean function and individual subject-specific deviations are modeled simultaneously as latent Gaussian processes. We derive exact analytic and computationally efficient expressions for the log-likelihood function and the posterior distributions in the case where the observations are sampled on either a completely or partially regular grid. This enables us to fit the model to large data sets that are currently computationally inaccessible using a standard implementation. We show through a simulation study that our analytic expressions are several orders of magnitude faster compared to a standard implementation, and we provide an implementation in the probabilistic programming language Stan.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Causal discovery for observational sciences using supervised machine learning
Authors:
Anne Helby Petersen,
Joseph Ramsey,
Claus Thorn Ekstrøm,
Peter Spirtes
Abstract:
Causal inference can estimate causal effects, but unless data are collected experimentally, statistical analyses must rely on pre-specified causal models. Causal discovery algorithms are empirical methods for constructing such causal models from data.
Several asymptotically correct methods already exist, but they generally struggle on smaller samples. Moreover, most methods focus on very sparse…
▽ More
Causal inference can estimate causal effects, but unless data are collected experimentally, statistical analyses must rely on pre-specified causal models. Causal discovery algorithms are empirical methods for constructing such causal models from data.
Several asymptotically correct methods already exist, but they generally struggle on smaller samples. Moreover, most methods focus on very sparse causal models, which may not always be a realistic representation of real-life data generating mechanisms. Finally, while causal relationships suggested by the methods often hold true, their claims about causal non-relatedness have high error rates. This non-conservative error tradeoff is not ideal for observational sciences, where the resulting model is directly used to inform causal inference: A causal model with many missing causal relations entails too strong assumptions and may lead to biased effect estimates.
We propose a new causal discovery method that addresses these three shortcomings: Supervised learning discovery (SLdisco). SLdisco uses supervised machine learning to obtain a map** from observational data to equivalence classes of causal models.
We evaluate SLdisco in a large simulation study based on Gaussian data and we consider several choices of model size and sample size. We find that SLdisco is more conservative, only moderately less informative and less sensitive towards sample size than existing procedures.
We furthermore provide a real epidemiological data application. We use random subsampling to investigate real data performance on small samples and again find that SLdisco is less sensitive towards sample size and hence seems to better utilize the information available in small datasets.
△ Less
Submitted 14 May, 2022; v1 submitted 25 February, 2022;
originally announced February 2022.
-
Ranking of average treatment effects with generalized random forests for time-to-event outcomes
Authors:
Helene C. W. Rytgaard,
Claus T. Ekstrøm,
Lars V. Kessing,
Thomas A. Gerds
Abstract:
In this paper we present a data-adaptive estimation procedure for estimation of average treatment effects in a time-to-event setting based on generalized random forests. In these kinds of settings, the definition of causal effect parameters are complicated by competing risks; here we distinguish between treatment effects on the crude and the net probabilities, respectively. To handle right-censori…
▽ More
In this paper we present a data-adaptive estimation procedure for estimation of average treatment effects in a time-to-event setting based on generalized random forests. In these kinds of settings, the definition of causal effect parameters are complicated by competing risks; here we distinguish between treatment effects on the crude and the net probabilities, respectively. To handle right-censoring, and to switch between crude and net probabilities, we propose a two-step procedure for estimation, applying inverse probability weighting to construct time-point specific weighted outcomes as input for the forest. The forest adaptively handles confounding of the treatment assigned by applying a splitting rule that targets a causal parameter. We demonstrate that our method is effective for a causal search through a list of treatments to be ranked according to the magnitude of their effect. We further apply our method to a dataset from the Danish health registries where it is of interest to discover drugs with an unexpected protective effect against relapse of severe depression.
△ Less
Submitted 27 April, 2021;
originally announced April 2021.
-
Having a Ball: evaluating scoring streaks and game excitement using in-match trend estimation
Authors:
Claus Thorn Ekstrøm,
Andreas Kryger Jensen
Abstract:
Many popular sports involve matches between two teams or players where each team have the possibility of scoring points throughout the match. While the overall match winner and result is interesting, it conveys little information about the underlying scoring trends throughout the match. Modeling approaches that accommodate a finer granularity of the score difference throughout the match is needed…
▽ More
Many popular sports involve matches between two teams or players where each team have the possibility of scoring points throughout the match. While the overall match winner and result is interesting, it conveys little information about the underlying scoring trends throughout the match. Modeling approaches that accommodate a finer granularity of the score difference throughout the match is needed to evaluate in-game strategies, discuss scoring streaks, teams strengths, and other aspects of the game.
We propose a latent Gaussian process to model the score difference between two teams and introduce the Trend Direction Index as an easily interpretable probabilistic measure of the current trend in the match as well as a measure of post-game trend evaluation. In addition we propose the Excitement Trend Index - the expected number of monotonicity changes in the running score difference - as a measure of overall game excitement.
Our proposed methodology is applied to all 1143 matches from the 2019-2020 National Basketball Association (NBA) season. We show how the trends can be interpreted in individual games and how the excitement score can be used to cluster teams according to how exciting they are to watch.
△ Less
Submitted 22 December, 2020;
originally announced December 2020.
-
Quantifying the Trendiness of Trends
Authors:
Andreas Kryger Jensen,
Claus Thorn Ekstrøm
Abstract:
News media often report that the trend of some public health outcome has changed. These statements are frequently based on longitudinal data, and the change in trend is typically found to have occurred at the most recent data collection time point - if no change had occurred the story is less likely to be reported. Such claims may potentially influence public health decisions on a national level.…
▽ More
News media often report that the trend of some public health outcome has changed. These statements are frequently based on longitudinal data, and the change in trend is typically found to have occurred at the most recent data collection time point - if no change had occurred the story is less likely to be reported. Such claims may potentially influence public health decisions on a national level.
We propose two measures for quantifying the trendiness of trends. Assuming that reality evolves in continuous time we define what constitutes a trend and a change in trend, and introduce a probabilistic Trend Direction Index. This index has the interpretation of the probability that a latent characteristic has changed monotonicity at any given time conditional on observed data. We also define an index of Expected Trend Instability quantifying the expected number of changes in trend on an interval.
Using a latent Gaussian Process model we show how the Trend Direction Index and the Expected Trend Instability can be estimated in a Bayesian framework and use the methods to analyze the proportion of smokers in Denmark during the last 20 years, and the development of new COVID-19 cases in Italy from February 24th onwards.
△ Less
Submitted 3 October, 2020; v1 submitted 26 December, 2019;
originally announced December 2019.
-
Evaluating one-shot tournament predictions
Authors:
Claus Thorn Ekstrøm,
Hans Van Eetvelde,
Christophe Ley,
Ulf Brefeld
Abstract:
We introduce the Tournament Rank Probability Score (TRPS) as a measure to evaluate and compare pre-tournament predictions, where predictions of the full tournament results are required to be available before the tournament begins. The TRPS handles partial ranking of teams, gives credit to predictions that are only slightly wrong, and can be modified with weights to stress the importance of particu…
▽ More
We introduce the Tournament Rank Probability Score (TRPS) as a measure to evaluate and compare pre-tournament predictions, where predictions of the full tournament results are required to be available before the tournament begins. The TRPS handles partial ranking of teams, gives credit to predictions that are only slightly wrong, and can be modified with weights to stress the importance of particular features of the tournament prediction. Thus, the Tournament Rank Prediction Score is more flexible than the commonly preferred log loss score for such tasks. In addition, we show how predictions from historic tournaments can be optimally combined into ensemble predictions in order to maximize the TRPS for a new tournament.
△ Less
Submitted 6 December, 2019;
originally announced December 2019.
-
Moment-based Estimation of Mixtures of Regression Models
Authors:
Claus Thorn Ekstrøm,
Christian Bressen Pipper
Abstract:
Finite mixtures of regression models provide a flexible modeling framework for many phenomena. Using moment-based estimation of the regression parameters, we develop unbiased estimators with a minimum of assumptions on the mixture components. In particular, only the average regression model for one of the components in the mixture model is needed and no requirements on the distributions. The consi…
▽ More
Finite mixtures of regression models provide a flexible modeling framework for many phenomena. Using moment-based estimation of the regression parameters, we develop unbiased estimators with a minimum of assumptions on the mixture components. In particular, only the average regression model for one of the components in the mixture model is needed and no requirements on the distributions. The consistency and asymptotic distribution of the estimators is derived and the proposed method is validated through a series of simulation studies and is shown to be highly accurate. We illustrate the use of the moment-based mixture of regression models with an application to wine quality data.
△ Less
Submitted 15 May, 2019;
originally announced May 2019.
-
Sequential rank agreement methods for comparison of ranked lists
Authors:
Claus Thorn Ekstrøm,
Thomas Alexander Gerds,
Andreas Kryger Jensen,
Kasper Brink-Jensen
Abstract:
The comparison of alternative rankings of a set of items is a general and prominent task in applied statistics. Predictor variables are ranked according to magnitude of association with an outcome, prediction models rank subjects according to the personalized risk of an event, and genetic studies rank genes according to their difference in gene expression levels. This article constructs measures o…
▽ More
The comparison of alternative rankings of a set of items is a general and prominent task in applied statistics. Predictor variables are ranked according to magnitude of association with an outcome, prediction models rank subjects according to the personalized risk of an event, and genetic studies rank genes according to their difference in gene expression levels. This article constructs measures of the agreement of two or more ordered lists. We use the standard deviation of the ranks to define a measure of agreement that both provides an intuitive interpretation and can be applied to any number of lists even if some or all are incomplete or censored. The approach can identify change-points in the agreement of the lists and the sequential changes of agreement as a function of the depth of the lists can be compared graphically to a permutation based reference set. The usefulness of these tools are illustrated using gene rankings, and using data from two Danish ovarian cancer studies where we assess the within and between agreement of different statistical classification methods.
△ Less
Submitted 27 August, 2015;
originally announced August 2015.
-
Inference for feature selection using the Lasso with high-dimensional data
Authors:
Kasper Brink-Jensen,
Claus Thorn Ekstrøm
Abstract:
Penalized regression models such as the Lasso have proved useful for variable selection in many fields - especially for situations with high-dimensional data where the numbers of predictors far exceeds the number of observations. These methods identify and rank variables of importance but do not generally provide any inference of the selected variables. Thus, the variables selected might be the "m…
▽ More
Penalized regression models such as the Lasso have proved useful for variable selection in many fields - especially for situations with high-dimensional data where the numbers of predictors far exceeds the number of observations. These methods identify and rank variables of importance but do not generally provide any inference of the selected variables. Thus, the variables selected might be the "most important" but need not be significant. We propose a significance test for the selection found by the Lasso. We introduce a procedure that computes inference and p-values for features chosen by the Lasso. This method rephrases the null hypothesis and uses a randomization approach which ensures that the error rate is controlled even for small samples. We demonstrate the ability of the algorithm to compute $p$-values of the expected magnitude with simulated data using a multitude of scenarios that involve various effects strengths and correlation between predictors. The algorithm is also applied to a prostate cancer dataset that has been analyzed in recent papers on the subject. The proposed method is found to provide a powerful way to make inference for feature selection even for small samples and when the number of predictors are several orders of magnitude larger than the number of observations. The algorithm is implemented in the MESS package in R and is freely available.
△ Less
Submitted 17 March, 2014;
originally announced March 2014.