-
Field demonstration of predictive heating control for an all-electric house in a cold climate
Authors:
Elias N. Pergantis,
Priyadarshan,
Nadah Al Theeb,
Parveen Dhillon,
Jonathan P. Ore,
Davide Ziviani,
Eckhard A. Groll,
Kevin J. Kircher
Abstract:
Efficient electric heat pumps that replace fossil-fueled heating systems could significantly reduce greenhouse gas emissions. However, electric heat pumps can sharply increase electricity demand, causing high utility bills and stressing the power grid. Residential neighborhoods could see particularly high electricity demand during cold weather, when heat demand rises and heat pump efficiencies fal…
▽ More
Efficient electric heat pumps that replace fossil-fueled heating systems could significantly reduce greenhouse gas emissions. However, electric heat pumps can sharply increase electricity demand, causing high utility bills and stressing the power grid. Residential neighborhoods could see particularly high electricity demand during cold weather, when heat demand rises and heat pump efficiencies fall. This paper presents the development and field demonstration of a predictive control system for an air-to-air heat pump with backup electric resistance heat. The control system adjusts indoor temperature set-points based on weather forecasts, occupancy conditions, and data-driven models of the building and heating equipment. Field tests from January to March of 2023 in an occupied, all-electric, 208 m^2 detached single-family house in Indiana, USA, included outdoor temperatures as low as -15 C. On average over these tests, the control system reduced daily heating energy use by 19% (95% confidence interval: 13--24%), energy used for backup heat by 38%, and the frequency of using the highest stage (19 kW) of backup heat by 83%. Concurrent surveys of residents showed that the control system maintained satisfactory thermal comfort. The control system could reduce the house's total annual heating costs by about $300 (95% confidence interval: 23--34%). These real-world results could strengthen the case for deploying predictive home heating control, bringing the technology one step closer to reducing emissions, utility bills, and power grid impacts at scale.
△ Less
Submitted 10 February, 2024;
originally announced February 2024.
-
Statistically Enhanced Learning: a feature engineering framework to boost (any) learning algorithms
Authors:
Florian Felice,
Christophe Ley,
Andreas Groll,
Stéphane Bordas
Abstract:
Feature engineering is of critical importance in the field of Data Science. While any data scientist knows the importance of rigorously preparing data to obtain good performing models, only scarce literature formalizes its benefits. In this work, we will present the method of Statistically Enhanced Learning (SEL), a formalization framework of existing feature engineering and extraction tasks in Ma…
▽ More
Feature engineering is of critical importance in the field of Data Science. While any data scientist knows the importance of rigorously preparing data to obtain good performing models, only scarce literature formalizes its benefits. In this work, we will present the method of Statistically Enhanced Learning (SEL), a formalization framework of existing feature engineering and extraction tasks in Machine Learning (ML). The difference compared to classical ML consists in the fact that certain predictors are not directly observed but obtained as statistical estimators. Our goal is to study SEL, aiming to establish a formalized framework and illustrate its improved performance by means of simulations as well as applications on real life use cases.
△ Less
Submitted 29 June, 2023;
originally announced June 2023.
-
Churn modeling of life insurance policies via statistical and machine learning methods -- Analysis of important features
Authors:
Andreas Groll,
Carsten Wasserfuhr,
Leonid Zeldin
Abstract:
Life assurance companies typically possess a wealth of data covering multiple systems and databases. These data are often used for analyzing the past and for describing the present. Taking account of the past, the future is mostly forecasted by traditional statistical methods. So far, only a few attempts were undertaken to perform estimations by means of machine learning approaches. In this work,…
▽ More
Life assurance companies typically possess a wealth of data covering multiple systems and databases. These data are often used for analyzing the past and for describing the present. Taking account of the past, the future is mostly forecasted by traditional statistical methods. So far, only a few attempts were undertaken to perform estimations by means of machine learning approaches. In this work, the individual contract cancellation behavior of customers within two partial stocks is modeled by the aid of various classification methods. Partial stocks of private pension and endowment policy are considered. We describe the data used for the modeling, their structured and in which way they are cleansed. The utilized models are calibrated on the basis of an extensive tuning process, then graphically evaluated regarding their goodness-of-fit and with the help of a variable relevance concept, we investigate which features notably affect the individual contract cancellation behavior.
△ Less
Submitted 18 February, 2022;
originally announced February 2022.
-
Machine Learning for Multi-Output Regression: When should a holistic multivariate approach be preferred over separate univariate ones?
Authors:
Lena Schmid,
Alexander Gerharz,
Andreas Groll,
Markus Pauly
Abstract:
Tree-based ensembles such as the Random Forest are modern classics among statistical learning methods. In particular, they are used for predicting univariate responses. In case of multiple outputs the question arises whether we separately fit univariate models or directly follow a multivariate approach. For the latter, several possibilities exist that are, e.g. based on modified splitting or stopp…
▽ More
Tree-based ensembles such as the Random Forest are modern classics among statistical learning methods. In particular, they are used for predicting univariate responses. In case of multiple outputs the question arises whether we separately fit univariate models or directly follow a multivariate approach. For the latter, several possibilities exist that are, e.g. based on modified splitting or stop** rules for multi-output regression. In this work we compare these methods in extensive simulations to help in answering the primary question when to use multivariate ensemble techniques.
△ Less
Submitted 14 January, 2022;
originally announced January 2022.
-
Using Sequential Statistical Tests for Efficient Hyperparameter Tuning
Authors:
Philip Buczak,
Andreas Groll,
Markus Pauly,
Jakob Rehof,
Daniel Horn
Abstract:
Hyperparameter tuning is one of the the most time-consuming parts in machine learning. Despite the existence of modern optimization algorithms that minimize the number of evaluations needed, evaluations of a single setting may still be expensive. Usually a resampling technique is used, where the machine learning method has to be fitted a fixed number of k times on different training datasets. The…
▽ More
Hyperparameter tuning is one of the the most time-consuming parts in machine learning. Despite the existence of modern optimization algorithms that minimize the number of evaluations needed, evaluations of a single setting may still be expensive. Usually a resampling technique is used, where the machine learning method has to be fitted a fixed number of k times on different training datasets. The respective mean performance of the k fits is then used as performance estimator. Many hyperparameter settings could be discarded after less than k resampling iterations if they are clearly inferior to high-performing settings. However, resampling is often performed until the very end, wasting a lot of computational effort. To this end, we propose the Sequential Random Search (SQRS) which extends the regular random search algorithm by a sequential testing procedure aimed at detecting and eliminating inferior parameter configurations early. We compared our SQRS with regular random search using multiple publicly available regression and classification datasets. Our simulation study showed that the SQRS is able to find similarly well-performing parameter settings while requiring noticeably fewer evaluations. Our results underscore the potential for integrating sequential tests into hyperparameter tuning.
△ Less
Submitted 28 November, 2022; v1 submitted 23 December, 2021;
originally announced December 2021.
-
Hybrid Machine Learning Forecasts for the UEFA EURO 2020
Authors:
Andreas Groll,
Lars Magnus Hvattum,
Christophe Ley,
Franziska Popp,
Gunther Schauberger,
Hans Van Eetvelde,
Achim Zeileis
Abstract:
Three state-of-the-art statistical ranking methods for forecasting football matches are combined with several other predictors in a hybrid machine learning model. Namely an ability estimate for every team based on historic matches; an ability estimate for every team based on bookmaker consensus; average plus-minus player ratings based on their individual performances in their home clubs and nation…
▽ More
Three state-of-the-art statistical ranking methods for forecasting football matches are combined with several other predictors in a hybrid machine learning model. Namely an ability estimate for every team based on historic matches; an ability estimate for every team based on bookmaker consensus; average plus-minus player ratings based on their individual performances in their home clubs and national teams; and further team covariates (e.g., market value, team structure) and country-specific socio-economic factors (population, GDP). The proposed combined approach is used for learning the number of goals scored in the matches from the four previous UEFA EUROs 2004-2016 and then applied to current information to forecast the upcoming UEFA EURO 2020. Based on the resulting estimates, the tournament is simulated repeatedly and winning probabilities are obtained for all teams. A random forest model favors the current World Champion France with a winning probability of 14.8% before England (13.5%) and Spain (12.3%). Additionally, we provide survival probabilities for all teams and at all tournament stages.
△ Less
Submitted 7 June, 2021;
originally announced June 2021.
-
Random boosting and random^2 forests -- A random tree depth injection approach
Authors:
Tobias Markus Krabel,
Thi Ngoc Tien Tran,
Andreas Groll,
Daniel Horn,
Carsten Jentsch
Abstract:
The induction of additional randomness in parallel and sequential ensemble methods has proven to be worthwhile in many aspects. In this manuscript, we propose and examine a novel random tree depth injection approach suitable for sequential and parallel tree-based approaches including Boosting and Random Forests. The resulting methods are called \emph{Random Boost} and \emph{Random$^2$ Forest}. Bot…
▽ More
The induction of additional randomness in parallel and sequential ensemble methods has proven to be worthwhile in many aspects. In this manuscript, we propose and examine a novel random tree depth injection approach suitable for sequential and parallel tree-based approaches including Boosting and Random Forests. The resulting methods are called \emph{Random Boost} and \emph{Random$^2$ Forest}. Both approaches serve as valuable extensions to the existing literature on the gradient boosting framework and random forests. A Monte Carlo simulation, in which tree-shaped data sets with different numbers of final partitions are built, suggests that there are several scenarios where \emph{Random Boost} and \emph{Random$^2$ Forest} can improve the prediction performance of conventional hierarchical boosting and random forest approaches. The new algorithms appear to be especially successful in cases where there are merely a few high-order interactions in the generated data. In addition, our simulations suggest that our random tree depth injection approach can improve computation time by up to 40%, while at the same time the performance losses in terms of prediction accuracy turn out to be minor or even negligible in most cases.
△ Less
Submitted 13 September, 2020;
originally announced September 2020.
-
Deducing neighborhoods of classes from a fitted model
Authors:
Alexander Gerharz,
Andreas Groll,
Gunther Schauberger
Abstract:
In todays world the request for very complex models for huge data sets is rising steadily. The problem with these models is that by raising the complexity of the models, it gets much harder to interpret them. The growing field of \emph{interpretable machine learning} tries to make up for the lack of interpretability in these complex (or even blackbox-)models by using specific techniques that can h…
▽ More
In todays world the request for very complex models for huge data sets is rising steadily. The problem with these models is that by raising the complexity of the models, it gets much harder to interpret them. The growing field of \emph{interpretable machine learning} tries to make up for the lack of interpretability in these complex (or even blackbox-)models by using specific techniques that can help to understand those models better. In this article a new kind of interpretable machine learning method is presented, which can help to understand the partitioning of the feature space into predicted classes in a classification model using quantile shifts. To illustrate in which situations this quantile shift method (QSM) could become beneficial, it is applied to a theoretical medical example and a real data example. Basically, real data points (or specific points of interest) are used and the changes of the prediction after slightly raising or decreasing specific features are observed. By comparing the predictions before and after the manipulations, under certain conditions the observed changes in the predictions can be interpreted as neighborhoods of the classes with regard to the manipulated features. Chordgraphs are used to visualize the observed changes.
△ Less
Submitted 17 September, 2020; v1 submitted 11 September, 2020;
originally announced September 2020.
-
A flexible adaptive lasso Cox frailty model based on the full likelihood
Authors:
Maike Hohberg,
Andreas Groll
Abstract:
In this work a method to regularize Cox frailty models is proposed that accommodates time-varying covariates and time-varying coefficients and is based on the full instead of the partial likelihood. A particular advantage in this framework is that the baseline hazard can be explicitly modeled in a smooth, semi-parametric way, e.g. via P-splines. Regularization for variable selection is performed v…
▽ More
In this work a method to regularize Cox frailty models is proposed that accommodates time-varying covariates and time-varying coefficients and is based on the full instead of the partial likelihood. A particular advantage in this framework is that the baseline hazard can be explicitly modeled in a smooth, semi-parametric way, e.g. via P-splines. Regularization for variable selection is performed via a lasso penalty and via group lasso for categorical variables while a second penalty regularizes wiggliness of smooth estimates of time-varying coefficients and the baseline hazard. Additionally, adaptive weights are included to stabilize the estimation. The method is implemented in R as coxlasso and will be compared to other packages for regularized Cox regression. Existing packages, however, do not allow for the combination of different effects that are accommodated in coxlasso.
△ Less
Submitted 31 March, 2020;
originally announced March 2020.
-
Addressing cluster-constant covariates in mixed effects models via likelihood-based boosting techniques
Authors:
Colin Griesbach,
Andreas Groll,
Elisabeth Waldmann
Abstract:
Boosting techniques from the field of statistical learning have grown to be a popular tool for estimating and selecting predictor effects in various regression models and can roughly be separated in two general approaches, namely gradient boosting and likelihood-based boosting. An extensive framework has been proposed in order to fit generalised mixed models based on boosting, however for the case…
▽ More
Boosting techniques from the field of statistical learning have grown to be a popular tool for estimating and selecting predictor effects in various regression models and can roughly be separated in two general approaches, namely gradient boosting and likelihood-based boosting. An extensive framework has been proposed in order to fit generalised mixed models based on boosting, however for the case of cluster-constant covariates likelihood-based boosting approaches tend to mischoose variables in the selection step leading to wrong estimates. We propose an improved boosting algorithm for linear mixed models where the random effects are properly weighted, disentangled from the fixed effects updating scheme and corrected for correlations with cluster-constant covariates in order to improve quality of estimates and in addition reduce the computational effort. The method outperforms current state-of-the-art approaches from boosting and maximum likelihood inference which is shown via simulations and various data examples.
△ Less
Submitted 13 December, 2019;
originally announced December 2019.
-
A regularized hidden Markov model for analyzing the 'hot shoe' in football
Authors:
Marius Ötting,
Andreas Groll
Abstract:
Although academic research on the 'hot hand' effect (in particular, in sports, especially in basketball) has been going on for more than 30 years, it still remains a central question in different areas of research whether such an effect exists. In this contribution, we investigate the potential occurrence of a 'hot shoe' effect for the performance of penalty takers in football based on data from t…
▽ More
Although academic research on the 'hot hand' effect (in particular, in sports, especially in basketball) has been going on for more than 30 years, it still remains a central question in different areas of research whether such an effect exists. In this contribution, we investigate the potential occurrence of a 'hot shoe' effect for the performance of penalty takers in football based on data from the German Bundesliga. For this purpose, we consider hidden Markov models (HMMs) to model the (latent) forms of players. To further account for individual heterogeneity of the penalty taker as well as the opponent's goalkeeper, player-specific abilities are incorporated in the model formulation together with a LASSO penalty. Our results suggest states which can be tied to different forms of players, thus providing evidence for the hot shoe effect, and shed some light on exceptionally well-performing goalkeepers, which are of potential interest to managers and sports fans.
△ Less
Submitted 19 November, 2019;
originally announced November 2019.
-
Generalised Joint Regression for Count Data with a Focus on Modelling Football Matches
Authors:
Hendrik van der Wurp,
Andreas Groll,
Thomas Kneib,
Giampiero Marra,
Rosalba Radice
Abstract:
We propose a versatile joint regression framework for count responses. The method is implemented in the R add-on package GJRM and allows for modelling linear and non-linear dependence through the use of several copulae. Moreover, the parameters of the marginal distributions of the count responses and of the copula can be specified as flexible functions of covariates. Motivated by a football applic…
▽ More
We propose a versatile joint regression framework for count responses. The method is implemented in the R add-on package GJRM and allows for modelling linear and non-linear dependence through the use of several copulae. Moreover, the parameters of the marginal distributions of the count responses and of the copula can be specified as flexible functions of covariates. Motivated by a football application, we also discuss an extension which forces the regression coefficients of the marginal (linear) predictors to be equal via a suitable penalisation. Model fitting is based on a trust region algorithm which estimates simultaneously all the parameters of the joint models. We investigate the proposal's empirical performance in two simulation studies, the first one designed for arbitrary count data, the other one reflecting football-specific settings. Finally, the method is applied to FIFA World Cup data, showing its competitiveness to the standard approach with regard to predictive performance.
△ Less
Submitted 21 August, 2019; v1 submitted 2 August, 2019;
originally announced August 2019.
-
Hybrid Machine Learning Forecasts for the FIFA Women's World Cup 2019
Authors:
Andreas Groll,
Christophe Ley,
Gunther Schauberger,
Hans Van Eetvelde,
Achim Zeileis
Abstract:
In this work, we combine two different ranking methods together with several other predictors in a joint random forest approach for the scores of soccer matches. The first ranking method is based on the bookmaker consensus, the second ranking method estimates adequate ability parameters that reflect the current strength of the teams best. The proposed combined approach is then applied to the data…
▽ More
In this work, we combine two different ranking methods together with several other predictors in a joint random forest approach for the scores of soccer matches. The first ranking method is based on the bookmaker consensus, the second ranking method estimates adequate ability parameters that reflect the current strength of the teams best. The proposed combined approach is then applied to the data from the two previous FIFA Women's World Cups 2011 and 2015. Finally, based on the resulting estimates, the FIFA Women's World Cup 2019 is simulated repeatedly and winning probabilities are obtained for all teams. The model clearly favors the defending champion USA before the host France.
△ Less
Submitted 3 June, 2019;
originally announced June 2019.
-
Prediction of the 2019 IHF World Men's Handball Championship - An underdispersed sparse count data regression model
Authors:
Andreas Groll,
Jonas Heiner,
Gunther Schauberger,
Jörn Uhrmeister
Abstract:
In this work, we compare several different modeling approaches for count data applied to the scores of handball matches with regard to their predictive performances based on all matches from the four previous IHF World Men's Handball Championships 2011 - 2017: (underdispersed) Poisson regression models, Gaussian response models and negative binomial models. All models are based on the teams' covar…
▽ More
In this work, we compare several different modeling approaches for count data applied to the scores of handball matches with regard to their predictive performances based on all matches from the four previous IHF World Men's Handball Championships 2011 - 2017: (underdispersed) Poisson regression models, Gaussian response models and negative binomial models. All models are based on the teams' covariate information. Within this comparison, the Gaussian response model turns out to be the best-performing prediction method on the training data and is, therefore, chosen as the final model. Based on its estimates, the IHF World Men's Handball Championship 2019 is simulated repeatedly and winning probabilities are obtained for all teams. The model clearly favors Denmark before France. Additionally, we provide survival probabilities for all teams and at all tournament stages as well as probabilities for all teams to qualify for the main round.
△ Less
Submitted 17 January, 2019;
originally announced January 2019.
-
Prediction of the FIFA World Cup 2018 - A random forest approach with an emphasis on estimated team ability parameters
Authors:
Andreas Groll,
Christophe Ley,
Gunther Schauberger,
Hans Van Eetvelde
Abstract:
In this work, we compare three different modeling approaches for the scores of soccer matches with regard to their predictive performances based on all matches from the four previous FIFA World Cups 2002 - 2014: Poisson regression models, random forests and ranking methods. While the former two are based on the teams' covariate information, the latter method estimates adequate ability parameters t…
▽ More
In this work, we compare three different modeling approaches for the scores of soccer matches with regard to their predictive performances based on all matches from the four previous FIFA World Cups 2002 - 2014: Poisson regression models, random forests and ranking methods. While the former two are based on the teams' covariate information, the latter method estimates adequate ability parameters that reflect the current strength of the teams best. Within this comparison the best-performing prediction methods on the training data turn out to be the ranking methods and the random forests. However, we show that by combining the random forest with the team ability parameters from the ranking methods as an additional covariate we can improve the predictive power substantially. Finally, this combination of methods is chosen as the final model and based on its estimates, the FIFA World Cup 2018 is simulated repeatedly and winning probabilities are obtained for all teams. The model slightly favors Spain before the defending champion Germany. Additionally, we provide survival probabilities for all teams and at all tournament stages as well as the most probable tournament outcome.
△ Less
Submitted 13 June, 2018; v1 submitted 8 June, 2018;
originally announced June 2018.