-
Ethical considerations of use of hold-out sets in clinical prediction model management
Authors:
Louis Chislett,
Louis JM Aslett,
Alisha R Davies,
Catalina A Vallejos,
James Liley
Abstract:
Clinical prediction models are statistical or machine learning models used to quantify the risk of a certain health outcome using patient data. These can then inform potential interventions on patients, causing an effect called performative prediction: predictions inform interventions which influence the outcome they were trying to predict, leading to a potential underestimation of risk in some pa…
▽ More
Clinical prediction models are statistical or machine learning models used to quantify the risk of a certain health outcome using patient data. These can then inform potential interventions on patients, causing an effect called performative prediction: predictions inform interventions which influence the outcome they were trying to predict, leading to a potential underestimation of risk in some patients if a model is updated on this data. One suggested resolution to this is the use of hold-out sets, in which a set of patients do not receive model derived risk scores, such that a model can be safely retrained. We present an overview of clinical and research ethics regarding potential implementation of hold-out sets for clinical prediction models in health settings. We focus on the ethical principles of beneficence, non-maleficence, autonomy and justice. We also discuss informed consent, clinical equipoise, and truth-telling. We present illustrative cases of potential hold-out set implementations and discuss statistical issues arising from different hold-out set sampling methods. We also discuss differences between hold-out sets and randomised control trials, in terms of ethics and statistical issues. Finally, we give practical recommendations for researchers interested in the use hold-out sets for clinical prediction models.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
kalis: A Modern Implementation of the Li & Stephens Model for Local Ancestry Inference in R
Authors:
Louis J. M. Aslett,
Ryan R. Christ
Abstract:
Approximating the recent phylogeny of $N$ phased haplotypes at a set of variants along the genome is a core problem in modern population genomics and central to performing genome-wide screens for association, selection, introgression, and other signals. The Li & Stephens (LS) model provides a simple yet powerful hidden Markov model for inferring the recent ancestry at a given variant, represented…
▽ More
Approximating the recent phylogeny of $N$ phased haplotypes at a set of variants along the genome is a core problem in modern population genomics and central to performing genome-wide screens for association, selection, introgression, and other signals. The Li & Stephens (LS) model provides a simple yet powerful hidden Markov model for inferring the recent ancestry at a given variant, represented as an $N \times N$ distance matrix based on posterior decodings. However, existing posterior decoding implementations for the LS model cannot scale to modern datasets with tens or hundreds of thousands of genomes. This work focuses on providing a high-performance engine to compute the LS model, enabling users to rapidly develop a range of variant-specific ancestral inference pipelines on top, exposed via an easy to use package, kalis, in the statistical programming language R. kalis exploits both multi-core parallelism and modern CPU vector instruction sets to enable scaling to problem sizes that would previously have been prohibitively slow to work with. The resulting distance matrices enable local ancestry, selection, and association studies in modern large scale genomic datasets.
△ Less
Submitted 21 December, 2022;
originally announced December 2022.
-
Holdouts set for predictive model updating
Authors:
Sami Haidar-Wehbe,
Samuel R Emerson,
Louis J M Aslett,
James Liley
Abstract:
In complex settings, such as healthcare, predictive risk scores play an increasingly crucial role in guiding interventions. However, directly updating risk scores used to guide intervention can lead to biased risk estimates. To address this, we propose updating using a `holdout set' - a subset of the population that does not receive interventions guided by the risk score. Striking a balance in the…
▽ More
In complex settings, such as healthcare, predictive risk scores play an increasingly crucial role in guiding interventions. However, directly updating risk scores used to guide intervention can lead to biased risk estimates. To address this, we propose updating using a `holdout set' - a subset of the population that does not receive interventions guided by the risk score. Striking a balance in the size of the holdout set is essential, to ensure good performance of the updated risk score whilst minimising the number of held out samples. We prove that this approach enables total costs to grow at a rate $O\left(N^{2/3}\right)$ for a population of size $N$, and argue that in general circumstances there is no competitive alternative. By defining an appropriate loss function, we describe conditions under which an optimal holdout size (OHS) can be readily identified, and introduce parametric and semi-parametric algorithms for OHS estimation, demonstrating their use on a recent risk score for pre-eclampsia. Based on these results, we make the case that a holdout set is a safe, viable and easily implemented means to safely update predictive risk scores.
△ Less
Submitted 31 July, 2023; v1 submitted 13 February, 2022;
originally announced February 2022.
-
Model updating after interventions paradoxically introduces bias
Authors:
James Liley,
Samuel R Emerson,
Bilal A Mateen,
Catalina A Vallejos,
Louis J M Aslett,
Sebastian J Vollmer
Abstract:
Machine learning is increasingly being used to generate prediction models for use in a number of real-world settings, from credit risk assessment to clinical decision support. Recent discussions have highlighted potential problems in the updating of a predictive score for a binary outcome when an existing predictive score forms part of the standard workflow, driving interventions. In this setting,…
▽ More
Machine learning is increasingly being used to generate prediction models for use in a number of real-world settings, from credit risk assessment to clinical decision support. Recent discussions have highlighted potential problems in the updating of a predictive score for a binary outcome when an existing predictive score forms part of the standard workflow, driving interventions. In this setting, the existing score induces an additional causative pathway which leads to miscalibration when the original score is replaced. We propose a general causal framework to describe and address this problem, and demonstrate an equivalent formulation as a partially observed Markov decision process. We use this model to demonstrate the impact of such `naive updating' when performed repeatedly. Namely, we show that successive predictive scores may converge to a point where they predict their own effect, or may eventually tend toward a stable oscillation between two values, and we argue that neither outcome is desirable. Furthermore, we demonstrate that even if model-fitting procedures improve, actual performance may worsen. We complement these findings with a discussion of several potential routes to overcome these issues.
△ Less
Submitted 22 February, 2021; v1 submitted 22 October, 2020;
originally announced October 2020.
-
Improved Concentration Bounds for Gaussian Quadratic Forms
Authors:
Robert E. Gallagher,
Louis J. M. Aslett,
David Steinsaltz,
Ryan R. Christ
Abstract:
For a wide class of monotonic functions $f$, we develop a Chernoff-style concentration inequality for quadratic forms $Q_f \sim \sum\limits_{i=1}^n f(η_i) (Z_i + δ_i)^2$, where $Z_i \sim N(0,1)$. The inequality is expressed in terms of traces that are rapid to compute, making it useful for bounding p-values in high-dimensional screening applications. The bounds we obtain are significantly tighter…
▽ More
For a wide class of monotonic functions $f$, we develop a Chernoff-style concentration inequality for quadratic forms $Q_f \sim \sum\limits_{i=1}^n f(η_i) (Z_i + δ_i)^2$, where $Z_i \sim N(0,1)$. The inequality is expressed in terms of traces that are rapid to compute, making it useful for bounding p-values in high-dimensional screening applications. The bounds we obtain are significantly tighter than those that have been previously developed, which we illustrate with numerical examples.
△ Less
Submitted 13 November, 2019;
originally announced November 2019.
-
Reliability analysis of general phased mission systems with a new survival signature
Authors:
Xianzhen Huang,
Louis J. M. Aslett,
Frank P. A. Coolen
Abstract:
It is often difficult for a phased mission system (PMS) to be highly reliable, because this entails achieving high reliability in every phase of operation. Consequently, reliability analysis of such systems is of critical importance. However, efficient and interpretable analysis of PMSs enabling general component lifetime distributions, arbitrary structures, and the possibility that components ski…
▽ More
It is often difficult for a phased mission system (PMS) to be highly reliable, because this entails achieving high reliability in every phase of operation. Consequently, reliability analysis of such systems is of critical importance. However, efficient and interpretable analysis of PMSs enabling general component lifetime distributions, arbitrary structures, and the possibility that components skip phases has been an open problem.
In this paper, we show that the survival signature can be used for reliability analysis of PMSs with similar types of component in each phase, providing an alternative to the existing limited approaches in the literature. We then develop new methodology addressing the full range of challenges above. The new method retains the attractive survival signature property of separating the system structure from the component lifetime distributions, simplifying computation, insight into, and inference for system reliability.
△ Less
Submitted 24 July, 2018;
originally announced July 2018.
-
Encrypted accelerated least squares regression
Authors:
Pedro M. Esperança,
Louis J. M. Aslett,
Chris C. Holmes
Abstract:
Information that is stored in an encrypted format is, by definition, usually not amenable to statistical analysis or machine learning methods. In this paper we present detailed analysis of coordinate and accelerated gradient descent algorithms which are capable of fitting least squares and penalised ridge regression models, using data encrypted under a fully homomorphic encryption scheme. Gradient…
▽ More
Information that is stored in an encrypted format is, by definition, usually not amenable to statistical analysis or machine learning methods. In this paper we present detailed analysis of coordinate and accelerated gradient descent algorithms which are capable of fitting least squares and penalised ridge regression models, using data encrypted under a fully homomorphic encryption scheme. Gradient descent is shown to dominate in terms of encrypted computational speed, and theoretical results are proven to give parameter bounds which ensure correctness of decryption. The characteristics of encrypted computation are empirically shown to favour a non-standard acceleration technique. This demonstrates the possibility of approximating conventional statistical regression methods using encrypted data without compromising privacy.
△ Less
Submitted 2 March, 2017;
originally announced March 2017.
-
Multilevel Monte Carlo for Reliability Theory
Authors:
Louis J. M. Aslett,
Tigran Nagapetyan,
Sebastian J. Vollmer
Abstract:
As the size of engineered systems grows, problems in reliability theory can become computationally challenging, often due to the combinatorial growth in the cut sets. In this paper we demonstrate how Multilevel Monte Carlo (MLMC) - a simulation approach which is typically used for stochastic differential equation models - can be applied in reliability problems by carefully controlling the bias-var…
▽ More
As the size of engineered systems grows, problems in reliability theory can become computationally challenging, often due to the combinatorial growth in the cut sets. In this paper we demonstrate how Multilevel Monte Carlo (MLMC) - a simulation approach which is typically used for stochastic differential equation models - can be applied in reliability problems by carefully controlling the bias-variance tradeoff in approximating large system behaviour. In this first exposition of MLMC methods in reliability problems we address the canonical problem of estimating the expectation of a functional of system lifetime and show the computational advantages compared to classical Monte Carlo methods. The difference in computational complexity can be orders of magnitude for very large or complicated system structures.
△ Less
Submitted 11 March, 2017; v1 submitted 1 September, 2016;
originally announced September 2016.
-
Cryptographically secure multiparty evaluation of system reliability
Authors:
Louis J. M. Aslett
Abstract:
The precise design of a system may be considered a trade secret which should be protected, whilst at the same time component manufacturers are sometimes reluctant to release full test data (perhaps only providing mean time to failure data). In this situation it seems impractical to both produce an accurate reliability assessment and satisfy all parties' privacy requirements. However, we present re…
▽ More
The precise design of a system may be considered a trade secret which should be protected, whilst at the same time component manufacturers are sometimes reluctant to release full test data (perhaps only providing mean time to failure data). In this situation it seems impractical to both produce an accurate reliability assessment and satisfy all parties' privacy requirements. However, we present recent developments in cryptography which, when combined with the recently developed survival signature in reliability theory, allows almost total privacy to be maintained in a cryptographically strong manner in precisely this setting. Thus, the system designer does not have to reveal their trade secret design and the component manufacturer can retain component test data in-house.
△ Less
Submitted 18 April, 2016;
originally announced April 2016.
-
Bayesian Nonparametric System Reliability using Sets of Priors
Authors:
Gero Walter,
Louis J. M. Aslett,
Frank P. A. Coolen
Abstract:
An imprecise Bayesian nonparametric approach to system reliability with multiple types of components is developed. This allows modelling partial or imperfect prior knowledge on component failure distributions in a flexible way through bounds on the functioning probability. Given component level test data these bounds are propagated to bounds on the posterior predictive distribution for the functio…
▽ More
An imprecise Bayesian nonparametric approach to system reliability with multiple types of components is developed. This allows modelling partial or imperfect prior knowledge on component failure distributions in a flexible way through bounds on the functioning probability. Given component level test data these bounds are propagated to bounds on the posterior predictive distribution for the functioning probability of a new system containing components exchangeable with those used in testing. The method further enables identification of prior-data conflict at the system level based on component level test data. New results on first-order stochastic dominance for the Beta-Binomial distribution make the technique computationally tractable. Our methodological contributions can be immediately used in applications by reliability practitioners as we provide easy to use software tools.
△ Less
Submitted 4 February, 2016;
originally announced February 2016.
-
Encrypted statistical machine learning: new privacy preserving methods
Authors:
Louis J. M. Aslett,
Pedro M. Esperança,
Chris C. Holmes
Abstract:
We present two new statistical machine learning methods designed to learn on fully homomorphic encrypted (FHE) data. The introduction of FHE schemes following Gentry (2009) opens up the prospect of privacy preserving statistical machine learning analysis and modelling of encrypted data without compromising security constraints. We propose tailored algorithms for applying extremely random forests,…
▽ More
We present two new statistical machine learning methods designed to learn on fully homomorphic encrypted (FHE) data. The introduction of FHE schemes following Gentry (2009) opens up the prospect of privacy preserving statistical machine learning analysis and modelling of encrypted data without compromising security constraints. We propose tailored algorithms for applying extremely random forests, involving a new cryptographic stochastic fraction estimator, and naïve Bayes, involving a semi-parametric model for the class decision boundary, and show how they can be used to learn and predict from encrypted data. We demonstrate that these techniques perform competitively on a variety of classification data sets and provide detailed information about the computational practicalities of these and other FHE methods.
△ Less
Submitted 27 August, 2015;
originally announced August 2015.
-
A review of homomorphic encryption and software tools for encrypted statistical machine learning
Authors:
Louis J. M. Aslett,
Pedro M. Esperança,
Chris C. Holmes
Abstract:
Recent advances in cryptography promise to enable secure statistical computation on encrypted data, whereby a limited set of operations can be carried out without the need to first decrypt. We review these homomorphic encryption schemes in a manner accessible to statisticians and machine learners, focusing on pertinent limitations inherent in the current state of the art. These limitations restric…
▽ More
Recent advances in cryptography promise to enable secure statistical computation on encrypted data, whereby a limited set of operations can be carried out without the need to first decrypt. We review these homomorphic encryption schemes in a manner accessible to statisticians and machine learners, focusing on pertinent limitations inherent in the current state of the art. These limitations restrict the kind of statistics and machine learning algorithms which can be implemented and we review those which have been successfully applied in the literature. Finally, we document a high performance R package implementing a recent homomorphic scheme in a general framework.
△ Less
Submitted 26 August, 2015;
originally announced August 2015.