Search | arXiv e-print repository

Variational Bayes latent class approach for EHR-based phenoty** with large real-world data

Authors: Brian Buckley, Adrian O'Hagan, Marie Galligan

Abstract: Bayesian approaches to clinical analyses for the purposes of patient phenoty** have been limited by the computational challenges associated with applying the Markov-Chain Monte-Carlo (MCMC) approach to large real-world data. Approximate Bayesian inference via optimization of the variational evidence lower bound, often called Variational Bayes (VB), has been successfully demonstrated for other ap… ▽ More Bayesian approaches to clinical analyses for the purposes of patient phenoty** have been limited by the computational challenges associated with applying the Markov-Chain Monte-Carlo (MCMC) approach to large real-world data. Approximate Bayesian inference via optimization of the variational evidence lower bound, often called Variational Bayes (VB), has been successfully demonstrated for other applications. We investigate the performance and characteristics of currently available R and Python VB software for variational Bayesian Latent Class Analysis (LCA) of realistically large real-world observational data. We used a real-world data set, OptumTM electronic health records (EHR), containing pediatric patients with risk indicators for type 2 diabetes mellitus that is a rare form in pediatric patients. The aim of this work is to validate a Bayesian patient phenoty** model for generality and extensibility and crucially that it can be applied to a realistically large real-world clinical data set. We find currently available automatic VB methods are very sensitive to initial starting conditions, model definition, algorithm hyperparameters and choice of gradient optimiser. The Bayesian LCA model was challenging to implement using VB but we achieved reasonable results with very good computational performance compared to MCMC. △ Less

Submitted 7 April, 2023; originally announced April 2023.

Comments: 10 pages, 5 figures, submitted to Wiley Stat. arXiv admin note: substantial text overlap with arXiv:2303.13619

arXiv:2303.13619 [pdf, other]

Variational Bayes latent class approach for EHR-based phenoty** with large real-world data

Authors: Brian Buckley, Adrian O'Hagan, Marie Galligan

Abstract: Bayesian approaches to clinical analyses for the purposes of patient phenoty** have been limited by the computational challenges associated with applying the Markov-Chain Monte-Carlo (MCMC) approach to large real-world data. Approximate Bayesian inference via optimization of the variational evidence lower bound, often called Variational Bayes (VB), has been successfully demonstrated for other ap… ▽ More Bayesian approaches to clinical analyses for the purposes of patient phenoty** have been limited by the computational challenges associated with applying the Markov-Chain Monte-Carlo (MCMC) approach to large real-world data. Approximate Bayesian inference via optimization of the variational evidence lower bound, often called Variational Bayes (VB), has been successfully demonstrated for other applications. We investigate the performance and characteristics of currently available R and Python VB software for variational Bayesian Latent Class Analysis (LCA) of realistically large real-world observational data. We used a real-world data set, Optum\textsuperscript{TM} electronic health records (EHR), containing pediatric patients with risk indicators for type 2 diabetes mellitus that is a rare form in pediatric patients. The aim of this work is to validate a Bayesian patient phenoty** model for generality and extensibility and crucially that it can be applied to a realistically large real-world clinical data set. We find currently available automatic VB methods are very sensitive to initial starting conditions, model definition, algorithm hyperparameters and choice of gradient optimiser. The Bayesian LCA model was challenging to implement using VB but we achieved reasonable results with very good computational performance compared to MCMC. △ Less

Submitted 23 March, 2023; originally announced March 2023.

Comments: 11 pages, 11 figures. Supplementary material available on request

arXiv:2209.05795 [pdf, other]

doi 10.1016/j.csda.2023.107841

Joint modelling of the body and tail of bivariate data

Authors: Lídia M. André, Jennifer L. Wadsworth, Adrian O'Hagan

Abstract: In situations where both extreme and non-extreme data are of interest, modelling the whole data set accurately is important. In a univariate framework, modelling the bulk and tail of a distribution has been extensively studied before. However, when more than one variable is of concern, models that aim specifically at capturing both regions correctly are scarce in the literature. A dependence model… ▽ More In situations where both extreme and non-extreme data are of interest, modelling the whole data set accurately is important. In a univariate framework, modelling the bulk and tail of a distribution has been extensively studied before. However, when more than one variable is of concern, models that aim specifically at capturing both regions correctly are scarce in the literature. A dependence model that blends two copulas with different characteristics over the whole range of the data support is proposed. One copula is tailored to the bulk and the other to the tail, with a dynamic weighting function employed to transition smoothly between them. Tail dependence properties are investigated numerically and simulation is used to confirm that the blended model is sufficiently flexible to capture a wide variety of structures. The model is applied to study the dependence between temperature and ozone concentration at two sites in the UK and compared with a single copula fit. The proposed model provides a better, more flexible, fit to the data, and is also capable of capturing complex dependence structures. △ Less

Submitted 10 October, 2023; v1 submitted 13 September, 2022; originally announced September 2022.

Comments: 36 pages, 12 figures

arXiv:2103.10912 [pdf, other]

Copula Averaging for Tail Dependence in Insurance Claims Data

Authors: Sen Hu, Adrian O'Hagan

Abstract: Analysing dependent risks is an important task for insurance companies. A dependency is reflected in the fact that information about one random variable provides information about the likely distribution of values of another random variable. Insurance companies in particular must investigate such dependencies between different lines of business and the effects that an extreme loss event, such as a… ▽ More Analysing dependent risks is an important task for insurance companies. A dependency is reflected in the fact that information about one random variable provides information about the likely distribution of values of another random variable. Insurance companies in particular must investigate such dependencies between different lines of business and the effects that an extreme loss event, such as an earthquake or hurricane, has across multiple lines of business simultaneously. Copulas provide a popular model-based approach to analysing the dependency between risks, and the coefficient of tail dependence is a measure of dependence for extreme losses. Besides commonly used empirical estimators for estimating the tail dependence coefficient, copula fitting can lead to estimation of such coefficients directly or can verify their existence. Generally, a range of copula models is available to fit a data set well, leading to multiple different tail dependence results; a method based on Bayesian model averaging is designed to obtain a unified estimate of tail dependence. In this article, this model-based coefficient estimation method is illustrated through a variety of copula fitting approaches and results are presented for several simulated data sets and also a real general insurance loss data set. △ Less

Submitted 19 March, 2021; originally announced March 2021.

arXiv:2102.02852 [pdf, other]

Eliciting judgements about dependent quantities of interest: The SHELF extension and copula methods illustrated using an asthma case study

Authors: Björn Holzhauer, Lisa V. Hampson, John Paul Gosling, Björn Bornkamp, Joseph Kahn, Markus R. Lange, Wen-Lin Luo, Caterina Brindicci, David Lawrence, Steffen Ballerstedt, Anthony O'Hagan

Abstract: Pharmaceutical companies regularly need to make decisions about drug development programs based on the limited knowledge from early stage clinical trials. In this situation, eliciting the judgements of experts is an attractive approach for synthesising evidence on the unknown quantities of interest. When calculating the probability of success for a drug development program, multiple quantities of… ▽ More Pharmaceutical companies regularly need to make decisions about drug development programs based on the limited knowledge from early stage clinical trials. In this situation, eliciting the judgements of experts is an attractive approach for synthesising evidence on the unknown quantities of interest. When calculating the probability of success for a drug development program, multiple quantities of interest - such as the effect of a drug on different endpoints - should not be treated as unrelated. We discuss two approaches for establishing a multivariate distribution for several related quantities within the SHeffield ELicitation Framework (SHELF). The first approach elicits experts' judgements about a quantity of interest conditional on knowledge about another one. For the second approach, we first elicit marginal distributions for each quantity of interest. Then, for each pair of quantities, we elicit the concordance probability that both lie on the same side of their respective elicited medians. This allows us to specify a copula to obtain the joint distribution of the quantities of interest. We show how these approaches were used in an elicitation workshop that was performed to assess the probability of success of the registrational program of an asthma drug. The judgements of the experts, which were obtained prior to completion of the pivotal studies, were well aligned with the final trial results. △ Less

Submitted 15 February, 2021; v1 submitted 4 February, 2021; originally announced February 2021.

Comments: 29 pages, 7 figures

MSC Class: 62P10; 62P30; 62C99

arXiv:1907.04185 [pdf, other]

Predictively Consistent Prior Effective Sample Sizes

Authors: Beat Neuenschwander, Sebastian Weber, Heinz Schmidli, Anthony O'Hagan

Abstract: Determining the sample size of an experiment can be challenging, even more so when incorporating external information via a prior distribution. Such information is increasingly used to reduce the size of the control group in randomized clinical trials. Knowing the amount of prior information, expressed as an equivalent prior effective sample size (ESS), clearly facilitates trial designs. Various m… ▽ More Determining the sample size of an experiment can be challenging, even more so when incorporating external information via a prior distribution. Such information is increasingly used to reduce the size of the control group in randomized clinical trials. Knowing the amount of prior information, expressed as an equivalent prior effective sample size (ESS), clearly facilitates trial designs. Various methods to obtain a prior's ESS have been proposed recently. They have been justified by the fact that they give the standard ESS for one-parameter exponential families. However, despite being based on similar information-based metrics, they may lead to surprisingly different ESS for non-conjugate settings, which complicates many designs with prior information. We show that current methods fail a basic predictive consistency criterion, which requires the expected posterior-predictive ESS for a sample of size $N$ to be the sum of the prior ESS and $N$. The expected local-information-ratio ESS is introduced and shown to be predictively consistent. It corrects the ESS of current methods, as shown for normally distributed data with a heavy-tailed Student-t prior and exponential data with a generalized Gamma prior. Finally, two applications are discussed: the prior ESS for the control group derived from historical data, and the posterior ESS for hierarchical subgroup analyses. △ Less

Submitted 9 July, 2019; originally announced July 2019.

Comments: 19 pages, 1 figure

ACM Class: G.3

arXiv:1904.04699 [pdf, other]

Bivariate Gamma Mixture of Experts Models for Joint Insurance Claims Modeling

Authors: Sen Hu, T Brendan Murphy, Adrian O'Hagan

Abstract: In general insurance, risks from different categories are often modeled independently and their sum is regarded as the total risk the insurer takes on in exchange for a premium. The dependence from multiple risks is generally neglected even when correlation could exist, for example a single car accident may result in claims from multiple risk categories. It is desirable to take the covariance of d… ▽ More In general insurance, risks from different categories are often modeled independently and their sum is regarded as the total risk the insurer takes on in exchange for a premium. The dependence from multiple risks is generally neglected even when correlation could exist, for example a single car accident may result in claims from multiple risk categories. It is desirable to take the covariance of different categories into consideration in modeling in order to better predict future claims and hence allow greater accuracy in ratemaking. In this work multivariate severity models are investigated using mixture of experts models with bivariate gamma distributions, where the dependence structure is modeled directly using a GLM framework, and covariates can be placed in both gating and expert networks. Furthermore, parsimonious parameterisations are considered, which leads to a family of bivariate gamma mixture of experts models. It can be viewed as a model-based clustering approach that clusters policyholders into sub-groups with different dependencies, and the parameters of the mixture models are dependent on the covariates. Clustering is shown to be important in separating the data into sub-grou**s where strong dependence is often present, even if the overall data set exhibits only weak dependence. In doing so, the correlation within different components features prominently in the model. It is shown that, by applying to both simulated data and a real-world Irish GI insurer data set, claim predictions can be improved. △ Less

Submitted 9 April, 2019; originally announced April 2019.

arXiv:1710.03704 [pdf, other]

Motor Insurance Accidental Damage Claims Modeling with Factor Collapsing and Bayesian Model Averaging

Authors: Sen Hu, Adrian O'Hagan, Thomas Brendan Murphy

Abstract: Accidental damage is a typical component of motor insurance claim. Modeling of this nature generally involves analysis of past claim history and different characteristics of the insured objects and the policyholders. Generalized linear models (GLMs) have become the industry's standard approach for pricing and modeling risks of this nature. However, the GLM approach utilizes a single "best" model o… ▽ More Accidental damage is a typical component of motor insurance claim. Modeling of this nature generally involves analysis of past claim history and different characteristics of the insured objects and the policyholders. Generalized linear models (GLMs) have become the industry's standard approach for pricing and modeling risks of this nature. However, the GLM approach utilizes a single "best" model on which loss predictions are based, which ignores the uncertainty among the competing models and variable selection. An additional characteristic of motor insurance data sets is the presence of many categorical variables, within which the number of levels is high. In particular, not all levels of such variables may be statistically significant and rather some subsets of the levels may be merged to give a smaller overall number of levels for improved model parsimony and interpretability. A method is proposed for assessing the optimal manner in which to collapse a factor with many levels into one with a smaller number of levels, then Bayesian model averaging (BMA) is used to blend model predictions from all reasonable models to account for factor collapsing uncertainty. This method will be computationally intensive due to the number of factors being collapsed as well as the possibly large number of levels within factors. Hence a stochastic optimisation is proposed to quickly find the best collapsing cases across the model space. △ Less

Submitted 10 October, 2017; originally announced October 2017.

arXiv:1510.00551 [pdf, ps, other]

Investigation of Parameter Uncertainty in Clustering Using a Gaussian Mixture Model Via Jackknife, Bootstrap and Weighted Likelihood Bootstrap

Authors: Adrian O'Hagan, Thomas Brendan Murphy, Luca Scrucca, Isobel Claire Gormley

Abstract: Mixture models are a popular tool in model-based clustering. Such a model is often fitted by a procedure that maximizes the likelihood, such as the EM algorithm. At convergence, the maximum likelihood parameter estimates are typically reported, but in most cases little emphasis is placed on the variability associated with these estimates. In part this may be due to the fact that standard errors ar… ▽ More Mixture models are a popular tool in model-based clustering. Such a model is often fitted by a procedure that maximizes the likelihood, such as the EM algorithm. At convergence, the maximum likelihood parameter estimates are typically reported, but in most cases little emphasis is placed on the variability associated with these estimates. In part this may be due to the fact that standard errors are not directly calculated in the model-fitting algorithm, either because they are not required to fit the model, or because they are difficult to compute. The examination of standard errors in model-based clustering is therefore typically neglected. The widely used R package mclust has recently introduced bootstrap and weighted likelihood bootstrap methods to facilitate standard error estimation. This paper provides an empirical comparison of these methods (along with the jackknife method) for producing standard errors and confidence intervals for mixture parameters. These methods are illustrated and contrasted in both a simulation study and in the traditional Old Faithful data set and Thyroid data set. △ Less

Submitted 22 July, 2019; v1 submitted 2 October, 2015; originally announced October 2015.

arXiv:1504.06870 [pdf, ps, other]

Improved model-based clustering performance using Bayesian initialization averaging

Authors: Adrian O'Hagan, Arthur White

Abstract: The Expectation-Maximization (EM) algorithm is a commonly used method for finding the maximum likelihood estimates of the parameters in a mixture model via coordinate ascent. A serious pitfall with the algorithm is that in the case of multimodal likelihood functions, it can get trapped at a local maximum. This problem often occurs when sub-optimal starting values are used to initialize the algorit… ▽ More The Expectation-Maximization (EM) algorithm is a commonly used method for finding the maximum likelihood estimates of the parameters in a mixture model via coordinate ascent. A serious pitfall with the algorithm is that in the case of multimodal likelihood functions, it can get trapped at a local maximum. This problem often occurs when sub-optimal starting values are used to initialize the algorithm. Bayesian initialization averaging (BIA) is proposed as an ensemble method to generate high quality starting values for the EM algorithm. Competing sets of trial starting values are combined as a weighted average, which is then used as the starting position for a full EM run. The method can also be extended to variational Bayes (VB) methods, a class of algorithm similar to EM that is based on an approximation of the model posterior. The BIA method is demonstrated on real continuous, categorical and network data sets, and the convergent log-likelihoods and associated clustering solutions presented. These compare favorably with the output produced using competing initialization methods such as random starts, hierarchical clustering and deterministic annealing, with the highest available maximum likelihood estimates obtained in a higher percentage of cases, at reasonable computational cost. The implications of the different clustering solutions obtained by local maxima are also discussed. △ Less

Submitted 30 August, 2018; v1 submitted 26 April, 2015; originally announced April 2015.

Showing 1–10 of 10 results for author: O'Hagan, A