A Semiparametric Approach for Robust and Efficient Learning with Biobank Data††The first two authors made equal contributions to this paper.
Abstract
With the increasing availability of electronic health records (EHR) linked with biobank data for translational research, a critical step in realizing its potential is to accurately classify phenotypes for patients. Existing approaches to achieve this goal are based on error-prone EHR surrogate outcomes, assisted and validated by a small set of labels obtained via medical chart review, which may also be subject to misclassification. Ignoring the noise in these outcomes can induce severe estimation and validation bias to both EHR phenoty** and risking modeling with biomarkers collected in the biobank. To overcome this challenge, we propose a novel unsupervised and semiparametric approach to jointly model multiple noisy EHR outcomes with their linked biobank features. Our approach primarily aims at disease risk modeling with the baseline biomarkers, and is also able to produce a predictive EHR phenoty** model and validate its performance without observations of the true disease outcome. It consists of composite and nonparametric regression steps free of any parametric model specification, followed by a parametric projection step to reduce the uncertainty and improve the estimation efficiency. We show that our method is robust to violations of the parametric assumptions while attaining the desirable root- convergence rates on risk modeling. Our developed method outperforms existing methods in extensive simulation studies, as well as a real-world application in phenoty** and genetic risk modeling of type II diabetes.
Keywords: EHR linked biobank data; Surrogates; Measurement errors; Biomarker; Model misspecification; Under-smoothing.
1 Introduction
1.1 Background
With the increasing adoption of electronic health record (EHR) systems in the United States, EHR data are increasingly accessible for research. Linking EHR data with biorepository, powerful phenome-genome studies can be performed with such large scale data for discovery and translational research (Kohane,, 2011; Denny et al.,, 2013; Wells et al.,, 2019). To fully realize the potential of EHR data, a critical step involves accurately and efficiently classifying phenotype status for individual patients to enable association studies and risk modeling. Although simple rule-based classification algorithms leveraging domain knowledge remain useful, they have varying degree of accuracy and portability (Zhang et al., 2019b, ). Conversely, data-driven machine learning based classification algorithms have been advocated as a useful alternative with higher accuracy and portability (Shivade et al.,, 2014; Liao et al.,, 2015; Banda et al.,, 2018). Typically, these algorithms undergo training and/or validation with gold standard labels curated via medical chart review. Subsequently, the predicted phenotypes for all patients in the cohort serve as the observed outcomes for downstream association studies (Liao et al.,, 2013, 2019, e.g).
Historically, most existing phenoty** algorithms have relied on supervised methods, which suffer from scalablility issue due to labor-intensive nature of manually reviewing charts to obtain gold standard labels for the phenotype of interest. In recent years, several unsupervised methods leveraging unlabeled data using surrogate features as noisy labels (Yu et al.,, 2017; Banda et al.,, 2017; Liao et al.,, 2019) were proposed as promising alternatives. However, these methods can lead to poor accuracy when the surrogate features have limited accuracy and do not provide reliable estimate of classification performance of the trained models.
1.2 Problem setup
Let denote the unobserved true binary phenotype status and be its associated baseline characteristics and genetic markers from the EHR linked biobank, which could be either multi-dimensional single nucleotide polymorphisms (SNPs) or a genetic risk score derived by weighting a number of SNPs. We simultaneously consider two types of error-prone outcomes or surrogates for in our setup. First, suppose there are -dimensinoal EHR surrogate features such as counts of ’s related billing codes and key laboratory results. Second, let be the chart review label from experts, taking values of for , to represent different levels of certainty regarding whether the patient has the condition . In practice, is often taken as with representing not a case, a possible case, and a case.
Importantly, we assume the error-prone outcomes and only relate to genetic markers through , i.e., . An illustration of this assumption is provided in the directed acyclic graph (DAG) of Figure 1. In this DAG, the baseline biomarkers first occur to affect the chance of develo** the disease , then causes the downstream hospital visits producing features and in EHR, where may encode unstructured information such as images and narrative clinical notes. Though is not directly included as an outcome in our setup, it can affect the medical review result together with the observed and structured .
Suppose there are patients with independent and identical copies of the complete set of variables described above, denoted as . Since the label is derived based on expertise and additional information like , it is usually more accurate than in characterizing the true . However, it may still have a moderate measurement error due to incomplete information collection for medical review or complication and ambiguity of certain phenotypes. Thus, we assume that is only observable in a small set of subjects indexed by , and, to account for the error of , it is marginally related to through
(1) |
Also, note that and are observed for all patients and the true outcome is not observed for any patient. So the observed data is formed as , with the labeling indicator , i.e., being completely at random. Without loss of generality, we let where is the size of sample with labels and is the indicator function. Our primary goal is to derive a risk model of against as well as inference of its encoded genetic associations. Since the genetic effects are usually moderate or small, it is more favorable to model and interpret with a simple and parametric form to ensure good interpretability and control the estimation uncertainty. In specific, we consider a working logistic model:
(2) |
where the expit link . Note that model (2) is allowed to be misspecified, and we define the target model parameter as where is the log-likelihood function of logistic regression. Though (2) may be misspecified, such is still identifiable and effective in characterizing the genetic associations. Our secondary goal is EHR phenoty** for the unobserved using be deriving a risk score , as well as validating its classification performance. Due to the absence of in our observation, all above-introduced tasks are unsupervised and, thus, more challenging than the standard supervised or recent semi-supervised scenarios reviewed in Section 1.3.
Remark 1
Though both and are surrogates of the truth with errors, we still notate and consider them separately for several reasons. First, is not accessible for a (large) fraction of subjects so the phenoty** score of can only include the fully observed as the predictors and formulated as . Second, although is neither perfect nor scalable, it is supposed to be more accurate and informative than . Thus, as will be discussed in Sections 2 and 3, is important under our framework to stable training and efficient estimation, especially when is of poor quality in characterizing .
1.3 Related literature and our contribution
Surrogate outcomes play an important role in data-driven biomedical research, particularly when obtaining the primary or true outcome of interest is costly or even impossible, e.g., demanding extensive human labor or long periods of follow-up. There is rich literature in both semi-supervised and unsupervised statistical learning with surrogates. For example, Athey et al., (2019) leveraged surrogates collected in observational studies to assist learning with experimental studies in paucity of the gold standard labels. Kallus and Mao, (2020) and Hou et al., (2021) studied how to utilize surrogates to improve the efficiency of causal inference without incurring bias. Hou et al., 2023a developed a semiparametric transformation approach to incorporate time-to-event surrogates and improve the learning efficiency with the true outcomes.
The aforementioned literature considered a semi-supervised setting with a small sample of the true outcome . Differently, our problem setup does not involve any observation of . For such an unsupervised setting, Huang et al., (2018) and Hong et al., (2019) proposed maximum likelihood approaches based on parametric assumptions on the conditional model of , which enables the identification and estimation of the model coefficients. Zhang et al., 2019a developed a method for the unsupervised learning and phenotype validation with anchor-positive surrogate outcomes in EHR. All these recent methods largely rely on parametric model assumptions like (2), a working assumption in our setup. Its misspecification could lead to biased estimation for the target parameter due to the absence of the true label .
Meanwhile, we notice some fully nonparametric approaches for the so called latent-structure or mixture model related to our problem setup in recent literature, including Bonhomme et al., (2016), Yu et al., (2019), and Zheng and Wu, (2019). For example, Zheng and Wu, (2019) proposed a novel tensor approach for learning of nonparametric mixtures, with a key idea of introducing basis approximation to the component density functions. This track of work is in general free from the model misspecification issue discussed above but cannot provide desirable -consistent estimators and may encounter the “curse of dimensionality” for multivariate surrogate outcomes.
To address the above-introduced dilemma between the bias caused by model misspecification and the low efficiency due to curse of dimensionality, we develop a Three-stage Unsupervised learning approach for Biomarkers linked with Error-prone outcomes, abbreviated as TUBE. Our approach primarily aims at risk modeling with the baseline biomarkers, and is also able to produce and validate a predictive EHR phenoty** score without observations of the true disease outcome. It is a semiparametric method that starts from a composite and nonparametric regression step for against that is free of any parametric assumptions. Following this step, TUBE combines multiple surrogates for EHR phenoty** and validation, and then implements a parametric projection step to improve the interpretability and estimation efficiency of the genetic risk model. We will show that our estimator for is -consistent and asymptotic normal without requiring model (2) to be correctly specified or to have a parametric form, which are imposed by existing methods like Hong et al., (2019) and Zhang et al., 2019a . Also, TUBE demonstrates significantly better performance than existing methods in our simulation and real-world studies.
2 Three-stage unsupervised learning method
2.1 Overview of the modeling strategy
Our proposed TUBE method consists of three main steps. In stage I, we adopt an under-smoothed nonparametric and composite likelihood strategy that is free of any parametric or model structural assumptions on the forms of , and . This is to avoid the potential bias caused by model misspecification on linking the error-prone outcomes with without the supervision of the true label . In stage II, we leverage the results from I to condense the EHR features into a risk score for more accurate phenoty** of , and refit the data using nonparametric likelihoods to evaluate its ROC. In stage III, we rely on the imputation outcomes from II to derive a parametric logistic model for . Compared to the previous steps, III will output a more efficient characterization of the genetic risk or association with good interpretability and desirable convergence rates. Meanwhile, built upon previous steps robust to model misspecification, stage III will be valid even when the target genetic model is wrong.
Denote by and for . To get rid of the curse of dimensionality in modeling jointly against through a multivariate nonparametric model, we consider a working conditional independence assumption across given , implying an additive logistic form of their joint model:
(3) |
where is an intercept term introduced such that . As will be introduced in Section 2.2, under this construction, we can model each with separately and combine them with a composite likelihood to estimate ’s, as if . Then we will ensemble the estimators of through (3) to derive an estimate for the phenoty** score . As we will discuss later, due to our use of the composite likelihood, violation of the additive model (3) will not cause invalidity to the downstream results.
For the genetic variants , we will consider two scenarios, including that (i) contains multi-dimensional discrete SNPs features ranging over ; and (2) is a univariate continuous gene risk score. For (i), we introduce the categorical functions covering all the possible combinations of the discrete SNPs in while for (ii), we use the spline (sieve) basis functions of . In both cases, we specify the nonparametric model of as
(4) |
where is a set of bases with possibly diverging dimensionality, used to approximate any (smooth) functions of . Note that model (4) is a nuisance model introduced to avoid model misspecification in the first stage of our method. Our final goal is to estimate the parametric model (2) with a more desirable convergence rate as well as easier interpretation than (4). This is more advantagous especially when the genetic association is mild or small and, thus, requiring small enough estimation uncertainty to detect.
2.2 Stage I: sieve-approximated composite likelihood
We first focus on the estimation of ’s and . To ensure the validity while incorporating the additional genetic information, we consider a composite log-likelihood formulated under our key assumption that and a working independence condition of given :
where is the -th EHR outcome of subject . As is outlined in Section 2.1, due to potential misspecification of the parametric models like (2), we model nonparametrically by (4), and adopt a similar sieve construction on each
where is a vector of basis functions used to approximate . For discrete , we naturally set as its dummy variables. For continuous , we again use sieve. Then we can construct the sieve-approximated composite likelihood as:
where , , , and we denote by and . To solve for that maximizes , we propose to use an expectation???maximization (EM) algorithm outlined in Algorithm 1.
Input: Observed data .
Initialize with obtained by Algorithm A2. Iterate on the following two steps for until convergence.
E-step. For each subject and outcome (or if observed: ), impute the probability for the unobserved conditional on the covariates in each component of the composite likelihood:
M-step. Update through the maximum likelihood estimation (MLE) specified with the imputed outcomes from the E-step:
Output:
Algorithm 1 iterates on two main steps. First, there is an E-step imputing the unobserved true outcome separately conditional on each or as the set of features appearing in each component of the composite likelihood. Unlike the EM algorithms for joint likelihood objectives, our method does not involve any imputation model of using the whole set of observed variables . This in turn ensures the validity free of any assumptions on the joint distribution of that is hard to characterize due to the curse of dimensionality. Second, Algorithm 1 involves an M-step solving for through MLE constructed using the imputed ’s. Again, corresponding to the composite likelihood construction, and ’s for different error-prone outcomes are solved separately based on their own imputed outcomes.
In Theorem 1 presented later, we show that Algorithm 1 maintains an ascent property on the objective composite likelihood function that is desirable for optimization. Nevertheless, it is still practically crucial to have a good initial estimator for Algorithm 1 to avoid the local minima issue. In response to this, we propose in Algorithm A2 of Appendix to derive through MLE constructed as if was the true outcome, i.e., the logistic regression of against or each . For , we set it up with a proper guess presuming that is informative.
2.3 Stage II: condensing EHR features for phenoty**
With the fitted estimator in Stage I, we derive , serving as a phenotype score condensing the outcomes . For , we further adopt a nonparametric likelihood approach that combines it with to derive an imputation model for . Since ensembles multiple EHR outcomes, it tends to be more predictive of than each single . So this procedure can be more efficient than modeling each single separately in , thus, being more favorable for the downstream analysis. As implied by (3), the optimal ensemble is only when the working assumption holds. When there is a strong evidence that such conditional independence does not hold, an alternative strategy is to set the phenoty** score as the first principle component of for , to make it representative of the multiple EHR outcomes.
Again, we will not rely on any parametric or model structural assumptions on the sensitivity function for and that captures . In this case, the log-likelihood function can be written as
Without any further constraint on , the above log-likelihood function will not have a unique maximizer. Thus, inspired by existing literature in nonparametric MLE (Murphy and Van der Vaart,, 2000, e.g.), we restrict to be a step function that can only jump at the observed data points , and denote its jump size at each as . If the true status was observed, the MLE for under this step-function constraint would be derived as
Based on this, our objective becomes to maximize
(5) |
where , under the step-function constraints on . Since we do not specify the correlation or dependence between and , we still adopt a composite strategy to model them in (5). But different from the fully composite also treating separately, we now condense ’s into a single .
2.4 Stage III: genetic risk modeling and EHR phenotype validation
In Steps (I) and (II) introduced above, we fit nonparametric models for to make the estimators and more robust to model misspecification. In practice, directly using such nonparametric models for gene association analysis often results in large variance or even inefficiency due to the curse of dimensionality. Thus, in this step, we leverage the extracted to construct a parametric genetic risk for the true outcome against . In specific, with , we characterize for all , and for as
which coincides with the imputation of the unobserved in the last E-step of Algorithm A2. Note that is not necessarily consistent for unless the working independence assumption (3) holds and . Then we conduct logistic regression for the imputed outcomes and separately against , to obtain estimators
Although , the standard error of may still be smaller than that of since is typically less informative than the chart review labels in terms of measuring the true . To derive a more efficient estimator, the final step is to assemble and as:
where is a weight determined using the data to minimize the variance of among all convex combinations of and . When , we can show that and are asymptotically independent, and, thus, the optimal weight , where and represent the estimated standard error of and . In general, we can take
where is the asymptotic covariance matrix of computed using bootstrap. Since the true disease status is unobserved, the estimators and are subject to the issue that the switch between and cannot be identified from the observed data. To address this, we assume the coefficient for to be greater than zero with chosen as an informative feature to . Correspondingly, we shall flip the sign of the fitted or if or . Alternatively, one could also restrict the prevalence of to be smaller than , which does not require the knowledge of some informative feature .
As the by-product, we are also able to validate the derived phenoty** score using the fitted sensitivity functional . Denote the limiting (population-level) function of as . The true positive rate (TPR) and false positive rate (FPR) of the classifier or on the true label can be naturally estimated using and respectively. Furthermore, the receiver operating characteristic (ROC) curve of or can be estimated by for , and the area under ROC .
3 Asymptotic analysis
In this section, we provide asymptotic analysis of the TUBE estimators , , and resulted from our described steps in Sections 2.2–2.4. We consider as a continuous univariate gene risk score and as its spline basis function. Let and be the population-level (true) parameters. We define the norm of to be and the norm of to be . We first introduce smoothness and regularity assumptions as follows.
Assumption 1
Covariates have compact domain with their joint probability density function being twice continuously differentiable. For all and , and are twice continuously differentiable. For , , the derivative of is continuously differentiable.
Assumption 2
The parameter spaces of and are compact. Hessian matrix has its all eigenvalues staying away from and . For any and , and are twice continuously differentiable with respect to , , and .
Remark 2
Assumption 1 consists of mild smoothness conditions commonly used for the asymptotic analysis of of M-estimation and sieve-smoothed regression (Van der Vaart,, 2000; Chen,, 2007, e.g.). Assumption 2 requires the non-singularity of the hessian matrix as well as the strong convexity of the loss functions, which has been also frequently used in the literature.
Remark 3
When and are discrete, e.g., being the categorical functions of several SNPs, Assumption 1 will be as given. In such a situation with discrete , the sensitivity function will only have finite choices on the cutoff , and the asymptotic analysis of its estimator will be degenerated and simplified.
Next, we establish the consistency and asymptotic normality for the phenoty** score in Theorem 2, as well as those for the estimator of its sensitivity function in Theorem 3. Let be the dimensionality of the bases and supposed to increase with .
Theorem 2
Theorem 3
Under all assumptions in Theorem 2, then as , converges to in probability, and for , converges weakly to some zero-mean Gaussian process for .
Considering that our primary goal is the genetic risk estimation with , we under-smooth the sieve estimator of by taking slightly larger than , to achieve the asymptotic unbiasedness and normality of that will be established in Theorem 4. This choice of does not lead to the optimal convergence rate of these by-products and . To further refine these estimators, one just needs to take and carry out Steps I and II. This leads to the -convergence of and , an improvement compared to the current -convergence. However, the estimator derived with cannot ensure the desirable parametric rate and asymptotic normality of and obtained in Step III. See existing literature like Chen, (2007) for more relevant results.
Finally, we establish the convergence properties of and , which reveals the -consistency and asymptotic normality of the TUBE estimator .
Theorem 4
Under all assumptions in Theorem 2, both and converge to in probability and converges weakly to a zero-mean Gaussian distribution.
4 Simulation
We conduct comprehensive simulation studies to evaluate the finite-sample performance of the proposed method. Let Binomial denote the binomial distribution with trials and a success probability of . To generate risk factors , we consider with , and , , generated independently from Binomial. For generation of the unobserved true outcome and EHR surrogates , we consider the following three settings:
-
(a)
where ; and where , are independent standard normal noises.
-
(b)
, with generated given in the same way as (a).
-
(c)
; and where , are independent standard normal noises.
In all settings, we set and generate from . As discussed earlier, is supposed to be an imperfect but more informative outcome compared to . Our setup mimics this by imposing a much stronger effect of on . We also let the size of labels range from to to investigate its influence on the efficiency of the methods.
We consider the following three methods for comparison: (1) the simple approach referred as Naive-Logistic directly using the label as the outcome for analysis; (2) our main benchmark Hong et al., (2019) using the composite likelihood approach with parametric modeling on and ; (3) the proposed TUBE approach with and the basis functions and specified as the natural spline with the degree of freedom as . Note that Hong et al., (2019)’s method is fully parametric and, thus, will concur the issues of model misspecification in settings (b) and (c) due to the non-linearity of . In setting (c), we introduce some small indirect effect of on given that moderately breaks our key independence assumption . This is to examine the sensitivity to the (slight) violation of this assumption.
The parameters of our interests include , the logistic model coefficients obtained by regressing against , as well as the accuracy parameter AUC of against their phenoty** score obtained in each method. The population level parameters of and are computed by generating an extremely large sample. Our evaluation metrics include mean squared error (MSE) in Figure 2, percent bias in Figure 3, i.e., the ratio between absolute bias and root MSE, and coverage probability (CP) of the 95% CI computed using the standard resampling bootstrap procedure; see Figure 4. The results in Figures 2-4 are obtained based on times of simulation. For the multi-dimensional , we only present the average performance over in these figures and the element-wise results can be found in the tables of Appendix B.
In all settings, Naive-Logistic shows large MSEs and percent biases due to the erroneousness of in measuring the true . In setting (a), TUBE attains close performance to the benchmark methods in Hong et al., (2019) that relies on a fully parametric modeling strategy and does not encounter the model misspecification issue. In specific, the percentage difference in the MSE between the two methods is smaller than on all parameters when in setting (a). Also, both methods attain small enough percent bias and desirable coverage probability on and AUC. Thus, although it seems redundant to use a more complex semiparametric modeling strategy in TUBE compared to Hong et al., (2019) when the true models are indeed linear and parametric, this complexity does not result in TUBE’s loss of validity or efficiency. This result is in line with our conclusions in Section 3 that the sieve estimators does not impact the parametric rate of our estimator for due to under-smoothing.
In settings (b) and (c) under which the fully parametric method of Hong et al., (2019) has a severe issue in model misspecification, TUBE achieves significantly better performance than Hong et al., (2019) and ensures the validity of inference. For example, under setting (b) with , the average MSE of TUBE on is more than 90% smaller than that of Hong et al., (2019). Also, TUBE successfully maintains a small percent bias (5%–10%) and appropriate coverage probability while Hong et al., (2019) fails to provide valid inference with the average coverage rates around 30% below than the nominal level 95% in setting (b). This substantial improvement of TUBE is resulted from the nonparametric construction in our Steps I and II that protect our approach against bias due to the nonlinear effects.
In addition, we notice that as the labeled sample size increases, the MSEs of TUBE on and AUC gradually decrease as provides additional information over . For example, when increase from to , TUBE’s MSE on AUC decreases more than in all settings. Recall that in practice and our simulation setup, is usually more informative than even though both of them contains errors in measuring the true . Thus, moderately increasing the size of could result in efficiency gain even with the total sample size unchanged. Meanwhile, we do not see the improvement of Naive-Logistic and Hong et al., (2019) as increases in settings (b) and (c) probably because of their large bias.
5 Real Example
The rising incidence of Type II diabetes mellitus (T2D) in recent years has risen great concern in health. Previous genome-wide association studies (GWAS) have identified many genetic variations associated with insulin resistance or inadequate insulin production attributing to T2D (Mahajan et al.,, 2018). Consequently, polygenic risk score (GRS) has been developed to predict individual’s genetic risk of develo** T2D (He et al.,, 2021). These advancements provide great potential for precision medicine approaches in the prevention and management of the T2D disease. In this application, we study the Mass General Brigham (MGB) biobank data (Castro et al.,, 2022) with a primary goal to build a genetic risk prediction model for T2D using its GRS and demographic information.
Our data set includes MGB biobank participants up to 2021 with their available EHR features updated for the same year. Their risk factors contain , an one-dimensional GRS for T2D derived using the reported variants and effect sizes of Mahajan et al., (2018), as well as gender denoted as ( for Female). The EHR surrogates include , the log-transformed total count of the International Classification of Diseases (ICD) codes for T2D and , the value of hemoglobin A1C obtained via laboratory tests. In addition, we have collected on a subset of patients as the manual chart reviewing label for T2D status created by clinicians in 2014. Due to the gap of time windows of data collection, is an imperfect label for the true T2D status with its potential measurement error coming from the missingness of information between 2014 and 2021, as well as the switch of the ICD system from version 9 to 10 around 2015 at MGB. For the purpose of validation, we also extract the chart reviewing labels created by clinicians according to all information up to 2021 on a random subsample of the data with size . These labels are more close to (arguably identical to) the true T2D status and only used for validation and evaluation of the estimators trained on the set .
In addition to Hong et al. 2019 and Naive-Logistic studied in Section 4, we also include four simple benchmark estimators including those obtained through the logistic regression against respectively using I(ICD1), I(ICD2), I(A1C5.7) and I(A1C6.4) as the binary outcomes. All of them are common and convenient ways to screen the subject with T2D frequently used in existing biomedical studies and practice. As the secondary analysis, we also estimate the AUC of the two important surrogates ICD and A1C using the imputation for in TUBE and other methods except the aforementioned approaches directly using ICD or A1C to construct the outcome. This aim is slightly different from evaluating the derived phenoty** score considered in Sections 2 and 4 but it can be realized using nearly the same strategy and is typically more useful for clinicians and researchers in practice. We use 200 times bootstrap sampling to quantify the variance of all the estimators. The resulted estimators with their standard errors are presented in Table 1.
Using the validation set with the true label , we obtain a validation estimator and evaluate the AUC of ICD and A1C. Evaluation metrics of the estimators for include: (1) mean square prediction error (MSPE) defined as the sample mean of ; (2) Deviance of the logistic model evaluated on the target data; (3) classifier’s correlation (Class. Cor) with , i.e., the sample correlation of and where is the sample mean of ; and (4) false classification rate (False Class.) compared to , i.e., the empirical probability of . The evaluation results are presented in Table 2.
(Intercept) | (GRS) | (Gender) | AUC(ICD) | AUC(A1C) | |
---|---|---|---|---|---|
ICD | – | – | |||
ICD | – | – | |||
A1C | – | – | |||
A1C | – | – | |||
Naive-Logistic | |||||
Hong et al. 2019 | |||||
TUBE | |||||
Validation |
MSPE | Deviance | Class. Cor | False Class. | |
---|---|---|---|---|
ICD | ||||
ICD | ||||
A1C | ||||
A1C | ||||
Naive-Logistic | ||||
Hong et al. 2019 | ||||
TUBE | ||||
Validation |
Among all methods under comparison, TUBE attains the closest point estimates to the validation estimator in terms of both and AUC. For example, the AUC of A1C evaluated using TUBE-imputed outcomes only differs from the the validation estimator by around while all the other estimators show more than gaps to the validation estimator. The estimation performance in are depicted more carefully in Table 2 where TUBE achieves the best on all metrics among all estimators except for . For example, compared to the recent method proposed by Hong et al., (2019), our method attains more than reduction on MSPE, and larger classifier’s correlation with the validation estimator. These results illustrate the effectiveness of leveraging our semiparametric modeling strategy to reduce potential bias due to misspecification. Meanwhile, although TUBE involves more complicated nonparametric regression, it does not result in significant inflation of the standard errors compared to Hong et al., (2019), which is a benefit of using parametric regression (projection) in Stage III.
Our estimator of reveals that the GRS has a significant positive effect (log(OR)=, 95% CI: ) on the risk of T2D and men have significantly higher risk to develop T2D than women in our study cohort. Interestingly, the effect sizes estimated using the four simple EHR outcomes, i.e., I(ICD1), I(ICD2), I(A1C5.7), and I(A1C6.4) are all smaller than and estimated by TUBE. As an explanation of this observation, after we convert the error-prone EHR outcomes to binary variables, they will have the same scale as the true outcome and, thus, showing weaker association with the risk factors than due to their measurement errors. This can be justified under the key assumption that ICD, A1C are independent with the baseline risk factors given the True T2D status.
6 Discussion
In summary, we propose TUBE, a novel unsupervised method for analyzing multiple error-prone EHR outcomes and noisy labels against baseline risk factors, such as genetic variants extracted from EHR linked biobanks. TUBE incorporates a nonparametric composite regression step, and then uses it to combine the EHR outcomes for phenoty** and derive a parametric genetic risk model through projection. Compared to existing methods, our semiparametric strategy has two advantages. First, the nonparametric composite construction at the first stage safeguards the unsupervised learning against potential bias due to model misspecification. Second, the derived parametric genetic risk model obtained through projection enhances interpretability and achieves and significantly reduced variance in comparison to a fully nonparametric approach. These advantages are supported by our comprehensive asymptotic analysis, simulations, and a real-world study.
We acknowledges several limitations and potential extensions of our work. First, the validity of our method is prone to severe violation of the conditional independence assumption between the EHR outcomes and the baseline covariates. This issue can be alleviated by incorporating (small) samples with the true labels to calibrate the unsupervised estimator derived from surrogates. Recent advancements in surrogate-assisted semi-supervised learning (Zhang et al.,, 2022; Hou et al., 2023b, ) are particularly relevant to this discussion. Second, our current setup focuses on binary disease status. In current biomedical studies, time to the onset of clinical events (e.g., cancer relapse) is often not readily available with their EHR surrogates subject to measurement errors. Simple estimates of the event time based on billing or procedure codes may poorly approximate the true outcome and lead to bias. Therefore, expanding TUBE to incorporate multiple sources of imperfect and temporal endpoints under the survival setting is a potential direction for future research. In addition, our current method only accommodates low-dimensional genetic variants and a single disease or phenotype. Recent large scale genome??? and phenome???wide studies (Huang and Labrecque,, 2019; Verma et al.,, 2023, e.g.) provides a strong motivation for its extensions to accommodate high-dimensional or machine learning estimates of the genetic risk models and multi-phenotype studies.
References
- Athey et al., (2019) Athey, S., Chetty, R., Imbens, G. W., and Kang, H. (2019). The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. Technical report, National Bureau of Economic Research.
- Banda et al., (2017) Banda, J. M., Halpern, Y., Sontag, D., and Shah, N. H. (2017). Electronic phenoty** with aphrodite and the observational health sciences and informatics (ohdsi) data network. AMIA Summits on Translational Science Proceedings, 2017:48.
- Banda et al., (2018) Banda, J. M., Seneviratne, M., Hernandez-Boussard, T., and Shah, N. H. (2018). Advances in electronic phenoty**: from rule-based definitions to machine learning models. Annual Review of Biomedical Data Science, 1:53–68.
- Bonhomme et al., (2016) Bonhomme, S., Jochmans, K., Robin, J.-M., et al. (2016). Estimating multivariate latent-structure models. The Annals of Statistics, 44(2):540–563.
- Castro et al., (2022) Castro, V. M., Gainer, V., Wattanasin, N., Benoit, B., Cagan, A., Ghosh, B., Goryachev, S., Metta, R., Park, H., Wang, D., et al. (2022). The mass general brigham biobank portal: an i2b2-based data repository linking disparate and high-dimensional patient data to support multimodal analytics. Journal of the American Medical Informatics Association, 29(4):643–651.
- Chen, (2007) Chen, X. (2007). Chapter 76 large sample sieve estimation of semi-nonparametric models. volume 6 of Handbook of Econometrics, pages 5549–5632. Elsevier.
- Denny et al., (2013) Denny, J. C., Bastarache, L., Ritchie, M. D., Carroll, R. J., Zink, R., Mosley, J. D., Field, J. R., Pulley, J. M., Ramirez, A. H., Bowton, E., et al. (2013). Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nature biotechnology, 31(12):1102–1111.
- He et al., (2021) He, Y., Lakhani, C. M., Rasooly, D., Manrai, A. K., Tzoulaki, I., and Patel, C. J. (2021). Comparisons of polyexposure, polygenic, and clinical risk scores in risk prediction of type 2 diabetes. Diabetes Care, 44(4):935–943.
- Hong et al., (2019) Hong, C., Liao, K. P., and Cai, T. (2019). Semi-supervised validation of multiple surrogate outcomes with application to electronic medical records phenoty**. Biometrics, 75(1):78–89.
- (10) Hou, J., Chan, S. F., Wang, X., and Cai, T. (2023a). Risk prediction with imperfect survival outcome information from electronic health records. Biometrics, 79(1):190–202.
- (11) Hou, J., Guo, Z., and Cai, T. (2023b). Surrogate assisted semi-supervised inference for high dimensional risk prediction. Journal of Machine Learning Research, 24(265):1–58.
- Hou et al., (2021) Hou, J., Mukherjee, R., and Cai, T. (2021). Efficient and robust semi-supervised estimation of ate with partially annotated treatment and response. arXiv preprint arXiv:2110.12336.
- Huang et al., (2018) Huang, J., Duan, R., Hubbard, R. A., Wu, Y., Moore, J. H., Xu, H., and Chen, Y. (2018). Pie: A prior knowledge guided integrated likelihood estimation method for bias reduction in association studies using electronic health records data. Journal of the American Medical Informatics Association, 25(3):345–352.
- Huang and Labrecque, (2019) Huang, J. Y. and Labrecque, J. A. (2019). From gwas to phewas: the search for causality in big data. The Lancet Digital Health, 1(3):e101–e103.
- Kallus and Mao, (2020) Kallus, N. and Mao, X. (2020). On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. arXiv preprint arXiv:2003.12408.
- Kohane, (2011) Kohane, I. S. (2011). Using electronic health records to drive discovery in disease genomics. Nature Reviews Genetics, 12(6):417–428.
- Liao et al., (2015) Liao, K. P., Cai, T., Savova, G. K., Murphy, S. N., Karlson, E. W., Ananthakrishnan, A. N., Gainer, V. S., Shaw, S. Y., Xia, Z., Szolovits, P., et al. (2015). Development of phenotype algorithms using electronic medical records and incorporating natural language processing. bmj, 350:h1885.
- Liao et al., (2013) Liao, K. P., Kurreeman, F., Li, G., Duclos, G., Murphy, S., Guzman, R., Cai, T., Gupta, N., Gainer, V., Schur, P., et al. (2013). Associations of autoantibodies, autoimmune risk alleles, and clinical diagnoses from the electronic medical records in rheumatoid arthritis cases and non–rheumatoid arthritis controls. Arthritis & Rheumatology, 65(3):571–581.
- Liao et al., (2019) Liao, K. P., Sun, J., Cai, T. A., Link, N., Hong, C., Huang, J., Huffman, J. E., Gronsbell, J., Zhang, Y., Ho, Y.-L., Castro, V., Gainer, V., Murphy, S. N., O’Donnell, C. J., Gaziano, J. M., Cho, K., Szolovits, P., Kohane, I. S., Yu, S., and Cai, Tianxi, w. t. M. V. P. (2019). High-throughput multimodal automated phenoty** (MAP) with application to PheWAS. Journal of the American Medical Informatics Association, 26(11):1255–1262.
- Mahajan et al., (2018) Mahajan, A., Taliun, D., Thurner, M., Robertson, N. R., Torres, J. M., Rayner, N. W., Payne, A. J., Steinthorsdottir, V., Scott, R. A., Grarup, N., et al. (2018). Fine-map** type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. Nature genetics, 50(11):1505–1513.
- Murphy and Van der Vaart, (2000) Murphy, S. A. and Van der Vaart, A. W. (2000). On profile likelihood. Journal of the American Statistical Association, 95(450):449–465.
- Shivade et al., (2014) Shivade, C., Raghavan, P., Fosler-Lussier, E., Embi, P. J., Elhadad, N., Johnson, S. B., and Lai, A. M. (2014). A review of approaches to identifying patient phenotype cohorts using electronic health records. Journal of the American Medical Informatics Association, 21(2):221–230.
- Van der Vaart, (2000) Van der Vaart, A. W. (2000). Asymptotic statistics, volume 3. Cambridge university press.
- Verma et al., (2023) Verma, A., Huffman, J. E., Rodriguez, A., Conery, M., Liu, M., Ho, Y.-L., Kim, Y., Heise, D. A., Guare, L., Panickan, V. A., et al. (2023). Diversity and scale: genetic architecture of 2,068 traits in the va million veteran program. medRxiv.
- Wells et al., (2019) Wells, Q. S., Gupta, D. K., Smith, J. G., Collins, S. P., Storrow, A. B., Ferguson, J., Smith, M. L., Pulley, J. M., Collier, S., Wang, X., et al. (2019). Accelerating biomarker discovery through electronic health records, automated biobanking, and proteomics. Journal of the American College of Cardiology, 73(17):2195–2205.
- Yu et al., (2017) Yu, S., Ma, Y., Gronsbell, J., Cai, T., Ananthakrishnan, A. N., Gainer, V. S., Churchill, S. E., Szolovits, P., Murphy, S. N., Kohane, I. S., et al. (2017). Enabling phenotypic big data with phenorm. Journal of the American Medical Informatics Association, 25(1):54–60.
- Yu et al., (2019) Yu, T., Li, P., Qin, J., et al. (2019). Maximum smoothed likelihood component density estimation in mixture models with known mixing proportions. Electronic Journal of Statistics, 13(2):4035–4078.
- (28) Zhang, L., Ding, X., Ma, Y., Muthu, N., Ajmal, I., Moore, J. H., Herman, D. S., and Chen, J. (2019a). Electronic health record phenoty** with internally assessable performance (phiap) using anchor-positive and unlabeled patients. arXiv preprint arXiv:1902.10060.
- (29) Zhang, Y., Cai, T., Yu, S., Cho, K., Hong, C., Sun, J., Huang, J., Ho, Y.-L., Ananthakrishnan, A. N., Xia, Z., et al. (2019b). High-throughput phenoty** with electronic medical record data using a common semi-supervised approach (phecap). Nature protocols, 14(12):3426–3444.
- Zhang et al., (2022) Zhang, Y., Liu, M., Neykov, M., and Cai, T. (2022). Prior adaptive semi-supervised learning with application to ehr phenoty**. The Journal of Machine Learning Research, 23(1):3617–3641.
- Zheng and Wu, (2019) Zheng, C. and Wu, Y. (2019). Nonparametric estimation of multivariate mixtures. Journal of the American Statistical Association, pages 1–16.
Appendix
Appendix A Additional implementation details
Input: Observed data , and the phenoty** score derived in Algorithm 1.
Initialize with introduced in Algorithm A2. Iterate on the following two steps for until convergence.
E-step. For each subject , impute the probability for conditional on (if observed) or :
M-step. Update through the MLE specified with the imputed outcomes from the E-step:
Output: The imputed outcomes (if ) and for .
Appendix B Additional numerical results
In this section, we attach more complete simulation results as a supplement to the main results presented in Section 4.
(a) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | =-4.600 | = 1.600 | = 1.600 | = 1.600 | = 1.600 | AUC=0.702 | =0.320 | =0.490 | =0.190 | =0.700 | =0.280 | =0.030 |
Naive-Logistic100 | 2.965 | -1.354 | -1.351 | -1.343 | -1.328 | -0.088 | - | - | - | - | - | - |
Hong et al100 | -0.050 | 0.013 | 0.006 | 0.020 | 0.014 | 0.001 | 0.004 | 0.000 | -0.004 | -0.003 | -0.002 | 0.005 |
TUBE100 | -0.145 | 0.040 | 0.036 | 0.049 | 0.044 | -0.001 | 0.001 | 0.002 | -0.003 | 0.002 | -0.005 | 0.003 |
Naive-Logistic500 | 3.024 | -1.358 | -1.348 | -1.357 | -1.344 | -0.103 | - | - | - | - | - | - |
Hong et al500 | -0.011 | 0.004 | 0.000 | 0.007 | 0.004 | 0.002 | -0.001 | 0.001 | 0.001 | 0.001 | -0.001 | 0.000 |
TUBE500 | -0.089 | 0.029 | 0.025 | 0.029 | 0.031 | 0.000 | -0.004 | 0.002 | 0.002 | 0.007 | -0.003 | -0.003 |
Naive-Logistic1000 | 3.019 | -1.360 | -1.346 | -1.347 | -1.349 | -0.104 | - | - | - | - | - | - |
Hong et al1000 | -0.010 | 0.000 | -0.002 | 0.005 | 0.000 | 0.002 | -0.003 | 0.001 | 0.002 | 0.003 | -0.002 | -0.001 |
TUBE1000 | -0.073 | 0.020 | 0.020 | 0.026 | 0.022 | 0.000 | -0.006 | 0.003 | 0.003 | 0.008 | -0.005 | -0.004 |
(b) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | = 1.300 | = 0.700 | =-0.700 | =-0.700 | =-0.700 | AUC=0.702 | =0.320 | =0.490 | =0.190 | =0.700 | =0.280 | =0.030 |
Naive-Logistic100 | -1.879 | -0.514 | 0.472 | 0.497 | 0.496 | -0.080 | - | - | - | - | - | - |
Hong et al100 | -1.255 | 3.467 | -0.642 | -0.646 | -0.630 | -0.027 | 0.016 | -0.011 | -0.005 | -0.056 | 0.031 | 0.025 |
TUBE100 | 0.020 | 0.016 | -0.024 | -0.009 | -0.011 | -0.003 | -0.001 | 0.003 | -0.002 | 0.003 | -0.004 | 0.001 |
Naive-Logistic500 | -1.853 | -0.507 | 0.495 | 0.490 | 0.500 | -0.097 | - | - | - | - | - | - |
Hong et al500 | -1.272 | 3.513 | -0.648 | -0.654 | -0.644 | -0.028 | 0.010 | -0.006 | -0.004 | -0.059 | 0.033 | 0.026 |
TUBE500 | 0.011 | 0.013 | -0.017 | -0.007 | -0.004 | -0.001 | -0.002 | 0.000 | 0.002 | 0.000 | 0.000 | 0.000 |
Naive-Logistic1000 | -1.850 | -0.509 | 0.495 | 0.493 | 0.500 | -0.097 | - | - | - | - | - | - |
Hong et al1000 | -1.281 | 3.524 | -0.650 | -0.652 | -0.643 | -0.028 | 0.008 | -0.002 | -0.005 | -0.060 | 0.033 | 0.027 |
TUBE1000 | 0.004 | 0.008 | -0.012 | -0.003 | -0.002 | -0.001 | -0.007 | 0.004 | 0.003 | 0.001 | -0.001 | 0.000 |
(c) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | = 1.300 | =-0.300 | =-0.700 | =-0.700 | =-0.800 | AUC=0.702 | =0.320 | =0.490 | =0.190 | =0.700 | =0.280 | =0.030 |
Naive-Logistic100 | -1.925 | 0.239 | 0.570 | 0.550 | 0.567 | -0.083 | - | - | - | - | - | - |
Hong et al100 | -0.228 | -1.200 | -0.380 | -0.408 | -0.393 | -0.025 | 0.037 | -0.038 | 0.000 | -0.039 | 0.026 | 0.012 |
TUBE100 | 0.090 | 0.046 | -0.021 | -0.032 | -0.029 | -0.009 | 0.017 | -0.007 | -0.010 | -0.007 | 0.004 | 0.002 |
Naive-Logistic500 | -1.887 | 0.227 | 0.564 | 0.562 | 0.563 | -0.100 | - | - | - | - | - | - |
Hong et al500 | -0.366 | -1.337 | -0.386 | -0.391 | -0.399 | -0.024 | 0.012 | -0.010 | -0.002 | -0.031 | 0.017 | 0.014 |
TUBE500 | 0.065 | 0.044 | -0.013 | -0.023 | -0.019 | -0.003 | -0.008 | 0.003 | 0.005 | 0.005 | -0.003 | -0.001 |
Naive-Logistic1000 | -1.887 | 0.226 | 0.557 | 0.568 | 0.571 | -0.102 | - | - | - | - | - | - |
Hong et al1000 | -0.340 | -1.476 | -0.452 | -0.446 | -0.456 | -0.025 | 0.012 | -0.009 | -0.003 | -0.035 | 0.020 | 0.015 |
TUBE1000 | 0.060 | 0.037 | -0.017 | -0.019 | -0.014 | -0.003 | -0.003 | -0.001 | 0.004 | 0.002 | -0.001 | -0.001 |
(a) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | =-4.600 | = 1.600 | = 1.600 | = 1.600 | = 1.600 | AUC=0.702 | =0.320 | =0.490 | =0.190 | =0.700 | =0.280 | =0.030 |
Naive-Logistic100 | 9.092 | 1.863 | 1.885 | 1.870 | 1.826 | 0.009 | - | - | - | - | - | - |
Hong et al100 | 0.678 | 0.070 | 0.075 | 0.085 | 0.072 | 0.000 | 0.005 | 0.005 | 0.003 | 0.012 | 0.011 | 0.002 |
TUBE100 | 0.780 | 0.078 | 0.090 | 0.098 | 0.087 | 0.001 | 0.005 | 0.005 | 0.003 | 0.013 | 0.011 | 0.002 |
Naive-Logistic500 | 9.194 | 1.849 | 1.830 | 1.851 | 1.815 | 0.011 | - | - | - | - | - | - |
Hong et al500 | 0.620 | 0.064 | 0.070 | 0.077 | 0.065 | 0.000 | 0.001 | 0.001 | 0.001 | 0.003 | 0.002 | 0.000 |
TUBE500 | 0.670 | 0.068 | 0.077 | 0.082 | 0.073 | 0.000 | 0.001 | 0.001 | 0.001 | 0.003 | 0.002 | 0.001 |
Naive-Logistic1000 | 9.137 | 1.852 | 1.816 | 1.819 | 1.825 | 0.011 | - | - | - | - | - | - |
Hong et al1000 | 0.604 | 0.060 | 0.065 | 0.076 | 0.066 | 0.000 | 0.001 | 0.001 | 0.000 | 0.001 | 0.001 | 0.000 |
TUBE1000 | 0.660 | 0.064 | 0.072 | 0.079 | 0.076 | 0.000 | 0.001 | 0.001 | 0.000 | 0.002 | 0.001 | 0.000 |
(b) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | = 1.300 | = 0.700 | =-0.700 | =-0.700 | =-0.700 | AUC=0.702 | =0.320 | =0.490 | =0.190 | =0.700 | =0.280 | =0.030 |
Naive-Logistic100 | 3.896 | 0.311 | 0.302 | 0.320 | 0.321 | 0.008 | - | - | - | - | - | - |
Hong et al100 | 2.145 | 13.794 | 0.564 | 0.615 | 0.557 | 0.001 | 0.023 | 0.024 | 0.014 | 0.006 | 0.004 | 0.001 |
TUBE100 | 0.081 | 0.014 | 0.015 | 0.016 | 0.015 | 0.001 | 0.017 | 0.019 | 0.009 | 0.005 | 0.005 | 0.001 |
Naive-Logistic500 | 3.489 | 0.265 | 0.259 | 0.253 | 0.262 | 0.010 | - | - | - | - | - | - |
Hong et al500 | 2.191 | 13.569 | 0.533 | 0.574 | 0.548 | 0.001 | 0.004 | 0.004 | 0.002 | 0.004 | 0.002 | 0.001 |
TUBE500 | 0.075 | 0.013 | 0.015 | 0.014 | 0.014 | 0.000 | 0.004 | 0.004 | 0.002 | 0.001 | 0.001 | 0.000 |
Naive-Logistic1000 | 3.451 | 0.264 | 0.251 | 0.249 | 0.256 | 0.010 | - | - | - | - | - | - |
Hong et al1000 | 2.176 | 13.639 | 0.539 | 0.567 | 0.541 | 0.001 | 0.002 | 0.002 | 0.001 | 0.004 | 0.001 | 0.001 |
TUBE1000 | 0.065 | 0.011 | 0.013 | 0.012 | 0.012 | 0.000 | 0.002 | 0.002 | 0.001 | 0.001 | 0.000 | 0.000 |
(c) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | = 1.300 | =-0.300 | =-0.700 | =-0.700 | =-0.800 | AUC=0.702 | =0.320 | =0.490 | =0.190 | =0.700 | =0.280 | =0.030 |
Naive-Logistic100 | 4.043 | 0.105 | 0.398 | 0.378 | 0.390 | - | - | - | - | - | - | |
Hong et al100 | 2.667 | 4.855 | 0.360 | 0.536 | 0.440 | 0.001 | 0.050 | 0.052 | 0.022 | 0.007 | 0.005 | 0.001 |
TUBE100 | 0.135 | 0.016 | 0.023 | 0.023 | 0.025 | 0.003 | 0.028 | 0.029 | 0.011 | 0.005 | 0.005 | 0.001 |
Naive-Logistic500 | 3.629 | 0.060 | 0.332 | 0.330 | 0.331 | 0.010 | - | - | - | - | - | - |
Hong et al500 | 2.297 | 4.564 | 0.344 | 0.335 | 0.359 | 0.001 | 0.012 | 0.010 | 0.004 | 0.003 | 0.001 | 0.001 |
TUBE500 | 0.130 | 0.013 | 0.022 | 0.021 | 0.023 | 0.000 | 0.006 | 0.006 | 0.003 | 0.001 | 0.001 | 0.000 |
Naive-Logistic1000 | 3.589 | 0.055 | 0.317 | 0.328 | 0.333 | 0.011 | - | - | - | - | - | - |
Hong et al1000 | 4.793 | 7.153 | 0.912 | 1.001 | 1.280 | 0.001 | 0.008 | 0.006 | 0.003 | 0.003 | 0.001 | 0.001 |
TUBE1000 | 0.113 | 0.012 | 0.019 | 0.019 | 0.022 | 0.000 | 0.003 | 0.003 | 0.002 | 0.001 | 0.001 | 0.000 |
(a) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | =-4.600 | = 1.600 | = 1.600 | = 1.600 | = 1.600 | AUC=0.702 | =0.320 | =0.490 | =0.190 | =0.700 | =0.280 | =0.030 |
Naive-Logistic100 | 0.002 | 0.000 | 0.000 | 0.000 | 0.002 | 0.398 | - | - | - | - | - | - |
Hong et al100 | 0.946 | 0.954 | 0.958 | 0.948 | 0.950 | 0.946 | 0.942 | 0.954 | 0.952 | 0.956 | 0.960 | 0.934 |
TUBE100 | 0.940 | 0.944 | 0.944 | 0.940 | 0.938 | 0.998 | 0.952 | 0.958 | 0.950 | 0.954 | 0.958 | 0.936 |
Naive-Logistic500 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | - | - | - | - | - | - |
Hong et al500 | 0.946 | 0.956 | 0.952 | 0.950 | 0.952 | 0.948 | 0.940 | 0.946 | 0.952 | 0.954 | 0.946 | 0.960 |
TUBE500 | 0.942 | 0.946 | 0.946 | 0.948 | 0.954 | 0.954 | 0.948 | 0.948 | 0.944 | 0.948 | 0.946 | 0.972 |
Naive-Logistic1000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | - | - | - | - | - | - |
Hong et al1000 | 0.948 | 0.950 | 0.946 | 0.956 | 0.950 | 0.954 | 0.938 | 0.950 | 0.948 | 0.944 | 0.952 | 0.968 |
TUBE1000 | 0.954 | 0.952 | 0.938 | 0.954 | 0.940 | 0.954 | 0.938 | 0.950 | 0.938 | 0.942 | 0.952 | 0.978 |
(b) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | = 1.300 | = 0.700 | =-0.700 | =-0.700 | =-0.700 | AUC=0.702 | =0.320 | =0.490 | =0.190 | =0.700 | =0.280 | =0.030 |
Naive-Logistic100 | 0.097 | 0.333 | 0.628 | 0.554 | 0.547 | 0.574 | - | - | - | - | - | - |
Hong et al100 | 0.634 | 0.261 | 0.675 | 0.752 | 0.697 | 0.602 | 0.952 | 0.956 | 0.956 | 0.828 | 0.903 | 0.871 |
TUBE100 | 0.947 | 0.956 | 0.929 | 0.945 | 0.958 | 0.992 | 0.952 | 0.954 | 0.941 | 0.949 | 0.952 | 0.947 |
Naive-Logistic500 | 0.000 | 0.000 | 0.008 | 0.004 | 0.002 | 0.000 | - | - | - | - | - | - |
Hong et al500 | 0.640 | 0.095 | 0.554 | 0.628 | 0.628 | 0.554 | 0.954 | 0.966 | 0.956 | 0.341 | 0.729 | 0.408 |
TUBE500 | 0.958 | 0.952 | 0.952 | 0.958 | 0.964 | 0.941 | 0.954 | 0.956 | 0.943 | 0.943 | 0.947 | 0.927 |
Naive-Logistic1000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | - | - | - | - | - | - |
Hong et al1000 | 0.604 | 0.083 | 0.566 | 0.636 | 0.618 | 0.543 | 0.943 | 0.941 | 0.947 | 0.083 | 0.475 | 0.121 |
TUBE1000 | 0.954 | 0.960 | 0.949 | 0.956 | 0.958 | 0.939 | 0.954 | 0.943 | 0.943 | 0.947 | 0.945 | 0.947 |
(c) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | = 1.300 | =-0.300 | =-0.700 | =-0.700 | =-0.800 | AUC=0.702 | =0.320 | =0.490 | =0.190 | =0.700 | =0.280 | =0.030 |
Naive-Logistic100 | 0.079 | 0.797 | 0.436 | 0.482 | 0.428 | 0.522 | - | - | - | - | - | - |
Hong et al100 | 0.956 | 0.937 | 0.896 | 0.954 | 0.927 | 0.858 | 0.956 | 0.927 | 0.958 | 0.925 | 0.929 | 0.937 |
TUBE100 | 0.939 | 0.935 | 0.960 | 0.948 | 0.935 | 0.985 | 0.971 | 0.952 | 0.969 | 0.956 | 0.954 | 0.952 |
Naive-Logistic500 | 0.000 | 0.290 | 0.002 | 0.006 | 0.006 | 0.002 | - | - | - | - | - | - |
Hong et al500 | 0.933 | 0.881 | 0.862 | 0.868 | 0.864 | 0.839 | 0.933 | 0.952 | 0.952 | 0.931 | 0.944 | 0.937 |
TUBE500 | 0.950 | 0.929 | 0.950 | 0.939 | 0.942 | 0.942 | 0.946 | 0.952 | 0.950 | 0.946 | 0.944 | 0.927 |
Naive-Logistic1000 | 0.000 | 0.077 | 0.000 | 0.000 | 0.000 | 0.000 | - | - | - | - | - | - |
Hong et al1000 | 0.985 | 0.946 | 0.979 | 0.987 | 0.990 | 0.843 | 0.939 | 0.946 | 0.948 | 0.879 | 0.912 | 0.894 |
TUBE1000 | 0.942 | 0.937 | 0.946 | 0.946 | 0.948 | 0.946 | 0.952 | 0.958 | 0.933 | 0.958 | 0.954 | 0.939 |