A Semiparametric Approach for Robust and Efficient Learning with Biobank Data^†^†The first two authors made equal contributions to this paper.

Molei Liu^∗ Department of Biostatistics, Columbia Mailman School of Public Health. Xinyi Wang^∗ Department of Statistics, University of Chicago. Chuan Hong Department of Biostatistics and Bioinformatics, Duke University.

Abstract

With the increasing availability of electronic health records (EHR) linked with biobank data for translational research, a critical step in realizing its potential is to accurately classify phenotypes for patients. Existing approaches to achieve this goal are based on error-prone EHR surrogate outcomes, assisted and validated by a small set of labels obtained via medical chart review, which may also be subject to misclassification. Ignoring the noise in these outcomes can induce severe estimation and validation bias to both EHR phenoty** and risking modeling with biomarkers collected in the biobank. To overcome this challenge, we propose a novel unsupervised and semiparametric approach to jointly model multiple noisy EHR outcomes with their linked biobank features. Our approach primarily aims at disease risk modeling with the baseline biomarkers, and is also able to produce a predictive EHR phenoty** model and validate its performance without observations of the true disease outcome. It consists of composite and nonparametric regression steps free of any parametric model specification, followed by a parametric projection step to reduce the uncertainty and improve the estimation efficiency. We show that our method is robust to violations of the parametric assumptions while attaining the desirable root- $n$ convergence rates on risk modeling. Our developed method outperforms existing methods in extensive simulation studies, as well as a real-world application in phenoty** and genetic risk modeling of type II diabetes.

Keywords: EHR linked biobank data; Surrogates; Measurement errors; Biomarker; Model misspecification; Under-smoothing.

1 Introduction

1.1 Background

With the increasing adoption of electronic health record (EHR) systems in the United States, EHR data are increasingly accessible for research. Linking EHR data with biorepository, powerful phenome-genome studies can be performed with such large scale data for discovery and translational research (Kohane,, 2011; Denny et al.,, 2013; Wells et al.,, 2019). To fully realize the potential of EHR data, a critical step involves accurately and efficiently classifying phenotype status for individual patients to enable association studies and risk modeling. Although simple rule-based classification algorithms leveraging domain knowledge remain useful, they have varying degree of accuracy and portability (Zhang et al., 2019b, ). Conversely, data-driven machine learning based classification algorithms have been advocated as a useful alternative with higher accuracy and portability (Shivade et al.,, 2014; Liao et al.,, 2015; Banda et al.,, 2018). Typically, these algorithms undergo training and/or validation with gold standard labels curated via medical chart review. Subsequently, the predicted phenotypes for all patients in the cohort serve as the observed outcomes for downstream association studies (Liao et al.,, 2013, 2019, e.g).

Historically, most existing phenoty** algorithms have relied on supervised methods, which suffer from scalablility issue due to labor-intensive nature of manually reviewing charts to obtain gold standard labels for the phenotype of interest. In recent years, several unsupervised methods leveraging unlabeled data using surrogate features as noisy labels (Yu et al.,, 2017; Banda et al.,, 2017; Liao et al.,, 2019) were proposed as promising alternatives. However, these methods can lead to poor accuracy when the surrogate features have limited accuracy and do not provide reliable estimate of classification performance of the trained models.

1.2 Problem setup

Let $Y$ denote the unobserved true binary phenotype status and ${\bf G}=(G_{1},\ldots,G_{q})^{{\sf\scriptscriptstyle{T}}}$ be its associated baseline characteristics and genetic markers from the EHR linked biobank, which could be either multi-dimensional single nucleotide polymorphisms (SNPs) or a genetic risk score derived by weighting a number of SNPs. We simultaneously consider two types of error-prone outcomes or surrogates for $Y$ in our setup. First, suppose there are $p$ -dimensinoal EHR surrogate features ${\bf X}=(X_{1},\ldots,X_{p})^{{\sf\scriptscriptstyle{T}}}$ such as counts of $Y$ ’s related billing codes and key laboratory results. Second, let $Y^{*}$ be the chart review label from experts, taking values of $k/K$ for $k\in\{0,1,...,K\}$ , to represent different levels of certainty regarding whether the patient has the condition $Y$ . In practice, $K$ is often taken as $2$ with $Y^{*}\in\{0,0.5,1\}$ representing not a case, a possible case, and a case.

Importantly, we assume the error-prone outcomes $Y^{*}$ and ${\bf X}$ only relate to genetic markers ${\bf G}$ through $Y$ , i.e., $(\mathbf{Y}^{*},{\bf X})\perp\!\!\!\perp{\bf G}\mid Y$ . An illustration of this assumption is provided in the directed acyclic graph (DAG) of Figure 1. In this DAG, the baseline biomarkers ${\bf G}$ first occur to affect the chance of develo** the disease $Y$ , then $Y$ causes the downstream hospital visits producing features ${\bf X}$ and $\mathbf{U}$ in EHR, where $\mathbf{U}$ may encode unstructured information such as images and narrative clinical notes. Though $\mathbf{U}$ is not directly included as an outcome in our setup, it can affect the medical review result $Y^{*}$ together with the observed and structured ${\bf X}$ .

Refer to caption — Figure 1: An illustrative directed acyclic graph (DAG) of the data generating mechanism.

Suppose there are $N$ patients with independent and identical copies of the complete set of variables ${\bf D}=(Y,Y^{*},{\bf X},{\bf G})$ described above, denoted as $\mathscr{D}=\{{\bf D}_{i}:i=1,2,...,N\}$ . Since the label $Y^{*}$ is derived based on expertise and additional information like $\mathbf{U}$ , it is usually more accurate than ${\bf X}$ in characterizing the true $Y$ . However, it may still have a moderate measurement error due to incomplete information collection for medical review or complication and ambiguity of certain phenotypes. Thus, we assume that $Y^{*}$ is only observable in a small set of $n$ subjects indexed by $\delta=1$ , and, to account for the error of $Y^{*}$ , it is marginally related to $Y$ through

{\rm Pr}(Y^{*}=k/K\mid Y=y)=\lambda_{yk},~{}\mbox{for}~{}k=0,\ldots,K,~{}y=0,1% ;\quad\bm{\lambda}_{y}=\{\lambda_{y1},...,\lambda_{yK}\}.

(1)

Also, note that ${\bf X}$ and ${\bf G}$ are observed for all patients and the true outcome $Y$ is not observed for any patient. So the observed data is formed as $\mathscr{O}=\{{\bf O}_{i}=(Y^{*}_{i}\delta_{i},\delta_{i},{\bf X}_{i},{\bf G}_% {i}):i=1,2,...,N\}$ , with the labeling indicator $\delta\perp{\bf D}$ , i.e., being completely at random. Without loss of generality, we let $\delta_{i}=I(1\leq i\leq n)$ where $n<N$ is the size of sample with labels and $I(\cdot)$ is the indicator function. Our primary goal is to derive a risk model of $Y$ against ${\bf G}$ as well as inference of its encoded genetic associations. Since the genetic effects are usually moderate or small, it is more favorable to model and interpret $Y\sim{\bf G}$ with a simple and parametric form to ensure good interpretability and control the estimation uncertainty. In specific, we consider a working logistic model:

\displaystyle{\rm Pr}(Y=1\mid{\bf G})=g(\bm{\beta}^{{\sf\scriptscriptstyle{T}}% }{\bf G}),

(2)

where the expit link $g(x)=e^{x}/(1+e^{x})$ . Note that model (2) is allowed to be misspecified, and we define the target model parameter as $\bar{\bm{\beta}}=\mathop{\mbox{argmax}}_{\bm{\beta}}\mathbb{E}\ell(Y,\bm{\beta% }^{{\sf\scriptscriptstyle{T}}}{\bf G})$ where $\ell(y,w)=y\log\{g(w)\}+(1-y)\log\{[1-g(w)]\}$ is the log-likelihood function of logistic regression. Though (2) may be misspecified, such $\bar{\bm{\beta}}$ is still identifiable and effective in characterizing the genetic associations. Our secondary goal is EHR phenoty** for the unobserved $Y$ using ${\bf X}$ be deriving a risk score $\alpha({\bf X})$ , as well as validating its classification performance. Due to the absence of $Y$ in our observation, all above-introduced tasks are unsupervised and, thus, more challenging than the standard supervised or recent semi-supervised scenarios reviewed in Section 1.3.

Remark 1

Though both $Y^{*}$ and ${\bf X}$ are surrogates of the truth $Y$ with errors, we still notate and consider them separately for several reasons. First, $Y^{*}$ is not accessible for a (large) fraction of subjects so the phenoty** score of $Y$ can only include the fully observed ${\bf X}$ as the predictors and formulated as $\alpha({\bf X})$ . Second, although $Y^{*}$ is neither perfect nor scalable, it is supposed to be more accurate and informative than ${\bf X}$ . Thus, as will be discussed in Sections 2 and 3, $Y^{*}$ is important under our framework to stable training and efficient estimation, especially when ${\bf X}$ is of poor quality in characterizing $Y$ .

1.3 Related literature and our contribution

Surrogate outcomes play an important role in data-driven biomedical research, particularly when obtaining the primary or true outcome of interest is costly or even impossible, e.g., demanding extensive human labor or long periods of follow-up. There is rich literature in both semi-supervised and unsupervised statistical learning with surrogates. For example, Athey et al., (2019) leveraged surrogates collected in observational studies to assist learning with experimental studies in paucity of the gold standard labels. Kallus and Mao, (2020) and Hou et al., (2021) studied how to utilize surrogates to improve the efficiency of causal inference without incurring bias. Hou et al., 2023a developed a semiparametric transformation approach to incorporate time-to-event surrogates and improve the learning efficiency with the true outcomes.

The aforementioned literature considered a semi-supervised setting with a small sample of the true outcome $Y$ . Differently, our problem setup does not involve any observation of $Y$ . For such an unsupervised setting, Huang et al., (2018) and Hong et al., (2019) proposed maximum likelihood approaches based on parametric assumptions on the conditional model of $Y$ , which enables the identification and estimation of the model coefficients. Zhang et al., 2019a developed a method for the unsupervised learning and phenotype validation with anchor-positive surrogate outcomes in EHR. All these recent methods largely rely on parametric model assumptions like (2), a working assumption in our setup. Its misspecification could lead to biased estimation for the target parameter $\bar{\bm{\beta}}=\mathop{\mbox{argmax}}_{\bm{\beta}}\mathbb{E}\ell(Y,\bm{\beta% }^{{\sf\scriptscriptstyle{T}}}{\bf G})$ due to the absence of the true label $Y$ .

Meanwhile, we notice some fully nonparametric approaches for the so called latent-structure or mixture model related to our problem setup in recent literature, including Bonhomme et al., (2016), Yu et al., (2019), and Zheng and Wu, (2019). For example, Zheng and Wu, (2019) proposed a novel tensor approach for learning of nonparametric mixtures, with a key idea of introducing basis approximation to the component density functions. This track of work is in general free from the model misspecification issue discussed above but cannot provide desirable $n^{-1/2}$ -consistent estimators and may encounter the “curse of dimensionality” for multivariate surrogate outcomes.

To address the above-introduced dilemma between the bias caused by model misspecification and the low efficiency due to curse of dimensionality, we develop a Three-stage Unsupervised learning approach for Biomarkers linked with Error-prone outcomes, abbreviated as TUBE. Our approach primarily aims at risk modeling with the baseline biomarkers, and is also able to produce and validate a predictive EHR phenoty** score without observations of the true disease outcome. It is a semiparametric method that starts from a composite and nonparametric regression step for ${\bf X},Y^{*}$ against ${\bf G}$ that is free of any parametric assumptions. Following this step, TUBE combines multiple surrogates for EHR phenoty** and validation, and then implements a parametric projection step to improve the interpretability and estimation efficiency of the genetic risk model. We will show that our estimator for $\bm{\beta}$ is $n^{-1/2}$ -consistent and asymptotic normal without requiring model (2) to be correctly specified or $Y\sim{\bf X}$ to have a parametric form, which are imposed by existing methods like Hong et al., (2019) and Zhang et al., 2019a . Also, TUBE demonstrates significantly better performance than existing methods in our simulation and real-world studies.

2 Three-stage unsupervised learning method

2.1 Overview of the modeling strategy

Our proposed TUBE method consists of three main steps. In stage I, we adopt an under-smoothed nonparametric and composite likelihood strategy that is free of any parametric or model structural assumptions on the forms of $Y\sim Y^{*}$ , $Y\sim{\bf X}$ and $Y\sim{\bf G}$ . This is to avoid the potential bias caused by model misspecification on linking the error-prone outcomes $(Y^{*},{\bf X})$ with ${\bf G}$ without the supervision of the true label $Y$ . In stage II, we leverage the results from I to condense the EHR features ${\bf X}$ into a risk score $\widehat{\alpha}({\bf X})$ for more accurate phenoty** of $Y$ , and refit the data using nonparametric likelihoods to evaluate its ROC. In stage III, we rely on the imputation outcomes from II to derive a parametric logistic model for $Y\sim{\bf G}$ . Compared to the previous steps, III will output a more efficient characterization of the genetic risk or association with good interpretability and desirable convergence rates. Meanwhile, built upon previous steps robust to model misspecification, stage III will be valid even when the target genetic model is wrong.

Denote by $\mu={\rm Pr}(Y=1)$ and $m_{j}(x)={\rm Pr}(Y=1\mid X_{j}=x)$ for $j=1,2,\ldots,p$ . To get rid of the curse of dimensionality in modeling $Y$ jointly against $X_{1},X_{2},\ldots,X_{p}$ through a multivariate nonparametric model, we consider a working conditional independence assumption across $X_{1},X_{2},\ldots,X_{p}$ given $Y$ , implying an additive logistic form of their joint model:

{\rm Pr}(Y=1\mid{\bf X})=g\{a+\bar{\alpha}({\bf X})\}\quad\mbox{with}\quad\bar% {\alpha}({\bf X})=\sum_{j=1}^{p}g^{-1}\{m_{j}(X_{j})\},

(3)

where $a$ is an intercept term introduced such that $\mathbb{E}g\{\bar{\alpha}({\bf X})\}=\mu$ . As will be introduced in Section 2.2, under this construction, we can model each $X_{j}$ with ${\bf G}$ separately and combine them with a composite likelihood to estimate $m_{j}(\cdot)$ ’s, as if $X_{1}\perp X_{2}\perp\ldots\perp X_{p}\mid Y$ . Then we will ensemble the estimators of $m_{j}(X_{j})$ through (3) to derive an estimate for the phenoty** score $\bar{\alpha}({\bf X})$ . As we will discuss later, due to our use of the composite likelihood, violation of the additive model (3) will not cause invalidity to the downstream results.

For the genetic variants ${\bf G}$ , we will consider two scenarios, including that (i) ${\bf G}$ contains multi-dimensional discrete SNPs features ranging over $\{0,1,2\}$ ; and (2) ${\bf G}$ is a univariate continuous gene risk score. For (i), we introduce the categorical functions covering all the possible combinations of the discrete SNPs in ${\bf G}$ while for (ii), we use the spline (sieve) basis functions of ${\bf G}$ . In both cases, we specify the nonparametric model of $Y\sim{\bf G}$ as

{\rm Pr}(Y=1\mid{\bf G})=g\{\bm{\xi}^{{\sf\scriptscriptstyle{T}}}\bm{\psi}({% \bf G})\},

(4)

where $\bm{\psi}({\bf G})=\{\psi_{1}({\bf G}),\psi_{2}({\bf G}),\ldots,\psi_{d_{g}}({% \bf G})\}^{{\sf\scriptscriptstyle{T}}}$ is a set of bases with possibly diverging dimensionality, used to approximate any (smooth) functions of ${\bf G}$ . Note that model (4) is a nuisance model introduced to avoid model misspecification in the first stage of our method. Our final goal is to estimate the parametric model (2) with a more desirable convergence rate as well as easier interpretation than (4). This is more advantagous especially when the genetic association is mild or small and, thus, requiring small enough estimation uncertainty to detect.

2.2 Stage I: sieve-approximated composite likelihood

We first focus on the estimation of $m_{j}(\cdot)$ ’s and ${\rm Pr}(Y=1\mid{\bf G})$ . To ensure the validity while incorporating the additional genetic information, we consider a composite log-likelihood formulated under our key assumption that $(\mathbf{Y}^{*},{\bf X})\perp\!\!\!\perp{\bf G}\mid Y$ and a working independence condition of $X_{1},...,X_{p}$ given $Y$ :

\sum_{i=1}^{n}\log\left\{\sum_{y=0}^{1}{\rm Pr}(Y^{*}_{i}\mid Y_{i}=y){\rm Pr}% (Y_{i}=y\mid{\bf G}_{i})\right\}+\sum_{i=1}^{N}\sum_{j=1}^{p}\log\left\{\sum_{% y=0}^{1}\frac{{\rm Pr}(Y_{i}=y\mid X_{ij}){\rm Pr}(Y_{i}=y\mid{\bf G}_{i})}{{% \rm Pr}(Y_{i}=y)}\right\},

where $X_{ij}$ is the $j$ -th EHR outcome of subject $i$ . As is outlined in Section 2.1, due to potential misspecification of the parametric models like (2), we model ${\rm Pr}(Y=y\mid{\bf G})$ nonparametrically by (4), and adopt a similar sieve construction on each

m_{j}(X_{j})={\rm Pr}(Y=1\mid X_{j})=g\{\bm{\zeta}_{j}^{{\sf\scriptscriptstyle% {T}}}\bm{\varphi}_{j}(X_{j})\},

where $\bm{\varphi}_{j}(x)$ is a vector of basis functions used to approximate $g^{-1}\{m_{j}(x)\}$ . For discrete $X_{j}$ , we naturally set $\bm{\varphi}_{j}(x)$ as its dummy variables. For continuous $X_{j}$ , we again use sieve. Then we can construct the sieve-approximated composite likelihood as:

{\cal C}(\bm{\theta})=\sum_{i=1}^{n}\log\left(\sum_{y=0}^{1}\lambda_{yY_{i}^{*% }}g_{y}\{\bm{\xi}^{{\sf\scriptscriptstyle{T}}}\bm{\psi}({\bf G}_{i})\}\right)+% \sum_{i=1}^{N}\sum_{j=1}^{p}\log\left(\sum_{y=0}^{1}\mu_{y}^{-1}g_{y}\{\bm{% \zeta}_{j}^{{\sf\scriptscriptstyle{T}}}\bm{\varphi}_{j}(X_{ij})\}g_{y}\{\bm{% \xi}^{{\sf\scriptscriptstyle{T}}}\bm{\psi}({\bf G}_{i})\}\right),

where $\bm{\theta}=\{\bm{\xi},\bm{\zeta},\bm{\lambda},\mu\}$ , $\bm{\lambda}=(\bm{\lambda}_{0}^{{\sf\scriptscriptstyle{T}}},\bm{\lambda}_{1}^{% {\sf\scriptscriptstyle{T}}})^{{\sf\scriptscriptstyle{T}}}$ , $\bm{\zeta}=(\bm{\zeta}_{1}^{{\sf\scriptscriptstyle{T}}},\ldots,\bm{\zeta}_{p}^% {{\sf\scriptscriptstyle{T}}})^{{\sf\scriptscriptstyle{T}}}$ , and we denote by $g_{y}(\cdot)=yg(\cdot)+(1-y)\{1-g(\cdot)\}$ and $\mu_{y}={\rm Pr}(Y=y)=y\mu+(1-y)(1-\mu)$ . To solve for $\bm{\theta}$ that maximizes ${\cal C}(\bm{\theta})$ , we propose to use an expectation???maximization (EM) algorithm outlined in Algorithm 1.

Algorithm 1 EM algorithm for the nonparametric composite log-likelihood.

Input: Observed data $\mathscr{O}=\{{\bf O}_{i}=(Y^{*}_{i}\delta_{i},\delta_{i},{\bf X}_{i},{\bf G}_% {i}):i=1,2,...,N\}$ .
Initialize with ${\widehat{\bm{\theta}}}^{(0)}=\{\widehat{\bm{\xi}}^{(0)},\widehat{\bm{\zeta}}^% {(0)},\widehat{\bm{\lambda}}^{(0)},\widehat{\mu}^{(0)}\}$ obtained by Algorithm A2. Iterate on the following two steps for $r=0,1,\ldots,R$ until convergence.
E-step. For each subject $i$ and outcome $j$ (or $Y^{*}$ if observed: $\delta_{i}=1$ ), impute the probability for the unobserved $Y_{i}$ conditional on the covariates in each component of the composite likelihood:

\widehat{Y}_{i0}^{(r+1)}=\delta_{i}\times\frac{\widehat{\lambda}_{1Y_{i}^{*}}^% {(r)}g_{1}\{\bm{\psi}^{{\sf\scriptscriptstyle{T}}}({\bf G}_{i})\widehat{\bm{% \xi}}^{(r)}\}}{\sum_{y=0}^{1}\widehat{\lambda}_{yY_{i}^{*}}^{(r)}g_{y}\{\bm{% \psi}^{{\sf\scriptscriptstyle{T}}}({\bf G}_{i})\widehat{\bm{\xi}}^{(r)}\}};~{}% \widehat{Y}_{ij}^{(r+1)}=\frac{g_{1}\{\bm{\varphi}^{{\sf\scriptscriptstyle{T}}% }_{j}(X_{ij})\widehat{\bm{\zeta}}^{(r)}_{j}\}g_{1}\{\bm{\psi}^{{\sf% \scriptscriptstyle{T}}}({\bf G}_{i})\widehat{\bm{\xi}}^{(r)}\}/\widehat{\mu}_{% 1}^{(r)}}{\sum_{y=0}^{1}g_{y}\{\bm{\varphi}^{{\sf\scriptscriptstyle{T}}}_{j}(X% _{ij})\widehat{\bm{\zeta}}^{(r)}_{j}\}g_{y}\{\bm{\psi}^{{\sf\scriptscriptstyle% {T}}}({\bf G}_{i})\widehat{\bm{\xi}}^{(r)}\}/\widehat{\mu}_{y}^{(r)}}.

M-step. Update $\bm{\theta}$ through the maximum likelihood estimation (MLE) specified with the imputed outcomes from the E-step:

	$\displaystyle\widehat{\mu}^{(r+1)}=\frac{1}{Np+n}\sum_{i=1}^{N}\sum_{j=0}^{p}% \widehat{Y}_{ij}^{(r+1)};\quad\widehat{\lambda}_{yk}^{(r+1)}=\frac{\sum_{i=1}^% {n}I(Y^{*}_{i}=k)\{{\widehat{Y}}_{i0}^{(r+1)}\}^{y}\{1-{\widehat{Y}}_{i0}^{(r+% 1)}\}^{1-y}}{\sum_{i=1}^{n}\{{\widehat{Y}}_{i0}^{(r+1)}\}^{y}\{1-{\widehat{Y}}% _{i0}^{(r+1)}\}^{1-y}};$
	$\displaystyle\widehat{\bm{\xi}}^{(r+1)}=\mathop{\mbox{argmax}}_{\bm{\xi}}\sum_% {i=1}^{n}\ell\left({\widehat{Y}}_{i0}^{(r+1)},\bm{\psi}^{{\sf% \scriptscriptstyle{T}}}({\bf G}_{i})\bm{\xi}\right)+\sum_{i=1}^{N}\sum_{j=1}^{% p}\ell\left(\widehat{Y}_{ij}^{(r+1)},\bm{\psi}^{{\sf\scriptscriptstyle{T}}}({% \bf G}_{i})\bm{\xi}\right);$
	$\displaystyle\widehat{\bm{\zeta}}_{j}^{(r+1)}=\mathop{\mbox{argmax}}_{\bm{% \zeta}_{j}}\sum_{i=1}^{N}\ell\left(\widehat{Y}_{ij}^{(r+1)},\bm{\varphi}^{{\sf% \scriptscriptstyle{T}}}_{j}(X_{ij})\bm{\zeta}_{j}\right),\quad\mbox{for }j=1,2% ,\ldots,p.$

Output: ${\widehat{\bm{\theta}}}=\{\widehat{\bm{\xi}},\widehat{\bm{\zeta}},\widehat{\bm% {\lambda}},\widehat{\mu}\}=\{\widehat{\bm{\xi}}^{(R)},\widehat{\bm{\zeta}}^{(R% )},\widehat{\bm{\lambda}}^{(R)},\widehat{\mu}^{(R)}\}$

Algorithm 1 iterates on two main steps. First, there is an E-step imputing the unobserved true outcome $Y$ separately conditional on each $(X_{j},{\bf G})$ or $(Y^{*},{\bf G})$ as the set of features appearing in each component of the composite likelihood. Unlike the EM algorithms for joint likelihood objectives, our method does not involve any imputation model of $Y$ using the whole set of observed variables $({\bf X},{\bf G},Y^{*})$ . This in turn ensures the validity free of any assumptions on the joint distribution of ${\bf X},Y^{*}$ that is hard to characterize due to the curse of dimensionality. Second, Algorithm 1 involves an M-step solving for $\bm{\theta}$ through MLE constructed using the imputed $Y$ ’s. Again, corresponding to the composite likelihood construction, $\bm{\lambda}$ and $\bm{\zeta}_{j}$ ’s for different error-prone outcomes are solved separately based on their own imputed outcomes.

In Theorem 1 presented later, we show that Algorithm 1 maintains an ascent property on the objective composite likelihood function that is desirable for optimization. Nevertheless, it is still practically crucial to have a good initial estimator ${\widehat{\bm{\theta}}}^{(0)}$ for Algorithm 1 to avoid the local minima issue. In response to this, we propose in Algorithm A2 of Appendix to derive $\widehat{\bm{\xi}}^{(0)},\widehat{\bm{\zeta}}^{(0)},\widehat{\mu}^{(0)}$ through MLE constructed as if $I(Y^{*}=1)$ was the true outcome, i.e., the logistic regression of $I(Y^{*}=1)$ against $\bm{\psi}({\bf G})$ or each $\bm{\varphi}_{j}(X_{j})$ . For $\widehat{\bm{\lambda}}^{(0)}$ , we set it up with a proper guess presuming that $Y^{*}$ is informative.

2.3 Stage II: condensing EHR features for phenoty**

With the fitted estimator in Stage I, we derive ${\widehat{\alpha}}({\bf X})=\sum_{j=1}^{p}\bm{\varphi}^{{\sf\scriptscriptstyle% {T}}}_{j}(X_{j})\widehat{\bm{\zeta}}_{j}$ , serving as a phenotype score condensing the outcomes $X_{1},X_{2},\ldots,X_{p}$ . For ${\widehat{\alpha}}({\bf X})$ , we further adopt a nonparametric likelihood approach that combines it with ${\bf G}$ to derive an imputation model for $Y$ . Since ${\widehat{\alpha}}({\bf X})$ ensembles multiple EHR outcomes, it tends to be more predictive of $Y$ than each single $m_{j}(X_{j})$ . So this procedure can be more efficient than modeling each single $X_{j}$ separately in ${\cal C}(\bm{\theta})$ , thus, being more favorable for the downstream analysis. As implied by (3), the optimal ensemble is $\bar{\alpha}({\bf X})=\sum_{j=1}^{p}g^{-1}\{m_{j}(X_{j})\}$ only when the working assumption $X_{1}\perp X_{2}\perp\ldots\perp X_{p}\mid Y$ holds. When there is a strong evidence that such conditional independence does not hold, an alternative strategy is to set the phenoty** score $\alpha({\bf X})$ as the first principle component of $g^{-1}\{m_{j}(X_{j})\}$ for $j=1,2,\ldots,p$ , to make it representative of the multiple EHR outcomes.

Again, we will not rely on any parametric or model structural assumptions on the sensitivity function ${\cal S}_{\bar{\alpha},y}(c)={\rm Pr}(\bar{\alpha}({\bf X})>c\mid Y=y)$ for $c\in\mathbb{R}$ and $y\in\{0,1\}$ that captures $\bar{\alpha}({\bf X})\mid Y$ . In this case, the log-likelihood function can be written as

\sum_{i=1}^{n}\log\left\{\sum_{y=0}^{1}\lambda_{yY^{*}_{i}}g_{y}\{\bm{\xi}^{{% \sf\scriptscriptstyle{T}}}\bm{\psi}({\bf G}_{i})\}\right\}+\sum_{i=1}^{N}\log% \left\{-\sum_{y=0}^{1}\dot{{\cal S}}_{\widehat{\alpha},y}\{{\widehat{\alpha}}(% {\bf X}_{i})\}g_{y}\{\bm{\xi}^{{\sf\scriptscriptstyle{T}}}\bm{\psi}({\bf G}_{i% })\}\right\}.

Without any further constraint on ${\cal S}_{\bar{\alpha},y}(c)={\rm Pr}(\bar{\alpha}({\bf X})>c\mid Y=y)$ , the above log-likelihood function will not have a unique maximizer. Thus, inspired by existing literature in nonparametric MLE (Murphy and Van der Vaart,, 2000, e.g.), we restrict ${\cal S}_{\bar{\alpha},y}(c)$ to be a step function that can only jump at the observed data points $\{{\widehat{\alpha}}({\bf X}_{i}):i=1,2,\ldots,N\}$ , and denote its jump size at each ${\widehat{\alpha}}({\bf X}_{i})$ as $\nabla{\cal S}_{\bar{\alpha},y}\{{\widehat{\alpha}}({\bf X}_{i})\}$ . If the true status $Y_{i}$ was observed, the MLE for ${\cal S}_{\widehat{\alpha},y}(c)$ under this step-function constraint would be derived as

\breve{\cal S}_{\widehat{\alpha},y}(c)=\frac{\sum_{i=1}^{N}I({\widehat{\alpha}% }({\bf X}_{i})>c)I(Y_{i}=y)}{\sum_{i=1}^{N}I(Y_{i}=y)}\quad\mbox{for}\quad c={% \widehat{\alpha}}({\bf X}_{i^{\prime}}).

Based on this, our objective becomes to maximize

{\cal L}(\bm{\eta}_{{\widehat{\alpha}}})=\sum_{i=1}^{n}\log\left\{\sum_{y=0}^{% 1}\lambda_{yY^{*}_{i}}g_{y}\{\bm{\xi}^{{\sf\scriptscriptstyle{T}}}\bm{\psi}({% \bf G}_{i})\}\right\}+\sum_{i=1}^{N}\log\left\{\sum_{y=0}^{1}-\nabla{\cal S}_{% \widehat{\alpha},y}\{{\widehat{\alpha}}({\bf X}_{i})\}g_{y}\{\bm{\xi}^{{\sf% \scriptscriptstyle{T}}}\bm{\psi}({\bf G}_{i})\}\right\},

(5)

where $\bm{\eta}_{\bm{\alpha}}=\{{\cal S}_{\alpha,0}(\cdot),{\cal S}_{\alpha,1}(\cdot% ),\bm{\lambda},\bm{\xi}\}$ , under the step-function constraints on ${\cal S}_{\alpha,0}(\cdot),{\cal S}_{\alpha,1}(\cdot)$ . Since we do not specify the correlation or dependence between $Y^{*}$ and ${\widehat{\alpha}}({\bf X})$ , we still adopt a composite strategy to model them in (5). But different from the fully composite ${\cal C}(\bm{\theta})$ also treating $X_{j}$ separately, we now condense $X_{j}$ ’s into a single ${\widehat{\alpha}}({\bf X})$ .

Similar to Algorithm 1, we adopt an EM algorithm to numerically maximize the objective ${\cal L}(\bm{\eta}_{{\widehat{\alpha}}})$ for the solution $\widetilde{\bm{\eta}}_{{\widehat{\alpha}}}=\{\widetilde{\cal S}_{\widehat{% \alpha},0}(\cdot),\widetilde{\cal S}_{\widehat{\alpha},1}(\cdot),\widetilde{% \bm{\lambda}},\widetilde{\bm{\xi}}\}$ ; see Algorithm A2 in Appendix A. At last, we introduce Theorem 1 to establish the ascent properties of our proposed EM algorithms for ${\cal C}(\bm{\theta})$ and ${\cal L}_{{\widehat{\alpha}}}(\bm{\eta})$ formulated in Steps I and II respectively.

Theorem 1

Let ${\widehat{\bm{\theta}}}^{(r)}$ and $\widetilde{\bm{\eta}}^{(r)}$ be the estimators at the $r$ -th iteration of the EM Algorithms 1 and A1 respectively. We have ${\cal C}({\widehat{\bm{\theta}}}^{(r)})\leq{\cal C}({\widehat{\bm{\theta}}}^{(% r+1)})$ and ${\cal L}(\widetilde{\bm{\eta}}_{{\widehat{\alpha}}}^{(r)})\leq{\cal L}(% \widetilde{\bm{\eta}}_{{\widehat{\alpha}}}^{(r+1)})$ , i.e., each iteration in our EM algorithms is ensured to result in the ascent of the objective log-likelihood functions.

2.4 Stage III: genetic risk modeling and EHR phenotype validation

In Steps (I) and (II) introduced above, we fit nonparametric models for $Y\mid{\bf G}$ to make the estimators ${\widehat{\alpha}}(\cdot)$ and $\widehat{{\cal S}}_{\bar{\alpha},y}(\cdot)$ more robust to model misspecification. In practice, directly using such nonparametric models for gene association analysis often results in large variance or even inefficiency due to the curse of dimensionality. Thus, in this step, we leverage the extracted $\widetilde{\bm{\eta}}_{{\widehat{\alpha}}}$ to construct a parametric genetic risk for the true outcome $Y_{i}$ against ${\bf G}_{i}$ . In specific, with $\widetilde{\bm{\eta}}_{{\widehat{\alpha}}}$ , we characterize $\mathbb{E}[Y_{i}\mid\bar{\alpha}({\bf X}_{i}),{\bf G}_{i}]$ for all $i=1,2,\ldots,N$ , and $\mathbb{E}[Y_{i}\mid Y^{*}_{i},{\bf G}_{i}]$ for $i=1,2,\ldots,n$ as

\widetilde{Y}_{i0}=\frac{\widetilde{\lambda}_{1Y_{i}^{*}}g_{1}\{\bm{\psi}^{{% \sf\scriptscriptstyle{T}}}({\bf G}_{i})\widetilde{\bm{\xi}}\}}{\sum_{y=0}^{1}% \widetilde{\lambda}_{yY_{i}^{*}}^{(r)}g_{y}\{\bm{\psi}^{{\sf\scriptscriptstyle% {T}}}({\bf G}_{i})\widetilde{\bm{\xi}}\}};\quad\widetilde{Y}_{i1}=\frac{\nabla% \widetilde{\cal S}_{\widehat{\alpha},1}\{{\widehat{\alpha}}({\bf X}_{i})\}g_{1% }\{\bm{\psi}^{{\sf\scriptscriptstyle{T}}}({\bf G}_{i})\widetilde{\bm{\xi}}\}}{% \sum_{y=0}^{1}\nabla\widetilde{\cal S}_{\widehat{\alpha},y}\{{\widehat{\alpha}% }({\bf X}_{i})\}g_{y}\{\bm{\psi}^{{\sf\scriptscriptstyle{T}}}({\bf G}_{i})% \widetilde{\bm{\xi}}\}},

which coincides with the imputation of the unobserved $Y$ in the last E-step of Algorithm A2. Note that $\widetilde{Y}_{i1}$ is not necessarily consistent for $\mathbb{E}[Y_{i}\mid{\bf X}_{i},{\bf G}_{i}]$ unless the working independence assumption (3) holds and $\mathbb{E}[Y_{i}\mid{\bf X}_{i}]=\mathbb{E}[Y_{i}\mid\bar{\alpha}({\bf X}_{i})]$ . Then we conduct logistic regression for the imputed outcomes $\widetilde{Y}_{i0}$ and $\widetilde{Y}_{i1}$ separately against ${\bf G}_{i}$ , to obtain estimators

\displaystyle{\widetilde{\bm{\beta}}}_{0}

\displaystyle=\mathop{\mbox{argmax}}_{\bm{\beta}}\sum_{i=1}^{n}\ell({% \widetilde{Y}}_{i0},{\bf G}_{i}^{{\sf\scriptscriptstyle{T}}}\bm{\beta});\quad{% \widetilde{\bm{\beta}}}_{1}=\mathop{\mbox{argmax}}_{\bm{\beta}}\sum_{i=1}^{N}% \ell({\widetilde{Y}}_{i1},{\bf G}_{i}^{{\sf\scriptscriptstyle{T}}}\bm{\beta}).

Although $N>n$ , the standard error of $\widetilde{\bm{\beta}}_{0}$ may still be smaller than that of $\widetilde{\bm{\beta}}_{1}$ since $X$ is typically less informative than the chart review labels $Y^{*}$ in terms of measuring the true $Y$ . To derive a more efficient estimator, the final step is to assemble ${\widetilde{\bm{\beta}}}_{0}$ and ${\widetilde{\bm{\beta}}}_{1}$ as:

{\widetilde{\bm{\beta}}}=\widehat{\omega}{\widetilde{\bm{\beta}}}_{0}+(1-% \widehat{\omega}){\widetilde{\bm{\beta}}}_{1};\quad\widehat{\omega}\in[0,1],

where $\widehat{\omega}$ is a weight determined using the data to minimize the variance of ${\widetilde{\bm{\beta}}}$ among all convex combinations of ${\widetilde{\bm{\beta}}}_{0}$ and ${\widetilde{\bm{\beta}}}_{1}$ . When $N\gg n$ , we can show that ${\widetilde{\bm{\beta}}}_{0}$ and ${\widetilde{\bm{\beta}}}_{1}$ are asymptotically independent, and, thus, the optimal weight $\widehat{\omega}={\widehat{\rm SE}_{0}^{-2}}/{(\widehat{\rm SE}_{0}^{-2}+% \widehat{\rm SE}_{1}^{-2})}$ , where $\widehat{\rm SE}_{0}$ and $\widehat{\rm SE}_{1}$ represent the estimated standard error of ${\widetilde{\bm{\beta}}}_{0}$ and ${\widetilde{\bm{\beta}}}_{1}$ . In general, we can take

\widehat{\omega}=\arg\min_{\omega\in[0,1]}(\omega,1-\omega)\widehat{\Sigma}_{% \widetilde{\bm{\beta}}_{0},\widetilde{\bm{\beta}}_{1}}(\omega,1-\omega)^{{\sf% \scriptscriptstyle{T}}},

where $\widehat{\Sigma}_{\widetilde{\bm{\beta}}_{0},\widetilde{\bm{\beta}}_{1}}$ is the asymptotic covariance matrix of $({\widetilde{\bm{\beta}}}_{0},{\widetilde{\bm{\beta}}}_{1})$ computed using bootstrap. Since the true disease status $Y$ is unobserved, the estimators $\widetilde{\bm{\beta}}_{0}$ and $\widetilde{\bm{\beta}}_{1}$ are subject to the issue that the switch between $Y=0$ and $Y=1$ cannot be identified from the observed data. To address this, we assume the coefficient for $G_{1}$ to be greater than zero with $G_{1}$ chosen as an informative feature to $Y$ . Correspondingly, we shall flip the sign of the fitted ${\widetilde{\bm{\beta}}}_{0}$ or ${\widetilde{\bm{\beta}}}_{1}$ if ${\widetilde{\beta}}_{01}<0$ or ${\widetilde{\beta}}_{11}<0$ . Alternatively, one could also restrict the prevalence of $Y$ to be smaller than $0.5$ , which does not require the knowledge of some informative feature $G_{1}$ .

As the by-product, we are also able to validate the derived phenoty** score ${\widehat{\alpha}}({\bf X})$ using the fitted sensitivity functional $\widetilde{\cal S}_{\widehat{\alpha},y}(\cdot)$ . Denote the limiting (population-level) function of ${\widehat{\alpha}}({\bf X})$ as $\bar{\alpha}({\bf X})$ . The true positive rate (TPR) and false positive rate (FPR) of the classifier $I(\widehat{\alpha}({\bf X})>c)$ or $I(\bar{\alpha}({\bf X})>c)$ on the true label $Y$ can be naturally estimated using $\widetilde{\cal S}_{\widehat{\alpha},1}(c)$ and $\widetilde{\cal S}_{\widehat{\alpha},0}(c)$ respectively. Furthermore, the receiver operating characteristic (ROC) curve of ${\widehat{\alpha}}({\bf X})$ or $\bar{\alpha}({\bf X})$ can be estimated by $\widehat{}\mbox{ROC}(u)=\widetilde{\cal S}_{\widehat{\alpha},1}\{\widetilde{% \cal S}^{-1}_{\widehat{\alpha},0}(u)\}$ for $u\in[0,1]$ , and the area under ROC $\widehat{}\mbox{AUC}=\int_{0}^{1}\widehat{}\mbox{ROC}(u)du$ .

3 Asymptotic analysis

In this section, we provide asymptotic analysis of the TUBE estimators ${\widehat{\alpha}}({\bf X})$ , $\widetilde{\cal S}_{\alpha,y}(\cdot)$ , and $\widetilde{\bm{\beta}}$ resulted from our described steps in Sections 2.2–2.4. We consider ${\bf G}$ as a continuous univariate gene risk score and $\psi({\bf G})$ as its spline basis function. Let $\bar{\bm{\theta}}=\{\bar{\bm{\xi}},\bar{\bm{\zeta}},\bar{\bm{\lambda}},\bar{% \mu}\}$ and $\bar{\bm{\eta}}=\{\bar{{\cal S}}_{\bar{\alpha},1},\bar{{\cal S}}_{\bar{\alpha}% ,0},\bar{\bm{\lambda}},\bar{\bm{\xi}}\}$ be the population-level (true) parameters. We define the norm of $\bm{\theta}$ to be $\|\bm{\theta}\|_{2}=\left\{\mathbb{E}\{\|\bm{\xi}\|_{2}^{2}\}+\mathbb{E}\{\|% \bm{\zeta}\|_{2}^{2}\}+\mathbb{E}\{\|\bm{\lambda}\|_{2}^{2}\}+\mathbb{E}\{u^{2% }\}\right\}^{1/2}$ and the norm of $\bm{\eta}$ to be $\|\bm{\eta}\|_{2}=\left\{\sum_{y=0}^{1}\int({\cal S}_{\alpha,y}(c))^{2}dc+% \mathbb{E}\{\|\bm{\lambda}\|_{2}^{2}\}+\mathbb{E}\{\|\bm{\xi}\|_{2}^{2}\}% \right\}^{1/2}$ . We first introduce smoothness and regularity assumptions as follows.

Assumption 1

Covariates $({\bf X},{\bf G})$ have compact domain $\mathcal{X}\times\mathcal{G}$ with their joint probability density function being twice continuously differentiable. For all $j=1,2,\ldots,p$ and $y=0,1$ , $m_{jy}(x)$ and $\gamma_{y}(g)$ are twice continuously differentiable. For $y=0,1$ , ${\cal S}^{\prime}_{\alpha,y}(c)$ , the derivative of ${\cal S}_{\alpha,y}(c)$ is continuously differentiable.

Assumption 2

The parameter spaces of $\bar{\bm{\theta}}$ and $\bar{\bm{\eta}}$ are compact. Hessian matrix $\mathbb{E}[{\bf G}{\bf G}^{{\sf\scriptscriptstyle{T}}}g_{1}^{\prime}({\bf G}^{% {\sf\scriptscriptstyle{T}}}\bm{\beta}_{0})]$ has its all eigenvalues staying away from $0$ and $\infty$ . For any $\bm{\theta}_{1},\bm{\theta}_{2}$ and $\bm{\eta_{1}},\bm{\eta_{2}}$ , $\mathbb{E}[{\cal C}(\bm{\theta}_{1}+\tau(\bm{\theta}_{2}-\bm{\theta}_{1}))]$ and $\mathbb{E}[{\cal L}(\bm{\eta}_{\alpha,1}+\tau(\bm{\eta}_{\alpha,2}-\bm{\eta}_{% \alpha,1}))]$ are twice continuously differentiable with respect to $\tau\in[0,1]$ , $\frac{\partial^{2}}{\partial\tau^{2}}\mathbb{E}[{\cal C}(\bm{\theta}_{1}+\tau(% \bm{\theta}_{2}-\bm{\theta}_{1}))]\asymp-\|\bm{\theta}_{2}-\bm{\theta}_{1}\|_{% 2}^{2}$ , and $\frac{\partial^{2}}{\partial\tau^{2}}\mathbb{E}[{\cal L}(\bm{\eta}_{\alpha,1}+% \tau(\bm{\eta}_{2}-\bm{\eta}_{1}))]\asymp-\|\bm{\eta}_{\alpha,2}-\bm{\eta}_{% \alpha,1}\|_{2}^{2}$ .

Remark 2

Assumption 1 consists of mild smoothness conditions commonly used for the asymptotic analysis of of M-estimation and sieve-smoothed regression (Van der Vaart,, 2000; Chen,, 2007, e.g.). Assumption 2 requires the non-singularity of the hessian matrix as well as the strong convexity of the loss functions, which has been also frequently used in the literature.

Remark 3

When ${\bf X}$ and ${\bf G}$ are discrete, e.g., ${\bf G}$ being the categorical functions of several SNPs, Assumption 1 will be as given. In such a situation with discrete ${\bf X}$ , the sensitivity function ${\cal S}_{\alpha,y}(c)$ will only have finite choices on the cutoff $c$ , and the asymptotic analysis of its estimator will be degenerated and simplified.

Next, we establish the consistency and asymptotic normality for the phenoty** score ${\widehat{\alpha}}({\bf x})$ in Theorem 2, as well as those for the estimator of its sensitivity function in Theorem 3. Let $J_{N}$ be the dimensionality of the bases $\bm{\varphi}_{j}({\bf X})$ and $\bm{\psi}({\bf G})$ supposed to increase with $N$ .

Theorem 2

Under Assumptions 1 and 2 and assume that $N^{1/4}\ll J_{N}\ll N^{1/2}$ . As $n,N\rightarrow\infty$ , $\sup_{{\bf x}\in\mathcal{X}}|{\widehat{\alpha}}({\bf x})-\bar{\alpha}({\bf x})|$ converges to $0$ in probability. Moreover, for ${\bf x}\in\mathcal{X}$ , $\sqrt{{N}/{J_{N}}}\{{\widehat{\alpha}}({\bf x})-\bar{\alpha}({\bf x})\}$ converges weakly to some zero-mean Gaussian process.

Theorem 3

Under all assumptions in Theorem 2, then as $n,N\to\infty$ , $\sup_{c\in\mathbb{R}}|\widetilde{\cal S}_{{\widehat{\alpha}},0}(c)-\bar{{\cal S% }}_{\bar{\alpha},0}(c)|+|\widetilde{\cal S}_{{\widehat{\alpha}},1}(c)-\bar{{% \cal S}}_{\bar{\alpha},1}(c)|$ converges to $0$ in probability, and for $c\in\mathbb{R}$ , $\sqrt{{N}/{J_{N}}}\{\widetilde{\cal S}_{{\widehat{\alpha}},0}(c)-\bar{{\cal S}% }_{\bar{\alpha},0}(c),\widetilde{\cal S}_{{\widehat{\alpha}},0}(c)-\bar{{\cal S% }}_{\bar{\alpha},0}(c)\}$ converges weakly to some zero-mean Gaussian process for $c\in\mathbb{R}$ .

Considering that our primary goal is the genetic risk estimation with $\widetilde{\bm{\beta}}$ , we under-smooth the sieve estimator of $\bar{\alpha}$ by taking $J_{N}$ slightly larger than $O(N^{1/4})$ , to achieve the asymptotic unbiasedness and normality of $\widetilde{\bm{\beta}}$ that will be established in Theorem 4. This choice of $J_{N}$ does not lead to the optimal convergence rate of these by-products ${\widehat{\alpha}}({\bf x})$ and $\widetilde{\cal S}_{{\widehat{\alpha}},y}(c)$ . To further refine these estimators, one just needs to take $J_{N}\asymp N^{1/5}$ and carry out Steps I and II. This leads to the $N^{-2/5}$ -convergence of ${\widehat{\alpha}}({\bf x})-\bar{\alpha}({\bf x})$ and $\widetilde{\cal S}_{{\widehat{\alpha}},y}(c)-\bar{{\cal S}}_{\bar{\alpha},y}(c)$ , an improvement compared to the current $N^{-3/8}$ -convergence. However, the estimator derived with $J_{N}\asymp N^{1/5}$ cannot ensure the desirable parametric rate and asymptotic normality of $\widetilde{\bm{\beta}}_{0}$ and $\widetilde{\bm{\beta}}_{1}$ obtained in Step III. See existing literature like Chen, (2007) for more relevant results.

Finally, we establish the convergence properties of $\widetilde{\bm{\beta}}_{0}$ and $\widetilde{\bm{\beta}}_{1}$ , which reveals the $n^{1/2}$ -consistency and asymptotic normality of the TUBE estimator $\widetilde{\bm{\beta}}$ .

Theorem 4

Under all assumptions in Theorem 2, both $\widetilde{\bm{\beta}}_{0}$ and $\widetilde{\bm{\beta}}_{1}$ converge to $\bar{\bm{\beta}}$ in probability and $\{\sqrt{n}(\widetilde{\bm{\beta}}_{0}-\bar{\bm{\beta}}),\sqrt{N}(\widetilde{% \bm{\beta}}_{1}-\bar{\bm{\beta}})\}$ converges weakly to a zero-mean Gaussian distribution.

4 Simulation

We conduct comprehensive simulation studies to evaluate the finite-sample performance of the proposed method. Let Binomial $\left\{n,p\right\}$ denote the binomial distribution with $n$ trials and a success probability of $p$ . To generate risk factors ${\bf G}=(G_{1},\ldots,G_{q})^{{\sf\scriptscriptstyle{T}}}$ , we consider $q=4$ with $G_{1}\sim{\rm N}(0,1)$ , and $G_{2}$ , $G_{3}$ , $G_{4}$ generated independently from Binomial $\left\{2,0.6\right\}$ . For generation of the unobserved true outcome $Y$ and EHR surrogates ${\bf X}$ , we consider the following three settings:

(a)

$Y\sim\textrm{Bernoulli}\left\{g(\bf G^{{\sf\scriptscriptstyle{T}}}\bm{\beta})\right\}$ where $\bm{\beta}^{*}=(-4.6,1.6,1.6,1.6,1.6)^{{\sf\scriptscriptstyle{T}}}$ ; and ${\bf X}=\{Y+0.5(1-Y)+\epsilon_{1},Y+0.5(1-Y)+\epsilon_{2},0.5Y+0.25(1-Y)+% \epsilon_{3}\}^{{\sf\scriptscriptstyle{T}}}$ where $\epsilon_{1},\epsilon_{2}$ , $\epsilon_{3}$ are independent standard normal noises.
(b)

$Y\sim\textrm{Bernoulli}\left\{g(G_{1}+G_{1}^{2}-\cos(G_{1})-G_{2}-G_{3}-G_{4}+% 2)\right\}$ , with ${\bf X}$ generated given $Y$ in the same way as (a).
(c)

$Y\sim\textrm{Bernoulli}\left\{g(-G_{1}+G_{1}^{2}+\sin(G_{1})-G_{2}-G_{3}-G_{4}% +1)\right\}$ ; and ${\bf X}=\{Y+0.5(1-Y)+0.005G_{1}+\epsilon_{1},Y+0.5(1-Y)+0.005G_{1}+\epsilon_{2% },0.5Y+0.25(1-Y)+0.005G_{1}+\epsilon_{3}\}^{{\sf\scriptscriptstyle{T}}}$ where $\epsilon_{1},\epsilon_{2}$ , $\epsilon_{3}$ are independent standard normal noises.

In all settings, we set $N=10000$ and generate $Y^{*}$ from $\textrm{Binomial}\left\{2,\textrm{expit}(-2+4Y+0.1_{3}^{{\sf\scriptscriptstyle% {T}}}{\bf X})\right\}$ . As discussed earlier, $Y^{*}$ is supposed to be an imperfect but more informative outcome compared to ${\bf X}$ . Our setup mimics this by imposing a much stronger effect of $Y$ on $Y^{*}$ . We also let the size of $Y^{*}$ labels $n$ range from $100$ to $1000$ to investigate its influence on the efficiency of the methods.

We consider the following three methods for comparison: (1) the simple approach referred as Naive-Logistic directly using the label $Y^{*}$ as the outcome for analysis; (2) our main benchmark Hong et al., (2019) using the composite likelihood approach with parametric modeling on ${\bf X}$ and ${\bf G}$ ; (3) the proposed TUBE approach with $\bm{\psi}({\bf G})=(\bm{\psi}_{1}(G_{1}),G_{2},G_{3},G_{4})$ and the basis functions $\bm{\varphi}_{j}$ and $\bm{\psi}_{1}(G_{1})$ specified as the natural spline with the degree of freedom as $4$ . Note that Hong et al., (2019)’s method is fully parametric and, thus, will concur the issues of model misspecification in settings (b) and (c) due to the non-linearity of $Y\sim{\bf G}$ . In setting (c), we introduce some small indirect effect of ${\bf G}$ on ${\bf X}$ given $Y$ that moderately breaks our key independence assumption ${\bf X}\perp{\bf G}\mid Y$ . This is to examine the sensitivity to the (slight) violation of this assumption.

The parameters of our interests include $\bm{\beta}$ , the logistic model coefficients obtained by regressing $Y$ against ${\bf G}$ , as well as the accuracy parameter AUC of $Y$ against their phenoty** score obtained in each method. The population level parameters of $\bm{\beta}$ and ${\bf G}$ are computed by generating an extremely large sample. Our evaluation metrics include mean squared error (MSE) in Figure 2, percent bias in Figure 3, i.e., the ratio between absolute bias and root MSE, and coverage probability (CP) of the 95% CI computed using the standard resampling bootstrap procedure; see Figure 4. The results in Figures 2-4 are obtained based on $500$ times of simulation. For the multi-dimensional $\bm{\beta}$ , we only present the average performance over $\beta_{1},\ldots,\beta_{4}$ in these figures and the element-wise results can be found in the tables of Appendix B.

In all settings, Naive-Logistic shows large MSEs and percent biases due to the erroneousness of $Y^{*}$ in measuring the true $Y$ . In setting (a), TUBE attains close performance to the benchmark methods in Hong et al., (2019) that relies on a fully parametric modeling strategy and does not encounter the model misspecification issue. In specific, the percentage difference in the MSE between the two methods is smaller than $5\%$ on all parameters when $n\geq 500$ in setting (a). Also, both methods attain small enough percent bias and desirable coverage probability on $\bm{\beta}$ and AUC. Thus, although it seems redundant to use a more complex semiparametric modeling strategy in TUBE compared to Hong et al., (2019) when the true models are indeed linear and parametric, this complexity does not result in TUBE’s loss of validity or efficiency. This result is in line with our conclusions in Section 3 that the sieve estimators does not impact the parametric rate of our estimator for $\bm{\beta}$ due to under-smoothing.

In settings (b) and (c) under which the fully parametric method of Hong et al., (2019) has a severe issue in model misspecification, TUBE achieves significantly better performance than Hong et al., (2019) and ensures the validity of inference. For example, under setting (b) with $n=500$ , the average MSE of TUBE on $\bm{\beta}$ is more than 90% smaller than that of Hong et al., (2019). Also, TUBE successfully maintains a small percent bias (5%–10%) and appropriate coverage probability while Hong et al., (2019) fails to provide valid inference with the average coverage rates around 30% below than the nominal level 95% in setting (b). This substantial improvement of TUBE is resulted from the nonparametric construction in our Steps I and II that protect our approach against bias due to the nonlinear effects.

In addition, we notice that as the labeled sample size $n$ increases, the MSEs of TUBE on $\bm{\beta}$ and AUC gradually decrease as $Y^{*}$ provides additional information over ${\bf X}$ . For example, when $n$ increase from $100$ to $500$ , TUBE’s MSE on AUC decreases more than $50\%$ in all settings. Recall that in practice and our simulation setup, $Y^{*}$ is usually more informative than ${\bf X}$ even though both of them contains errors in measuring the true $Y$ . Thus, moderately increasing the size of $Y^{*}$ could result in efficiency gain even with the total sample size $N$ unchanged. Meanwhile, we do not see the improvement of Naive-Logistic and Hong et al., (2019) as $n$ increases in settings (b) and (c) probably because of their large bias.

5 Real Example

The rising incidence of Type II diabetes mellitus (T2D) in recent years has risen great concern in health. Previous genome-wide association studies (GWAS) have identified many genetic variations associated with insulin resistance or inadequate insulin production attributing to T2D (Mahajan et al.,, 2018). Consequently, polygenic risk score (GRS) has been developed to predict individual’s genetic risk of develo** T2D (He et al.,, 2021). These advancements provide great potential for precision medicine approaches in the prevention and management of the T2D disease. In this application, we study the Mass General Brigham (MGB) biobank data (Castro et al.,, 2022) with a primary goal to build a genetic risk prediction model for T2D using its GRS and demographic information.

Our data set includes $N=16,963$ MGB biobank participants up to 2021 with their available EHR features updated for the same year. Their risk factors ${\bf G}$ contain $G_{1}$ , an one-dimensional GRS for T2D derived using the reported variants and effect sizes of Mahajan et al., (2018), as well as gender denoted as $G_{2}$ ( $G_{2}=1$ for Female). The EHR surrogates ${\bf X}$ include $X_{1}$ , the log-transformed total count of the International Classification of Diseases (ICD) codes for T2D and $X_{2}$ , the value of hemoglobin A1C obtained via laboratory tests. In addition, we have collected $Y^{*}$ on a subset of $n=269$ patients as the manual chart reviewing label for T2D status created by clinicians in 2014. Due to the gap of time windows of data collection, $Y^{*}$ is an imperfect label for the true T2D status $Y$ with its potential measurement error coming from the missingness of information between 2014 and 2021, as well as the switch of the ICD system from version 9 to 10 around 2015 at MGB. For the purpose of validation, we also extract the chart reviewing labels created by clinicians according to all information up to 2021 on a random subsample of the data with size $n_{v}=220$ . These labels are more close to (arguably identical to) the true T2D status $Y$ and only used for validation and evaluation of the estimators trained on the set $\mathscr{O}=\{{\bf O}_{i}=(Y^{*}_{i}\delta_{i},\delta_{i},{\bf X}_{i},{\bf G}_% {i}):i=1,2,...,N\}$ .

In addition to Hong et al. 2019 and Naive-Logistic studied in Section 4, we also include four simple benchmark estimators including those obtained through the logistic regression against ${\bf G}$ respectively using I(ICD $\geq$ 1), I(ICD $\geq$ 2), I(A1C $\geq$ 5.7) and I(A1C $\geq$ 6.4) as the binary outcomes. All of them are common and convenient ways to screen the subject with T2D frequently used in existing biomedical studies and practice. As the secondary analysis, we also estimate the AUC of the two important surrogates ICD and A1C using the imputation for $Y$ in TUBE and other methods except the aforementioned approaches directly using ICD or A1C to construct the outcome. This aim is slightly different from evaluating the derived phenoty** score ${\widehat{\alpha}}({\bf X})$ considered in Sections 2 and 4 but it can be realized using nearly the same strategy and is typically more useful for clinicians and researchers in practice. We use 200 times bootstrap sampling to quantify the variance of all the estimators. The resulted estimators with their standard errors are presented in Table 1.

Using the validation set with the true label $Y$ , we obtain a validation estimator $\widehat{\bm{\beta}}_{v}$ and evaluate the AUC of ICD and A1C. Evaluation metrics of the estimators for $\beta$ include: (1) mean square prediction error (MSPE) defined as the sample mean of $\{g({\bf G}_{i}^{{\sf\scriptscriptstyle{T}}}{\widehat{\bm{\beta}}}_{v})-g({\bf G% }_{i}^{{\sf\scriptscriptstyle{T}}}{\widehat{\bm{\beta}}})\}^{2}$ ; (2) Deviance of the logistic model evaluated on the target data; (3) classifier’s correlation (Class. Cor) with $\widehat{\bm{\beta}}_{v}$ , i.e., the sample correlation of $I(g({\bf G}_{i}^{{\sf\scriptscriptstyle{T}}}{\widehat{\bm{\beta}}}_{v})>c)$ and $I(g({\bf G}_{i}^{{\sf\scriptscriptstyle{T}}}{\widehat{\bm{\beta}}})>c)$ where $c$ is the sample mean of $g({\bf G}_{i}^{{\sf\scriptscriptstyle{T}}}{\widehat{\bm{\beta}}}_{v})$ ; and (4) false classification rate (False Class.) compared to $\widehat{\bm{\beta}}_{v}$ , i.e., the empirical probability of $I(g({\bf G}_{i}^{{\sf\scriptscriptstyle{T}}}{\widehat{\bm{\beta}}}_{v})>c)\neq I% (g({\bf G}_{i}^{{\sf\scriptscriptstyle{T}}}{\widehat{\bm{\beta}}})>c)$ . The evaluation results are presented in Table 2.

	$\beta_{0}$ (Intercept)	$\beta_{1}$ (GRS)	$\beta_{2}$ (Gender)	AUC(ICD)	AUC(A1C)
ICD $\geq 1$	$-0.955_{0.028}$	$0.649_{0.08}$	$-0.556_{0.036}$	–	–
ICD $\geq 2$	$-1.286_{0.031}$	$0.795_{0.087}$	$-0.627_{0.04}$	–	–
A1C $\geq 5.7$	$-0.737_{0.027}$	$0.464_{0.076}$	$-0.461_{0.034}$	–	–
A1C $\geq 6.5$	$-2.1_{0.041}$	$0.818_{0.115}$	$-0.618_{0.053}$	–	–
Naive-Logistic	$-1.386_{0.31}$	$2.221_{0.639}$	$-1.572_{0.377}$	$0.949_{0.016}$	$0.805_{0.023}$
Hong et al. 2019	$-1.223_{0.136}$	$1.204_{0.160}$	$-0.806_{0.107}$	$0.856_{0.046}$	$0.787_{0.035}$
TUBE	$-1.352_{0.215}$	$1.162_{0.200}$	$-0.844_{0.140}$	$0.973_{0.016}$	$0.894_{0.013}$
Validation	$-1.341_{0.263}$	$1.007_{0.854}$	$-0.979_{0.387}$	$0.983_{0.008}$	$0.872_{0.036}$

Table 1: Estimators for the T2D genetic model coefficient

\bm{\beta}

and the AUCs of ICD and A1C, with their empirical standard errors presented as subscriptions.

	MSPE	Deviance	Class. Cor	False Class.
ICD $\geq 1$	$0.0064$	$0.004$	$0.20$	$0.46$
ICD $\geq 2$	$0.0008$	$-0.014$	$0.81$	$0.10$
A1C $\geq 5.7$	$0.0156$	$0.029$	$0$	$0.50$
A1C $\geq 6.4$	$0.0069$	$0.010$	$0.12$	$0.48$
Naive-Logistic	$0.0034$	$0.000$	$0.40$	$0.36$
Hong et al. 2019	$0.0011$	$-0.013$	$0.81$	$0.10$
TUBE	$\mathbf{0.0002}$	$\mathbf{-0.017}$	$\mathbf{0.95}$	$\mathbf{0.03}$
Validation	$0$	$-0.017$	$1$	$0$

Table 2: Estimation performance in the T2D genetic model

\bm{\beta}

evaluated using the metrics introduced in Section 5.

Among all methods under comparison, TUBE attains the closest point estimates to the validation estimator in terms of both $\bm{\beta}$ and AUC. For example, the AUC of A1C evaluated using TUBE-imputed outcomes only differs from the the validation estimator by around $0.02$ while all the other estimators show more than $0.06$ gaps to the validation estimator. The estimation performance in $\bm{\beta}$ are depicted more carefully in Table 2 where TUBE achieves the best on all metrics among all estimators except for $\widehat{\bm{\beta}}_{v}$ . For example, compared to the recent method proposed by Hong et al., (2019), our method attains more than $70\%$ reduction on MSPE, and $0.14$ larger classifier’s correlation with the validation estimator. These results illustrate the effectiveness of leveraging our semiparametric modeling strategy to reduce potential bias due to misspecification. Meanwhile, although TUBE involves more complicated nonparametric regression, it does not result in significant inflation of the standard errors compared to Hong et al., (2019), which is a benefit of using parametric regression (projection) in Stage III.

Our estimator of $\bm{\beta}$ reveals that the GRS has a significant positive effect (log(OR)= $1.16$ , 95% CI: $[0.77,1.55]$ ) on the risk of T2D and men have significantly higher risk to develop T2D than women in our study cohort. Interestingly, the effect sizes estimated using the four simple EHR outcomes, i.e., I(ICD $\geq$ 1), I(ICD $\geq$ 2), I(A1C $\geq$ 5.7), and I(A1C $\geq$ 6.4) are all smaller than $\beta_{1}$ and $\beta_{2}$ estimated by TUBE. As an explanation of this observation, after we convert the error-prone EHR outcomes to binary variables, they will have the same scale as the true outcome $Y$ and, thus, showing weaker association with the risk factors than $Y$ due to their measurement errors. This can be justified under the key assumption that ICD, A1C are independent with the baseline risk factors given the True T2D status.

6 Discussion

In summary, we propose TUBE, a novel unsupervised method for analyzing multiple error-prone EHR outcomes and noisy labels against baseline risk factors, such as genetic variants extracted from EHR linked biobanks. TUBE incorporates a nonparametric composite regression step, and then uses it to combine the EHR outcomes for phenoty** and derive a parametric genetic risk model through projection. Compared to existing methods, our semiparametric strategy has two advantages. First, the nonparametric composite construction at the first stage safeguards the unsupervised learning against potential bias due to model misspecification. Second, the derived parametric genetic risk model obtained through projection enhances interpretability and achieves and significantly reduced variance in comparison to a fully nonparametric approach. These advantages are supported by our comprehensive asymptotic analysis, simulations, and a real-world study.

We acknowledges several limitations and potential extensions of our work. First, the validity of our method is prone to severe violation of the conditional independence assumption between the EHR outcomes and the baseline covariates. This issue can be alleviated by incorporating (small) samples with the true labels to calibrate the unsupervised estimator derived from surrogates. Recent advancements in surrogate-assisted semi-supervised learning (Zhang et al.,, 2022; Hou et al., 2023b, ) are particularly relevant to this discussion. Second, our current setup focuses on binary disease status. In current biomedical studies, time to the onset of clinical events (e.g., cancer relapse) is often not readily available with their EHR surrogates subject to measurement errors. Simple estimates of the event time based on billing or procedure codes may poorly approximate the true outcome and lead to bias. Therefore, expanding TUBE to incorporate multiple sources of imperfect and temporal endpoints under the survival setting is a potential direction for future research. In addition, our current method only accommodates low-dimensional genetic variants and a single disease or phenotype. Recent large scale genome??? and phenome???wide studies (Huang and Labrecque,, 2019; Verma et al.,, 2023, e.g.) provides a strong motivation for its extensions to accommodate high-dimensional or machine learning estimates of the genetic risk models and multi-phenotype studies.

References

Athey et al., (2019) Athey, S., Chetty, R., Imbens, G. W., and Kang, H. (2019). The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. Technical report, National Bureau of Economic Research.
Banda et al., (2017) Banda, J. M., Halpern, Y., Sontag, D., and Shah, N. H. (2017). Electronic phenoty** with aphrodite and the observational health sciences and informatics (ohdsi) data network. AMIA Summits on Translational Science Proceedings, 2017:48.
Banda et al., (2018) Banda, J. M., Seneviratne, M., Hernandez-Boussard, T., and Shah, N. H. (2018). Advances in electronic phenoty**: from rule-based definitions to machine learning models. Annual Review of Biomedical Data Science, 1:53–68.
Bonhomme et al., (2016) Bonhomme, S., Jochmans, K., Robin, J.-M., et al. (2016). Estimating multivariate latent-structure models. The Annals of Statistics, 44(2):540–563.
Castro et al., (2022) Castro, V. M., Gainer, V., Wattanasin, N., Benoit, B., Cagan, A., Ghosh, B., Goryachev, S., Metta, R., Park, H., Wang, D., et al. (2022). The mass general brigham biobank portal: an i2b2-based data repository linking disparate and high-dimensional patient data to support multimodal analytics. Journal of the American Medical Informatics Association, 29(4):643–651.
Chen, (2007) Chen, X. (2007). Chapter 76 large sample sieve estimation of semi-nonparametric models. volume 6 of Handbook of Econometrics, pages 5549–5632. Elsevier.
Denny et al., (2013) Denny, J. C., Bastarache, L., Ritchie, M. D., Carroll, R. J., Zink, R., Mosley, J. D., Field, J. R., Pulley, J. M., Ramirez, A. H., Bowton, E., et al. (2013). Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nature biotechnology, 31(12):1102–1111.
He et al., (2021) He, Y., Lakhani, C. M., Rasooly, D., Manrai, A. K., Tzoulaki, I., and Patel, C. J. (2021). Comparisons of polyexposure, polygenic, and clinical risk scores in risk prediction of type 2 diabetes. Diabetes Care, 44(4):935–943.
Hong et al., (2019) Hong, C., Liao, K. P., and Cai, T. (2019). Semi-supervised validation of multiple surrogate outcomes with application to electronic medical records phenoty**. Biometrics, 75(1):78–89.
(10) Hou, J., Chan, S. F., Wang, X., and Cai, T. (2023a). Risk prediction with imperfect survival outcome information from electronic health records. Biometrics, 79(1):190–202.
(11) Hou, J., Guo, Z., and Cai, T. (2023b). Surrogate assisted semi-supervised inference for high dimensional risk prediction. Journal of Machine Learning Research, 24(265):1–58.
Hou et al., (2021) Hou, J., Mukherjee, R., and Cai, T. (2021). Efficient and robust semi-supervised estimation of ate with partially annotated treatment and response. arXiv preprint arXiv:2110.12336.
Huang et al., (2018) Huang, J., Duan, R., Hubbard, R. A., Wu, Y., Moore, J. H., Xu, H., and Chen, Y. (2018). Pie: A prior knowledge guided integrated likelihood estimation method for bias reduction in association studies using electronic health records data. Journal of the American Medical Informatics Association, 25(3):345–352.
Huang and Labrecque, (2019) Huang, J. Y. and Labrecque, J. A. (2019). From gwas to phewas: the search for causality in big data. The Lancet Digital Health, 1(3):e101–e103.
Kallus and Mao, (2020) Kallus, N. and Mao, X. (2020). On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. arXiv preprint arXiv:2003.12408.
Kohane, (2011) Kohane, I. S. (2011). Using electronic health records to drive discovery in disease genomics. Nature Reviews Genetics, 12(6):417–428.
Liao et al., (2015) Liao, K. P., Cai, T., Savova, G. K., Murphy, S. N., Karlson, E. W., Ananthakrishnan, A. N., Gainer, V. S., Shaw, S. Y., Xia, Z., Szolovits, P., et al. (2015). Development of phenotype algorithms using electronic medical records and incorporating natural language processing. bmj, 350:h1885.
Liao et al., (2013) Liao, K. P., Kurreeman, F., Li, G., Duclos, G., Murphy, S., Guzman, R., Cai, T., Gupta, N., Gainer, V., Schur, P., et al. (2013). Associations of autoantibodies, autoimmune risk alleles, and clinical diagnoses from the electronic medical records in rheumatoid arthritis cases and non–rheumatoid arthritis controls. Arthritis & Rheumatology, 65(3):571–581.
Liao et al., (2019) Liao, K. P., Sun, J., Cai, T. A., Link, N., Hong, C., Huang, J., Huffman, J. E., Gronsbell, J., Zhang, Y., Ho, Y.-L., Castro, V., Gainer, V., Murphy, S. N., O’Donnell, C. J., Gaziano, J. M., Cho, K., Szolovits, P., Kohane, I. S., Yu, S., and Cai, Tianxi, w. t. M. V. P. (2019). High-throughput multimodal automated phenoty** (MAP) with application to PheWAS. Journal of the American Medical Informatics Association, 26(11):1255–1262.
Mahajan et al., (2018) Mahajan, A., Taliun, D., Thurner, M., Robertson, N. R., Torres, J. M., Rayner, N. W., Payne, A. J., Steinthorsdottir, V., Scott, R. A., Grarup, N., et al. (2018). Fine-map** type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. Nature genetics, 50(11):1505–1513.
Murphy and Van der Vaart, (2000) Murphy, S. A. and Van der Vaart, A. W. (2000). On profile likelihood. Journal of the American Statistical Association, 95(450):449–465.
Shivade et al., (2014) Shivade, C., Raghavan, P., Fosler-Lussier, E., Embi, P. J., Elhadad, N., Johnson, S. B., and Lai, A. M. (2014). A review of approaches to identifying patient phenotype cohorts using electronic health records. Journal of the American Medical Informatics Association, 21(2):221–230.
Van der Vaart, (2000) Van der Vaart, A. W. (2000). Asymptotic statistics, volume 3. Cambridge university press.
Verma et al., (2023) Verma, A., Huffman, J. E., Rodriguez, A., Conery, M., Liu, M., Ho, Y.-L., Kim, Y., Heise, D. A., Guare, L., Panickan, V. A., et al. (2023). Diversity and scale: genetic architecture of 2,068 traits in the va million veteran program. medRxiv.
Wells et al., (2019) Wells, Q. S., Gupta, D. K., Smith, J. G., Collins, S. P., Storrow, A. B., Ferguson, J., Smith, M. L., Pulley, J. M., Collier, S., Wang, X., et al. (2019). Accelerating biomarker discovery through electronic health records, automated biobanking, and proteomics. Journal of the American College of Cardiology, 73(17):2195–2205.
Yu et al., (2017) Yu, S., Ma, Y., Gronsbell, J., Cai, T., Ananthakrishnan, A. N., Gainer, V. S., Churchill, S. E., Szolovits, P., Murphy, S. N., Kohane, I. S., et al. (2017). Enabling phenotypic big data with phenorm. Journal of the American Medical Informatics Association, 25(1):54–60.
Yu et al., (2019) Yu, T., Li, P., Qin, J., et al. (2019). Maximum smoothed likelihood component density estimation in mixture models with known mixing proportions. Electronic Journal of Statistics, 13(2):4035–4078.
(28) Zhang, L., Ding, X., Ma, Y., Muthu, N., Ajmal, I., Moore, J. H., Herman, D. S., and Chen, J. (2019a). Electronic health record phenoty** with internally assessable performance (phiap) using anchor-positive and unlabeled patients. arXiv preprint arXiv:1902.10060.
(29) Zhang, Y., Cai, T., Yu, S., Cho, K., Hong, C., Sun, J., Huang, J., Ho, Y.-L., Ananthakrishnan, A. N., Xia, Z., et al. (2019b). High-throughput phenoty** with electronic medical record data using a common semi-supervised approach (phecap). Nature protocols, 14(12):3426–3444.
Zhang et al., (2022) Zhang, Y., Liu, M., Neykov, M., and Cai, T. (2022). Prior adaptive semi-supervised learning with application to ehr phenoty**. The Journal of Machine Learning Research, 23(1):3617–3641.
Zheng and Wu, (2019) Zheng, C. and Wu, Y. (2019). Nonparametric estimation of multivariate mixtures. Journal of the American Statistical Association, pages 1–16.

Appendix

Appendix A Additional implementation details

Algorithm A1 EM algorithm for maximizing the non-parametric log-likelihood function (5).

Input: Observed data $\mathscr{O}=\{{\bf O}_{i}=(Y^{*}_{i}\delta_{i},\delta_{i},{\bf X}_{i},{\bf G}_% {i}):i=1,2,...,N\}$ , and the phenoty** score ${\widehat{\alpha}}({\bf x})$ derived in Algorithm 1.
Initialize with $\bm{\widetilde{\eta}}_{{\widehat{\alpha}}}^{(0)}=\{\widetilde{\cal S}_{% \widehat{\alpha},y}^{(0)}(\cdot),\widetilde{\bm{\lambda}}^{(0)},\widetilde{\bm% {\xi}}^{(0)}:y=0,1\}$ introduced in Algorithm A2. Iterate on the following two steps for $r=0,1,\ldots,R$ until convergence.
E-step. For each subject $i$ , impute the probability for $Y_{i}$ conditional on $Y^{*}_{i}$ (if observed) or ${\widehat{\alpha}}({\bf X}_{i})$ :

\widetilde{Y}_{i0}^{(r+1)}=\delta_{i}\times\frac{\widetilde{\lambda}_{1Y_{i}^{% *}}^{(r)}g_{1}\{\bm{\psi}^{{\sf\scriptscriptstyle{T}}}({\bf G}_{i})\widetilde{% \bm{\xi}}^{(r)}\}}{\sum_{y=0}^{1}\widetilde{\lambda}_{yY_{i}^{*}}^{(r)}g_{y}\{% \bm{\psi}^{{\sf\scriptscriptstyle{T}}}({\bf G}_{i})\widetilde{\bm{\xi}}^{(r)}% \}};\quad\widetilde{Y}_{i1}^{(r+1)}=\frac{-\nabla\widetilde{\cal S}^{(r)}_{% \widehat{\alpha},1}\{{\widehat{\alpha}}({\bf X}_{i})\}g_{1}\{\bm{\psi}^{{\sf% \scriptscriptstyle{T}}}({\bf G}_{i})\widetilde{\bm{\xi}}^{(r)}\}}{-\sum_{y=0}^% {1}\nabla\widetilde{\cal S}^{(r)}_{\widehat{\alpha},y}\{{\widehat{\alpha}}({% \bf X}_{i})\}g_{y}\{\bm{\psi}^{{\sf\scriptscriptstyle{T}}}({\bf G}_{i})% \widetilde{\bm{\xi}}^{(r)}\}}.

M-step. Update $\bm{\eta}_{{\widehat{\alpha}}}$ through the MLE specified with the imputed outcomes from the E-step:

	$\displaystyle\widetilde{\lambda}_{yk}^{(r+1)}=\frac{\sum_{i=1}^{n}I(Y^{*}_{i}=% k)\{{\widetilde{Y}}_{i0}^{(r+1)}\}^{y}\{1-{\widetilde{Y}}_{i0}^{(r+1)}\}^{1-y}% }{\sum_{i=1}^{n}\{{\widetilde{Y}}_{i0}^{(r+1)}\}^{y}\{1-{\widetilde{Y}}_{i0}^{% (r+1)}\}^{1-y}};\quad k=0,1,\ldots,K$
	$\displaystyle\widetilde{\bm{\xi}}^{(r+1)}=\mathop{\mbox{argmax}}_{\bm{\xi}}% \sum_{i=1}^{n}\ell\left(\widetilde{Y}_{i0}^{(r+1)},\bm{\psi}^{{\sf% \scriptscriptstyle{T}}}({\bf G}_{i})\bm{\xi}\right)+\sum_{i=1}^{N}\ell\left(% \widetilde{Y}_{i1}^{(r+1)},\bm{\psi}^{{\sf\scriptscriptstyle{T}}}({\bf G}_{i})% \bm{\xi}\right);$
	$\displaystyle\widetilde{\cal S}^{(r)}_{\widehat{\alpha},y}(c)=\frac{\sum_{i=1}% ^{N}I({\widehat{\alpha}}({\bf X}_{i})>c)\{{\widetilde{Y}}_{i1}^{(r+1)}\}^{y}\{% 1-{\widetilde{Y}}_{i1}^{(r+1)}\}^{1-y}}{\sum_{i=1}^{N}\{{\widetilde{Y}}_{i1}^{% (r+1)}\}^{y}\{1-{\widetilde{Y}}_{i1}^{(r+1)}\}^{1-y}},\quad y=0,1.$

Output: The imputed outcomes ${\widetilde{Y}}_{i0}=\widetilde{Y}_{i0}^{(R)}$ (if $\delta_{i}=1$ ) and ${\widetilde{Y}}_{i1}=\widetilde{Y}_{i1}^{(R)}$ for $i=1,2,\ldots,N$ .

Algorithm A2 Initialization of the EM Algorithms.

For Algorithm 1, we define $Y^{\dagger}_{i}=I(Y^{*}_{i}=1)$ for subjects $i=1,2,\ldots,n$ and obtain the initial estimators $\widehat{\bm{\xi}}^{(0)},\widehat{\bm{\zeta}}^{(0)},\widehat{\mu}^{(0)}$ through MLE:

\widehat{\mu}^{(0)}=\frac{1}{n}\sum_{i=1}^{n}Y^{\dagger}_{i};\quad\widehat{\bm% {\xi}}^{(0)}=\mathop{\mbox{argmax}}_{\bm{\xi}}\sum_{i=1}^{n}\ell\left(Y^{% \dagger}_{i},\bm{\psi}^{{\sf\scriptscriptstyle{T}}}({\bf G}_{i})\bm{\xi}\right% );\quad\widehat{\bm{\zeta}}_{j}^{(0)}=\mathop{\mbox{argmax}}_{\bm{\zeta}_{j}}% \sum_{i=1}^{n}\ell\left(Y^{\dagger}_{i},\bm{\varphi}^{{\sf\scriptscriptstyle{T% }}}_{j}(X_{ij})\bm{\zeta}_{j}\right).

For $\widehat{\bm{\lambda}}^{(0)}$ , we set $\widetilde{\lambda}_{1K}^{(0)}=0.85$ ; $\widetilde{\lambda}_{1k}^{(0)}=0.15/K$ for $k=0,1,\ldots,K-1$ and $\widetilde{\lambda}_{00}^{(0)}=0.85$ ; $\widetilde{\lambda}_{0k}^{(0)}=0.15/K$ for $k=1,\ldots,K$ , in the belief that $Y^{*}$ is reliable.
For Algorithm A2, we set $\widetilde{\bm{\lambda}}^{(0)}=\widehat{\bm{\lambda}}$ and $\widetilde{\bm{\xi}}^{(0)}=\widehat{\bm{\xi}}$ based on the results in Algorithm 1, and take

\widetilde{\cal S}_{\widehat{\alpha},y}^{(0)}(c)=\frac{\sum_{i=1}^{n}I({% \widehat{\alpha}}({\bf X}_{i})>c)I(Y^{\dagger}_{i}=y)}{\sum_{i=1}^{n}I(Y^{% \dagger}_{i}=y)},\quad y=0,1.

Appendix B Additional numerical results

In this section, we attach more complete simulation results as a supplement to the main results presented in Section 4.

Table A1: Biases of parameter estimates over 500 simulations for the regression parameters for genetic effects (

\bm{\beta}

), the area under the curve (AUC) for the classification algorithm, and the errors and/or uncertainties in labels (

\bm{\lambda}

) for settings (a) with linear genetic effects, (b) with nonlinear genetic effects, and (c) with nonlinear genetic effects and slight violation of conditional independence between

{\bf G}

and

{\bf X}

Method	$\beta_{0}$ =-4.600	$\beta_{1}$ = 1.600	$\beta_{2}$ = 1.600	$\beta_{3}$ = 1.600	$\beta_{4}$ = 1.600	AUC=0.702	$\lambda_{1}(0)$ =0.320	$\lambda_{1}(0.5)$ =0.490	$\lambda_{1}(1)$ =0.190	$\lambda_{0}(0)$ =0.700	$\lambda_{0}(0.5)$ =0.280	$\lambda_{0}(1)$ =0.030
(a)
Naive-Logistic₁₀₀	2.965	-1.354	-1.351	-1.343	-1.328	-0.088	-	-	-	-	-	-
Hong et al₁₀₀	-0.050	0.013	0.006	0.020	0.014	0.001	0.004	0.000	-0.004	-0.003	-0.002	0.005
TUBE₁₀₀	-0.145	0.040	0.036	0.049	0.044	-0.001	0.001	0.002	-0.003	0.002	-0.005	0.003
Naive-Logistic₅₀₀	3.024	-1.358	-1.348	-1.357	-1.344	-0.103	-	-	-	-	-	-
Hong et al₅₀₀	-0.011	0.004	0.000	0.007	0.004	0.002	-0.001	0.001	0.001	0.001	-0.001	0.000
TUBE₅₀₀	-0.089	0.029	0.025	0.029	0.031	0.000	-0.004	0.002	0.002	0.007	-0.003	-0.003
Naive-Logistic₁₀₀₀	3.019	-1.360	-1.346	-1.347	-1.349	-0.104	-	-	-	-	-	-
Hong et al₁₀₀₀	-0.010	0.000	-0.002	0.005	0.000	0.002	-0.003	0.001	0.002	0.003	-0.002	-0.001
TUBE₁₀₀₀	-0.073	0.020	0.020	0.026	0.022	0.000	-0.006	0.003	0.003	0.008	-0.005	-0.004

Method	$\beta_{0}$ = 1.300	$\beta_{1}$ = 0.700	$\beta_{2}$ =-0.700	$\beta_{3}$ =-0.700	$\beta_{4}$ =-0.700	AUC=0.702	$\lambda_{1}(0)$ =0.320	$\lambda_{1}(0.5)$ =0.490	$\lambda_{1}(1)$ =0.190	$\lambda_{0}(0)$ =0.700	$\lambda_{0}(0.5)$ =0.280	$\lambda_{0}(1)$ =0.030
(b)
Naive-Logistic₁₀₀	-1.879	-0.514	0.472	0.497	0.496	-0.080	-	-	-	-	-	-
Hong et al₁₀₀	-1.255	3.467	-0.642	-0.646	-0.630	-0.027	0.016	-0.011	-0.005	-0.056	0.031	0.025
TUBE₁₀₀	0.020	0.016	-0.024	-0.009	-0.011	-0.003	-0.001	0.003	-0.002	0.003	-0.004	0.001
Naive-Logistic₅₀₀	-1.853	-0.507	0.495	0.490	0.500	-0.097	-	-	-	-	-	-
Hong et al₅₀₀	-1.272	3.513	-0.648	-0.654	-0.644	-0.028	0.010	-0.006	-0.004	-0.059	0.033	0.026
TUBE₅₀₀	0.011	0.013	-0.017	-0.007	-0.004	-0.001	-0.002	0.000	0.002	0.000	0.000	0.000
Naive-Logistic₁₀₀₀	-1.850	-0.509	0.495	0.493	0.500	-0.097	-	-	-	-	-	-
Hong et al₁₀₀₀	-1.281	3.524	-0.650	-0.652	-0.643	-0.028	0.008	-0.002	-0.005	-0.060	0.033	0.027
TUBE₁₀₀₀	0.004	0.008	-0.012	-0.003	-0.002	-0.001	-0.007	0.004	0.003	0.001	-0.001	0.000

Method	$\beta_{0}$ = 1.300	$\beta_{1}$ =-0.300	$\beta_{2}$ =-0.700	$\beta_{3}$ =-0.700	$\beta_{4}$ =-0.800	AUC=0.702	$\lambda_{1}(0)$ =0.320	$\lambda_{1}(0.5)$ =0.490	$\lambda_{1}(1)$ =0.190	$\lambda_{0}(0)$ =0.700	$\lambda_{0}(0.5)$ =0.280	$\lambda_{0}(1)$ =0.030
(c)
Naive-Logistic₁₀₀	-1.925	0.239	0.570	0.550	0.567	-0.083	-	-	-	-	-	-
Hong et al₁₀₀	-0.228	-1.200	-0.380	-0.408	-0.393	-0.025	0.037	-0.038	0.000	-0.039	0.026	0.012
TUBE₁₀₀	0.090	0.046	-0.021	-0.032	-0.029	-0.009	0.017	-0.007	-0.010	-0.007	0.004	0.002
Naive-Logistic₅₀₀	-1.887	0.227	0.564	0.562	0.563	-0.100	-	-	-	-	-	-
Hong et al₅₀₀	-0.366	-1.337	-0.386	-0.391	-0.399	-0.024	0.012	-0.010	-0.002	-0.031	0.017	0.014
TUBE₅₀₀	0.065	0.044	-0.013	-0.023	-0.019	-0.003	-0.008	0.003	0.005	0.005	-0.003	-0.001
Naive-Logistic₁₀₀₀	-1.887	0.226	0.557	0.568	0.571	-0.102	-	-	-	-	-	-
Hong et al₁₀₀₀	-0.340	-1.476	-0.452	-0.446	-0.456	-0.025	0.012	-0.009	-0.003	-0.035	0.020	0.015
TUBE₁₀₀₀	0.060	0.037	-0.017	-0.019	-0.014	-0.003	-0.003	-0.001	0.004	0.002	-0.001	-0.001

Table A2: Mean square errors (MSE) of parameter estimates over 500 simulations for the regression parameters for genetic effects (

\bm{\beta}

), the area under the curve (AUC) for the classification algorithm, and the errors and/or uncertainties in labels (

\bm{\lambda}

) for settings (a) with linear genetic effects, (b) with nonlinear genetic effects, and (c) with nonlinear genetic effects and slight violation of conditional independence between

{\bf G}

and

{\bf X}

Method	$\beta_{0}$ =-4.600	$\beta_{1}$ = 1.600	$\beta_{2}$ = 1.600	$\beta_{3}$ = 1.600	$\beta_{4}$ = 1.600	AUC=0.702	$\lambda_{1}(0)$ =0.320	$\lambda_{1}(0.5)$ =0.490	$\lambda_{1}(1)$ =0.190	$\lambda_{0}(0)$ =0.700	$\lambda_{0}(0.5)$ =0.280	$\lambda_{0}(1)$ =0.030
(a)
Naive-Logistic₁₀₀	9.092	1.863	1.885	1.870	1.826	0.009	-	-	-	-	-	-
Hong et al₁₀₀	0.678	0.070	0.075	0.085	0.072	0.000	0.005	0.005	0.003	0.012	0.011	0.002
TUBE₁₀₀	0.780	0.078	0.090	0.098	0.087	0.001	0.005	0.005	0.003	0.013	0.011	0.002
Naive-Logistic₅₀₀	9.194	1.849	1.830	1.851	1.815	0.011	-	-	-	-	-	-
Hong et al₅₀₀	0.620	0.064	0.070	0.077	0.065	0.000	0.001	0.001	0.001	0.003	0.002	0.000
TUBE₅₀₀	0.670	0.068	0.077	0.082	0.073	0.000	0.001	0.001	0.001	0.003	0.002	0.001
Naive-Logistic₁₀₀₀	9.137	1.852	1.816	1.819	1.825	0.011	-	-	-	-	-	-
Hong et al₁₀₀₀	0.604	0.060	0.065	0.076	0.066	0.000	0.001	0.001	0.000	0.001	0.001	0.000
TUBE₁₀₀₀	0.660	0.064	0.072	0.079	0.076	0.000	0.001	0.001	0.000	0.002	0.001	0.000

Method	$\beta_{0}$ = 1.300	$\beta_{1}$ = 0.700	$\beta_{2}$ =-0.700	$\beta_{3}$ =-0.700	$\beta_{4}$ =-0.700	AUC=0.702	$\lambda_{1}(0)$ =0.320	$\lambda_{1}(0.5)$ =0.490	$\lambda_{1}(1)$ =0.190	$\lambda_{0}(0)$ =0.700	$\lambda_{0}(0.5)$ =0.280	$\lambda_{0}(1)$ =0.030
(b)
Naive-Logistic₁₀₀	3.896	0.311	0.302	0.320	0.321	0.008	-	-	-	-	-	-
Hong et al₁₀₀	2.145	13.794	0.564	0.615	0.557	0.001	0.023	0.024	0.014	0.006	0.004	0.001
TUBE₁₀₀	0.081	0.014	0.015	0.016	0.015	0.001	0.017	0.019	0.009	0.005	0.005	0.001
Naive-Logistic₅₀₀	3.489	0.265	0.259	0.253	0.262	0.010	-	-	-	-	-	-
Hong et al₅₀₀	2.191	13.569	0.533	0.574	0.548	0.001	0.004	0.004	0.002	0.004	0.002	0.001
TUBE₅₀₀	0.075	0.013	0.015	0.014	0.014	0.000	0.004	0.004	0.002	0.001	0.001	0.000
Naive-Logistic₁₀₀₀	3.451	0.264	0.251	0.249	0.256	0.010	-	-	-	-	-	-
Hong et al₁₀₀₀	2.176	13.639	0.539	0.567	0.541	0.001	0.002	0.002	0.001	0.004	0.001	0.001
TUBE₁₀₀₀	0.065	0.011	0.013	0.012	0.012	0.000	0.002	0.002	0.001	0.001	0.000	0.000

Method	$\beta_{0}$ = 1.300	$\beta_{1}$ =-0.300	$\beta_{2}$ =-0.700	$\beta_{3}$ =-0.700	$\beta_{4}$ =-0.800	AUC=0.702	$\lambda_{1}(0)$ =0.320	$\lambda_{1}(0.5)$ =0.490	$\lambda_{1}(1)$ =0.190	$\lambda_{0}(0)$ =0.700	$\lambda_{0}(0.5)$ =0.280	$\lambda_{0}(1)$ =0.030
(c)
Naive-Logistic₁₀₀	4.043	0.105	0.398	0.378	0.390	-	-	-	-	-	-
Hong et al₁₀₀	2.667	4.855	0.360	0.536	0.440	0.001	0.050	0.052	0.022	0.007	0.005	0.001
TUBE₁₀₀	0.135	0.016	0.023	0.023	0.025	0.003	0.028	0.029	0.011	0.005	0.005	0.001
Naive-Logistic₅₀₀	3.629	0.060	0.332	0.330	0.331	0.010	-	-	-	-	-	-
Hong et al₅₀₀	2.297	4.564	0.344	0.335	0.359	0.001	0.012	0.010	0.004	0.003	0.001	0.001
TUBE₅₀₀	0.130	0.013	0.022	0.021	0.023	0.000	0.006	0.006	0.003	0.001	0.001	0.000
Naive-Logistic₁₀₀₀	3.589	0.055	0.317	0.328	0.333	0.011	-	-	-	-	-	-
Hong et al₁₀₀₀	4.793	7.153	0.912	1.001	1.280	0.001	0.008	0.006	0.003	0.003	0.001	0.001
TUBE₁₀₀₀	0.113	0.012	0.019	0.019	0.022	0.000	0.003	0.003	0.002	0.001	0.001	0.000

Table A3: Coverage probabilities (CP) at the 95% nominal level of parameter estimates over 500 simulations for the regression parameters for genetic effects (

\bm{\beta}

), the area under the curve (AUC) for the classification algorithm, and the errors and/or uncertainties in labels (

\bm{\lambda}

) for settings (a) with linear genetic effects, (b) with nonlinear genetic effects, and (c) with nonlinear genetic effects and slight violation of conditional independence between

{\bf G}

and

{\bf X}

Method	$\beta_{0}$ =-4.600	$\beta_{1}$ = 1.600	$\beta_{2}$ = 1.600	$\beta_{3}$ = 1.600	$\beta_{4}$ = 1.600	AUC=0.702	$\lambda_{1}(0)$ =0.320	$\lambda_{1}(0.5)$ =0.490	$\lambda_{1}(1)$ =0.190	$\lambda_{0}(0)$ =0.700	$\lambda_{0}(0.5)$ =0.280	$\lambda_{0}(1)$ =0.030
(a)
Naive-Logistic₁₀₀	0.002	0.000	0.000	0.000	0.002	0.398	-	-	-	-	-	-
Hong et al₁₀₀	0.946	0.954	0.958	0.948	0.950	0.946	0.942	0.954	0.952	0.956	0.960	0.934
TUBE₁₀₀	0.940	0.944	0.944	0.940	0.938	0.998	0.952	0.958	0.950	0.954	0.958	0.936
Naive-Logistic₅₀₀	0.000	0.000	0.000	0.000	0.000	0.000	-	-	-	-	-	-
Hong et al₅₀₀	0.946	0.956	0.952	0.950	0.952	0.948	0.940	0.946	0.952	0.954	0.946	0.960
TUBE₅₀₀	0.942	0.946	0.946	0.948	0.954	0.954	0.948	0.948	0.944	0.948	0.946	0.972
Naive-Logistic₁₀₀₀	0.000	0.000	0.000	0.000	0.000	0.000	-	-	-	-	-	-
Hong et al₁₀₀₀	0.948	0.950	0.946	0.956	0.950	0.954	0.938	0.950	0.948	0.944	0.952	0.968
TUBE₁₀₀₀	0.954	0.952	0.938	0.954	0.940	0.954	0.938	0.950	0.938	0.942	0.952	0.978

Method	$\beta_{0}$ = 1.300	$\beta_{1}$ = 0.700	$\beta_{2}$ =-0.700	$\beta_{3}$ =-0.700	$\beta_{4}$ =-0.700	AUC=0.702	$\lambda_{1}(0)$ =0.320	$\lambda_{1}(0.5)$ =0.490	$\lambda_{1}(1)$ =0.190	$\lambda_{0}(0)$ =0.700	$\lambda_{0}(0.5)$ =0.280	$\lambda_{0}(1)$ =0.030
(b)
Naive-Logistic₁₀₀	0.097	0.333	0.628	0.554	0.547	0.574	-	-	-	-	-	-
Hong et al₁₀₀	0.634	0.261	0.675	0.752	0.697	0.602	0.952	0.956	0.956	0.828	0.903	0.871
TUBE₁₀₀	0.947	0.956	0.929	0.945	0.958	0.992	0.952	0.954	0.941	0.949	0.952	0.947
Naive-Logistic₅₀₀	0.000	0.000	0.008	0.004	0.002	0.000	-	-	-	-	-	-
Hong et al₅₀₀	0.640	0.095	0.554	0.628	0.628	0.554	0.954	0.966	0.956	0.341	0.729	0.408
TUBE₅₀₀	0.958	0.952	0.952	0.958	0.964	0.941	0.954	0.956	0.943	0.943	0.947	0.927
Naive-Logistic₁₀₀₀	0.000	0.000	0.000	0.000	0.000	0.000	-	-	-	-	-	-
Hong et al₁₀₀₀	0.604	0.083	0.566	0.636	0.618	0.543	0.943	0.941	0.947	0.083	0.475	0.121
TUBE₁₀₀₀	0.954	0.960	0.949	0.956	0.958	0.939	0.954	0.943	0.943	0.947	0.945	0.947

Method	$\beta_{0}$ = 1.300	$\beta_{1}$ =-0.300	$\beta_{2}$ =-0.700	$\beta_{3}$ =-0.700	$\beta_{4}$ =-0.800	AUC=0.702	$\lambda_{1}(0)$ =0.320	$\lambda_{1}(0.5)$ =0.490	$\lambda_{1}(1)$ =0.190	$\lambda_{0}(0)$ =0.700	$\lambda_{0}(0.5)$ =0.280	$\lambda_{0}(1)$ =0.030
(c)
Naive-Logistic₁₀₀	0.079	0.797	0.436	0.482	0.428	0.522	-	-	-	-	-	-
Hong et al₁₀₀	0.956	0.937	0.896	0.954	0.927	0.858	0.956	0.927	0.958	0.925	0.929	0.937
TUBE₁₀₀	0.939	0.935	0.960	0.948	0.935	0.985	0.971	0.952	0.969	0.956	0.954	0.952
Naive-Logistic₅₀₀	0.000	0.290	0.002	0.006	0.006	0.002	-	-	-	-	-	-
Hong et al₅₀₀	0.933	0.881	0.862	0.868	0.864	0.839	0.933	0.952	0.952	0.931	0.944	0.937
TUBE₅₀₀	0.950	0.929	0.950	0.939	0.942	0.942	0.946	0.952	0.950	0.946	0.944	0.927
Naive-Logistic₁₀₀₀	0.000	0.077	0.000	0.000	0.000	0.000	-	-	-	-	-	-
Hong et al₁₀₀₀	0.985	0.946	0.979	0.987	0.990	0.843	0.939	0.946	0.948	0.879	0.912	0.894
TUBE₁₀₀₀	0.942	0.937	0.946	0.946	0.948	0.946	0.952	0.958	0.933	0.958	0.954	0.939

A Semiparametric Approach for Robust and Efficient Learning with Biobank Data††The first two authors made equal contributions to this paper.

Abstract

1 Introduction

1.1 Background

1.2 Problem setup

Remark 1

1.3 Related literature and our contribution

2 Three-stage unsupervised learning method

2.1 Overview of the modeling strategy

2.2 Stage I: sieve-approximated composite likelihood

2.3 Stage II: condensing EHR features for phenoty**

Theorem 1

2.4 Stage III: genetic risk modeling and EHR phenotype validation

3 Asymptotic analysis

Assumption 1

Assumption 2

Remark 2

Remark 3

Theorem 2

Theorem 3

Theorem 4

4 Simulation

5 Real Example

6 Discussion

References

Appendix

Appendix A Additional implementation details

Appendix B Additional numerical results

A Semiparametric Approach for Robust and Efficient Learning with Biobank Data^†^†The first two authors made equal contributions to this paper.