A Semiparametric Approach for Robust and Efficient Learning with Biobank DataThe first two authors made equal contributions to this paper.

Molei Liu Department of Biostatistics, Columbia Mailman School of Public Health. Xinyi Wang Department of Statistics, University of Chicago. Chuan Hong Department of Biostatistics and Bioinformatics, Duke University.
Abstract

With the increasing availability of electronic health records (EHR) linked with biobank data for translational research, a critical step in realizing its potential is to accurately classify phenotypes for patients. Existing approaches to achieve this goal are based on error-prone EHR surrogate outcomes, assisted and validated by a small set of labels obtained via medical chart review, which may also be subject to misclassification. Ignoring the noise in these outcomes can induce severe estimation and validation bias to both EHR phenoty** and risking modeling with biomarkers collected in the biobank. To overcome this challenge, we propose a novel unsupervised and semiparametric approach to jointly model multiple noisy EHR outcomes with their linked biobank features. Our approach primarily aims at disease risk modeling with the baseline biomarkers, and is also able to produce a predictive EHR phenoty** model and validate its performance without observations of the true disease outcome. It consists of composite and nonparametric regression steps free of any parametric model specification, followed by a parametric projection step to reduce the uncertainty and improve the estimation efficiency. We show that our method is robust to violations of the parametric assumptions while attaining the desirable root-n𝑛nitalic_n convergence rates on risk modeling. Our developed method outperforms existing methods in extensive simulation studies, as well as a real-world application in phenoty** and genetic risk modeling of type II diabetes.

Keywords: EHR linked biobank data; Surrogates; Measurement errors; Biomarker; Model misspecification; Under-smoothing.

1 Introduction

1.1 Background

With the increasing adoption of electronic health record (EHR) systems in the United States, EHR data are increasingly accessible for research. Linking EHR data with biorepository, powerful phenome-genome studies can be performed with such large scale data for discovery and translational research (Kohane,, 2011; Denny et al.,, 2013; Wells et al.,, 2019). To fully realize the potential of EHR data, a critical step involves accurately and efficiently classifying phenotype status for individual patients to enable association studies and risk modeling. Although simple rule-based classification algorithms leveraging domain knowledge remain useful, they have varying degree of accuracy and portability (Zhang et al., 2019b, ). Conversely, data-driven machine learning based classification algorithms have been advocated as a useful alternative with higher accuracy and portability (Shivade et al.,, 2014; Liao et al.,, 2015; Banda et al.,, 2018). Typically, these algorithms undergo training and/or validation with gold standard labels curated via medical chart review. Subsequently, the predicted phenotypes for all patients in the cohort serve as the observed outcomes for downstream association studies (Liao et al.,, 2013, 2019, e.g).

Historically, most existing phenoty** algorithms have relied on supervised methods, which suffer from scalablility issue due to labor-intensive nature of manually reviewing charts to obtain gold standard labels for the phenotype of interest. In recent years, several unsupervised methods leveraging unlabeled data using surrogate features as noisy labels (Yu et al.,, 2017; Banda et al.,, 2017; Liao et al.,, 2019) were proposed as promising alternatives. However, these methods can lead to poor accuracy when the surrogate features have limited accuracy and do not provide reliable estimate of classification performance of the trained models.

1.2 Problem setup

Let Y𝑌Yitalic_Y denote the unobserved true binary phenotype status and 𝐆=(G1,,Gq)𝖳𝐆superscriptsubscript𝐺1subscript𝐺𝑞𝖳{\bf G}=(G_{1},\ldots,G_{q})^{{\sf\scriptscriptstyle{T}}}bold_G = ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT be its associated baseline characteristics and genetic markers from the EHR linked biobank, which could be either multi-dimensional single nucleotide polymorphisms (SNPs) or a genetic risk score derived by weighting a number of SNPs. We simultaneously consider two types of error-prone outcomes or surrogates for Y𝑌Yitalic_Y in our setup. First, suppose there are p𝑝pitalic_p-dimensinoal EHR surrogate features 𝐗=(X1,,Xp)𝖳𝐗superscriptsubscript𝑋1subscript𝑋𝑝𝖳{\bf X}=(X_{1},\ldots,X_{p})^{{\sf\scriptscriptstyle{T}}}bold_X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT such as counts of Y𝑌Yitalic_Y’s related billing codes and key laboratory results. Second, let Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the chart review label from experts, taking values of k/K𝑘𝐾k/Kitalic_k / italic_K for k{0,1,,K}𝑘01𝐾k\in\{0,1,...,K\}italic_k ∈ { 0 , 1 , … , italic_K }, to represent different levels of certainty regarding whether the patient has the condition Y𝑌Yitalic_Y. In practice, K𝐾Kitalic_K is often taken as 2222 with Y{0,0.5,1}superscript𝑌00.51Y^{*}\in\{0,0.5,1\}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ { 0 , 0.5 , 1 } representing not a case, a possible case, and a case.

Importantly, we assume the error-prone outcomes Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝐗𝐗{\bf X}bold_X only relate to genetic markers 𝐆𝐆{\bf G}bold_G through Y𝑌Yitalic_Y, i.e., (𝐘,𝐗)𝐆Y(\mathbf{Y}^{*},{\bf X})\perp\!\!\!\perp{\bf G}\mid Y( bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_X ) ⟂ ⟂ bold_G ∣ italic_Y. An illustration of this assumption is provided in the directed acyclic graph (DAG) of Figure 1. In this DAG, the baseline biomarkers 𝐆𝐆{\bf G}bold_G first occur to affect the chance of develo** the disease Y𝑌Yitalic_Y, then Y𝑌Yitalic_Y causes the downstream hospital visits producing features 𝐗𝐗{\bf X}bold_X and 𝐔𝐔\mathbf{U}bold_U in EHR, where 𝐔𝐔\mathbf{U}bold_U may encode unstructured information such as images and narrative clinical notes. Though 𝐔𝐔\mathbf{U}bold_U is not directly included as an outcome in our setup, it can affect the medical review result Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT together with the observed and structured 𝐗𝐗{\bf X}bold_X.

Refer to caption
Figure 1: An illustrative directed acyclic graph (DAG) of the data generating mechanism.

Suppose there are N𝑁Nitalic_N patients with independent and identical copies of the complete set of variables 𝐃=(Y,Y,𝐗,𝐆)𝐃𝑌superscript𝑌𝐗𝐆{\bf D}=(Y,Y^{*},{\bf X},{\bf G})bold_D = ( italic_Y , italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_X , bold_G ) described above, denoted as 𝒟={𝐃i:i=1,2,,N}𝒟conditional-setsubscript𝐃𝑖𝑖12𝑁\mathscr{D}=\{{\bf D}_{i}:i=1,2,...,N\}script_D = { bold_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_i = 1 , 2 , … , italic_N }. Since the label Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is derived based on expertise and additional information like 𝐔𝐔\mathbf{U}bold_U, it is usually more accurate than 𝐗𝐗{\bf X}bold_X in characterizing the true Y𝑌Yitalic_Y. However, it may still have a moderate measurement error due to incomplete information collection for medical review or complication and ambiguity of certain phenotypes. Thus, we assume that Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is only observable in a small set of n𝑛nitalic_n subjects indexed by δ=1𝛿1\delta=1italic_δ = 1, and, to account for the error of Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, it is marginally related to Y𝑌Yitalic_Y through

Pr(Y=k/KY=y)=λyk,fork=0,,K,y=0,1;𝝀y={λy1,,λyK}.formulae-sequencePrsuperscript𝑌conditional𝑘𝐾𝑌𝑦subscript𝜆𝑦𝑘formulae-sequencefor𝑘0𝐾formulae-sequence𝑦01subscript𝝀𝑦subscript𝜆𝑦1subscript𝜆𝑦𝐾{\rm Pr}(Y^{*}=k/K\mid Y=y)=\lambda_{yk},~{}\mbox{for}~{}k=0,\ldots,K,~{}y=0,1% ;\quad\bm{\lambda}_{y}=\{\lambda_{y1},...,\lambda_{yK}\}.roman_Pr ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_k / italic_K ∣ italic_Y = italic_y ) = italic_λ start_POSTSUBSCRIPT italic_y italic_k end_POSTSUBSCRIPT , for italic_k = 0 , … , italic_K , italic_y = 0 , 1 ; bold_italic_λ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = { italic_λ start_POSTSUBSCRIPT italic_y 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_y italic_K end_POSTSUBSCRIPT } . (1)

Also, note that 𝐗𝐗{\bf X}bold_X and 𝐆𝐆{\bf G}bold_G are observed for all patients and the true outcome Y𝑌Yitalic_Y is not observed for any patient. So the observed data is formed as 𝒪={𝐎i=(Yiδi,δi,𝐗i,𝐆i):i=1,2,,N}𝒪conditional-setsubscript𝐎𝑖subscriptsuperscript𝑌𝑖subscript𝛿𝑖subscript𝛿𝑖subscript𝐗𝑖subscript𝐆𝑖𝑖12𝑁\mathscr{O}=\{{\bf O}_{i}=(Y^{*}_{i}\delta_{i},\delta_{i},{\bf X}_{i},{\bf G}_% {i}):i=1,2,...,N\}script_O = { bold_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) : italic_i = 1 , 2 , … , italic_N }, with the labeling indicator δ𝐃perpendicular-to𝛿𝐃\delta\perp{\bf D}italic_δ ⟂ bold_D, i.e., being completely at random. Without loss of generality, we let δi=I(1in)subscript𝛿𝑖𝐼1𝑖𝑛\delta_{i}=I(1\leq i\leq n)italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_I ( 1 ≤ italic_i ≤ italic_n ) where n<N𝑛𝑁n<Nitalic_n < italic_N is the size of sample with labels and I()𝐼I(\cdot)italic_I ( ⋅ ) is the indicator function. Our primary goal is to derive a risk model of Y𝑌Yitalic_Y against 𝐆𝐆{\bf G}bold_G as well as inference of its encoded genetic associations. Since the genetic effects are usually moderate or small, it is more favorable to model and interpret Y𝐆similar-to𝑌𝐆Y\sim{\bf G}italic_Y ∼ bold_G with a simple and parametric form to ensure good interpretability and control the estimation uncertainty. In specific, we consider a working logistic model:

Pr(Y=1𝐆)=g(𝜷𝖳𝐆),Pr𝑌conditional1𝐆𝑔superscript𝜷𝖳𝐆\displaystyle{\rm Pr}(Y=1\mid{\bf G})=g(\bm{\beta}^{{\sf\scriptscriptstyle{T}}% }{\bf G}),roman_Pr ( italic_Y = 1 ∣ bold_G ) = italic_g ( bold_italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_G ) , (2)

where the expit link g(x)=ex/(1+ex)𝑔𝑥superscript𝑒𝑥1superscript𝑒𝑥g(x)=e^{x}/(1+e^{x})italic_g ( italic_x ) = italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT / ( 1 + italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ). Note that model (2) is allowed to be misspecified, and we define the target model parameter as 𝜷¯=argmax𝜷𝔼(Y,𝜷𝖳𝐆)¯𝜷subscriptargmax𝜷𝔼𝑌superscript𝜷𝖳𝐆\bar{\bm{\beta}}=\mathop{\mbox{argmax}}_{\bm{\beta}}\mathbb{E}\ell(Y,\bm{\beta% }^{{\sf\scriptscriptstyle{T}}}{\bf G})over¯ start_ARG bold_italic_β end_ARG = argmax start_POSTSUBSCRIPT bold_italic_β end_POSTSUBSCRIPT blackboard_E roman_ℓ ( italic_Y , bold_italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_G ) where (y,w)=ylog{g(w)}+(1y)log{[1g(w)]}𝑦𝑤𝑦𝑔𝑤1𝑦delimited-[]1𝑔𝑤\ell(y,w)=y\log\{g(w)\}+(1-y)\log\{[1-g(w)]\}roman_ℓ ( italic_y , italic_w ) = italic_y roman_log { italic_g ( italic_w ) } + ( 1 - italic_y ) roman_log { [ 1 - italic_g ( italic_w ) ] } is the log-likelihood function of logistic regression. Though (2) may be misspecified, such 𝜷¯¯𝜷\bar{\bm{\beta}}over¯ start_ARG bold_italic_β end_ARG is still identifiable and effective in characterizing the genetic associations. Our secondary goal is EHR phenoty** for the unobserved Y𝑌Yitalic_Y using 𝐗𝐗{\bf X}bold_X be deriving a risk score α(𝐗)𝛼𝐗\alpha({\bf X})italic_α ( bold_X ), as well as validating its classification performance. Due to the absence of Y𝑌Yitalic_Y in our observation, all above-introduced tasks are unsupervised and, thus, more challenging than the standard supervised or recent semi-supervised scenarios reviewed in Section 1.3.

Remark 1

Though both Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝐗𝐗{\bf X}bold_X are surrogates of the truth Y𝑌Yitalic_Y with errors, we still notate and consider them separately for several reasons. First, Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is not accessible for a (large) fraction of subjects so the phenoty** score of Y𝑌Yitalic_Y can only include the fully observed 𝐗𝐗{\bf X}bold_X as the predictors and formulated as α(𝐗)𝛼𝐗\alpha({\bf X})italic_α ( bold_X ). Second, although Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is neither perfect nor scalable, it is supposed to be more accurate and informative than 𝐗𝐗{\bf X}bold_X. Thus, as will be discussed in Sections 2 and 3, Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is important under our framework to stable training and efficient estimation, especially when 𝐗𝐗{\bf X}bold_X is of poor quality in characterizing Y𝑌Yitalic_Y.

1.3 Related literature and our contribution

Surrogate outcomes play an important role in data-driven biomedical research, particularly when obtaining the primary or true outcome of interest is costly or even impossible, e.g., demanding extensive human labor or long periods of follow-up. There is rich literature in both semi-supervised and unsupervised statistical learning with surrogates. For example, Athey et al., (2019) leveraged surrogates collected in observational studies to assist learning with experimental studies in paucity of the gold standard labels. Kallus and Mao, (2020) and Hou et al., (2021) studied how to utilize surrogates to improve the efficiency of causal inference without incurring bias. Hou et al., 2023a developed a semiparametric transformation approach to incorporate time-to-event surrogates and improve the learning efficiency with the true outcomes.

The aforementioned literature considered a semi-supervised setting with a small sample of the true outcome Y𝑌Yitalic_Y. Differently, our problem setup does not involve any observation of Y𝑌Yitalic_Y. For such an unsupervised setting, Huang et al., (2018) and Hong et al., (2019) proposed maximum likelihood approaches based on parametric assumptions on the conditional model of Y𝑌Yitalic_Y, which enables the identification and estimation of the model coefficients. Zhang et al., 2019a developed a method for the unsupervised learning and phenotype validation with anchor-positive surrogate outcomes in EHR. All these recent methods largely rely on parametric model assumptions like (2), a working assumption in our setup. Its misspecification could lead to biased estimation for the target parameter 𝜷¯=argmax𝜷𝔼(Y,𝜷𝖳𝐆)¯𝜷subscriptargmax𝜷𝔼𝑌superscript𝜷𝖳𝐆\bar{\bm{\beta}}=\mathop{\mbox{argmax}}_{\bm{\beta}}\mathbb{E}\ell(Y,\bm{\beta% }^{{\sf\scriptscriptstyle{T}}}{\bf G})over¯ start_ARG bold_italic_β end_ARG = argmax start_POSTSUBSCRIPT bold_italic_β end_POSTSUBSCRIPT blackboard_E roman_ℓ ( italic_Y , bold_italic_β start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_G ) due to the absence of the true label Y𝑌Yitalic_Y.

Meanwhile, we notice some fully nonparametric approaches for the so called latent-structure or mixture model related to our problem setup in recent literature, including Bonhomme et al., (2016), Yu et al., (2019), and Zheng and Wu, (2019). For example, Zheng and Wu, (2019) proposed a novel tensor approach for learning of nonparametric mixtures, with a key idea of introducing basis approximation to the component density functions. This track of work is in general free from the model misspecification issue discussed above but cannot provide desirable n1/2superscript𝑛12n^{-1/2}italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT-consistent estimators and may encounter the “curse of dimensionality” for multivariate surrogate outcomes.

To address the above-introduced dilemma between the bias caused by model misspecification and the low efficiency due to curse of dimensionality, we develop a Three-stage Unsupervised learning approach for Biomarkers linked with Error-prone outcomes, abbreviated as TUBE. Our approach primarily aims at risk modeling with the baseline biomarkers, and is also able to produce and validate a predictive EHR phenoty** score without observations of the true disease outcome. It is a semiparametric method that starts from a composite and nonparametric regression step for 𝐗,Y𝐗superscript𝑌{\bf X},Y^{*}bold_X , italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT against 𝐆𝐆{\bf G}bold_G that is free of any parametric assumptions. Following this step, TUBE combines multiple surrogates for EHR phenoty** and validation, and then implements a parametric projection step to improve the interpretability and estimation efficiency of the genetic risk model. We will show that our estimator for 𝜷𝜷\bm{\beta}bold_italic_β is n1/2superscript𝑛12n^{-1/2}italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT-consistent and asymptotic normal without requiring model (2) to be correctly specified or Y𝐗similar-to𝑌𝐗Y\sim{\bf X}italic_Y ∼ bold_X to have a parametric form, which are imposed by existing methods like Hong et al., (2019) and Zhang et al., 2019a . Also, TUBE demonstrates significantly better performance than existing methods in our simulation and real-world studies.

2 Three-stage unsupervised learning method

2.1 Overview of the modeling strategy

Our proposed TUBE method consists of three main steps. In stage I, we adopt an under-smoothed nonparametric and composite likelihood strategy that is free of any parametric or model structural assumptions on the forms of YYsimilar-to𝑌superscript𝑌Y\sim Y^{*}italic_Y ∼ italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Y𝐗similar-to𝑌𝐗Y\sim{\bf X}italic_Y ∼ bold_X and Y𝐆similar-to𝑌𝐆Y\sim{\bf G}italic_Y ∼ bold_G. This is to avoid the potential bias caused by model misspecification on linking the error-prone outcomes (Y,𝐗)superscript𝑌𝐗(Y^{*},{\bf X})( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_X ) with 𝐆𝐆{\bf G}bold_G without the supervision of the true label Y𝑌Yitalic_Y. In stage II, we leverage the results from I to condense the EHR features 𝐗𝐗{\bf X}bold_X into a risk score α^(𝐗)^𝛼𝐗\widehat{\alpha}({\bf X})over^ start_ARG italic_α end_ARG ( bold_X ) for more accurate phenoty** of Y𝑌Yitalic_Y, and refit the data using nonparametric likelihoods to evaluate its ROC. In stage III, we rely on the imputation outcomes from II to derive a parametric logistic model for Y𝐆similar-to𝑌𝐆Y\sim{\bf G}italic_Y ∼ bold_G. Compared to the previous steps, III will output a more efficient characterization of the genetic risk or association with good interpretability and desirable convergence rates. Meanwhile, built upon previous steps robust to model misspecification, stage III will be valid even when the target genetic model is wrong.

Denote by μ=Pr(Y=1)𝜇Pr𝑌1\mu={\rm Pr}(Y=1)italic_μ = roman_Pr ( italic_Y = 1 ) and mj(x)=Pr(Y=1Xj=x)subscript𝑚𝑗𝑥Pr𝑌conditional1subscript𝑋𝑗𝑥m_{j}(x)={\rm Pr}(Y=1\mid X_{j}=x)italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) = roman_Pr ( italic_Y = 1 ∣ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_x ) for j=1,2,,p𝑗12𝑝j=1,2,\ldots,pitalic_j = 1 , 2 , … , italic_p. To get rid of the curse of dimensionality in modeling Y𝑌Yitalic_Y jointly against X1,X2,,Xpsubscript𝑋1subscript𝑋2subscript𝑋𝑝X_{1},X_{2},\ldots,X_{p}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT through a multivariate nonparametric model, we consider a working conditional independence assumption across X1,X2,,Xpsubscript𝑋1subscript𝑋2subscript𝑋𝑝X_{1},X_{2},\ldots,X_{p}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT given Y𝑌Yitalic_Y, implying an additive logistic form of their joint model:

Pr(Y=1𝐗)=g{a+α¯(𝐗)}withα¯(𝐗)=j=1pg1{mj(Xj)},formulae-sequencePr𝑌conditional1𝐗𝑔𝑎¯𝛼𝐗with¯𝛼𝐗superscriptsubscript𝑗1𝑝superscript𝑔1subscript𝑚𝑗subscript𝑋𝑗{\rm Pr}(Y=1\mid{\bf X})=g\{a+\bar{\alpha}({\bf X})\}\quad\mbox{with}\quad\bar% {\alpha}({\bf X})=\sum_{j=1}^{p}g^{-1}\{m_{j}(X_{j})\},roman_Pr ( italic_Y = 1 ∣ bold_X ) = italic_g { italic_a + over¯ start_ARG italic_α end_ARG ( bold_X ) } with over¯ start_ARG italic_α end_ARG ( bold_X ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT { italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } , (3)

where a𝑎aitalic_a is an intercept term introduced such that 𝔼g{α¯(𝐗)}=μ𝔼𝑔¯𝛼𝐗𝜇\mathbb{E}g\{\bar{\alpha}({\bf X})\}=\mublackboard_E italic_g { over¯ start_ARG italic_α end_ARG ( bold_X ) } = italic_μ. As will be introduced in Section 2.2, under this construction, we can model each Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with 𝐆𝐆{\bf G}bold_G separately and combine them with a composite likelihood to estimate mj()subscript𝑚𝑗m_{j}(\cdot)italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ⋅ )’s, as if X1X2XpYperpendicular-tosubscript𝑋1subscript𝑋2perpendicular-toperpendicular-toconditionalsubscript𝑋𝑝𝑌X_{1}\perp X_{2}\perp\ldots\perp X_{p}\mid Yitalic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟂ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟂ … ⟂ italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∣ italic_Y. Then we will ensemble the estimators of mj(Xj)subscript𝑚𝑗subscript𝑋𝑗m_{j}(X_{j})italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) through (3) to derive an estimate for the phenoty** score α¯(𝐗)¯𝛼𝐗\bar{\alpha}({\bf X})over¯ start_ARG italic_α end_ARG ( bold_X ). As we will discuss later, due to our use of the composite likelihood, violation of the additive model (3) will not cause invalidity to the downstream results.

For the genetic variants 𝐆𝐆{\bf G}bold_G, we will consider two scenarios, including that (i) 𝐆𝐆{\bf G}bold_G contains multi-dimensional discrete SNPs features ranging over {0,1,2}012\{0,1,2\}{ 0 , 1 , 2 }; and (2) 𝐆𝐆{\bf G}bold_G is a univariate continuous gene risk score. For (i), we introduce the categorical functions covering all the possible combinations of the discrete SNPs in 𝐆𝐆{\bf G}bold_G while for (ii), we use the spline (sieve) basis functions of 𝐆𝐆{\bf G}bold_G. In both cases, we specify the nonparametric model of Y𝐆similar-to𝑌𝐆Y\sim{\bf G}italic_Y ∼ bold_G as

Pr(Y=1𝐆)=g{𝝃𝖳𝝍(𝐆)},Pr𝑌conditional1𝐆𝑔superscript𝝃𝖳𝝍𝐆{\rm Pr}(Y=1\mid{\bf G})=g\{\bm{\xi}^{{\sf\scriptscriptstyle{T}}}\bm{\psi}({% \bf G})\},roman_Pr ( italic_Y = 1 ∣ bold_G ) = italic_g { bold_italic_ξ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_ψ ( bold_G ) } , (4)

where 𝝍(𝐆)={ψ1(𝐆),ψ2(𝐆),,ψdg(𝐆)}𝖳𝝍𝐆superscriptsubscript𝜓1𝐆subscript𝜓2𝐆subscript𝜓subscript𝑑𝑔𝐆𝖳\bm{\psi}({\bf G})=\{\psi_{1}({\bf G}),\psi_{2}({\bf G}),\ldots,\psi_{d_{g}}({% \bf G})\}^{{\sf\scriptscriptstyle{T}}}bold_italic_ψ ( bold_G ) = { italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_G ) , italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_G ) , … , italic_ψ start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_G ) } start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT is a set of bases with possibly diverging dimensionality, used to approximate any (smooth) functions of 𝐆𝐆{\bf G}bold_G. Note that model (4) is a nuisance model introduced to avoid model misspecification in the first stage of our method. Our final goal is to estimate the parametric model (2) with a more desirable convergence rate as well as easier interpretation than (4). This is more advantagous especially when the genetic association is mild or small and, thus, requiring small enough estimation uncertainty to detect.

2.2 Stage I: sieve-approximated composite likelihood

We first focus on the estimation of mj()subscript𝑚𝑗m_{j}(\cdot)italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ⋅ )’s and Pr(Y=1𝐆)Pr𝑌conditional1𝐆{\rm Pr}(Y=1\mid{\bf G})roman_Pr ( italic_Y = 1 ∣ bold_G ). To ensure the validity while incorporating the additional genetic information, we consider a composite log-likelihood formulated under our key assumption that (𝐘,𝐗)𝐆Y(\mathbf{Y}^{*},{\bf X})\perp\!\!\!\perp{\bf G}\mid Y( bold_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_X ) ⟂ ⟂ bold_G ∣ italic_Y and a working independence condition of X1,,Xpsubscript𝑋1subscript𝑋𝑝X_{1},...,X_{p}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT given Y𝑌Yitalic_Y:

i=1nlog{y=01Pr(YiYi=y)Pr(Yi=y𝐆i)}+i=1Nj=1plog{y=01Pr(Yi=yXij)Pr(Yi=y𝐆i)Pr(Yi=y)},superscriptsubscript𝑖1𝑛superscriptsubscript𝑦01Prconditionalsubscriptsuperscript𝑌𝑖subscript𝑌𝑖𝑦Prsubscript𝑌𝑖conditional𝑦subscript𝐆𝑖superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑝superscriptsubscript𝑦01Prsubscript𝑌𝑖conditional𝑦subscript𝑋𝑖𝑗Prsubscript𝑌𝑖conditional𝑦subscript𝐆𝑖Prsubscript𝑌𝑖𝑦\sum_{i=1}^{n}\log\left\{\sum_{y=0}^{1}{\rm Pr}(Y^{*}_{i}\mid Y_{i}=y){\rm Pr}% (Y_{i}=y\mid{\bf G}_{i})\right\}+\sum_{i=1}^{N}\sum_{j=1}^{p}\log\left\{\sum_{% y=0}^{1}\frac{{\rm Pr}(Y_{i}=y\mid X_{ij}){\rm Pr}(Y_{i}=y\mid{\bf G}_{i})}{{% \rm Pr}(Y_{i}=y)}\right\},∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log { ∑ start_POSTSUBSCRIPT italic_y = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT roman_Pr ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y ) roman_Pr ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y ∣ bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT roman_log { ∑ start_POSTSUBSCRIPT italic_y = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG roman_Pr ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y ∣ italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) roman_Pr ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y ∣ bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Pr ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y ) end_ARG } ,

where Xijsubscript𝑋𝑖𝑗X_{ij}italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the j𝑗jitalic_j-th EHR outcome of subject i𝑖iitalic_i. As is outlined in Section 2.1, due to potential misspecification of the parametric models like (2), we model Pr(Y=y𝐆)Pr𝑌conditional𝑦𝐆{\rm Pr}(Y=y\mid{\bf G})roman_Pr ( italic_Y = italic_y ∣ bold_G ) nonparametrically by (4), and adopt a similar sieve construction on each

mj(Xj)=Pr(Y=1Xj)=g{𝜻j𝖳𝝋j(Xj)},subscript𝑚𝑗subscript𝑋𝑗Pr𝑌conditional1subscript𝑋𝑗𝑔superscriptsubscript𝜻𝑗𝖳subscript𝝋𝑗subscript𝑋𝑗m_{j}(X_{j})={\rm Pr}(Y=1\mid X_{j})=g\{\bm{\zeta}_{j}^{{\sf\scriptscriptstyle% {T}}}\bm{\varphi}_{j}(X_{j})\},italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = roman_Pr ( italic_Y = 1 ∣ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_g { bold_italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } ,

where 𝝋j(x)subscript𝝋𝑗𝑥\bm{\varphi}_{j}(x)bold_italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) is a vector of basis functions used to approximate g1{mj(x)}superscript𝑔1subscript𝑚𝑗𝑥g^{-1}\{m_{j}(x)\}italic_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT { italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) }. For discrete Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, we naturally set 𝝋j(x)subscript𝝋𝑗𝑥\bm{\varphi}_{j}(x)bold_italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) as its dummy variables. For continuous Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, we again use sieve. Then we can construct the sieve-approximated composite likelihood as:

𝒞(𝜽)=i=1nlog(y=01λyYigy{𝝃𝖳𝝍(𝐆i)})+i=1Nj=1plog(y=01μy1gy{𝜻j𝖳𝝋j(Xij)}gy{𝝃𝖳𝝍(𝐆i)}),𝒞𝜽superscriptsubscript𝑖1𝑛superscriptsubscript𝑦01subscript𝜆𝑦superscriptsubscript𝑌𝑖subscript𝑔𝑦superscript𝝃𝖳𝝍subscript𝐆𝑖superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑝superscriptsubscript𝑦01superscriptsubscript𝜇𝑦1subscript𝑔𝑦superscriptsubscript𝜻𝑗𝖳subscript𝝋𝑗subscript𝑋𝑖𝑗subscript𝑔𝑦superscript𝝃𝖳𝝍subscript𝐆𝑖{\cal C}(\bm{\theta})=\sum_{i=1}^{n}\log\left(\sum_{y=0}^{1}\lambda_{yY_{i}^{*% }}g_{y}\{\bm{\xi}^{{\sf\scriptscriptstyle{T}}}\bm{\psi}({\bf G}_{i})\}\right)+% \sum_{i=1}^{N}\sum_{j=1}^{p}\log\left(\sum_{y=0}^{1}\mu_{y}^{-1}g_{y}\{\bm{% \zeta}_{j}^{{\sf\scriptscriptstyle{T}}}\bm{\varphi}_{j}(X_{ij})\}g_{y}\{\bm{% \xi}^{{\sf\scriptscriptstyle{T}}}\bm{\psi}({\bf G}_{i})\}\right),caligraphic_C ( bold_italic_θ ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log ( ∑ start_POSTSUBSCRIPT italic_y = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_y italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT { bold_italic_ξ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_ψ ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT roman_log ( ∑ start_POSTSUBSCRIPT italic_y = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT { bold_italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) } italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT { bold_italic_ξ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_ψ ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } ) ,

where 𝜽={𝝃,𝜻,𝝀,μ}𝜽𝝃𝜻𝝀𝜇\bm{\theta}=\{\bm{\xi},\bm{\zeta},\bm{\lambda},\mu\}bold_italic_θ = { bold_italic_ξ , bold_italic_ζ , bold_italic_λ , italic_μ }, 𝝀=(𝝀0𝖳,𝝀1𝖳)𝖳𝝀superscriptsuperscriptsubscript𝝀0𝖳superscriptsubscript𝝀1𝖳𝖳\bm{\lambda}=(\bm{\lambda}_{0}^{{\sf\scriptscriptstyle{T}}},\bm{\lambda}_{1}^{% {\sf\scriptscriptstyle{T}}})^{{\sf\scriptscriptstyle{T}}}bold_italic_λ = ( bold_italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT , bold_italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT, 𝜻=(𝜻1𝖳,,𝜻p𝖳)𝖳𝜻superscriptsuperscriptsubscript𝜻1𝖳superscriptsubscript𝜻𝑝𝖳𝖳\bm{\zeta}=(\bm{\zeta}_{1}^{{\sf\scriptscriptstyle{T}}},\ldots,\bm{\zeta}_{p}^% {{\sf\scriptscriptstyle{T}}})^{{\sf\scriptscriptstyle{T}}}bold_italic_ζ = ( bold_italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT , … , bold_italic_ζ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT, and we denote by gy()=yg()+(1y){1g()}subscript𝑔𝑦𝑦𝑔1𝑦1𝑔g_{y}(\cdot)=yg(\cdot)+(1-y)\{1-g(\cdot)\}italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( ⋅ ) = italic_y italic_g ( ⋅ ) + ( 1 - italic_y ) { 1 - italic_g ( ⋅ ) } and μy=Pr(Y=y)=yμ+(1y)(1μ)subscript𝜇𝑦Pr𝑌𝑦𝑦𝜇1𝑦1𝜇\mu_{y}={\rm Pr}(Y=y)=y\mu+(1-y)(1-\mu)italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = roman_Pr ( italic_Y = italic_y ) = italic_y italic_μ + ( 1 - italic_y ) ( 1 - italic_μ ). To solve for 𝜽𝜽\bm{\theta}bold_italic_θ that maximizes 𝒞(𝜽)𝒞𝜽{\cal C}(\bm{\theta})caligraphic_C ( bold_italic_θ ), we propose to use an expectation???maximization (EM) algorithm outlined in Algorithm 1.

Algorithm 1 EM algorithm for the nonparametric composite log-likelihood.

Input: Observed data 𝒪={𝐎i=(Yiδi,δi,𝐗i,𝐆i):i=1,2,,N}𝒪conditional-setsubscript𝐎𝑖subscriptsuperscript𝑌𝑖subscript𝛿𝑖subscript𝛿𝑖subscript𝐗𝑖subscript𝐆𝑖𝑖12𝑁\mathscr{O}=\{{\bf O}_{i}=(Y^{*}_{i}\delta_{i},\delta_{i},{\bf X}_{i},{\bf G}_% {i}):i=1,2,...,N\}script_O = { bold_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) : italic_i = 1 , 2 , … , italic_N }.  
Initialize with 𝜽^(0)={𝝃^(0),𝜻^(0),𝝀^(0),μ^(0)}superscript^𝜽0superscript^𝝃0superscript^𝜻0superscript^𝝀0superscript^𝜇0{\widehat{\bm{\theta}}}^{(0)}=\{\widehat{\bm{\xi}}^{(0)},\widehat{\bm{\zeta}}^% {(0)},\widehat{\bm{\lambda}}^{(0)},\widehat{\mu}^{(0)}\}over^ start_ARG bold_italic_θ end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = { over^ start_ARG bold_italic_ξ end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_ζ end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_λ end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT } obtained by Algorithm A2. Iterate on the following two steps for r=0,1,,R𝑟01𝑅r=0,1,\ldots,Ritalic_r = 0 , 1 , … , italic_R until convergence.  
E-step. For each subject i𝑖iitalic_i and outcome j𝑗jitalic_j (or Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT if observed: δi=1subscript𝛿𝑖1\delta_{i}=1italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1), impute the probability for the unobserved Yisubscript𝑌𝑖Y_{i}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT conditional on the covariates in each component of the composite likelihood:

Y^i0(r+1)=δi×λ^1Yi(r)g1{𝝍𝖳(𝐆i)𝝃^(r)}y=01λ^yYi(r)gy{𝝍𝖳(𝐆i)𝝃^(r)};Y^ij(r+1)=g1{𝝋j𝖳(Xij)𝜻^j(r)}g1{𝝍𝖳(𝐆i)𝝃^(r)}/μ^1(r)y=01gy{𝝋j𝖳(Xij)𝜻^j(r)}gy{𝝍𝖳(𝐆i)𝝃^(r)}/μ^y(r).formulae-sequencesuperscriptsubscript^𝑌𝑖0𝑟1subscript𝛿𝑖superscriptsubscript^𝜆1superscriptsubscript𝑌𝑖𝑟subscript𝑔1superscript𝝍𝖳subscript𝐆𝑖superscript^𝝃𝑟superscriptsubscript𝑦01superscriptsubscript^𝜆𝑦superscriptsubscript𝑌𝑖𝑟subscript𝑔𝑦superscript𝝍𝖳subscript𝐆𝑖superscript^𝝃𝑟superscriptsubscript^𝑌𝑖𝑗𝑟1subscript𝑔1subscriptsuperscript𝝋𝖳𝑗subscript𝑋𝑖𝑗subscriptsuperscript^𝜻𝑟𝑗subscript𝑔1superscript𝝍𝖳subscript𝐆𝑖superscript^𝝃𝑟superscriptsubscript^𝜇1𝑟superscriptsubscript𝑦01subscript𝑔𝑦subscriptsuperscript𝝋𝖳𝑗subscript𝑋𝑖𝑗subscriptsuperscript^𝜻𝑟𝑗subscript𝑔𝑦superscript𝝍𝖳subscript𝐆𝑖superscript^𝝃𝑟superscriptsubscript^𝜇𝑦𝑟\widehat{Y}_{i0}^{(r+1)}=\delta_{i}\times\frac{\widehat{\lambda}_{1Y_{i}^{*}}^% {(r)}g_{1}\{\bm{\psi}^{{\sf\scriptscriptstyle{T}}}({\bf G}_{i})\widehat{\bm{% \xi}}^{(r)}\}}{\sum_{y=0}^{1}\widehat{\lambda}_{yY_{i}^{*}}^{(r)}g_{y}\{\bm{% \psi}^{{\sf\scriptscriptstyle{T}}}({\bf G}_{i})\widehat{\bm{\xi}}^{(r)}\}};~{}% \widehat{Y}_{ij}^{(r+1)}=\frac{g_{1}\{\bm{\varphi}^{{\sf\scriptscriptstyle{T}}% }_{j}(X_{ij})\widehat{\bm{\zeta}}^{(r)}_{j}\}g_{1}\{\bm{\psi}^{{\sf% \scriptscriptstyle{T}}}({\bf G}_{i})\widehat{\bm{\xi}}^{(r)}\}/\widehat{\mu}_{% 1}^{(r)}}{\sum_{y=0}^{1}g_{y}\{\bm{\varphi}^{{\sf\scriptscriptstyle{T}}}_{j}(X% _{ij})\widehat{\bm{\zeta}}^{(r)}_{j}\}g_{y}\{\bm{\psi}^{{\sf\scriptscriptstyle% {T}}}({\bf G}_{i})\widehat{\bm{\xi}}^{(r)}\}/\widehat{\mu}_{y}^{(r)}}.over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT = italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × divide start_ARG over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT 1 italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT { bold_italic_ψ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) over^ start_ARG bold_italic_ξ end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT } end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_y = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_y italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT { bold_italic_ψ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) over^ start_ARG bold_italic_ξ end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT } end_ARG ; over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT = divide start_ARG italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT { bold_italic_φ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) over^ start_ARG bold_italic_ζ end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT { bold_italic_ψ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) over^ start_ARG bold_italic_ξ end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT } / over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_y = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT { bold_italic_φ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) over^ start_ARG bold_italic_ζ end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT { bold_italic_ψ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) over^ start_ARG bold_italic_ξ end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT } / over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT end_ARG .

M-step. Update 𝜽𝜽\bm{\theta}bold_italic_θ through the maximum likelihood estimation (MLE) specified with the imputed outcomes from the E-step:

μ^(r+1)=1Np+ni=1Nj=0pY^ij(r+1);λ^yk(r+1)=i=1nI(Yi=k){Y^i0(r+1)}y{1Y^i0(r+1)}1yi=1n{Y^i0(r+1)}y{1Y^i0(r+1)}1y;formulae-sequencesuperscript^𝜇𝑟11𝑁𝑝𝑛superscriptsubscript𝑖1𝑁superscriptsubscript𝑗0𝑝superscriptsubscript^𝑌𝑖𝑗𝑟1superscriptsubscript^𝜆𝑦𝑘𝑟1superscriptsubscript𝑖1𝑛𝐼subscriptsuperscript𝑌𝑖𝑘superscriptsuperscriptsubscript^𝑌𝑖0𝑟1𝑦superscript1superscriptsubscript^𝑌𝑖0𝑟11𝑦superscriptsubscript𝑖1𝑛superscriptsuperscriptsubscript^𝑌𝑖0𝑟1𝑦superscript1superscriptsubscript^𝑌𝑖0𝑟11𝑦\displaystyle\widehat{\mu}^{(r+1)}=\frac{1}{Np+n}\sum_{i=1}^{N}\sum_{j=0}^{p}% \widehat{Y}_{ij}^{(r+1)};\quad\widehat{\lambda}_{yk}^{(r+1)}=\frac{\sum_{i=1}^% {n}I(Y^{*}_{i}=k)\{{\widehat{Y}}_{i0}^{(r+1)}\}^{y}\{1-{\widehat{Y}}_{i0}^{(r+% 1)}\}^{1-y}}{\sum_{i=1}^{n}\{{\widehat{Y}}_{i0}^{(r+1)}\}^{y}\{1-{\widehat{Y}}% _{i0}^{(r+1)}\}^{1-y}};over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N italic_p + italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT ; over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_y italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_I ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k ) { over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT { 1 - over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT 1 - italic_y end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT { over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT { 1 - over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT 1 - italic_y end_POSTSUPERSCRIPT end_ARG ;
𝝃^(r+1)=argmax𝝃i=1n(Y^i0(r+1),𝝍𝖳(𝐆i)𝝃)+i=1Nj=1p(Y^ij(r+1),𝝍𝖳(𝐆i)𝝃);superscript^𝝃𝑟1subscriptargmax𝝃superscriptsubscript𝑖1𝑛superscriptsubscript^𝑌𝑖0𝑟1superscript𝝍𝖳subscript𝐆𝑖𝝃superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑝superscriptsubscript^𝑌𝑖𝑗𝑟1superscript𝝍𝖳subscript𝐆𝑖𝝃\displaystyle\widehat{\bm{\xi}}^{(r+1)}=\mathop{\mbox{argmax}}_{\bm{\xi}}\sum_% {i=1}^{n}\ell\left({\widehat{Y}}_{i0}^{(r+1)},\bm{\psi}^{{\sf% \scriptscriptstyle{T}}}({\bf G}_{i})\bm{\xi}\right)+\sum_{i=1}^{N}\sum_{j=1}^{% p}\ell\left(\widehat{Y}_{ij}^{(r+1)},\bm{\psi}^{{\sf\scriptscriptstyle{T}}}({% \bf G}_{i})\bm{\xi}\right);over^ start_ARG bold_italic_ξ end_ARG start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT = argmax start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℓ ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT , bold_italic_ψ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_italic_ξ ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT roman_ℓ ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT , bold_italic_ψ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_italic_ξ ) ;
𝜻^j(r+1)=argmax𝜻ji=1N(Y^ij(r+1),𝝋j𝖳(Xij)𝜻j),for j=1,2,,p.formulae-sequencesuperscriptsubscript^𝜻𝑗𝑟1subscriptargmaxsubscript𝜻𝑗superscriptsubscript𝑖1𝑁superscriptsubscript^𝑌𝑖𝑗𝑟1subscriptsuperscript𝝋𝖳𝑗subscript𝑋𝑖𝑗subscript𝜻𝑗for 𝑗12𝑝\displaystyle\widehat{\bm{\zeta}}_{j}^{(r+1)}=\mathop{\mbox{argmax}}_{\bm{% \zeta}_{j}}\sum_{i=1}^{N}\ell\left(\widehat{Y}_{ij}^{(r+1)},\bm{\varphi}^{{\sf% \scriptscriptstyle{T}}}_{j}(X_{ij})\bm{\zeta}_{j}\right),\quad\mbox{for }j=1,2% ,\ldots,p.over^ start_ARG bold_italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT = argmax start_POSTSUBSCRIPT bold_italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_ℓ ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT , bold_italic_φ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) bold_italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , for italic_j = 1 , 2 , … , italic_p .

Output: 𝜽^={𝝃^,𝜻^,𝝀^,μ^}={𝝃^(R),𝜻^(R),𝝀^(R),μ^(R)}^𝜽^𝝃^𝜻^𝝀^𝜇superscript^𝝃𝑅superscript^𝜻𝑅superscript^𝝀𝑅superscript^𝜇𝑅{\widehat{\bm{\theta}}}=\{\widehat{\bm{\xi}},\widehat{\bm{\zeta}},\widehat{\bm% {\lambda}},\widehat{\mu}\}=\{\widehat{\bm{\xi}}^{(R)},\widehat{\bm{\zeta}}^{(R% )},\widehat{\bm{\lambda}}^{(R)},\widehat{\mu}^{(R)}\}over^ start_ARG bold_italic_θ end_ARG = { over^ start_ARG bold_italic_ξ end_ARG , over^ start_ARG bold_italic_ζ end_ARG , over^ start_ARG bold_italic_λ end_ARG , over^ start_ARG italic_μ end_ARG } = { over^ start_ARG bold_italic_ξ end_ARG start_POSTSUPERSCRIPT ( italic_R ) end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_ζ end_ARG start_POSTSUPERSCRIPT ( italic_R ) end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_λ end_ARG start_POSTSUPERSCRIPT ( italic_R ) end_POSTSUPERSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT ( italic_R ) end_POSTSUPERSCRIPT }

Algorithm 1 iterates on two main steps. First, there is an E-step imputing the unobserved true outcome Y𝑌Yitalic_Y separately conditional on each (Xj,𝐆)subscript𝑋𝑗𝐆(X_{j},{\bf G})( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_G ) or (Y,𝐆)superscript𝑌𝐆(Y^{*},{\bf G})( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_G ) as the set of features appearing in each component of the composite likelihood. Unlike the EM algorithms for joint likelihood objectives, our method does not involve any imputation model of Y𝑌Yitalic_Y using the whole set of observed variables (𝐗,𝐆,Y)𝐗𝐆superscript𝑌({\bf X},{\bf G},Y^{*})( bold_X , bold_G , italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). This in turn ensures the validity free of any assumptions on the joint distribution of 𝐗,Y𝐗superscript𝑌{\bf X},Y^{*}bold_X , italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that is hard to characterize due to the curse of dimensionality. Second, Algorithm 1 involves an M-step solving for 𝜽𝜽\bm{\theta}bold_italic_θ through MLE constructed using the imputed Y𝑌Yitalic_Y’s. Again, corresponding to the composite likelihood construction, 𝝀𝝀\bm{\lambda}bold_italic_λ and 𝜻jsubscript𝜻𝑗\bm{\zeta}_{j}bold_italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT’s for different error-prone outcomes are solved separately based on their own imputed outcomes.

In Theorem 1 presented later, we show that Algorithm 1 maintains an ascent property on the objective composite likelihood function that is desirable for optimization. Nevertheless, it is still practically crucial to have a good initial estimator 𝜽^(0)superscript^𝜽0{\widehat{\bm{\theta}}}^{(0)}over^ start_ARG bold_italic_θ end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT for Algorithm 1 to avoid the local minima issue. In response to this, we propose in Algorithm A2 of Appendix to derive 𝝃^(0),𝜻^(0),μ^(0)superscript^𝝃0superscript^𝜻0superscript^𝜇0\widehat{\bm{\xi}}^{(0)},\widehat{\bm{\zeta}}^{(0)},\widehat{\mu}^{(0)}over^ start_ARG bold_italic_ξ end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_ζ end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT through MLE constructed as if I(Y=1)𝐼superscript𝑌1I(Y^{*}=1)italic_I ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 1 ) was the true outcome, i.e., the logistic regression of I(Y=1)𝐼superscript𝑌1I(Y^{*}=1)italic_I ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 1 ) against 𝝍(𝐆)𝝍𝐆\bm{\psi}({\bf G})bold_italic_ψ ( bold_G ) or each 𝝋j(Xj)subscript𝝋𝑗subscript𝑋𝑗\bm{\varphi}_{j}(X_{j})bold_italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). For 𝝀^(0)superscript^𝝀0\widehat{\bm{\lambda}}^{(0)}over^ start_ARG bold_italic_λ end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, we set it up with a proper guess presuming that Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is informative.

2.3 Stage II: condensing EHR features for phenoty**

With the fitted estimator in Stage I, we derive α^(𝐗)=j=1p𝝋j𝖳(Xj)𝜻^j^𝛼𝐗superscriptsubscript𝑗1𝑝subscriptsuperscript𝝋𝖳𝑗subscript𝑋𝑗subscript^𝜻𝑗{\widehat{\alpha}}({\bf X})=\sum_{j=1}^{p}\bm{\varphi}^{{\sf\scriptscriptstyle% {T}}}_{j}(X_{j})\widehat{\bm{\zeta}}_{j}over^ start_ARG italic_α end_ARG ( bold_X ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT bold_italic_φ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) over^ start_ARG bold_italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, serving as a phenotype score condensing the outcomes X1,X2,,Xpsubscript𝑋1subscript𝑋2subscript𝑋𝑝X_{1},X_{2},\ldots,X_{p}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. For α^(𝐗)^𝛼𝐗{\widehat{\alpha}}({\bf X})over^ start_ARG italic_α end_ARG ( bold_X ), we further adopt a nonparametric likelihood approach that combines it with 𝐆𝐆{\bf G}bold_G to derive an imputation model for Y𝑌Yitalic_Y. Since α^(𝐗)^𝛼𝐗{\widehat{\alpha}}({\bf X})over^ start_ARG italic_α end_ARG ( bold_X ) ensembles multiple EHR outcomes, it tends to be more predictive of Y𝑌Yitalic_Y than each single mj(Xj)subscript𝑚𝑗subscript𝑋𝑗m_{j}(X_{j})italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). So this procedure can be more efficient than modeling each single Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT separately in 𝒞(𝜽)𝒞𝜽{\cal C}(\bm{\theta})caligraphic_C ( bold_italic_θ ), thus, being more favorable for the downstream analysis. As implied by (3), the optimal ensemble is α¯(𝐗)=j=1pg1{mj(Xj)}¯𝛼𝐗superscriptsubscript𝑗1𝑝superscript𝑔1subscript𝑚𝑗subscript𝑋𝑗\bar{\alpha}({\bf X})=\sum_{j=1}^{p}g^{-1}\{m_{j}(X_{j})\}over¯ start_ARG italic_α end_ARG ( bold_X ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT { italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } only when the working assumption X1X2XpYperpendicular-tosubscript𝑋1subscript𝑋2perpendicular-toperpendicular-toconditionalsubscript𝑋𝑝𝑌X_{1}\perp X_{2}\perp\ldots\perp X_{p}\mid Yitalic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟂ italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟂ … ⟂ italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∣ italic_Y holds. When there is a strong evidence that such conditional independence does not hold, an alternative strategy is to set the phenoty** score α(𝐗)𝛼𝐗\alpha({\bf X})italic_α ( bold_X ) as the first principle component of g1{mj(Xj)}superscript𝑔1subscript𝑚𝑗subscript𝑋𝑗g^{-1}\{m_{j}(X_{j})\}italic_g start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT { italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } for j=1,2,,p𝑗12𝑝j=1,2,\ldots,pitalic_j = 1 , 2 , … , italic_p, to make it representative of the multiple EHR outcomes.

Again, we will not rely on any parametric or model structural assumptions on the sensitivity function 𝒮α¯,y(c)=Pr(α¯(𝐗)>cY=y)subscript𝒮¯𝛼𝑦𝑐Pr¯𝛼𝐗conditional𝑐𝑌𝑦{\cal S}_{\bar{\alpha},y}(c)={\rm Pr}(\bar{\alpha}({\bf X})>c\mid Y=y)caligraphic_S start_POSTSUBSCRIPT over¯ start_ARG italic_α end_ARG , italic_y end_POSTSUBSCRIPT ( italic_c ) = roman_Pr ( over¯ start_ARG italic_α end_ARG ( bold_X ) > italic_c ∣ italic_Y = italic_y ) for c𝑐c\in\mathbb{R}italic_c ∈ blackboard_R and y{0,1}𝑦01y\in\{0,1\}italic_y ∈ { 0 , 1 } that captures α¯(𝐗)Yconditional¯𝛼𝐗𝑌\bar{\alpha}({\bf X})\mid Yover¯ start_ARG italic_α end_ARG ( bold_X ) ∣ italic_Y. In this case, the log-likelihood function can be written as

i=1nlog{y=01λyYigy{𝝃𝖳𝝍(𝐆i)}}+i=1Nlog{y=01𝒮˙α^,y{α^(𝐗i)}gy{𝝃𝖳𝝍(𝐆i)}}.superscriptsubscript𝑖1𝑛superscriptsubscript𝑦01subscript𝜆𝑦subscriptsuperscript𝑌𝑖subscript𝑔𝑦superscript𝝃𝖳𝝍subscript𝐆𝑖superscriptsubscript𝑖1𝑁superscriptsubscript𝑦01subscript˙𝒮^𝛼𝑦^𝛼subscript𝐗𝑖subscript𝑔𝑦superscript𝝃𝖳𝝍subscript𝐆𝑖\sum_{i=1}^{n}\log\left\{\sum_{y=0}^{1}\lambda_{yY^{*}_{i}}g_{y}\{\bm{\xi}^{{% \sf\scriptscriptstyle{T}}}\bm{\psi}({\bf G}_{i})\}\right\}+\sum_{i=1}^{N}\log% \left\{-\sum_{y=0}^{1}\dot{{\cal S}}_{\widehat{\alpha},y}\{{\widehat{\alpha}}(% {\bf X}_{i})\}g_{y}\{\bm{\xi}^{{\sf\scriptscriptstyle{T}}}\bm{\psi}({\bf G}_{i% })\}\right\}.∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log { ∑ start_POSTSUBSCRIPT italic_y = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_y italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT { bold_italic_ξ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_ψ ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } } + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log { - ∑ start_POSTSUBSCRIPT italic_y = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT over˙ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , italic_y end_POSTSUBSCRIPT { over^ start_ARG italic_α end_ARG ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT { bold_italic_ξ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_ψ ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } } .

Without any further constraint on 𝒮α¯,y(c)=Pr(α¯(𝐗)>cY=y)subscript𝒮¯𝛼𝑦𝑐Pr¯𝛼𝐗conditional𝑐𝑌𝑦{\cal S}_{\bar{\alpha},y}(c)={\rm Pr}(\bar{\alpha}({\bf X})>c\mid Y=y)caligraphic_S start_POSTSUBSCRIPT over¯ start_ARG italic_α end_ARG , italic_y end_POSTSUBSCRIPT ( italic_c ) = roman_Pr ( over¯ start_ARG italic_α end_ARG ( bold_X ) > italic_c ∣ italic_Y = italic_y ), the above log-likelihood function will not have a unique maximizer. Thus, inspired by existing literature in nonparametric MLE (Murphy and Van der Vaart,, 2000, e.g.), we restrict 𝒮α¯,y(c)subscript𝒮¯𝛼𝑦𝑐{\cal S}_{\bar{\alpha},y}(c)caligraphic_S start_POSTSUBSCRIPT over¯ start_ARG italic_α end_ARG , italic_y end_POSTSUBSCRIPT ( italic_c ) to be a step function that can only jump at the observed data points {α^(𝐗i):i=1,2,,N}conditional-set^𝛼subscript𝐗𝑖𝑖12𝑁\{{\widehat{\alpha}}({\bf X}_{i}):i=1,2,\ldots,N\}{ over^ start_ARG italic_α end_ARG ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) : italic_i = 1 , 2 , … , italic_N }, and denote its jump size at each α^(𝐗i)^𝛼subscript𝐗𝑖{\widehat{\alpha}}({\bf X}_{i})over^ start_ARG italic_α end_ARG ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) as 𝒮α¯,y{α^(𝐗i)}subscript𝒮¯𝛼𝑦^𝛼subscript𝐗𝑖\nabla{\cal S}_{\bar{\alpha},y}\{{\widehat{\alpha}}({\bf X}_{i})\}∇ caligraphic_S start_POSTSUBSCRIPT over¯ start_ARG italic_α end_ARG , italic_y end_POSTSUBSCRIPT { over^ start_ARG italic_α end_ARG ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) }. If the true status Yisubscript𝑌𝑖Y_{i}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT was observed, the MLE for 𝒮α^,y(c)subscript𝒮^𝛼𝑦𝑐{\cal S}_{\widehat{\alpha},y}(c)caligraphic_S start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , italic_y end_POSTSUBSCRIPT ( italic_c ) under this step-function constraint would be derived as

𝒮˘α^,y(c)=i=1NI(α^(𝐗i)>c)I(Yi=y)i=1NI(Yi=y)forc=α^(𝐗i).formulae-sequencesubscript˘𝒮^𝛼𝑦𝑐superscriptsubscript𝑖1𝑁𝐼^𝛼subscript𝐗𝑖𝑐𝐼subscript𝑌𝑖𝑦superscriptsubscript𝑖1𝑁𝐼subscript𝑌𝑖𝑦for𝑐^𝛼subscript𝐗superscript𝑖\breve{\cal S}_{\widehat{\alpha},y}(c)=\frac{\sum_{i=1}^{N}I({\widehat{\alpha}% }({\bf X}_{i})>c)I(Y_{i}=y)}{\sum_{i=1}^{N}I(Y_{i}=y)}\quad\mbox{for}\quad c={% \widehat{\alpha}}({\bf X}_{i^{\prime}}).over˘ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , italic_y end_POSTSUBSCRIPT ( italic_c ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_I ( over^ start_ARG italic_α end_ARG ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > italic_c ) italic_I ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_I ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y ) end_ARG for italic_c = over^ start_ARG italic_α end_ARG ( bold_X start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) .

Based on this, our objective becomes to maximize

(𝜼α^)=i=1nlog{y=01λyYigy{𝝃𝖳𝝍(𝐆i)}}+i=1Nlog{y=01𝒮α^,y{α^(𝐗i)}gy{𝝃𝖳𝝍(𝐆i)}},subscript𝜼^𝛼superscriptsubscript𝑖1𝑛superscriptsubscript𝑦01subscript𝜆𝑦subscriptsuperscript𝑌𝑖subscript𝑔𝑦superscript𝝃𝖳𝝍subscript𝐆𝑖superscriptsubscript𝑖1𝑁superscriptsubscript𝑦01subscript𝒮^𝛼𝑦^𝛼subscript𝐗𝑖subscript𝑔𝑦superscript𝝃𝖳𝝍subscript𝐆𝑖{\cal L}(\bm{\eta}_{{\widehat{\alpha}}})=\sum_{i=1}^{n}\log\left\{\sum_{y=0}^{% 1}\lambda_{yY^{*}_{i}}g_{y}\{\bm{\xi}^{{\sf\scriptscriptstyle{T}}}\bm{\psi}({% \bf G}_{i})\}\right\}+\sum_{i=1}^{N}\log\left\{\sum_{y=0}^{1}-\nabla{\cal S}_{% \widehat{\alpha},y}\{{\widehat{\alpha}}({\bf X}_{i})\}g_{y}\{\bm{\xi}^{{\sf% \scriptscriptstyle{T}}}\bm{\psi}({\bf G}_{i})\}\right\},caligraphic_L ( bold_italic_η start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log { ∑ start_POSTSUBSCRIPT italic_y = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_y italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT { bold_italic_ξ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_ψ ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } } + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log { ∑ start_POSTSUBSCRIPT italic_y = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT - ∇ caligraphic_S start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , italic_y end_POSTSUBSCRIPT { over^ start_ARG italic_α end_ARG ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT { bold_italic_ξ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_ψ ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } } , (5)

where 𝜼𝜶={𝒮α,0(),𝒮α,1(),𝝀,𝝃}subscript𝜼𝜶subscript𝒮𝛼0subscript𝒮𝛼1𝝀𝝃\bm{\eta}_{\bm{\alpha}}=\{{\cal S}_{\alpha,0}(\cdot),{\cal S}_{\alpha,1}(\cdot% ),\bm{\lambda},\bm{\xi}\}bold_italic_η start_POSTSUBSCRIPT bold_italic_α end_POSTSUBSCRIPT = { caligraphic_S start_POSTSUBSCRIPT italic_α , 0 end_POSTSUBSCRIPT ( ⋅ ) , caligraphic_S start_POSTSUBSCRIPT italic_α , 1 end_POSTSUBSCRIPT ( ⋅ ) , bold_italic_λ , bold_italic_ξ }, under the step-function constraints on 𝒮α,0(),𝒮α,1()subscript𝒮𝛼0subscript𝒮𝛼1{\cal S}_{\alpha,0}(\cdot),{\cal S}_{\alpha,1}(\cdot)caligraphic_S start_POSTSUBSCRIPT italic_α , 0 end_POSTSUBSCRIPT ( ⋅ ) , caligraphic_S start_POSTSUBSCRIPT italic_α , 1 end_POSTSUBSCRIPT ( ⋅ ). Since we do not specify the correlation or dependence between Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and α^(𝐗)^𝛼𝐗{\widehat{\alpha}}({\bf X})over^ start_ARG italic_α end_ARG ( bold_X ), we still adopt a composite strategy to model them in (5). But different from the fully composite 𝒞(𝜽)𝒞𝜽{\cal C}(\bm{\theta})caligraphic_C ( bold_italic_θ ) also treating Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT separately, we now condense Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT’s into a single α^(𝐗)^𝛼𝐗{\widehat{\alpha}}({\bf X})over^ start_ARG italic_α end_ARG ( bold_X ).

Similar to Algorithm 1, we adopt an EM algorithm to numerically maximize the objective (𝜼α^)subscript𝜼^𝛼{\cal L}(\bm{\eta}_{{\widehat{\alpha}}})caligraphic_L ( bold_italic_η start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG end_POSTSUBSCRIPT ) for the solution 𝜼~α^={𝒮~α^,0(),𝒮~α^,1(),𝝀~,𝝃~}subscript~𝜼^𝛼subscript~𝒮^𝛼0subscript~𝒮^𝛼1~𝝀~𝝃\widetilde{\bm{\eta}}_{{\widehat{\alpha}}}=\{\widetilde{\cal S}_{\widehat{% \alpha},0}(\cdot),\widetilde{\cal S}_{\widehat{\alpha},1}(\cdot),\widetilde{% \bm{\lambda}},\widetilde{\bm{\xi}}\}over~ start_ARG bold_italic_η end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG end_POSTSUBSCRIPT = { over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , 0 end_POSTSUBSCRIPT ( ⋅ ) , over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , 1 end_POSTSUBSCRIPT ( ⋅ ) , over~ start_ARG bold_italic_λ end_ARG , over~ start_ARG bold_italic_ξ end_ARG }; see Algorithm A2 in Appendix A. At last, we introduce Theorem 1 to establish the ascent properties of our proposed EM algorithms for 𝒞(𝜽)𝒞𝜽{\cal C}(\bm{\theta})caligraphic_C ( bold_italic_θ ) and α^(𝜼)subscript^𝛼𝜼{\cal L}_{{\widehat{\alpha}}}(\bm{\eta})caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG end_POSTSUBSCRIPT ( bold_italic_η ) formulated in Steps I and II respectively.

Theorem 1

Let 𝛉^(r)superscript^𝛉𝑟{\widehat{\bm{\theta}}}^{(r)}over^ start_ARG bold_italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT and 𝛈~(r)superscript~𝛈𝑟\widetilde{\bm{\eta}}^{(r)}over~ start_ARG bold_italic_η end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT be the estimators at the r𝑟ritalic_r-th iteration of the EM Algorithms 1 and A1 respectively. We have 𝒞(𝛉^(r))𝒞(𝛉^(r+1))𝒞superscript^𝛉𝑟𝒞superscript^𝛉𝑟1{\cal C}({\widehat{\bm{\theta}}}^{(r)})\leq{\cal C}({\widehat{\bm{\theta}}}^{(% r+1)})caligraphic_C ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ) ≤ caligraphic_C ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT ) and (𝛈~α^(r))(𝛈~α^(r+1))superscriptsubscript~𝛈^𝛼𝑟superscriptsubscript~𝛈^𝛼𝑟1{\cal L}(\widetilde{\bm{\eta}}_{{\widehat{\alpha}}}^{(r)})\leq{\cal L}(% \widetilde{\bm{\eta}}_{{\widehat{\alpha}}}^{(r+1)})caligraphic_L ( over~ start_ARG bold_italic_η end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ) ≤ caligraphic_L ( over~ start_ARG bold_italic_η end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT ), i.e., each iteration in our EM algorithms is ensured to result in the ascent of the objective log-likelihood functions.

2.4 Stage III: genetic risk modeling and EHR phenotype validation

In Steps (I) and (II) introduced above, we fit nonparametric models for Y𝐆conditional𝑌𝐆Y\mid{\bf G}italic_Y ∣ bold_G to make the estimators α^()^𝛼{\widehat{\alpha}}(\cdot)over^ start_ARG italic_α end_ARG ( ⋅ ) and 𝒮^α¯,y()subscript^𝒮¯𝛼𝑦\widehat{{\cal S}}_{\bar{\alpha},y}(\cdot)over^ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over¯ start_ARG italic_α end_ARG , italic_y end_POSTSUBSCRIPT ( ⋅ ) more robust to model misspecification. In practice, directly using such nonparametric models for gene association analysis often results in large variance or even inefficiency due to the curse of dimensionality. Thus, in this step, we leverage the extracted 𝜼~α^subscript~𝜼^𝛼\widetilde{\bm{\eta}}_{{\widehat{\alpha}}}over~ start_ARG bold_italic_η end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG end_POSTSUBSCRIPT to construct a parametric genetic risk for the true outcome Yisubscript𝑌𝑖Y_{i}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT against 𝐆isubscript𝐆𝑖{\bf G}_{i}bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In specific, with 𝜼~α^subscript~𝜼^𝛼\widetilde{\bm{\eta}}_{{\widehat{\alpha}}}over~ start_ARG bold_italic_η end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG end_POSTSUBSCRIPT, we characterize 𝔼[Yiα¯(𝐗i),𝐆i]𝔼delimited-[]conditionalsubscript𝑌𝑖¯𝛼subscript𝐗𝑖subscript𝐆𝑖\mathbb{E}[Y_{i}\mid\bar{\alpha}({\bf X}_{i}),{\bf G}_{i}]blackboard_E [ italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ over¯ start_ARG italic_α end_ARG ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] for all i=1,2,,N𝑖12𝑁i=1,2,\ldots,Nitalic_i = 1 , 2 , … , italic_N, and 𝔼[YiYi,𝐆i]𝔼delimited-[]conditionalsubscript𝑌𝑖subscriptsuperscript𝑌𝑖subscript𝐆𝑖\mathbb{E}[Y_{i}\mid Y^{*}_{i},{\bf G}_{i}]blackboard_E [ italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] for i=1,2,,n𝑖12𝑛i=1,2,\ldots,nitalic_i = 1 , 2 , … , italic_n as

Y~i0=λ~1Yig1{𝝍𝖳(𝐆i)𝝃~}y=01λ~yYi(r)gy{𝝍𝖳(𝐆i)𝝃~};Y~i1=𝒮~α^,1{α^(𝐗i)}g1{𝝍𝖳(𝐆i)𝝃~}y=01𝒮~α^,y{α^(𝐗i)}gy{𝝍𝖳(𝐆i)𝝃~},formulae-sequencesubscript~𝑌𝑖0subscript~𝜆1superscriptsubscript𝑌𝑖subscript𝑔1superscript𝝍𝖳subscript𝐆𝑖~𝝃superscriptsubscript𝑦01superscriptsubscript~𝜆𝑦superscriptsubscript𝑌𝑖𝑟subscript𝑔𝑦superscript𝝍𝖳subscript𝐆𝑖~𝝃subscript~𝑌𝑖1subscript~𝒮^𝛼1^𝛼subscript𝐗𝑖subscript𝑔1superscript𝝍𝖳subscript𝐆𝑖~𝝃superscriptsubscript𝑦01subscript~𝒮^𝛼𝑦^𝛼subscript𝐗𝑖subscript𝑔𝑦superscript𝝍𝖳subscript𝐆𝑖~𝝃\widetilde{Y}_{i0}=\frac{\widetilde{\lambda}_{1Y_{i}^{*}}g_{1}\{\bm{\psi}^{{% \sf\scriptscriptstyle{T}}}({\bf G}_{i})\widetilde{\bm{\xi}}\}}{\sum_{y=0}^{1}% \widetilde{\lambda}_{yY_{i}^{*}}^{(r)}g_{y}\{\bm{\psi}^{{\sf\scriptscriptstyle% {T}}}({\bf G}_{i})\widetilde{\bm{\xi}}\}};\quad\widetilde{Y}_{i1}=\frac{\nabla% \widetilde{\cal S}_{\widehat{\alpha},1}\{{\widehat{\alpha}}({\bf X}_{i})\}g_{1% }\{\bm{\psi}^{{\sf\scriptscriptstyle{T}}}({\bf G}_{i})\widetilde{\bm{\xi}}\}}{% \sum_{y=0}^{1}\nabla\widetilde{\cal S}_{\widehat{\alpha},y}\{{\widehat{\alpha}% }({\bf X}_{i})\}g_{y}\{\bm{\psi}^{{\sf\scriptscriptstyle{T}}}({\bf G}_{i})% \widetilde{\bm{\xi}}\}},over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT = divide start_ARG over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT 1 italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT { bold_italic_ψ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) over~ start_ARG bold_italic_ξ end_ARG } end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_y = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_y italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT { bold_italic_ψ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) over~ start_ARG bold_italic_ξ end_ARG } end_ARG ; over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT = divide start_ARG ∇ over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , 1 end_POSTSUBSCRIPT { over^ start_ARG italic_α end_ARG ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT { bold_italic_ψ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) over~ start_ARG bold_italic_ξ end_ARG } end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_y = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∇ over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , italic_y end_POSTSUBSCRIPT { over^ start_ARG italic_α end_ARG ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT { bold_italic_ψ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) over~ start_ARG bold_italic_ξ end_ARG } end_ARG ,

which coincides with the imputation of the unobserved Y𝑌Yitalic_Y in the last E-step of Algorithm A2. Note that Y~i1subscript~𝑌𝑖1\widetilde{Y}_{i1}over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT is not necessarily consistent for 𝔼[Yi𝐗i,𝐆i]𝔼delimited-[]conditionalsubscript𝑌𝑖subscript𝐗𝑖subscript𝐆𝑖\mathbb{E}[Y_{i}\mid{\bf X}_{i},{\bf G}_{i}]blackboard_E [ italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] unless the working independence assumption (3) holds and 𝔼[Yi𝐗i]=𝔼[Yiα¯(𝐗i)]𝔼delimited-[]conditionalsubscript𝑌𝑖subscript𝐗𝑖𝔼delimited-[]conditionalsubscript𝑌𝑖¯𝛼subscript𝐗𝑖\mathbb{E}[Y_{i}\mid{\bf X}_{i}]=\mathbb{E}[Y_{i}\mid\bar{\alpha}({\bf X}_{i})]blackboard_E [ italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] = blackboard_E [ italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ over¯ start_ARG italic_α end_ARG ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ]. Then we conduct logistic regression for the imputed outcomes Y~i0subscript~𝑌𝑖0\widetilde{Y}_{i0}over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT and Y~i1subscript~𝑌𝑖1\widetilde{Y}_{i1}over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT separately against 𝐆isubscript𝐆𝑖{\bf G}_{i}bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, to obtain estimators

𝜷~0subscript~𝜷0\displaystyle{\widetilde{\bm{\beta}}}_{0}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT =argmax𝜷i=1n(Y~i0,𝐆i𝖳𝜷);𝜷~1=argmax𝜷i=1N(Y~i1,𝐆i𝖳𝜷).formulae-sequenceabsentsubscriptargmax𝜷superscriptsubscript𝑖1𝑛subscript~𝑌𝑖0superscriptsubscript𝐆𝑖𝖳𝜷subscript~𝜷1subscriptargmax𝜷superscriptsubscript𝑖1𝑁subscript~𝑌𝑖1superscriptsubscript𝐆𝑖𝖳𝜷\displaystyle=\mathop{\mbox{argmax}}_{\bm{\beta}}\sum_{i=1}^{n}\ell({% \widetilde{Y}}_{i0},{\bf G}_{i}^{{\sf\scriptscriptstyle{T}}}\bm{\beta});\quad{% \widetilde{\bm{\beta}}}_{1}=\mathop{\mbox{argmax}}_{\bm{\beta}}\sum_{i=1}^{N}% \ell({\widetilde{Y}}_{i1},{\bf G}_{i}^{{\sf\scriptscriptstyle{T}}}\bm{\beta}).= argmax start_POSTSUBSCRIPT bold_italic_β end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℓ ( over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT , bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_β ) ; over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = argmax start_POSTSUBSCRIPT bold_italic_β end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_ℓ ( over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_β ) .

Although N>n𝑁𝑛N>nitalic_N > italic_n, the standard error of 𝜷~0subscript~𝜷0\widetilde{\bm{\beta}}_{0}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT may still be smaller than that of 𝜷~1subscript~𝜷1\widetilde{\bm{\beta}}_{1}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT since X𝑋Xitalic_X is typically less informative than the chart review labels Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in terms of measuring the true Y𝑌Yitalic_Y. To derive a more efficient estimator, the final step is to assemble 𝜷~0subscript~𝜷0{\widetilde{\bm{\beta}}}_{0}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝜷~1subscript~𝜷1{\widetilde{\bm{\beta}}}_{1}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as:

𝜷~=ω^𝜷~0+(1ω^)𝜷~1;ω^[0,1],formulae-sequence~𝜷^𝜔subscript~𝜷01^𝜔subscript~𝜷1^𝜔01{\widetilde{\bm{\beta}}}=\widehat{\omega}{\widetilde{\bm{\beta}}}_{0}+(1-% \widehat{\omega}){\widetilde{\bm{\beta}}}_{1};\quad\widehat{\omega}\in[0,1],over~ start_ARG bold_italic_β end_ARG = over^ start_ARG italic_ω end_ARG over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 - over^ start_ARG italic_ω end_ARG ) over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; over^ start_ARG italic_ω end_ARG ∈ [ 0 , 1 ] ,

where ω^^𝜔\widehat{\omega}over^ start_ARG italic_ω end_ARG is a weight determined using the data to minimize the variance of 𝜷~~𝜷{\widetilde{\bm{\beta}}}over~ start_ARG bold_italic_β end_ARG among all convex combinations of 𝜷~0subscript~𝜷0{\widetilde{\bm{\beta}}}_{0}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝜷~1subscript~𝜷1{\widetilde{\bm{\beta}}}_{1}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. When Nnmuch-greater-than𝑁𝑛N\gg nitalic_N ≫ italic_n, we can show that 𝜷~0subscript~𝜷0{\widetilde{\bm{\beta}}}_{0}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝜷~1subscript~𝜷1{\widetilde{\bm{\beta}}}_{1}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are asymptotically independent, and, thus, the optimal weight ω^=SE^02/(SE^02+SE^12)^𝜔superscriptsubscript^SE02superscriptsubscript^SE02superscriptsubscript^SE12\widehat{\omega}={\widehat{\rm SE}_{0}^{-2}}/{(\widehat{\rm SE}_{0}^{-2}+% \widehat{\rm SE}_{1}^{-2})}over^ start_ARG italic_ω end_ARG = over^ start_ARG roman_SE end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT / ( over^ start_ARG roman_SE end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT + over^ start_ARG roman_SE end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ), where SE^0subscript^SE0\widehat{\rm SE}_{0}over^ start_ARG roman_SE end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and SE^1subscript^SE1\widehat{\rm SE}_{1}over^ start_ARG roman_SE end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT represent the estimated standard error of 𝜷~0subscript~𝜷0{\widetilde{\bm{\beta}}}_{0}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝜷~1subscript~𝜷1{\widetilde{\bm{\beta}}}_{1}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. In general, we can take

ω^=argminω[0,1](ω,1ω)Σ^𝜷~0,𝜷~1(ω,1ω)𝖳,^𝜔subscript𝜔01𝜔1𝜔subscript^Σsubscript~𝜷0subscript~𝜷1superscript𝜔1𝜔𝖳\widehat{\omega}=\arg\min_{\omega\in[0,1]}(\omega,1-\omega)\widehat{\Sigma}_{% \widetilde{\bm{\beta}}_{0},\widetilde{\bm{\beta}}_{1}}(\omega,1-\omega)^{{\sf% \scriptscriptstyle{T}}},over^ start_ARG italic_ω end_ARG = roman_arg roman_min start_POSTSUBSCRIPT italic_ω ∈ [ 0 , 1 ] end_POSTSUBSCRIPT ( italic_ω , 1 - italic_ω ) over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ω , 1 - italic_ω ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ,

where Σ^𝜷~0,𝜷~1subscript^Σsubscript~𝜷0subscript~𝜷1\widehat{\Sigma}_{\widetilde{\bm{\beta}}_{0},\widetilde{\bm{\beta}}_{1}}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the asymptotic covariance matrix of (𝜷~0,𝜷~1)subscript~𝜷0subscript~𝜷1({\widetilde{\bm{\beta}}}_{0},{\widetilde{\bm{\beta}}}_{1})( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) computed using bootstrap. Since the true disease status Y𝑌Yitalic_Y is unobserved, the estimators 𝜷~0subscript~𝜷0\widetilde{\bm{\beta}}_{0}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝜷~1subscript~𝜷1\widetilde{\bm{\beta}}_{1}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are subject to the issue that the switch between Y=0𝑌0Y=0italic_Y = 0 and Y=1𝑌1Y=1italic_Y = 1 cannot be identified from the observed data. To address this, we assume the coefficient for G1subscript𝐺1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to be greater than zero with G1subscript𝐺1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT chosen as an informative feature to Y𝑌Yitalic_Y. Correspondingly, we shall flip the sign of the fitted 𝜷~0subscript~𝜷0{\widetilde{\bm{\beta}}}_{0}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT or 𝜷~1subscript~𝜷1{\widetilde{\bm{\beta}}}_{1}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT if β~01<0subscript~𝛽010{\widetilde{\beta}}_{01}<0over~ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT < 0 or β~11<0subscript~𝛽110{\widetilde{\beta}}_{11}<0over~ start_ARG italic_β end_ARG start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT < 0. Alternatively, one could also restrict the prevalence of Y𝑌Yitalic_Y to be smaller than 0.50.50.50.5, which does not require the knowledge of some informative feature G1subscript𝐺1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

As the by-product, we are also able to validate the derived phenoty** score α^(𝐗)^𝛼𝐗{\widehat{\alpha}}({\bf X})over^ start_ARG italic_α end_ARG ( bold_X ) using the fitted sensitivity functional 𝒮~α^,y()subscript~𝒮^𝛼𝑦\widetilde{\cal S}_{\widehat{\alpha},y}(\cdot)over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , italic_y end_POSTSUBSCRIPT ( ⋅ ). Denote the limiting (population-level) function of α^(𝐗)^𝛼𝐗{\widehat{\alpha}}({\bf X})over^ start_ARG italic_α end_ARG ( bold_X ) as α¯(𝐗)¯𝛼𝐗\bar{\alpha}({\bf X})over¯ start_ARG italic_α end_ARG ( bold_X ). The true positive rate (TPR) and false positive rate (FPR) of the classifier I(α^(𝐗)>c)𝐼^𝛼𝐗𝑐I(\widehat{\alpha}({\bf X})>c)italic_I ( over^ start_ARG italic_α end_ARG ( bold_X ) > italic_c ) or I(α¯(𝐗)>c)𝐼¯𝛼𝐗𝑐I(\bar{\alpha}({\bf X})>c)italic_I ( over¯ start_ARG italic_α end_ARG ( bold_X ) > italic_c ) on the true label Y𝑌Yitalic_Y can be naturally estimated using 𝒮~α^,1(c)subscript~𝒮^𝛼1𝑐\widetilde{\cal S}_{\widehat{\alpha},1}(c)over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , 1 end_POSTSUBSCRIPT ( italic_c ) and 𝒮~α^,0(c)subscript~𝒮^𝛼0𝑐\widetilde{\cal S}_{\widehat{\alpha},0}(c)over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , 0 end_POSTSUBSCRIPT ( italic_c ) respectively. Furthermore, the receiver operating characteristic (ROC) curve of α^(𝐗)^𝛼𝐗{\widehat{\alpha}}({\bf X})over^ start_ARG italic_α end_ARG ( bold_X ) or α¯(𝐗)¯𝛼𝐗\bar{\alpha}({\bf X})over¯ start_ARG italic_α end_ARG ( bold_X ) can be estimated by ^ROC(u)=𝒮~α^,1{𝒮~α^,01(u)}^absentROC𝑢subscript~𝒮^𝛼1subscriptsuperscript~𝒮1^𝛼0𝑢\widehat{}\mbox{ROC}(u)=\widetilde{\cal S}_{\widehat{\alpha},1}\{\widetilde{% \cal S}^{-1}_{\widehat{\alpha},0}(u)\}over^ start_ARG end_ARG ROC ( italic_u ) = over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , 1 end_POSTSUBSCRIPT { over~ start_ARG caligraphic_S end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , 0 end_POSTSUBSCRIPT ( italic_u ) } for u[0,1]𝑢01u\in[0,1]italic_u ∈ [ 0 , 1 ], and the area under ROC ^AUC=01^ROC(u)𝑑u^absentAUCsuperscriptsubscript01^absentROC𝑢differential-d𝑢\widehat{}\mbox{AUC}=\int_{0}^{1}\widehat{}\mbox{ROC}(u)duover^ start_ARG end_ARG AUC = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT over^ start_ARG end_ARG ROC ( italic_u ) italic_d italic_u.

3 Asymptotic analysis

In this section, we provide asymptotic analysis of the TUBE estimators α^(𝐗)^𝛼𝐗{\widehat{\alpha}}({\bf X})over^ start_ARG italic_α end_ARG ( bold_X ), 𝒮~α,y()subscript~𝒮𝛼𝑦\widetilde{\cal S}_{\alpha,y}(\cdot)over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT italic_α , italic_y end_POSTSUBSCRIPT ( ⋅ ), and 𝜷~~𝜷\widetilde{\bm{\beta}}over~ start_ARG bold_italic_β end_ARG resulted from our described steps in Sections 2.22.4. We consider 𝐆𝐆{\bf G}bold_G as a continuous univariate gene risk score and ψ(𝐆)𝜓𝐆\psi({\bf G})italic_ψ ( bold_G ) as its spline basis function. Let 𝜽¯={𝝃¯,𝜻¯,𝝀¯,μ¯}¯𝜽¯𝝃¯𝜻¯𝝀¯𝜇\bar{\bm{\theta}}=\{\bar{\bm{\xi}},\bar{\bm{\zeta}},\bar{\bm{\lambda}},\bar{% \mu}\}over¯ start_ARG bold_italic_θ end_ARG = { over¯ start_ARG bold_italic_ξ end_ARG , over¯ start_ARG bold_italic_ζ end_ARG , over¯ start_ARG bold_italic_λ end_ARG , over¯ start_ARG italic_μ end_ARG } and 𝜼¯={𝒮¯α¯,1,𝒮¯α¯,0,𝝀¯,𝝃¯}¯𝜼subscript¯𝒮¯𝛼1subscript¯𝒮¯𝛼0¯𝝀¯𝝃\bar{\bm{\eta}}=\{\bar{{\cal S}}_{\bar{\alpha},1},\bar{{\cal S}}_{\bar{\alpha}% ,0},\bar{\bm{\lambda}},\bar{\bm{\xi}}\}over¯ start_ARG bold_italic_η end_ARG = { over¯ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over¯ start_ARG italic_α end_ARG , 1 end_POSTSUBSCRIPT , over¯ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over¯ start_ARG italic_α end_ARG , 0 end_POSTSUBSCRIPT , over¯ start_ARG bold_italic_λ end_ARG , over¯ start_ARG bold_italic_ξ end_ARG } be the population-level (true) parameters. We define the norm of 𝜽𝜽\bm{\theta}bold_italic_θ to be 𝜽2={𝔼{𝝃22}+𝔼{𝜻22}+𝔼{𝝀22}+𝔼{u2}}1/2subscriptnorm𝜽2superscript𝔼superscriptsubscriptnorm𝝃22𝔼superscriptsubscriptnorm𝜻22𝔼superscriptsubscriptnorm𝝀22𝔼superscript𝑢212\|\bm{\theta}\|_{2}=\left\{\mathbb{E}\{\|\bm{\xi}\|_{2}^{2}\}+\mathbb{E}\{\|% \bm{\zeta}\|_{2}^{2}\}+\mathbb{E}\{\|\bm{\lambda}\|_{2}^{2}\}+\mathbb{E}\{u^{2% }\}\right\}^{1/2}∥ bold_italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { blackboard_E { ∥ bold_italic_ξ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } + blackboard_E { ∥ bold_italic_ζ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } + blackboard_E { ∥ bold_italic_λ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } + blackboard_E { italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } } start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT and the norm of 𝜼𝜼\bm{\eta}bold_italic_η to be 𝜼2={y=01(𝒮α,y(c))2𝑑c+𝔼{𝝀22}+𝔼{𝝃22}}1/2subscriptnorm𝜼2superscriptsuperscriptsubscript𝑦01superscriptsubscript𝒮𝛼𝑦𝑐2differential-d𝑐𝔼superscriptsubscriptnorm𝝀22𝔼superscriptsubscriptnorm𝝃2212\|\bm{\eta}\|_{2}=\left\{\sum_{y=0}^{1}\int({\cal S}_{\alpha,y}(c))^{2}dc+% \mathbb{E}\{\|\bm{\lambda}\|_{2}^{2}\}+\mathbb{E}\{\|\bm{\xi}\|_{2}^{2}\}% \right\}^{1/2}∥ bold_italic_η ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { ∑ start_POSTSUBSCRIPT italic_y = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∫ ( caligraphic_S start_POSTSUBSCRIPT italic_α , italic_y end_POSTSUBSCRIPT ( italic_c ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_c + blackboard_E { ∥ bold_italic_λ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } + blackboard_E { ∥ bold_italic_ξ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } } start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT. We first introduce smoothness and regularity assumptions as follows.

Assumption 1

Covariates (𝐗,𝐆)𝐗𝐆({\bf X},{\bf G})( bold_X , bold_G ) have compact domain 𝒳×𝒢𝒳𝒢\mathcal{X}\times\mathcal{G}caligraphic_X × caligraphic_G with their joint probability density function being twice continuously differentiable. For all j=1,2,,p𝑗12𝑝j=1,2,\ldots,pitalic_j = 1 , 2 , … , italic_p and y=0,1𝑦01y=0,1italic_y = 0 , 1, mjy(x)subscript𝑚𝑗𝑦𝑥m_{jy}(x)italic_m start_POSTSUBSCRIPT italic_j italic_y end_POSTSUBSCRIPT ( italic_x ) and γy(g)subscript𝛾𝑦𝑔\gamma_{y}(g)italic_γ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_g ) are twice continuously differentiable. For y=0,1𝑦01y=0,1italic_y = 0 , 1, 𝒮α,y(c)subscriptsuperscript𝒮𝛼𝑦𝑐{\cal S}^{\prime}_{\alpha,y}(c)caligraphic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α , italic_y end_POSTSUBSCRIPT ( italic_c ), the derivative of 𝒮α,y(c)subscript𝒮𝛼𝑦𝑐{\cal S}_{\alpha,y}(c)caligraphic_S start_POSTSUBSCRIPT italic_α , italic_y end_POSTSUBSCRIPT ( italic_c ) is continuously differentiable.

Assumption 2

The parameter spaces of 𝛉¯¯𝛉\bar{\bm{\theta}}over¯ start_ARG bold_italic_θ end_ARG and 𝛈¯¯𝛈\bar{\bm{\eta}}over¯ start_ARG bold_italic_η end_ARG are compact. Hessian matrix 𝔼[𝐆𝐆𝖳g1(𝐆𝖳𝛃0)]𝔼delimited-[]superscript𝐆𝐆𝖳superscriptsubscript𝑔1superscript𝐆𝖳subscript𝛃0\mathbb{E}[{\bf G}{\bf G}^{{\sf\scriptscriptstyle{T}}}g_{1}^{\prime}({\bf G}^{% {\sf\scriptscriptstyle{T}}}\bm{\beta}_{0})]blackboard_E [ bold_GG start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_G start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] has its all eigenvalues staying away from 00 and \infty. For any 𝛉1,𝛉2subscript𝛉1subscript𝛉2\bm{\theta}_{1},\bm{\theta}_{2}bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 𝛈𝟏,𝛈𝟐subscript𝛈1subscript𝛈2\bm{\eta_{1}},\bm{\eta_{2}}bold_italic_η start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_italic_η start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT, 𝔼[𝒞(𝛉1+τ(𝛉2𝛉1))]𝔼delimited-[]𝒞subscript𝛉1𝜏subscript𝛉2subscript𝛉1\mathbb{E}[{\cal C}(\bm{\theta}_{1}+\tau(\bm{\theta}_{2}-\bm{\theta}_{1}))]blackboard_E [ caligraphic_C ( bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_τ ( bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) ] and 𝔼[(𝛈α,1+τ(𝛈α,2𝛈α,1))]𝔼delimited-[]subscript𝛈𝛼1𝜏subscript𝛈𝛼2subscript𝛈𝛼1\mathbb{E}[{\cal L}(\bm{\eta}_{\alpha,1}+\tau(\bm{\eta}_{\alpha,2}-\bm{\eta}_{% \alpha,1}))]blackboard_E [ caligraphic_L ( bold_italic_η start_POSTSUBSCRIPT italic_α , 1 end_POSTSUBSCRIPT + italic_τ ( bold_italic_η start_POSTSUBSCRIPT italic_α , 2 end_POSTSUBSCRIPT - bold_italic_η start_POSTSUBSCRIPT italic_α , 1 end_POSTSUBSCRIPT ) ) ] are twice continuously differentiable with respect to τ[0,1]𝜏01\tau\in[0,1]italic_τ ∈ [ 0 , 1 ], 2τ2𝔼[𝒞(𝛉1+τ(𝛉2𝛉1))]𝛉2𝛉122asymptotically-equalssuperscript2superscript𝜏2𝔼delimited-[]𝒞subscript𝛉1𝜏subscript𝛉2subscript𝛉1superscriptsubscriptnormsubscript𝛉2subscript𝛉122\frac{\partial^{2}}{\partial\tau^{2}}\mathbb{E}[{\cal C}(\bm{\theta}_{1}+\tau(% \bm{\theta}_{2}-\bm{\theta}_{1}))]\asymp-\|\bm{\theta}_{2}-\bm{\theta}_{1}\|_{% 2}^{2}divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E [ caligraphic_C ( bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_τ ( bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) ] ≍ - ∥ bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and 2τ2𝔼[(𝛈α,1+τ(𝛈2𝛈1))]𝛈α,2𝛈α,122asymptotically-equalssuperscript2superscript𝜏2𝔼delimited-[]subscript𝛈𝛼1𝜏subscript𝛈2subscript𝛈1superscriptsubscriptnormsubscript𝛈𝛼2subscript𝛈𝛼122\frac{\partial^{2}}{\partial\tau^{2}}\mathbb{E}[{\cal L}(\bm{\eta}_{\alpha,1}+% \tau(\bm{\eta}_{2}-\bm{\eta}_{1}))]\asymp-\|\bm{\eta}_{\alpha,2}-\bm{\eta}_{% \alpha,1}\|_{2}^{2}divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E [ caligraphic_L ( bold_italic_η start_POSTSUBSCRIPT italic_α , 1 end_POSTSUBSCRIPT + italic_τ ( bold_italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - bold_italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) ] ≍ - ∥ bold_italic_η start_POSTSUBSCRIPT italic_α , 2 end_POSTSUBSCRIPT - bold_italic_η start_POSTSUBSCRIPT italic_α , 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Remark 2

Assumption 1 consists of mild smoothness conditions commonly used for the asymptotic analysis of of M-estimation and sieve-smoothed regression (Van der Vaart,, 2000; Chen,, 2007, e.g.). Assumption 2 requires the non-singularity of the hessian matrix as well as the strong convexity of the loss functions, which has been also frequently used in the literature.

Remark 3

When 𝐗𝐗{\bf X}bold_X and 𝐆𝐆{\bf G}bold_G are discrete, e.g., 𝐆𝐆{\bf G}bold_G being the categorical functions of several SNPs, Assumption 1 will be as given. In such a situation with discrete 𝐗𝐗{\bf X}bold_X, the sensitivity function 𝒮α,y(c)subscript𝒮𝛼𝑦𝑐{\cal S}_{\alpha,y}(c)caligraphic_S start_POSTSUBSCRIPT italic_α , italic_y end_POSTSUBSCRIPT ( italic_c ) will only have finite choices on the cutoff c𝑐citalic_c, and the asymptotic analysis of its estimator will be degenerated and simplified.

Next, we establish the consistency and asymptotic normality for the phenoty** score α^(𝐱)^𝛼𝐱{\widehat{\alpha}}({\bf x})over^ start_ARG italic_α end_ARG ( bold_x ) in Theorem 2, as well as those for the estimator of its sensitivity function in Theorem 3. Let JNsubscript𝐽𝑁J_{N}italic_J start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT be the dimensionality of the bases 𝝋j(𝐗)subscript𝝋𝑗𝐗\bm{\varphi}_{j}({\bf X})bold_italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_X ) and 𝝍(𝐆)𝝍𝐆\bm{\psi}({\bf G})bold_italic_ψ ( bold_G ) supposed to increase with N𝑁Nitalic_N.

Theorem 2

Under Assumptions 1 and 2 and assume that N1/4JNN1/2much-less-thansuperscript𝑁14subscript𝐽𝑁much-less-thansuperscript𝑁12N^{1/4}\ll J_{N}\ll N^{1/2}italic_N start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT ≪ italic_J start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ≪ italic_N start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT. As n,N𝑛𝑁n,N\rightarrow\inftyitalic_n , italic_N → ∞, sup𝐱𝒳|α^(𝐱)α¯(𝐱)|subscriptsupremum𝐱𝒳^𝛼𝐱¯𝛼𝐱\sup_{{\bf x}\in\mathcal{X}}|{\widehat{\alpha}}({\bf x})-\bar{\alpha}({\bf x})|roman_sup start_POSTSUBSCRIPT bold_x ∈ caligraphic_X end_POSTSUBSCRIPT | over^ start_ARG italic_α end_ARG ( bold_x ) - over¯ start_ARG italic_α end_ARG ( bold_x ) | converges to 00 in probability. Moreover, for 𝐱𝒳𝐱𝒳{\bf x}\in\mathcal{X}bold_x ∈ caligraphic_X, N/JN{α^(𝐱)α¯(𝐱)}𝑁subscript𝐽𝑁^𝛼𝐱¯𝛼𝐱\sqrt{{N}/{J_{N}}}\{{\widehat{\alpha}}({\bf x})-\bar{\alpha}({\bf x})\}square-root start_ARG italic_N / italic_J start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG { over^ start_ARG italic_α end_ARG ( bold_x ) - over¯ start_ARG italic_α end_ARG ( bold_x ) } converges weakly to some zero-mean Gaussian process.

Theorem 3

Under all assumptions in Theorem 2, then as n,N𝑛𝑁n,N\to\inftyitalic_n , italic_N → ∞, supc|𝒮~α^,0(c)𝒮¯α¯,0(c)|+|𝒮~α^,1(c)𝒮¯α¯,1(c)|subscriptsupremum𝑐subscript~𝒮^𝛼0𝑐subscript¯𝒮¯𝛼0𝑐subscript~𝒮^𝛼1𝑐subscript¯𝒮¯𝛼1𝑐\sup_{c\in\mathbb{R}}|\widetilde{\cal S}_{{\widehat{\alpha}},0}(c)-\bar{{\cal S% }}_{\bar{\alpha},0}(c)|+|\widetilde{\cal S}_{{\widehat{\alpha}},1}(c)-\bar{{% \cal S}}_{\bar{\alpha},1}(c)|roman_sup start_POSTSUBSCRIPT italic_c ∈ blackboard_R end_POSTSUBSCRIPT | over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , 0 end_POSTSUBSCRIPT ( italic_c ) - over¯ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over¯ start_ARG italic_α end_ARG , 0 end_POSTSUBSCRIPT ( italic_c ) | + | over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , 1 end_POSTSUBSCRIPT ( italic_c ) - over¯ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over¯ start_ARG italic_α end_ARG , 1 end_POSTSUBSCRIPT ( italic_c ) | converges to 00 in probability, and for c𝑐c\in\mathbb{R}italic_c ∈ blackboard_R, N/JN{𝒮~α^,0(c)𝒮¯α¯,0(c),𝒮~α^,0(c)𝒮¯α¯,0(c)}𝑁subscript𝐽𝑁subscript~𝒮^𝛼0𝑐subscript¯𝒮¯𝛼0𝑐subscript~𝒮^𝛼0𝑐subscript¯𝒮¯𝛼0𝑐\sqrt{{N}/{J_{N}}}\{\widetilde{\cal S}_{{\widehat{\alpha}},0}(c)-\bar{{\cal S}% }_{\bar{\alpha},0}(c),\widetilde{\cal S}_{{\widehat{\alpha}},0}(c)-\bar{{\cal S% }}_{\bar{\alpha},0}(c)\}square-root start_ARG italic_N / italic_J start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG { over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , 0 end_POSTSUBSCRIPT ( italic_c ) - over¯ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over¯ start_ARG italic_α end_ARG , 0 end_POSTSUBSCRIPT ( italic_c ) , over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , 0 end_POSTSUBSCRIPT ( italic_c ) - over¯ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over¯ start_ARG italic_α end_ARG , 0 end_POSTSUBSCRIPT ( italic_c ) } converges weakly to some zero-mean Gaussian process for c𝑐c\in\mathbb{R}italic_c ∈ blackboard_R.

Considering that our primary goal is the genetic risk estimation with 𝜷~~𝜷\widetilde{\bm{\beta}}over~ start_ARG bold_italic_β end_ARG, we under-smooth the sieve estimator of α¯¯𝛼\bar{\alpha}over¯ start_ARG italic_α end_ARG by taking JNsubscript𝐽𝑁J_{N}italic_J start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT slightly larger than O(N1/4)𝑂superscript𝑁14O(N^{1/4})italic_O ( italic_N start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT ), to achieve the asymptotic unbiasedness and normality of 𝜷~~𝜷\widetilde{\bm{\beta}}over~ start_ARG bold_italic_β end_ARG that will be established in Theorem 4. This choice of JNsubscript𝐽𝑁J_{N}italic_J start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT does not lead to the optimal convergence rate of these by-products α^(𝐱)^𝛼𝐱{\widehat{\alpha}}({\bf x})over^ start_ARG italic_α end_ARG ( bold_x ) and 𝒮~α^,y(c)subscript~𝒮^𝛼𝑦𝑐\widetilde{\cal S}_{{\widehat{\alpha}},y}(c)over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , italic_y end_POSTSUBSCRIPT ( italic_c ). To further refine these estimators, one just needs to take JNN1/5asymptotically-equalssubscript𝐽𝑁superscript𝑁15J_{N}\asymp N^{1/5}italic_J start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ≍ italic_N start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT and carry out Steps I and II. This leads to the N2/5superscript𝑁25N^{-2/5}italic_N start_POSTSUPERSCRIPT - 2 / 5 end_POSTSUPERSCRIPT-convergence of α^(𝐱)α¯(𝐱)^𝛼𝐱¯𝛼𝐱{\widehat{\alpha}}({\bf x})-\bar{\alpha}({\bf x})over^ start_ARG italic_α end_ARG ( bold_x ) - over¯ start_ARG italic_α end_ARG ( bold_x ) and 𝒮~α^,y(c)𝒮¯α¯,y(c)subscript~𝒮^𝛼𝑦𝑐subscript¯𝒮¯𝛼𝑦𝑐\widetilde{\cal S}_{{\widehat{\alpha}},y}(c)-\bar{{\cal S}}_{\bar{\alpha},y}(c)over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , italic_y end_POSTSUBSCRIPT ( italic_c ) - over¯ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over¯ start_ARG italic_α end_ARG , italic_y end_POSTSUBSCRIPT ( italic_c ), an improvement compared to the current N3/8superscript𝑁38N^{-3/8}italic_N start_POSTSUPERSCRIPT - 3 / 8 end_POSTSUPERSCRIPT-convergence. However, the estimator derived with JNN1/5asymptotically-equalssubscript𝐽𝑁superscript𝑁15J_{N}\asymp N^{1/5}italic_J start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ≍ italic_N start_POSTSUPERSCRIPT 1 / 5 end_POSTSUPERSCRIPT cannot ensure the desirable parametric rate and asymptotic normality of 𝜷~0subscript~𝜷0\widetilde{\bm{\beta}}_{0}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝜷~1subscript~𝜷1\widetilde{\bm{\beta}}_{1}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT obtained in Step III. See existing literature like Chen, (2007) for more relevant results.

Finally, we establish the convergence properties of 𝜷~0subscript~𝜷0\widetilde{\bm{\beta}}_{0}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝜷~1subscript~𝜷1\widetilde{\bm{\beta}}_{1}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, which reveals the n1/2superscript𝑛12n^{1/2}italic_n start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT-consistency and asymptotic normality of the TUBE estimator 𝜷~~𝜷\widetilde{\bm{\beta}}over~ start_ARG bold_italic_β end_ARG.

Theorem 4

Under all assumptions in Theorem 2, both 𝛃~0subscript~𝛃0\widetilde{\bm{\beta}}_{0}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝛃~1subscript~𝛃1\widetilde{\bm{\beta}}_{1}over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT converge to 𝛃¯¯𝛃\bar{\bm{\beta}}over¯ start_ARG bold_italic_β end_ARG in probability and {n(𝛃~0𝛃¯),N(𝛃~1𝛃¯)}𝑛subscript~𝛃0¯𝛃𝑁subscript~𝛃1¯𝛃\{\sqrt{n}(\widetilde{\bm{\beta}}_{0}-\bar{\bm{\beta}}),\sqrt{N}(\widetilde{% \bm{\beta}}_{1}-\bar{\bm{\beta}})\}{ square-root start_ARG italic_n end_ARG ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_β end_ARG ) , square-root start_ARG italic_N end_ARG ( over~ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_β end_ARG ) } converges weakly to a zero-mean Gaussian distribution.

4 Simulation

We conduct comprehensive simulation studies to evaluate the finite-sample performance of the proposed method. Let Binomial{n,p}𝑛𝑝\left\{n,p\right\}{ italic_n , italic_p } denote the binomial distribution with n𝑛nitalic_n trials and a success probability of p𝑝pitalic_p. To generate risk factors 𝐆=(G1,,Gq)𝖳𝐆superscriptsubscript𝐺1subscript𝐺𝑞𝖳{\bf G}=(G_{1},\ldots,G_{q})^{{\sf\scriptscriptstyle{T}}}bold_G = ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT, we consider q=4𝑞4q=4italic_q = 4 with G1N(0,1)similar-tosubscript𝐺1N01G_{1}\sim{\rm N}(0,1)italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ roman_N ( 0 , 1 ), and G2subscript𝐺2G_{2}italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, G3subscript𝐺3G_{3}italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, G4subscript𝐺4G_{4}italic_G start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT generated independently from Binomial{2,0.6}20.6\left\{2,0.6\right\}{ 2 , 0.6 }. For generation of the unobserved true outcome Y𝑌Yitalic_Y and EHR surrogates 𝐗𝐗{\bf X}bold_X, we consider the following three settings:

  • (a)

    YBernoulli{g(𝐆𝖳𝜷)}similar-to𝑌Bernoulli𝑔superscript𝐆𝖳𝜷Y\sim\textrm{Bernoulli}\left\{g(\bf G^{{\sf\scriptscriptstyle{T}}}\bm{\beta})\right\}italic_Y ∼ Bernoulli { italic_g ( bold_G start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_β ) } where 𝜷=(4.6,1.6,1.6,1.6,1.6)𝖳superscript𝜷superscript4.61.61.61.61.6𝖳\bm{\beta}^{*}=(-4.6,1.6,1.6,1.6,1.6)^{{\sf\scriptscriptstyle{T}}}bold_italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ( - 4.6 , 1.6 , 1.6 , 1.6 , 1.6 ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT; and 𝐗={Y+0.5(1Y)+ϵ1,Y+0.5(1Y)+ϵ2,0.5Y+0.25(1Y)+ϵ3}𝖳𝐗superscript𝑌0.51𝑌subscriptitalic-ϵ1𝑌0.51𝑌subscriptitalic-ϵ20.5𝑌0.251𝑌subscriptitalic-ϵ3𝖳{\bf X}=\{Y+0.5(1-Y)+\epsilon_{1},Y+0.5(1-Y)+\epsilon_{2},0.5Y+0.25(1-Y)+% \epsilon_{3}\}^{{\sf\scriptscriptstyle{T}}}bold_X = { italic_Y + 0.5 ( 1 - italic_Y ) + italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y + 0.5 ( 1 - italic_Y ) + italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 0.5 italic_Y + 0.25 ( 1 - italic_Y ) + italic_ϵ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT where ϵ1,ϵ2subscriptitalic-ϵ1subscriptitalic-ϵ2\epsilon_{1},\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, ϵ3subscriptitalic-ϵ3\epsilon_{3}italic_ϵ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are independent standard normal noises.

  • (b)

    YBernoulli{g(G1+G12cos(G1)G2G3G4+2)}similar-to𝑌Bernoulli𝑔subscript𝐺1superscriptsubscript𝐺12subscript𝐺1subscript𝐺2subscript𝐺3subscript𝐺42Y\sim\textrm{Bernoulli}\left\{g(G_{1}+G_{1}^{2}-\cos(G_{1})-G_{2}-G_{3}-G_{4}+% 2)\right\}italic_Y ∼ Bernoulli { italic_g ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - roman_cos ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_G start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT + 2 ) }, with 𝐗𝐗{\bf X}bold_X generated given Y𝑌Yitalic_Y in the same way as (a).

  • (c)

    YBernoulli{g(G1+G12+sin(G1)G2G3G4+1)}similar-to𝑌Bernoulli𝑔subscript𝐺1superscriptsubscript𝐺12subscript𝐺1subscript𝐺2subscript𝐺3subscript𝐺41Y\sim\textrm{Bernoulli}\left\{g(-G_{1}+G_{1}^{2}+\sin(G_{1})-G_{2}-G_{3}-G_{4}% +1)\right\}italic_Y ∼ Bernoulli { italic_g ( - italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_sin ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_G start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT + 1 ) }; and 𝐗={Y+0.5(1Y)+0.005G1+ϵ1,Y+0.5(1Y)+0.005G1+ϵ2,0.5Y+0.25(1Y)+0.005G1+ϵ3}𝖳𝐗superscript𝑌0.51𝑌0.005subscript𝐺1subscriptitalic-ϵ1𝑌0.51𝑌0.005subscript𝐺1subscriptitalic-ϵ20.5𝑌0.251𝑌0.005subscript𝐺1subscriptitalic-ϵ3𝖳{\bf X}=\{Y+0.5(1-Y)+0.005G_{1}+\epsilon_{1},Y+0.5(1-Y)+0.005G_{1}+\epsilon_{2% },0.5Y+0.25(1-Y)+0.005G_{1}+\epsilon_{3}\}^{{\sf\scriptscriptstyle{T}}}bold_X = { italic_Y + 0.5 ( 1 - italic_Y ) + 0.005 italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y + 0.5 ( 1 - italic_Y ) + 0.005 italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 0.5 italic_Y + 0.25 ( 1 - italic_Y ) + 0.005 italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT where ϵ1,ϵ2subscriptitalic-ϵ1subscriptitalic-ϵ2\epsilon_{1},\epsilon_{2}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, ϵ3subscriptitalic-ϵ3\epsilon_{3}italic_ϵ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are independent standard normal noises.

In all settings, we set N=10000𝑁10000N=10000italic_N = 10000 and generate Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from Binomial{2,expit(2+4Y+0.13𝖳𝐗)}Binomial2expit24𝑌superscriptsubscript0.13𝖳𝐗\textrm{Binomial}\left\{2,\textrm{expit}(-2+4Y+0.1_{3}^{{\sf\scriptscriptstyle% {T}}}{\bf X})\right\}Binomial { 2 , expit ( - 2 + 4 italic_Y + 0.1 start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_X ) }. As discussed earlier, Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is supposed to be an imperfect but more informative outcome compared to 𝐗𝐗{\bf X}bold_X. Our setup mimics this by imposing a much stronger effect of Y𝑌Yitalic_Y on Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. We also let the size of Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT labels n𝑛nitalic_n range from 100100100100 to 1000100010001000 to investigate its influence on the efficiency of the methods.

We consider the following three methods for comparison: (1) the simple approach referred as Naive-Logistic directly using the label Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as the outcome for analysis; (2) our main benchmark Hong et al., (2019) using the composite likelihood approach with parametric modeling on 𝐗𝐗{\bf X}bold_X and 𝐆𝐆{\bf G}bold_G; (3) the proposed TUBE approach with 𝝍(𝐆)=(𝝍1(G1),G2,G3,G4)𝝍𝐆subscript𝝍1subscript𝐺1subscript𝐺2subscript𝐺3subscript𝐺4\bm{\psi}({\bf G})=(\bm{\psi}_{1}(G_{1}),G_{2},G_{3},G_{4})bold_italic_ψ ( bold_G ) = ( bold_italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) and the basis functions 𝝋jsubscript𝝋𝑗\bm{\varphi}_{j}bold_italic_φ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and 𝝍1(G1)subscript𝝍1subscript𝐺1\bm{\psi}_{1}(G_{1})bold_italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) specified as the natural spline with the degree of freedom as 4444. Note that Hong et al., (2019)’s method is fully parametric and, thus, will concur the issues of model misspecification in settings (b) and (c) due to the non-linearity of Y𝐆similar-to𝑌𝐆Y\sim{\bf G}italic_Y ∼ bold_G. In setting (c), we introduce some small indirect effect of 𝐆𝐆{\bf G}bold_G on 𝐗𝐗{\bf X}bold_X given Y𝑌Yitalic_Y that moderately breaks our key independence assumption 𝐗𝐆Yperpendicular-to𝐗conditional𝐆𝑌{\bf X}\perp{\bf G}\mid Ybold_X ⟂ bold_G ∣ italic_Y. This is to examine the sensitivity to the (slight) violation of this assumption.

The parameters of our interests include 𝜷𝜷\bm{\beta}bold_italic_β, the logistic model coefficients obtained by regressing Y𝑌Yitalic_Y against 𝐆𝐆{\bf G}bold_G, as well as the accuracy parameter AUC of Y𝑌Yitalic_Y against their phenoty** score obtained in each method. The population level parameters of 𝜷𝜷\bm{\beta}bold_italic_β and 𝐆𝐆{\bf G}bold_G are computed by generating an extremely large sample. Our evaluation metrics include mean squared error (MSE) in Figure 2, percent bias in Figure 3, i.e., the ratio between absolute bias and root MSE, and coverage probability (CP) of the 95% CI computed using the standard resampling bootstrap procedure; see Figure 4. The results in Figures 2-4 are obtained based on 500500500500 times of simulation. For the multi-dimensional 𝜷𝜷\bm{\beta}bold_italic_β, we only present the average performance over β1,,β4subscript𝛽1subscript𝛽4\beta_{1},\ldots,\beta_{4}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_β start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT in these figures and the element-wise results can be found in the tables of Appendix B.

Refer to caption
Figure 2: Mean squared error (MSE) for estimators of the genetic effects 𝜷𝜷\bm{\beta}bold_italic_β and the AUC of the phenoty** score in different settings introduced in Section 4.
Refer to caption
Figure 3: Absolute Biases/RMSE for estimators of the genetic effects 𝜷𝜷\bm{\beta}bold_italic_β and the AUC of the phenoty** score in different settings introduced in Section 4.
Refer to caption
Figure 4: Coverage probabilities (CP) for estimators of the genetic effects 𝜷𝜷\bm{\beta}bold_italic_β and the AUC of the phenoty** score in different settings introduced in Section 4.

In all settings, Naive-Logistic shows large MSEs and percent biases due to the erroneousness of Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in measuring the true Y𝑌Yitalic_Y. In setting (a), TUBE attains close performance to the benchmark methods in Hong et al., (2019) that relies on a fully parametric modeling strategy and does not encounter the model misspecification issue. In specific, the percentage difference in the MSE between the two methods is smaller than 5%percent55\%5 % on all parameters when n500𝑛500n\geq 500italic_n ≥ 500 in setting (a). Also, both methods attain small enough percent bias and desirable coverage probability on 𝜷𝜷\bm{\beta}bold_italic_β and AUC. Thus, although it seems redundant to use a more complex semiparametric modeling strategy in TUBE compared to Hong et al., (2019) when the true models are indeed linear and parametric, this complexity does not result in TUBE’s loss of validity or efficiency. This result is in line with our conclusions in Section 3 that the sieve estimators does not impact the parametric rate of our estimator for 𝜷𝜷\bm{\beta}bold_italic_β due to under-smoothing.

In settings (b) and (c) under which the fully parametric method of Hong et al., (2019) has a severe issue in model misspecification, TUBE achieves significantly better performance than Hong et al., (2019) and ensures the validity of inference. For example, under setting (b) with n=500𝑛500n=500italic_n = 500, the average MSE of TUBE on 𝜷𝜷\bm{\beta}bold_italic_β is more than 90% smaller than that of Hong et al., (2019). Also, TUBE successfully maintains a small percent bias (5%–10%) and appropriate coverage probability while Hong et al., (2019) fails to provide valid inference with the average coverage rates around 30% below than the nominal level 95% in setting (b). This substantial improvement of TUBE is resulted from the nonparametric construction in our Steps I and II that protect our approach against bias due to the nonlinear effects.

In addition, we notice that as the labeled sample size n𝑛nitalic_n increases, the MSEs of TUBE on 𝜷𝜷\bm{\beta}bold_italic_β and AUC gradually decrease as Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT provides additional information over 𝐗𝐗{\bf X}bold_X. For example, when n𝑛nitalic_n increase from 100100100100 to 500500500500, TUBE’s MSE on AUC decreases more than 50%percent5050\%50 % in all settings. Recall that in practice and our simulation setup, Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is usually more informative than 𝐗𝐗{\bf X}bold_X even though both of them contains errors in measuring the true Y𝑌Yitalic_Y. Thus, moderately increasing the size of Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT could result in efficiency gain even with the total sample size N𝑁Nitalic_N unchanged. Meanwhile, we do not see the improvement of Naive-Logistic and Hong et al., (2019) as n𝑛nitalic_n increases in settings (b) and (c) probably because of their large bias.

5 Real Example

The rising incidence of Type II diabetes mellitus (T2D) in recent years has risen great concern in health. Previous genome-wide association studies (GWAS) have identified many genetic variations associated with insulin resistance or inadequate insulin production attributing to T2D (Mahajan et al.,, 2018). Consequently, polygenic risk score (GRS) has been developed to predict individual’s genetic risk of develo** T2D (He et al.,, 2021). These advancements provide great potential for precision medicine approaches in the prevention and management of the T2D disease. In this application, we study the Mass General Brigham (MGB) biobank data (Castro et al.,, 2022) with a primary goal to build a genetic risk prediction model for T2D using its GRS and demographic information.

Our data set includes N=16,963𝑁16963N=16,963italic_N = 16 , 963 MGB biobank participants up to 2021 with their available EHR features updated for the same year. Their risk factors 𝐆𝐆{\bf G}bold_G contain G1subscript𝐺1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, an one-dimensional GRS for T2D derived using the reported variants and effect sizes of Mahajan et al., (2018), as well as gender denoted as G2subscript𝐺2G_{2}italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (G2=1subscript𝐺21G_{2}=1italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 for Female). The EHR surrogates 𝐗𝐗{\bf X}bold_X include X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the log-transformed total count of the International Classification of Diseases (ICD) codes for T2D and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the value of hemoglobin A1C obtained via laboratory tests. In addition, we have collected Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT on a subset of n=269𝑛269n=269italic_n = 269 patients as the manual chart reviewing label for T2D status created by clinicians in 2014. Due to the gap of time windows of data collection, Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is an imperfect label for the true T2D status Y𝑌Yitalic_Y with its potential measurement error coming from the missingness of information between 2014 and 2021, as well as the switch of the ICD system from version 9 to 10 around 2015 at MGB. For the purpose of validation, we also extract the chart reviewing labels created by clinicians according to all information up to 2021 on a random subsample of the data with size nv=220subscript𝑛𝑣220n_{v}=220italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 220. These labels are more close to (arguably identical to) the true T2D status Y𝑌Yitalic_Y and only used for validation and evaluation of the estimators trained on the set 𝒪={𝐎i=(Yiδi,δi,𝐗i,𝐆i):i=1,2,,N}𝒪conditional-setsubscript𝐎𝑖subscriptsuperscript𝑌𝑖subscript𝛿𝑖subscript𝛿𝑖subscript𝐗𝑖subscript𝐆𝑖𝑖12𝑁\mathscr{O}=\{{\bf O}_{i}=(Y^{*}_{i}\delta_{i},\delta_{i},{\bf X}_{i},{\bf G}_% {i}):i=1,2,...,N\}script_O = { bold_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) : italic_i = 1 , 2 , … , italic_N }.

In addition to Hong et al. 2019 and Naive-Logistic studied in Section 4, we also include four simple benchmark estimators including those obtained through the logistic regression against 𝐆𝐆{\bf G}bold_G respectively using I(ICD\geq1), I(ICD\geq2), I(A1C\geq5.7) and I(A1C\geq6.4) as the binary outcomes. All of them are common and convenient ways to screen the subject with T2D frequently used in existing biomedical studies and practice. As the secondary analysis, we also estimate the AUC of the two important surrogates ICD and A1C using the imputation for Y𝑌Yitalic_Y in TUBE and other methods except the aforementioned approaches directly using ICD or A1C to construct the outcome. This aim is slightly different from evaluating the derived phenoty** score α^(𝐗)^𝛼𝐗{\widehat{\alpha}}({\bf X})over^ start_ARG italic_α end_ARG ( bold_X ) considered in Sections 2 and 4 but it can be realized using nearly the same strategy and is typically more useful for clinicians and researchers in practice. We use 200 times bootstrap sampling to quantify the variance of all the estimators. The resulted estimators with their standard errors are presented in Table 1.

Using the validation set with the true label Y𝑌Yitalic_Y, we obtain a validation estimator 𝜷^vsubscript^𝜷𝑣\widehat{\bm{\beta}}_{v}over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and evaluate the AUC of ICD and A1C. Evaluation metrics of the estimators for β𝛽\betaitalic_β include: (1) mean square prediction error (MSPE) defined as the sample mean of {g(𝐆i𝖳𝜷^v)g(𝐆i𝖳𝜷^)}2superscript𝑔superscriptsubscript𝐆𝑖𝖳subscript^𝜷𝑣𝑔superscriptsubscript𝐆𝑖𝖳^𝜷2\{g({\bf G}_{i}^{{\sf\scriptscriptstyle{T}}}{\widehat{\bm{\beta}}}_{v})-g({\bf G% }_{i}^{{\sf\scriptscriptstyle{T}}}{\widehat{\bm{\beta}}})\}^{2}{ italic_g ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) - italic_g ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over^ start_ARG bold_italic_β end_ARG ) } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT; (2) Deviance of the logistic model evaluated on the target data; (3) classifier’s correlation (Class. Cor) with 𝜷^vsubscript^𝜷𝑣\widehat{\bm{\beta}}_{v}over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, i.e., the sample correlation of I(g(𝐆i𝖳𝜷^v)>c)𝐼𝑔superscriptsubscript𝐆𝑖𝖳subscript^𝜷𝑣𝑐I(g({\bf G}_{i}^{{\sf\scriptscriptstyle{T}}}{\widehat{\bm{\beta}}}_{v})>c)italic_I ( italic_g ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) > italic_c ) and I(g(𝐆i𝖳𝜷^)>c)𝐼𝑔superscriptsubscript𝐆𝑖𝖳^𝜷𝑐I(g({\bf G}_{i}^{{\sf\scriptscriptstyle{T}}}{\widehat{\bm{\beta}}})>c)italic_I ( italic_g ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over^ start_ARG bold_italic_β end_ARG ) > italic_c ) where c𝑐citalic_c is the sample mean of g(𝐆i𝖳𝜷^v)𝑔superscriptsubscript𝐆𝑖𝖳subscript^𝜷𝑣g({\bf G}_{i}^{{\sf\scriptscriptstyle{T}}}{\widehat{\bm{\beta}}}_{v})italic_g ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ); and (4) false classification rate (False Class.) compared to 𝜷^vsubscript^𝜷𝑣\widehat{\bm{\beta}}_{v}over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, i.e., the empirical probability of I(g(𝐆i𝖳𝜷^v)>c)I(g(𝐆i𝖳𝜷^)>c)𝐼𝑔superscriptsubscript𝐆𝑖𝖳subscript^𝜷𝑣𝑐𝐼𝑔superscriptsubscript𝐆𝑖𝖳^𝜷𝑐I(g({\bf G}_{i}^{{\sf\scriptscriptstyle{T}}}{\widehat{\bm{\beta}}}_{v})>c)\neq I% (g({\bf G}_{i}^{{\sf\scriptscriptstyle{T}}}{\widehat{\bm{\beta}}})>c)italic_I ( italic_g ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) > italic_c ) ≠ italic_I ( italic_g ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT over^ start_ARG bold_italic_β end_ARG ) > italic_c ). The evaluation results are presented in Table 2.

β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (Intercept) β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (GRS) β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (Gender) AUC(ICD) AUC(A1C)
ICD1absent1\geq 1≥ 1 0.9550.028subscript0.9550.028-0.955_{0.028}- 0.955 start_POSTSUBSCRIPT 0.028 end_POSTSUBSCRIPT 0.6490.08subscript0.6490.080.649_{0.08}0.649 start_POSTSUBSCRIPT 0.08 end_POSTSUBSCRIPT 0.5560.036subscript0.5560.036-0.556_{0.036}- 0.556 start_POSTSUBSCRIPT 0.036 end_POSTSUBSCRIPT
ICD2absent2\geq 2≥ 2 1.2860.031subscript1.2860.031-1.286_{0.031}- 1.286 start_POSTSUBSCRIPT 0.031 end_POSTSUBSCRIPT 0.7950.087subscript0.7950.0870.795_{0.087}0.795 start_POSTSUBSCRIPT 0.087 end_POSTSUBSCRIPT 0.6270.04subscript0.6270.04-0.627_{0.04}- 0.627 start_POSTSUBSCRIPT 0.04 end_POSTSUBSCRIPT
A1C5.7absent5.7\geq 5.7≥ 5.7 0.7370.027subscript0.7370.027-0.737_{0.027}- 0.737 start_POSTSUBSCRIPT 0.027 end_POSTSUBSCRIPT 0.4640.076subscript0.4640.0760.464_{0.076}0.464 start_POSTSUBSCRIPT 0.076 end_POSTSUBSCRIPT 0.4610.034subscript0.4610.034-0.461_{0.034}- 0.461 start_POSTSUBSCRIPT 0.034 end_POSTSUBSCRIPT
A1C6.5absent6.5\geq 6.5≥ 6.5 2.10.041subscript2.10.041-2.1_{0.041}- 2.1 start_POSTSUBSCRIPT 0.041 end_POSTSUBSCRIPT 0.8180.115subscript0.8180.1150.818_{0.115}0.818 start_POSTSUBSCRIPT 0.115 end_POSTSUBSCRIPT 0.6180.053subscript0.6180.053-0.618_{0.053}- 0.618 start_POSTSUBSCRIPT 0.053 end_POSTSUBSCRIPT
Naive-Logistic 1.3860.31subscript1.3860.31-1.386_{0.31}- 1.386 start_POSTSUBSCRIPT 0.31 end_POSTSUBSCRIPT 2.2210.639subscript2.2210.6392.221_{0.639}2.221 start_POSTSUBSCRIPT 0.639 end_POSTSUBSCRIPT 1.5720.377subscript1.5720.377-1.572_{0.377}- 1.572 start_POSTSUBSCRIPT 0.377 end_POSTSUBSCRIPT 0.9490.016subscript0.9490.0160.949_{0.016}0.949 start_POSTSUBSCRIPT 0.016 end_POSTSUBSCRIPT 0.8050.023subscript0.8050.0230.805_{0.023}0.805 start_POSTSUBSCRIPT 0.023 end_POSTSUBSCRIPT
Hong et al. 2019 1.2230.136subscript1.2230.136-1.223_{0.136}- 1.223 start_POSTSUBSCRIPT 0.136 end_POSTSUBSCRIPT 1.2040.160subscript1.2040.1601.204_{0.160}1.204 start_POSTSUBSCRIPT 0.160 end_POSTSUBSCRIPT 0.8060.107subscript0.8060.107-0.806_{0.107}- 0.806 start_POSTSUBSCRIPT 0.107 end_POSTSUBSCRIPT 0.8560.046subscript0.8560.0460.856_{0.046}0.856 start_POSTSUBSCRIPT 0.046 end_POSTSUBSCRIPT 0.7870.035subscript0.7870.0350.787_{0.035}0.787 start_POSTSUBSCRIPT 0.035 end_POSTSUBSCRIPT
TUBE 1.3520.215subscript1.3520.215-1.352_{0.215}- 1.352 start_POSTSUBSCRIPT 0.215 end_POSTSUBSCRIPT 1.1620.200subscript1.1620.2001.162_{0.200}1.162 start_POSTSUBSCRIPT 0.200 end_POSTSUBSCRIPT 0.8440.140subscript0.8440.140-0.844_{0.140}- 0.844 start_POSTSUBSCRIPT 0.140 end_POSTSUBSCRIPT 0.9730.016subscript0.9730.0160.973_{0.016}0.973 start_POSTSUBSCRIPT 0.016 end_POSTSUBSCRIPT 0.8940.013subscript0.8940.0130.894_{0.013}0.894 start_POSTSUBSCRIPT 0.013 end_POSTSUBSCRIPT
Validation 1.3410.263subscript1.3410.263-1.341_{0.263}- 1.341 start_POSTSUBSCRIPT 0.263 end_POSTSUBSCRIPT 1.0070.854subscript1.0070.8541.007_{0.854}1.007 start_POSTSUBSCRIPT 0.854 end_POSTSUBSCRIPT 0.9790.387subscript0.9790.387-0.979_{0.387}- 0.979 start_POSTSUBSCRIPT 0.387 end_POSTSUBSCRIPT 0.9830.008subscript0.9830.0080.983_{0.008}0.983 start_POSTSUBSCRIPT 0.008 end_POSTSUBSCRIPT 0.8720.036subscript0.8720.0360.872_{0.036}0.872 start_POSTSUBSCRIPT 0.036 end_POSTSUBSCRIPT
Table 1: Estimators for the T2D genetic model coefficient 𝜷𝜷\bm{\beta}bold_italic_β and the AUCs of ICD and A1C, with their empirical standard errors presented as subscriptions.
MSPE Deviance Class. Cor False Class.
ICD1absent1\geq 1≥ 1 0.00640.00640.00640.0064 0.0040.0040.0040.004 0.200.200.200.20 0.460.460.460.46
ICD2absent2\geq 2≥ 2 0.00080.00080.00080.0008 0.0140.014-0.014- 0.014 0.810.810.810.81 0.100.100.100.10
A1C5.7absent5.7\geq 5.7≥ 5.7 0.01560.01560.01560.0156 0.0290.0290.0290.029 00 0.500.500.500.50
A1C6.4absent6.4\geq 6.4≥ 6.4 0.00690.00690.00690.0069 0.0100.0100.0100.010 0.120.120.120.12 0.480.480.480.48
Naive-Logistic 0.00340.00340.00340.0034 0.0000.0000.0000.000 0.400.400.400.40 0.360.360.360.36
Hong et al. 2019 0.00110.00110.00110.0011 0.0130.013-0.013- 0.013 0.810.810.810.81 0.100.100.100.10
TUBE 0.00020.0002\mathbf{0.0002}bold_0.0002 0.0170.017\mathbf{-0.017}- bold_0.017 0.950.95\mathbf{0.95}bold_0.95 0.030.03\mathbf{0.03}bold_0.03
Validation 00 0.0170.017-0.017- 0.017 1111 00
Table 2: Estimation performance in the T2D genetic model 𝜷𝜷\bm{\beta}bold_italic_β evaluated using the metrics introduced in Section 5.

Among all methods under comparison, TUBE attains the closest point estimates to the validation estimator in terms of both 𝜷𝜷\bm{\beta}bold_italic_β and AUC. For example, the AUC of A1C evaluated using TUBE-imputed outcomes only differs from the the validation estimator by around 0.020.020.020.02 while all the other estimators show more than 0.060.060.060.06 gaps to the validation estimator. The estimation performance in 𝜷𝜷\bm{\beta}bold_italic_β are depicted more carefully in Table 2 where TUBE achieves the best on all metrics among all estimators except for 𝜷^vsubscript^𝜷𝑣\widehat{\bm{\beta}}_{v}over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. For example, compared to the recent method proposed by Hong et al., (2019), our method attains more than 70%percent7070\%70 % reduction on MSPE, and 0.140.140.140.14 larger classifier’s correlation with the validation estimator. These results illustrate the effectiveness of leveraging our semiparametric modeling strategy to reduce potential bias due to misspecification. Meanwhile, although TUBE involves more complicated nonparametric regression, it does not result in significant inflation of the standard errors compared to Hong et al., (2019), which is a benefit of using parametric regression (projection) in Stage III.

Our estimator of 𝜷𝜷\bm{\beta}bold_italic_β reveals that the GRS has a significant positive effect (log(OR)=1.161.161.161.16, 95% CI: [0.77,1.55]0.771.55[0.77,1.55][ 0.77 , 1.55 ]) on the risk of T2D and men have significantly higher risk to develop T2D than women in our study cohort. Interestingly, the effect sizes estimated using the four simple EHR outcomes, i.e., I(ICD\geq1), I(ICD\geq2), I(A1C\geq5.7), and I(A1C\geq6.4) are all smaller than β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT estimated by TUBE. As an explanation of this observation, after we convert the error-prone EHR outcomes to binary variables, they will have the same scale as the true outcome Y𝑌Yitalic_Y and, thus, showing weaker association with the risk factors than Y𝑌Yitalic_Y due to their measurement errors. This can be justified under the key assumption that ICD, A1C are independent with the baseline risk factors given the True T2D status.

6 Discussion

In summary, we propose TUBE, a novel unsupervised method for analyzing multiple error-prone EHR outcomes and noisy labels against baseline risk factors, such as genetic variants extracted from EHR linked biobanks. TUBE incorporates a nonparametric composite regression step, and then uses it to combine the EHR outcomes for phenoty** and derive a parametric genetic risk model through projection. Compared to existing methods, our semiparametric strategy has two advantages. First, the nonparametric composite construction at the first stage safeguards the unsupervised learning against potential bias due to model misspecification. Second, the derived parametric genetic risk model obtained through projection enhances interpretability and achieves and significantly reduced variance in comparison to a fully nonparametric approach. These advantages are supported by our comprehensive asymptotic analysis, simulations, and a real-world study.

We acknowledges several limitations and potential extensions of our work. First, the validity of our method is prone to severe violation of the conditional independence assumption between the EHR outcomes and the baseline covariates. This issue can be alleviated by incorporating (small) samples with the true labels to calibrate the unsupervised estimator derived from surrogates. Recent advancements in surrogate-assisted semi-supervised learning (Zhang et al.,, 2022; Hou et al., 2023b, ) are particularly relevant to this discussion. Second, our current setup focuses on binary disease status. In current biomedical studies, time to the onset of clinical events (e.g., cancer relapse) is often not readily available with their EHR surrogates subject to measurement errors. Simple estimates of the event time based on billing or procedure codes may poorly approximate the true outcome and lead to bias. Therefore, expanding TUBE to incorporate multiple sources of imperfect and temporal endpoints under the survival setting is a potential direction for future research. In addition, our current method only accommodates low-dimensional genetic variants and a single disease or phenotype. Recent large scale genome??? and phenome???wide studies (Huang and Labrecque,, 2019; Verma et al.,, 2023, e.g.) provides a strong motivation for its extensions to accommodate high-dimensional or machine learning estimates of the genetic risk models and multi-phenotype studies.

References

  • Athey et al., (2019) Athey, S., Chetty, R., Imbens, G. W., and Kang, H. (2019). The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. Technical report, National Bureau of Economic Research.
  • Banda et al., (2017) Banda, J. M., Halpern, Y., Sontag, D., and Shah, N. H. (2017). Electronic phenoty** with aphrodite and the observational health sciences and informatics (ohdsi) data network. AMIA Summits on Translational Science Proceedings, 2017:48.
  • Banda et al., (2018) Banda, J. M., Seneviratne, M., Hernandez-Boussard, T., and Shah, N. H. (2018). Advances in electronic phenoty**: from rule-based definitions to machine learning models. Annual Review of Biomedical Data Science, 1:53–68.
  • Bonhomme et al., (2016) Bonhomme, S., Jochmans, K., Robin, J.-M., et al. (2016). Estimating multivariate latent-structure models. The Annals of Statistics, 44(2):540–563.
  • Castro et al., (2022) Castro, V. M., Gainer, V., Wattanasin, N., Benoit, B., Cagan, A., Ghosh, B., Goryachev, S., Metta, R., Park, H., Wang, D., et al. (2022). The mass general brigham biobank portal: an i2b2-based data repository linking disparate and high-dimensional patient data to support multimodal analytics. Journal of the American Medical Informatics Association, 29(4):643–651.
  • Chen, (2007) Chen, X. (2007). Chapter 76 large sample sieve estimation of semi-nonparametric models. volume 6 of Handbook of Econometrics, pages 5549–5632. Elsevier.
  • Denny et al., (2013) Denny, J. C., Bastarache, L., Ritchie, M. D., Carroll, R. J., Zink, R., Mosley, J. D., Field, J. R., Pulley, J. M., Ramirez, A. H., Bowton, E., et al. (2013). Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nature biotechnology, 31(12):1102–1111.
  • He et al., (2021) He, Y., Lakhani, C. M., Rasooly, D., Manrai, A. K., Tzoulaki, I., and Patel, C. J. (2021). Comparisons of polyexposure, polygenic, and clinical risk scores in risk prediction of type 2 diabetes. Diabetes Care, 44(4):935–943.
  • Hong et al., (2019) Hong, C., Liao, K. P., and Cai, T. (2019). Semi-supervised validation of multiple surrogate outcomes with application to electronic medical records phenoty**. Biometrics, 75(1):78–89.
  • (10) Hou, J., Chan, S. F., Wang, X., and Cai, T. (2023a). Risk prediction with imperfect survival outcome information from electronic health records. Biometrics, 79(1):190–202.
  • (11) Hou, J., Guo, Z., and Cai, T. (2023b). Surrogate assisted semi-supervised inference for high dimensional risk prediction. Journal of Machine Learning Research, 24(265):1–58.
  • Hou et al., (2021) Hou, J., Mukherjee, R., and Cai, T. (2021). Efficient and robust semi-supervised estimation of ate with partially annotated treatment and response. arXiv preprint arXiv:2110.12336.
  • Huang et al., (2018) Huang, J., Duan, R., Hubbard, R. A., Wu, Y., Moore, J. H., Xu, H., and Chen, Y. (2018). Pie: A prior knowledge guided integrated likelihood estimation method for bias reduction in association studies using electronic health records data. Journal of the American Medical Informatics Association, 25(3):345–352.
  • Huang and Labrecque, (2019) Huang, J. Y. and Labrecque, J. A. (2019). From gwas to phewas: the search for causality in big data. The Lancet Digital Health, 1(3):e101–e103.
  • Kallus and Mao, (2020) Kallus, N. and Mao, X. (2020). On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. arXiv preprint arXiv:2003.12408.
  • Kohane, (2011) Kohane, I. S. (2011). Using electronic health records to drive discovery in disease genomics. Nature Reviews Genetics, 12(6):417–428.
  • Liao et al., (2015) Liao, K. P., Cai, T., Savova, G. K., Murphy, S. N., Karlson, E. W., Ananthakrishnan, A. N., Gainer, V. S., Shaw, S. Y., Xia, Z., Szolovits, P., et al. (2015). Development of phenotype algorithms using electronic medical records and incorporating natural language processing. bmj, 350:h1885.
  • Liao et al., (2013) Liao, K. P., Kurreeman, F., Li, G., Duclos, G., Murphy, S., Guzman, R., Cai, T., Gupta, N., Gainer, V., Schur, P., et al. (2013). Associations of autoantibodies, autoimmune risk alleles, and clinical diagnoses from the electronic medical records in rheumatoid arthritis cases and non–rheumatoid arthritis controls. Arthritis &amp; Rheumatology, 65(3):571–581.
  • Liao et al., (2019) Liao, K. P., Sun, J., Cai, T. A., Link, N., Hong, C., Huang, J., Huffman, J. E., Gronsbell, J., Zhang, Y., Ho, Y.-L., Castro, V., Gainer, V., Murphy, S. N., O’Donnell, C. J., Gaziano, J. M., Cho, K., Szolovits, P., Kohane, I. S., Yu, S., and Cai, Tianxi, w. t. M. V. P. (2019). High-throughput multimodal automated phenoty** (MAP) with application to PheWAS. Journal of the American Medical Informatics Association, 26(11):1255–1262.
  • Mahajan et al., (2018) Mahajan, A., Taliun, D., Thurner, M., Robertson, N. R., Torres, J. M., Rayner, N. W., Payne, A. J., Steinthorsdottir, V., Scott, R. A., Grarup, N., et al. (2018). Fine-map** type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. Nature genetics, 50(11):1505–1513.
  • Murphy and Van der Vaart, (2000) Murphy, S. A. and Van der Vaart, A. W. (2000). On profile likelihood. Journal of the American Statistical Association, 95(450):449–465.
  • Shivade et al., (2014) Shivade, C., Raghavan, P., Fosler-Lussier, E., Embi, P. J., Elhadad, N., Johnson, S. B., and Lai, A. M. (2014). A review of approaches to identifying patient phenotype cohorts using electronic health records. Journal of the American Medical Informatics Association, 21(2):221–230.
  • Van der Vaart, (2000) Van der Vaart, A. W. (2000). Asymptotic statistics, volume 3. Cambridge university press.
  • Verma et al., (2023) Verma, A., Huffman, J. E., Rodriguez, A., Conery, M., Liu, M., Ho, Y.-L., Kim, Y., Heise, D. A., Guare, L., Panickan, V. A., et al. (2023). Diversity and scale: genetic architecture of 2,068 traits in the va million veteran program. medRxiv.
  • Wells et al., (2019) Wells, Q. S., Gupta, D. K., Smith, J. G., Collins, S. P., Storrow, A. B., Ferguson, J., Smith, M. L., Pulley, J. M., Collier, S., Wang, X., et al. (2019). Accelerating biomarker discovery through electronic health records, automated biobanking, and proteomics. Journal of the American College of Cardiology, 73(17):2195–2205.
  • Yu et al., (2017) Yu, S., Ma, Y., Gronsbell, J., Cai, T., Ananthakrishnan, A. N., Gainer, V. S., Churchill, S. E., Szolovits, P., Murphy, S. N., Kohane, I. S., et al. (2017). Enabling phenotypic big data with phenorm. Journal of the American Medical Informatics Association, 25(1):54–60.
  • Yu et al., (2019) Yu, T., Li, P., Qin, J., et al. (2019). Maximum smoothed likelihood component density estimation in mixture models with known mixing proportions. Electronic Journal of Statistics, 13(2):4035–4078.
  • (28) Zhang, L., Ding, X., Ma, Y., Muthu, N., Ajmal, I., Moore, J. H., Herman, D. S., and Chen, J. (2019a). Electronic health record phenoty** with internally assessable performance (phiap) using anchor-positive and unlabeled patients. arXiv preprint arXiv:1902.10060.
  • (29) Zhang, Y., Cai, T., Yu, S., Cho, K., Hong, C., Sun, J., Huang, J., Ho, Y.-L., Ananthakrishnan, A. N., Xia, Z., et al. (2019b). High-throughput phenoty** with electronic medical record data using a common semi-supervised approach (phecap). Nature protocols, 14(12):3426–3444.
  • Zhang et al., (2022) Zhang, Y., Liu, M., Neykov, M., and Cai, T. (2022). Prior adaptive semi-supervised learning with application to ehr phenoty**. The Journal of Machine Learning Research, 23(1):3617–3641.
  • Zheng and Wu, (2019) Zheng, C. and Wu, Y. (2019). Nonparametric estimation of multivariate mixtures. Journal of the American Statistical Association, pages 1–16.

Appendix

Appendix A Additional implementation details

Algorithm A1 EM algorithm for maximizing the non-parametric log-likelihood function (5).

Input: Observed data 𝒪={𝐎i=(Yiδi,δi,𝐗i,𝐆i):i=1,2,,N}𝒪conditional-setsubscript𝐎𝑖subscriptsuperscript𝑌𝑖subscript𝛿𝑖subscript𝛿𝑖subscript𝐗𝑖subscript𝐆𝑖𝑖12𝑁\mathscr{O}=\{{\bf O}_{i}=(Y^{*}_{i}\delta_{i},\delta_{i},{\bf X}_{i},{\bf G}_% {i}):i=1,2,...,N\}script_O = { bold_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) : italic_i = 1 , 2 , … , italic_N }, and the phenoty** score α^(𝐱)^𝛼𝐱{\widehat{\alpha}}({\bf x})over^ start_ARG italic_α end_ARG ( bold_x ) derived in Algorithm 1.  
Initialize with 𝜼~α^(0)={𝒮~α^,y(0)(),𝝀~(0),𝝃~(0):y=0,1}superscriptsubscriptbold-~𝜼^𝛼0conditional-setsuperscriptsubscript~𝒮^𝛼𝑦0superscript~𝝀0superscript~𝝃0𝑦01\bm{\widetilde{\eta}}_{{\widehat{\alpha}}}^{(0)}=\{\widetilde{\cal S}_{% \widehat{\alpha},y}^{(0)}(\cdot),\widetilde{\bm{\lambda}}^{(0)},\widetilde{\bm% {\xi}}^{(0)}:y=0,1\}overbold_~ start_ARG bold_italic_η end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = { over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( ⋅ ) , over~ start_ARG bold_italic_λ end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , over~ start_ARG bold_italic_ξ end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT : italic_y = 0 , 1 } introduced in Algorithm A2. Iterate on the following two steps for r=0,1,,R𝑟01𝑅r=0,1,\ldots,Ritalic_r = 0 , 1 , … , italic_R until convergence.  
E-step. For each subject i𝑖iitalic_i, impute the probability for Yisubscript𝑌𝑖Y_{i}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT conditional on Yisubscriptsuperscript𝑌𝑖Y^{*}_{i}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (if observed) or α^(𝐗i)^𝛼subscript𝐗𝑖{\widehat{\alpha}}({\bf X}_{i})over^ start_ARG italic_α end_ARG ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ):

Y~i0(r+1)=δi×λ~1Yi(r)g1{𝝍𝖳(𝐆i)𝝃~(r)}y=01λ~yYi(r)gy{𝝍𝖳(𝐆i)𝝃~(r)};Y~i1(r+1)=𝒮~α^,1(r){α^(𝐗i)}g1{𝝍𝖳(𝐆i)𝝃~(r)}y=01𝒮~α^,y(r){α^(𝐗i)}gy{𝝍𝖳(𝐆i)𝝃~(r)}.formulae-sequencesuperscriptsubscript~𝑌𝑖0𝑟1subscript𝛿𝑖superscriptsubscript~𝜆1superscriptsubscript𝑌𝑖𝑟subscript𝑔1superscript𝝍𝖳subscript𝐆𝑖superscript~𝝃𝑟superscriptsubscript𝑦01superscriptsubscript~𝜆𝑦superscriptsubscript𝑌𝑖𝑟subscript𝑔𝑦superscript𝝍𝖳subscript𝐆𝑖superscript~𝝃𝑟superscriptsubscript~𝑌𝑖1𝑟1subscriptsuperscript~𝒮𝑟^𝛼1^𝛼subscript𝐗𝑖subscript𝑔1superscript𝝍𝖳subscript𝐆𝑖superscript~𝝃𝑟superscriptsubscript𝑦01subscriptsuperscript~𝒮𝑟^𝛼𝑦^𝛼subscript𝐗𝑖subscript𝑔𝑦superscript𝝍𝖳subscript𝐆𝑖superscript~𝝃𝑟\widetilde{Y}_{i0}^{(r+1)}=\delta_{i}\times\frac{\widetilde{\lambda}_{1Y_{i}^{% *}}^{(r)}g_{1}\{\bm{\psi}^{{\sf\scriptscriptstyle{T}}}({\bf G}_{i})\widetilde{% \bm{\xi}}^{(r)}\}}{\sum_{y=0}^{1}\widetilde{\lambda}_{yY_{i}^{*}}^{(r)}g_{y}\{% \bm{\psi}^{{\sf\scriptscriptstyle{T}}}({\bf G}_{i})\widetilde{\bm{\xi}}^{(r)}% \}};\quad\widetilde{Y}_{i1}^{(r+1)}=\frac{-\nabla\widetilde{\cal S}^{(r)}_{% \widehat{\alpha},1}\{{\widehat{\alpha}}({\bf X}_{i})\}g_{1}\{\bm{\psi}^{{\sf% \scriptscriptstyle{T}}}({\bf G}_{i})\widetilde{\bm{\xi}}^{(r)}\}}{-\sum_{y=0}^% {1}\nabla\widetilde{\cal S}^{(r)}_{\widehat{\alpha},y}\{{\widehat{\alpha}}({% \bf X}_{i})\}g_{y}\{\bm{\psi}^{{\sf\scriptscriptstyle{T}}}({\bf G}_{i})% \widetilde{\bm{\xi}}^{(r)}\}}.over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT = italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × divide start_ARG over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT 1 italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT { bold_italic_ψ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) over~ start_ARG bold_italic_ξ end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT } end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_y = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_y italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT { bold_italic_ψ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) over~ start_ARG bold_italic_ξ end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT } end_ARG ; over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT = divide start_ARG - ∇ over~ start_ARG caligraphic_S end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , 1 end_POSTSUBSCRIPT { over^ start_ARG italic_α end_ARG ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT { bold_italic_ψ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) over~ start_ARG bold_italic_ξ end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT } end_ARG start_ARG - ∑ start_POSTSUBSCRIPT italic_y = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∇ over~ start_ARG caligraphic_S end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , italic_y end_POSTSUBSCRIPT { over^ start_ARG italic_α end_ARG ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } italic_g start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT { bold_italic_ψ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) over~ start_ARG bold_italic_ξ end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT } end_ARG .

M-step. Update 𝜼α^subscript𝜼^𝛼\bm{\eta}_{{\widehat{\alpha}}}bold_italic_η start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG end_POSTSUBSCRIPT through the MLE specified with the imputed outcomes from the E-step:

λ~yk(r+1)=i=1nI(Yi=k){Y~i0(r+1)}y{1Y~i0(r+1)}1yi=1n{Y~i0(r+1)}y{1Y~i0(r+1)}1y;k=0,1,,Kformulae-sequencesuperscriptsubscript~𝜆𝑦𝑘𝑟1superscriptsubscript𝑖1𝑛𝐼subscriptsuperscript𝑌𝑖𝑘superscriptsuperscriptsubscript~𝑌𝑖0𝑟1𝑦superscript1superscriptsubscript~𝑌𝑖0𝑟11𝑦superscriptsubscript𝑖1𝑛superscriptsuperscriptsubscript~𝑌𝑖0𝑟1𝑦superscript1superscriptsubscript~𝑌𝑖0𝑟11𝑦𝑘01𝐾\displaystyle\widetilde{\lambda}_{yk}^{(r+1)}=\frac{\sum_{i=1}^{n}I(Y^{*}_{i}=% k)\{{\widetilde{Y}}_{i0}^{(r+1)}\}^{y}\{1-{\widetilde{Y}}_{i0}^{(r+1)}\}^{1-y}% }{\sum_{i=1}^{n}\{{\widetilde{Y}}_{i0}^{(r+1)}\}^{y}\{1-{\widetilde{Y}}_{i0}^{% (r+1)}\}^{1-y}};\quad k=0,1,\ldots,Kover~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_y italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_I ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k ) { over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT { 1 - over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT 1 - italic_y end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT { over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT { 1 - over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT 1 - italic_y end_POSTSUPERSCRIPT end_ARG ; italic_k = 0 , 1 , … , italic_K
𝝃~(r+1)=argmax𝝃i=1n(Y~i0(r+1),𝝍𝖳(𝐆i)𝝃)+i=1N(Y~i1(r+1),𝝍𝖳(𝐆i)𝝃);superscript~𝝃𝑟1subscriptargmax𝝃superscriptsubscript𝑖1𝑛superscriptsubscript~𝑌𝑖0𝑟1superscript𝝍𝖳subscript𝐆𝑖𝝃superscriptsubscript𝑖1𝑁superscriptsubscript~𝑌𝑖1𝑟1superscript𝝍𝖳subscript𝐆𝑖𝝃\displaystyle\widetilde{\bm{\xi}}^{(r+1)}=\mathop{\mbox{argmax}}_{\bm{\xi}}% \sum_{i=1}^{n}\ell\left(\widetilde{Y}_{i0}^{(r+1)},\bm{\psi}^{{\sf% \scriptscriptstyle{T}}}({\bf G}_{i})\bm{\xi}\right)+\sum_{i=1}^{N}\ell\left(% \widetilde{Y}_{i1}^{(r+1)},\bm{\psi}^{{\sf\scriptscriptstyle{T}}}({\bf G}_{i})% \bm{\xi}\right);over~ start_ARG bold_italic_ξ end_ARG start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT = argmax start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℓ ( over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT , bold_italic_ψ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_italic_ξ ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_ℓ ( over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT , bold_italic_ψ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_italic_ξ ) ;
𝒮~α^,y(r)(c)=i=1NI(α^(𝐗i)>c){Y~i1(r+1)}y{1Y~i1(r+1)}1yi=1N{Y~i1(r+1)}y{1Y~i1(r+1)}1y,y=0,1.formulae-sequencesubscriptsuperscript~𝒮𝑟^𝛼𝑦𝑐superscriptsubscript𝑖1𝑁𝐼^𝛼subscript𝐗𝑖𝑐superscriptsuperscriptsubscript~𝑌𝑖1𝑟1𝑦superscript1superscriptsubscript~𝑌𝑖1𝑟11𝑦superscriptsubscript𝑖1𝑁superscriptsuperscriptsubscript~𝑌𝑖1𝑟1𝑦superscript1superscriptsubscript~𝑌𝑖1𝑟11𝑦𝑦01\displaystyle\widetilde{\cal S}^{(r)}_{\widehat{\alpha},y}(c)=\frac{\sum_{i=1}% ^{N}I({\widehat{\alpha}}({\bf X}_{i})>c)\{{\widetilde{Y}}_{i1}^{(r+1)}\}^{y}\{% 1-{\widetilde{Y}}_{i1}^{(r+1)}\}^{1-y}}{\sum_{i=1}^{N}\{{\widetilde{Y}}_{i1}^{% (r+1)}\}^{y}\{1-{\widetilde{Y}}_{i1}^{(r+1)}\}^{1-y}},\quad y=0,1.over~ start_ARG caligraphic_S end_ARG start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , italic_y end_POSTSUBSCRIPT ( italic_c ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_I ( over^ start_ARG italic_α end_ARG ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > italic_c ) { over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT { 1 - over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT 1 - italic_y end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT { over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT { 1 - over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r + 1 ) end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT 1 - italic_y end_POSTSUPERSCRIPT end_ARG , italic_y = 0 , 1 .

Output: The imputed outcomes Y~i0=Y~i0(R)subscript~𝑌𝑖0superscriptsubscript~𝑌𝑖0𝑅{\widetilde{Y}}_{i0}=\widetilde{Y}_{i0}^{(R)}over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT = over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_R ) end_POSTSUPERSCRIPT (if δi=1subscript𝛿𝑖1\delta_{i}=1italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1) and Y~i1=Y~i1(R)subscript~𝑌𝑖1superscriptsubscript~𝑌𝑖1𝑅{\widetilde{Y}}_{i1}=\widetilde{Y}_{i1}^{(R)}over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT = over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_R ) end_POSTSUPERSCRIPT for i=1,2,,N𝑖12𝑁i=1,2,\ldots,Nitalic_i = 1 , 2 , … , italic_N.

Algorithm A2 Initialization of the EM Algorithms.

For Algorithm 1, we define Yi=I(Yi=1)subscriptsuperscript𝑌𝑖𝐼subscriptsuperscript𝑌𝑖1Y^{\dagger}_{i}=I(Y^{*}_{i}=1)italic_Y start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_I ( italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) for subjects i=1,2,,n𝑖12𝑛i=1,2,\ldots,nitalic_i = 1 , 2 , … , italic_n and obtain the initial estimators 𝝃^(0),𝜻^(0),μ^(0)superscript^𝝃0superscript^𝜻0superscript^𝜇0\widehat{\bm{\xi}}^{(0)},\widehat{\bm{\zeta}}^{(0)},\widehat{\mu}^{(0)}over^ start_ARG bold_italic_ξ end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_ζ end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT through MLE:

μ^(0)=1ni=1nYi;𝝃^(0)=argmax𝝃i=1n(Yi,𝝍𝖳(𝐆i)𝝃);𝜻^j(0)=argmax𝜻ji=1n(Yi,𝝋j𝖳(Xij)𝜻j).formulae-sequencesuperscript^𝜇01𝑛superscriptsubscript𝑖1𝑛subscriptsuperscript𝑌𝑖formulae-sequencesuperscript^𝝃0subscriptargmax𝝃superscriptsubscript𝑖1𝑛subscriptsuperscript𝑌𝑖superscript𝝍𝖳subscript𝐆𝑖𝝃superscriptsubscript^𝜻𝑗0subscriptargmaxsubscript𝜻𝑗superscriptsubscript𝑖1𝑛subscriptsuperscript𝑌𝑖subscriptsuperscript𝝋𝖳𝑗subscript𝑋𝑖𝑗subscript𝜻𝑗\widehat{\mu}^{(0)}=\frac{1}{n}\sum_{i=1}^{n}Y^{\dagger}_{i};\quad\widehat{\bm% {\xi}}^{(0)}=\mathop{\mbox{argmax}}_{\bm{\xi}}\sum_{i=1}^{n}\ell\left(Y^{% \dagger}_{i},\bm{\psi}^{{\sf\scriptscriptstyle{T}}}({\bf G}_{i})\bm{\xi}\right% );\quad\widehat{\bm{\zeta}}_{j}^{(0)}=\mathop{\mbox{argmax}}_{\bm{\zeta}_{j}}% \sum_{i=1}^{n}\ell\left(Y^{\dagger}_{i},\bm{\varphi}^{{\sf\scriptscriptstyle{T% }}}_{j}(X_{ij})\bm{\zeta}_{j}\right).over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_Y start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; over^ start_ARG bold_italic_ξ end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = argmax start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℓ ( italic_Y start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_ψ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( bold_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_italic_ξ ) ; over^ start_ARG bold_italic_ζ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = argmax start_POSTSUBSCRIPT bold_italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℓ ( italic_Y start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_φ start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) bold_italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) .

For 𝝀^(0)superscript^𝝀0\widehat{\bm{\lambda}}^{(0)}over^ start_ARG bold_italic_λ end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, we set λ~1K(0)=0.85superscriptsubscript~𝜆1𝐾00.85\widetilde{\lambda}_{1K}^{(0)}=0.85over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT 1 italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 0.85; λ~1k(0)=0.15/Ksuperscriptsubscript~𝜆1𝑘00.15𝐾\widetilde{\lambda}_{1k}^{(0)}=0.15/Kover~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT 1 italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 0.15 / italic_K for k=0,1,,K1𝑘01𝐾1k=0,1,\ldots,K-1italic_k = 0 , 1 , … , italic_K - 1 and λ~00(0)=0.85superscriptsubscript~𝜆0000.85\widetilde{\lambda}_{00}^{(0)}=0.85over~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT 00 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 0.85; λ~0k(0)=0.15/Ksuperscriptsubscript~𝜆0𝑘00.15𝐾\widetilde{\lambda}_{0k}^{(0)}=0.15/Kover~ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT 0 italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 0.15 / italic_K for k=1,,K𝑘1𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K, in the belief that Ysuperscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is reliable.  
For Algorithm A2, we set 𝝀~(0)=𝝀^superscript~𝝀0^𝝀\widetilde{\bm{\lambda}}^{(0)}=\widehat{\bm{\lambda}}over~ start_ARG bold_italic_λ end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = over^ start_ARG bold_italic_λ end_ARG and 𝝃~(0)=𝝃^superscript~𝝃0^𝝃\widetilde{\bm{\xi}}^{(0)}=\widehat{\bm{\xi}}over~ start_ARG bold_italic_ξ end_ARG start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = over^ start_ARG bold_italic_ξ end_ARG based on the results in Algorithm 1, and take

𝒮~α^,y(0)(c)=i=1nI(α^(𝐗i)>c)I(Yi=y)i=1nI(Yi=y),y=0,1.formulae-sequencesuperscriptsubscript~𝒮^𝛼𝑦0𝑐superscriptsubscript𝑖1𝑛𝐼^𝛼subscript𝐗𝑖𝑐𝐼subscriptsuperscript𝑌𝑖𝑦superscriptsubscript𝑖1𝑛𝐼subscriptsuperscript𝑌𝑖𝑦𝑦01\widetilde{\cal S}_{\widehat{\alpha},y}^{(0)}(c)=\frac{\sum_{i=1}^{n}I({% \widehat{\alpha}}({\bf X}_{i})>c)I(Y^{\dagger}_{i}=y)}{\sum_{i=1}^{n}I(Y^{% \dagger}_{i}=y)},\quad y=0,1.over~ start_ARG caligraphic_S end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG , italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( italic_c ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_I ( over^ start_ARG italic_α end_ARG ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > italic_c ) italic_I ( italic_Y start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_I ( italic_Y start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y ) end_ARG , italic_y = 0 , 1 .

Appendix B Additional numerical results

In this section, we attach more complete simulation results as a supplement to the main results presented in Section 4.

Table A1: Biases of parameter estimates over 500 simulations for the regression parameters for genetic effects (𝜷𝜷\bm{\beta}bold_italic_β), the area under the curve (AUC) for the classification algorithm, and the errors and/or uncertainties in labels (𝝀𝝀\bm{\lambda}bold_italic_λ) for settings (a) with linear genetic effects, (b) with nonlinear genetic effects, and (c) with nonlinear genetic effects and slight violation of conditional independence between 𝐆𝐆{\bf G}bold_G and 𝐗𝐗{\bf X}bold_X.
(a)
Method β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT=-4.600 β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT= 1.600 β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT= 1.600 β3subscript𝛽3\beta_{3}italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT= 1.600 β4subscript𝛽4\beta_{4}italic_β start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT= 1.600 AUC=0.702 λ1(0)subscript𝜆10\lambda_{1}(0)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0 )=0.320 λ1(0.5)subscript𝜆10.5\lambda_{1}(0.5)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0.5 )=0.490 λ1(1)subscript𝜆11\lambda_{1}(1)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 )=0.190 λ0(0)subscript𝜆00\lambda_{0}(0)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 0 )=0.700 λ0(0.5)subscript𝜆00.5\lambda_{0}(0.5)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 0.5 )=0.280 λ0(1)subscript𝜆01\lambda_{0}(1)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 1 )=0.030
Naive-Logistic100 2.965 -1.354 -1.351 -1.343 -1.328 -0.088 - - - - - -
Hong et al100 -0.050 0.013 0.006 0.020 0.014 0.001 0.004 0.000 -0.004 -0.003 -0.002 0.005
TUBE100 -0.145 0.040 0.036 0.049 0.044 -0.001 0.001 0.002 -0.003 0.002 -0.005 0.003
Naive-Logistic500 3.024 -1.358 -1.348 -1.357 -1.344 -0.103 - - - - - -
Hong et al500 -0.011 0.004 0.000 0.007 0.004 0.002 -0.001 0.001 0.001 0.001 -0.001 0.000
TUBE500 -0.089 0.029 0.025 0.029 0.031 0.000 -0.004 0.002 0.002 0.007 -0.003 -0.003
Naive-Logistic1000 3.019 -1.360 -1.346 -1.347 -1.349 -0.104 - - - - - -
Hong et al1000 -0.010 0.000 -0.002 0.005 0.000 0.002 -0.003 0.001 0.002 0.003 -0.002 -0.001
TUBE1000 -0.073 0.020 0.020 0.026 0.022 0.000 -0.006 0.003 0.003 0.008 -0.005 -0.004
(b)
Method β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT= 1.300 β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT= 0.700 β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT=-0.700 β3subscript𝛽3\beta_{3}italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT=-0.700 β4subscript𝛽4\beta_{4}italic_β start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT=-0.700 AUC=0.702 λ1(0)subscript𝜆10\lambda_{1}(0)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0 )=0.320 λ1(0.5)subscript𝜆10.5\lambda_{1}(0.5)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0.5 )=0.490 λ1(1)subscript𝜆11\lambda_{1}(1)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 )=0.190 λ0(0)subscript𝜆00\lambda_{0}(0)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 0 )=0.700 λ0(0.5)subscript𝜆00.5\lambda_{0}(0.5)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 0.5 )=0.280 λ0(1)subscript𝜆01\lambda_{0}(1)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 1 )=0.030
Naive-Logistic100 -1.879 -0.514 0.472 0.497 0.496 -0.080 - - - - - -
Hong et al100 -1.255 3.467 -0.642 -0.646 -0.630 -0.027 0.016 -0.011 -0.005 -0.056 0.031 0.025
TUBE100 0.020 0.016 -0.024 -0.009 -0.011 -0.003 -0.001 0.003 -0.002 0.003 -0.004 0.001
Naive-Logistic500 -1.853 -0.507 0.495 0.490 0.500 -0.097 - - - - - -
Hong et al500 -1.272 3.513 -0.648 -0.654 -0.644 -0.028 0.010 -0.006 -0.004 -0.059 0.033 0.026
TUBE500 0.011 0.013 -0.017 -0.007 -0.004 -0.001 -0.002 0.000 0.002 0.000 0.000 0.000
Naive-Logistic1000 -1.850 -0.509 0.495 0.493 0.500 -0.097 - - - - - -
Hong et al1000 -1.281 3.524 -0.650 -0.652 -0.643 -0.028 0.008 -0.002 -0.005 -0.060 0.033 0.027
TUBE1000 0.004 0.008 -0.012 -0.003 -0.002 -0.001 -0.007 0.004 0.003 0.001 -0.001 0.000
(c)
Method β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT= 1.300 β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT=-0.300 β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT=-0.700 β3subscript𝛽3\beta_{3}italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT=-0.700 β4subscript𝛽4\beta_{4}italic_β start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT=-0.800 AUC=0.702 λ1(0)subscript𝜆10\lambda_{1}(0)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0 )=0.320 λ1(0.5)subscript𝜆10.5\lambda_{1}(0.5)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0.5 )=0.490 λ1(1)subscript𝜆11\lambda_{1}(1)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 )=0.190 λ0(0)subscript𝜆00\lambda_{0}(0)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 0 )=0.700 λ0(0.5)subscript𝜆00.5\lambda_{0}(0.5)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 0.5 )=0.280 λ0(1)subscript𝜆01\lambda_{0}(1)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 1 )=0.030
Naive-Logistic100 -1.925 0.239 0.570 0.550 0.567 -0.083 - - - - - -
Hong et al100 -0.228 -1.200 -0.380 -0.408 -0.393 -0.025 0.037 -0.038 0.000 -0.039 0.026 0.012
TUBE100 0.090 0.046 -0.021 -0.032 -0.029 -0.009 0.017 -0.007 -0.010 -0.007 0.004 0.002
Naive-Logistic500 -1.887 0.227 0.564 0.562 0.563 -0.100 - - - - - -
Hong et al500 -0.366 -1.337 -0.386 -0.391 -0.399 -0.024 0.012 -0.010 -0.002 -0.031 0.017 0.014
TUBE500 0.065 0.044 -0.013 -0.023 -0.019 -0.003 -0.008 0.003 0.005 0.005 -0.003 -0.001
Naive-Logistic1000 -1.887 0.226 0.557 0.568 0.571 -0.102 - - - - - -
Hong et al1000 -0.340 -1.476 -0.452 -0.446 -0.456 -0.025 0.012 -0.009 -0.003 -0.035 0.020 0.015
TUBE1000 0.060 0.037 -0.017 -0.019 -0.014 -0.003 -0.003 -0.001 0.004 0.002 -0.001 -0.001
Table A2: Mean square errors (MSE) of parameter estimates over 500 simulations for the regression parameters for genetic effects (𝜷𝜷\bm{\beta}bold_italic_β), the area under the curve (AUC) for the classification algorithm, and the errors and/or uncertainties in labels (𝝀𝝀\bm{\lambda}bold_italic_λ) for settings (a) with linear genetic effects, (b) with nonlinear genetic effects, and (c) with nonlinear genetic effects and slight violation of conditional independence between 𝐆𝐆{\bf G}bold_G and 𝐗𝐗{\bf X}bold_X.
(a)
Method β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT=-4.600 β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT= 1.600 β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT= 1.600 β3subscript𝛽3\beta_{3}italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT= 1.600 β4subscript𝛽4\beta_{4}italic_β start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT= 1.600 AUC=0.702 λ1(0)subscript𝜆10\lambda_{1}(0)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0 )=0.320 λ1(0.5)subscript𝜆10.5\lambda_{1}(0.5)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0.5 )=0.490 λ1(1)subscript𝜆11\lambda_{1}(1)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 )=0.190 λ0(0)subscript𝜆00\lambda_{0}(0)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 0 )=0.700 λ0(0.5)subscript𝜆00.5\lambda_{0}(0.5)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 0.5 )=0.280 λ0(1)subscript𝜆01\lambda_{0}(1)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 1 )=0.030
Naive-Logistic100 9.092 1.863 1.885 1.870 1.826 0.009 - - - - - -
Hong et al100 0.678 0.070 0.075 0.085 0.072 0.000 0.005 0.005 0.003 0.012 0.011 0.002
TUBE100 0.780 0.078 0.090 0.098 0.087 0.001 0.005 0.005 0.003 0.013 0.011 0.002
Naive-Logistic500 9.194 1.849 1.830 1.851 1.815 0.011 - - - - - -
Hong et al500 0.620 0.064 0.070 0.077 0.065 0.000 0.001 0.001 0.001 0.003 0.002 0.000
TUBE500 0.670 0.068 0.077 0.082 0.073 0.000 0.001 0.001 0.001 0.003 0.002 0.001
Naive-Logistic1000 9.137 1.852 1.816 1.819 1.825 0.011 - - - - - -
Hong et al1000 0.604 0.060 0.065 0.076 0.066 0.000 0.001 0.001 0.000 0.001 0.001 0.000
TUBE1000 0.660 0.064 0.072 0.079 0.076 0.000 0.001 0.001 0.000 0.002 0.001 0.000
(b)
Method β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT= 1.300 β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT= 0.700 β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT=-0.700 β3subscript𝛽3\beta_{3}italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT=-0.700 β4subscript𝛽4\beta_{4}italic_β start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT=-0.700 AUC=0.702 λ1(0)subscript𝜆10\lambda_{1}(0)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0 )=0.320 λ1(0.5)subscript𝜆10.5\lambda_{1}(0.5)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0.5 )=0.490 λ1(1)subscript𝜆11\lambda_{1}(1)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 )=0.190 λ0(0)subscript𝜆00\lambda_{0}(0)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 0 )=0.700 λ0(0.5)subscript𝜆00.5\lambda_{0}(0.5)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 0.5 )=0.280 λ0(1)subscript𝜆01\lambda_{0}(1)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 1 )=0.030
Naive-Logistic100 3.896 0.311 0.302 0.320 0.321 0.008 - - - - - -
Hong et al100 2.145 13.794 0.564 0.615 0.557 0.001 0.023 0.024 0.014 0.006 0.004 0.001
TUBE100 0.081 0.014 0.015 0.016 0.015 0.001 0.017 0.019 0.009 0.005 0.005 0.001
Naive-Logistic500 3.489 0.265 0.259 0.253 0.262 0.010 - - - - - -
Hong et al500 2.191 13.569 0.533 0.574 0.548 0.001 0.004 0.004 0.002 0.004 0.002 0.001
TUBE500 0.075 0.013 0.015 0.014 0.014 0.000 0.004 0.004 0.002 0.001 0.001 0.000
Naive-Logistic1000 3.451 0.264 0.251 0.249 0.256 0.010 - - - - - -
Hong et al1000 2.176 13.639 0.539 0.567 0.541 0.001 0.002 0.002 0.001 0.004 0.001 0.001
TUBE1000 0.065 0.011 0.013 0.012 0.012 0.000 0.002 0.002 0.001 0.001 0.000 0.000
(c)
Method β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT= 1.300 β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT=-0.300 β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT=-0.700 β3subscript𝛽3\beta_{3}italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT=-0.700 β4subscript𝛽4\beta_{4}italic_β start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT=-0.800 AUC=0.702 λ1(0)subscript𝜆10\lambda_{1}(0)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0 )=0.320 λ1(0.5)subscript𝜆10.5\lambda_{1}(0.5)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0.5 )=0.490 λ1(1)subscript𝜆11\lambda_{1}(1)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 )=0.190 λ0(0)subscript𝜆00\lambda_{0}(0)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 0 )=0.700 λ0(0.5)subscript𝜆00.5\lambda_{0}(0.5)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 0.5 )=0.280 λ0(1)subscript𝜆01\lambda_{0}(1)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 1 )=0.030
Naive-Logistic100 4.043 0.105 0.398 0.378 0.390 - - - - - -
Hong et al100 2.667 4.855 0.360 0.536 0.440 0.001 0.050 0.052 0.022 0.007 0.005 0.001
TUBE100 0.135 0.016 0.023 0.023 0.025 0.003 0.028 0.029 0.011 0.005 0.005 0.001
Naive-Logistic500 3.629 0.060 0.332 0.330 0.331 0.010 - - - - - -
Hong et al500 2.297 4.564 0.344 0.335 0.359 0.001 0.012 0.010 0.004 0.003 0.001 0.001
TUBE500 0.130 0.013 0.022 0.021 0.023 0.000 0.006 0.006 0.003 0.001 0.001 0.000
Naive-Logistic1000 3.589 0.055 0.317 0.328 0.333 0.011 - - - - - -
Hong et al1000 4.793 7.153 0.912 1.001 1.280 0.001 0.008 0.006 0.003 0.003 0.001 0.001
TUBE1000 0.113 0.012 0.019 0.019 0.022 0.000 0.003 0.003 0.002 0.001 0.001 0.000
Table A3: Coverage probabilities (CP) at the 95% nominal level of parameter estimates over 500 simulations for the regression parameters for genetic effects (𝜷𝜷\bm{\beta}bold_italic_β), the area under the curve (AUC) for the classification algorithm, and the errors and/or uncertainties in labels (𝝀𝝀\bm{\lambda}bold_italic_λ) for settings (a) with linear genetic effects, (b) with nonlinear genetic effects, and (c) with nonlinear genetic effects and slight violation of conditional independence between 𝐆𝐆{\bf G}bold_G and 𝐗𝐗{\bf X}bold_X.
(a)
Method β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT=-4.600 β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT= 1.600 β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT= 1.600 β3subscript𝛽3\beta_{3}italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT= 1.600 β4subscript𝛽4\beta_{4}italic_β start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT= 1.600 AUC=0.702 λ1(0)subscript𝜆10\lambda_{1}(0)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0 )=0.320 λ1(0.5)subscript𝜆10.5\lambda_{1}(0.5)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0.5 )=0.490 λ1(1)subscript𝜆11\lambda_{1}(1)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 )=0.190 λ0(0)subscript𝜆00\lambda_{0}(0)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 0 )=0.700 λ0(0.5)subscript𝜆00.5\lambda_{0}(0.5)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 0.5 )=0.280 λ0(1)subscript𝜆01\lambda_{0}(1)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 1 )=0.030
Naive-Logistic100 0.002 0.000 0.000 0.000 0.002 0.398 - - - - - -
Hong et al100 0.946 0.954 0.958 0.948 0.950 0.946 0.942 0.954 0.952 0.956 0.960 0.934
TUBE100 0.940 0.944 0.944 0.940 0.938 0.998 0.952 0.958 0.950 0.954 0.958 0.936
Naive-Logistic500 0.000 0.000 0.000 0.000 0.000 0.000 - - - - - -
Hong et al500 0.946 0.956 0.952 0.950 0.952 0.948 0.940 0.946 0.952 0.954 0.946 0.960
TUBE500 0.942 0.946 0.946 0.948 0.954 0.954 0.948 0.948 0.944 0.948 0.946 0.972
Naive-Logistic1000 0.000 0.000 0.000 0.000 0.000 0.000 - - - - - -
Hong et al1000 0.948 0.950 0.946 0.956 0.950 0.954 0.938 0.950 0.948 0.944 0.952 0.968
TUBE1000 0.954 0.952 0.938 0.954 0.940 0.954 0.938 0.950 0.938 0.942 0.952 0.978
(b)
Method β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT= 1.300 β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT= 0.700 β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT=-0.700 β3subscript𝛽3\beta_{3}italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT=-0.700 β4subscript𝛽4\beta_{4}italic_β start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT=-0.700 AUC=0.702 λ1(0)subscript𝜆10\lambda_{1}(0)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0 )=0.320 λ1(0.5)subscript𝜆10.5\lambda_{1}(0.5)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0.5 )=0.490 λ1(1)subscript𝜆11\lambda_{1}(1)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 )=0.190 λ0(0)subscript𝜆00\lambda_{0}(0)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 0 )=0.700 λ0(0.5)subscript𝜆00.5\lambda_{0}(0.5)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 0.5 )=0.280 λ0(1)subscript𝜆01\lambda_{0}(1)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 1 )=0.030
Naive-Logistic100 0.097 0.333 0.628 0.554 0.547 0.574 - - - - - -
Hong et al100 0.634 0.261 0.675 0.752 0.697 0.602 0.952 0.956 0.956 0.828 0.903 0.871
TUBE100 0.947 0.956 0.929 0.945 0.958 0.992 0.952 0.954 0.941 0.949 0.952 0.947
Naive-Logistic500 0.000 0.000 0.008 0.004 0.002 0.000 - - - - - -
Hong et al500 0.640 0.095 0.554 0.628 0.628 0.554 0.954 0.966 0.956 0.341 0.729 0.408
TUBE500 0.958 0.952 0.952 0.958 0.964 0.941 0.954 0.956 0.943 0.943 0.947 0.927
Naive-Logistic1000 0.000 0.000 0.000 0.000 0.000 0.000 - - - - - -
Hong et al1000 0.604 0.083 0.566 0.636 0.618 0.543 0.943 0.941 0.947 0.083 0.475 0.121
TUBE1000 0.954 0.960 0.949 0.956 0.958 0.939 0.954 0.943 0.943 0.947 0.945 0.947
(c)
Method β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT= 1.300 β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT=-0.300 β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT=-0.700 β3subscript𝛽3\beta_{3}italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT=-0.700 β4subscript𝛽4\beta_{4}italic_β start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT=-0.800 AUC=0.702 λ1(0)subscript𝜆10\lambda_{1}(0)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0 )=0.320 λ1(0.5)subscript𝜆10.5\lambda_{1}(0.5)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 0.5 )=0.490 λ1(1)subscript𝜆11\lambda_{1}(1)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 )=0.190 λ0(0)subscript𝜆00\lambda_{0}(0)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 0 )=0.700 λ0(0.5)subscript𝜆00.5\lambda_{0}(0.5)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 0.5 )=0.280 λ0(1)subscript𝜆01\lambda_{0}(1)italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 1 )=0.030
Naive-Logistic100 0.079 0.797 0.436 0.482 0.428 0.522 - - - - - -
Hong et al100 0.956 0.937 0.896 0.954 0.927 0.858 0.956 0.927 0.958 0.925 0.929 0.937
TUBE100 0.939 0.935 0.960 0.948 0.935 0.985 0.971 0.952 0.969 0.956 0.954 0.952
Naive-Logistic500 0.000 0.290 0.002 0.006 0.006 0.002 - - - - - -
Hong et al500 0.933 0.881 0.862 0.868 0.864 0.839 0.933 0.952 0.952 0.931 0.944 0.937
TUBE500 0.950 0.929 0.950 0.939 0.942 0.942 0.946 0.952 0.950 0.946 0.944 0.927
Naive-Logistic1000 0.000 0.077 0.000 0.000 0.000 0.000 - - - - - -
Hong et al1000 0.985 0.946 0.979 0.987 0.990 0.843 0.939 0.946 0.948 0.879 0.912 0.894
TUBE1000 0.942 0.937 0.946 0.946 0.948 0.946 0.952 0.958 0.933 0.958 0.954 0.939