Closing the Gap in High-Risk Pregnancy Care Using Machine Learning and Human-AI Collaboration

Hussein Mozannar IDSS, Massachusetts Institute of Technology, Cambridge, MA CSAIL and IMES, Massachusetts Institute of Technology, Cambridge, MA Equal contribution, co-first authors Corresponding author, [email protected] Yuria Utsumi CSAIL and IMES, Massachusetts Institute of Technology, Cambridge, MA Equal contribution, co-first authors Irene Y. Chen Microsoft Research New England, Cambridge, MA Stephanie S. Gervasi Independence Blue Cross, Philadelphia, Pennsylvania. Michele Ewing Independence Blue Cross, Philadelphia, Pennsylvania. Aaron Smith-McLallen Independence Blue Cross, Philadelphia, Pennsylvania. David Sontag CSAIL and IMES, Massachusetts Institute of Technology, Cambridge, MA
Abstract

A high-risk pregnancy is a pregnancy complicated by factors that can adversely affect the outcomes of the mother or the infant. Health insurers use algorithms to identify members who would benefit from additional clinical support. This work presents the implementation of a real-world ML-based system to assist care managers in identifying pregnant patients at risk of complications. In this retrospective evaluation study, we developed a novel hybrid-ML classifier to predict whether patients are pregnant and trained a standard classifier using claims data from a health insurance company in the US to predict whether a patient will develop pregnancy complications. These models were developed in cooperation with the care management team and integrated into a user interface with explanations for the nurses. The proposed models outperformed commonly used claim codes for the identification of pregnant patients at the expense of a manageable false positive rate. Our risk complication classifier shows that we can accurately triage patients by risk of complication. Our approach and evaluation are guided by human-centric design. In user studies with the nurses, they preferred the proposed models over existing approaches.

1 Introduction

Refer to caption
Figure 1: Illustration of our proposed algorithm HAPI for pregnancy identification. We first collect a historical dataset of members that is used to train the Lasso model that predicts the probability of members being pregnant. Then, at each point in time t in the patient’s trajectory (weekly frequency), we pass their claim codes through HAPI which combines the Lasso model and the list of anchor pregnancy codes to obtain a probability of a member being pregnant. We visualize on the rightmost graph the probability of pregnancy during the member’s gestation, where we also show the first instance where there is a code indicating pregnancy start compared to when HAPI predicted pregnancy start.

High-risk pregnancy is a pregnancy complicated by factors that can adversely affect the health outcomes of the mother, fetus, or infant. Pregnancy complications like gestational diabetes, hypertension, and pre-eclampsia can lead to childbirth complications such as eclampsia, cardiomyopathy, and embolism and result in adverse pregnancy outcomes, including preterm birth, HELLP syndrome, and intrauterine fetal death. In 2018, pregnancy and childbirth complications affected 19.6% and 1.7% of pregnancies, respectively, in the U.S. [1]. Moreover, systemic disparities in pregnancy and childbirth complications are well-documented. Black women are significantly more likely to develop preeclampsia and more than three times more likely to die from pregnancy-related complications than White women [2].

Fortunately, timely and appropriate clinical intervention can effectively manage complications during pregnancy and reduce maternal, fetal, and neonatal morbidity and mortality [3, 4, 5, 6, 7]. Health plan-operated care management programs for high-risk pregnancies aim to coordinate care for at-risk patients across their clinical care team, educate patients about their conditions and medications, and provide education and support managing their conditions [8, 9, 10, 11].

Objective.

In this work, we collaborate with the High-Risk Pregnancy (HRP) care management team at an Anonymized Health Insurance Company (AIC) in the US. We aim to improve the member identification process in which nurse case managers review relevant clinical information and make decisions about which members are most appropriate for the HRP program. The process begins with ML algorithms and clinical decision rules to identify pregnant and at-risk members from medical claims, which are served to nurse case managers for review and final determination of program eligibility and appropriateness. Automated mechanisms for patient risk identification and stratification are critical to efficiently identify pregnant and at-risk patients from a large patient population. We conducted structured interviews with the care managers to understand the identification and stratification process and discover opportunities to improve it. These conversations highlighted that patients being surfaced for evaluation are often no longer pregnant, have a low risk of pregnancy complications, and nurses lack insight into why patients are being surfaced.

Our first task was to improve the latency with which pregnant patients are identified. Our second task was to accurately identify patients at high risk for pregnancy complications. However, not all complications of pregnancy can be effectively remediated through telephonicly delivered care management. Following the care manager’s recommendations, the outreach and education delivered in HRP program would be most impactful for patients with gestational diabetes and gestational hypertension.

Contributions.

This paper presents a recipe for develo** automated systems for high-risk pregnancy management programs, from dataset creation to model training and evaluation. We first outline how to build datasets from patient data available to be used to train models for pregnancy identification and detection. We developed a novel Hybrid Algorithm for Pregnancy Identification (HAPI) that combines manual code lists with machine learning models. We then train a classifier that predicts the patient’s risk for develo** complications at each point in their pregnancy. We integrate these models into a user-friendly interface for nurses to use. We retrospectively evaluate the individual classifiers on over 30k patients, showing we can identify pregnant members earlier on average than predefined code lists and can triage members by risk of complication with an AUC of 0.76. User studies with nurses confirm that the new interface is preferred over existing implementations.

More broadly, we believe our work serves as an important demonstration of human-centric design for ML in healthcare and will be a useful guide for future work in the field.

2 Related Work

Much of the existing literature on pregnancy identification focuses on retrospective identification of pregnancy episodes [12, 13, 14, 15]. Our goal was to identify pregnancy in a near real-time fashion as information about the patient becomes available through medical and pharmacy claims, lab results, authorizations, and admit, discharge, and transfer data. To the best of our knowledge, we believe this is the first work that accomplishes this objective. Although there is extensive literature on predicting pregnancy complications using machine learning [16, 17, 18, 19, 20], we focus specifically on gestational hypertension and diabetes and making risk predictions as early as possible. While we are aware that certain deep learning architectures perform well for our task, practical considerations limit us to the use of linear classifiers, which perform relatively well. Our approach is to build separate machine learning models for pregnancy start and end identification and risk of pregnancy complications. When deploying machine learning models in the clinical setting, it is important to provide a rationale for predictions to gain clinicians’ trust and help them make informed decisions [21, 22, 23, 24]. We discuss other relevant prior work in the remaining sections.

3 Methods

3.1 Dataset Creation For Pregnancy Start and End Identification

Our approach for identifying the start and end of a patient’s pregnancy is based on a machine learning predictor. Since there is no publicly available well-suited data for this task, we built our own dataset to train the model from AIC’s members only. We construct a cohort of female patients with ages between 18 and 48 who had pregnancies with and without complications between 2004 to 2021 but eventually had a live birth. We also construct a matching cohort of never-pregnant female patients according to the age distribution of the pregnant sub-cohort.

To identify pregnant patients for use in our machine learning algorithm to identify pregnancy starts, we use a modified version of the algorithm of Matcho et al. [12] to identify pregnant patients and only select patients who had a healthy live birth. The original algorithm retrospectively infers the start and end of the most recent pregnancy episode and the corresponding pregnancy outcome or complication. In contrast, our approach identifies gestational episodes in real time. We select patients with a live birth only because that allows us to reliably identify the pregnancy start date. For pregnancies with a live birth without complications, we can reliably identify the pregnancy start date, which we set to be 40 weeks before the end date of pregnancy. For pregnancies with complications, we set the start date to be the first date of occurrence of a pregnancy start code. The overall dataset consisted of 36735 patients with an average age of 32.3 years composed into three subgroups: 22.6% pregnancies without complications, 62.4% pregnancies with complications, and 15.0% never pregnant.

For pregnant patients, we extract weekly data starting from 20 weeks before pregnancy starts to 20 weeks after the pregnancy ends: 80 weeks total - 80 total data points per patient. This allows for early pregnancy and non-pregnancy indicators to be learned while avoiding signals from previous pregnancies. For never-pregnant patients, we sample 80 weeks of data, around the midpoint of their medical history. For each data point, we generate non-temporal and temporal features from medical data. For temporal data, we construct windowed features, which aggregate the data within a specified backward time window and map them to a binary indicator feature indicating whether the billing codes occurred or not during that time window. Windowed features for 5-day and 10-day windows are generated using omop-learn [25] for the following categories: medical conditions, prescriptions, procedures, specialty visits, and labs. We also include 12 non-temporal features, which include age, race, and gender. This gives us a feature set of 62,734 features. For each subgroup, we split the data into a train set (50%), validation set (25%), and test set (25%) by patients, so no patient data is shared across the different splits. We aggregate all three sub-cohorts to construct the train, validation, and test splits. A summary of the dataset is provided in Table 1. Further details about the dataset creation are found in the Appendix.

Table 1: Summary of patient characteristics and feature processing for the dataset used to build models to identify if a patient is pregnant (Identification Dataset) and for the dataset used to triage patients by risk of complication (Complications Dataset).
Characteristics Identification Dataset Complications Dataset
No. of patients 36,735 12,243
Race / Ethnicity (%) 39.1% White, 5.7% Black, 3.4% Other (rest is unreported) 43.8% White, 5.70% Black and 3.6% Other (rest is unreported)
Average Age in years 32.3 (σ𝜎\sigmaitalic_σ=6.1) 32.0 (σ𝜎\sigmaitalic_σ=6.1)
Pregnancy Complication % 22.6% without complication, 62.4% with complications, 15.0% not pregnant 73.6% without complication, 26.4% with complications divided into 16.9% with gestational hypertension and 9.4% gestational diabetes
Dataset split 50% training, 25% validation and 25% testing 60% training, 20% validation and 20% testing
Features generated {5,10}510\{5,10\}{ 5 , 10 } day windowed features and 12 non-temporal features {30,180,365,730,10k}3018036573010𝑘\{30,180,365,730,10k\}{ 30 , 180 , 365 , 730 , 10 italic_k } day windowed features and 12 non-temporal features
Total number of features per patient data point 112,322 62,734

3.2 Algorithm For Pregnancy Start and End Identification

We propose a Hybrid Algorithm for Pregnancy Identification (HAPI) that predicts at each week the probability that a patient is pregnant. The HAPI algorithm predicts a score in [0,1]01[0,1][ 0 , 1 ] of the likelihood of the patient being pregnant at each point in time t𝑡titalic_t using their features up to time t𝑡titalic_t: Xtsubscript𝑋𝑡X_{t}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. HAPI first relies on a set of carefully chosen clinical codes that indicate either the start or end of pregnancy denoted as ’anchors’. Starting from each week of the patient’s data, if a code indicating the start of pregnancy is available, we set the start of pregnancy at the first week when the code is available, similarly for codes indicating the end of pregnancy. Otherwise, we use a Lasso regularized logistic regression model [26] that is trained with the objective of predicting whether the patient is currently pregnant from the features in the dataset. Importantly, we use the Anchor&Learn approach [27], where we remove the anchors from the feature set of the Lasso algorithm so that it focuses on signals not captured by the anchors. After we get the predictions of the Lasso model at time t𝑡titalic_t as f(Xt)[0,1]𝑓subscript𝑋𝑡01f(X_{t})\in[0,1]italic_f ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ [ 0 , 1 ], we pass those predictions to an exponential moving average filter to smooth the predictions over time and obtain f~(Xt)~𝑓subscript𝑋𝑡\tilde{f}(X_{t})over~ start_ARG italic_f end_ARG ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). We then binarize the predictions using a learned threshold to obtain q^(Xt){0,1}^𝑞subscript𝑋𝑡01\hat{q}(X_{t})\in\{0,1\}over^ start_ARG italic_q end_ARG ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ { 0 , 1 }. We predict that the patient is pregnant at time t𝑡titalic_t if q^(Xt)=1^𝑞subscript𝑋𝑡1\hat{q}(X_{t})=1over^ start_ARG italic_q end_ARG ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 1 and we have two consecutive increases in f~(Xt)~𝑓subscript𝑋𝑡\tilde{f}(X_{t})over~ start_ARG italic_f end_ARG ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (similarly for end-of-pregnancy prediction). The Lasso model is learned on the training set of the dataset previously described with hyperparameters chosen on the validation set. A formal description of the algorithm can be found in the Appendix.

3.3 Dataset Creation for Pregnancy Complication Prediction

After pregnant patients are identified, we have to distinguish between those with a high and low likelihood of develo** complications. The case management team identified gestational diabetes (GDB) and gestational hypertension (GHT) as specific complications that could be effectively managed within the HRP program. Our approach is to build a calibrated machine learning classifier that given a patient’s data can predict the risk of them develo** either gestational diabetes or gestational hypertension. Moreover, the classifier can provide a list of the patient features that led to the prediction as a form of explanation. Since there exists no good public data for evaluating and training the classifier, we construct our own dataset from AIC members.

We first constructed a cohort of pregnant patients using the algorithm of Matcho et al. [12], we then collaborated with the nurse case managers to compile a list of codes that indicate pregnancy episodes with gestational diabetes and gestational hypertension. We validated the code set with the care management nurses, who hand-labeled outcome codes for a subset of 20 patients, given data up to the end of the pregnancy episode. This allowed us to (1) validate that the existing codes are indicative of the corresponding outcome, and (2) find new codes indicative of an outcome. This enables us to find patients with outcomes of GHT and GDB during their pregnancy.

We select the subset of patients with outcomes: live birth (no complication), gestational hypertension, and gestational diabetes. We then select a set of 12,243 patients where 73.6% are live births, 16.9 % have gestational hypertension and 9.4% have gestational diabetes. We divide the patients into 20% for testing and 80% for training and validation. For each patient, we construct 10 data points where each data point is a slice of the patient’s data up to a cutoff date. We choose the cutoff dates to be uniformly distributed from three months before pregnancy starts until the end of pregnancy, thus covering all three trimesters. Each data point has a label in 𝒴={0,1,2}𝒴012\mathcal{Y}=\{0,1,2\}caligraphic_Y = { 0 , 1 , 2 } indicating respectively no complication, GHT, and GDB. Each data point has temporal and non-temporal features as in the dataset for pregnancy identification. We fit several standard classification algorithms – Lasso (L1-regularized), ELASTIC-NET (L1 and L2-regularized), and XGBOOST (gradient-boosted tree) by using the training set and the validation set to pick hyperparameters.

3.4 Extracting Evidence for Predictions

It is essential that we provide information as to why a certain patient had the given predictions. We provide a list of claim codes that had the most effect on the model making its prediction and the polarity of the effect. For pregnancy identification with HAPI, for each patient, we surface all anchor codes if they are available and we then surface the highest weighted codes (by absolute value) according to the Lasso model.

For predicting the risk of pregnancy complications, the classifier’s top codes by model weight include many variants of diabetes and hypertension codes since the prior history of these conditions is highly predictive of GDB and GHT. However, there is a nontrivial number of patients who have no prior history of these conditions, and they may be affected by a different set of risk factors. To better capture these factors, we partition our dataset conditional on prior history of diabetes and hypertension, and train a separate Lasso model on each of the four subsets: no prior history, history of both conditions, and history of either condition alone. We call these models GROUP-Lasso models. For a given patient, the prediction follows the global Lasso model but to extract evidence for the prediction, we extract the highest weighted features from the GROUP-Lasso model that the patient belongs to depending on their prior history of DB or HT.

3.5 User Study Design

Our main evaluation of HAPI and our pregnancy complications classifier is a retrospective evaluation of performance, however, this evaluation does not mirror how the algorithms will be deployed exactly. The algorithms never make the final decision on who is deemed to be pregnant or at risk, rather it is the care managers after reviewing the algorithm’s predictions. We design two user studies that simulate how HAPI and the complications classifier will be deployed where a care manager assesses in a simulation environment of patients. All studies in this work were ruled exempt by our IRB.

With input from the nurse care managers, we built a dashboard mimicking the actual dashboard used by the nurses in the HRP program to surface medical history available in insurance claims and other data sources (e.g. visits, diagnosis codes, demographics). For testing HAPI, we perform a study with a single nurse under two conditions A) with the predictions and evidence of HAPI and B) without any algorithmic predictions (control). The nurse makes predictions in each condition for 12 patients. Each trial took up to an hour.

For testing the pregnancy complications classifier, we ran three trials where the nurses made decisions on patients – A) one without predictions or evidence, B) one with predictions, and C) one with both predictions and evidence, with each of two nurses (referred to as Nurse 1 and Nurse 2) from the pregnancy care management program (six trials total). In each trial, the nurse makes predictions on 18 patients retrospectively. A sketch of the interface is shown in Fig10. For each patient, we asked the nurse if they would call the patient and which complication the patient would develop.

Refer to caption
Figure 2: Patient dashboard sketch for the user study on pregnancy complications classification. The user interface consists of a left panel containing demographic information and two views: Overview and Visits. We show the subtab Diseases/Conditions from the overview view where the nurse can find the ICD codes for each condition and disease. On the left panel, patient information is shown, the model prediction, and history of prior complications. We color ICD codes positively associated with red complications (intensity varies with correlation) and those negatively associated with complications with green.

3.6 Statistical analysis

For obtaining confidence intervals for AUC we use a method computed using a distribution-independent method based on error rate and the number of positive and negative samples introduced in [28]. For obtaining confidence intervals for accuracy and FNR/FPR metrics, we use the Wilscon score interval. We use McNemar’s test to compare ordinal data proportions and paired t-tests to compare numerical data. All analysis was conducted in Python 3.8 and using the statsmodels and scipy packages.

4 Results

4.1 Identifying Pregnancies From Claims Data

We evaluate HAPI on a test set of 9183 patients randomly selected from the dataset. We compare the performance of HAPI against the baseline of only using anchor pregnancy codes for the detection of pregnancy start [14]. We measure for HAPI and the anchor code list the difference between the predicted start date and the actual pregnancy start date. The actual pregnancy start date is obtained by subtracting 40 weeks from the exact date of birth.

We show the histogram of the difference between the predicted start date and the actual date for patients with complications in Figure 3 for both the anchor code list and our proposed algorithm. Compared to using the code list alone, HAPI predicts an earlier start date for 3.54% (95% CI 3.05-4.00, z=14.5, p<0.001) of patients with pregnancy complications and 4.29% (95% CI 3.42-5.16, z=9.6 p<0.001) earlier for pregnancies without complications. For the patients with complications who are predicted earlier by HAPI, the average difference between the predictions and the actual start date is 54.3 days compared to 75.6 days for the code list (t=105,p<00001formulae-sequence𝑡105𝑝00001t=-10\cdot 5,p<0\cdot 0001italic_t = - 10 ⋅ 5 , italic_p < 0 ⋅ 0001). For patients without complications, the average difference is 66.9 days compared to 102.5 days (t=65,p<00001formulae-sequence𝑡65𝑝00001t=-6\cdot 5,p<0\cdot 0001italic_t = - 6 ⋅ 5 , italic_p < 0 ⋅ 0001), respectively. However, when we look at all the test set the average difference is 1 day earlier for HAPI compared to the code list on patients with and without complications which is not statistically significant. The model predicts that 5.58% (95% CI 4.05-6.40) of non-pregnant patients are in fact pregnant (false positive rate). HAPI can be adjusted using the Lasso model threshold to reduce the false positive rate at the expense of detecting pregnancies later in time.

Refer to caption
(a) On all the test set.
Refer to caption
(b) On subset of data where HAPI outperforms anchor codes
Figure 3: Histogram of pregnancy identification delays for pregnancies with complications for HAPI compared to the anchor codes. We measure the difference of days between the predicted start date and actual start date for our model HAPI compared to a set of predefined pregnancy start codes (anchor codes). In subfigure (a) we show the histogram of differences in all the test patients and we can see that the two distributions overlap. However, in subfigure (b) when we look at the subset of the test patients where HAPI is earlier than the anchor codes ( 3.54% of the set) we see that HAPI is earlier than the anchor codes.

4.2 Predicting Pregnancy Complications

After pregnant patients are identified, we have to distinguish between those with a high and low likelihood of develo** complications. In Table 2, we compare the performance of different machine learning classifiers at predicting whether a patient will develop gestational diabetes or gestational hypertension or neither. We find that the best-performing model in terms of accuracy is a Lasso regularized logistic regression model which achieves an average accuracy of 76.8% (95% CI 76.2-77.3) at predicting complications across each test patient pregnancy and AUC of 0.761 (95% CI 0.754-767). The Lasso model is able to achieve an accuracy of 73.1% (95% CI 72.9-74.2) and AUC of 0.722 (95% CI 0.710-0.734) when predicting three months before the start of the patient’s pregnancy. This indicates that there is a signal at the start of the pregnancy to triage patients by risk of complications. We assess model performance using data of the patients at different stages of pregnancy and find that accuracy and AUC generally increases as we progress to later pregnancy terms. This indicates that the model performs better as we see more data on the patient, but the confidence intervals overlap in some time periods.

To assess model performance at different stages of pregnancy, we evaluate the model when predicting on patient’s data in each trimester and before gestation. To do this, we trim each member’s data until the desired date of prediction and then predict using the Lasso model, results are in Figure 4. While confidence intervals do not overlap consistently across time periods, the metrics generally increase as we progress to later pregnancy terms, indicating that the model performs better as we see more data on the member.

Additionally, we evaluate how early the model is catching pregnancy complications While the Lasso model has a high false negative rate of 57.4% (95% CI 53.5-61.2), of the patients with true positive predictions (37.6%), a majority are caught before gestation (59.6% with 95% CI 53.1-65.5 ). This is important since early intervention and treatment are important in reducing gestational diabetes and hypertension risk [4, 6, 7].

Table 2: Evaluation metrics for predictors of pregnancy complications on the test when predicting across four time periods for each patient: before pregnancy, during trimesters 1,2 and 3; results are aggregated across the four periods. We show three different machine learning models, their accuracy, and their AUC on the test set. We provide 95% confidence intervals obtained for accuracy using the Wilson score interval and for AUC using the method in [28].
Accuracy AUC
Mean 95% CI Mean 95% CI
Lasso (L1) 0.768 0.762-0.773 0.761 0.754-0.767
ELASTIC-NET (L1+L2) 0.713 0.707-0.719 0.736 0.729-0.742
XGBOOST 0.687 0.681-0.692 0.770 0.764-0.775

4.3 Bias/Fairness Audit For Pregnancy Complication Classifier

Prior work has shown that care management risk algorithms may contain racial bias due to nuances in how outcomes are defined [29]. Moreover, there exist systemic health disparities in maternal and infant mortality rates, e.g. Black people have mortality rates over three times higher than White people during pregnancy (40.8 v. 12.7 per 100,000 live births) [2]. To this end, we audit our algorithm for potential racial bias. We report evaluation metrics in Table 3 for the three most common race groups (White - 43.8%, Black - 5.7%, Other - 3.6%). Other race category includes race outside of the following: American Indian or Alaska Native, Black or African American, White, Asian, Hispanic or Latino, Native Hawaiian or Other Pacific Islander. We note that accuracy for the White group is 77.4% (95% CI 76.7-78.2) compared to a lower accuracy for the Black group at 68.1 (95% CI 65.6-70.5). However, the AUC for the White group is 0.740 (95% CI 0.730-75.0) which is lower than that of the Black group 0.787 (95% CI 0.765-0.808). This may be due to differences in class distribution, since the Black subgroup has much higher rates of complication (44.0%), compared to White (24.6%) and Other (25.9%) races. True positive rates of catching complications are 36.6%, 27.1%, and 30.0%, for Black, White, and Other subgroups, respectively. Race data for this analysis comes from electronic medical records with low coverage for race attribution (only 53%similar-toabsentpercent53\sim 53\%∼ 53 % of members have some member-level race attributed to examine bias), so true error rates may differ from those reported here. he lower accuracy of Black patients compared to White or Other race patients can potentially be explained by different base rates. When different subgroups have different base rates, competing definitions of algorithm fairness may conflict [30]. It is important to better understand sources of health disparities, potentially through gathering additional information such as social determinants of health [31].

Table 3: Evaluation metrics for the Lasso model on the test set, across different race groups for predicting complications at different points in the pregnancy (averaged from 3 months prior to gestation, trimester 1,2 and 3). Rates of complication in each race group are White - 24.6%, Black - 44.0%, Other - 25.9%. For each race group, we obtain the accuracy and AUC on the subgroup alone with 95% confidence intervals.
Accuracy (95% CI) AUC (95% CI)
White 0.774 (0.767, 0.782) 0.740 (0.730, 0.750)
Black 0.681 (0.656, 0.705) 0.787 (0.765, 0.808)
Other 0.792 (0.765, 0.819) 0.826 (0.798, 0.854)
Refer to caption
Figure 4: Accuracy and AUROC of the Lasso pregnancy complication predictor as we predict later during pregnancy duration. For each time of prediction, we trim patient data until the time of prediction, we then predict using the trimmed patient data for each time. We plot the linear trend line of the accuracy and AUROC which are shown to be increasing over time, error bars represent 95% CI.

4.4 User Studies

In the user study for pregnancy identification, in both conditions, the nurse correctly identified 5 of the 8 pregnant patients. Notably, in each trial, we introduced a patient who was falsely detected by the model to be pregnant, but the care manager was successfully able to recognize this incorrect prediction.

In the user study for pregnancy complications, we note that the inclusion of the model prediction and prior history seemed to improve the nurse’s accuracy at predicting whether a patient will develop GDP or GHT. Nurse 1 had an accuracy of 56% without the model, 72% with the model prediction only, and 67% with model prediction and evidence. Similarly, nurse 2 had an accuracy of 33% in condition A, 56% in B, and 67% in C. Note that due to the small sample size of the studies, all increases in accuracy are not statistically significant. Nurse 1 explained that a prior history of diabetes/hypertension or complications in a previous pregnancy is usually sufficient to make a call, but additional information such as distinct risk factors for complications (e.g. polycystic ovary syndrome) can help them build a better profile of the patient and identify those at risk. Both nurses indicated that highlighted evidence helped with obtaining this information more quickly. The evidence helped them focus on important visits and codes, especially when the visit history was lengthy. Nurse 2 said that although not all evidence was useful or made sense, it is easy to filter out the irrelevant ones, i.e. surfacing useful codes should be prioritized over surfacing a few codes. In follow-up interviews and discussions, the nurses expressed a preference for the dashboard used in the user study compared to their previous systems. They noted that the new interface saves an enormous amount of time as they no longer need to access the claims system to review several years of claims data to decipher whether the patient is even pregnant let alone if they have any potential risk factors.

5 Discussion

In this study, we developed a machine learning system for the early detection of pregnancy and the identification of high-risk members. This system is part of a real-world deployment at AIC. We introduced a novel algorithm that identifies whether a member is pregnant from insurance claims data by combining indicators for pregnancy start and end with machine learning predictors. We found that it identifies for 3.54% members an earlier pregnancy start data compared to concept codes and has only a 5.58% false positive rate. The model identifies members who may have started pregnancy visits later in their term since, for example, they tested for pregnancy using at-home tests. This could be a reason to offer cost-free pregnancy tests at local clinics so members are incentivized to get tested formally, and in turn, the insurance company obtains data to identify pregnant members earlier. A large proportion of these members also tend to be high risk, which is exactly who we want to identify early for early intervention and treatment. Leveraging this information, we then identified members at the greatest risk for pregnancy complications so that care managers can provide timely and effective support. Using predictors of gestational diabetes and gestational hypertension, our model achieved an AUROC performance of 0.76.

We followed a human-centered design methodology and showed that it can improve the care management program for high-risk pregnancies at IBC. Because care managers are often faced with limited and fragmented interactions with patients, we conducted extensive discussions and interviews with care managers of the HRP program to identify their current needs and greatest challenges. These insights—combined with insurance claims—can help early detection of pregnancy, accurate identification of impactable high-risk members, and provision of explainable indicators to supplement predictions. We show that when actively engaging critical stakeholders like the care managers, machine learning systems can guide care management to prevent pregnancy complications.

We then set up a mock enrollment dashboard and evaluated these methods across two user studies and found two key findings. First, the pregnancy identification algorithm helps nurses identify pregnancies earlier while correctly filtering out false-positive members. Second, showing the pregnancy complication model’s prediction and prior history of chronic conditions improves nurses’ performance metrics when deciding who to call. While model explanations adversely affected the nurse’s performance in terms of time per member and how early they identify pregnant members in the pregnancy identification study, we observed that explanations improved notes about the member in the pregnancy risk factor study without much difference in nurse’s classification performance. The latter study better integrated explanations into the clinical workflow, and nurses appeared to disagree with the explanations less, which emphasizes the importance of the explanation method and how they are presented. Our study demonstrated that close collaboration with care managers can be used to leverage insurance claims to improve the care of pregnant patients. We hope that our results can serve as a call to action for similar predictive models used to allocate care. In a recent report in the Journal of Biomedical Informatics, researchers advocated for more overlap in human-computer interaction and clinical decision-making tasks to improve precision medicine [32]. Our work expands on those topics to empower the domain experts and primary users of our system. We found that comprehensive needs-finding interviews with the care managers greatly enhanced our targeted ML system. Not only were we able to focus on the most salient problems facing care managers, but the resulting ML system also has better resource allocation for pregnancy patients.

Our study opens several areas for future work. As with any machine learning system, continual validation of our models across time is key to ensuring robust and generalizable performance. Predictors of early pregnancy detection and predictors of high-risk pregnancy may change over time due to improvements in health technology and patterns of healthcare utilization. Computational work in transfer learning and robustness can help adapt our models over time with minimal adjustment. Additionally, topics of pregnancy may raise questions about patient privacy. Our model keeps patient data completely private except for the minimal set of relevant care managers; however, advances in patient privacy protection may also be relevant.

6 Limitations and Conclusion

There are limitations to our study that need to be addressed. In Appendix Figure 8, we stratify the population by when our ML system provided a relevant alert. Unfortunately, 60% of the alerts are never sounded for patients who have complications. This gap in our model performance is likely due to the sparsity of insurance claims and the delay of visits by the patients, both challenges often faced by models working with healthcare data [33]. We are also concerned with the disparate impact of the ML system on different patient subpopulations, particularly historically vulnerable groups. Health insurers are actively creating best practices for auditing and improving algorithmic bias, with the first step being the measurement of existing bias [34]. In Table 3, we show the performance of the detection algorithm on White, Black, and Other race patients. The lower accuracy of Black patients compared to White or Other race patients can potentially be explained by different base rates. When different subgroups have different base rates, competing definitions of algorithm fairness may conflict [30]. It is important to better understand sources of health disparities, potentially through gathering additional information such as social determinants of health [31].

In conclusion, we have developed novel algorithms for the identification of pregnancy and triage of pregnant members by risk of complication. These algorithms’ development and subsequent evaluations followed a human-centered design methodology with extensive collaboration with the high-risk pregnancy care managers at AIC. Thus, we demonstrated that the active engagement of key stakeholders like care managers can substantially improve the clinical workflow and quality of care given by care managers for pregnant patients.

Author Contributions

Conception and design: H.M., Y.U., D.S., S.G., A.S.

Model Development: H.M., Y.U.

User Study Development: Y.U

Data Collection: H.M., Y.U., M.E.

Data analysis: H.M., Y.U.

Data Interpretation: H.M., Y.U., I.C., D.S., S.G., A.S.

Supervision: D.S.

Manuscript writing: H.M., Y.U., I.C., D.S., S.G., A.S., M.E.

Acknowledgements

H.M., Y.U., I.C. and D.S. were supported by a grant from Independence Blue Cross.

Competing Interests

The research was financially supported by a grant from Independence Blue Cross, which also contributed the data for the study. The sponsor collected the data, reviewed the manuscript, and approved the decision to submit the manuscript for publication. H.M., Y.U., I.C. and D.S. were supported by the grant. M.E., S.G., A.S. are employees of Independence Blue Cross.

Data Availability

The datasets generated and analyzed during the current study are not publicly available as they contain insurance claims data and demographic data (including age and ethnicity) of members insured by Independence Blue Cross and the data is de-identified but not anonymous.

References

  • [1] Trends in Pregnancy and Childbirth Complications in the U.S. https://www.bcbs.com/the-health-of-america/reports/trends-in-pregnancy-and-childbirth-complications-in-the-uspre-ex (2020). Accessed:2022-5-10.
  • [2] Petersen, E. E. Racial/Ethnic Disparities in Pregnancy-Related Deaths — United States, 2007–2016. \JournalTitleMMWR. Morbidity and Mortality Weekly Report 68, DOI: 10.15585/mmwr.mm6835a3 (2019).
  • [3] Lassi, Z. S., Mansoor, T., Salam, R. A., Das, J. K. & Bhutta, Z. A. Essential pre-pregnancy and pregnancy interventions for improved maternal, newborn and child health. \JournalTitleReproductive Health 11, S2, DOI: 10.1186/1742-4755-11-S1-S2 (2014).
  • [4] Teede, H. J., Harrison, C. L., Teh, W. T., Paul, E. & Allan, C. A. Gestational diabetes: Development of an early risk prediction tool to facilitate opportunities for prevention. \JournalTitleAustralian and New Zealand Journal of Obstetrics and Gynaecology 51, 499–504, DOI: 10.1111/j.1479-828X.2011.01356.x (2011). _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1479-828X.2011.01356.x.
  • [5] Gao, Y., Ren, S., Zhou, H. & Xuan, R. Impact of Physical Activity During Pregnancy on Gestational Hypertension. \JournalTitlePhysical Activity and Health 4, 32–39, DOI: 10.5334/paah.49 (2020). Number: 1 Publisher: Ubiquity Press.
  • [6] Raets, L., Beunen, K. & Benhalima, K. Screening for Gestational Diabetes Mellitus in Early Pregnancy: What Is the Evidence? \JournalTitleJournal of Clinical Medicine 10, 1257, DOI: 10.3390/jcm10061257 (2021). Number: 6 Publisher: Multidisciplinary Digital Publishing Institute.
  • [7] Rowan, J. A., Budden, A., Ivanova, V., Hughes, R. C. & Sadler, L. C. Women with an HbA1c of 41–49 mmol/mol (5.9–6.6%): a higher risk subgroup that may benefit from early pregnancy intervention. \JournalTitleDiabetic Medicine 33, 25–31, DOI: 10.1111/dme.12812 (2016). _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/dme.12812.
  • [8] Hong, C. S. H. Caring for High-Need, High-Cost Patients: What Makes for a Successful Care Management Program? Tech. Rep., Commonwealth Fund, New York, NY United States (2014). DOI: 10.15868/socialsector.25007.
  • [9] Alexander, J. W. & Mackey, M. C. Cost Effectiveness of a High-Risk Pregnancy Program. \JournalTitleCare Management Journals 1, 170–174, DOI: 10.1891/1521-0987.1.3.170 (1999).
  • [10] Mate, A. et al. Field study in deploying restless multi-armed bandits: Assisting non-profits in improving maternal and child health. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, 12017–12025 (2022).
  • [11] Radin, J. M. et al. The healthy pregnancy research program: transforming pregnancy research through a researchkit app. \JournalTitleNPJ digital medicine 1, 45 (2018).
  • [12] Matcho, A. et al. Inferring pregnancy episodes and outcomes within a network of observational databases. \JournalTitlePLoS ONE 13, e0192033, DOI: 10.1371/journal.pone.0192033 (2018).
  • [13] Blotière, P.-O. et al. Development of an algorithm to identify pregnancy episodes and related outcomes in health care claims databases: An application to antiepileptic drug use in 4.9 million pregnant women in France. \JournalTitlePharmacoepidemiology and Drug Safety 27, 763–770, DOI: 10.1002/pds.4556 (2018). _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/pds.4556.
  • [14] MacDonald, S. C. et al. Identifying pregnancies in insurance claims data: Methods and application to retinoid teratogenic surveillance. \JournalTitlePharmacoepidemiology and Drug Safety 28, 1211–1221, DOI: 10.1002/pds.4794 (2019).
  • [15] Schink, T., Wentzell, N., Dathe, K., Onken, M. & Haug, U. Estimating the Beginning of Pregnancy in German Claims Data: Development of an Algorithm With a Focus on the Expected Delivery Date. \JournalTitleFrontiers in Public Health 8 (2020).
  • [16] Bertini, A., Salas, R., Chabert, S., Sobrevia, L. & Pardo, F. Using machine learning to predict complications in pregnancy: A systematic review. \JournalTitleFrontiers in bioengineering and biotechnology 9, 1385 (2022).
  • [17] Espinosa, C. et al. Data-driven modeling of pregnancy-related complications. \JournalTitleTrends in molecular medicine 27, 762–776 (2021).
  • [18] Islam, M. N., Mustafina, S. N., Mahmud, T. & Khan, N. I. Machine learning to predict pregnancy outcomes: a systematic review, synthesizing framework and future research agenda. \JournalTitleBMC Pregnancy and Childbirth 22, 1–19 (2022).
  • [19] Machado, J. M. et al. Predicting the risk associated to pregnancy using data mining. \JournalTitleSCITEPRESS (2015).
  • [20] Li, S. et al. Improving preeclampsia risk prediction by modeling pregnancy trajectories from routinely collected electronic medical record data. \JournalTitleNPJ Digital Medicine 5, 68 (2022).
  • [21] Park, S. Y. et al. Identifying challenges and opportunities in human–ai collaboration in healthcare (2019).
  • [22] Asan, O., Bayrak, A. E., Choudhury, A. et al. Artificial intelligence and human trust in healthcare: focus on clinicians. \JournalTitleJournal of medical Internet research 22, e15154 (2020).
  • [23] Reverberi, C. et al. Experimental evidence of effective human–ai collaboration in medical decision-making. \JournalTitleScientific reports 12, 14952 (2022).
  • [24] Gaube, S. et al. Do as ai say: susceptibility in deployment of clinical decision-aids. \JournalTitleNPJ digital medicine 4, 31 (2021).
  • [25] Kodialam, R. et al. Deep contextual clinical prediction with reverse distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, 249–258 (2021).
  • [26] Tibshirani, R. Regression shrinkage and selection via the lasso. \JournalTitleJournal of the Royal Statistical Society: Series B (Methodological) 58, 267–288 (1996).
  • [27] Halpern, Y., Horng, S., Choi, Y. & Sontag, D. Electronic medical record phenoty** using the anchor and learn framework. \JournalTitleJournal of the American Medical Informatics Association 23, 731–740 (2016).
  • [28] Cortes, C. & Mohri, M. Confidence intervals for the area under the roc curve. \JournalTitleAdvances in neural information processing systems 17 (2004).
  • [29] Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. \JournalTitleScience 366, 447–453 (2019).
  • [30] Chouldechova, A. & Roth, A. The frontiers of fairness in machine learning. \JournalTitlearXiv preprint arXiv:1810.08810 (2018).
  • [31] McCradden, M. D., Joshi, S., Mazwi, M. & Anderson, J. A. Ethical limitations of algorithmic fairness solutions in health care machine learning. \JournalTitleThe Lancet Digital Health 2, e221–e223 (2020).
  • [32] Rundo, L., Pirrone, R., Vitabile, S., Sala, E. & Gambino, O. Recent advances of hci in decision-making tasks for optimized clinical workflows and precision medicine. \JournalTitleJournal of biomedical informatics 108, 103479 (2020).
  • [33] Chen, I. Y. et al. Ethical machine learning in healthcare. \JournalTitleAnnual review of biomedical data science 4, 123–144 (2021).
  • [34] Gervasi, S. S. et al. The potential for bias in machine learning and opportunities for health insurers to address it: Article examines the potential for bias in machine learning and opportunities for health insurers to address it. \JournalTitleHealth Affairs 41, 212–218 (2022).

Appendix A Supplemental Information

A.1 Ethics

For our human subject experiments described in our results with the care managers, we obtained an exempt evaluation from our IRB. Our IRB judged that our research activities meet the criteria for exemption as defined by Federal regulation 45 CFR 46 under the following:

  • Exempt Category 3 - Benign Behavioral Intervention Research involving benign behavioral interventions where the study activities are limited to adults only and disclosure of the subjects’ responses outside the research could not reasonably place the subjects at risk for criminal or civil liability or be damaging to the subjects’ financial standing, employability, educational advancement, or reputation. Research does not involve deception or participants prospectively agree to the deception. 45 CFR 46.104(d)(3)

  • Exempt Category 2 - Educational Testing, Surveys, Interviews or Observation Research involving surveys, interviews, educational tests or observation of public behavior with adults or children and disclosure of the subjects’ responses outside the research could not reasonably place the subjects at risk for criminal or civil liability or be damaging to the subjects’ financial standing, employability, educational advancement, or reputation. Research activities with children must be limited to educational tests or observation of public behavior and cannot include direct intervention by the investigator. 45 CFR 46.104(d)(2)

A.2 Dataset Creation Algorithm for Pregnancy Identification

We build on [12], which presents an algorithm for inferring pregnancy episodes across a set of pregnancy outcomes in OMOP Common Data Model. Our modified algorithm can handle a larger set of pregnancy outcomes, e.g. neonatal ICU admission, by doing a forward search to update the outcome once the pregnancy episode is identified. We describe our modified version in Algorithm 1. We illustrate the algorithm in Figure 5 and present a subset of target codes for reference in 4.

Refer to caption
Figure 5: Illustration of the pregnancy cohort selection algorithm (1). First, the most recent pregnancy outcome is detected (red point), referencing outcome codes defined in [12]. Then, we search for pregnancy start code(s) (blue point(s)) within a specified lookback window for the corresponding outcome [12] (blue brackets); the earliest start code marks the start of that pregnancy episode. Finally, we do a forward search for any additional pregnancy outcome or complications, referencing additional outcome codes compiled internally at AIC (orange point); if one exists, the pregnancy outcome is updated.

Member B is excluded from the cohort since no pregnancy start code was detected within the lookback window. Member C is excluded since there was no associated pregnancy outcome code; amenorrhea alone cannot indicate pregnancy has started since it can be caused by non-pregnancy-related factors (e.g. stress, menopause).
(a) Refer to caption
(b) Refer to caption
Figure 6: Labeling start and end of each pregnancy episode. (a) For pregnancies without complications, we set the start of gestation to be 40 weeks prior to when the outcome code is observed (tstart=tend40 weekssubscript𝑡𝑠𝑡𝑎𝑟𝑡subscriptsuperscript𝑡𝑒𝑛𝑑40 weekst_{start}=t^{{}^{\prime}}_{end}-40\text{ weeks}italic_t start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT = italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT - 40 weeks), assuming a full term pregnancy. (b) For pregnancies with complications, we set the start of gestation to be the date of pregnancy start code (tstart=tstartsubscript𝑡𝑠𝑡𝑎𝑟𝑡subscriptsuperscript𝑡𝑠𝑡𝑎𝑟𝑡t_{start}=t^{{}^{\prime}}_{start}italic_t start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT = italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT). These points give us a reference frame for data sampling, labeling, and evaluation inside and outside of pregnancy.
Outcome Target Codes Pregnancy ID? Risk Factors?
Neonatal Intensive Care Unit (NICU) Newborn light for gestational age
Low birth weight infant
Birth injury to central nervous system
Respiratory distress syndrome in the newborn
Pulmonary hypertension of newborn
X X
Hypertension/Pre-eclampsia (HPPE) Pre-existing hypertension in obstetric context
Transient hypertension of pregnancy
Renal hypertension complicating pregnancy
Severe pre-eclampsia
Gestational proteinuria
X X
Pre-term birth Preterm premature rupture of membranes
Fetal or neonatal effect of maternal premature rupture of membrane
Baby premature, 24-26 weeks
Extreme immaturity, 750-999 grams
Metabolic bone disease of prematurity
X X
Gestational Hypertension Unspecified maternal hypertension
Gestational [pregnancy-induced] hypertension
Hypertension, Pregnancy-Induced
gestational proteinuria
Mild to moderate pre-eclampsia
X
Gestational Diabetes Gestational diabetes mellitus in childbirth
Diabetes mellitus arising in pregnancy
Gestational diabetes mellitus in the puerperium
Gestational diabetes mellitus complicating pregnancy
Maternal gestational diabetes mellitus
X
Table 4: Pregnancy outcomes and examples of corresponding target codes and indicators of whether the outcome was included in the second pass search during cohort creation for pregnancy identification and pregnancy risk factors.
Algorithm 1 Building pregnant cohort.
for iP𝑖𝑃i\in Pitalic_i ∈ italic_P do
     // Detect and classify most recent pregnancy outcome (first pass)
     touti,outcomeigetPregnancyOutcome(i)superscriptsubscript𝑡𝑜𝑢𝑡𝑖𝑜𝑢𝑡𝑐𝑜𝑚superscript𝑒𝑖𝑔𝑒𝑡𝑃𝑟𝑒𝑔𝑛𝑎𝑛𝑐𝑦𝑂𝑢𝑡𝑐𝑜𝑚𝑒superscript𝑖t_{out}^{i},outcome^{i}\leftarrow getPregnancyOutcome(\mathcal{H}^{i})italic_t start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_o italic_u italic_t italic_c italic_o italic_m italic_e start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ← italic_g italic_e italic_t italic_P italic_r italic_e italic_g italic_n italic_a italic_n italic_c italic_y italic_O italic_u italic_t italic_c italic_o italic_m italic_e ( caligraphic_H start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT )
     
     // Backtrack to estimate pregnancy start
     tstart,mintoutigmaxoutcomesubscript𝑡𝑠𝑡𝑎𝑟𝑡𝑚𝑖𝑛superscriptsubscript𝑡𝑜𝑢𝑡𝑖superscriptsubscript𝑔𝑚𝑎𝑥𝑜𝑢𝑡𝑐𝑜𝑚𝑒t_{start,min}\leftarrow t_{out}^{i}-g_{max}^{outcome}italic_t start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t , italic_m italic_i italic_n end_POSTSUBSCRIPT ← italic_t start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_g start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t italic_c italic_o italic_m italic_e end_POSTSUPERSCRIPT\triangleright Lower bound for pregnancy start
     tstart,maxtoutigminoutcomesubscript𝑡𝑠𝑡𝑎𝑟𝑡𝑚𝑎𝑥superscriptsubscript𝑡𝑜𝑢𝑡𝑖superscriptsubscript𝑔𝑚𝑖𝑛𝑜𝑢𝑡𝑐𝑜𝑚𝑒t_{start,max}\leftarrow t_{out}^{i}-g_{min}^{outcome}italic_t start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t , italic_m italic_a italic_x end_POSTSUBSCRIPT ← italic_t start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_g start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o italic_u italic_t italic_c italic_o italic_m italic_e end_POSTSUPERSCRIPT\triangleright Upper bound for pregnancy start
     tstartiestimatePregnancyStart(tstart,min,tstart,max)superscriptsubscript𝑡𝑠𝑡𝑎𝑟𝑡𝑖𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑃𝑟𝑒𝑔𝑛𝑎𝑛𝑐𝑦𝑆𝑡𝑎𝑟𝑡subscript𝑡𝑠𝑡𝑎𝑟𝑡𝑚𝑖𝑛subscript𝑡𝑠𝑡𝑎𝑟𝑡𝑚𝑎𝑥t_{start}^{i}\leftarrow estimatePregnancyStart(t_{start,min},t_{start,max})italic_t start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ← italic_e italic_s italic_t italic_i italic_m italic_a italic_t italic_e italic_P italic_r italic_e italic_g italic_n italic_a italic_n italic_c italic_y italic_S italic_t italic_a italic_r italic_t ( italic_t start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t , italic_m italic_i italic_n end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t , italic_m italic_a italic_x end_POSTSUBSCRIPT )
     
     // Forward search to update pregnancy outcome (second pass)
     touti,outcomeiupdatePregnancyOutcome(tstarti,touti)superscriptsubscript𝑡𝑜𝑢𝑡𝑖𝑜𝑢𝑡𝑐𝑜𝑚superscript𝑒𝑖𝑢𝑝𝑑𝑎𝑡𝑒𝑃𝑟𝑒𝑔𝑛𝑎𝑛𝑐𝑦𝑂𝑢𝑡𝑐𝑜𝑚𝑒subscriptsuperscript𝑡𝑖𝑠𝑡𝑎𝑟𝑡subscriptsuperscript𝑡𝑖𝑜𝑢𝑡t_{out}^{i},outcome^{i}\leftarrow updatePregnancyOutcome(t^{i}_{start},t^{i}_{% out})italic_t start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_o italic_u italic_t italic_c italic_o italic_m italic_e start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ← italic_u italic_p italic_d italic_a italic_t italic_e italic_P italic_r italic_e italic_g italic_n italic_a italic_n italic_c italic_y italic_O italic_u italic_t italic_c italic_o italic_m italic_e ( italic_t start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT , italic_t start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT )
end for

We build a cohort of patients who were never pregnant throughout their claims history. We sample these patients according to the age distribution of pregnant members (mean: 31.8 years, standard deviation: 4.8 years) and define “never pregnant” to be any member who does not have any of the pregnancy start or outcome concept codes present in their claims history.

A.3 Dataset Creation Algorithm for Pregnancy Complication Prediction

In Algorithm 1, the first pass phase that searches for the most recent pregnancy outcome references the original pregnancy outcomes and corresponding target codes defined in [12]. In the second pass phase performs a second search to update the previous outcome, we reference target codes for additional outcomes. We present a subset of target codes for these outcomes and an indicator for when they are used in Table 4

We queried for pregnancy episodes with a gestational diabetes ICD 10 code (O24.11-O24.93) using ATLAS [ohdsi-atlas]. We then filtered for unique diagnosis codes within those episodes and selected the most frequently occurring diagnosis codes as the initial set of target codes for gestational diabetes outcomes. The same procedure was repeated for gestational hypertension/pre-eclampsia (ICD 10 code O10.011-O16.9). We validated the code set with the care management nurses, who hand-labeled outcome codes for a subset of 20 members, given data up to the end of the pregnancy episode. This allowed us to (1) validate that the existing codes are indicative of the corresponding outcome, and (2) find new codes indicative of an outcome. For example, Methyldopa 250 MG Oral Tablet, an anti-hypertensive drug, was added as a code for gestational HT/PE.

Similar to pregnancy identification, we generate non-temporal and temporal features for each sampled point. For temporal data, we generate windowed features for 30 day, 180 day, 365 day, 730 day, and 10k day windows using omop-learn for the following categories: medical conditions, prescriptions, procedures, specialty visits, and labs. We also include 12 non-temporal features, which include age, race, and gender. This gives us a feature set of 112,322 features.

A.4 Hyperparameter Selection for Machine Learning Models

For the pregnancy identification LASSO model, we report the hyperparameter search space in Table 5. We select the model with the highest validation accuracy. The decision threshold is chosen to be the geometric mean of sensitivity and specificity on the validation set.

Hyperparameters Search Range
Regularization strength (C) 1e-3, 7.5e-4, 5e-4, 2.5e-4*, 1e-4
Tolerance 1*, 1e-1, 1e-2, 1e-3, 1e-4
Table 5: Hyperparameter search range for pregnancy identification model. Asterisk marks the chosen hyperparameters.

For the pregnancy complications models, we report the hyperparameter search space in Table 6. Note that we also correct for class imbalance by weighting each class j𝑗jitalic_j by p(yj)1𝑝superscriptsubscript𝑦𝑗1p(y_{j})^{-1}italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, where p(yj)𝑝subscript𝑦𝑗p(y_{j})italic_p ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is the proportion of outcomes under class j𝑗jitalic_j in the training set. We select the model with the highest product of AUROC and accuracy on the validation set.

Hyperparameters Search Range
LASSO Regularization strength (C) 1, 1e-1, 1e-2, 1e-3*, 1e-4
Tolerance 1e-1, 1e-2, 1e-3*, 1e-4
ELASTIC-NET L1-ratio 0.25*, 0.5, 0.75
Tolerance 1e-1*, 5e-2
XGBOOST Learning rate 1e-1*, 1e-2, 1e-3, 1e-4
Table 6: Hyperparameter search range for pregnancy risk model. Asterisk marks the chosen hyperparameters.

A.5 Algorithm For Pregnancy Identification

We formally describe our pregnancy identification algorithm continuing on from the Methods section.

We combine the anchors and the Lasso model into a hybrid model f(Xt)𝑓subscript𝑋𝑡f(X_{t})italic_f ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) that does the following: if there exists a pregnancy start code only then f(Xt)=1𝑓subscript𝑋𝑡1f(X_{t})=1italic_f ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 1, if there is a pregnancy end code then f(Xt)=0𝑓subscript𝑋𝑡0f(X_{t})=0italic_f ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 0, otherwise f(Xt)𝑓subscript𝑋𝑡f(X_{t})italic_f ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) follows the prediction of the Lasso model. After we get the predictions at time t𝑡titalic_t as f(Xt)𝑓subscript𝑋𝑡f(X_{t})italic_f ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) we pass those predictions to an exponential moving average filter. This serves to smooth the predictions over the last 5 time points with a decay factor of 1/3131/31 / 3 to get a result q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG. We then binarize q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG with a learned threshold τ𝜏\tauitalic_τ chosen to maximize the geometric mean of the F1-score of the pregnancy predictions to obtain a binary prediction y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG. This process is performed at each time stamp for the member’s data. We predict the pregnancy start date as the first instance of time where y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG is 1111 and we have two consecutive increase scores q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG, similarly, we predict the pregnancy end date as the instance of time y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG is 00 and we have two consecutive decreasing scores q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG (given we already predicted the start).

Algorithm 2 Inferring pregnancy start and end for each member.
for iP𝑖𝑃i\in Pitalic_i ∈ italic_P do
     p^f(Xi)^𝑝𝑓subscript𝑋𝑖\hat{p}\leftarrow f(X_{i})over^ start_ARG italic_p end_ARG ← italic_f ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) \triangleright predict probability of pregnancy over time
     q^^𝑞absent\hat{q}\leftarrowover^ start_ARG italic_q end_ARG ←EMA(p^)^𝑝(\hat{p})( over^ start_ARG italic_p end_ARG ) \triangleright smooth with exponential moving average filter
     y^𝚙𝚛𝚎𝚍𝚒𝚌𝚝(q^)^𝑦𝚙𝚛𝚎𝚍𝚒𝚌𝚝^𝑞\hat{y}\leftarrow\verb!predict!(\hat{q})over^ start_ARG italic_y end_ARG ← typewriter_predict ( over^ start_ARG italic_q end_ARG ) \triangleright returns binary predictions
     start^i,end^isubscript^𝑠𝑡𝑎𝑟𝑡𝑖subscript^𝑒𝑛𝑑𝑖absent\hat{start}_{i},\hat{end}_{i}\leftarrowover^ start_ARG italic_s italic_t italic_a italic_r italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_e italic_n italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ←InferEpisode(q^,y^)^𝑞^𝑦(\hat{q},\hat{y})( over^ start_ARG italic_q end_ARG , over^ start_ARG italic_y end_ARG ) \triangleright infer pregnancy start and end (see Alg. 3)
end for
Algorithm 3 Inferring pregnancy start and end, given smoothed probability and predictions over time (q^,y^^𝑞^𝑦\hat{q},\hat{y}over^ start_ARG italic_q end_ARG , over^ start_ARG italic_y end_ARG).
isStart=True; l𝑙litalic_l=len(q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG)
start, end = None, None
for t=0:l2:𝑡0𝑙2t=0:l-2italic_t = 0 : italic_l - 2 do
     if (isStart) and (y^[t]==1)(\hat{y}[t]==1)( over^ start_ARG italic_y end_ARG [ italic_t ] = = 1 ) and (q^[t]<q^[t+1])^𝑞delimited-[]𝑡^𝑞delimited-[]𝑡1(\hat{q}[t]<\hat{q}[t+1])( over^ start_ARG italic_q end_ARG [ italic_t ] < over^ start_ARG italic_q end_ARG [ italic_t + 1 ] ) then
         // set pregnancy start if we have +ve prediction and increasing probability
         startt+1absent𝑡1\leftarrow t+1← italic_t + 1; isStart\leftarrowFalse
     else if (not isStart) and (y^[t]==0)(\hat{y}[t]==0)( over^ start_ARG italic_y end_ARG [ italic_t ] = = 0 ) and (q^[t]>q^[t+1])^𝑞delimited-[]𝑡^𝑞delimited-[]𝑡1(\hat{q}[t]>\hat{q}[t+1])( over^ start_ARG italic_q end_ARG [ italic_t ] > over^ start_ARG italic_q end_ARG [ italic_t + 1 ] )then
         // set pregnancy end if we have -ve prediction and decreasing probability
         endt+1absent𝑡1\leftarrow t+1← italic_t + 1
     end if
     // use code-based prediction by default if pregnancy start is before
     // 1 month after true pregnancy start (and we are simulating nurses filtering)
     if nurseFilter and start < trueStart+deltaMonth then start\leftarrowcodeStart
     end if
     // set start and end to be the earliest value (code-based or model-based)
     start\leftarrow min(codeStart, start)
     end\leftarrow min(codeEnd, end)
end for

A.6 Additional Results for Pregnancy Identification Retrospective Evaluation

We include additional results for our retrospective evaluation of the pregnancy identification algorithm.

Feature name
2213418 - procedure - Immunization administration (includes percutaneous, intradermal, subcutaneous, or intramuscular injections); 1 vaccine (single or combination vaccine/toxoid)
2212167 - labs - Urinalysis, by dip stick or tablet reagent for bilirubin, glucose, hemoglobin, ketones, leukocytes, nitrite, pH, protein, specific gravity, urobilinogen, any number of these constituents; non-automated, without microscopy
2108115 - procedure - Collection of venous blood by venipuncture
3050479 - labs - Immature granulocytes/100 leukocytes in Blood
2212996 - labs - Culture, bacterial; quantitative colony count, urine
3033575 - labs - Monocytes [#/volume] in Blood by Automated count
3023314 - labs - Hematocrit [Volume Fraction] of Blood by Automated count
3014576 - labs - Chloride [Moles/volume] in Serum or Plasma
38004461 - specialty - Obstetrics/Gynecology
3015746 - labs - Specimen source identified
Table 7: Top positive features surfaced by non-pregnant members who were inferred to be pregnant.
Figure 7: Histogram of pregnancy identification delays for pregnancies without complications for HAPI compared to the codes. We measure the difference of days between the predicted start date and actual start date for our model HAPI compared to a set of predefined pregnancy start codes (anchor codes). In subfigure (a) we show the histogram of differences on all the test and we can see that the two distributions overlap. However, in subfigure (b) when we look at the subset of the test where HAPI is earlier than the codes.
Refer to caption
(a) On all the test set.
Refer to caption
(b) On subset of data where HAPI outperforms anchor codes

A.7 Additional Results for Pregnancy Complication Prediction Retrospective Evaluation

We include additional results for our retrospective evaluation of the pregnancy complications algorithm.

In Table 8 we show the performance of our proposed predictor GROUP-Lasso that conditions on the patient’s prior history of disease and predicts using separate Lasso models for each sub-group compared to the global Lasso model. Modeling outcomes for separate groups increases accuracy as the predictions become better calibrated but sacrifices ranking ability in terms of AUC. The advantage of GROUP-Lasso is that the features surfaced as explanations by the sub-group models show information beyond prior history may be useful for the care managers. Therefore, we use the global Lasso model to make predictions but use GROUP-Lasso to surface features as an explanation.

Refer to caption
Figure 8: Distribution of earliest risk predictions of the Lasso model for members who have pregnancy complications of either gestational hypertension or diabetes. Members are placed into the respective bucket of the time the Lasso model first predicts a complication or if it never makes such a prediction.
(a) Refer to caption
(b) Refer to caption
(c) Refer to caption
Figure 9: Distribution of earliest risk predictions for members at risk of (a) both gestational DB and HT, (b) only gestational DB, and (c) only gestational HT.
GROUP-Lasso Lasso
History of DB AUROC 0.675 0.706
Accuracy 0.622 0.570
History of HT AUROC 0.6573 0.708
Accuracy 0.708 0.647
History of DB+HT AUROC 0.635 0.757
Accuracy 0.624 0.568
No history of DB/HT AUROC 0.596 0.667
Accuracy 0.793 0.780
Table 8: Evaluation metrics for each subgroup, comparing the subgroup model GROUP-Lasso to the global model Lasso. We partition the training and test sets based on member’s prior history of gestational diabetes (DB) or gestational hypertension (HT). We then train a Lasso model (Lasso) on all the partitions and then evaluate it on each test partition. We then train 4 different Lasso models (GROUP-Lasso) on each partition in the training and data and then evaluate.

A.8 Additional Details for User Studies

We include additional details of our user studies.

Refer to caption Refer to caption
(a) Pregnancy Identification Interface (b) Pregnancy Complications Interface
Figure 10: Patient dashboard sketch for the user study on (a) pregnancy identification with HAPI and (b) pregnancy complications classification. In sub-figure (a) the user interface consists of a left panel containing demographic information and three views: Overview, Visits, and Model Explanation (evidence). We show the subtab Diseases/Conditions from the overview view where the nurse can find the ICD codes for each condition and disease. In sub-figure (b) we update the user interface taking into account feedback from the care managers by integrating the model predictions and evidence into the visit and overview views. On the left panel, patient information is shown as well as the model prediction and history of prior complications. We show in the figure the Diseases/Conditions view, we color ICD codes that are positively associated with complications with red (intensity varies with correlation) and those negatively associated with complications with green.
(a)
Trial Members with information beyond prior knowledge Examples
A 55.6% (10 members) oligohydramnios, premature delivery, blood clot, cervical issues, large baby, fetal hereditary disease, cancer, abnormal heart rate, first pregnancy
B 33.3% (6 members) mental health / potential for postpartum depression, fetal abnormality, obesity, cervical issue, first pregnancy
C 66.7% (12 members) elevated glucose during current pregnancy / abnormal glucose code, h/o premature delivery, first pregnancy, twins, polycystic ovary syndrome, cervical incompetence (risk for preterm birth), elevated protein labs in current pregnancy
(b)
Trial Members with information beyond prior knowledge Examples
A 55.6% (10 members) previous retained placenta, home injections, pulmonary embolism, pre-term delivery, thalassemia, elevated glucose, asthma, hypothyroidism, infertility, uterine leimyoma, anemia, musculoskeletal disease, polycystic ovary syndrome, methadone
B 27.8% (5 members) asthma, pre-term delivery, hypothyroidism, obesity, fibroids, infertility
C 61.1% (11 members) firbroids, previous losses, pre-term delivery, thyroid disease, cardiac murmur, Rhesus -, obesity, cardiac concern, hypothyroidism, infertility
Table 9: Summary of members whose nurse notes contain information beyond prior knowledge (age, race, prior history of DB/HT) for nurse 1 (a) and nurse 2 (b).
Category Sub-category # of members Simulation start date range
Pregnant members detected by model Detected early within reasonable time (at least 1 month after tstartsubscript𝑡𝑠𝑡𝑎𝑟𝑡t_{start}italic_t start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT) 2 [tstart3 weeks[t_{start}^{*}-3\text{ weeks}[ italic_t start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - 3 weeks, tstart]t_{start}^{*}]italic_t start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ]
Detected too early (before 1 month after tstartsubscript𝑡𝑠𝑡𝑎𝑟𝑡t_{start}italic_t start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT) 2 [tstart3 weeks,[t_{start}^{*}-3\text{ weeks},[ italic_t start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - 3 weeks , tstart]t_{start}^{*}]italic_t start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ]
Pregnant members detected by code 4 [tstart1 week,[t_{start}^{*}-1\text{ week},[ italic_t start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - 1 week , tstart+2 weeks]t_{start}^{*}+2\text{ weeks}]italic_t start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + 2 weeks ]
Non-pregnant members Detected not pregnant 3 [τ0,τ5 weeks]superscript𝜏0superscript𝜏5 weeks\left[\tau^{0},\tau^{{}^{\prime}}-5\text{ weeks}\right][ italic_τ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT - 5 weeks ]
Detected pregnant 1 [tstart3 weeks,[t_{start}^{*}-3\text{ weeks},[ italic_t start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - 3 weeks , tstart]t_{start}^{*}]italic_t start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ]
Table 10: Distribution of members and range of simulation start dates for pregnancy identification study. Note that for members detected pregnant, we sample within a fixed window around tstartsubscriptsuperscript𝑡𝑠𝑡𝑎𝑟𝑡t^{*}_{start}italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT, the pregnancy start time inferred by the algorithm; this ensures that the pregnancy prediction triggers during the 5-week simulation period. For members detected not pregnant, we sample from τ0superscript𝜏0\tau^{0}italic_τ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, the time of the first sampled data point, to τ5 weekssuperscript𝜏5 weeks\tau^{{}^{\prime}}-5\text{ weeks}italic_τ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT - 5 weeks, 5 weeks prior to the last sampled data point, so there is sufficient data for simulation.
Outcome Correct Prediction? Prior History? Number of Members
Gestational DB Yes No DB history 3
Gestational HT Yes No HT history 3
No complication Yes No DB or HT history 3
Gestational DB No No DB history 1
Gestational HT No No HT history 1
No complication No No DB or HT history 1
Gestational DB Yes DB history 1
Gestational HT Yes HT history 1
No complication Yes DB+HT history 1
Gestational DB No DB history 1
Gestational HT No HT history 1
No complication No DB+HT history 1
Table 11: Distribution of members for pregnancy risk factor study. We sample members evenly across the three outcomes, with more members without prior history to reflect the overall distribution. We include both correct and incorrect predictions to evaluate how well nurses filter errors.