Closing the Gap in High-Risk Pregnancy Care Using Machine Learning and Human-AI Collaboration
Abstract
A high-risk pregnancy is a pregnancy complicated by factors that can adversely affect the outcomes of the mother or the infant. Health insurers use algorithms to identify members who would benefit from additional clinical support. This work presents the implementation of a real-world ML-based system to assist care managers in identifying pregnant patients at risk of complications. In this retrospective evaluation study, we developed a novel hybrid-ML classifier to predict whether patients are pregnant and trained a standard classifier using claims data from a health insurance company in the US to predict whether a patient will develop pregnancy complications. These models were developed in cooperation with the care management team and integrated into a user interface with explanations for the nurses. The proposed models outperformed commonly used claim codes for the identification of pregnant patients at the expense of a manageable false positive rate. Our risk complication classifier shows that we can accurately triage patients by risk of complication. Our approach and evaluation are guided by human-centric design. In user studies with the nurses, they preferred the proposed models over existing approaches.
1 Introduction
High-risk pregnancy is a pregnancy complicated by factors that can adversely affect the health outcomes of the mother, fetus, or infant. Pregnancy complications like gestational diabetes, hypertension, and pre-eclampsia can lead to childbirth complications such as eclampsia, cardiomyopathy, and embolism and result in adverse pregnancy outcomes, including preterm birth, HELLP syndrome, and intrauterine fetal death. In 2018, pregnancy and childbirth complications affected 19.6% and 1.7% of pregnancies, respectively, in the U.S. [1]. Moreover, systemic disparities in pregnancy and childbirth complications are well-documented. Black women are significantly more likely to develop preeclampsia and more than three times more likely to die from pregnancy-related complications than White women [2].
Fortunately, timely and appropriate clinical intervention can effectively manage complications during pregnancy and reduce maternal, fetal, and neonatal morbidity and mortality [3, 4, 5, 6, 7]. Health plan-operated care management programs for high-risk pregnancies aim to coordinate care for at-risk patients across their clinical care team, educate patients about their conditions and medications, and provide education and support managing their conditions [8, 9, 10, 11].
Objective.
In this work, we collaborate with the High-Risk Pregnancy (HRP) care management team at an Anonymized Health Insurance Company (AIC) in the US. We aim to improve the member identification process in which nurse case managers review relevant clinical information and make decisions about which members are most appropriate for the HRP program. The process begins with ML algorithms and clinical decision rules to identify pregnant and at-risk members from medical claims, which are served to nurse case managers for review and final determination of program eligibility and appropriateness. Automated mechanisms for patient risk identification and stratification are critical to efficiently identify pregnant and at-risk patients from a large patient population. We conducted structured interviews with the care managers to understand the identification and stratification process and discover opportunities to improve it. These conversations highlighted that patients being surfaced for evaluation are often no longer pregnant, have a low risk of pregnancy complications, and nurses lack insight into why patients are being surfaced.
Our first task was to improve the latency with which pregnant patients are identified. Our second task was to accurately identify patients at high risk for pregnancy complications. However, not all complications of pregnancy can be effectively remediated through telephonicly delivered care management. Following the care manager’s recommendations, the outreach and education delivered in HRP program would be most impactful for patients with gestational diabetes and gestational hypertension.
Contributions.
This paper presents a recipe for develo** automated systems for high-risk pregnancy management programs, from dataset creation to model training and evaluation. We first outline how to build datasets from patient data available to be used to train models for pregnancy identification and detection. We developed a novel Hybrid Algorithm for Pregnancy Identification (HAPI) that combines manual code lists with machine learning models. We then train a classifier that predicts the patient’s risk for develo** complications at each point in their pregnancy. We integrate these models into a user-friendly interface for nurses to use. We retrospectively evaluate the individual classifiers on over 30k patients, showing we can identify pregnant members earlier on average than predefined code lists and can triage members by risk of complication with an AUC of 0.76. User studies with nurses confirm that the new interface is preferred over existing implementations.
More broadly, we believe our work serves as an important demonstration of human-centric design for ML in healthcare and will be a useful guide for future work in the field.
2 Related Work
Much of the existing literature on pregnancy identification focuses on retrospective identification of pregnancy episodes [12, 13, 14, 15]. Our goal was to identify pregnancy in a near real-time fashion as information about the patient becomes available through medical and pharmacy claims, lab results, authorizations, and admit, discharge, and transfer data. To the best of our knowledge, we believe this is the first work that accomplishes this objective. Although there is extensive literature on predicting pregnancy complications using machine learning [16, 17, 18, 19, 20], we focus specifically on gestational hypertension and diabetes and making risk predictions as early as possible. While we are aware that certain deep learning architectures perform well for our task, practical considerations limit us to the use of linear classifiers, which perform relatively well. Our approach is to build separate machine learning models for pregnancy start and end identification and risk of pregnancy complications. When deploying machine learning models in the clinical setting, it is important to provide a rationale for predictions to gain clinicians’ trust and help them make informed decisions [21, 22, 23, 24]. We discuss other relevant prior work in the remaining sections.
3 Methods
3.1 Dataset Creation For Pregnancy Start and End Identification
Our approach for identifying the start and end of a patient’s pregnancy is based on a machine learning predictor. Since there is no publicly available well-suited data for this task, we built our own dataset to train the model from AIC’s members only. We construct a cohort of female patients with ages between 18 and 48 who had pregnancies with and without complications between 2004 to 2021 but eventually had a live birth. We also construct a matching cohort of never-pregnant female patients according to the age distribution of the pregnant sub-cohort.
To identify pregnant patients for use in our machine learning algorithm to identify pregnancy starts, we use a modified version of the algorithm of Matcho et al. [12] to identify pregnant patients and only select patients who had a healthy live birth. The original algorithm retrospectively infers the start and end of the most recent pregnancy episode and the corresponding pregnancy outcome or complication. In contrast, our approach identifies gestational episodes in real time. We select patients with a live birth only because that allows us to reliably identify the pregnancy start date. For pregnancies with a live birth without complications, we can reliably identify the pregnancy start date, which we set to be 40 weeks before the end date of pregnancy. For pregnancies with complications, we set the start date to be the first date of occurrence of a pregnancy start code. The overall dataset consisted of 36735 patients with an average age of 32.3 years composed into three subgroups: 22.6% pregnancies without complications, 62.4% pregnancies with complications, and 15.0% never pregnant.
For pregnant patients, we extract weekly data starting from
20 weeks before pregnancy starts to 20 weeks after the pregnancy ends: 80 weeks total - 80 total data points per patient. This allows for early pregnancy and non-pregnancy indicators to be learned while avoiding signals from previous pregnancies. For never-pregnant patients, we sample 80 weeks of data, around the midpoint of their medical history.
For each data point, we generate non-temporal and temporal features from medical data. For temporal data, we construct windowed features, which aggregate the data within a specified backward time window and map them to a binary indicator feature indicating whether the billing codes occurred or not during that time window. Windowed features for 5-day and 10-day windows are generated using omop-learn
[25] for the following categories: medical conditions, prescriptions, procedures, specialty visits, and labs. We also include 12 non-temporal features, which include age, race, and gender. This gives us a feature set of 62,734 features.
For each subgroup, we split the data into a train set (50%), validation set (25%), and test set (25%) by patients, so no patient data is shared across the different splits. We aggregate all three sub-cohorts to construct the train, validation, and test splits.
A summary of the dataset is provided in Table 1. Further details about the dataset creation are found in the Appendix.
Characteristics | Identification Dataset | Complications Dataset |
---|---|---|
No. of patients | 36,735 | 12,243 |
Race / Ethnicity (%) | 39.1% White, 5.7% Black, 3.4% Other (rest is unreported) | 43.8% White, 5.70% Black and 3.6% Other (rest is unreported) |
Average Age in years | 32.3 (=6.1) | 32.0 (=6.1) |
Pregnancy Complication % | 22.6% without complication, 62.4% with complications, 15.0% not pregnant | 73.6% without complication, 26.4% with complications divided into 16.9% with gestational hypertension and 9.4% gestational diabetes |
Dataset split | 50% training, 25% validation and 25% testing | 60% training, 20% validation and 20% testing |
Features generated | day windowed features and 12 non-temporal features | day windowed features and 12 non-temporal features |
Total number of features per patient data point | 112,322 | 62,734 |
3.2 Algorithm For Pregnancy Start and End Identification
We propose a Hybrid Algorithm for Pregnancy Identification (HAPI) that predicts at each week the probability that a patient is pregnant. The HAPI algorithm predicts a score in of the likelihood of the patient being pregnant at each point in time using their features up to time : . HAPI first relies on a set of carefully chosen clinical codes that indicate either the start or end of pregnancy denoted as ’anchors’. Starting from each week of the patient’s data, if a code indicating the start of pregnancy is available, we set the start of pregnancy at the first week when the code is available, similarly for codes indicating the end of pregnancy. Otherwise, we use a Lasso regularized logistic regression model [26] that is trained with the objective of predicting whether the patient is currently pregnant from the features in the dataset. Importantly, we use the Anchor&Learn approach [27], where we remove the anchors from the feature set of the Lasso algorithm so that it focuses on signals not captured by the anchors. After we get the predictions of the Lasso model at time as , we pass those predictions to an exponential moving average filter to smooth the predictions over time and obtain . We then binarize the predictions using a learned threshold to obtain . We predict that the patient is pregnant at time if and we have two consecutive increases in (similarly for end-of-pregnancy prediction). The Lasso model is learned on the training set of the dataset previously described with hyperparameters chosen on the validation set. A formal description of the algorithm can be found in the Appendix.
3.3 Dataset Creation for Pregnancy Complication Prediction
After pregnant patients are identified, we have to distinguish between those with a high and low likelihood of develo** complications. The case management team identified gestational diabetes (GDB) and gestational hypertension (GHT) as specific complications that could be effectively managed within the HRP program. Our approach is to build a calibrated machine learning classifier that given a patient’s data can predict the risk of them develo** either gestational diabetes or gestational hypertension. Moreover, the classifier can provide a list of the patient features that led to the prediction as a form of explanation. Since there exists no good public data for evaluating and training the classifier, we construct our own dataset from AIC members.
We first constructed a cohort of pregnant patients using the algorithm of Matcho et al. [12], we then collaborated with the nurse case managers to compile a list of codes that indicate pregnancy episodes with gestational diabetes and gestational hypertension. We validated the code set with the care management nurses, who hand-labeled outcome codes for a subset of 20 patients, given data up to the end of the pregnancy episode. This allowed us to (1) validate that the existing codes are indicative of the corresponding outcome, and (2) find new codes indicative of an outcome. This enables us to find patients with outcomes of GHT and GDB during their pregnancy.
We select the subset of patients with outcomes: live birth (no complication), gestational hypertension, and gestational diabetes. We then select a set of 12,243 patients where 73.6% are live births, 16.9 % have gestational hypertension and 9.4% have gestational diabetes. We divide the patients into 20% for testing and 80% for training and validation. For each patient, we construct 10 data points where each data point is a slice of the patient’s data up to a cutoff date. We choose the cutoff dates to be uniformly distributed from three months before pregnancy starts until the end of pregnancy, thus covering all three trimesters. Each data point has a label in indicating respectively no complication, GHT, and GDB. Each data point has temporal and non-temporal features as in the dataset for pregnancy identification. We fit several standard classification algorithms – Lasso (L1-regularized), ELASTIC-NET (L1 and L2-regularized), and XGBOOST (gradient-boosted tree) by using the training set and the validation set to pick hyperparameters.
3.4 Extracting Evidence for Predictions
It is essential that we provide information as to why a certain patient had the given predictions. We provide a list of claim codes that had the most effect on the model making its prediction and the polarity of the effect. For pregnancy identification with HAPI, for each patient, we surface all anchor codes if they are available and we then surface the highest weighted codes (by absolute value) according to the Lasso model.
For predicting the risk of pregnancy complications, the classifier’s top codes by model weight include many variants of diabetes and hypertension codes since the prior history of these conditions is highly predictive of GDB and GHT. However, there is a nontrivial number of patients who have no prior history of these conditions, and they may be affected by a different set of risk factors. To better capture these factors, we partition our dataset conditional on prior history of diabetes and hypertension, and train a separate Lasso model on each of the four subsets: no prior history, history of both conditions, and history of either condition alone. We call these models GROUP-Lasso models. For a given patient, the prediction follows the global Lasso model but to extract evidence for the prediction, we extract the highest weighted features from the GROUP-Lasso model that the patient belongs to depending on their prior history of DB or HT.
3.5 User Study Design
Our main evaluation of HAPI and our pregnancy complications classifier is a retrospective evaluation of performance, however, this evaluation does not mirror how the algorithms will be deployed exactly. The algorithms never make the final decision on who is deemed to be pregnant or at risk, rather it is the care managers after reviewing the algorithm’s predictions. We design two user studies that simulate how HAPI and the complications classifier will be deployed where a care manager assesses in a simulation environment of patients. All studies in this work were ruled exempt by our IRB.
With input from the nurse care managers, we built a dashboard mimicking the actual dashboard used by the nurses in the HRP program to surface medical history available in insurance claims and other data sources (e.g. visits, diagnosis codes, demographics). For testing HAPI, we perform a study with a single nurse under two conditions A) with the predictions and evidence of HAPI and B) without any algorithmic predictions (control). The nurse makes predictions in each condition for 12 patients. Each trial took up to an hour.
For testing the pregnancy complications classifier, we ran three trials where the nurses made decisions on patients – A) one without predictions or evidence, B) one with predictions, and C) one with both predictions and evidence, with each of two nurses (referred to as Nurse 1 and Nurse 2) from the pregnancy care management program (six trials total). In each trial, the nurse makes predictions on 18 patients retrospectively. A sketch of the interface is shown in Fig10. For each patient, we asked the nurse if they would call the patient and which complication the patient would develop.
3.6 Statistical analysis
For obtaining confidence intervals for AUC we use a method computed using a distribution-independent method based on error rate and the number of positive and negative samples introduced in [28]. For obtaining confidence intervals for accuracy and FNR/FPR metrics, we use the Wilscon score interval. We use McNemar’s test to compare ordinal data proportions and paired t-tests to compare numerical data. All analysis was conducted in Python 3.8 and using the statsmodels and scipy packages.
4 Results
4.1 Identifying Pregnancies From Claims Data
We evaluate HAPI on a test set of 9183 patients randomly selected from the dataset. We compare the performance of HAPI against the baseline of only using anchor pregnancy codes for the detection of pregnancy start [14]. We measure for HAPI and the anchor code list the difference between the predicted start date and the actual pregnancy start date. The actual pregnancy start date is obtained by subtracting 40 weeks from the exact date of birth.
We show the histogram of the difference between the predicted start date and the actual date for patients with complications in Figure 3 for both the anchor code list and our proposed algorithm. Compared to using the code list alone, HAPI predicts an earlier start date for 3.54% (95% CI 3.05-4.00, z=14.5, p<0.001) of patients with pregnancy complications and 4.29% (95% CI 3.42-5.16, z=9.6 p<0.001) earlier for pregnancies without complications. For the patients with complications who are predicted earlier by HAPI, the average difference between the predictions and the actual start date is 54.3 days compared to 75.6 days for the code list (). For patients without complications, the average difference is 66.9 days compared to 102.5 days (), respectively. However, when we look at all the test set the average difference is 1 day earlier for HAPI compared to the code list on patients with and without complications which is not statistically significant. The model predicts that 5.58% (95% CI 4.05-6.40) of non-pregnant patients are in fact pregnant (false positive rate). HAPI can be adjusted using the Lasso model threshold to reduce the false positive rate at the expense of detecting pregnancies later in time.
4.2 Predicting Pregnancy Complications
After pregnant patients are identified, we have to distinguish between those with a high and low likelihood of develo** complications. In Table 2, we compare the performance of different machine learning classifiers at predicting whether a patient will develop gestational diabetes or gestational hypertension or neither. We find that the best-performing model in terms of accuracy is a Lasso regularized logistic regression model which achieves an average accuracy of 76.8% (95% CI 76.2-77.3) at predicting complications across each test patient pregnancy and AUC of 0.761 (95% CI 0.754-767). The Lasso model is able to achieve an accuracy of 73.1% (95% CI 72.9-74.2) and AUC of 0.722 (95% CI 0.710-0.734) when predicting three months before the start of the patient’s pregnancy. This indicates that there is a signal at the start of the pregnancy to triage patients by risk of complications. We assess model performance using data of the patients at different stages of pregnancy and find that accuracy and AUC generally increases as we progress to later pregnancy terms. This indicates that the model performs better as we see more data on the patient, but the confidence intervals overlap in some time periods.
To assess model performance at different stages of pregnancy, we evaluate the model when predicting on patient’s data in each trimester and before gestation. To do this, we trim each member’s data until the desired date of prediction and then predict using the Lasso model, results are in Figure 4. While confidence intervals do not overlap consistently across time periods, the metrics generally increase as we progress to later pregnancy terms, indicating that the model performs better as we see more data on the member.
Additionally, we evaluate how early the model is catching pregnancy complications While the Lasso model has a high false negative rate of 57.4% (95% CI 53.5-61.2), of the patients with true positive predictions (37.6%), a majority are caught before gestation (59.6% with 95% CI 53.1-65.5 ). This is important since early intervention and treatment are important in reducing gestational diabetes and hypertension risk [4, 6, 7].
Accuracy | AUC | |||
---|---|---|---|---|
Mean | 95% CI | Mean | 95% CI | |
Lasso (L1) | 0.768 | 0.762-0.773 | 0.761 | 0.754-0.767 |
ELASTIC-NET (L1+L2) | 0.713 | 0.707-0.719 | 0.736 | 0.729-0.742 |
XGBOOST | 0.687 | 0.681-0.692 | 0.770 | 0.764-0.775 |
4.3 Bias/Fairness Audit For Pregnancy Complication Classifier
Prior work has shown that care management risk algorithms may contain racial bias due to nuances in how outcomes are defined [29]. Moreover, there exist systemic health disparities in maternal and infant mortality rates, e.g. Black people have mortality rates over three times higher than White people during pregnancy (40.8 v. 12.7 per 100,000 live births) [2]. To this end, we audit our algorithm for potential racial bias. We report evaluation metrics in Table 3 for the three most common race groups (White - 43.8%, Black - 5.7%, Other - 3.6%). Other race category includes race outside of the following: American Indian or Alaska Native, Black or African American, White, Asian, Hispanic or Latino, Native Hawaiian or Other Pacific Islander. We note that accuracy for the White group is 77.4% (95% CI 76.7-78.2) compared to a lower accuracy for the Black group at 68.1 (95% CI 65.6-70.5). However, the AUC for the White group is 0.740 (95% CI 0.730-75.0) which is lower than that of the Black group 0.787 (95% CI 0.765-0.808). This may be due to differences in class distribution, since the Black subgroup has much higher rates of complication (44.0%), compared to White (24.6%) and Other (25.9%) races. True positive rates of catching complications are 36.6%, 27.1%, and 30.0%, for Black, White, and Other subgroups, respectively. Race data for this analysis comes from electronic medical records with low coverage for race attribution (only of members have some member-level race attributed to examine bias), so true error rates may differ from those reported here. he lower accuracy of Black patients compared to White or Other race patients can potentially be explained by different base rates. When different subgroups have different base rates, competing definitions of algorithm fairness may conflict [30]. It is important to better understand sources of health disparities, potentially through gathering additional information such as social determinants of health [31].
Accuracy (95% CI) | AUC (95% CI) | |
---|---|---|
White | 0.774 (0.767, 0.782) | 0.740 (0.730, 0.750) |
Black | 0.681 (0.656, 0.705) | 0.787 (0.765, 0.808) |
Other | 0.792 (0.765, 0.819) | 0.826 (0.798, 0.854) |
4.4 User Studies
In the user study for pregnancy identification, in both conditions, the nurse correctly identified 5 of the 8 pregnant patients. Notably, in each trial, we introduced a patient who was falsely detected by the model to be pregnant, but the care manager was successfully able to recognize this incorrect prediction.
In the user study for pregnancy complications, we note that the inclusion of the model prediction and prior history seemed to improve the nurse’s accuracy at predicting whether a patient will develop GDP or GHT. Nurse 1 had an accuracy of 56% without the model, 72% with the model prediction only, and 67% with model prediction and evidence. Similarly, nurse 2 had an accuracy of 33% in condition A, 56% in B, and 67% in C. Note that due to the small sample size of the studies, all increases in accuracy are not statistically significant. Nurse 1 explained that a prior history of diabetes/hypertension or complications in a previous pregnancy is usually sufficient to make a call, but additional information such as distinct risk factors for complications (e.g. polycystic ovary syndrome) can help them build a better profile of the patient and identify those at risk. Both nurses indicated that highlighted evidence helped with obtaining this information more quickly. The evidence helped them focus on important visits and codes, especially when the visit history was lengthy. Nurse 2 said that although not all evidence was useful or made sense, it is easy to filter out the irrelevant ones, i.e. surfacing useful codes should be prioritized over surfacing a few codes. In follow-up interviews and discussions, the nurses expressed a preference for the dashboard used in the user study compared to their previous systems. They noted that the new interface saves an enormous amount of time as they no longer need to access the claims system to review several years of claims data to decipher whether the patient is even pregnant let alone if they have any potential risk factors.
5 Discussion
In this study, we developed a machine learning system for the early detection of pregnancy and the identification of high-risk members. This system is part of a real-world deployment at AIC. We introduced a novel algorithm that identifies whether a member is pregnant from insurance claims data by combining indicators for pregnancy start and end with machine learning predictors. We found that it identifies for 3.54% members an earlier pregnancy start data compared to concept codes and has only a 5.58% false positive rate. The model identifies members who may have started pregnancy visits later in their term since, for example, they tested for pregnancy using at-home tests. This could be a reason to offer cost-free pregnancy tests at local clinics so members are incentivized to get tested formally, and in turn, the insurance company obtains data to identify pregnant members earlier. A large proportion of these members also tend to be high risk, which is exactly who we want to identify early for early intervention and treatment. Leveraging this information, we then identified members at the greatest risk for pregnancy complications so that care managers can provide timely and effective support. Using predictors of gestational diabetes and gestational hypertension, our model achieved an AUROC performance of 0.76.
We followed a human-centered design methodology and showed that it can improve the care management program for high-risk pregnancies at IBC. Because care managers are often faced with limited and fragmented interactions with patients, we conducted extensive discussions and interviews with care managers of the HRP program to identify their current needs and greatest challenges. These insights—combined with insurance claims—can help early detection of pregnancy, accurate identification of impactable high-risk members, and provision of explainable indicators to supplement predictions. We show that when actively engaging critical stakeholders like the care managers, machine learning systems can guide care management to prevent pregnancy complications.
We then set up a mock enrollment dashboard and evaluated these methods across two user studies and found two key findings. First, the pregnancy identification algorithm helps nurses identify pregnancies earlier while correctly filtering out false-positive members. Second, showing the pregnancy complication model’s prediction and prior history of chronic conditions improves nurses’ performance metrics when deciding who to call. While model explanations adversely affected the nurse’s performance in terms of time per member and how early they identify pregnant members in the pregnancy identification study, we observed that explanations improved notes about the member in the pregnancy risk factor study without much difference in nurse’s classification performance. The latter study better integrated explanations into the clinical workflow, and nurses appeared to disagree with the explanations less, which emphasizes the importance of the explanation method and how they are presented. Our study demonstrated that close collaboration with care managers can be used to leverage insurance claims to improve the care of pregnant patients. We hope that our results can serve as a call to action for similar predictive models used to allocate care. In a recent report in the Journal of Biomedical Informatics, researchers advocated for more overlap in human-computer interaction and clinical decision-making tasks to improve precision medicine [32]. Our work expands on those topics to empower the domain experts and primary users of our system. We found that comprehensive needs-finding interviews with the care managers greatly enhanced our targeted ML system. Not only were we able to focus on the most salient problems facing care managers, but the resulting ML system also has better resource allocation for pregnancy patients.
Our study opens several areas for future work. As with any machine learning system, continual validation of our models across time is key to ensuring robust and generalizable performance. Predictors of early pregnancy detection and predictors of high-risk pregnancy may change over time due to improvements in health technology and patterns of healthcare utilization. Computational work in transfer learning and robustness can help adapt our models over time with minimal adjustment. Additionally, topics of pregnancy may raise questions about patient privacy. Our model keeps patient data completely private except for the minimal set of relevant care managers; however, advances in patient privacy protection may also be relevant.
6 Limitations and Conclusion
There are limitations to our study that need to be addressed. In Appendix Figure 8, we stratify the population by when our ML system provided a relevant alert. Unfortunately, 60% of the alerts are never sounded for patients who have complications. This gap in our model performance is likely due to the sparsity of insurance claims and the delay of visits by the patients, both challenges often faced by models working with healthcare data [33]. We are also concerned with the disparate impact of the ML system on different patient subpopulations, particularly historically vulnerable groups. Health insurers are actively creating best practices for auditing and improving algorithmic bias, with the first step being the measurement of existing bias [34]. In Table 3, we show the performance of the detection algorithm on White, Black, and Other race patients. The lower accuracy of Black patients compared to White or Other race patients can potentially be explained by different base rates. When different subgroups have different base rates, competing definitions of algorithm fairness may conflict [30]. It is important to better understand sources of health disparities, potentially through gathering additional information such as social determinants of health [31].
In conclusion, we have developed novel algorithms for the identification of pregnancy and triage of pregnant members by risk of complication. These algorithms’ development and subsequent evaluations followed a human-centered design methodology with extensive collaboration with the high-risk pregnancy care managers at AIC. Thus, we demonstrated that the active engagement of key stakeholders like care managers can substantially improve the clinical workflow and quality of care given by care managers for pregnant patients.
Author Contributions
Conception and design: H.M., Y.U., D.S., S.G., A.S.
Model Development: H.M., Y.U.
User Study Development: Y.U
Data Collection: H.M., Y.U., M.E.
Data analysis: H.M., Y.U.
Data Interpretation: H.M., Y.U., I.C., D.S., S.G., A.S.
Supervision: D.S.
Manuscript writing: H.M., Y.U., I.C., D.S., S.G., A.S., M.E.
Acknowledgements
H.M., Y.U., I.C. and D.S. were supported by a grant from Independence Blue Cross.
Competing Interests
The research was financially supported by a grant from Independence Blue Cross, which also contributed the data for the study. The sponsor collected the data, reviewed the manuscript, and approved the decision to submit the manuscript for publication. H.M., Y.U., I.C. and D.S. were supported by the grant. M.E., S.G., A.S. are employees of Independence Blue Cross.
Data Availability
The datasets generated and analyzed during the current study are not publicly available as they contain insurance claims data and demographic data (including age and ethnicity) of members insured by Independence Blue Cross and the data is de-identified but not anonymous.
References
- [1] Trends in Pregnancy and Childbirth Complications in the U.S. https://www.bcbs.com/the-health-of-america/reports/trends-in-pregnancy-and-childbirth-complications-in-the-uspre-ex (2020). Accessed:2022-5-10.
- [2] Petersen, E. E. Racial/Ethnic Disparities in Pregnancy-Related Deaths — United States, 2007–2016. \JournalTitleMMWR. Morbidity and Mortality Weekly Report 68, DOI: 10.15585/mmwr.mm6835a3 (2019).
- [3] Lassi, Z. S., Mansoor, T., Salam, R. A., Das, J. K. & Bhutta, Z. A. Essential pre-pregnancy and pregnancy interventions for improved maternal, newborn and child health. \JournalTitleReproductive Health 11, S2, DOI: 10.1186/1742-4755-11-S1-S2 (2014).
- [4] Teede, H. J., Harrison, C. L., Teh, W. T., Paul, E. & Allan, C. A. Gestational diabetes: Development of an early risk prediction tool to facilitate opportunities for prevention. \JournalTitleAustralian and New Zealand Journal of Obstetrics and Gynaecology 51, 499–504, DOI: 10.1111/j.1479-828X.2011.01356.x (2011). _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1479-828X.2011.01356.x.
- [5] Gao, Y., Ren, S., Zhou, H. & Xuan, R. Impact of Physical Activity During Pregnancy on Gestational Hypertension. \JournalTitlePhysical Activity and Health 4, 32–39, DOI: 10.5334/paah.49 (2020). Number: 1 Publisher: Ubiquity Press.
- [6] Raets, L., Beunen, K. & Benhalima, K. Screening for Gestational Diabetes Mellitus in Early Pregnancy: What Is the Evidence? \JournalTitleJournal of Clinical Medicine 10, 1257, DOI: 10.3390/jcm10061257 (2021). Number: 6 Publisher: Multidisciplinary Digital Publishing Institute.
- [7] Rowan, J. A., Budden, A., Ivanova, V., Hughes, R. C. & Sadler, L. C. Women with an HbA1c of 41–49 mmol/mol (5.9–6.6%): a higher risk subgroup that may benefit from early pregnancy intervention. \JournalTitleDiabetic Medicine 33, 25–31, DOI: 10.1111/dme.12812 (2016). _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/dme.12812.
- [8] Hong, C. S. H. Caring for High-Need, High-Cost Patients: What Makes for a Successful Care Management Program? Tech. Rep., Commonwealth Fund, New York, NY United States (2014). DOI: 10.15868/socialsector.25007.
- [9] Alexander, J. W. & Mackey, M. C. Cost Effectiveness of a High-Risk Pregnancy Program. \JournalTitleCare Management Journals 1, 170–174, DOI: 10.1891/1521-0987.1.3.170 (1999).
- [10] Mate, A. et al. Field study in deploying restless multi-armed bandits: Assisting non-profits in improving maternal and child health. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, 12017–12025 (2022).
- [11] Radin, J. M. et al. The healthy pregnancy research program: transforming pregnancy research through a researchkit app. \JournalTitleNPJ digital medicine 1, 45 (2018).
- [12] Matcho, A. et al. Inferring pregnancy episodes and outcomes within a network of observational databases. \JournalTitlePLoS ONE 13, e0192033, DOI: 10.1371/journal.pone.0192033 (2018).
- [13] Blotière, P.-O. et al. Development of an algorithm to identify pregnancy episodes and related outcomes in health care claims databases: An application to antiepileptic drug use in 4.9 million pregnant women in France. \JournalTitlePharmacoepidemiology and Drug Safety 27, 763–770, DOI: 10.1002/pds.4556 (2018). _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/pds.4556.
- [14] MacDonald, S. C. et al. Identifying pregnancies in insurance claims data: Methods and application to retinoid teratogenic surveillance. \JournalTitlePharmacoepidemiology and Drug Safety 28, 1211–1221, DOI: 10.1002/pds.4794 (2019).
- [15] Schink, T., Wentzell, N., Dathe, K., Onken, M. & Haug, U. Estimating the Beginning of Pregnancy in German Claims Data: Development of an Algorithm With a Focus on the Expected Delivery Date. \JournalTitleFrontiers in Public Health 8 (2020).
- [16] Bertini, A., Salas, R., Chabert, S., Sobrevia, L. & Pardo, F. Using machine learning to predict complications in pregnancy: A systematic review. \JournalTitleFrontiers in bioengineering and biotechnology 9, 1385 (2022).
- [17] Espinosa, C. et al. Data-driven modeling of pregnancy-related complications. \JournalTitleTrends in molecular medicine 27, 762–776 (2021).
- [18] Islam, M. N., Mustafina, S. N., Mahmud, T. & Khan, N. I. Machine learning to predict pregnancy outcomes: a systematic review, synthesizing framework and future research agenda. \JournalTitleBMC Pregnancy and Childbirth 22, 1–19 (2022).
- [19] Machado, J. M. et al. Predicting the risk associated to pregnancy using data mining. \JournalTitleSCITEPRESS (2015).
- [20] Li, S. et al. Improving preeclampsia risk prediction by modeling pregnancy trajectories from routinely collected electronic medical record data. \JournalTitleNPJ Digital Medicine 5, 68 (2022).
- [21] Park, S. Y. et al. Identifying challenges and opportunities in human–ai collaboration in healthcare (2019).
- [22] Asan, O., Bayrak, A. E., Choudhury, A. et al. Artificial intelligence and human trust in healthcare: focus on clinicians. \JournalTitleJournal of medical Internet research 22, e15154 (2020).
- [23] Reverberi, C. et al. Experimental evidence of effective human–ai collaboration in medical decision-making. \JournalTitleScientific reports 12, 14952 (2022).
- [24] Gaube, S. et al. Do as ai say: susceptibility in deployment of clinical decision-aids. \JournalTitleNPJ digital medicine 4, 31 (2021).
- [25] Kodialam, R. et al. Deep contextual clinical prediction with reverse distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, 249–258 (2021).
- [26] Tibshirani, R. Regression shrinkage and selection via the lasso. \JournalTitleJournal of the Royal Statistical Society: Series B (Methodological) 58, 267–288 (1996).
- [27] Halpern, Y., Horng, S., Choi, Y. & Sontag, D. Electronic medical record phenoty** using the anchor and learn framework. \JournalTitleJournal of the American Medical Informatics Association 23, 731–740 (2016).
- [28] Cortes, C. & Mohri, M. Confidence intervals for the area under the roc curve. \JournalTitleAdvances in neural information processing systems 17 (2004).
- [29] Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. \JournalTitleScience 366, 447–453 (2019).
- [30] Chouldechova, A. & Roth, A. The frontiers of fairness in machine learning. \JournalTitlearXiv preprint arXiv:1810.08810 (2018).
- [31] McCradden, M. D., Joshi, S., Mazwi, M. & Anderson, J. A. Ethical limitations of algorithmic fairness solutions in health care machine learning. \JournalTitleThe Lancet Digital Health 2, e221–e223 (2020).
- [32] Rundo, L., Pirrone, R., Vitabile, S., Sala, E. & Gambino, O. Recent advances of hci in decision-making tasks for optimized clinical workflows and precision medicine. \JournalTitleJournal of biomedical informatics 108, 103479 (2020).
- [33] Chen, I. Y. et al. Ethical machine learning in healthcare. \JournalTitleAnnual review of biomedical data science 4, 123–144 (2021).
- [34] Gervasi, S. S. et al. The potential for bias in machine learning and opportunities for health insurers to address it: Article examines the potential for bias in machine learning and opportunities for health insurers to address it. \JournalTitleHealth Affairs 41, 212–218 (2022).
Appendix A Supplemental Information
A.1 Ethics
For our human subject experiments described in our results with the care managers, we obtained an exempt evaluation from our IRB. Our IRB judged that our research activities meet the criteria for exemption as defined by Federal regulation 45 CFR 46 under the following:
-
•
Exempt Category 3 - Benign Behavioral Intervention Research involving benign behavioral interventions where the study activities are limited to adults only and disclosure of the subjects’ responses outside the research could not reasonably place the subjects at risk for criminal or civil liability or be damaging to the subjects’ financial standing, employability, educational advancement, or reputation. Research does not involve deception or participants prospectively agree to the deception. 45 CFR 46.104(d)(3)
-
•
Exempt Category 2 - Educational Testing, Surveys, Interviews or Observation Research involving surveys, interviews, educational tests or observation of public behavior with adults or children and disclosure of the subjects’ responses outside the research could not reasonably place the subjects at risk for criminal or civil liability or be damaging to the subjects’ financial standing, employability, educational advancement, or reputation. Research activities with children must be limited to educational tests or observation of public behavior and cannot include direct intervention by the investigator. 45 CFR 46.104(d)(2)
A.2 Dataset Creation Algorithm for Pregnancy Identification
We build on [12], which presents an algorithm for inferring pregnancy episodes across a set of pregnancy outcomes in OMOP Common Data Model. Our modified algorithm can handle a larger set of pregnancy outcomes, e.g. neonatal ICU admission, by doing a forward search to update the outcome once the pregnancy episode is identified. We describe our modified version in Algorithm 1. We illustrate the algorithm in Figure 5 and present a subset of target codes for reference in 4.
![Refer to caption](extracted/2305.17261v3/figures/pregnancy_cohort_selection.png)
Member B is excluded from the cohort since no pregnancy start code was detected within the lookback window. Member C is excluded since there was no associated pregnancy outcome code; amenorrhea alone cannot indicate pregnancy has started since it can be caused by non-pregnancy-related factors (e.g. stress, menopause).
(a) | ![]() |
---|---|
(b) | ![]() |
Outcome | Target Codes | Pregnancy ID? | Risk Factors? |
---|---|---|---|
Neonatal Intensive Care Unit (NICU) |
Newborn light for gestational age
Low birth weight infant Birth injury to central nervous system Respiratory distress syndrome in the newborn Pulmonary hypertension of newborn |
X | X |
Hypertension/Pre-eclampsia (HPPE) |
Pre-existing hypertension in obstetric context
Transient hypertension of pregnancy Renal hypertension complicating pregnancy Severe pre-eclampsia Gestational proteinuria |
X | X |
Pre-term birth |
Preterm premature rupture of membranes
Fetal or neonatal effect of maternal premature rupture of membrane Baby premature, 24-26 weeks Extreme immaturity, 750-999 grams Metabolic bone disease of prematurity |
X | X |
Gestational Hypertension |
Unspecified maternal hypertension
Gestational [pregnancy-induced] hypertension Hypertension, Pregnancy-Induced gestational proteinuria Mild to moderate pre-eclampsia |
X | |
Gestational Diabetes |
Gestational diabetes mellitus in childbirth
Diabetes mellitus arising in pregnancy Gestational diabetes mellitus in the puerperium Gestational diabetes mellitus complicating pregnancy Maternal gestational diabetes mellitus |
X |
We build a cohort of patients who were never pregnant throughout their claims history. We sample these patients according to the age distribution of pregnant members (mean: 31.8 years, standard deviation: 4.8 years) and define “never pregnant” to be any member who does not have any of the pregnancy start or outcome concept codes present in their claims history.
A.3 Dataset Creation Algorithm for Pregnancy Complication Prediction
In Algorithm 1, the first pass phase that searches for the most recent pregnancy outcome references the original pregnancy outcomes and corresponding target codes defined in [12]. In the second pass phase performs a second search to update the previous outcome, we reference target codes for additional outcomes. We present a subset of target codes for these outcomes and an indicator for when they are used in Table 4
We queried for pregnancy episodes with a gestational diabetes ICD 10 code (O24.11-O24.93) using ATLAS [ohdsi-atlas]. We then filtered for unique diagnosis codes within those episodes and selected the most frequently occurring diagnosis codes as the initial set of target codes for gestational diabetes outcomes. The same procedure was repeated for gestational hypertension/pre-eclampsia (ICD 10 code O10.011-O16.9). We validated the code set with the care management nurses, who hand-labeled outcome codes for a subset of 20 members, given data up to the end of the pregnancy episode. This allowed us to (1) validate that the existing codes are indicative of the corresponding outcome, and (2) find new codes indicative of an outcome. For example, Methyldopa 250 MG Oral Tablet, an anti-hypertensive drug, was added as a code for gestational HT/PE.
Similar to pregnancy identification, we generate non-temporal and temporal features for each sampled point. For temporal data, we generate windowed features for 30 day, 180 day, 365 day, 730 day, and 10k day windows using omop-learn
for the following categories: medical conditions, prescriptions, procedures, specialty visits, and labs. We also include 12 non-temporal features, which include age, race, and gender. This gives us a feature set of 112,322 features.
A.4 Hyperparameter Selection for Machine Learning Models
For the pregnancy identification LASSO model, we report the hyperparameter search space in Table 5. We select the model with the highest validation accuracy. The decision threshold is chosen to be the geometric mean of sensitivity and specificity on the validation set.
Hyperparameters | Search Range |
---|---|
Regularization strength (C) | 1e-3, 7.5e-4, 5e-4, 2.5e-4*, 1e-4 |
Tolerance | 1*, 1e-1, 1e-2, 1e-3, 1e-4 |
For the pregnancy complications models, we report the hyperparameter search space in Table 6. Note that we also correct for class imbalance by weighting each class by , where is the proportion of outcomes under class in the training set. We select the model with the highest product of AUROC and accuracy on the validation set.
Hyperparameters | Search Range | |
LASSO | Regularization strength (C) | 1, 1e-1, 1e-2, 1e-3*, 1e-4 |
Tolerance | 1e-1, 1e-2, 1e-3*, 1e-4 | |
ELASTIC-NET | L1-ratio | 0.25*, 0.5, 0.75 |
Tolerance | 1e-1*, 5e-2 | |
XGBOOST | Learning rate | 1e-1*, 1e-2, 1e-3, 1e-4 |
A.5 Algorithm For Pregnancy Identification
We formally describe our pregnancy identification algorithm continuing on from the Methods section.
We combine the anchors and the Lasso model into a hybrid model that does the following: if there exists a pregnancy start code only then , if there is a pregnancy end code then , otherwise follows the prediction of the Lasso model. After we get the predictions at time as we pass those predictions to an exponential moving average filter. This serves to smooth the predictions over the last 5 time points with a decay factor of to get a result . We then binarize with a learned threshold chosen to maximize the geometric mean of the F1-score of the pregnancy predictions to obtain a binary prediction . This process is performed at each time stamp for the member’s data. We predict the pregnancy start date as the first instance of time where is and we have two consecutive increase scores , similarly, we predict the pregnancy end date as the instance of time is and we have two consecutive decreasing scores (given we already predicted the start).
EMA
smooth with exponential moving average filter
InferEpisode
infer pregnancy start and end (see Alg. 3)
A.6 Additional Results for Pregnancy Identification Retrospective Evaluation
We include additional results for our retrospective evaluation of the pregnancy identification algorithm.
Feature name |
---|
2213418 - procedure - Immunization administration (includes percutaneous, intradermal, subcutaneous, or intramuscular injections); 1 vaccine (single or combination vaccine/toxoid) |
2212167 - labs - Urinalysis, by dip stick or tablet reagent for bilirubin, glucose, hemoglobin, ketones, leukocytes, nitrite, pH, protein, specific gravity, urobilinogen, any number of these constituents; non-automated, without microscopy |
2108115 - procedure - Collection of venous blood by venipuncture |
3050479 - labs - Immature granulocytes/100 leukocytes in Blood |
2212996 - labs - Culture, bacterial; quantitative colony count, urine |
3033575 - labs - Monocytes [#/volume] in Blood by Automated count |
3023314 - labs - Hematocrit [Volume Fraction] of Blood by Automated count |
3014576 - labs - Chloride [Moles/volume] in Serum or Plasma |
38004461 - specialty - Obstetrics/Gynecology |
3015746 - labs - Specimen source identified |
A.7 Additional Results for Pregnancy Complication Prediction Retrospective Evaluation
We include additional results for our retrospective evaluation of the pregnancy complications algorithm.
In Table 8 we show the performance of our proposed predictor GROUP-Lasso that conditions on the patient’s prior history of disease and predicts using separate Lasso models for each sub-group compared to the global Lasso model. Modeling outcomes for separate groups increases accuracy as the predictions become better calibrated but sacrifices ranking ability in terms of AUC. The advantage of GROUP-Lasso is that the features surfaced as explanations by the sub-group models show information beyond prior history may be useful for the care managers. Therefore, we use the global Lasso model to make predictions but use GROUP-Lasso to surface features as an explanation.
(a) | ![]() |
---|---|
(b) | ![]() |
(c) | ![]() |
GROUP-Lasso | Lasso | ||
History of DB | AUROC | 0.675 | 0.706 |
Accuracy | 0.622 | 0.570 | |
History of HT | AUROC | 0.6573 | 0.708 |
Accuracy | 0.708 | 0.647 | |
History of DB+HT | AUROC | 0.635 | 0.757 |
Accuracy | 0.624 | 0.568 | |
No history of DB/HT | AUROC | 0.596 | 0.667 |
Accuracy | 0.793 | 0.780 |
A.8 Additional Details for User Studies
We include additional details of our user studies.
(a) Pregnancy Identification Interface | (b) Pregnancy Complications Interface |
(a) |
|
||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
(b) |
|
Category | Sub-category | # of members | Simulation start date range |
---|---|---|---|
Pregnant members detected by model | Detected early within reasonable time (at least 1 month after ) | 2 | , |
Detected too early (before 1 month after ) | 2 | ||
Pregnant members detected by code | – | 4 | |
Non-pregnant members | Detected not pregnant | 3 | |
Detected pregnant | 1 |
Outcome | Correct Prediction? | Prior History? | Number of Members |
---|---|---|---|
Gestational DB | Yes | No DB history | 3 |
Gestational HT | Yes | No HT history | 3 |
No complication | Yes | No DB or HT history | 3 |
Gestational DB | No | No DB history | 1 |
Gestational HT | No | No HT history | 1 |
No complication | No | No DB or HT history | 1 |
Gestational DB | Yes | DB history | 1 |
Gestational HT | Yes | HT history | 1 |
No complication | Yes | DB+HT history | 1 |
Gestational DB | No | DB history | 1 |
Gestational HT | No | HT history | 1 |
No complication | No | DB+HT history | 1 |