-
A method for comparing multiple imputation techniques: a case study on the U.S. National COVID Cohort Collaborative
Authors:
Elena Casiraghi,
Rachel Wong,
Margaret Hall,
Ben Coleman,
Marco Notaro,
Michael D. Evans,
Jena S. Tronieri,
Hannah Blau,
Bryan Laraway,
Tiffany J. Callahan,
Lauren E. Chan,
Carolyn T. Bramante,
John B. Buse,
Richard A. Moffitt,
Til Sturmer,
Steven G. Johnson,
Yu Raymond Shao,
Justin Reese,
Peter N. Robinson,
Alberto Paccanaro,
Giorgio Valentini,
Jared D. Huling,
Kenneth Wilkins,
:,
Tell Bennet
, et al. (12 additional authors not shown)
Abstract:
Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful to assess associations between patients' predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases and the simple removal of these cases may introduce severe bias. For these reasons, several multiple imputation algorithms have been propose…
▽ More
Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful to assess associations between patients' predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases and the simple removal of these cases may introduce severe bias. For these reasons, several multiple imputation algorithms have been proposed to attempt to recover the missing information. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithms works best in a given scenario. Furthermore, the selection of each algorithm parameters and data-related modelling choices are also both crucial and challenging. In this paper, we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. The experiments presented here show that our approach could effectively highlight the most valid and performant missing-data handling strategy for our case study. Moreover, our methodology allowed us to gain an understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.
△ Less
Submitted 25 September, 2022; v1 submitted 13 June, 2022;
originally announced June 2022.
-
An Open Natural Language Processing Development Framework for EHR-based Clinical Research: A case demonstration using the National COVID Cohort Collaborative (N3C)
Authors:
Sijia Liu,
Andrew Wen,
Liwei Wang,
Huan He,
Sunyang Fu,
Robert Miller,
Andrew Williams,
Daniel Harris,
Ramakanth Kavuluru,
Mei Liu,
Noor Abu-el-rub,
Dalton Schutte,
Rui Zhang,
Masoud Rouhizadeh,
John D. Osborne,
Yongqun He,
Umit Topaloglu,
Stephanie S Hong,
Joel H Saltz,
Thomas Schaffter,
Emily Pfaff,
Christopher G. Chute,
Tim Duong,
Melissa A. Haendel,
Rafael Fuentes
, et al. (7 additional authors not shown)
Abstract:
While we pay attention to the latest advances in clinical natural language processing (NLP), we can notice some resistance in the clinical and translational research community to adopt NLP models due to limited transparency, interpretability, and usability. In this study, we proposed an open natural language processing development framework. We evaluated it through the implementation of NLP algori…
▽ More
While we pay attention to the latest advances in clinical natural language processing (NLP), we can notice some resistance in the clinical and translational research community to adopt NLP models due to limited transparency, interpretability, and usability. In this study, we proposed an open natural language processing development framework. We evaluated it through the implementation of NLP algorithms for the National COVID Cohort Collaborative (N3C). Based on the interests in information extraction from COVID-19 related clinical notes, our work includes 1) an open data annotation process using COVID-19 signs and symptoms as the use case, 2) a community-driven ruleset composing platform, and 3) a synthetic text data generation workflow to generate texts for information extraction tasks without involving human subjects. The corpora were derived from texts from three different institutions (Mayo Clinic, University of Kentucky, University of Minnesota). The gold standard annotations were tested with a single institution's (Mayo) ruleset. This resulted in performances of 0.876, 0.706, and 0.694 in F-scores for Mayo, Minnesota, and Kentucky test datasets, respectively. The study as a consortium effort of the N3C NLP subgroup demonstrates the feasibility of creating a federated NLP algorithm development and benchmarking platform to enhance multi-institution clinical NLP study and adoption. Although we use COVID-19 as a use case in this effort, our framework is general enough to be applied to other domains of interest in clinical NLP.
△ Less
Submitted 21 March, 2022; v1 submitted 20 October, 2021;
originally announced October 2021.
-
Enabling Longitudinal Exploratory Analysis of Clinical COVID Data
Authors:
David Borland,
Irena Brain,
Karamarie Fecho,
Emily Pfaff,
Hao Xu,
James Champion,
Chris Bizon,
David Gotz
Abstract:
As the COVID-19 pandemic continues to impact the world, data is being gathered and analyzed to better understand the disease. Recognizing the potential for visual analytics technologies to support exploratory analysis and hypothesis generation from longitudinal clinical data, a team of collaborators worked to apply existing event sequence visual analytics technologies to a longitudinal clinical da…
▽ More
As the COVID-19 pandemic continues to impact the world, data is being gathered and analyzed to better understand the disease. Recognizing the potential for visual analytics technologies to support exploratory analysis and hypothesis generation from longitudinal clinical data, a team of collaborators worked to apply existing event sequence visual analytics technologies to a longitudinal clinical data from a cohort of 998 patients with high rates of COVID-19 infection. This paper describes the initial steps toward this goal, including: (1) the data transformation and processing work required to prepare the data for visual analysis, (2) initial findings and observations, and (3) qualitative feedback and lessons learned which highlight key features as well as limitations to address in future work.
△ Less
Submitted 25 August, 2021;
originally announced August 2021.