-
Diffsurv: Differentiable sorting for censored time-to-event data
Authors:
Andre Vauvelle,
Benjamin Wild,
Aylin Cakiroglu,
Roland Eils,
Spiros Denaxas
Abstract:
Survival analysis is a crucial semi-supervised task in machine learning with numerous real-world applications, particularly in healthcare. Currently, the most common approach to survival analysis is based on Cox's partial likelihood, which can be interpreted as a ranking model optimized on a lower bound of the concordance index. This relation between ranking models and Cox's partial likelihood con…
▽ More
Survival analysis is a crucial semi-supervised task in machine learning with numerous real-world applications, particularly in healthcare. Currently, the most common approach to survival analysis is based on Cox's partial likelihood, which can be interpreted as a ranking model optimized on a lower bound of the concordance index. This relation between ranking models and Cox's partial likelihood considers only pairwise comparisons. Recent work has developed differentiable sorting methods which relax this pairwise independence assumption, enabling the ranking of sets of samples. However, current differentiable sorting methods cannot account for censoring, a key factor in many real-world datasets. To address this limitation, we propose a novel method called Diffsurv. We extend differentiable sorting methods to handle censored tasks by predicting matrices of possible permutations that take into account the label uncertainty introduced by censored samples. We contrast this approach with methods derived from partial likelihood and ranking losses. Our experiments show that Diffsurv outperforms established baselines in various simulated and real-world risk prediction scenarios. Additionally, we demonstrate the benefits of the algorithmic supervision enabled by Diffsurv by presenting a novel method for top-k risk prediction that outperforms current methods.
△ Less
Submitted 26 April, 2023;
originally announced April 2023.
-
Phenoty** with Positive Unlabelled Learning for Genome-Wide Association Studies
Authors:
Andre Vauvelle,
Hamish Tomlinson,
Aaron Sim,
Spiros Denaxas
Abstract:
Identifying phenotypes plays an important role in furthering our understanding of disease biology through practical applications within healthcare and the life sciences. The challenge of dealing with the complexities and noise within electronic health records (EHRs) has motivated applications of machine learning in phenotypic discovery. While recent research has focused on finding predictive subty…
▽ More
Identifying phenotypes plays an important role in furthering our understanding of disease biology through practical applications within healthcare and the life sciences. The challenge of dealing with the complexities and noise within electronic health records (EHRs) has motivated applications of machine learning in phenotypic discovery. While recent research has focused on finding predictive subtypes for clinical decision support, here we instead focus on the noise that results in phenotypic misclassification, which can reduce a phenotypes ability to detect associations in genome-wide association studies (GWAS). We show that by combining anchor learning and transformer architectures into our proposed model, AnchorBERT, we are able to detect genomic associations only previously found in large consortium studies with 5$\times$ more cases. When reducing the number of controls available by 50\%, we find our model is able to maintain 40\% more significant genomic associations from the GWAS catalog compared to standard phenotype definitions. \keywords{Phenoty** \and Machine Learning \and Semi-Supervised \and Genetic Association Studies \and Biological Discovery}
△ Less
Submitted 15 February, 2022;
originally announced February 2022.
-
How to estimate the association between change in a risk factor and a health outcome?
Authors:
Michail Katsoulis,
Alvina G Lai,
Dimitra-Kleio Kipourou,
Reecha Sofat,
Manuel Gomes,
Amitava Banerjee,
Spiros Denaxas,
Thomas R Lumbers,
Kostas Tsilidis,
Harry Hemingway,
Karla Diaz-Ordaz
Abstract:
Estimating the effect of a change in a particular risk factor and a chronic disease requires information on the risk factor from two time points; the enrolment and the first follow-up. When using observational data to study the effect of such an exposure (change in risk factor) extra complications arise, namely (i) when is time zero? and (ii) which information on confounders should we account for…
▽ More
Estimating the effect of a change in a particular risk factor and a chronic disease requires information on the risk factor from two time points; the enrolment and the first follow-up. When using observational data to study the effect of such an exposure (change in risk factor) extra complications arise, namely (i) when is time zero? and (ii) which information on confounders should we account for in this type of analysis? From enrolment or the 1st follow-up? Or from both?. The combination of these questions has proven to be very challenging. Researchers have applied different methodologies with mixed success, because the different choices made when answering these questions induce systematic bias. Here we review these methodologies and highlight the sources of bias in each type of analysis. We discuss the advantages and the limitations of each method ending by making our recommendations on the analysis plan.
△ Less
Submitted 21 December, 2020;
originally announced December 2020.
-
Selective recruitment designs for improving observational studies using electronic health records
Authors:
James E. Barrett,
Aylin Cakiroglu,
Catey Bunce,
Anoop Shah,
Spiros Denaxas
Abstract:
Large scale electronic health records (EHRs) present an opportunity to quickly identify suitable individuals in order to directly invite them to participate in an observational study. EHRs can contain data from millions of individuals, raising the question of how to optimally select a cohort of size n from a larger pool of size N. In this paper we propose a simple selective recruitment protocol th…
▽ More
Large scale electronic health records (EHRs) present an opportunity to quickly identify suitable individuals in order to directly invite them to participate in an observational study. EHRs can contain data from millions of individuals, raising the question of how to optimally select a cohort of size n from a larger pool of size N. In this paper we propose a simple selective recruitment protocol that selects a cohort in which covariates of interest tend to have a uniform distribution. We show that selectively recruited cohorts potentially offer greater statistical power and more accurate parameter estimates than randomly selected cohorts. Our protocol can be applied to studies with multiple categorical and continuous covariates. We apply our protocol to a numerically simulated prospective observational study using an EHR database of stable acute coronary disease patients from 82,089 individuals in the U.K. Selective recruitment designs require a smaller sample size, leading to more efficient and cost-effective studies.
△ Less
Submitted 13 February, 2019;
originally announced March 2019.
-
Application of Clinical Concept Embeddings for Heart Failure Prediction in UK EHR data
Authors:
Spiros Denaxas,
Pontus Stenetorp,
Sebastian Riedel,
Maria Pikoula,
Richard Dobson,
Harry Hemingway
Abstract:
Electronic health records (EHR) are increasingly being used for constructing disease risk prediction models. Feature engineering in EHR data however is challenging due to their highly dimensional and heterogeneous nature. Low-dimensional representations of EHR data can potentially mitigate these challenges. In this paper, we use global vectors (GloVe) to learn word embeddings for diagnoses and pro…
▽ More
Electronic health records (EHR) are increasingly being used for constructing disease risk prediction models. Feature engineering in EHR data however is challenging due to their highly dimensional and heterogeneous nature. Low-dimensional representations of EHR data can potentially mitigate these challenges. In this paper, we use global vectors (GloVe) to learn word embeddings for diagnoses and procedures recorded using 13 million ontology terms across 2.7 million hospitalisations in national UK EHR. We demonstrate the utility of these embeddings by evaluating their performance in identifying patients which are at higher risk of being hospitalised for congestive heart failure. Our findings indicate that embeddings can enable the creation of robust EHR-derived disease risk prediction models and address some the limitations associated with manual clinical feature engineering.
△ Less
Submitted 28 November, 2018; v1 submitted 23 November, 2018;
originally announced November 2018.
-
Evaluation of Semantic Web Technologies for Storing Computable Definitions of Electronic Health Records Phenoty** Algorithms
Authors:
Vaclav Papez,
Spiros Denaxas,
Harry Hemingway
Abstract:
Electronic Health Records are electronic data generated during or as a byproduct of routine patient care. Structured, semi-structured and unstructured EHR offer researchers unprecedented phenotypic breadth and depth and have the potential to accelerate the development of precision medicine approaches at scale. A main EHR use-case is defining phenoty** algorithms that identify disease status, ons…
▽ More
Electronic Health Records are electronic data generated during or as a byproduct of routine patient care. Structured, semi-structured and unstructured EHR offer researchers unprecedented phenotypic breadth and depth and have the potential to accelerate the development of precision medicine approaches at scale. A main EHR use-case is defining phenoty** algorithms that identify disease status, onset and severity. Phenoty** algorithms utilize diagnoses, prescriptions, laboratory tests, symptoms and other elements in order to identify patients with or without a specific trait. No common standardized, structured, computable format exists for storing phenoty** algorithms. The majority of algorithms are stored as human-readable descriptive text documents making their translation to code challenging due to their inherent complexity and hinders their sharing and re-use across the community. In this paper, we evaluate the two key Semantic Web Technologies, the Web Ontology Language and the Resource Description Framework, for enabling computable representations of EHR-driven phenoty** algorithms.
△ Less
Submitted 24 July, 2017;
originally announced July 2017.
-
Evaluating openEHR for storing computable representations of electronic health record phenoty** algorithms
Authors:
Vaclav Papez,
Spiros Denaxas,
Harry Hemingway
Abstract:
Electronic Health Records (EHR) are data generated during routine clinical care. EHR offer researchers unprecedented phenotypic breadth and depth and have the potential to accelerate the pace of precision medicine at scale. A main EHR use-case is creating phenoty** algorithms to define disease status, onset and severity. Currently, no common machine-readable standard exists for defining phenotyp…
▽ More
Electronic Health Records (EHR) are data generated during routine clinical care. EHR offer researchers unprecedented phenotypic breadth and depth and have the potential to accelerate the pace of precision medicine at scale. A main EHR use-case is creating phenoty** algorithms to define disease status, onset and severity. Currently, no common machine-readable standard exists for defining phenoty** algorithms which often are stored in human-readable formats. As a result, the translation of algorithms to implementation code is challenging and sharing across the scientific community is problematic. In this paper, we evaluate openEHR, a formal EHR data specification, for computable representations of EHR phenoty** algorithms.
△ Less
Submitted 27 April, 2017; v1 submitted 20 April, 2017;
originally announced April 2017.
-
Evaluation of Machine Learning Methods to Predict Coronary Artery Disease Using Metabolomic Data
Authors:
Henrietta Forssen,
Riyaz S. Patel,
Natalie Fitzpatrick,
Aroon Hingorani,
Adam Timmis,
Harry Hemingway,
Spiros C. Denaxas
Abstract:
Metabolomic data can potentially enable accurate, non-invasive and low-cost prediction of coronary artery disease. Regression-based analytical approaches however might fail to fully account for interactions between metabolites, rely on a priori selected input features and thus might suffer from poorer accuracy. Supervised machine learning methods can potentially be used in order to fully exploit t…
▽ More
Metabolomic data can potentially enable accurate, non-invasive and low-cost prediction of coronary artery disease. Regression-based analytical approaches however might fail to fully account for interactions between metabolites, rely on a priori selected input features and thus might suffer from poorer accuracy. Supervised machine learning methods can potentially be used in order to fully exploit the dimensionality and richness of the data. In this paper, we systematically implement and evaluate a set of supervised learning methods (L1 regression, random forest classifier) and compare them to traditional regression-based approaches for disease prediction using metabolomic data.
△ Less
Submitted 28 February, 2017;
originally announced March 2017.
-
A novel framework for assessing metadata quality in epidemiological and public health research settings
Authors:
Christiana McMahon,
Spiros Denaxas
Abstract:
Metadata are critical in epidemiological and public health research. However, a lack of biomedical metadata quality frameworks and limited awareness of the implications of poor quality metadata renders data analyses problematic. In this study, we created and evaluated a novel framework to assess metadata quality of epidemiological and public health research datasets. We performed a literature revi…
▽ More
Metadata are critical in epidemiological and public health research. However, a lack of biomedical metadata quality frameworks and limited awareness of the implications of poor quality metadata renders data analyses problematic. In this study, we created and evaluated a novel framework to assess metadata quality of epidemiological and public health research datasets. We performed a literature review and surveyed stakeholders to enhance our understanding of biomedical metadata quality assessment. The review identified 11 studies and nine quality dimensions; none of which were specifically aimed at biomedical metadata. 96 individuals completed the survey; of those who submitted data, most only assessed metadata quality sometimes, and eight did not at all. Our framework has four sections: a) general information; b) tools and technologies; c) usability; and d) management and curation. We evaluated the framework using three test cases and sought expert feedback. The framework can assess biomedical metadata quality systematically and robustly.
△ Less
Submitted 22 August, 2016;
originally announced August 2016.