Search | arXiv e-print repository

doi 10.1109/BIBM55620.2022.9995229

Unsupervised extraction, labelling and clustering of segments from clinical notes

Authors: Petr Zelina, Jana Halámková, Vít Nováček

Abstract: This work is motivated by the scarcity of tools for accurate, unsupervised information extraction from unstructured clinical notes in computationally underrepresented languages, such as Czech. We introduce a step** stone to a broad array of downstream tasks such as summarisation or integration of individual patient records, extraction of structured information for national cancer registry report… ▽ More This work is motivated by the scarcity of tools for accurate, unsupervised information extraction from unstructured clinical notes in computationally underrepresented languages, such as Czech. We introduce a step** stone to a broad array of downstream tasks such as summarisation or integration of individual patient records, extraction of structured information for national cancer registry reporting or building of semi-structured semantic patient representations for computing patient embeddings. More specifically, we present a method for unsupervised extraction of semantically-labelled textual segments from clinical notes and test it out on a dataset of Czech breast cancer patients, provided by Masaryk Memorial Cancer Institute (the largest Czech hospital specialising in oncology). Our goal was to extract, classify (i.e. label) and cluster segments of the free-text notes that correspond to specific clinical features (e.g., family background, comorbidities or toxicities). The presented results demonstrate the practical relevance of the proposed approach for building more sophisticated extraction and analytical pipelines deployed on Czech clinical notes. △ Less

Submitted 21 November, 2022; originally announced November 2022.

Comments: To be published at the IEEE BIBM 2022 conference

Journal ref: IEEE BIBM; 2022; pages 1362-1368

arXiv:2211.09856 [pdf, other]

Machine Learning-Assisted Recurrence Prediction for Early-Stage Non-Small-Cell Lung Cancer Patients

Authors: Adrianna Janik, Maria Torrente, Luca Costabello, Virginia Calvo, Brian Walsh, Carlos Camps, Sameh K. Mohamed, Ana L. Ortega, Vít Nováček, Bartomeu Massutí, Pasquale Minervini, M. Rosario Garcia Campelo, Edel del Barco, Joaquim Bosch-Barrera, Ernestina Menasalvas, Mohan Timilsina, Mariano Provencio

Abstract: Background: Stratifying cancer patients according to risk of relapse can personalize their care. In this work, we provide an answer to the following research question: How to utilize machine learning to estimate probability of relapse in early-stage non-small-cell lung cancer patients? Methods: For predicting relapse in 1,387 early-stage (I-II), non-small-cell lung cancer (NSCLC) patients from t… ▽ More Background: Stratifying cancer patients according to risk of relapse can personalize their care. In this work, we provide an answer to the following research question: How to utilize machine learning to estimate probability of relapse in early-stage non-small-cell lung cancer patients? Methods: For predicting relapse in 1,387 early-stage (I-II), non-small-cell lung cancer (NSCLC) patients from the Spanish Lung Cancer Group data (65.7 average age, 24.8% females, 75.2% males) we train tabular and graph machine learning models. We generate automatic explanations for the predictions of such models. For models trained on tabular data, we adopt SHAP local explanations to gauge how each patient feature contributes to the predicted outcome. We explain graph machine learning predictions with an example-based method that highlights influential past patients. Results: Machine learning models trained on tabular data exhibit a 76% accuracy for the Random Forest model at predicting relapse evaluated with a 10-fold cross-validation (model was trained 10 times with different independent sets of patients in test, train and validation sets, the reported metrics are averaged over these 10 test sets). Graph machine learning reaches 68% accuracy over a 200-patient, held-out test set, calibrated on a held-out set of 100 patients. Conclusions: Our results show that machine learning models trained on tabular and graph data can enable objective, personalised and reproducible prediction of relapse and therefore, disease outcome in patients with early-stage NSCLC. With further prospective and multisite validation, and additional radiological and molecular data, this prognostic model could potentially serve as a predictive decision support tool for deciding the use of adjuvant treatments in early-stage lung cancer. Keywords: Non-Small-Cell Lung Cancer, Tumor Recurrence Prediction, Machine Learning △ Less

Submitted 17 November, 2022; originally announced November 2022.

arXiv:1809.07685 [pdf, ps, other]

Finding Explanations of Entity Relatedness in Graphs: A Survey

Authors: Raoul Biagioni, Pierre-Yves Vandenbussche, Vit Novacek

Abstract: Analysing and explaining relationships between entities in a graph is a fundamental problem associated with many practical applications. For example, a graph of biological pathways can be used for discovering a previously unknown relationship between two proteins. Domain experts, however, may be reluctant to trust such a discovery without a detailed explanation as to why exactly the two proteins a… ▽ More Analysing and explaining relationships between entities in a graph is a fundamental problem associated with many practical applications. For example, a graph of biological pathways can be used for discovering a previously unknown relationship between two proteins. Domain experts, however, may be reluctant to trust such a discovery without a detailed explanation as to why exactly the two proteins are deemed related in the graph. This paper provides an overview of the types of solutions, their associated methods and strategies, that have been proposed for finding entity relatedness explanations in graphs. The first type of solution relies on information inherent to the paths connecting the entities. This type of solution provides entity relatedness explanations in the form of a list of ranked paths. The rank of a path is measured in terms of importance, uniqueness, novelty and informativeness. The second type of solution relies on measures of node relevance. In this case, the relevance of nodes is measured w.r.t. the entities of interest, and relatedness explanations are provided in the form of a subgraph that maximises node relevance scores. This paper uses this classification of approaches to discuss and contrast some of the key concepts that guide different solutions to the problem of entity relatedness explanation in graphs. △ Less

Submitted 9 August, 2018; originally announced September 2018.

Comments: 10 pages, 9 Equations, Survey Paper

arXiv:1503.09137 [pdf, other]

Formalising Hypothesis Virtues in Knowledge Graphs: A General Theoretical Framework and its Validation in Literature-Based Discovery Experiments

Authors: Vit Novacek

Abstract: We introduce an approach to discovery informatics that uses so called knowledge graphs as the essential representation structure. Knowledge graph is an umbrella term that subsumes various approaches to tractable representation of large volumes of loosely structured knowledge in a graph form. It has been used primarily in the Web and Linked Open Data contexts, but is applicable to any other area de… ▽ More We introduce an approach to discovery informatics that uses so called knowledge graphs as the essential representation structure. Knowledge graph is an umbrella term that subsumes various approaches to tractable representation of large volumes of loosely structured knowledge in a graph form. It has been used primarily in the Web and Linked Open Data contexts, but is applicable to any other area dealing with knowledge representation. In the perspective of our approach motivated by the challenges of discovery informatics, knowledge graphs correspond to hypotheses. We present a framework for formalising so called hypothesis virtues within knowledge graphs. The framework is based on a classic work in philosophy of science, and naturally progresses from mostly informative foundational notions to actionable specifications of measures corresponding to particular virtues. These measures can consequently be used to determine refined sub-sets of knowledge graphs that have large relative potential for making discoveries. We validate the proposed framework by experiments in literature-based discovery. The experiments have demonstrated the utility of our work and its superiority w.r.t. related approaches. △ Less

Submitted 28 April, 2015; v1 submitted 31 March, 2015; originally announced March 2015.

Comments: Pre-print of an article submitted to Artificial Intelligence Journal (after the manuscript has been refused by the editors of Journal of Web Semantics before the peer review process due to being out of scope for that journal)

arXiv:1406.1061 [pdf, other]

A Methodology for Empirical Analysis of LOD Datasets

Authors: Vit Novacek

Abstract: CoCoE stands for Complexity, Coherence and Entropy, and presents an extensible methodology for empirical analysis of Linked Open Data (i.e., RDF graphs). CoCoE can offer answers to questions like: Is dataset A better than B for knowledge discovery since it is more complex and informative?, Is dataset X better than Y for simple value lookups due its flatter structure?, etc. In order to address such… ▽ More CoCoE stands for Complexity, Coherence and Entropy, and presents an extensible methodology for empirical analysis of Linked Open Data (i.e., RDF graphs). CoCoE can offer answers to questions like: Is dataset A better than B for knowledge discovery since it is more complex and informative?, Is dataset X better than Y for simple value lookups due its flatter structure?, etc. In order to address such questions, we introduce a set of well-founded measures based on complementary notions from distributional semantics, network analysis and information theory. These measures are part of a specific implementation of the CoCoE methodology that is available for download. Last but not least, we illustrate CoCoE by its application to selected biomedical RDF datasets. △ Less

Submitted 4 June, 2014; originally announced June 2014.

Comments: A current working draft of the paper submitted to the ISWC'14 conference (track information available here: http://iswc2014.semanticweb.org/call-replication-benchmark-data-software-papers)

arXiv:1304.6473 [pdf, other]

Technical report: Linking the scientific and clinical data with KI2NA-LHC

Authors: Vit Novacek, Aisha Naseer

Abstract: We introduce a use case and propose a system for data and knowledge integration in life sciences. In particular, we focus on linking clinical resources (electronic patient records) with scientific documents and data (research articles, biomedical ontologies and databases). Our motivation is two-fold. Firstly, we aim to instantly provide scientific context of particular patient cases for clinicians… ▽ More We introduce a use case and propose a system for data and knowledge integration in life sciences. In particular, we focus on linking clinical resources (electronic patient records) with scientific documents and data (research articles, biomedical ontologies and databases). Our motivation is two-fold. Firstly, we aim to instantly provide scientific context of particular patient cases for clinicians in order for them to propose treatments in a more informed way. Secondly, we want to build a technical infrastructure for researchers that will allow them to semi-automatically formulate and evaluate their hypothesis against longitudinal patient data. This paper describes the proposed system and its typical usage in a broader context of KI2NA, an ongoing collaboration between the DERI research institute and Fujitsu Laboratories. We introduce an architecture of the proposed framework called KI2NA-LHC (for Linked Health Care) and outline the details of its implementation. We also describe typical usage scenarios and propose a methodology for evaluation of the whole framework. The main goal of this paper is to introduce our ongoing work to a broader expert audience. By doing so, we aim to establish an early-adopter community for our work and elicit feedback we could reflect in the development of the prototype so that it is better tailored to the requirements of target users. △ Less

Submitted 23 April, 2013; originally announced April 2013.

Comments: A longer version of a paper originally published at the IEEE conference on Computer-Based Medical Systems (CBMS'13), under the name: Linking the Scientific and Clinical Data with KI2NA-LHC - An Outline (authors are the same)

arXiv:1210.3241 [pdf, ps, other]

Distributional Framework for Emergent Knowledge Acquisition and its Application to Automated Document Annotation

Authors: Vit Novacek

Abstract: The paper introduces a framework for representation and acquisition of knowledge emerging from large samples of textual data. We utilise a tensor-based, distributional representation of simple statements extracted from text, and show how one can use the representation to infer emergent knowledge patterns from the textual data in an unsupervised manner. Examples of the patterns we investigate in th… ▽ More The paper introduces a framework for representation and acquisition of knowledge emerging from large samples of textual data. We utilise a tensor-based, distributional representation of simple statements extracted from text, and show how one can use the representation to infer emergent knowledge patterns from the textual data in an unsupervised manner. Examples of the patterns we investigate in the paper are implicit term relationships or conjunctive IF-THEN rules. To evaluate the practical relevance of our approach, we apply it to annotation of life science articles with terms from MeSH (a controlled biomedical vocabulary and thesaurus). △ Less

Submitted 11 October, 2012; originally announced October 2012.

ACM Class: I.2.6; I.2.7; H.2.8

Showing 1–7 of 7 results for author: Nováček, V