Search | arXiv e-print repository

Infusing clinical knowledge into tokenisers for language models

Authors: Abul Hasan, **ge Wu, Quang Ngoc Nguyen, Salomé Andres, Imane Guellil, Huayu Zhang, Arlene Casey, Beatrice Alex, Bruce Guthrie, Honghan Wu

Abstract: This study introduces a novel knowledge enhanced tokenisation mechanism, K-Tokeniser, for clinical text processing. Technically, at initialisation stage, K-Tokeniser populates global representations of tokens based on semantic types of domain concepts (such as drugs or diseases) from either a domain ontology like Unified Medical Language System or the training data of the task related corpus. At t… ▽ More This study introduces a novel knowledge enhanced tokenisation mechanism, K-Tokeniser, for clinical text processing. Technically, at initialisation stage, K-Tokeniser populates global representations of tokens based on semantic types of domain concepts (such as drugs or diseases) from either a domain ontology like Unified Medical Language System or the training data of the task related corpus. At training or inference stage, sentence level localised context will be utilised for choosing the optimal global token representation to realise the semantic-based tokenisation. To avoid pretraining using the new tokeniser, an embedding initialisation approach is proposed to generate representations for new tokens. Using three transformer-based language models, a comprehensive set of experiments are conducted on four real-world datasets for evaluating K-Tokeniser in a wide range of clinical text analytics tasks including clinical concept and relation extraction, automated clinical coding, clinical phenotype identification, and clinical research article classification. Overall, our models demonstrate consistent improvements over their counterparts in all tasks. In particular, substantial improvements are observed in the automated clinical coding task with 13\% increase on Micro $F_1$ score. Furthermore, K-Tokeniser also shows significant capacities in facilitating quicker converge of language models. Specifically, using K-Tokeniser, the language models would only require 50\% of the training data to achieve the best performance of the baseline tokeniser using all training data in the concept extraction task and less than 20\% of the data for the automated coding task. It is worth mentioning that all these improvements require no pre-training process, making the approach generalisable. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: 18 pages, 6 figures

arXiv:2309.02593 [pdf, other]

Quantum Voting and Violation of Gibbard-Satterthwaite's Impossibility Theorem

Authors: Ethan Dickey, Aidan Casey

Abstract: In the realm of algorithmic economics, voting systems are evaluated and compared by examining the properties or axioms they satisfy. While this pursuit has yielded valuable insights, it has also led to seminal impossibility results such as Arrow's and Gibbard-Satterthwaite's Impossibility Theorems, which pose challenges in designing ideal voting systems. Enter the domain of quantum computing: rece… ▽ More In the realm of algorithmic economics, voting systems are evaluated and compared by examining the properties or axioms they satisfy. While this pursuit has yielded valuable insights, it has also led to seminal impossibility results such as Arrow's and Gibbard-Satterthwaite's Impossibility Theorems, which pose challenges in designing ideal voting systems. Enter the domain of quantum computing: recent advancements have introduced the concept of quantum voting systems, which have many potential applications including in security and blockchain. Building on recent works that bypass Arrow's Impossibility Theorem using quantum voting systems, our research extends Quantum Condorcet Voting (QCV) to counter the Gibbard-Satterthwaite Impossibility Theorem in a quantum setting. To show this, we introduce a quantum-specific notion of truthfulness, extend ideas like incentive compatibility and the purpose of onto to the quantum domain, and introduce new tools to map social welfare functions to social choice functions in this domain. △ Less

Submitted 5 September, 2023; originally announced September 2023.

Comments: 35 pages, 1 figure, 2 tables

arXiv:2205.05656 [pdf, other]

doi 10.1186/s12911-023-02181-9

Ontology-Driven and Weakly Supervised Rare Disease Identification from Clinical Notes

Authors: Hang Dong, Víctor Suárez-Paniagua, Huayu Zhang, Minhong Wang, Arlene Casey, Emma Davidson, Jiaoyan Chen, Beatrice Alex, William Whiteley, Honghan Wu

Abstract: Computational text phenoty** is the practice of identifying patients with certain disorders and traits from clinical notes. Rare diseases are challenging to be identified due to few cases available for machine learning and the need for data annotation from domain experts. We propose a method using ontologies and weak supervision, with recent pre-trained contextual representations from Bi-directi… ▽ More Computational text phenoty** is the practice of identifying patients with certain disorders and traits from clinical notes. Rare diseases are challenging to be identified due to few cases available for machine learning and the need for data annotation from domain experts. We propose a method using ontologies and weak supervision, with recent pre-trained contextual representations from Bi-directional Transformers (e.g. BERT). The ontology-based framework includes two steps: (i) Text-to-UMLS, extracting phenotypes by contextually linking mentions to concepts in Unified Medical Language System (UMLS), with a Named Entity Recognition and Linking (NER+L) tool, SemEHR, and weak supervision with customised rules and contextual mention representation; (ii) UMLS-to-ORDO, matching UMLS concepts to rare diseases in Orphanet Rare Disease Ontology (ORDO). The weakly supervised approach is proposed to learn a phenotype confirmation model to improve Text-to-UMLS linking, without annotated data from domain experts. We evaluated the approach on three clinical datasets, MIMIC-III discharge summaries, MIMIC-III radiology reports, and NHS Tayside brain imaging reports from two institutions in the US and the UK, with annotations. The improvements in the precision were pronounced (by over 30% to 50% absolute score for Text-to-UMLS linking), with almost no loss of recall compared to the existing NER+L tool, SemEHR. Results on radiology reports from MIMIC-III and NHS Tayside were consistent with the discharge summaries. The overall pipeline processing clinical notes can extract rare disease cases, mostly uncaptured in structured data (manually assigned ICD codes). We discuss the usefulness of the weak supervision approach and propose directions for future studies. △ Less

Submitted 3 May, 2023; v1 submitted 11 May, 2022; originally announced May 2022.

Comments: Accepted for BMC Medical Informatics and Decision Making, structured abstract in full text, 16 pages, 4 figures (and extra 7 pages, 1 figure in the supplementary material)

MSC Class: 68T50 (Primary); 68T30 (Secondary) ACM Class: I.2.7; J.3

arXiv:2109.06591 [pdf, other]

The impact of the COVID-19 pandemic on academic productivity

Authors: Andrew R. Casey, Ilya Mandel, Prasun K. Ray

Abstract: 'Publish or perish' is an expression describing the pressure on academics to consistently publish research to ensure a successful career in academia. With a global pandemic that has changed the world, how has it changed academic productivity? Here we show that academics are posting just as many publications on the arXiv pre-print server as if there were no pandemic: 168,630 were posted in 2020, a… ▽ More 'Publish or perish' is an expression describing the pressure on academics to consistently publish research to ensure a successful career in academia. With a global pandemic that has changed the world, how has it changed academic productivity? Here we show that academics are posting just as many publications on the arXiv pre-print server as if there were no pandemic: 168,630 were posted in 2020, a +12.6% change from 2019 and $+1.4σ$ deviation above the predicted 162,577 $\pm$ 4,393. However, some immediate impacts are visible in individual research fields. Conference cancellations have led to sharp drops in pre-prints, but laboratory closures have had mixed effects. Only some experimental fields show mild declines in outputs, with most being consistent on previous years or even increasing above model expectations. The most significant change is a 50% increase ($+8σ$) in quantitative biology research, all related to the COVID-19 pandemic. Some of these publications are by biologists using arXiv for the first time, and some are written by researchers from other fields (e.g., physicists, mathematicians). While quantitative biology pre-prints have returned to pre-pandemic levels, 20% of the research in this field is now focussed on the COVID-19 pandemic, demonstrating a strong shift in research focus. △ Less

Submitted 14 September, 2021; originally announced September 2021.

Comments: Submitted to RSOS

arXiv:2102.09553 [pdf, other]

doi 10.1186/s12911-021-01533-7

A Systematic Review of Natural Language Processing Applied to Radiology Reports

Authors: Arlene Casey, Emma Davidson, Michael Poon, Hang Dong, Daniel Duma, Andreas Grivas, Claire Grover, Víctor Suárez-Paniagua, Richard Tobin, William Whiteley, Honghan Wu, Beatrice Alex

Abstract: NLP has a significant role in advancing healthcare and has been found to be key in extracting structured information from radiology reports. Understanding recent developments in NLP application to radiology is of significance but recent reviews on this are limited. This study systematically assesses recent literature in NLP applied to radiology reports. Our automated literature search yields 4,799… ▽ More NLP has a significant role in advancing healthcare and has been found to be key in extracting structured information from radiology reports. Understanding recent developments in NLP application to radiology is of significance but recent reviews on this are limited. This study systematically assesses recent literature in NLP applied to radiology reports. Our automated literature search yields 4,799 results using automated filtering, metadata enriching steps and citation search combined with manual review. Our analysis is based on 21 variables including radiology characteristics, NLP methodology, performance, study, and clinical application characteristics. We present a comprehensive analysis of the 164 publications retrieved with each categorised into one of 6 clinical application categories. Deep learning use increases but conventional machine learning approaches are still prevalent. Deep learning remains challenged when data is scarce and there is little evidence of adoption into clinical practice. Despite 17% of studies reporting greater than 0.85 F1 scores, it is hard to comparatively evaluate these approaches given that most of them use different datasets. Only 14 studies made their data and 15 their code available with 10 externally validating results. Automated understanding of clinical narratives of the radiology reports has the potential to enhance the healthcare process but reproducibility and explainability of models are important if the domain is to move applications into clinical use. More could be done to share code enabling validation of methods on different institutional data and to reduce heterogeneity in reporting of study properties allowing inter-study comparisons. Our results have significance for researchers providing a systematic synthesis of existing work to build on, identify gaps, opportunities for collaboration and avoid duplication. △ Less

Submitted 18 February, 2021; originally announced February 2021.

Journal ref: BMC Medical Informatics and Decision Making 2021

arXiv:2008.10401 [pdf, ps, other]

Combinatorial diversity metrics for the analysis of policy processes

Authors: Mark Dukes, Anthony A. Casey

Abstract: We present several completely general diversity metrics to quantify the problem-solving capacity of any public policy decision making process. This is performed by modelling the policy process using a declarative process paradigm in conjunction with constraints modelled by expressions in linear temporal logic. We introduce a class of traces, called first-passage traces, to represent the different… ▽ More We present several completely general diversity metrics to quantify the problem-solving capacity of any public policy decision making process. This is performed by modelling the policy process using a declarative process paradigm in conjunction with constraints modelled by expressions in linear temporal logic. We introduce a class of traces, called first-passage traces, to represent the different executions of the declarative processes. Heuristics of what properties a diversity measure of such processes ought to satisfy are used to derive two different metrics for these processes in terms of the set of first-passage traces. These metrics turn out to have formulations in terms of the entropies of two different random variables on the set of traces of the processes. In addition, we introduce a measure of `goodness' whereby a trace is termed {\it good} if it satisfies some prescribed linear temporal logic expression. This allows for comparisons of policy processes with respect to the prescribed notion of `goodness'. △ Less

Submitted 19 August, 2020; originally announced August 2020.

arXiv:2002.01415 [pdf, other]

doi 10.46298/jdmdh.6071

Plague Dot Text: Text mining and annotation of outbreak reports of the Third Plague Pandemic (1894-1952)

Authors: Arlene Casey, Mike Bennett, Richard Tobin, Claire Grover, Iona Walker, Lukas Engelmann, Beatrice Alex

Abstract: The design of models that govern diseases in population is commonly built on information and data gathered from past outbreaks. However, epidemic outbreaks are never captured in statistical data alone but are communicated by narratives, supported by empirical observations. Outbreak reports discuss correlations between populations, locations and the disease to infer insights into causes, vectors an… ▽ More The design of models that govern diseases in population is commonly built on information and data gathered from past outbreaks. However, epidemic outbreaks are never captured in statistical data alone but are communicated by narratives, supported by empirical observations. Outbreak reports discuss correlations between populations, locations and the disease to infer insights into causes, vectors and potential interventions. The problem with these narratives is usually the lack of consistent structure or strong conventions, which prohibit their formal analysis in larger corpora. Our interdisciplinary research investigates more than 100 reports from the third plague pandemic (1894-1952) evaluating ways of building a corpus to extract and structure this narrative information through text mining and manual annotation. In this paper we discuss the progress of our ongoing exploratory project, how we enhance optical character recognition (OCR) methods to improve text capture, our approach to structure the narratives and identify relevant entities in the reports. The structured corpus is made available via Solr enabling search and analysis across the whole collection for future research dedicated, for example, to the identification of concepts. We show preliminary visualisations of the characteristics of causation and differences with respect to gender as a result of syntactic-category-dependent corpus statistics. Our goal is to develop structured accounts of some of the most significant concepts that were used to understand the epidemiology of the third plague pandemic around the globe. The corpus enables researchers to analyse the reports collectively allowing for deep insights into the global epidemiological consideration of plague in the early twentieth century. △ Less

Submitted 11 January, 2021; v1 submitted 4 February, 2020; originally announced February 2020.

Comments: Journal of Data Mining & Digital Humanities 2021

Journal ref: Journal of Data Mining & Digital Humanities, HistoInformatics, HistoInformatics (January 20, 2021) jdmdh:6071

arXiv:1511.06351 [pdf, other]

Learning Representations Using Complex-Valued Nets

Authors: Andy M. Sarroff, Victor Shepardson, Michael A. Casey

Abstract: Complex-valued neural networks (CVNNs) are an emerging field of research in neural networks due to their potential representational properties for audio, image, and physiological signals. It is common in signal processing to transform sequences of real values to the complex domain via a set of complex basis functions, such as the Fourier transform. We show how CVNNs can be used to learn complex re… ▽ More Complex-valued neural networks (CVNNs) are an emerging field of research in neural networks due to their potential representational properties for audio, image, and physiological signals. It is common in signal processing to transform sequences of real values to the complex domain via a set of complex basis functions, such as the Fourier transform. We show how CVNNs can be used to learn complex representations of real valued time-series data. We present methods and results using a framework that can compose holomorphic and non-holomorphic functions in a multi-layer network using a theoretical result called the Wirtinger derivative. We test our methods on a representation learning task for real-valued signals, recurrent complex-valued networks and their real-valued counterparts. Our results show that recurrent complex-valued networks can perform as well as their real-valued counterparts while learning filters that are representative of the domain of the data. △ Less

Submitted 19 November, 2015; originally announced November 2015.

Showing 1–8 of 8 results for author: Casey, A