Search | arXiv e-print repository

Sample Size in Natural Language Processing within Healthcare Research

Authors: Jaya Chaturvedi, Diana Shamsutdinova, Felix Zimmer, Sumithra Velupillai, Daniel Stahl, Robert Stewart, Angus Roberts

Abstract: Sample size calculation is an essential step in most data-based disciplines. Large enough samples ensure representativeness of the population and determine the precision of estimates. This is true for most quantitative studies, including those that employ machine learning methods, such as natural language processing, where free-text is used to generate predictions and classify instances of text. W… ▽ More Sample size calculation is an essential step in most data-based disciplines. Large enough samples ensure representativeness of the population and determine the precision of estimates. This is true for most quantitative studies, including those that employ machine learning methods, such as natural language processing, where free-text is used to generate predictions and classify instances of text. Within the healthcare domain, the lack of sufficient corpora of previously collected data can be a limiting factor when determining sample sizes for new studies. This paper tries to address the issue by making recommendations on sample sizes for text classification tasks in the healthcare domain. Models trained on the MIMIC-III database of critical care records from Beth Israel Deaconess Medical Center were used to classify documents as having or not having Unspecified Essential Hypertension, the most common diagnosis code in the database. Simulations were performed using various classifiers on different sample sizes and class proportions. This was repeated for a comparatively less common diagnosis code within the database of diabetes mellitus without mention of complication. Smaller sample sizes resulted in better results when using a K-nearest neighbours classifier, whereas larger sample sizes provided better results with support vector machines and BERT models. Overall, a sample size larger than 1000 was sufficient to provide decent performance metrics. The simulations conducted within this study provide guidelines that can be used as recommendations for selecting appropriate sample sizes and class proportions, and for predicting expected performance, when building classifiers for textual healthcare data. The methodology used here can be modified for sample size estimates calculations with other datasets. △ Less

Submitted 5 September, 2023; originally announced September 2023.

Comments: Submitted to Journal of Biomedical Informatics

arXiv:2308.08904 [pdf]

Development of a Knowledge Graph Embeddings Model for Pain

Authors: Jaya Chaturvedi, Tao Wang, Sumithra Velupillai, Robert Stewart, Angus Roberts

Abstract: Pain is a complex concept that can interconnect with other concepts such as a disorder that might cause pain, a medication that might relieve pain, and so on. To fully understand the context of pain experienced by either an individual or across a population, we may need to examine all concepts related to pain and the relationships between them. This is especially useful when modeling pain that has… ▽ More Pain is a complex concept that can interconnect with other concepts such as a disorder that might cause pain, a medication that might relieve pain, and so on. To fully understand the context of pain experienced by either an individual or across a population, we may need to examine all concepts related to pain and the relationships between them. This is especially useful when modeling pain that has been recorded in electronic health records. Knowledge graphs represent concepts and their relations by an interlinked network, enabling semantic and context-based reasoning in a computationally tractable form. These graphs can, however, be too large for efficient computation. Knowledge graph embeddings help to resolve this by representing the graphs in a low-dimensional vector space. These embeddings can then be used in various downstream tasks such as classification and link prediction. The various relations associated with pain which are required to construct such a knowledge graph can be obtained from external medical knowledge bases such as SNOMED CT, a hierarchical systematic nomenclature of medical terms. A knowledge graph built in this way could be further enriched with real-world examples of pain and its relations extracted from electronic health records. This paper describes the construction of such knowledge graph embedding models of pain concepts, extracted from the unstructured text of mental health electronic health records, combined with external knowledge created from relations described in SNOMED CT, and their evaluation on a subject-object link prediction task. The performance of the models was compared with other baseline models. △ Less

Submitted 17 August, 2023; originally announced August 2023.

Comments: Accepted at AMIA 2023, New Orleans

arXiv:2304.01240 [pdf]

Identifying Mentions of Pain in Mental Health Records Text: A Natural Language Processing Approach

Authors: Jaya Chaturvedi, Sumithra Velupillai, Robert Stewart, Angus Roberts

Abstract: Pain is a common reason for accessing healthcare resources and is a growing area of research, especially in its overlap with mental health. Mental health electronic health records are a good data source to study this overlap. However, much information on pain is held in the free text of these records, where mentions of pain present a unique natural language processing problem due to its ambiguous… ▽ More Pain is a common reason for accessing healthcare resources and is a growing area of research, especially in its overlap with mental health. Mental health electronic health records are a good data source to study this overlap. However, much information on pain is held in the free text of these records, where mentions of pain present a unique natural language processing problem due to its ambiguous nature. This project uses data from an anonymised mental health electronic health records database. The data are used to train a machine learning based classification algorithm to classify sentences as discussing patient pain or not. This will facilitate the extraction of relevant pain information from large databases, and the use of such outputs for further studies on pain and mental health. 1,985 documents were manually triple-annotated for creation of gold standard training data, which was used to train three commonly used classification algorithms. The best performing model achieved an F1-score of 0.98 (95% CI 0.98-0.99). △ Less

Submitted 5 April, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

Comments: 5 pages, 2 tables, submitted to MEDINFO 2023 conference

arXiv:2007.10159 [pdf, other]

Analysing Meso and Macro conversation structures in an online suicide support forum

Authors: Sagar Joglekar, Sumithra Velupillai, Rina Dutta, Nishanth Sastry

Abstract: Platforms like Reddit and Twitter offer internet users an opportunity to talk about diverse issues, including those pertaining to physical and mental health. Some of these forums also function as a safe space for severely distressed mental health patients to get social support from peers. The online community platform Reddit's SuicideWatch is one example of an online forum dedicated specifically t… ▽ More Platforms like Reddit and Twitter offer internet users an opportunity to talk about diverse issues, including those pertaining to physical and mental health. Some of these forums also function as a safe space for severely distressed mental health patients to get social support from peers. The online community platform Reddit's SuicideWatch is one example of an online forum dedicated specifically to people who suffer from suicidal thoughts, or who are concerned about people who might be at risk. It remains to be seen if these forums can be used to understand and model the nature of online social support, not least because of the noisy and informal nature of conversations. Moreover, understanding how a community of volunteering peers react to calls for help in cases of suicidal posts, would help to devise better tools for online mitigation of such episodes. In this paper, we propose an approach to characterise conversations in online forums. Using data from the SuicideWatch subreddit as a case study, we propose metrics at a macroscopic level -- measuring the structure of the entire conversation as a whole. We also develop a framework to measure structures in supportive conversations at a mesoscopic level -- measuring interactions with the immediate neighbours of the person in distress. We statistically show through comparison with baseline conversations from random Reddit threads that certain macro and meso-scale structures in an online conversation exhibit signatures of social support, and are particularly over-expressed in SuicideWatch conversations. △ Less

Submitted 20 July, 2020; originally announced July 2020.

arXiv:1907.01055 [pdf, other]

Is artificial data useful for biomedical Natural Language Processing algorithms?

Authors: Zixu Wang, Julia Ive, Sumithra Velupillai, Lucia Specia

Abstract: A major obstacle to the development of Natural Language Processing (NLP) methods in the biomedical domain is data accessibility. This problem can be addressed by generating medical data artificially. Most previous studies have focused on the generation of short clinical text, and evaluation of the data utility has been limited. We propose a generic methodology to guide the generation of clinical t… ▽ More A major obstacle to the development of Natural Language Processing (NLP) methods in the biomedical domain is data accessibility. This problem can be addressed by generating medical data artificially. Most previous studies have focused on the generation of short clinical text, and evaluation of the data utility has been limited. We propose a generic methodology to guide the generation of clinical text with key phrases. We use the artificial data as additional training data in two key biomedical NLP tasks: text classification and temporal relation extraction. We show that artificially generated training data used in conjunction with real training data can lead to performance boosts for data-greedy neural network algorithms. We also demonstrate the usefulness of the generated data for NLP setups where it fully replaces real training data. △ Less

Submitted 7 August, 2019; v1 submitted 1 July, 2019; originally announced July 2019.

Comments: BioNLP 2019

arXiv:1508.02079 [pdf, ps, other]

Facts and Fabrications about Ebola: A Twitter Based Study

Authors: Janani Kalyanam, Sumithra Velupillai, Son Doan, Mike Conway, Gert Lanckriet

Abstract: Microblogging websites like Twitter have been shown to be immensely useful for spreading information on a global scale within seconds. The detrimental effect, however, of such platforms is that misinformation and rumors are also as likely to spread on the network as credible, verified information. From a public health standpoint, the spread of misinformation creates unnecessary panic for the publi… ▽ More Microblogging websites like Twitter have been shown to be immensely useful for spreading information on a global scale within seconds. The detrimental effect, however, of such platforms is that misinformation and rumors are also as likely to spread on the network as credible, verified information. From a public health standpoint, the spread of misinformation creates unnecessary panic for the public. We recently witnessed several such scenarios during the outbreak of Ebola in 2014 [14, 1]. In order to effectively counter the medical misinformation in a timely manner, our goal here is to study the nature of such misinformation and rumors in the United States during fall 2014 when a handful of Ebola cases were confirmed in North America. It is a well known convention on Twitter to use hashtags to give context to a Twitter message (a tweet). In this study, we collected approximately 47M tweets from the Twitter streaming API related to Ebola. Based on hashtags, we propose a method to classify the tweets into two sets: credible and speculative. We analyze these two sets and study how they differ in terms of a number of features extracted from the Twitter API. In conclusion, we infer several interesting differences between the two sets. We outline further potential directions to using this material for monitoring and separating speculative tweets from credible ones, to enable improved public health information. △ Less

Submitted 9 August, 2015; originally announced August 2015.

Comments: Appears in SIGKDD BigCHat Workshop 2015

Showing 1–6 of 6 results for author: Velupillai, S