Search | arXiv e-print repository

sc-OTGM: Single-Cell Perturbation Modeling by Solving Optimal Mass Transport on the Manifold of Gaussian Mixtures

Authors: Andac Demir, Elizaveta Solovyeva, James Boylan, Mei Xiao, Fabrizio Serluca, Sebastian Hoersch, Jeremy Jenkins, Murthy Devarakonda, Bulent Kiziltan

Abstract: Influenced by breakthroughs in LLMs, single-cell foundation models are emerging. While these models show successful performance in cell type clustering, phenotype classification, and gene perturbation response prediction, it remains to be seen if a simpler model could achieve comparable or better results, especially with limited data. This is important, as the quantity and quality of single-cell d… ▽ More Influenced by breakthroughs in LLMs, single-cell foundation models are emerging. While these models show successful performance in cell type clustering, phenotype classification, and gene perturbation response prediction, it remains to be seen if a simpler model could achieve comparable or better results, especially with limited data. This is important, as the quantity and quality of single-cell data typically fall short of the standards in textual data used for training LLMs. Single-cell sequencing often suffers from technical artifacts, dropout events, and batch effects. These challenges are compounded in a weakly supervised setting, where the labels of cell states can be noisy, further complicating the analysis. To tackle these challenges, we present sc-OTGM, streamlined with less than 500K parameters, making it approximately 100x more compact than the foundation models, offering an efficient alternative. sc-OTGM is an unsupervised model grounded in the inductive bias that the scRNAseq data can be generated from a combination of the finite multivariate Gaussian distributions. The core function of sc-OTGM is to create a probabilistic latent space utilizing a GMM as its prior distribution and distinguish between distinct cell populations by learning their respective marginal PDFs. It uses a Hit-and-Run Markov chain sampler to determine the OT plan across these PDFs within the GMM framework. We evaluated our model against a CRISPR-mediated perturbation dataset, called CROP-seq, consisting of 57 one-gene perturbations. Our results demonstrate that sc-OTGM is effective in cell state classification, aids in the analysis of differential gene expression, and ranks genes for target identification through a recommender system. It also predicts the effects of single-gene perturbations on downstream gene regulation and generates synthetic scRNA-seq data conditioned on specific cell states. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Comments: ICLR 2024, Machine Learning for Genomics Explorations Workshop

arXiv:2309.15979 [pdf]

Clinical Trial Recommendations Using Semantics-Based Inductive Inference and Knowledge Graph Embeddings

Authors: Murthy V. Devarakonda, Smita Mohanty, Raja Rao Sunkishala, Nag Mallampalli, Xiong Liu

Abstract: Designing a new clinical trial entails many decisions, such as defining a cohort and setting the study objectives to name a few, and therefore can benefit from recommendations based on exhaustive mining of past clinical trial records. Here, we propose a novel recommendation methodology, based on neural embeddings trained on a first-of-a-kind knowledge graph of clinical trials. We addressed several… ▽ More Designing a new clinical trial entails many decisions, such as defining a cohort and setting the study objectives to name a few, and therefore can benefit from recommendations based on exhaustive mining of past clinical trial records. Here, we propose a novel recommendation methodology, based on neural embeddings trained on a first-of-a-kind knowledge graph of clinical trials. We addressed several important research questions in this context, including designing a knowledge graph (KG) for clinical trial data, effectiveness of various KG embedding (KGE) methods for it, a novel inductive inference using KGE, and its use in generating recommendations for clinical trial design. We used publicly available data from clinicaltrials.gov for the study. Results show that our recommendations approach achieves relevance scores of 70%-83%, measured as the text similarity to actual clinical trial elements, and the most relevant recommendation can be found near the top of list. Our study also suggests potential improvement in training KGE using node semantics. △ Less

Submitted 27 September, 2023; originally announced September 2023.

Comments: 13 pages (w/o bibliography), 4 Figures, 6 Tables

arXiv:2212.14102 [pdf, other]

Customizing Knowledge Graph Embedding to Improve Clinical Study Recommendation

Authors: Xiong Liu, Iya Khalil, Murthy Devarakonda

Abstract: Inferring knowledge from clinical trials using knowledge graph embedding is an emerging area. However, customizing graph embeddings for different use cases remains a significant challenge. We propose custom2vec, an algorithmic framework to customize graph embeddings by incorporating user preferences in training the embeddings. It captures user preferences by adding custom nodes and links derived f… ▽ More Inferring knowledge from clinical trials using knowledge graph embedding is an emerging area. However, customizing graph embeddings for different use cases remains a significant challenge. We propose custom2vec, an algorithmic framework to customize graph embeddings by incorporating user preferences in training the embeddings. It captures user preferences by adding custom nodes and links derived from manually vetted results of a separate information retrieval method. We propose a joint learning objective to preserve the original network structure while incorporating the user's custom annotations. We hypothesize that the custom training improves user-expected predictions, for example, in link prediction tasks. We demonstrate the effectiveness of custom2vec for clinical trials related to non-small cell lung cancer (NSCLC) with two customization scenarios: recommending immuno-oncology trials evaluating PD-1 inhibitors and exploring similar trials that compare new therapies with a standard of care. The results show that custom2vec training achieves better performance than the conventional training methods. Our approach is a novel way to customize knowledge graph embeddings and enable more accurate recommendations and predictions. △ Less

Submitted 28 December, 2022; originally announced December 2022.

arXiv:2110.10027 [pdf]

Clinical Trial Information Extraction with BERT

Authors: Xiong Liu, Greg L. Hersch, Iya Khalil, Murthy Devarakonda

Abstract: Natural language processing (NLP) of clinical trial documents can be useful in new trial design. Here we identify entity types relevant to clinical trial design and propose a framework called CT-BERT for information extraction from clinical trial text. We trained named entity recognition (NER) models to extract eligibility criteria entities by fine-tuning a set of pre-trained BERT models. We then… ▽ More Natural language processing (NLP) of clinical trial documents can be useful in new trial design. Here we identify entity types relevant to clinical trial design and propose a framework called CT-BERT for information extraction from clinical trial text. We trained named entity recognition (NER) models to extract eligibility criteria entities by fine-tuning a set of pre-trained BERT models. We then compared the performance of CT-BERT with recent baseline methods including attention-based BiLSTM and Criteria2Query. The results demonstrate the superiority of CT-BERT in clinical trial NLP. △ Less

Submitted 11 September, 2021; originally announced October 2021.

Comments: HealthNLP 2021, IEEE International Conference on Healthcare Informatics (ICHI 2021)

arXiv:2109.02808 [pdf, other]

A Scalable AI Approach for Clinical Trial Cohort Optimization

Authors: Xiong Liu, Cheng Shi, Uday Deore, Yingbo Wang, Myah Tran, Iya Khalil, Murthy Devarakonda

Abstract: FDA has been promoting enrollment practices that could enhance the diversity of clinical trial populations, through broadening eligibility criteria. However, how to broaden eligibility remains a significant challenge. We propose an AI approach to Cohort Optimization (AICO) through transformer-based natural language processing of the eligibility criteria and evaluation of the criteria using real-wo… ▽ More FDA has been promoting enrollment practices that could enhance the diversity of clinical trial populations, through broadening eligibility criteria. However, how to broaden eligibility remains a significant challenge. We propose an AI approach to Cohort Optimization (AICO) through transformer-based natural language processing of the eligibility criteria and evaluation of the criteria using real-world data. The method can extract common eligibility criteria variables from a large set of relevant trials and measure the generalizability of trial designs to real-world patients. It overcomes the scalability limits of existing manual methods and enables rapid simulation of eligibility criteria design for a disease of interest. A case study on breast cancer trial design demonstrates the utility of the method in improving trial generalizability. △ Less

Submitted 6 September, 2021; originally announced September 2021.

Comments: PharML 2021 (Machine Learning for Pharma and Healthcare Applications) at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2021)

arXiv:2101.02017 [pdf, ps, other]

COVID-19: Comparative Analysis of Methods for Identifying Articles Related to Therapeutics and Vaccines without Using Labeled Data

Authors: Mihir Parmar, Ashwin Karthik Ambalavanan, Hong Guan, Rishab Banerjee, Jitesh Pabla, Murthy Devarakonda

Abstract: Here we proposed an approach to analyze text classification methods based on the presence or absence of task-specific terms (and their synonyms) in the text. We applied this approach to study six different transfer-learning and unsupervised methods for screening articles relevant to COVID-19 vaccines and therapeutics. The analysis revealed that while a BERT model trained on search-engine results g… ▽ More Here we proposed an approach to analyze text classification methods based on the presence or absence of task-specific terms (and their synonyms) in the text. We applied this approach to study six different transfer-learning and unsupervised methods for screening articles relevant to COVID-19 vaccines and therapeutics. The analysis revealed that while a BERT model trained on search-engine results generally performed well, it miss-classified relevant abstracts that did not contain task-specific terms. We used this insight to create a more effective unsupervised ensemble. △ Less

Submitted 5 January, 2021; originally announced January 2021.

Comments: 6 pages, 3 Tables, Appendix

arXiv:2004.06222 [pdf]

Cascade Neural Ensemble for Identifying Scientifically Sound Articles

Authors: Ashwin Karthik Ambalavanan, Murthy Devarakonda

Abstract: Background: A significant barrier to conducting systematic reviews and meta-analysis is efficiently finding scientifically sound relevant articles. Typically, less than 1% of articles match this requirement which leads to a highly imbalanced task. Although feature-engineered and early neural networks models were studied for this task, there is an opportunity to improve the results. Methods: We f… ▽ More Background: A significant barrier to conducting systematic reviews and meta-analysis is efficiently finding scientifically sound relevant articles. Typically, less than 1% of articles match this requirement which leads to a highly imbalanced task. Although feature-engineered and early neural networks models were studied for this task, there is an opportunity to improve the results. Methods: We framed the problem of filtering articles as a classification task, and trained and tested several ensemble architectures of SciBERT, a variant of BERT pre-trained on scientific articles, on a manually annotated dataset of about 50K articles from MEDLINE. Since scientifically sound articles are identified through a multi-step process we proposed a novel cascade ensemble analogous to the selection process. We compared the performance of the cascade ensemble with a single integrated model and other types of ensembles as well as with results from previous studies. Results: The cascade ensemble architecture achieved 0.7505 F measure, an impressive 49.1% error rate reduction, compared to a CNN model that was previously proposed and evaluated on a selected subset of the 50K articles. On the full dataset, the cascade ensemble achieved 0.7639 F measure, resulting in an error rate reduction of 19.7% compared to the best performance reported in a previous study that used the full dataset. Conclusion: Pre-trained contextual encoder neural networks (e.g. SciBERT) perform better than the models studied previously and manually created search filters in filtering for scientifically sound relevant articles. The superior performance achieved by the cascade ensemble is a significant result that generalizes beyond this task and the dataset, and is analogous to query optimization in IR and databases. △ Less

Submitted 13 April, 2020; originally announced April 2020.

Comments: 11 pages, 4 figures, and 9 tables

arXiv:2004.06216 [pdf]

Robustly Pre-trained Neural Model for Direct Temporal Relation Extraction

Authors: Hong Guan, Jianfu Li, Hua Xu, Murthy Devarakonda

Abstract: Background: Identifying relationships between clinical events and temporal expressions is a key challenge in meaningfully analyzing clinical text for use in advanced AI applications. While previous studies exist, the state-of-the-art performance has significant room for improvement. Methods: We studied several variants of BERT (Bidirectional Encoder Representations using Transformers) some invol… ▽ More Background: Identifying relationships between clinical events and temporal expressions is a key challenge in meaningfully analyzing clinical text for use in advanced AI applications. While previous studies exist, the state-of-the-art performance has significant room for improvement. Methods: We studied several variants of BERT (Bidirectional Encoder Representations using Transformers) some involving clinical domain customization and the others involving improved architecture and/or training strategies. We evaluated these methods using a direct temporal relations dataset which is a semantically focused subset of the 2012 i2b2 temporal relations challenge dataset. Results: Our results show that RoBERTa, which employs better pre-training strategies including using 10x larger corpus, has improved overall F measure by 0.0864 absolute score (on the 1.00 scale) and thus reducing the error rate by 24% relative to the previous state-of-the-art performance achieved with an SVM (support vector machine) model. Conclusion: Modern contextual language modeling neural networks, pre-trained on a large corpus, achieve impressive performance even on highly-nuanced clinical temporal relation tasks. △ Less

Submitted 13 April, 2020; originally announced April 2020.

Comments: 10 pages, 1 Figure, 7 Tables

arXiv:1911.03869 [pdf, other]

Knowledge Guided Named Entity Recognition for BioMedical Text

Authors: Pratyay Banerjee, Kuntal Kumar Pal, Murthy Devarakonda, Chitta Baral

Abstract: In this work, we formulate the NER task as a multi-answer knowledge guided QA task (KGQA) which helps to predict entities only by assigning B, I and O tags without associating entity types with the tags. We provide different knowledge contexts, such as, entity types, questions, definitions and examples along with the text and train on a combined dataset of 18 biomedical corpora. This formulation (… ▽ More In this work, we formulate the NER task as a multi-answer knowledge guided QA task (KGQA) which helps to predict entities only by assigning B, I and O tags without associating entity types with the tags. We provide different knowledge contexts, such as, entity types, questions, definitions and examples along with the text and train on a combined dataset of 18 biomedical corpora. This formulation (a) enables systems to jointly learn NER specific features from varied NER datasets, (b) can use knowledge-text attention to identify words having higher similarity to provided knowledge, improving performance, (c) reduces system confusion by reducing the prediction classes to B, I, O only, and (d) makes detection of nested entities easier. We perform extensive experiments of this KGQA formulation on 18 biomedical NER datasets, and through experiments we note that knowledge helps in achieving better performance. Our problem formulation is able to achieve state-of-the-art results in 12 datasets. △ Less

Submitted 18 September, 2020; v1 submitted 10 November, 2019; originally announced November 2019.

Comments: 6 pages, 2 figures, 5 tables, WIP

arXiv:1906.11930 [pdf]

Training Models to Extract Treatment Plans from Clinical Notes Using Contents of Sections with Headings

Authors: Ananya Poddar, Bharath Dandala, Murthy Devarakonda

Abstract: Objective: Using natural language processing (NLP) to find sentences that state treatment plans in a clinical note, would automate plan extraction and would further enable their use in tools that help providers and care managers. However, as in the most NLP tasks on clinical text, creating gold standard to train and test NLP models is tedious and expensive. Fortuitously, sometimes but not always c… ▽ More Objective: Using natural language processing (NLP) to find sentences that state treatment plans in a clinical note, would automate plan extraction and would further enable their use in tools that help providers and care managers. However, as in the most NLP tasks on clinical text, creating gold standard to train and test NLP models is tedious and expensive. Fortuitously, sometimes but not always clinical notes contain sections with a heading that identifies the section as a plan. Leveraging contents of such labeled sections as a noisy training data, we assessed accuracy of NLP models trained with the data. Methods: We used common variations of plan headings and rule-based heuristics to find plan sections with headings in clinical notes, and we extracted sentences from them and formed a noisy training data of plan sentences. We trained Support Vector Machine (SVM) and Convolutional Neural Network (CNN) models with the data. We measured accuracy of the trained models on the noisy dataset using ten-fold cross validation and separately on a set-aside manually annotated dataset. Results: About 13% of 117,730 clinical notes contained treatment plans sections with recognizable headings in the 1001 longitudinal patient records that were obtained from Cleveland Clinic under an IRB approval. We were able to extract and create a noisy training data of 13,492 plan sentences from the clinical notes. CNN achieved best F measures, 0.91 and 0.97 in the cross-validation and set-aside evaluation experiments respectively. SVM slightly underperformed with F measures of 0.89 and 0.96 in the same experiments. Conclusion: Our study showed that the training supervised learning models using noisy plan sentences was effective in identifying them in all clinical notes. More broadly, sections with informal headings in clinical notes can be a good source for generating effective training data. △ Less

Submitted 27 June, 2019; originally announced June 2019.

Comments: 15 pages, 4 Figures, and 4 Tables

arXiv:1902.09674 [pdf]

Develo** and Using Special-Purpose Lexicons for Cohort Selection from Clinical Notes

Authors: Samarth Rawal, Ashok Prakash, Soumya Adhya, Sidharth Kulkarni, Saadat Anwar, Chitta Baral, Murthy Devarakonda

Abstract: Background and Significance: Selecting cohorts for a clinical trial typically requires costly and time-consuming manual chart reviews resulting in poor participation. To help automate the process, National NLP Clinical Challenges (N2C2) conducted a shared challenge by defining 13 criteria for clinical trial cohort selection and by providing training and test datasets. This research was motivated b… ▽ More Background and Significance: Selecting cohorts for a clinical trial typically requires costly and time-consuming manual chart reviews resulting in poor participation. To help automate the process, National NLP Clinical Challenges (N2C2) conducted a shared challenge by defining 13 criteria for clinical trial cohort selection and by providing training and test datasets. This research was motivated by the N2C2 challenge. Methods: We broke down the task into 13 independent subtasks corresponding to each criterion and implemented subtasks using rules or a supervised machine learning model. Each task critically depended on knowledge resources in the form of task-specific lexicons, for which we developed a novel model-driven approach. The approach allowed us to first expand the lexicon from a seed set and then remove noise from the list, thus improving the accuracy. Results: Our system achieved an overall F measure of 0.9003 at the challenge, and was statistically tied for the first place out of 45 participants. The model-driven lexicon development and further debugging the rules/code on the training set improved overall F measure to 0.9140, overtaking the best numerical result at the challenge. Discussion: Cohort selection, like phenotype extraction and classification, is amenable to rule-based or simple machine learning methods, however, the lexicons involved, such as medication names or medical terms referring to a medical problem, critically determine the overall accuracy. Automated lexicon development has the potential for scalability and accuracy. △ Less

Submitted 25 February, 2019; originally announced February 2019.

Comments: 13 pages, paper describing the NLP system built for N2C2 Task 1 2018 shared challenge in biomedical NLP

arXiv:1805.06816 [pdf]

Annotating Electronic Medical Records for Question Answering

Authors: Preethi Raghavan, Siddharth Patwardhan, Jennifer J. Liang, Murthy V. Devarakonda

Abstract: Our research is in the relatively unexplored area of question answering technologies for patient-specific questions over their electronic health records. A large dataset of human expert curated question and answer pairs is an important pre-requisite for develo**, training and evaluating any question answering system that is powered by machine learning. In this paper, we describe a process for cr… ▽ More Our research is in the relatively unexplored area of question answering technologies for patient-specific questions over their electronic health records. A large dataset of human expert curated question and answer pairs is an important pre-requisite for develo**, training and evaluating any question answering system that is powered by machine learning. In this paper, we describe a process for creating such a dataset of questions and answers. Our methodology is replicable, can be conducted by medical students as annotators, and results in high inter-annotator agreement (0.71 Cohen's kappa). Over the course of 11 months, 11 medical students followed our annotation methodology, resulting in a question answering dataset of 5696 questions over 71 patient records, of which 1747 questions have corresponding answers generated by the medical students. △ Less

Submitted 17 May, 2018; originally announced May 2018.

Comments: 10 pages, 2016

Showing 1–12 of 12 results for author: Devarakonda, M