Search | arXiv e-print repository

arXiv:2406.18792 [pdf, other]

A data-driven assessment of biomedical terminology evolution using information theoretical and network analysis approaches

Authors: Jenny Copara, Nona Naderi, Gilles Falquet, Douglas Teodoro

Abstract: The Medical Subject Headings (MeSH), one of the main knowledge organization systems in the biomedical domain, is constantly evolving following the latest scientific discoveries in health and life sciences. Previous research focused on quantifying information in MeSH using its hierarchical structure. In this work, we propose a data-driven approach based on information theory and network analyses to… ▽ More The Medical Subject Headings (MeSH), one of the main knowledge organization systems in the biomedical domain, is constantly evolving following the latest scientific discoveries in health and life sciences. Previous research focused on quantifying information in MeSH using its hierarchical structure. In this work, we propose a data-driven approach based on information theory and network analyses to quantify the knowledge evolution in MeSH and the relevance of its individual concepts. Our approach leverages article annotations and their citation networks to compute the level of informativeness, usefulness, disruptiveness, and influence of MeSH concepts over time. The citation network includes the instances of MeSH concepts or MeSH headings, and the concept relevance is calculated individually. Then, this computation is propagated to the hierarchy to establish the relevance of a concept. We quantitatively evaluated our approach using changes in the MeSH terminology and showed that it effectively captures the evolution of the terminology. Moreover, we validated the ability of our framework to characterize retracted articles and show that concepts used to annotate retracted articles differ substantially from those used to annotate non-retracted. The proposed framework provides an effective method to rank concept relevance and can be useful in maintaining evolving knowledge organization systems. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: 19 pages, 7 figures, 4 tables

arXiv:2404.12827 [pdf]

CT-ADE: An Evaluation Benchmark for Adverse Drug Event Prediction from Clinical Trial Results

Authors: Anthony Yazdani, Alban Bornet, Boya Zhang, Philipp Khlebnikov, Poorya Amini, Douglas Teodoro

Abstract: Adverse drug events (ADEs) significantly impact clinical research and public health, contributing to failures in clinical trials and leading to increased healthcare costs. The accurate prediction and management of ADEs are crucial for improving the development of safer, more effective medications, and enhancing patient outcomes. To support this effort, we introduce CT-ADE, a novel dataset compiled… ▽ More Adverse drug events (ADEs) significantly impact clinical research and public health, contributing to failures in clinical trials and leading to increased healthcare costs. The accurate prediction and management of ADEs are crucial for improving the development of safer, more effective medications, and enhancing patient outcomes. To support this effort, we introduce CT-ADE, a novel dataset compiled to enhance the predictive modeling of ADEs. Encompassing over 12,000 instances extracted from clinical trial results, the CT-ADE dataset integrates drug, patient population, and contextual information for multilabel ADE classification tasks in monopharmacy treatments, providing a comprehensive resource for develo** advanced predictive models. To mirror the complex nature of ADEs, annotations are standardized at the system organ class level of the Medical Dictionary for Regulatory Activities (MedDRA) ontology. Preliminary analyses using baseline models have demonstrated promising results, achieving 73.33% F1 score and 81.54% balanced accuracy, highlighting CT-ADE's potential to advance ADE prediction. CT-ADE provides an essential tool for researchers aiming to leverage the power of artificial intelligence and machine learning to enhance patient safety and minimize the impact of ADEs on pharmaceutical research and development. Researchers interested in using the CT-ADE dataset can find all necessary resources at https://github.com/xxxx/xxxx. △ Less

Submitted 19 April, 2024; originally announced April 2024.

arXiv:2308.12877 [pdf]

DS4DH at #SMM4H 2023: Zero-Shot Adverse Drug Events Normalization using Sentence Transformers and Reciprocal-Rank Fusion

Authors: Anthony Yazdani, Hossein Rouhizadeh, David Vicente Alvarez, Douglas Teodoro

Abstract: This paper outlines the performance evaluation of a system for adverse drug event normalization, developed by the Data Science for Digital Health (DS4DH) group for the Social Media Mining for Health Applications (SMM4H) 2023 shared task 5. Shared task 5 targeted the normalization of adverse drug event mentions in Twitter to standard concepts of the Medical Dictionary for Regulatory Activities term… ▽ More This paper outlines the performance evaluation of a system for adverse drug event normalization, developed by the Data Science for Digital Health (DS4DH) group for the Social Media Mining for Health Applications (SMM4H) 2023 shared task 5. Shared task 5 targeted the normalization of adverse drug event mentions in Twitter to standard concepts of the Medical Dictionary for Regulatory Activities terminology. Our system hinges on a two-stage approach: BERT fine-tuning for entity recognition, followed by zero-shot normalization using sentence transformers and reciprocal-rank fusion. The approach yielded a precision of 44.9%, recall of 40.5%, and an F1-score of 42.6%. It outperformed the median performance in shared task 5 by 10% and demonstrated the highest performance among all participants. These results substantiate the effectiveness of our approach and its potential application for adverse drug event normalization in the realm of social media text mining. △ Less

Submitted 6 November, 2023; v1 submitted 15 August, 2023; originally announced August 2023.

Comments: Peer-reviewed and accepted for presentation at the #SMM4H 2023 Workshop

arXiv:2302.04185 [pdf, other]

Efficient Joint Learning for Clinical Named Entity Recognition and Relation Extraction Using Fourier Networks: A Use Case in Adverse Drug Events

Authors: Anthony Yazdani, Dimitrios Proios, Hossein Rouhizadeh, Douglas Teodoro

Abstract: Current approaches for clinical information extraction are inefficient in terms of computational costs and memory consumption, hindering their application to process large-scale electronic health records (EHRs). We propose an efficient end-to-end model, the Joint-NER-RE-Fourier (JNRF), to jointly learn the tasks of named entity recognition and relation extraction for documents of variable length.… ▽ More Current approaches for clinical information extraction are inefficient in terms of computational costs and memory consumption, hindering their application to process large-scale electronic health records (EHRs). We propose an efficient end-to-end model, the Joint-NER-RE-Fourier (JNRF), to jointly learn the tasks of named entity recognition and relation extraction for documents of variable length. The architecture uses positional encoding and unitary batch sizes to process variable length documents and uses a weight-shared Fourier network layer for low-complexity token mixing. Finally, we reach the theoretical computational complexity lower bound for relation extraction using a selective pooling strategy and distance-aware attention weights with trainable polynomial distance functions. We evaluated the JNRF architecture using the 2018 N2C2 ADE benchmark to jointly extract medication-related entities and relations in variable-length EHR summaries. JNRF outperforms rolling window BERT with selective pooling by 0.42%, while being twice as fast to train. Compared to state-of-the-art BiLSTM-CRF architectures on the N2C2 ADE benchmark, results show that the proposed approach trains 22 times faster and reduces GPU memory consumption by 1.75 folds, with a reasonable performance tradeoff of 90%, without the use of external tools, hand-crafted rules or post-processing. Given the significant carbon footprint of deep learning models and the current energy crises, these methods could support efficient and cleaner information extraction in EHRs and other types of large-scale document databases. △ Less

Submitted 8 February, 2023; originally announced February 2023.

Comments: International Conference on Natural Language Processing (ICON 2022)

arXiv:2202.06771 [pdf, other]

DS4DH at TREC Health Misinformation 2021: Multi-Dimensional Ranking Models with Transfer Learning and Rank Fusion

Authors: Boya Zhang, Nona Naderi, Fernando Jaume-Santero, Douglas Teodoro

Abstract: This paper describes the work of the Data Science for Digital Health (DS4DH) group at the TREC Health Misinformation Track 2021. The TREC Health Misinformation track focused on the development of retrieval methods that provide relevant, correct and credible information for health related searches on the Web. In our methodology, we used a two-step ranking approach that includes i) a standard retrie… ▽ More This paper describes the work of the Data Science for Digital Health (DS4DH) group at the TREC Health Misinformation Track 2021. The TREC Health Misinformation track focused on the development of retrieval methods that provide relevant, correct and credible information for health related searches on the Web. In our methodology, we used a two-step ranking approach that includes i) a standard retrieval phase, based on BM25 model, and ii) a re-ranking phase, with a pipeline of models focused on the usefulness, supportiveness and credibility dimensions of the retrieved documents. To estimate the usefulness, we classified the initial rank list using pre-trained language models based on the transformers architecture fine-tuned on the MS MARCO corpus. To assess the supportiveness, we utilized BERT-based models fine-tuned on scientific and Wikipedia corpora. Finally, to evaluate the credibility of the documents, we employed a random forest model trained on the Microsoft Credibility dataset combined with a list of credible sites. The resulting ranked lists were then combined using the Reciprocal Rank Fusion algorithm to obtain the final list of useful, supporting and credible documents. Our approach achieved competitive results, being top-2 in the compatibility measurement for the automatic runs. Our findings suggest that integrating automatic ranking models created for each information quality dimension with transfer learning can increase the effectiveness of health-related information retrieval. △ Less

Submitted 14 February, 2022; originally announced February 2022.

arXiv:2110.15710 [pdf, other]

Classification of hierarchical text using geometric deep learning: the case of clinical trials corpus

Authors: Sohrab Ferdowsi, Nikolay Borissov, Julien Knafou, Poorya Amini, Douglas Teodoro

Abstract: We consider the hierarchical representation of documents as graphs and use geometric deep learning to classify them into different categories. While graph neural networks can efficiently handle the variable structure of hierarchical documents using the permutation invariant message passing operations, we show that we can gain extra performance improvements using our proposed selective graph poolin… ▽ More We consider the hierarchical representation of documents as graphs and use geometric deep learning to classify them into different categories. While graph neural networks can efficiently handle the variable structure of hierarchical documents using the permutation invariant message passing operations, we show that we can gain extra performance improvements using our proposed selective graph pooling operation that arises from the fact that some parts of the hierarchy are invariable across different documents. We applied our model to classify clinical trial (CT) protocols into completed and terminated categories. We use bag-of-words based, as well as pre-trained transformer-based embeddings to featurize the graph nodes, achieving f1-scores around 0.85 on a publicly available large scale CT registry of around 360K protocols. We further demonstrate how the selective pooling can add insights into the CT termination status prediction. We make the source code and dataset splits accessible. △ Less

Submitted 4 October, 2021; originally announced October 2021.

Comments: Accepted as a long paper in EMNLP 2021 - Oral presentation to the Machine Learning track

arXiv:2007.12569 [pdf, other]

Named entity recognition in chemical patents using ensemble of contextual language models

Authors: Jenny Copara, Nona Naderi, Julien Knafou, Patrick Ruch, Douglas Teodoro

Abstract: Chemical patent documents describe a broad range of applications holding key reaction and compound information, such as chemical structure, reaction formulas, and molecular properties. These informational entities should be first identified in text passages to be utilized in downstream tasks. Text mining provides means to extract relevant information from chemical patents through information extra… ▽ More Chemical patent documents describe a broad range of applications holding key reaction and compound information, such as chemical structure, reaction formulas, and molecular properties. These informational entities should be first identified in text passages to be utilized in downstream tasks. Text mining provides means to extract relevant information from chemical patents through information extraction techniques. As part of the Information Extraction task of the Cheminformatics Elsevier Melbourne University challenge, in this work we study the effectiveness of contextualized language models to extract reaction information in chemical patents. We assess transformer architectures trained on a generic and specialised corpora to propose a new ensemble model. Our best model, based on a majority ensemble approach, achieves an exact F1-score of 92.30% and a relaxed F1-score of 96.24%. The results show that ensemble of contextualized language models can provide an effective method to extract information from chemical patents. △ Less

Submitted 17 September, 2020; v1 submitted 24 July, 2020; originally announced July 2020.

arXiv:1711.09731 [pdf]

Archetypes for Representing Data about the Brazilian Public Hospital Information System and Outpatient High Complexity Procedures System

Authors: Sergio Miranda Freire, Luciana Tricai Cavalini, Douglas Teodoro, Erik Sundvall

Abstract: The Brazilian Ministry of Health has selected the openEHR model as a standard for electronic health record systems. This paper presents a set of archetypes to represent the main data from the Brazilian Public Hospital Information System and the High Complexity Procedures Module of the Brazilian public Outpatient Health Information System. The archetypes from the public openEHR Clinical Knowledge M… ▽ More The Brazilian Ministry of Health has selected the openEHR model as a standard for electronic health record systems. This paper presents a set of archetypes to represent the main data from the Brazilian Public Hospital Information System and the High Complexity Procedures Module of the Brazilian public Outpatient Health Information System. The archetypes from the public openEHR Clinical Knowledge Manager (CKM), were examined in order to select archetypes that could be used to represent the data of the above mentioned systems. For several concepts, it was necessary to specialize the CKM archetypes, or design new ones. A total of 22 archetypes were used: 8 new, 5 specialized and 9 reused from CKM. This set of archetypes can be used not only for information exchange, but also for generating a big anonymized dataset for testing openEHR-based systems. △ Less

Submitted 14 November, 2017; originally announced November 2017.

arXiv:1711.09729 [pdf]

Design of an Integrated Analytics Platform for Healthcare Assessment Centered on the Episode of Care

Authors: Douglas Teodoro, Nils Rotgans, Lucas Oliveira, Lilian Correia

Abstract: Assessing care quality and performance is essential to improve healthcare processes and population health management. However, due to bad system design and lack of access to required data, this assessment is often delayed or not done at all. The goal of our research is to investigate an advanced analytics platform that enables healthcare quality and performance assessment. We used a user-centered… ▽ More Assessing care quality and performance is essential to improve healthcare processes and population health management. However, due to bad system design and lack of access to required data, this assessment is often delayed or not done at all. The goal of our research is to investigate an advanced analytics platform that enables healthcare quality and performance assessment. We used a user-centered design approach to identify the system requirements and have the concept of episode of care as the building block of information for a key performance indicator analytics system. We implemented architecture and interface prototypes, and performed a usability test with hospital users with managerial roles. The results show that by using user-centered design we created an analytical platform that provides a holistic and integrated view of the clinical, financial and operational aspects of the institution. Our encouraging results warrant further studies to understand other aspects of usability. △ Less

Submitted 14 November, 2017; originally announced November 2017.

arXiv:1709.03061 [pdf]

Improving average ranking precision in user searches for biomedical research datasets

Authors: Douglas Teodoro, Luc Mottin, Julien Gobeill, Arnaud Gaudinat, Thérèse Vachon, Patrick Ruch

Abstract: Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel… ▽ More Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel ranking pipeline to improve the search of datasets used in biomedical experiments. Our system comprises a query expansion model based on word embeddings, a similarity measure algorithm that takes into consideration the relevance of the query terms, and a dataset categorisation method that boosts the rank of datasets matching query constraints. The system was evaluated using a corpus with 800k datasets and 21 annotated user queries. Our system provides competitive results when compared to the other challenge participants. In the official run, it achieved the highest infAP among the participants, being +22.3% higher than the median infAP of the participant's best submissions. Overall, it is ranked at top 2 if an aggregated metric using the best official measures per participant is considered. The query expansion method showed positive impact on the system's performance increasing our baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively. Our similarity measure algorithm seems to be robust, in particular compared to Divergence From Randomness framework, having smaller performance variations under different training conditions. Finally, the result categorization did not have significant impact on the system's performance. We believe that our solution could be used to enhance biomedical dataset management systems. In particular, the use of data driven query expansion methods could be an alternative to the complexity of biomedical terminologies. △ Less

Submitted 10 September, 2017; originally announced September 2017.

Showing 1–10 of 10 results for author: Teodoro, D