Search | arXiv e-print repository

MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation

Authors: Zexue He, Yu Wang, An Yan, Yao Liu, Eric Y. Chang, Amilcare Gentili, Julian McAuley, Chun-Nan Hsu

Abstract: Curated datasets for healthcare are often limited due to the need of human annotations from experts. In this paper, we present MedEval, a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare. MedEval is comprehensive and consists of data from several healthcare systems and spans 35 human body regions from 8 examination modaliti… ▽ More Curated datasets for healthcare are often limited due to the need of human annotations from experts. In this paper, we present MedEval, a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare. MedEval is comprehensive and consists of data from several healthcare systems and spans 35 human body regions from 8 examination modalities. With 22,779 collected sentences and 21,228 reports, we provide expert annotations at multiple levels, offering a granular potential usage of the data and supporting a wide range of tasks. Moreover, we systematically evaluated 10 generic and domain-specific language models under zero-shot and finetuning settings, from domain-adapted baselines in healthcare to general-purposed state-of-the-art large language models (e.g., ChatGPT). Our evaluations reveal varying effectiveness of the two categories of language models across different tasks, from which we notice the importance of instruction tuning for few-shot usage of large language models. Our investigation paves the way toward benchmarking language models for healthcare and provides valuable insights into the strengths and limitations of adopting large language models in medical domains, informing their practical applications and future advancements. △ Less

Submitted 14 November, 2023; v1 submitted 21 October, 2023; originally announced October 2023.

Comments: Accepted to EMNLP 2023. Camera-ready version: updated IRB, added more evaluation results on LLMs such as GPT4, LLaMa2, and LLaMa2-chat

arXiv:2310.03182 [pdf, other]

Robust and Interpretable Medical Image Classifiers via Concept Bottleneck Models

Authors: An Yan, Yu Wang, Yiwu Zhong, Zexue He, Petros Karypis, Zihan Wang, Chengyu Dong, Amilcare Gentili, Chun-Nan Hsu, **gbo Shang, Julian McAuley

Abstract: Medical image classification is a critical problem for healthcare, with the potential to alleviate the workload of doctors and facilitate diagnoses of patients. However, two challenges arise when deploying deep learning models to real-world healthcare applications. First, neural models tend to learn spurious correlations instead of desired features, which could fall short when generalizing to new… ▽ More Medical image classification is a critical problem for healthcare, with the potential to alleviate the workload of doctors and facilitate diagnoses of patients. However, two challenges arise when deploying deep learning models to real-world healthcare applications. First, neural models tend to learn spurious correlations instead of desired features, which could fall short when generalizing to new domains (e.g., patients with different ages). Second, these black-box models lack interpretability. When making diagnostic predictions, it is important to understand why a model makes a decision for trustworthy and safety considerations. In this paper, to address these two limitations, we propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts. Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model. We systematically evaluate our method on eight medical image classification datasets to verify its effectiveness. On challenging datasets with strong confounding factors, our method can mitigate spurious correlations thus substantially outperform standard visual encoders and other baselines. Finally, we show how classification with a small number of concepts brings a level of interpretability for understanding model decisions through case studies in real medical data. △ Less

Submitted 4 October, 2023; originally announced October 2023.

Comments: 18 pages, 12 figures

arXiv:2306.10723 [pdf, other]

Fine-tuning Large Enterprise Language Models via Ontological Reasoning

Authors: Teodoro Baldazzi, Luigi Bellomarini, Stefano Ceri, Andrea Colombo, Andrea Gentili, Emanuel Sallinger

Abstract: Large Language Models (LLMs) exploit fine-tuning as a technique to adapt to diverse goals, thanks to task-specific training data. Task specificity should go hand in hand with domain orientation, that is, the specialization of an LLM to accurately address the tasks of a given realm of interest. However, models are usually fine-tuned over publicly available data or, at most, over ground data from da… ▽ More Large Language Models (LLMs) exploit fine-tuning as a technique to adapt to diverse goals, thanks to task-specific training data. Task specificity should go hand in hand with domain orientation, that is, the specialization of an LLM to accurately address the tasks of a given realm of interest. However, models are usually fine-tuned over publicly available data or, at most, over ground data from databases, ignoring business-level definitions and domain experience. On the other hand, Enterprise Knowledge Graphs (EKGs) are able to capture and augment such domain knowledge via ontological reasoning. With the goal of combining LLM flexibility with the domain orientation of EKGs, we propose a novel neurosymbolic architecture that leverages the power of ontological reasoning to build task- and domain-specific corpora for LLM fine-tuning. △ Less

Submitted 18 September, 2023; v1 submitted 19 June, 2023; originally announced June 2023.

Comments: Accepted at RuleML 2023

arXiv:2305.08300 [pdf, other]

"Nothing Abnormal": Disambiguating Medical Reports via Contrastive Knowledge Infusion

Authors: Zexue He, An Yan, Amilcare Gentili, Julian McAuley, Chun-Nan Hsu

Abstract: Sharing medical reports is essential for patient-centered care. A recent line of work has focused on automatically generating reports with NLP methods. However, different audiences have different purposes when writing/reading medical reports -- for example, healthcare professionals care more about pathology, whereas patients are more concerned with the diagnosis ("Is there any abnormality?"). The… ▽ More Sharing medical reports is essential for patient-centered care. A recent line of work has focused on automatically generating reports with NLP methods. However, different audiences have different purposes when writing/reading medical reports -- for example, healthcare professionals care more about pathology, whereas patients are more concerned with the diagnosis ("Is there any abnormality?"). The expectation gap results in a common situation where patients find their medical reports to be ambiguous and therefore unsure about the next steps. In this work, we explore the audience expectation gap in healthcare and summarize common ambiguities that lead patients to be confused about their diagnosis into three categories: medical jargon, contradictory findings, and misleading grammatical errors. Based on our analysis, we define a disambiguation rewriting task to regenerate an input to be unambiguous while preserving information about the original content. We further propose a rewriting algorithm based on contrastive pretraining and perturbation-based rewriting. In addition, we create two datasets, OpenI-Annotated based on chest reports and VA-Annotated based on general medical reports, with available binary labels for ambiguity and abnormality presence annotated by radiology specialists. Experimental results on these datasets show that our proposed algorithm effectively rewrites input sentences in a less ambiguous way with high content fidelity. Our code and annotated data are released to facilitate future research. △ Less

Submitted 14 May, 2023; originally announced May 2023.

Comments: Accepted to AAAI 2023. 13 pages including 4-page supplementary materials

arXiv:2109.12242 [pdf, other]

Weakly Supervised Contrastive Learning for Chest X-Ray Report Generation

Authors: An Yan, Zexue He, Xing Lu, Jiang Du, Eric Chang, Amilcare Gentili, Julian McAuley, Chun-Nan Hsu

Abstract: Radiology report generation aims at generating descriptive text from radiology images automatically, which may present an opportunity to improve radiology reporting and interpretation. A typical setting consists of training encoder-decoder models on image-report pairs with a cross entropy loss, which struggles to generate informative sentences for clinical diagnoses since normal findings dominate… ▽ More Radiology report generation aims at generating descriptive text from radiology images automatically, which may present an opportunity to improve radiology reporting and interpretation. A typical setting consists of training encoder-decoder models on image-report pairs with a cross entropy loss, which struggles to generate informative sentences for clinical diagnoses since normal findings dominate the datasets. To tackle this challenge and encourage more clinically-accurate text outputs, we propose a novel weakly supervised contrastive loss for medical report generation. Experimental results demonstrate that our method benefits from contrasting target reports with incorrect but semantically-close ones. It outperforms previous work on both clinical correctness and text generation metrics for two public benchmarks. △ Less

Submitted 24 September, 2021; originally announced September 2021.

Comments: Findings of EMNLP 2021

arXiv:2010.02467 [pdf, other]

Learning Visual-Semantic Embeddings for Reporting Abnormal Findings on Chest X-rays

Authors: Jianmo Ni, Chun-Nan Hsu, Amilcare Gentili, Julian McAuley

Abstract: Automatic medical image report generation has drawn growing attention due to its potential to alleviate radiologists' workload. Existing work on report generation often trains encoder-decoder networks to generate complete reports. However, such models are affected by data bias (e.g.~label imbalance) and face common issues inherent in text generation models (e.g.~repetition). In this work, we focus… ▽ More Automatic medical image report generation has drawn growing attention due to its potential to alleviate radiologists' workload. Existing work on report generation often trains encoder-decoder networks to generate complete reports. However, such models are affected by data bias (e.g.~label imbalance) and face common issues inherent in text generation models (e.g.~repetition). In this work, we focus on reporting abnormal findings on radiology images; instead of training on complete radiology reports, we propose a method to identify abnormal findings from the reports in addition to grou** them with unsupervised clustering and minimal rules. We formulate the task as cross-modal retrieval and propose Conditional Visual-Semantic Embeddings to align images and fine-grained abnormal findings in a joint embedding space. We demonstrate that our method is able to retrieve abnormal findings and outperforms existing generation models on both clinical correctness and text generation metrics. △ Less

Submitted 6 October, 2020; originally announced October 2020.

Comments: 7 pages, 2 figures, to be published in Findings of EMNLP 2020

arXiv:2004.10119 [pdf, other]

COVID-19 and Company Knowledge Graphs: Assessing Golden Powers and Economic Impact of Selective Lockdown via AI Reasoning

Authors: Luigi Bellomarini, Marco Benedetti, Andrea Gentili, Rosario Laurendi, Davide Magnanimi, Antonio Muci, Emanuel Sallinger

Abstract: In the COVID-19 outbreak, governments have applied progressive restrictions to production activities, permitting only those that are considered strategic or that provide essential services. This is particularly apparent in countries that have been stricken hard by the virus, with Italy being a major example. Yet we know that companies are not just isolated entities: They organize themselves into i… ▽ More In the COVID-19 outbreak, governments have applied progressive restrictions to production activities, permitting only those that are considered strategic or that provide essential services. This is particularly apparent in countries that have been stricken hard by the virus, with Italy being a major example. Yet we know that companies are not just isolated entities: They organize themselves into intricate shareholding structures --- forming company networks --- distributing decision power and dividends in sophisticated schemes for various purposes. One tool from the Artificial Intelligence (AI) toolbox that is particularly effective to perform reasoning tasks on domains characterized by many entities highly interconnected with one another is Knowledge Graphs (KG). In this work, we present a visionary opinion and report on ongoing work about the application of Automated Reasoning and Knowledge Graph technology to address the impact of the COVID-19 outbreak on the network of Italian companies and support the application of legal instruments for the protection of strategic companies from takeovers. △ Less

Submitted 21 April, 2020; originally announced April 2020.

Showing 1–7 of 7 results for author: Gentili, A