Search | arXiv e-print repository

CLUE: A Clinical Language Understanding Evaluation for LLMs

Authors: Amin Dada, Marie Bauer, Amanda Butler Contreras, Osman Alperen Koraş, Constantin Marc Seibold, Kaleb E Smith, Jens Kleesiek

Abstract: Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes. Emerging biomedical LLMs aim to address healthcare-specific challenges, including privacy demands and computational constraints. Assessing the models' suitability for this sensitive application area is of the utmost importance. However, evaluation has primarily been lim… ▽ More Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes. Emerging biomedical LLMs aim to address healthcare-specific challenges, including privacy demands and computational constraints. Assessing the models' suitability for this sensitive application area is of the utmost importance. However, evaluation has primarily been limited to non-clinical tasks, which do not reflect the complexity of practical clinical applications. To fill this gap, we present the Clinical Language Understanding Evaluation (CLUE), a benchmark tailored to evaluate LLMs on clinical tasks. CLUE includes six tasks to test the practical applicability of LLMs in complex healthcare settings. Our evaluation includes a total of $25$ LLMs. In contrast to previous evaluations, CLUE shows a decrease in performance for nine out of twelve biomedical models. Our benchmark represents a step towards a standardized approach to evaluating and develo** LLMs in healthcare to align future model development with the real-world needs of clinical application. We open-source all evaluation scripts and datasets for future research at https://github.com/TIO-IKIM/CLUE. △ Less

Submitted 24 June, 2024; v1 submitted 5 April, 2024; originally announced April 2024.

arXiv:2403.02930 [pdf, other]

A Second Look on BASS -- Boosting Abstractive Summarization with Unified Semantic Graphs -- A Replication Study

Authors: Osman Alperen Koraş, Jörg Schlötterer, Christin Seifert

Abstract: We present a detailed replication study of the BASS framework, an abstractive summarization system based on the notion of Unified Semantic Graphs. Our investigation includes challenges in replicating key components and an ablation study to systematically isolate error sources rooted in replicating novel components. Our findings reveal discrepancies in performance compared to the original work. We… ▽ More We present a detailed replication study of the BASS framework, an abstractive summarization system based on the notion of Unified Semantic Graphs. Our investigation includes challenges in replicating key components and an ablation study to systematically isolate error sources rooted in replicating novel components. Our findings reveal discrepancies in performance compared to the original work. We highlight the significance of paying careful attention even to reasonably omitted details for replicating advanced frameworks like BASS, and emphasize key practices for writing replicable papers. △ Less

Submitted 25 March, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

Comments: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in Advances in Information Retrieval, 46th European Conference on Information Retrieval, ECIR 2024. 16 pages, 4 figures

arXiv:2310.16570 [pdf, other]

Give Me the Facts! A Survey on Factual Knowledge Probing in Pre-trained Language Models

Authors: Paul Youssef, Osman Alperen Koraş, Meijie Li, Jörg Schlötterer, Christin Seifert

Abstract: Pre-trained Language Models (PLMs) are trained on vast unlabeled data, rich in world knowledge. This fact has sparked the interest of the community in quantifying the amount of factual knowledge present in PLMs, as this explains their performance on downstream tasks, and potentially justifies their use as knowledge bases. In this work, we survey methods and datasets that are used to probe PLMs for… ▽ More Pre-trained Language Models (PLMs) are trained on vast unlabeled data, rich in world knowledge. This fact has sparked the interest of the community in quantifying the amount of factual knowledge present in PLMs, as this explains their performance on downstream tasks, and potentially justifies their use as knowledge bases. In this work, we survey methods and datasets that are used to probe PLMs for factual knowledge. Our contributions are: (1) We propose a categorization scheme for factual probing methods that is based on how their inputs, outputs and the probed PLMs are adapted; (2) We provide an overview of the datasets used for factual probing; (3) We synthesize insights about knowledge retention and prompt optimization in PLMs, analyze obstacles to adopting PLMs as knowledge bases and outline directions for future work. △ Less

Submitted 4 December, 2023; v1 submitted 25 October, 2023; originally announced October 2023.

Comments: Accepted at EMNLP Findings 2023

Showing 1–3 of 3 results for author: Koraş, O A