11institutetext: Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
11email: {mai.kassem, mohammad.yaqub}@mbzuai.ac.ae
22institutetext: School of Computer Science, Carleton University, Ottawa, CA
22email: [email protected]

MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis

Mai A. Shaaban 11    Adnan Khan 22    Mohammad Yaqub 11
Abstract

Chest X-ray images are commonly used for predicting acute and chronic cardiopulmonary conditions, but efforts to integrate them with structured clinical data face challenges due to incomplete electronic health records (EHR). This paper introduces MedPromptX, the first model to integrate multimodal large language models (MLLMs), few-shot prompting (FP) and visual grounding (VG) to combine imagery with EHR data for chest X-ray diagnosis. A pre-trained MLLM is utilized to complement the missing EHR information, providing a comprehensive understanding of patients’ medical history. Additionally, FP reduces the necessity for extensive training of MLLMs while effectively tackling the issue of hallucination. Nevertheless, the process of determining the optimal number of few-shot examples and selecting high-quality candidates can be burdensome, yet it profoundly influences model performance. Hence, we propose a new technique that dynamically refines few-shot data for real-time adjustment to new patient scenarios. Moreover, VG aids in focusing the model’s attention on relevant regions of interest in X-ray images, enhancing the identification of abnormalities. We release MedPromptX-VQA, a new in-context visual question answering dataset encompassing interleaved image and EHR data derived from MIMIC-IV and MIMIC-CXR databases. Results demonstrate the SOTA performance of MedPromptX, achieving an 11% improvement in F1-score compared to the baselines. Code and data are available at https://github.com/BioMedIA-MBZUAI/MedPromptX.

Keywords:
Medical Diagnosis Multimodal Large Language Models Few-shot Learning Visual Grounding Visual Question Answering.

1 Introduction

Emerging machine learning and deep learning advancements are assisting radiologists in detecting chest X-ray abnormalities, streamlining diagnostic processes [16, 9]. While traditional diagnosis based solely on imaging data can be effective, incorporating patients’ clinical history can significantly improve diagnostic outcomes, underscoring the importance of multimodal approaches [19, 22]. The integration of electronic health records (EHR) has been challenged by its inherent incompleteness [18]. Missing values and lack of normal ranges for laboratory tests complicate the interpretation of medical datasets like MIMIC-IV [8]. To this end, large language models (LLMs), as in [19], have shown promise in clinical prediction by fine-tuning with prompts leveraging structured EHR data. The emergence of generalist models like BiomedGPT [26] represents a major advancement in biomedical AI, handling various tasks across modalities and surpassing SOTA results. Additionally, visual grounding techniques, as explored in [6], further exemplify progress in medical imaging, particularly in automating associations between image features and descriptive reports in CT imaging.

Despite these advancements, there remains a gap in the integration of multimodal data for enhancing diagnostic accuracy in chest X-ray analysis. [11] evaluates GPT-4V’s multimodal capabilities, indicating both the potential and limitations of current models in medical imaging tasks. Training LLMs or even fine-tuning can be computationally expensive [27]. Therefore, a crucial breakthrough lies in few-shot prompting [4], which enables rapid adaptation to new diagnostic tasks with minimal labeled data and without parameter updates. This empowers medical professionals to efficiently use accurate diagnostic solutions tailored to specific patient cases [15]. In addition, few-shot prompting addresses the challenge of hallucination in LLMs, guiding the output and ensuring the reliability of diagnostic results [24]. Nevertheless, the quality and the quantity of the few-shot data play a pivotal role in influencing performance [1, 23]. While these models have made strides in report generation and visual question answering (VQA), their precision in identifying specific medical conditions and integrating multimodal information effectively remains an area for improvement.

To this end, we introduce MedPromptX and a new multimodal in-context learning dataset. To the best of our knowledge, MedPromptX is the first model to integrate multimodal LLMs, few-shot prompting, and visual grounding for chest X-ray diagnosis. MedPromptX addresses the challenge of incomplete EHR by complementing missing information through a pre-trained multimodal LLM and focusing on relevant image regions through visual grounding. Additionally, we propose a dynamic proximity selection (DPS) technique that refines few-shot data in real-time. DPS involves analyzing a few examples of positively and negatively diagnosed patients. This technique allows the model to capture the nuanced relationships between patient history and patient outcomes, enhancing diagnostic accuracy while reducing the dependency on extensive labeled datasets, positioning our framework as a significant advancement in the field. Our main contributions are as follows:

  • Introducing MedPromptX, a novel diagnostic model for chest X-ray images that harnesses multimodal LLMs (MLLMs), few-shot prompting (FP) and visual grounding (VG), enabling more accurate prediction of abnormalities.

  • Mitigating the incompleteness in EHR data by transforming inputs into a textual form, adopting pre-trained MLLMs.

  • Extracting the logical patterns discerned from the few-shot data efficiently by implementing DPS, allowing for the capture of the underlying semantics.

  • Constructing MedPromptX-VQA, a new in-context learning dataset tailored for VQA with interleaved chest X-ray image and structured medical data.

2 Methodology

Refer to caption
Figure 1: MedPromptX: each input sample consists of an image Isuperscript𝐼I^{\prime}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and corresponding text Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT containing tabular features. (1) The VG model takes Isuperscript𝐼I^{\prime}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and generates a grounded image Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by prompting the desired output. (2) The grounded image embeddings G𝐺Gitalic_G and text embeddings T𝑇Titalic_T of the candidates are processed by the DPS technique to calculate their relevancy scores to a query sample q𝑞qitalic_q. (3) MLLM ingests a few-shot prompt and predicts whether a patient is likely to have a targeted disease.

2.1 MedPromptX for Diagnosis

The workflow of MedPromptX in Figure 1 can be conceptualized as a four-phase process. Let 𝒞={(I1,T1),,(In,Tn)}𝒞subscriptsuperscript𝐼1subscriptsuperscript𝑇1subscriptsuperscript𝐼𝑛subscriptsuperscript𝑇𝑛\mathcal{C}=\{(I^{\prime}_{1},T^{\prime}_{1}),\ldots,(I^{\prime}_{n},T^{\prime% }_{n})\}caligraphic_C = { ( italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } denotes a set of n𝑛nitalic_n candidates and q=(Iq,Tq)𝑞subscriptsuperscript𝐼𝑞subscriptsuperscript𝑇𝑞q=(I^{\prime}_{q},T^{\prime}_{q})italic_q = ( italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) denotes the query sample. Isuperscript𝐼I^{\prime}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a chest X-ray image and Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the corresponding text containing EHR data. First, the visual grounding (VG) module detects regions of interest (ROI) for each sample and generates grounded image Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by prompting the class. Then, image and text encoders generate image G𝐺Gitalic_G and text T𝑇Titalic_T embeddings, respectively. Next, the dynamic proximity selection (DPS) module refines the candidates, resulting in \mathcal{E}caligraphic_E, where 𝒞𝒞\mathcal{E}\subseteq\mathcal{C}caligraphic_E ⊆ caligraphic_C. Finally, the multimodal large language model (MLLM) ingests the final prompt containing a reordered subset \mathcal{E}caligraphic_E to predict the abnormality in query patient q𝑞qitalic_q.

2.1.1 Visual Grounding (VG)

For the object detection task, conventional methods often confront limitations regarding their capacity to recognize predefined classes of objects [10, 12]. Integrating new classes into these models necessitates an exhaustive process of data collection, annotation and model retraining. We use Grounding DINO (GDINO) [12] (the VG component in Figure 1) to address this challenge by detecting arbitrary objects delineated through human language inputs, a concept commonly referred to as zero-shot detection.

GDINO uses DINO [25], a SOTA transformer-based object detection algorithm, with GLIP [10] pre-training that focuses on grounding textual descriptions to visual elements in a given image. GDINO is a two-stream framework where multi-scale image and text features are extracted separately using backbone architectures such as Swin Transformer [13] and BERT [5], respectively. These features are then transformed into a unified representation space through multiple layers of feature enhancers, incorporating deformable self-attention for image features and regular self-attention for text features.

To detect visual evidence (i.e., grounded image) denoted as Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we pass a textual input e𝑒eitalic_e of a pathological condition (e.g., Pneumonia) along with an X-ray image Isuperscript𝐼I^{\prime}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to the VG model. The model assigns scores to particular regions based on their prominence in the image VG(I,e)={p(G1),,p(Gk)}𝑉𝐺superscript𝐼𝑒𝑝subscriptsuperscript𝐺1𝑝subscriptsuperscript𝐺𝑘VG(I^{\prime},e)=\{p(G^{\prime}_{1}),\dots,p(G^{\prime}_{k})\}italic_V italic_G ( italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_e ) = { italic_p ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_p ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) }, where k𝑘kitalic_k is the total number of detected regions and p𝑝pitalic_p is the score. We then consider Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with the highest score for the subsequent phases of our model.

2.1.2 Dynamic Proximity Selection (DPS)

The performance of FP is highly sensitive to the design of the prompt. This includes the choice of examples, their order, and how well they align with the desired task. Misleading, ambiguous, or poorly chosen examples can lead to suboptimal or entirely incorrect outputs [23, 1]. The DPS method leverages a distance function d𝑑ditalic_d, such as cosine similarity to order candidate instances 𝒞={(G1,T1),,(Gn,Tn)}𝒞subscriptsuperscript𝐺1subscriptsuperscript𝑇1subscriptsuperscript𝐺𝑛subscriptsuperscript𝑇𝑛\mathcal{C}=\{(G^{\prime}_{1},T^{\prime}_{1}),\ldots,(G^{\prime}_{n},T^{\prime% }_{n})\}caligraphic_C = { ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) }, based on their proximity to a query instance q𝑞qitalic_q. Applying a similarity threshold dynamically filters out noisy candidates, enhancing the robustness and adaptability of the FP technique. Thus, the number of n𝑛nitalic_n candidate samples can be reduced (n1𝑛1n-1italic_n - 1,n2𝑛2n-2italic_n - 2,…,1). Mathematically, the approach can be represented as:

DPS(𝒞,q)={d(Gc,Gq)+d(Tc,Tq)2th}c𝒞DPS𝒞𝑞subscript𝑑subscript𝐺𝑐subscript𝐺𝑞𝑑subscript𝑇𝑐subscript𝑇𝑞2𝑡𝑐𝒞\text{DPS}(\mathcal{C},q)=\left\{\frac{d({G_{c},G_{q}})+d({T_{c},T_{q}})}{2}% \geq th\right\}_{c\in\mathcal{C}}DPS ( caligraphic_C , italic_q ) = { divide start_ARG italic_d ( italic_G start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) + italic_d ( italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) end_ARG start_ARG 2 end_ARG ≥ italic_t italic_h } start_POSTSUBSCRIPT italic_c ∈ caligraphic_C end_POSTSUBSCRIPT (1)

The result is a refined subset \mathcal{E}caligraphic_E where each candidate has a similarity score greater than or equal to a threshold th𝑡thitalic_t italic_h. In this method, an instance c𝒞𝑐𝒞{c\in\mathcal{C}}italic_c ∈ caligraphic_C can be decomposed into either grounded image embeddings Gcsubscript𝐺𝑐G_{c}italic_G start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT or text embeddings Tcsubscript𝑇𝑐T_{c}italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT containing laboratory test results of a patient. After computing the similarity scores for text and images separately, the final score is obtained by averaging the scores from both modalities. Motivated by [23, 1], DPS positions the most closely related candidate directly before the query instance, rather than allocating it at a greater distance. This order enhances the precision of the FP process.

2.1.3 Multimodal LLM (MLLM)

Incorporating descriptive information about clinical events can provide valuable context for understanding the reasoning behind model predictions, unlike classical machine learning algorithms, which treat input as numerical attributes without considering the semantic meaning. There are limited examples of open-source models that can ingest FP with interleaved modalities. One notable model is Med-Flamingo [15], which has undergone pre-training on a vast array of medical data. Therefore, Med-Flamingo, which is based on Flamingo [1], serves as the MLLM component in Figure 1. The Flamingo [1] framework can process inputs consisting of both textual and visual content and produce coherent textual output. Flamingo adopts a strategy of freezing the language model and vision encoder weights and establishing connections through learnable architectures. The key component is the perceiver resampler module, introduced in Flamingo to convert spatiotemporal features from the vision encoder into a fixed-size set of visual tokens, facilitating their integration into the language model’s processing pipeline. Additionally, cross-attention layers are inserted between pre-trained language model layers, enabling the model to incorporate visual cues for tasks such as next-token prediction. The pivotal aspect of Flamingo is that it predicts the likelihood of text sequences y𝑦yitalic_y when conditioned on accompanying images x𝑥xitalic_x as follows:

p(y|x)==1Lp(y|y<,x).𝑝conditional𝑦𝑥superscriptsubscriptproduct1𝐿𝑝conditionalsubscript𝑦subscript𝑦absentsubscript𝑥absentp(y|x)=\prod_{\ell=1}^{L}p(y_{\ell}|y_{<\ell},x_{\leq\ell}).italic_p ( italic_y | italic_x ) = ∏ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT < roman_ℓ end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT ≤ roman_ℓ end_POSTSUBSCRIPT ) . (2)

The notation ysubscript𝑦y_{\ell}italic_y start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT represents the \ellroman_ℓ-th token in the sequence of L𝐿Litalic_L language tokens constituting our input text, while y<subscript𝑦absenty_{<\ell}italic_y start_POSTSUBSCRIPT < roman_ℓ end_POSTSUBSCRIPT denotes all preceding language tokens, and xsubscript𝑥absentx_{\leq\ell}italic_x start_POSTSUBSCRIPT ≤ roman_ℓ end_POSTSUBSCRIPT symbolizes the corresponding sequence of images.

2.2 MedPromptX-VQA Dataset

Refer to caption
Figure 2: The “Positive” and “Negative” representations for 12 pathological conditions.

Our methodology involves constructing the MedPromptX-VQA dataset derived from a unified multimodal dataset, denoted as HAIM-MIMIC-MM [20]. This dataset is a fusion of information sourced from MIMIC-IV [8] and MIMIC-CXR [7] databases, meticulously curated to focus solely on patients with at least one chest X-ray procedure. The resultant HAIM-MIMIC-MM dataset encapsulates records from 7,279 hospitalization stays, involving 6,485 distinct patients, thereby establishing a multimodal link encompassing tabular, textual and visual representations of patient health data.

Within our dataset, patients are labeled with 12 pathological conditions: Atelectasis, Cardiomegaly, Consolidation, Edema, Enlarged Cardiomediastinum, Fracture, Lung Lesion, Lung Opacity, Pleural Effusion, Pleural Other, Pneumonia and Pneumothorax. To alleviate the challenges of limited context length and hallucination in LLMs [24], we transformed these labels into a binary single-label classification framework to ensure that the input fits the context length and to acquire a controlled output, rather than acquiring an open set of possible diagnoses. For each label, if a patient exhibits the condition, the corresponding label is assigned the value 1; otherwise, it is given the value 0. For this study, we specifically selected patients diagnosed with the aforementioned conditions, resulting in 968 records split into 501 positive and 467 negative samples. Figure 2 shows the representations of the labels in the final dataset.

The creation of MedPromptX-VQA involves three steps: (1) extraction of laboratory test results from the chartevents table within the MIMIC-IV dataset [8], resulting in 357 features in total, (2) feature engineering, which includes identification of the most strongly correlated features in relation to the label using Pearson method, and (3) transformation of clinical charts into textual representations using comma-separated values (i.e., serialization). Finally, the dataset is structured to support the in-context learning task, where each record has interleaved image and text, encompassing both positive and negative samples of patients. The motivation for feature selection is to maintain input consistency between the few-shot data and the query sample. This means that the features present in the query sample should already be represented by the candidates, while also adhering to the context length. Hence, we set a maximum of 10 features per label. The selected features are presented in Table 0.A.3 in Appendix.

3 Experimental Setup

We employed a randomized order strategy for the input sequences across several SOTA models; namely Med-Flamingo [15], OpenFlamingo [2], BioMedLM [3] and Clinical-T5-Large [14]. Moreover, the number of few-shot samples remained consistent at 6 across all the models and they were chosen randomly. For MedPromptX, the number of few-shot candidates is dynamically reduced by the DPS technique, resulting in a unique configuration for each query instance. The exclusion criteria for a candidate involve eliminating instances where the cosine similarity falls below a certain threshold, set at 70%. Furthermore, all experiments were conducted using NVIDIA A100-SXM GPU equipped with 40GB of dedicated memory. For MedPromptX, the frozen language encoder employed is LLaMA-7B [21], while the frozen visual encoder is CLIP ViT-L-14 [17]. Table 0.A.1 in Appendix shows detailed descriptions of the used models.

The prompt design for each model differs based on its capability. Accordingly, Med-Flamingo and OpenFlamingo ingest interleaved image and text, excluding EHR data, whereas BioMedLM and Clinical-T5 use text, including EHR data. MedPromptX stands out as the sole model that processes interleaved grounded or original image and EHR text prompt. Below are examples for each type:

  • Image, Text: \say<image>Question: Is the patient likely to have Cardiomegaly?

  • EHR Text: \sayQuestion: Is the patient likely to have Cardiomegaly, given the following laboratory test results: 0.52 sec QTc?

  • Image, EHR Text: \say<image>Question: Is the patient likely to have Cardiomegaly, given the following laboratory test results: 0.52 sec QTc?

4 Results and Discussion

The results in Table 1 emphasize the complex nature of medical diagnosis, wherein multiple data modalities can provide complementary information leading to better model performance. The combination of imaging data with clinical text via MedPromptX seems significant in providing the model with a richer context, leading to more informed predictions. However, initial attempts yielded lower results, emphasizing the challenges in effectively integrating diverse data sources. With the implementation of DPS and VG, subsequent improvements were observed, suggesting that these strategies are crucial in overcoming the obstacles encountered when processing complex prompts.

Table 1: Performance of MedPromptX against SOTA baselines. Without DPS, candidate prompts in the 6-shot setting are randomly ordered. In contrast, with DPS, the ordering is determined by cosine similarity scores between the embeddings of each candidate and the test prompt, potentially reducing the number of candidates per record given a threshold. When VG is activated, the model processes images with contextual grounding. Conversely, when VG is deactivated, the model ingests original images.
Model DPS Setting VG Setting Precision Recall F1 Score Accuracy
BioMedLM 0.665 0.210 0.484 0.536
Clinical-T5-Large Disabled N/A 0.707 0.371 0.576 0.595
Med-Flamingo 0.545 0.220 0.461 0.501
OpenFlamingo Disabled Disabled 0.523 0.291 0.476 0.496
Disabled 0.520 0.381 0.493 0.498
Disabled Enabled 0.511 0.379 0.486 0.491
Disabled 0.708 0.581 0.658 0.659
MedPromptX (ours) Enabled Enabled 0.773 0.565 0.686 0.689

DPS enhances the model’s ability to learn from limited data by reducing the number of ambiguous examples to 4 on average, contributing to better understanding. On the other hand, a random configuration of FP may introduce unintended biases or result in irrelevant guidance for the model. Moreover, the activation of VG empowers the model to focus its attention on pertinent regions within an image by generating output embeddings that encode semantic information instead of dealing with raw pixel data.

The performance gap observed when using VG solely may be attributed to training the VG model on general domain data rather than on chest X-ray images, particularly in handling the complexity inherent in cases where abnormalities are present in small regions. Providing additional context could bridge the gap, which was achieved by refining EHR data with DPS alongside VG.

4.1 Ablations

Initializing DPS with an increased number of shots provides the model with a broader range of context and examples to learn from, enabling it to generalize more effectively, as shown in Table 2. However, the 6-shot setting strikes a balance between performance and ensuring the inclusion of all classes. In contrast, using a higher number of examples would necessitate drop** classes with insufficient positive or negative examples. Moreover, zero-shot assessment was unattainable due to the hallucination of the models, giving entirely incorrect output for some patient cases. This underscores the necessity for employing FP.

Table 2: Comparing model performance using different number of instances for DPS initialization. The threshold is set at 0.7, and VG is enabled.
Prompt Setting Precision Recall F1-score Accuracy
4-shot 0.640 0.565 0.609 0.609
6-shot 0.773 0.565 0.686 0.689
8-shot 0.789 0.556 0.689 0.693
10-shot 0.732 0.541 0.690 0.705
12-shot 0.735 0.654 0.733 0.740

Adjusting the threshold for DPS can significantly affect performance; an extremely high threshold restricts the model from including meaningful examples, while an extremely low threshold retains nearly the same examples (Figure 0.A.1 in Appendix). Moreover, the utilization of multimodal similarity (Table 0.A.2 in Appendix) enhances the instance selection process by capturing a more comprehensive representation of the data compared to single-modality approaches.

5 Conclusion

This paper introduced MedPromptX, a novel model that integrates clinical history with imaging data for accurate chest X-ray diagnosis. MedPromptX addressed the challenges associated with medical data incompleteness, adaptability to new patient cases with limited labeled data and abnormality detection in X-ray images. Nevertheless, further improvements could be obtained by using fine-tuned backbones, which is beyond the scope of this study. Future work can include the accessibility of diverse and well-annotated datasets. Additionally, rigorous clinical trials and real-world deployment are necessary to validate the framework’s real-world effectiveness and clinical utility.

References

  • [1] Alayrac, J.B., Donahue, J., Luc, P., et al.: Flamingo: a Visual Language Model for Few-Shot Learning. Advances in Neural Information Processing Systems 35 (apr 2022), https://arxiv.longhoe.net/abs/2204.14198v2
  • [2] Awadalla, A., Gao, I., Gardner, J., et al.: Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023)
  • [3] Bolton, E., Hall, D., Yasunaga, M., et al.: Stanford crfm: Biomedlm (2022), https://crfm.stanford.edu/2022/12/15/biomedlm.html
  • [4] Brown, T.B., Mann, B., Ryder, N., et al.: Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems 2020-Decem (may 2020), https://arxiv.longhoe.net/abs/2005.14165v4
  • [5] Devlin, J., Chang, M.W., Lee, K., et al.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  • [6] Ichinose, A., Hatsutani, T., Nakamura, K., et al.: Visual Grounding of Whole Radiology Reports for 3D CT Images, p. 611–621. Springer Nature Switzerland (2023). https://doi.org/10.1007/978-3-031-43904-9_59
  • [7] Johnson, A.E.W., Pollard, T.J., Greenbaum, N.R., et al.: MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs (jan 2019), https://arxiv.longhoe.net/abs/1901.07042v5
  • [8] Johnson, A.E., Bulgarelli, L., Shen, L., et al.: MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data 2023 10:1 10(1),  1–9 (jan 2023). https://doi.org/10.1038/s41597-022-01899-x
  • [9] van Leeuwen, K.G., de Rooij, M., Schalekamp, S., et al.: How does artificial intelligence in radiology improve efficiency and health outcomes? Pediatric Radiology pp. 1–7 (2021)
  • [10] Li, L.H., Zhang, P., Zhang, H., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10965–10975 (2022)
  • [11] Li, Y., Liu, Y., Wang, Z., et al.: A comprehensive study of gpt-4v’s multimodal capabilities in medical imaging. medRxiv pp. 2023–11 (2023)
  • [12] Liu, S., Zeng, Z., Ren, T., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
  • [13] Liu, Z., Lin, Y., Cao, Y., et al.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021)
  • [14] Lu, Q., Dou, D., Nguyen, T.H.: ClinicalT5: A Generative Language Model for Clinical Text. Findings of the Association for Computational Linguistics: EMNLP 2022 pp. 5436–5443 (2022). https://doi.org/10.18653/V1/2022.FINDINGS-EMNLP.398
  • [15] Moor, M., Huang, Q., Wu, S., et al.: Med-Flamingo: a Multimodal Medical Few-shot Learner (jul 2023), https://arxiv.longhoe.net/abs/2307.15189v1
  • [16] Najjar, R.: Redefining radiology: a review of artificial intelligence integration in medical imaging. Diagnostics 13(17),  2760 (2023)
  • [17] Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
  • [18] Shah, S.M., Khan, R.A.: Secondary use of electronic health record: Opportunities and challenges. IEEE access 8, 136947–136965 (2020)
  • [19] Shoham, O.B., Rappoport, N.: Cpllm: Clinical prediction with large language models (2023). https://doi.org/10.48550/ARXIV.2309.11295
  • [20] Soenksen, L.R., Ma, Y., Zeng, C., et al.: Integrated multimodal artificial intelligence framework for healthcare applications. npj Digital Medicine 2022 5:1 5(1), 1–10 (sep 2022). https://doi.org/10.1038/s41746-022-00689-4
  • [21] Touvron, H., Lavril, T., Izacard, G., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
  • [22] Tu, T., Azizi, S., Driess, D., et al.: Towards Generalist Biomedical AI (jul 2023), https://arxiv.longhoe.net/abs/2307.14334v1
  • [23] Yang, Z., Gan, Z., Wang, J., et al.: An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
  • [24] Yin, S., Fu, C., Zhao, S., et al.: Woodpecker: Hallucination correction for multimodal large language models (2023). https://doi.org/10.48550/ARXIV.2310.16045
  • [25] Zhang, H., Li, F., Liu, S., et al.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022)
  • [26] Zhang, K., Yu, J., Adhikarla, E., et al.: Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks (2024)
  • [27] Zhou, H., Liu, F., Gu, B., et al.: A Survey of Large Language Models in Medicine: Principles, Applications, and Challenges (nov 2023), https://arxiv.longhoe.net/abs/2311.05112v2

Appendix 0.A Appendix

Table 0.A.1: Overview of large language models and visual-language models.
Model Pre-training Data Visual Encoder Language Model Size
BioMedLM The Pile Standard GPT-2 2.7B
Clinical-T5-Large MIMIC-III and MIMIC-IV N/A T5-Large 0.8B
Med-Flamingo MTB and PMC-OA CLIP ViT-L-14 LLaMA-7B 8.3B
OpenFlamingo LAION-2B and Multimodal C4 CLIP ViT-L-14 MPT-1B 3.0B
Refer to caption
Figure 0.A.1: Comparison of MedPromptX under different DPS thresholds.
Table 0.A.2: Comparison of employing DPS with averaged similarity scores from two modalities versus employing similarity scores based on a single modality.
DPS Modality Precision Recall F1-score Accuracy
Text 0.558 0.391 0.518 0.525
Image 0.748 0.463 0.632 0.642
Multimodal 0.773 0.565 0.686 0.689
Table 0.A.3: Summary of the top correlated features that contribute to each label’s prediction, providing a clear understanding of the significant variables driving our model’s performance.
Label No. Features Top Features
Atelectasis 10 CO (Arterial), HDL, Cholesterol, ELWI (PiCCO), T Low (APRV), GEDI (PiCCO), LDL measured, T High (APRV), LDL calculated, Serum Osmolality
Cardiomegaly 10 BiPap bpm (S/T -Back up), LDL measured, ELWI (PiCCO), D-Dimer, Impaired Skin Length #5, Impaired Skin Width #5, Uric Acid, GEDI (PiCCO), QTc, Cholesterol
Consolidation 9 Manual Blood Pressure Diastolic Right, Manual Blood Pressure Systolic Right, ELWI (PiCCO), GEDI (PiCCO), CFI (PiCCO), Manual Blood Pressure Diastolic Left, Negative Insp. Force, Cholesterol, PCA basal rate (mL/hour)
Edema 10 SV (Arterial), CO (Arterial), ELWI (PiCCO), CFI (PiCCO), GEDI (PiCCO), Bladder Scan Estimate, SVV (Arterial), Gentamicin (Random), LDL measured, BiPap bpm (S/T -Back up)
Enlarged Cardiomediastinum 5 ELWI (PiCCO), GEDI (PiCCO), SVV (Arterial), D-Dimer, RCexp (Measured Time Constant)
Fracture 10 Absolute Count - Monos, CK-MB, Absolute Count - Neuts, Troponin-T, CO2 production, Differential-Bands, Vti High, Absolute Count - Lymphs, Chloride (whole blood), Total Bilirubin
Lung Lesion 10 Temporary Ventricular Sens Setting mV, Temporary Venticular Sens Threshold mV, PCV Level, Absolute Count - Neuts, GI #1 Tube Mark (CM), Temporary Pacemaker Rate, Glucose (whole blood), Ionized Calcium, Total Bilirubin, Absolute Count - Eos
Lung Opacity 6 Cardiac Output (thermodilution), Bladder Scan Estimate, Ammonia, Serum Osmolality, PBP (Prefilter) Replacement Rate, Current Goal
Pleural Effusion 10 SV (Arterial), CO (Arterial), ELWI (PiCCO), Permanent Pacemaker Rate, GEDI (PiCCO), Gentamicin (Random), SVV (Arterial), Arctic Sun/Alsius Temp #2 C, Feeding Weight, Arctic Sun/Alsius Temp #1 C
Pleural Other 10 PCV Level, Impaired Skin Length #2, Temporary Ventricular Sens Setting mV, Temporary Venticular Stim Threshold mA, Temperature Celsius, Temporary Venticular Sens Threshold mV, Troponin-T, Total Bilirubin, Mixed Venous O2% Sat, PeCO2
Pneumonia 9 ELWI (PiCCO), Recruitment Duration, T Low (APRV), CO (PiCCO), HDL, Cholesterol, SV (Arterial), Impaired Skin Width #5, LDL measured
Pneumothorax 10 HDL, Impaired Skin Width #3, Cardiac Output (thermodilution), Temporary Venticular Stim Threshold mA, TCO2 (calc) Venous, Total Bilirubin, Tidal Volume (set), Venous CO2 Pressure, Differential-Monos, Absolute Count - Eos