SERPENT-VLM [Uncaptioned image] : Self-Refining Radiology Report Generation Using Vision Language Models

Manav Nitin Kapadnis11footnotemark: 1     Sohan Patnaik11footnotemark: 1     Abhilash Nandy     Sourjyadip Ray
Pawan Goyal     Debdoot Sheet
[email protected]     [email protected]
Indian Institute of Technology Kharagpur
India
  Equal contribution.
Abstract

Radiology Report Generation (R2Gen) demonstrates how Multi-modal Large Language Models (MLLMs) can automate the creation of accurate and coherent radiological reports. Existing methods often hallucinate details in text-based reports that don’t accurately reflect the image content. To mitigate this, we introduce a novel strategy, SERPENT-VLM (SElf Refining Radiology RePort GENeraTion using Vision Language Models), which improves the R2Gen task by integrating a self-refining mechanism into the MLLM framework. We employ a unique self-supervised loss that leverages similarity between pooled image representations and the contextual representations of the generated radiological text, alongside the standard Causal Language Modeling objective, to refine image-text representations. This allows the model to scrutinize and align the generated text through dynamic interaction between a given image and the generated text, therefore reducing hallucination and continuously enhancing nuanced report generation. SERPENT-VLM outperforms existing baselines such as LlaVA-Med, BiomedGPT, etc., achieving SoTA performance on the IU X-ray and Radiology Objects in COntext (ROCO) datasets, and also proves to be robust against noisy images. A qualitative case study emphasizes the significant advancements towards more sophisticated MLLM frameworks for R2Gen, opening paths for further research into self-supervised refinement in the medical imaging domain.

SERPENT-VLM [Uncaptioned image] : Self-Refining Radiology Report Generation Using Vision Language Models


Manav Nitin Kapadnis11footnotemark: 1thanks: Equalcontribution.     Sohan Patnaik11footnotemark: 1     Abhilash Nandy     Sourjyadip Ray Pawan Goyal     Debdoot Sheet [email protected]     [email protected] Indian Institute of Technology Kharagpur India


1 Introduction

Radiology Report Generation (R2Gen) serves as a crucial link between medical imaging and natural language processing, to automate the interpretation of radiological images into comprehensive text reports. This task requires models to learn long-range dependencies effectively while generating the report, a challenge that remains largely unmet in current systems. The primary goal of R2Gen is to generate accurate and comprehensive medical reports from radiological imagery, an essential step toward enhancing diagnostic accuracy and efficiency.

Refer to caption
Figure 1: Generated report samples on IU-Xray dataset. We qualitatively analyze reports generated by medical pre-trained LLMs LlaVA-Med and BioMedGPT with SERPENT-VLM. Hallucinated information in the reports is highlighted using yellow.

Prevailing methods Vinyals et al. (2015); Xu et al. (2015); Tang et al. (2023); You et al. (2016); Tang et al. (2021) in R2Gen often rely on (1) large datasets for pre-training to impart domain-specific knowledge, and (2) typically utilizing compute-intensive encoder-decoder architectures for fine-tuning. These approaches are fraught with drawbacks, such as omission of minor yet clinically significant details Wang et al. (2022b); You et al. (2021); Wang et al. (2021) and the persistent issue of hallucination as seen in Fig. 1, where generated reports from LlaVA-Med and BiomedGPT wrongly include details not present in the images. Minimizing hallucinations in radiology report generation is crucial since these inaccuracies can lead to misdiagnoses, directly impacting patient treatment plans and outcomes. Moreover, reducing hallucinations ensures the reliability and trustworthiness of automated reports, which is vital for maintaining clinical credibility and facilitating effective patient care. Therefore, the limitations pertaining to existing approaches underscore the necessity for a more refined approach for accurate medical diagnosis, addressing the critical gaps in R2Gen.

In this paper, we introduce a streamlined pipeline, SERPENT-VLM, which begins by processing a given X-ray image by passing it through a visual encoder and map** it to a vector representation in a high-dimensional space. This process facilitates a nuanced understanding of the medical imagery. The encoded image, alongside a report generation prompt, is then passed as inputs to a Large Language Model (LLM) for text generation. We employ a cross-entropy loss for the causal language modeling objective and introduce a novel self-refining objective that leverages the pooled image representation and the generated report’s contextual representation. This allows for tuning the network without compromising inference latency, while significantly improving performance evaluated using metrics such as Bleu𝐵𝑙𝑒𝑢Bleuitalic_B italic_l italic_e italic_u, RougeL𝑅𝑜𝑢𝑔subscript𝑒𝐿Rouge_{L}italic_R italic_o italic_u italic_g italic_e start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, BertScore𝐵𝑒𝑟𝑡𝑆𝑐𝑜𝑟𝑒BertScoreitalic_B italic_e italic_r italic_t italic_S italic_c italic_o italic_r italic_e.

The contributions of our work are summarized as follows:

  1. 1.

    Our approach does not compromise on inference latency, adopting a refining strategy through a novel loss function used only for fine-tuning

  2. 2.

    The introduction of a self-refining loss ensures the generation of nuanced, hallucination-free radiology reports

  3. 3.

    Our system not only matches but surpasses the performance of leading generalistic pre-trained medical LLMs.

  4. 4.

    Our approach demonstrates robustness against noisy image inputs, maintaining the generation of comprehensive reports.

This marks a substantial advancement in the field of R2Gen, setting new benchmarks for accuracy, efficiency, and robustness.

The remainder of the paper is organized as follows: We begin by delving into the literature review in Section 2, focusing on current and past state-of-the-art (SoTA) methodologies in the domain of radiological report generation. Section 3 discusses the proposed strategy for the self-refining fine-tuning our approach. The datasets, baselines, experimental setups, and ablation studies are detailed in Section 4. Finally, we conclude with a summary of our findings in Section 5.

2 Related Work

Medical Report Generation (MRG): Medical Report Generation has been extensively studied through ML models. **g et al. (2018) proposed a co-attention network that aligns visual and textual information to generate comprehensive radiology reports. Further enhancing the capabilities, a memory-driven transformer Chen et al. (2020) integrates memory modules for encoding and decoding processes, allowing for more sophisticated report generation Chen et al. (2020, 2021). Cross-modal learning Wang et al. (2022a) utilizes prototype matrices and contrastive losses to refine the learning of visual-textual correlations, complemented by a self-boosting framework to align image features with report text Wang et al. (2021). Liu et al. (2021) addressed the problem of mitigating inherent biases through a data-driven method, introducing a prior-posterior knowledge-based report generation. Nooralahzadeh et al. (2021) leveraged curriculum learning to extract global concepts to create a bridge between images and text. Task-specific architecture with sentence-level attention mechanism across visual features Yuan et al. (2019) allows the model to capture key medical concepts from images. A weakly supervised paradigm to amplify hard negative samples Yan et al. (2021) addresses the medical data scarcity challenge.

Large Language Models and Vision language Models: The advent of Large Language Models (LLMs) such as GPT-4, Claude, BARD showcase excellent zero-shot language understanding bro (2020); Li et al. (2021); Liu et al. (2021); Irvin et al. (2019); image understanding and visual question answering Team et al. (2023) capabilities. Open-source LLMs, like LLaMA and BLOOM, and Multi-modal LLMs such as LlaVA Liu et al. (2024), Open Flamingo Awadalla et al. (2023) have also democratized access to cutting-edge generative technology Ouyang et al. (2022); Pan et al. (2020). Furthermore, domain-specific models LlaVA-Med Li et al. (2023) and BiomedGPT Zhang et al. (2024) have shown promising results in pathology and radiology-related tasks. However, knowledge grounding for medical reports Hyland et al. (2023), thereby reducing hallucination produced by these models remains a challenge.

Source & Representation of Feedback: Iterative refinement in MRG has traditionally relied on human feedback to achieve high-quality outputs Tandon et al. (2022). Scalar reward functions and domain-specific feedback tools, such as compilers, were proposed as cost-effective alternatives to human feedback Le et al. (2022); Yasunaga and Liang (2020). Recent developments show that Large Language Models (LLMs) can self-evaluate their responses. However, applying this to Multi-modal Large Language Models remains largely unexplored in terms of generating grounded and hallucination-free responses.

We now discuss the proposed methodology in the subsequent section.

3 Methodology

Refer to caption
Figure 2: Overview of the SERPENT-VLM pipeline. The X-ray image is processed using a visual encoder (step 1) and projected onto a high-dimensional space using a visual mapper (step 2). The encoded image with the report generation prompt is fed into the LLM (step 3). Cross-entropy loss is employed (step 4) for the causal language modeling objective. The pooled image representation and the Contextual representation of the generated report are used to compute the self-refining loss (step 5). A weighted combination of both objectives is used to train the network (step 6).

3.1 Overview of SERPENT-VLM

We summarize the pipeline of SERPENT-VLM in Figure 2. It consists of two branches to establish the learning optimization criterion. 1) Causal Language Modeling Objective enforces standard cross-entropy loss (step 4 in Fig. 2) for supervised radiology report generation. Our approach consists of a visual encoder that extracts information from chest X-ray images (step 1 in Fig. 2), a visual mapper that projects low dimensional image features onto high dimensional feature space (step 2 in Fig. 2) and a Large Language Model that autoregressively generates the diagnostic radiological report (step 3 in Fig. 2). To further reduce hallucination, we construct a pooled representation of the given X-ray image, a contextual representation leveraging the attention weights and last hidden states of the generated report and enforce 2) Self Refining Objective that tries to maximise the similarity between pooled image representation and the contextual representation of the generated report through a self-supervised loss criterion (step 5 in Fig. 2). We train the network through a weighted combination of both the losses (step 6 in Fig. 2), thereby enabling SERPENT-VLM to continuously refine itself by aligning generated text with the input image. We now discuss the details of each component.

3.2 SERPENT-VLM Framework

The architecture of SERPENT-VLM can be partitioned into three different modules - a visual encoder, a visual mapper and a large language model (LLM). Formally, consider a chest X-ray image IvCxHxWsubscript𝐼𝑣superscript𝐶x𝐻x𝑊I_{v}\in\mathbb{R}^{C\text{x}H\text{x}W}italic_I start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C x italic_H x italic_W end_POSTSUPERSCRIPT, where C𝐶Citalic_C is the number of input channels, H𝐻Hitalic_H, W𝑊Witalic_W being the height and width of the image respectively. Iv=[Iv1,Iv2,Ivk]subscript𝐼𝑣subscript𝐼subscript𝑣1subscript𝐼subscript𝑣2subscript𝐼subscript𝑣𝑘I_{v}=[I_{v_{1}},I_{v_{2}},\cdots I_{v_{k}}]italic_I start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = [ italic_I start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ italic_I start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] comprises of a sequence of k𝑘kitalic_k patches with IviCxPxPsubscript𝐼subscript𝑣𝑖superscript𝐶x𝑃x𝑃I_{v_{i}}\in\mathbb{R}^{C\text{x}P\text{x}P}italic_I start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C x italic_P x italic_P end_POSTSUPERSCRIPT being the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT patch, and P𝑃Pitalic_P is the patch size. We leverage a transformer-based visual encoder Vencsubscript𝑉𝑒𝑛𝑐V_{enc}italic_V start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT to encode and obtain contextual representation e~vidvsubscript~𝑒subscript𝑣𝑖superscriptsubscript𝑑𝑣\tilde{e}_{v_{i}}\in\mathbb{R}^{d_{v}}over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denoted by Eq. 1 and aggregate each encoded patch to obtain a global image representation e~vsubscript~𝑒𝑣\tilde{e}_{v}over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT depicted by Eq. 2.

e~v1,e~v2,e~vk=Venc(Iv1,Iv2,Ivk)subscript~𝑒subscript𝑣1subscript~𝑒subscript𝑣2subscript~𝑒subscript𝑣𝑘subscript𝑉𝑒𝑛𝑐subscript𝐼subscript𝑣1subscript𝐼subscript𝑣2subscript𝐼subscript𝑣𝑘\tilde{e}_{v_{1}},\tilde{e}_{v_{2}},\cdots\tilde{e}_{v_{k}}=V_{enc}(I_{v_{1}},% I_{v_{2}},\cdots I_{v_{k}})over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ italic_I start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (1)
e~v=Vpooler(e~v1,e~v2,e~vk)subscript~𝑒𝑣subscript𝑉𝑝𝑜𝑜𝑙𝑒𝑟subscript~𝑒subscript𝑣1subscript~𝑒subscript𝑣2subscript~𝑒subscript𝑣𝑘\tilde{e}_{v}=V_{pooler}(\tilde{e}_{v_{1}},\tilde{e}_{v_{2}},\cdots\tilde{e}_{% v_{k}})over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_p italic_o italic_o italic_l italic_e italic_r end_POSTSUBSCRIPT ( over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (2)

The encoded image features inherently reside in a visual feature space, which is distinct and not directly compatible with the textual feature space, and hence need to be aligned with the word embedding space of the LLM. To ensure this, we use a learnable visual mapper Vmapsubscript𝑉𝑚𝑎𝑝V_{map}italic_V start_POSTSUBSCRIPT italic_m italic_a italic_p end_POSTSUBSCRIPT to project the patch embeddings e~visubscript~𝑒subscript𝑣𝑖\tilde{e}_{v_{i}}over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT onto the word embedding space. Formally, evi=Vmap(e~vi)subscript𝑒subscript𝑣𝑖subscript𝑉𝑚𝑎𝑝subscript~𝑒subscript𝑣𝑖e_{v_{i}}=V_{map}(\tilde{e}_{v_{i}})italic_e start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_m italic_a italic_p end_POSTSUBSCRIPT ( over~ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). We construct a seed prompt T𝑇Titalic_T instructing the LLM to generate a report conditioned on the image Ivsubscript𝐼𝑣I_{v}italic_I start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, and obtain the corresponding tokens 𝒯tokens=[t1,t2,,t|𝒯tokens|]subscript𝒯𝑡𝑜𝑘𝑒𝑛𝑠subscript𝑡1subscript𝑡2subscript𝑡subscript𝒯𝑡𝑜𝑘𝑒𝑛𝑠\mathcal{T}_{tokens}=[t_{1},t_{2},\cdots,t_{|\mathcal{T}_{tokens}|}]caligraphic_T start_POSTSUBSCRIPT italic_t italic_o italic_k italic_e italic_n italic_s end_POSTSUBSCRIPT = [ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT | caligraphic_T start_POSTSUBSCRIPT italic_t italic_o italic_k italic_e italic_n italic_s end_POSTSUBSCRIPT | end_POSTSUBSCRIPT ] which is given as input to the Embedding𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔Embeddingitalic_E italic_m italic_b italic_e italic_d italic_d italic_i italic_n italic_g module of the LLM to construct the token embeddings (refer Eq. 3),

et1,et2,,et|𝒯tokens|=Embedding(t1,t2,,t|𝒯tokens|)subscript𝑒subscript𝑡1subscript𝑒subscript𝑡2subscript𝑒subscript𝑡subscript𝒯𝑡𝑜𝑘𝑒𝑛𝑠𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔subscript𝑡1subscript𝑡2subscript𝑡subscript𝒯𝑡𝑜𝑘𝑒𝑛𝑠e_{t_{1}},e_{t_{2}},\cdots,e_{t_{|\mathcal{T}_{tokens}|}}=Embedding(t_{1},t_{2% },\cdots,t_{|\mathcal{T}_{tokens}|})italic_e start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_e start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT | caligraphic_T start_POSTSUBSCRIPT italic_t italic_o italic_k italic_e italic_n italic_s end_POSTSUBSCRIPT | end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_E italic_m italic_b italic_e italic_d italic_d italic_i italic_n italic_g ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT | caligraphic_T start_POSTSUBSCRIPT italic_t italic_o italic_k italic_e italic_n italic_s end_POSTSUBSCRIPT | end_POSTSUBSCRIPT )

(3)

We concatenate the sequence of projected image patch embeddings evisubscript𝑒subscript𝑣𝑖e_{v_{i}}italic_e start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT with the seed prompt text embeddings etjsubscript𝑒subscript𝑡𝑗e_{t_{j}}italic_e start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT to obtain a sequence of input embeddings e=[ev;et]subscript𝑒subscript𝑒𝑣subscript𝑒𝑡e_{\mathcal{I}}=[e_{v};e_{t}]italic_e start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT = [ italic_e start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ; italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] which are given as input to the decoder-only LLM denoted by TD𝑇𝐷TDitalic_T italic_D for generating the logits of the response tokens in auto-regressive fashion. Vencsubscript𝑉𝑒𝑛𝑐V_{enc}italic_V start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT, Vpoolersubscript𝑉𝑝𝑜𝑜𝑙𝑒𝑟V_{pooler}italic_V start_POSTSUBSCRIPT italic_p italic_o italic_o italic_l italic_e italic_r end_POSTSUBSCRIPT, Vmapsubscript𝑉𝑚𝑎𝑝V_{map}italic_V start_POSTSUBSCRIPT italic_m italic_a italic_p end_POSTSUBSCRIPT and TD𝑇𝐷TDitalic_T italic_D are trained through cross-entropy loss reportsubscript𝑟𝑒𝑝𝑜𝑟𝑡\mathcal{L}_{report}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_p italic_o italic_r italic_t end_POSTSUBSCRIPT enforced between the generated logits and the actual responses. To further guide the report generation process by aligning the generated response with the input image, we enforce a self-supervised refining loss.

3.3 Self-refining Strategy

We construct an aggregated representation of the generated text by utilizing the attention weights of the last layer of TD𝑇𝐷TDitalic_T italic_D. Consider the logit distribution for each generated token as lidsubscript𝑙𝑖superscript𝑑l_{i}\in\mathbb{R}^{d}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, where d𝑑ditalic_d is the vocabulary size of TD𝑇𝐷TDitalic_T italic_D. To encode the representation of each generated token, which is further used to compute the self-refining loss in a differentiable fashion, we leverage Gumbel-Softmax on the logit distribution to obtain l^isubscript^𝑙𝑖\hat{l}_{i}over^ start_ARG italic_l end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each predicted token. We construct the aggregated representation e^ip=j=1dejl^ijsuperscriptsubscript^𝑒𝑖𝑝superscriptsubscript𝑗1𝑑subscript𝑒𝑗subscript^𝑙𝑖𝑗\hat{e}_{i}^{p}=\sum_{j=1}^{d}e_{j}\hat{l}_{ij}over^ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over^ start_ARG italic_l end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT of each predicted token by taking a weighted sum of the embedding matrix E=e1,e2,,ed𝐸subscript𝑒1subscript𝑒2subscript𝑒𝑑E=e_{1},e_{2},\cdots,e_{d}italic_E = italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT with l^isubscript^𝑙𝑖\hat{l}_{i}over^ start_ARG italic_l end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT being the corresponding weights. Formally,

l^ij=e(log(lij)+gij)/τj=1de(log(lij)+gij)/τsubscript^𝑙𝑖𝑗superscript𝑒𝑙𝑜𝑔subscript𝑙𝑖𝑗subscript𝑔𝑖𝑗𝜏superscriptsubscript𝑗1𝑑superscript𝑒𝑙𝑜𝑔subscript𝑙𝑖𝑗subscript𝑔𝑖𝑗𝜏\hat{l}_{ij}=\frac{e^{(log(l_{ij})+g_{ij})/\tau}}{\sum_{j=1}^{d}e^{(log(l_{ij}% )+g_{ij})/\tau}}over^ start_ARG italic_l end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT ( italic_l italic_o italic_g ( italic_l start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) + italic_g start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ( italic_l italic_o italic_g ( italic_l start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) + italic_g start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) / italic_τ end_POSTSUPERSCRIPT end_ARG (4)

Since, the gumbel-softmax operator makes the logit distribution peaky, taking a weighted sum effectively yields the predicted token embeddings. Further, we construct an aggregated representation htdtsubscript𝑡superscriptsubscript𝑑𝑡h_{t}\in\mathbb{R}^{d_{t}}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT of the predicted token embeddings by leveraging the attention weights from the last layer of TD𝑇𝐷TDitalic_T italic_D. We hypothesize that aligning the aggregated representation of the generated report with the pooled input image representation would reduce hallucination and ground the report generation task. For this, we enforce a self-refining loss between htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and evsubscript𝑒𝑣e_{v}italic_e start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT depicted by Eq. 5

refine=1bibehtTev,subscript𝑟𝑒𝑓𝑖𝑛𝑒1𝑏superscriptsubscript𝑖𝑏superscript𝑒superscriptsubscript𝑡𝑇subscript𝑒𝑣\mathcal{L}_{refine}=\frac{1}{b}\sum_{i}^{b}e^{-h_{t}^{T}e_{v}},caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_f italic_i italic_n italic_e end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_b end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (5)

where b𝑏bitalic_b is the batch size.

Minimizing the negative exponential of the similarity between the image and generated text representation pushes the representation closer, thus further grounding the report generation process. We optimize our network with a weighted combination of both the causal language modeling objective and the self-refining objective. The total loss is denoted by Eq. 6

total=λreport report+λrefine refinesubscript𝑡𝑜𝑡𝑎𝑙subscript𝜆𝑟𝑒𝑝𝑜𝑟𝑡 subscript𝑟𝑒𝑝𝑜𝑟𝑡subscript𝜆𝑟𝑒𝑓𝑖𝑛𝑒 subscript𝑟𝑒𝑓𝑖𝑛𝑒\mathcal{L}_{total}=\lambda_{report}\text{ }\mathcal{L}_{report}+\lambda_{% refine}\text{ }\mathcal{L}_{refine}caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_o italic_r italic_t end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_p italic_o italic_r italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_r italic_e italic_f italic_i italic_n italic_e end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_f italic_i italic_n italic_e end_POSTSUBSCRIPT (6)

reportsubscript𝑟𝑒𝑝𝑜𝑟𝑡\mathcal{L}_{report}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_p italic_o italic_r italic_t end_POSTSUBSCRIPT depicts the standard causal language modeling objective that ensures the conditional generation of radiological report text based on the input image, whereas refinesubscript𝑟𝑒𝑓𝑖𝑛𝑒\mathcal{L}_{refine}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_f italic_i italic_n italic_e end_POSTSUBSCRIPT ensures that the generated report is grounded in context of the input image, thereby establishing a robust pipeline for radiology report generation.

4 Experiments and Evaluation

We now discuss the details corresponding to the experiments and ablation studies carried out and enumerate the observations.

4.1 Implementation Details

We discuss the technical details and hyper-parameter settings for all the experiments. For the visual encoder Vencsubscript𝑉𝑒𝑛𝑐V_{enc}italic_V start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT, we employed the base version of Swin-Transformer-V2111https://huggingface.co/microsoft/swinv2-base-patch4-window12-192-22k and a feed-forward neural network for Vmapsubscript𝑉𝑚𝑎𝑝V_{map}italic_V start_POSTSUBSCRIPT italic_m italic_a italic_p end_POSTSUBSCRIPT. We leverage LLaMA2-7B222https://huggingface.co/meta-llama/Llama-2-7b-chat-hf as our primary LLM. Further, the hidden dimension of dvsubscript𝑑𝑣d_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT of Vencsubscript𝑉𝑒𝑛𝑐V_{enc}italic_V start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT and dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of TD𝑇𝐷TDitalic_T italic_D are 768768768768 and 1024102410241024 respectively. We freeze the weights of Vencsubscript𝑉𝑒𝑛𝑐V_{enc}italic_V start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT, however keep Vmapsubscript𝑉𝑚𝑎𝑝V_{map}italic_V start_POSTSUBSCRIPT italic_m italic_a italic_p end_POSTSUBSCRIPT trainable. We employ LoRA with a rank and α𝛼\alphaitalic_α-scaling factor of 16 each to fine-tune the underlying LLM TD𝑇𝐷TDitalic_T italic_D. We train SERPENT-VLM for 15 epochs on IU-Xray dataset and 20 epochs on the ROCO dataset with mixed precision on an effective batch size (BS) of 6 using one NVIDIA A40 48GB GPU using a learning rate of 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT with linear rate scheduler through AdamW optimizer. For inference, we leverage beam search decoding with beam size configured to 3.

4.2 Datasets and Evaluation Metrics:

We evaluate SERPENT-VLM on two commonly used datasets diverse modality -

  1. 1.

    IU X-Ray which is a widely used publicly available dataset for medical report generation tasks containing 3,955 fully de-identified radiology reports with sections such as Impression, Findings, Indication, etc., each associated with frontal and/or lateral chest X-rays, totaling 7,470 images;

  2. 2.

    ROCO which has ‘radiology’ and ‘out-of-class’ subsets (synthetic radiology images, clinical photos, portraits, compound radiology images, and digital art) of roughly 65,460 and 8,182 ‘radiology’, and 4,902 and 613 ‘out-of-class’ images in the train and test set respectively.

Since the reports are verbose and need to be accurately measured with word-level precision, we compute overlap-based metrics like BLEU and Rouge-L, and a semantic similarity-based metric BertScore for evaluating the efficacy of our approach.

Dataset Train Val Test Image Views
IU X-Ray 2769 791 395 Frontal and Lateral
ROCO 65460 8183 8182 Frontal
Table 1: Statistics of Evaluation Datasets

4.3 Performance of SERPENT-VLM on Radiology Report Generation

IU-Xray ROCO
Methods Bleu1𝐵𝑙𝑒subscript𝑢1Bleu_{1}italic_B italic_l italic_e italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Bleu2𝐵𝑙𝑒subscript𝑢2Bleu_{2}italic_B italic_l italic_e italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Bleu3𝐵𝑙𝑒subscript𝑢3Bleu_{3}italic_B italic_l italic_e italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT Bleu4𝐵𝑙𝑒subscript𝑢4Bleu_{4}italic_B italic_l italic_e italic_u start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT RougeL𝑅𝑜𝑢𝑔subscript𝑒𝐿Rouge_{L}italic_R italic_o italic_u italic_g italic_e start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT BertScore𝐵𝑒𝑟𝑡𝑆𝑐𝑜𝑟𝑒BertScoreitalic_B italic_e italic_r italic_t italic_S italic_c italic_o italic_r italic_e Bleu1𝐵𝑙𝑒subscript𝑢1Bleu_{1}italic_B italic_l italic_e italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Bleu2𝐵𝑙𝑒subscript𝑢2Bleu_{2}italic_B italic_l italic_e italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Bleu3𝐵𝑙𝑒subscript𝑢3Bleu_{3}italic_B italic_l italic_e italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT Bleu4𝐵𝑙𝑒subscript𝑢4Bleu_{4}italic_B italic_l italic_e italic_u start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT RougeL𝑅𝑜𝑢𝑔subscript𝑒𝐿Rouge_{L}italic_R italic_o italic_u italic_g italic_e start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT BertScore𝐵𝑒𝑟𝑡𝑆𝑐𝑜𝑟𝑒BertScoreitalic_B italic_e italic_r italic_t italic_S italic_c italic_o italic_r italic_e
Show-Tell 0.243 0.13 0.108 0.078 0.307 0.378 0.104 0.076 0.051 0.027 0.089 0.34
Att2in 0.248 0.134 0.116 0.091 0.309 0.386 0.106 0.077 0.052 0.027 0.091 0.347
AdaAtt 0.284 0.207 0.15 0.126 0.311 0.442 0.122 0.089 0.060 0.031 0.104 0.397
Transformer 0.372 0.251 0.147 0.136 0.317 0.579 0.159 0.116 0.079 0.041 0.137 0.521
M2transformer 0.402 0.284 0.168 0.143 0.328 0.626 0.172 0.125 0.085 0.044 0.148 0.563
R2Gen 0.47 0.304 0.219 0.165 0.371 0.732 0.201 0.147 0.099 0.052 0.173 0.658
R2GenCMN 0.475 0.309 0.222 0.17 0.375 0.74 0.169 0.148 0.100 0.052 0.175 0.665
MSAT 0.481 0.316 0.226 0.171 0.372 0.749 0.212 0.150 0.102 0.053 0.177 0.673
METransformer 0.483 0.322 0.228 0.172 0.38 0.752 0.211 0.151 0.102 0.053 0.178 0.676
R2GenGPT (Deep) 0.480 0.316 0.216 0.169 0.377 0.748 0.213 0.150 0.101 0.053 0.177 0.672
MiniGPT4 0.494 0.329 0.220 0.179 0.390 0.767 0.219 0.156 0.103 0.056 0.183 0.689
BiomedGPT 0.516 0.343 0.233 0.183 0.403 0.793 0.229 0.163 0.109 0.058 0.189 0.712
LlaVA-Med 0.528 0.346 0.237 0.186 0.422 0.845 0.234 0.164 0.111 0.061 0.198 0.759
SERPENT-VLM 0.547 0.356 0.242 0.190 0.452 0.935 0.243 0.169 0.108 0.057 0.212 0.84
Table 2: Results of SERPENT-VLM on Benchmark datasets

Table 2 illustrates the comprehensive comparison of SERPENT-VLM against various state-of-the-art baselines across the IU-Xray and ROCO datasets. In comparison with traditional non-LLM approaches such as Show-Tell Vinyals et al. (2015), Att2in Xu et al. (2015), and R2Gen Chen et al. (2020), SERPENT-VLM exhibits significant improvements. For instance, on the IU-Xray dataset, SERPENT-VLM achieves a Bleu4𝐵𝑙𝑒subscript𝑢4Bleu_{4}italic_B italic_l italic_e italic_u start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT score of 0.190, surpassing Show-Tell’s 0.078 and R2Gen’s 0.165, and even outperforming the more advanced R2GenCMN, which scores 0.170. This indicates not only an improvement in capturing long-range dependencies but also a notable reduction in detail hallucination, a common issue in earlier models. Furthermore, when compared to Medical LLMs and generalistic Vision-Language Models such as LlaVA-Med Li et al. (2023), BiomedGPT Zhang et al. (2024), and MiniGPT4 Zhu et al. (2023), SERPENT-VLM demonstrates superior performance, marking a significant leap in R2Gen. For example, against LlaVA-Med, which records a Bleu4𝐵𝑙𝑒subscript𝑢4Bleu_{4}italic_B italic_l italic_e italic_u start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT of 0.186 on IU-Xray, SERPENT-VLM shows a marked improvement with a score of 0.190. Similarly, in the context of BertScore𝐵𝑒𝑟𝑡𝑆𝑐𝑜𝑟𝑒BertScoreitalic_B italic_e italic_r italic_t italic_S italic_c italic_o italic_r italic_e, SERPENT-VLM achieves an impressive 0.935 compared to LlaVA-Med’s 0.845 and BiomedGPT’s 0.793, underscoring its enhanced textual coherence.

4.4 Discussion on the Impact of different Design Choices for SERPENT-VLM

We carry experiments pertaining to two different design choices for SERPENT-VLM and establish the efficacy of the proposed architecture through the comparative analysis across experiments.

  1. 1.

    Effect of relative importance of two losses: We vary the relative importance self-refining loss (λrefinesubscript𝜆𝑟𝑒𝑓𝑖𝑛𝑒\lambda_{refine}italic_λ start_POSTSUBSCRIPT italic_r italic_e italic_f italic_i italic_n italic_e end_POSTSUBSCRIPT) and report-generation loss (λreportsubscript𝜆𝑟𝑒𝑝𝑜𝑟𝑡\lambda_{report}italic_λ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_o italic_r italic_t end_POSTSUBSCRIPT) in Eq. 6. Table 3 shows that combining the two losses yields much better performance for IU X-ray and ROCO compared to just using the report generation loss (row 5 vs. row 2). This highlights that self-refining loss complements the report generation loss by grounding the generated report on the input image, thereby reducing hallucination. Further, it is observed that using only self-refining loss (row 1) leads to a degradation in performance because SERPENT-VLM is trained only through a self-supervised paradigm without any kind of supervision. As observed, this equilibrium is not merely about avoiding hallucinations but also about fostering a synergistic effect where each loss component reinforces the other, thereby elevating the overall quality and reliability of the automated radiology reports. The findings from our experiments provide compelling evidence for the critical role of balanced loss parameters in achieving the desired outcomes, advocating for a nuanced approach in their application within the framework of SERPENT-VLM.

    Dataset λReportsubscript𝜆𝑅𝑒𝑝𝑜𝑟𝑡\lambda_{Report}italic_λ start_POSTSUBSCRIPT italic_R italic_e italic_p italic_o italic_r italic_t end_POSTSUBSCRIPT λRefinesubscript𝜆𝑅𝑒𝑓𝑖𝑛𝑒\lambda_{Refine}italic_λ start_POSTSUBSCRIPT italic_R italic_e italic_f italic_i italic_n italic_e end_POSTSUBSCRIPT Bleu1𝐵𝑙𝑒subscript𝑢1Bleu_{1}italic_B italic_l italic_e italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Bleu2𝐵𝑙𝑒subscript𝑢2Bleu_{2}italic_B italic_l italic_e italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Bleu3𝐵𝑙𝑒subscript𝑢3Bleu_{3}italic_B italic_l italic_e italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT Bleu4𝐵𝑙𝑒subscript𝑢4Bleu_{4}italic_B italic_l italic_e italic_u start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT RougeL𝑅𝑜𝑢𝑔subscript𝑒𝐿Rouge_{L}italic_R italic_o italic_u italic_g italic_e start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT BertScore𝐵𝑒𝑟𝑡𝑆𝑐𝑜𝑟𝑒BertScoreitalic_B italic_e italic_r italic_t italic_S italic_c italic_o italic_r italic_e
    IU-Xray 0 1.0 0.416 0.270 0.184 0.144 0.344 0.711
    0.3 0.7 0.547 0.356 0.242 0.190 0.452 0.935
    0.5 0.5 0.492 0.320 0.218 0.171 0.407 0.842
    0.7 0.3 0.479 0.311 0.212 0.166 0.396 0.818
    1 0.0 0.451 0.311 0.200 0.157 0.373 0.771
    ROCO 0 1 0.187 0.130 0.083 0.044 0.163 0.647
    0.3 0.7 0.243 0.169 0.108 0.057 0.212 0.840
    0.5 0.5 0.214 0.149 0.095 0.050 0.187 0.739
    0.7 0.3 0.207 0.144 0.092 0.048 0.180 0.714
    1 0 0.194 0.135 0.086 0.046 0.170 0.672
    Table 3: Impact of combining self-refining loss (weight λrefinesubscript𝜆𝑟𝑒𝑓𝑖𝑛𝑒\lambda_{refine}italic_λ start_POSTSUBSCRIPT italic_r italic_e italic_f italic_i italic_n italic_e end_POSTSUBSCRIPT) with report-generation loss (weight λreportsubscript𝜆𝑟𝑒𝑝𝑜𝑟𝑡\lambda_{report}italic_λ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_o italic_r italic_t end_POSTSUBSCRIPT). Fusing both the loss components gives optimal performance.
  2. 2.

    Effect of contextual representation design strategy: We explore different aggregation strategies for obtaining the contextual representation of the generated report. As depicted in Table 4, attention-based aggregation outperforms other aggregation strategies by a significant margin by obtaining a BertScore of 0.935 and 0.840; BLEU1 score of 0.547 and 0.243 on IU X-ray and ROCO respectively. Average pooling (average of token representations), Max pooling (token representation with maximum L2-norm) and Top-k average pooling (average top k=5𝑘5k=5italic_k = 5 token representations based on attention-weights) give sub-optimal performance on both IU X-ray and ROCO benchmark, thereby establishing the critical importance of sophisticated feature integration methods in enhancing the model’s capability to synthesize coherent and contextually relevant radiology reports. Exploration into different aggregation strategies reveals that the sophistication and adaptability of the aggregation mechanism play a pivotal role in the efficacy of medical report generation models.

    Dataset Design Strategy Bleu1𝐵𝑙𝑒subscript𝑢1Bleu_{1}italic_B italic_l italic_e italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Bleu2𝐵𝑙𝑒subscript𝑢2Bleu_{2}italic_B italic_l italic_e italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Bleu3𝐵𝑙𝑒subscript𝑢3Bleu_{3}italic_B italic_l italic_e italic_u start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT Bleu4𝐵𝑙𝑒subscript𝑢4Bleu_{4}italic_B italic_l italic_e italic_u start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT RougeL𝑅𝑜𝑢𝑔subscript𝑒𝐿Rouge_{L}italic_R italic_o italic_u italic_g italic_e start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT BertScore𝐵𝑒𝑟𝑡𝑆𝑐𝑜𝑟𝑒BertScoreitalic_B italic_e italic_r italic_t italic_S italic_c italic_o italic_r italic_e
    IU-Xray Attention based aggregation 0.547 0.356 0.242 0.190 0.452 0.935
    Average pooling 0.410 0.267 0.182 0.143 0.339 0.701
    Top k average pooling 0.465 0.303 0.206 0.162 0.384 0.795
    Max pooling 0.383 0.249 0.169 0.133 0.316 0.655
    ROCO Attention based aggregation 0.243 0.169 0.108 0.057 0.212 0.840
    Average pooling 0.190 0.132 0.084 0.044 0.165 0.655
    Top k average pooling 0.199 0.139 0.089 0.047 0.174 0.689
    Max pooling 0.170 0.118 0.076 0.040 0.148 0.588
    Table 4: Performance comparison of different design strategies for contextual representation. Attention weights-based aggregation displays superior performance.

4.5 How robust is SERPENT-VLM to noisy images?

We assess the robustness of SoTA methods LlaVA-Med and BiomedGPT, with our method SERPENT-VLM, by introducing Gaussian noise to radiological images. Fig. 3 demonstrate that SERPENT-VLM significantly outperforms the current SoTA models, LlaVA-Med and BiomedGPT, across all Gaussian Noise scales, maintaining higher BLEU1 ( 5-6% higher) and BertScore ( 9-10% higher) metrics, thus showcasing superior robustness in report generation under noisy and corrupted images. This also highlights SERPENT-VLM’s ability to focus on relevant parts of the image, thereby mitigating the effects of added noise and grounding the generated report - an indication of reduction in hallucination phenomena. The integration of SERPENT-VLM could markedly enhance diagnostic accuracy, aiding radiologists in delivering faster and more accurate patient care.

Refer to caption
(a) Performance metrics for ROCO dataset with varying levels of Gaussian noise added to input radiological images.
Refer to caption
(b) Performance metrics for IU-Xray dataset with varying levels of Gaussian noise added to input radiological images.
Figure 3: Comparative performance metrics for ROCO and IU-Xray datasets.

5 Summary and Conclusion

In this paper, we propose SERPENT-VLM, an innovative method for producing detailed and accurate radiology reports from Chest X-rays without hallucinations. The process utilizes a frozen visual encoder to transform X-ray images into a high-dimensional space, which a Large Language Model (LLM) then uses to generate initial reports. These reports undergo further refinement through a novel combination of self-refining loss and Causal Language Modeling Loss, significantly surpassing existing methods as detailed in Section 4. Our experiments in Section 4 and supplementary materials, confirm the effectiveness of our self-refining approach, even with distorted noisy images. Our future works involve the extension of our method to other medical imaging types, such as MRIs and CT scans, and to incorporate diagnostic RADreports to enhance report accuracy further.

Limitations

The SERPENT-VLM has shown significant advancements in creating radiology reports from chest X-rays, reducing inaccuracies, and better matching the content of the images compared to earlier models. However, this research has its limitations. The testing of the model’s performance and adaptability has been limited to particular datasets (IU X-Ray and ROCO), which do not encompass the broad spectrum of radiological images or health conditions. It remains unclear how well this would work in actual medical situations. Furthermore, although the model’s ability to handle low-quality images is emphasized, the wide range of image quality in real-life scenarios could pose challenges that have yet to be evaluated.

Ethics Statement

The deployment of SERPENT-VLM in clinical settings involves significant ethical considerations. The model’s potential to generate erroneous interpretations from radiological images, despite reduced hallucinations, necessitates cautious application, especially since incorrect reports could lead to misdiagnoses or inappropriate treatments. The use of large datasets for training also raises privacy concerns, requiring stringent data handling and patient consent protocols.

References

  • bro (2020) 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  • Awadalla et al. (2023) Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. 2023. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390.
  • Chen et al. (2021) Zhihong Chen, Yaling Shen, Yan Song, and Xiang Wan. 2021. Cross-modal memory networks for radiology report generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5904–5914, Online. Association for Computational Linguistics.
  • Chen et al. (2020) Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xiang Wan. 2020. Generating radiology reports via memory-driven transformer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1439–1449, Online. Association for Computational Linguistics.
  • Hyland et al. (2023) Stephanie L Hyland, Shruthi Bannur, Kenza Bouzid, Daniel C Castro, Mercy Ranjit, Anton Schwaighofer, Fernando Pérez-García, Valentina Salvatelli, Shaury Srivastav, Anja Thieme, et al. 2023. Maira-1: A specialised large multimodal model for radiology report generation. arXiv preprint arXiv:2311.13668.
  • Irvin et al. (2019) Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, and Andrew Y. Ng. 2019. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’19/IAAI’19/EAAI’19. AAAI Press.
  • **g et al. (2018) Baoyu **g, Pengtao Xie, and Eric Xing. 2018. On the automatic generation of medical imaging reports. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.
  • Le et al. (2022) Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Hoi. 2022. CodeRL: Mastering code generation through pretrained models and deep reinforcement learning. In Advances in Neural Information Processing Systems.
  • Li et al. (2023) Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. 2023. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.
  • Li et al. (2021) Yehao Li, Yingwei Pan, **gwen Chen, Ting Yao, and Tao Mei. 2021. X-modaler: A versatile and high-performance codebase for cross-modal analytics. In Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, page 3799–3802, New York, NY, USA. Association for Computing Machinery.
  • Liu et al. (2021) Fenglin Liu, Xian Wu, Shen Ge, Wei Fan, and Yuexian Zou. 2021. Exploring and distilling posterior and prior knowledge for radiology report generation. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
  • Liu et al. (2024) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024. Visual instruction tuning. Advances in neural information processing systems, 36.
  • Nooralahzadeh et al. (2021) Farhad Nooralahzadeh, Nicolas Perez Gonzalez, Thomas Frauenfelder, Koji Fujimoto, and Michael Krauthammer. 2021. Progressive transformer-based generation of radiology reports. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2824–2832, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  • Pan et al. (2020) Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
  • Tandon et al. (2022) Niket Tandon, Aman Madaan, Peter Clark, and Yiming Yang. 2022. Learning to repair: Repairing model output errors after deployment using a dynamic memory of feedback. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 339–352, Seattle, United States. Association for Computational Linguistics.
  • Tang et al. (2021) Mingkang Tang, Zhanyu Wang, Zhenhua LIU, Fengyun Rao, Dian Li, and Xiu Li. 2021. Clip4caption: Clip for video caption. In Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, page 4858–4862, New York, NY, USA. Association for Computing Machinery.
  • Tang et al. (2023) Mingkang Tang, Zhanyu Wang, Zhaoyang Zeng, ** Zhou. 2023. Stay in grid: Improving video captioning via fully grid-level representation. IEEE Transactions on Circuits and Systems for Video Technology, 33(7):3319–3332.
  • Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  • Vinyals et al. (2015) O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2015. Show and tell: A neural image caption generator. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3156–3164, Los Alamitos, CA, USA. IEEE Computer Society.
  • Wang et al. (2022a) Jun Wang, Abhir Bhalerao, and Yulan He. 2022a. Cross-modal prototype driven network for radiology report generation. In Computer Vision – ECCV 2022, pages 563–579, Cham. Springer Nature Switzerland.
  • Wang et al. (2022b) Zhanyu Wang, Hongwei Han, Lei Wang, ** Zhou. 2022b. Automated radiographic report generation purely on transformer: A multicriteria supervised approach. IEEE Transactions on Medical Imaging, 41(10):2803–2813.
  • Wang et al. (2021) Zhanyu Wang, Lu** Zhou, Lei Wang, and Xiu Li. 2021. A self-boosting framework for automated radiographic report generation. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2433–2442.
  • Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2048–2057, Lille, France. PMLR.
  • Yan et al. (2021) An Yan, Zexue He, Xing Lu, Jiang Du, Eric Chang, Amilcare Gentili, Julian McAuley, and Chun-Nan Hsu. 2021. Weakly supervised contrastive learning for chest X-ray report generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4009–4015, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Yasunaga and Liang (2020) Michihiro Yasunaga and Percy Liang. 2020. Graph-based, self-supervised program repair from diagnostic feedback. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org.
  • You et al. (2021) Di You, Fenglin Liu, Shen Ge, Xiaoxia Xie, **g Zhang, and Xian Wu. 2021. AlignTransformer: Hierarchical alignment of visual regions and disease tags for medical report generation. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2021, Lecture notes in computer science, pages 72–82. Springer International Publishing, Cham.
  • You et al. (2016) Quanzeng You, Hailin **, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4651–4659.
  • Yuan et al. (2019) Jianbo Yuan, Haofu Liao, Rui Luo, and Jiebo Luo. 2019. Automatic radiology report generation based on multi-view image fusion and medical concept enrichment. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, pages 721–729, Cham. Springer International Publishing.
  • Zhang et al. (2024) Kai Zhang, Jun Yu, Eashan Adhikarla, Rong Zhou, Zhiling Yan, Yixin Liu, Zhengliang Liu, Lifang He, Brian Davison, Xiang Li, Hui Ren, Sunyang Fu, James Zou, Wei Liu, **g Huang, Chen Chen, Yuyin Zhou, Tianming Liu, Xun Chen, Yong Chen, Quanzheng Li, Hongfang Liu, and Lichao Sun. 2024. Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks.
  • Zhu et al. (2023) Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large language models.