Unlocking the Power of Spatial and Temporal Information
in Medical Multimodal Pre-training

**xia Yang    Bing Su    Wayne Xin Zhao    Ji-Rong Wen
Abstract

Medical vision-language pre-training methods mainly leverage the correspondence between paired medical images and radiological reports. Although multi-view spatial images and temporal sequences of image-report pairs are available in off-the-shelf multi-modal medical datasets, most existing methods have not thoroughly tapped into such extensive supervision signals. In this paper, we introduce the Med-ST framework for fine-grained spatial and temporal modeling to exploit information from multiple spatial views of chest radiographs and temporal historical records. For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views. To achieve a more comprehensive alignment, Med-ST not only establishes the global alignment between whole images and texts but also introduces modality-weighted local alignment between text tokens and spatial regions of images. For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward map** classification (FMC) and reverse map** regression (RMR). By perceiving temporal information from simple to complex, Med-ST can learn temporal semantics. Experimental results across four distinct tasks demonstrate the effectiveness of Med-ST, especially in temporal classification tasks. Our code and model are available at https://github.com/SVT-Yang/MedST.

Machine Learning, ICML

1 Introduction

Refer to caption
Figure 1: The motivation and our framework: (a) The practice of physicians in a clinical setting. (b) We propose the Med-ST framework, which explicitly designs spatial and temporal modeling by mining spatio-temporal supervision signals from the dataset.

Medical vision-language pre-training aims to learn visual and textual representations from paired radiological reports and images. The learned representations can be adapted to various downstream tasks, which have great potential value for clinical auxiliary diagnosis and so on. Existing medical vision language pre-training methods generally focus on utilizing the correspondences between paired images and texts (Huang et al., 2021; Wang et al., 2022a; Zhou et al., 2022; Wan et al., 2023; Zhou et al., 2023). The radiological report is aligned with the frontal medical image through contrastive learning or masked modality reconstruction (Cheng et al., 2023; Zhou et al., 2023; Moon et al., 2022). Although the learned visual representations can encode medical semantic information, they cannot fully capture fine-grained spatial information in different views, nor can they distinguish fine-grained temporal differences between image-text pairs of the same patient at different times.

The applicability of medical vision-language pre-training models significantly depends on their capacity to mimic human cognitive process in interpreting radiological reports and images. Typically, in a clinical setting as shown in Figure 1(a), the doctor’s diagnosis relies on the comprehensive examination results, considering frontal views and lateral views. This process also involves reviewing the patient’s historical medical records to comprehend past symptoms and trends. Fortunately, such comprehensive spatio-temporal information is available in off-the-shelf medical multimodal datasets. Besides image-text pairs, these datasets naturally embody parts of two types of supervision signals: spatial information (frontal and lateral views) and temporal information (chronological order of diagnoses).

However, these supervision signals have not been fully explored by existing medical pre-training methods. For spatial information, images of lateral views are either ignored or treated identically to those of frontal views  (Huang et al., 2021; Wang et al., 2022a; Zhou et al., 2023; Dawidowicz et al., 2023; Shu et al., 2024). Regarding temporal information, existing work only utilizes a single prior image and does not design objectives specifically for this self-supervised signal (Bannur et al., 2023). The temporal information in historical sequences of image-report pairs has not yet been explored.

In this work, we introduce the Med-ST framework, which jointly exploits the comprehensive spatial and temporal information within off-the-shelf medical datasets to supervise the pre-training of visual and textual representations. As visualized in Fig. 1(b), Med-ST comprises two main components: spatial modeling and temporal modeling. In spatial modeling, we introduce the Mixture of View Expert (MoVE) architecture to construct the multi-view image encoder. We feed both frontal and lateral views into the encoder, which utilizes two experts to extract complementary information from different spatial perspectives. Features generated by both experts are integrated to form a joint visual representation of these varied spatial angles. Visual representations are then aligned with textual representations through contrastive learning (Oord et al., 2018). Recognizing that pathological areas occupy only a part of the images or texts, we propose modality-weighted local alignment, which assigns different weights to different local image patches and text tokens pairs based on the information they contain, achieving a fine-grained local alignment. For temporal modeling, we encourage the learned image-text feature sequences to express the same semantic changes, allowing the pre-training model to gain more supervision signals. Motivated by Dwibedi et al. (2019), we perform bidirectional cycle consistency between sequences of different modalities. Since the model has never explicitly perceived temporal sequence information, directly learning such information is challenging. Therefore, we adopt a novel bidirectional learning approach from simple to complex. In the forward process, we design a relatively simple classification loss to initially perceive information in the sequence. In the reverse process, we add a Gaussian prior for regression. We penalize inconsistencies by transforming the mismatch measurements into classification and regression targets. Through a bidirectional process from simple to complex, our model perceives the information of the sequence context, thus being able to capture changes. By incorporating spatial and temporal modeling, the global multi-view visual representation and textual representation interact in both spatial and temporal dimensions. In this way, our Med-ST gains the ability to jointly encode fine-grained spatiotemporal information into multi-modal representations.

The contributions of our study are outlined as follows:

  • We thoroughly explore the information in medical multimodal datasets without additional manual labeling. Beyond text-image pairing, we leverage multi-view spatial data and historical temporal data, yielding a richer set of supervision signals.

  • Our spatial modeling utilizes the MoVE architecture to tackle both frontal and lateral views with specialized experts, and introduces modality-weighted local alignment to establish fine-grained contrastive learning between spatial image regions and semantic tokens.

  • For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective that progresses from simple to complex. By forward map** classification and reverse map** regression, our model becomes capable of perceiving the context of sequences.

  • We evaluate the performance of our method in temporal tasks and medical image classification tasks. The results demonstrate the effectiveness of our method.

2 Related Work

Vision-Language Pre-training Models. Multimodal vision-language pre-training (VLP) is a field of research that aims to jointly learn representations of both visual and textual data. It trains deep neural networks on large-scale datasets that contain both image and text data, mainly image-text pairs (Radford et al., 2021; Li et al., 2021, 2023; Wang et al., 2022b; Alayrac et al., 2022; Chen et al., 2022; Bao et al., 2022; Zhang et al., 2023; Lu et al., 2023). By learning to represent images and text in a shared embedding space, multimodal pre-training can generate cross-modal representations that capture the semantic meaning of both modalities. Due to the impressive performance of VLP in the natural domain, it has also gained significant popularity in other fields, including the medical domain. It is important to note that directly transferring VLP models to the medical domain may not be feasible due to differences between the two domains (Chambon et al., 2022a, b).

Representation Learning in Medical Domain. The research on vision-language pre-training in medical domain can be divided into two branches: one is leveraging external knowledge bases (Wu et al., 2023; Wang et al., 2022c; Qin et al., 2022) and the other is fully utilizing existing datasets through specific model design (Huang et al., 2021; Wang et al., 2022a; Zhou et al., 2022, 2023; Cheng et al., 2023). In the first branch, MedKLIP and MedCLIP (Wu et al., 2023; Wang et al., 2022c) leverage entity extraction or descriptive information to enhance the model. Currently, several large language models are being applied in medical domain (Wang et al., 2023; Singhal et al., 2022; Shaib et al., 2023; Yunxiang et al., 2023), primarily by using medical instruction datasets to generate more accurate answers. However, utilizing external knowledge may be limited by the availability and completeness of the external knowledge base. In the other branch, MGCA (Wang et al., 2022a) learns generalized medical visual representations through leveraging the inherent semantic correlations and MRM (Zhou et al., 2023) learns knowledge-enhanced semantic representations by reconstructing both masked image patches and masked report tokens. However, these works only utilize the frontal view of the chest X-ray, discarding the lateral views (Huang et al., 2021; Wang et al., 2022a), or treat them the same as frontal views without exploring the relationship between them (Zhou et al., 2023; Dawidowicz et al., 2023). We believe there is an association between the frontal and lateral views that could be helpful for downstream tasks, and medical evidence has shown the value of lateral views in diagnosis (Raoof et al., 2012). Therefore, we attempt to better utilize the lateral view through spatial modeling.

Temporal Self-Supervised Learning in Medical Domain. Effective representations can be learned from time series data through self-supervision, eliminating the need for manual annotation. Specifically, medical time series data often contain abundant semantic information. BioViL-T (Bannur et al., 2023) exploits time-related information by using prior images, thereby gaining additional supervision. However, BioViL-T only utilizes a single prior image and lacks a temporal optimization objective. Our approach considers both text and image pairs in a temporal context and designs a time-related objective, i.e., cross-modality bidirectional cycle consistency, to better leverage the information provided by time series.

Refer to caption
Figure 2: The framework of Med-ST. For spatial modeling, Med-ST extracts integrated multi-view visual representations, performs global alignment between paired global features, and introduces modality-weighted local alignment between paired local features. For temporal modeling, global image and text features in all pairs form two sequences respectively, and Med-ST imposes cross-modality bidirectional cycle consistency constraints between them.

3 Method

3.1 Overview

Existing medical datasets contain some inherent spatial and temporal supervision signals that have been previously overlooked, obviating the need for supplementary manual annotations. A collection of images associated with a single report is referred to as a study111https://physionet.org/content/mimic-cxr/2.0.0/. Given a dataset comprising multiple studies, each study typically includes a frontal view image along with its corresponding radiological report. In some cases, lateral views may be present. All studies of a patient have a chronological order. As a result, we can extract matched multi-view data and time-series data from the dataset, which provide spatial and temporal supervision signals for pre-training multi-modal representations.

Formally, we reorganize the data in a medical dataset 𝒟𝒟\mathcal{D}caligraphic_D into a number of sequences, where each sequence contains a series of image-text pairs and typically corresponds to the historical diagnosis records of a patient. Some image-text pairs do not belong to any sequences, e.g., patients have only one diagnosis record. We regard such individual image-text pairs as sequences with a length of 1111. Therefore, 𝒟={𝑺1,𝑺2,,𝑺|𝒟|}𝒟subscript𝑺1subscript𝑺2subscript𝑺𝒟\mathcal{D}=\{\bm{S}_{1},\bm{S}_{2},\cdots,\bm{S}_{|\mathcal{D}|}\}caligraphic_D = { bold_italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_italic_S start_POSTSUBSCRIPT | caligraphic_D | end_POSTSUBSCRIPT }, where |𝒟|𝒟|\mathcal{D}|| caligraphic_D | is the number of sequences in the dataset. For the i𝑖iitalic_i-th sequence, 𝑺i={(𝑰i,tf,𝑰i,tl,𝑻i,t)}t=1|𝑺i|subscript𝑺𝑖superscriptsubscriptsubscriptsuperscript𝑰𝑓𝑖𝑡subscriptsuperscript𝑰𝑙𝑖𝑡subscript𝑻𝑖𝑡𝑡1subscript𝑺𝑖\bm{S}_{i}=\{(\bm{I}^{f}_{i,t},\bm{I}^{l}_{i,t},\bm{T}_{i,t})\}_{t=1}^{|\bm{S}% _{i}|}bold_italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { ( bold_italic_I start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , bold_italic_I start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , bold_italic_T start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT, where 𝑰i,tfsubscriptsuperscript𝑰𝑓𝑖𝑡\bm{I}^{f}_{i,t}bold_italic_I start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT, 𝑰i,tlsubscriptsuperscript𝑰𝑙𝑖𝑡\bm{I}^{l}_{i,t}bold_italic_I start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT, and 𝑻i,tsubscript𝑻𝑖𝑡\bm{T}_{i,t}bold_italic_T start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT are the frontal image, the corresponding lateral image, and the related text at the t𝑡titalic_t-th timestep, and |𝑺i|subscript𝑺𝑖|\bm{S}_{i}|| bold_italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | is the number of image-text pairs in this sequence. Not every image-text pair has a lateral view, i.e., for some i𝑖iitalic_i and t𝑡titalic_t, 𝑰i,tlsubscriptsuperscript𝑰𝑙𝑖𝑡\bm{I}^{l}_{i,t}bold_italic_I start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT does not exist, and we set it to a tensor with all zeros in this case.

Our proposed model Med-ST leverages both spatial and temporal modeling to learn multi-modal representations from such sequence data of multi-view image-text pairs. Med-ST consists of a multi-view image encoder fIsubscript𝑓𝐼f_{I}italic_f start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and a text encoder fTsubscript𝑓𝑇f_{T}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. The image encoder employs the proposed MoVE architecture to extract integrated representations from frontal and lateral images, i.e., [𝒙i,tg,𝑿i,t]=fI([𝑰i,tf,𝑰i,tl])subscriptsuperscript𝒙𝑔𝑖𝑡subscript𝑿𝑖𝑡subscript𝑓𝐼subscriptsuperscript𝑰𝑓𝑖𝑡subscriptsuperscript𝑰𝑙𝑖𝑡[\bm{x}^{g}_{i,t},\bm{X}_{i,t}]=f_{I}([\bm{I}^{f}_{i,t},\bm{I}^{l}_{i,t}])[ bold_italic_x start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ] = italic_f start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( [ bold_italic_I start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , bold_italic_I start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ] ), where 𝒙i,tgdsubscriptsuperscript𝒙𝑔𝑖𝑡superscript𝑑\bm{x}^{g}_{i,t}\in\mathbb{R}^{d}bold_italic_x start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the global image feature, 𝑿i,tM×dsubscript𝑿𝑖𝑡superscript𝑀𝑑\bm{X}_{i,t}\in\mathbb{R}^{M\times d}bold_italic_X start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_d end_POSTSUPERSCRIPT is the sequence of patch-wise features and M𝑀Mitalic_M is the number of tokens. The text encoder extracts the text representation, i.e., [𝒓i,tg,𝑹i,t]=fT(𝑻i,tf)subscriptsuperscript𝒓𝑔𝑖𝑡subscript𝑹𝑖𝑡subscript𝑓𝑇subscriptsuperscript𝑻𝑓𝑖𝑡[\bm{r}^{g}_{i,t},\bm{R}_{i,t}]=f_{T}(\bm{T}^{f}_{i,t})[ bold_italic_r start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , bold_italic_R start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ] = italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_italic_T start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ), where 𝒓i,tgdsubscriptsuperscript𝒓𝑔𝑖𝑡superscript𝑑\bm{r}^{g}_{i,t}\in\mathbb{R}^{d}bold_italic_r start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the global text feature and 𝑹i,tW×dsubscript𝑹𝑖𝑡superscript𝑊𝑑\bm{R}_{i,t}\in\mathbb{R}^{W\times d}bold_italic_R start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_W × italic_d end_POSTSUPERSCRIPT is the sequence of token-wise features and W𝑊Witalic_W is the number of tokens. As shown in Fig. 2, in the training phase, for each input sequence 𝑺𝑺\bm{S}bold_italic_S, Med-ST first extracts the image and text representations for all time steps: {𝒙tg,𝑿t,𝒓tg,𝑹t}t=1|𝑺|superscriptsubscriptsubscriptsuperscript𝒙𝑔𝑡subscript𝑿𝑡subscriptsuperscript𝒓𝑔𝑡subscript𝑹𝑡𝑡1𝑺\{\bm{x}^{g}_{t},\bm{X}_{t},\bm{r}^{g}_{t},\bm{R}_{t}\}_{t=1}^{|\bm{S}|}{ bold_italic_x start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_r start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_italic_S | end_POSTSUPERSCRIPT. By viewing image-text pairs in all sequences as independent, Med-ST performs global alignment between matched global image and text features {𝒙tg,𝒓tg}subscriptsuperscript𝒙𝑔𝑡subscriptsuperscript𝒓𝑔𝑡\{\bm{x}^{g}_{t},\bm{r}^{g}_{t}\}{ bold_italic_x start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_r start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, and introduces modality-weighted local alignment between matched sequences of patch-wise and token-wise features {𝑿t,𝑹t}subscript𝑿𝑡subscript𝑹𝑡\{\bm{X}_{t},\bm{R}_{t}\}{ bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }. For temporal modeling, the global image and text features form two feature sequences, 𝑿g=[𝒙1g,𝒙2g,,𝒙|𝑺|g]superscript𝑿𝑔subscriptsuperscript𝒙𝑔1subscriptsuperscript𝒙𝑔2subscriptsuperscript𝒙𝑔𝑺\bm{X}^{g}=[\bm{x}^{g}_{1},\bm{x}^{g}_{2},\cdots,\bm{x}^{g}_{|\bm{S}|}]bold_italic_X start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = [ bold_italic_x start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_italic_x start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT | bold_italic_S | end_POSTSUBSCRIPT ] and 𝑹g=[𝒓1g,𝒓2g,,𝒓|𝑺|g]superscript𝑹𝑔subscriptsuperscript𝒓𝑔1subscriptsuperscript𝒓𝑔2subscriptsuperscript𝒓𝑔𝑺\bm{R}^{g}=[\bm{r}^{g}_{1},\bm{r}^{g}_{2},\cdots,\bm{r}^{g}_{|\bm{S}|}]bold_italic_R start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = [ bold_italic_r start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_r start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_italic_r start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT | bold_italic_S | end_POSTSUBSCRIPT ], respectively, and Med-ST imposes the cross-modality cycle consistency constraints to align 𝑿gsuperscript𝑿𝑔\bm{X}^{g}bold_italic_X start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT and 𝑹gsuperscript𝑹𝑔\bm{R}^{g}bold_italic_R start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT.

3.2 Spatial Modeling

3.2.1 Image Feature Extraction

Based on Vision Transformer (ViT) (Dosovitskiy et al., 2020), we develop a Mixture of View Experts (MoVE) architecture as fIsubscript𝑓𝐼f_{I}italic_f start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT to encode the multi-view images. As shown in Fig. 3, for a pair of frontal and lateral images (𝑰f,𝑰l)superscript𝑰𝑓superscript𝑰𝑙(\bm{I}^{f},\bm{I}^{l})( bold_italic_I start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , bold_italic_I start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) in a sequence sample 𝑺𝑺\bm{S}bold_italic_S, we initially divide them into patches, respectively, which are then flattened through a projector to obtain a series of patch embeddings {𝒑jf}j=1npsuperscriptsubscriptsuperscriptsubscript𝒑𝑗𝑓𝑗1subscript𝑛𝑝\{\bm{p}_{j}^{f}\}_{j=1}^{n_{p}}{ bold_italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, {𝒑jl}j=1npsuperscriptsubscriptsuperscriptsubscript𝒑𝑗𝑙𝑗1subscript𝑛𝑝\{\bm{p}_{j}^{l}\}_{j=1}^{n_{p}}{ bold_italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with dimension D𝐷Ditalic_D. We add a learnable embedding 𝒑CLSsubscript𝒑CLS\bm{p}_{\text{CLS}}bold_italic_p start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT. To preserve the positional information of patches in each image, we introduce position encoding 𝑬pos(np+1)×Dsubscript𝑬𝑝𝑜𝑠superscriptsubscript𝑛𝑝1𝐷\bm{E}_{pos}\in\mathbb{R}^{({n_{p}}+1)\times D}bold_italic_E start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 ) × italic_D end_POSTSUPERSCRIPT to both frontal and lateral patches, respectively. All patch embeddings from both views are concatenated to form the embedding sequence 𝑯0vsuperscriptsubscript𝑯0𝑣\bm{H}_{0}^{v}bold_italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT of the image encoder as follows, which serves as the input of the image encoder.

𝑯0v=(𝒑CLS+𝑬pos[0,:])({𝒑jf}j=1np+𝑬pos[1:,:])({𝒑jl}j=1np+𝑬pos[1:,:])\begin{split}\bm{H}_{0}^{v}=&\left(\bm{p}_{\text{CLS}}+\bm{E}_{pos}[0,:]\right% )\oplus(\{\bm{p}_{j}^{f}\}_{j=1}^{n_{p}}+\bm{E}_{pos}[1:,:])\\ &\oplus(\{\bm{p}_{j}^{l}\}_{j=1}^{n_{p}}+\bm{E}_{pos}[1:,:])\end{split}start_ROW start_CELL bold_italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT = end_CELL start_CELL ( bold_italic_p start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT + bold_italic_E start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT [ 0 , : ] ) ⊕ ( { bold_italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + bold_italic_E start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT [ 1 : , : ] ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⊕ ( { bold_italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + bold_italic_E start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT [ 1 : , : ] ) end_CELL end_ROW (1)

where direct-sum\oplus is the concatenation operation and 𝑯0v(2np+1)×Dsuperscriptsubscript𝑯0𝑣superscript2subscript𝑛𝑝1𝐷\bm{H}_{0}^{v}\in\mathbb{R}^{(2{n_{p}}+1)\times D}bold_italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( 2 italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 ) × italic_D end_POSTSUPERSCRIPT.

To enable the model to perceive distinctions between the two views, we employ our MoVE architecture to replace the standard transformer’s block in ViT. Specifically, after obtaining the output 𝑯l1subscript𝑯𝑙1\bm{H}_{l-1}bold_italic_H start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT from the previous block, multi-head self-attention layers are performed to capture shared spatial dependencies among all patch embeddings. Then the first np+1subscript𝑛𝑝1{n_{p}}+1italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 vectors are processed through a Frontal Feedforward Network (F-FFN), while the latter npsubscript𝑛𝑝n_{p}italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT vectors go through a Lateral Feedforward Network (L-FFN). These two FFNs act as distinct experts for capturing complementary information from different perspectives. The outputs of the two experts are concatenated to form the output 𝑯lsubscript𝑯𝑙\bm{H}_{l}bold_italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT of this block. The image encoder consists of L𝐿Litalic_L MoVE-based blocks. For the output 𝑯Lsubscript𝑯𝐿\bm{H}_{L}bold_italic_H start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT of the final block, 𝒙g=𝑯L[0,:]superscript𝒙𝑔subscript𝑯𝐿0:\bm{x}^{g}=\bm{H}_{L}[0,:]bold_italic_x start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = bold_italic_H start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT [ 0 , : ] is the global multi-view image feature and 𝑿=𝑯L[0,:]𝑿subscript𝑯𝐿0:\bm{X}=\bm{H}_{L}[0,:]bold_italic_X = bold_italic_H start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT [ 0 , : ] is the sequence of patch-wise features.

3.2.2 Cross Modal alignment

Global Alignment. For the t𝑡titalic_t-th pair (𝑰i,tf,𝑰i,tl,𝑻i,t)subscriptsuperscript𝑰𝑓𝑖𝑡subscriptsuperscript𝑰𝑙𝑖𝑡subscript𝑻𝑖𝑡(\bm{I}^{f}_{i,t},\bm{I}^{l}_{i,t},\bm{T}_{i,t})( bold_italic_I start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , bold_italic_I start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , bold_italic_T start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) of the i𝑖iitalic_i-th sequence in the batch, we obtain [𝒙i,tg,𝑿i,t]=fI([𝑰i,tf,𝑰i,tl])subscriptsuperscript𝒙𝑔𝑖𝑡subscript𝑿𝑖𝑡subscript𝑓𝐼subscriptsuperscript𝑰𝑓𝑖𝑡subscriptsuperscript𝑰𝑙𝑖𝑡[\bm{x}^{g}_{i,t},\bm{X}_{i,t}]=f_{I}([\bm{I}^{f}_{i,t},\bm{I}^{l}_{i,t}])[ bold_italic_x start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ] = italic_f start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( [ bold_italic_I start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , bold_italic_I start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ] ). We also use the text encoder to extract the global text feature 𝒓i,tgsubscriptsuperscript𝒓𝑔𝑖𝑡\bm{r}^{g}_{i,t}bold_italic_r start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT and the sequence of token features 𝑹i,tsubscript𝑹𝑖𝑡\bm{R}_{i,t}bold_italic_R start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT. Following CLIP (Radford et al., 2021), we employ contrastive learning to bring global features of matching images and radiological reports closer while pushing global features of non-matching pairs apart.

li,tv2tsuperscriptsubscript𝑙𝑖𝑡𝑣2𝑡\displaystyle l_{i,t}^{v2t}italic_l start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v 2 italic_t end_POSTSUPERSCRIPT =logexp(sim(𝒙i,tg,𝒓i,tg)/τ1)b=1Bk=1|𝑺k|exp(sim(𝒙i,tg,𝒓b,kg)/τ1),absent𝑠𝑖𝑚superscriptsubscript𝒙𝑖𝑡𝑔superscriptsubscript𝒓𝑖𝑡𝑔subscript𝜏1superscriptsubscript𝑏1𝐵superscriptsubscript𝑘1subscript𝑺𝑘𝑠𝑖𝑚superscriptsubscript𝒙𝑖𝑡𝑔superscriptsubscript𝒓𝑏𝑘𝑔subscript𝜏1\displaystyle=-\log\frac{\exp(sim(\bm{x}_{i,t}^{g},\bm{r}_{i,t}^{g})/\tau_{1})% }{\sum_{b=1}^{B}\sum_{k=1}^{|\bm{S}_{k}|}\exp(sim(\bm{x}_{i,t}^{g},\bm{r}_{b,k% }^{g})/\tau_{1})},= - roman_log divide start_ARG roman_exp ( italic_s italic_i italic_m ( bold_italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , bold_italic_r start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) / italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT roman_exp ( italic_s italic_i italic_m ( bold_italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , bold_italic_r start_POSTSUBSCRIPT italic_b , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) / italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG , (2)
li,tt2vsuperscriptsubscript𝑙𝑖𝑡𝑡2𝑣\displaystyle l_{i,t}^{t2v}italic_l start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t 2 italic_v end_POSTSUPERSCRIPT =logexp(sim(𝒓i,tg,𝒙i,tg)/τ1)b=1Bk=1|𝑺k|exp(sim(𝒓i,tg,𝒙b,kg)/τ1).absent𝑠𝑖𝑚superscriptsubscript𝒓𝑖𝑡𝑔superscriptsubscript𝒙𝑖𝑡𝑔subscript𝜏1superscriptsubscript𝑏1𝐵superscriptsubscript𝑘1subscript𝑺𝑘𝑠𝑖𝑚superscriptsubscript𝒓𝑖𝑡𝑔superscriptsubscript𝒙𝑏𝑘𝑔subscript𝜏1\displaystyle=-\log\frac{\exp(sim(\bm{r}_{i,t}^{g},\bm{x}_{i,t}^{g})/\tau_{1})% }{\sum_{b=1}^{B}\sum_{k=1}^{|\bm{S}_{k}|}\exp(sim(\bm{r}_{i,t}^{g},\bm{x}_{b,k% }^{g})/\tau_{1})}.= - roman_log divide start_ARG roman_exp ( italic_s italic_i italic_m ( bold_italic_r start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) / italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT roman_exp ( italic_s italic_i italic_m ( bold_italic_r start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_b , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) / italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG . (3)

Here, B𝐵Bitalic_B is the batch size, sim(,)𝑠𝑖𝑚sim(\cdot,\cdot)italic_s italic_i italic_m ( ⋅ , ⋅ ) is the similarity function and is measured by dot product. The temperature variable τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is used to scale the logits. So the objective of Global Alignment is the average of the two losses:

global=12|𝒟|i=1|𝒟|1|𝑺i|t=1|𝑺i|(li,tv2t+li,tt2v).subscript𝑔𝑙𝑜𝑏𝑎𝑙12𝒟superscriptsubscript𝑖1𝒟1subscript𝑺𝑖superscriptsubscript𝑡1subscript𝑺𝑖superscriptsubscript𝑙𝑖𝑡𝑣2𝑡superscriptsubscript𝑙𝑖𝑡𝑡2𝑣\mathcal{L}_{global}=\frac{1}{2|\mathcal{D}|}\sum_{i=1}^{|\mathcal{D}|}\frac{1% }{|\bm{S}_{i}|}\sum_{t=1}^{|\bm{S}_{i}|}(l_{i,t}^{v2t}+l_{i,t}^{t2v}).caligraphic_L start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D | end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | bold_italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT ( italic_l start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v 2 italic_t end_POSTSUPERSCRIPT + italic_l start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t 2 italic_v end_POSTSUPERSCRIPT ) . (4)
Refer to caption
Figure 3: The multi-view image encoder. We input both the frontal and lateral views into MoVE blocks. The right side shows the structure of MoVE.

Modality-Weighted Local Alignment. Considering that pathological regions may only occupy specific portions of both the image and text, we introduce a novel modality-weighted local alignment between 𝑿i,tsubscript𝑿𝑖𝑡\bm{X}_{i,t}bold_italic_X start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT and 𝑹i,tsubscript𝑹𝑖𝑡\bm{R}_{i,t}bold_italic_R start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT, i.e., paired sequences of patch and token features, to perform fine-grained spatial modeling.

For the j𝑗jitalic_j-th text token feature 𝒓i,t,jsubscript𝒓𝑖𝑡𝑗\bm{r}_{i,t,j}bold_italic_r start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT in 𝑹i,tsubscript𝑹𝑖𝑡\bm{R}_{i,t}bold_italic_R start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT, we compute the cosine similarity between 𝒓i,t,jsubscript𝒓𝑖𝑡𝑗\bm{r}_{i,t,j}bold_italic_r start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT and all image patch features in 𝑿i,tsubscript𝑿𝑖𝑡\bm{X}_{i,t}bold_italic_X start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT, then we apply softmax function to the similarity scores. Thus we get s=softmax(𝒓i,t,j𝑿i,tT)𝑠softmaxsubscript𝒓𝑖𝑡𝑗superscriptsubscript𝑿superscript𝑖,𝑡𝑇s=\text{softmax}(\bm{r}_{i,t,j}\bm{X}_{i^{,}t}^{T})italic_s = softmax ( bold_italic_r start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT , end_POSTSUPERSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ). These scores are then multiplied by the corresponding image patch features and aggregated. As a result, we generate a new textual-attended visual representation 𝒌i,t,jsubscript𝒌𝑖𝑡𝑗\bm{k}_{i,t,j}bold_italic_k start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT. Our goal of local alignment is to align 𝒓i,t,jsubscript𝒓𝑖𝑡𝑗\bm{r}_{i,t,j}bold_italic_r start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT and its corresponding textual-attended visual representation 𝒌i,t,jsubscript𝒌𝑖𝑡𝑗\bm{k}_{i,t,j}bold_italic_k start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT.

li,t,jv2t=logexp(sim(𝒌i,t,j,𝒓i,t,j)/τ2)c=1Wexp(sim(𝒌i,t,j,𝒓i,t,c)/τ2),superscriptsubscript𝑙𝑖𝑡𝑗𝑣2𝑡𝑠𝑖𝑚subscript𝒌𝑖𝑡𝑗subscript𝒓𝑖𝑡𝑗subscript𝜏2superscriptsubscript𝑐1𝑊𝑠𝑖𝑚subscript𝒌𝑖𝑡𝑗subscript𝒓𝑖𝑡𝑐subscript𝜏2l_{i,t,j}^{v2t}=-\log\frac{\exp(sim(\bm{k}_{i,t,j},\bm{r}_{i,t,j})/\tau_{2})}{% \sum_{c=1}^{W}\exp(sim(\bm{k}_{i,t,j},\bm{r}_{i,t,c})/\tau_{2})},italic_l start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v 2 italic_t end_POSTSUPERSCRIPT = - roman_log divide start_ARG roman_exp ( italic_s italic_i italic_m ( bold_italic_k start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT , bold_italic_r start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT ) / italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT roman_exp ( italic_s italic_i italic_m ( bold_italic_k start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT , bold_italic_r start_POSTSUBSCRIPT italic_i , italic_t , italic_c end_POSTSUBSCRIPT ) / italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG , (5)
li,t,jt2v=logexp(sim(𝒓i,t,j,𝒌i,t,j)/τ2)c=1Wexp(sim(𝒓i,t,j,𝒌i,t,c/τ2).l_{i,t,j}^{t2v}=-\log\frac{\exp(sim(\bm{r}_{i,t,j},\bm{k}_{i,t,j})/\tau_{2})}{% \sum_{c=1}^{W}\exp(sim(\bm{r}_{i,t,j},\bm{k}_{i,t,c}/\tau_{2})}.italic_l start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t 2 italic_v end_POSTSUPERSCRIPT = - roman_log divide start_ARG roman_exp ( italic_s italic_i italic_m ( bold_italic_r start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT , bold_italic_k start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT ) / italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT roman_exp ( italic_s italic_i italic_m ( bold_italic_r start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT , bold_italic_k start_POSTSUBSCRIPT italic_i , italic_t , italic_c end_POSTSUBSCRIPT / italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG . (6)

Acknowledging that different local pairs contain varying amounts of information, those representing pathological semantics should be assigned greater importance. Different from Wang et al. (2022a) using averaged last-layer attention weight in a single modality, we considered the information contained in both visual and textual modalities. For the pair (𝒌i,t,j,𝒓i,t,j)subscript𝒌𝑖𝑡𝑗subscript𝒓𝑖𝑡𝑗(\bm{k}_{i,t,j},\bm{r}_{i,t,j})( bold_italic_k start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT , bold_italic_r start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT ), the weights are determined by comparing the value of this pair against the average value of all local pairs in 𝑿i,t,𝑹i,tsubscript𝑿𝑖𝑡subscript𝑹𝑖𝑡\bm{X}_{i,t},\bm{R}_{i,t}bold_italic_X start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , bold_italic_R start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT. Firstly, we perform an element-wise multiplication of 𝒓i,t,jsubscript𝒓𝑖𝑡𝑗\bm{r}_{i,t,j}bold_italic_r start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT and 𝒌i,t,jsubscript𝒌𝑖𝑡𝑗\bm{k}_{i,t,j}bold_italic_k start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT, i.e., 𝒐i,t,j=𝒓i,t,j𝒌i,t,jsubscript𝒐𝑖𝑡𝑗direct-productsubscript𝒓𝑖𝑡𝑗subscript𝒌𝑖𝑡𝑗\bm{o}_{i,t,j}=\bm{r}_{i,t,j}\odot\bm{k}_{i,t,j}bold_italic_o start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT = bold_italic_r start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT ⊙ bold_italic_k start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT. These vectors capture relationships and importance across different dimensions and we form them into a matrix 𝓞i,t=[𝒐i,t,1,𝒐i,t,2,,𝒐i,t,W]subscript𝓞𝑖𝑡subscript𝒐𝑖𝑡1subscript𝒐𝑖𝑡2subscript𝒐𝑖𝑡𝑊\bm{\mathcal{O}}_{i,t}=[\bm{o}_{i,t,1},\bm{o}_{i,t,2},\dots,\bm{o}_{i,t,W}]bold_caligraphic_O start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = [ bold_italic_o start_POSTSUBSCRIPT italic_i , italic_t , 1 end_POSTSUBSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_i , italic_t , 2 end_POSTSUBSCRIPT , … , bold_italic_o start_POSTSUBSCRIPT italic_i , italic_t , italic_W end_POSTSUBSCRIPT ]. Then we calculate the mean: o¯i,t=1Wj=1W𝒐i,t,jsubscript¯𝑜𝑖𝑡1𝑊superscriptsubscript𝑗1𝑊subscript𝒐𝑖𝑡𝑗\bar{o}_{i,t}=\frac{1}{W}\sum_{j=1}^{W}\bm{o}_{i,t,j}over¯ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT bold_italic_o start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT, from which we derive the weight through softmax attention score.

wi,t,j=(softmax(o¯i,t𝑸(𝓞i,t𝑲)Td))j,subscript𝑤𝑖𝑡𝑗subscriptsoftmaxsubscript¯𝑜𝑖𝑡𝑸superscriptsubscript𝓞𝑖𝑡𝑲𝑇𝑑𝑗w_{i,t,j}=\left(\text{softmax}\left(\frac{\bar{o}_{i,t}\bm{Q}{\left(\bm{% \mathcal{O}}_{i,t}\bm{K}\right)}^{T}}{\sqrt{d}}\right)\right)_{j},italic_w start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT = ( softmax ( divide start_ARG over¯ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT bold_italic_Q ( bold_caligraphic_O start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT bold_italic_K ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (7)

where 𝑸d×d,𝑲d×dformulae-sequence𝑸superscript𝑑𝑑𝑲superscript𝑑𝑑\bm{Q}\in\mathbb{R}^{d\times d},\bm{K}\in\mathbb{R}^{d\times d}bold_italic_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT , bold_italic_K ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT are learnable matrices. We apply wi,t,jsubscript𝑤𝑖𝑡𝑗w_{i,t,j}italic_w start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT as the weight of li,t,jsubscript𝑙𝑖𝑡𝑗l_{i,t,j}italic_l start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT and derive the Visual modality-weighted Local Alignment loss VLAsubscriptVLA\mathcal{L}_{\text{VLA}}caligraphic_L start_POSTSUBSCRIPT VLA end_POSTSUBSCRIPT as:

VLA=1|𝒟|Wi=1|𝒟|1|𝑺t|t=1|𝑺t|j=1Wwi,t,j(li,t,jv2t+li,t,jt2v).subscriptVLA1𝒟𝑊superscriptsubscript𝑖1𝒟1subscript𝑺𝑡superscriptsubscript𝑡1subscript𝑺𝑡superscriptsubscript𝑗1𝑊subscript𝑤𝑖𝑡𝑗superscriptsubscript𝑙𝑖𝑡𝑗𝑣2𝑡superscriptsubscript𝑙𝑖𝑡𝑗𝑡2𝑣\mathcal{L}_{\text{VLA}}=\frac{1}{|\mathcal{D}|W}\sum_{i=1}^{|\mathcal{D}|}% \frac{1}{|\bm{S}_{t}|}\sum_{t=1}^{|\bm{S}_{t}|}\sum_{j=1}^{W}w_{i,t,j}(l_{i,t,% j}^{v2t}+l_{i,t,j}^{t2v}).caligraphic_L start_POSTSUBSCRIPT VLA end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_D | italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D | end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | bold_italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT ( italic_l start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v 2 italic_t end_POSTSUPERSCRIPT + italic_l start_POSTSUBSCRIPT italic_i , italic_t , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t 2 italic_v end_POSTSUPERSCRIPT ) . (8)

Correspondingly, we can derive the Texual Modality-Weighted Local Alignment Loss TLAsubscriptTLA\mathcal{L}_{\text{TLA}}caligraphic_L start_POSTSUBSCRIPT TLA end_POSTSUBSCRIPT. The objective of Local Alignment is:

local=VLA+TLA.subscript𝑙𝑜𝑐𝑎𝑙subscriptVLAsubscriptTLA\mathcal{L}_{local}=\mathcal{L}_{\text{VLA}}+\mathcal{L}_{\text{TLA}}.caligraphic_L start_POSTSUBSCRIPT italic_l italic_o italic_c italic_a italic_l end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT VLA end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT TLA end_POSTSUBSCRIPT . (9)

3.3 Temporal Modeling

We have observed that there is inherent temporal information in the dataset, which can provide additional supervision signals without the need for extra manual labeling. The objective of our temporal modeling is to use this information to enhance the alignment of image and text sequences that convey the same temporal semantics. As mentioned in Sec. 3.1, for sequence 𝑺𝑺\bm{S}bold_italic_S, the global image and text features form two feature sequences, 𝑿g=[𝒙1g,𝒙2g,,𝒙|𝑺|g]superscript𝑿𝑔subscriptsuperscript𝒙𝑔1subscriptsuperscript𝒙𝑔2subscriptsuperscript𝒙𝑔𝑺\bm{X}^{g}=[\bm{x}^{g}_{1},\bm{x}^{g}_{2},\cdots,\bm{x}^{g}_{|\bm{S}|}]bold_italic_X start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = [ bold_italic_x start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_italic_x start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT | bold_italic_S | end_POSTSUBSCRIPT ] and 𝑹g=[𝒓1g,𝒓2g,,𝒓|𝑺|g]superscript𝑹𝑔subscriptsuperscript𝒓𝑔1subscriptsuperscript𝒓𝑔2subscriptsuperscript𝒓𝑔𝑺\bm{R}^{g}=[\bm{r}^{g}_{1},\bm{r}^{g}_{2},\cdots,\bm{r}^{g}_{|\bm{S}|}]bold_italic_R start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = [ bold_italic_r start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_r start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_italic_r start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT | bold_italic_S | end_POSTSUBSCRIPT ], respectively, and Med-ST imposes the cross-modality bidirectional cycle consistency constraints to align 𝑿gsuperscript𝑿𝑔\bm{X}^{g}bold_italic_X start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT and 𝑹gsuperscript𝑹𝑔\bm{R}^{g}bold_italic_R start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT.

Cycle Consistency. As illustrated in Figure 2, we want to establish bidirectional perception in the sequence. Given an image point 𝒙tgsuperscriptsubscript𝒙𝑡𝑔\bm{x}_{t}^{g}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT from 𝑿gsuperscript𝑿𝑔\bm{X}^{g}bold_italic_X start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT, we first map it to a textual point using function F𝐹Fitalic_F. Subsequently, the result of this map** is transformed back to an image point through function G𝐺Gitalic_G, thus we get 𝒙tg=G(F(𝒙tg))superscriptsubscript𝒙superscript𝑡𝑔𝐺𝐹superscriptsubscript𝒙𝑡𝑔\bm{x}_{t^{\prime}}^{g}=G(F(\bm{x}_{t}^{g}))bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = italic_G ( italic_F ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) ). If t=t𝑡superscript𝑡t=t^{\prime}italic_t = italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the point 𝒙tgsuperscriptsubscript𝒙𝑡𝑔\bm{x}_{t}^{g}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT is cycle-consistent. The bidirectional matching mechanism of cycle consistency can perceive fine-grained contextual differences, thereby establishing a cross-modal match. Since our model has never explicitly perceived temporal sequence information, directly learning such information is challenging. Therefore, we adopt a novel bidirectional learning approach from simple to complex.

Forward Map** Classification (FMC). We first calculate the distance between 𝒙tgsuperscriptsubscript𝒙𝑡𝑔\bm{x}_{t}^{g}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT and all points in 𝑹gsuperscript𝑹𝑔\bm{R}^{g}bold_italic_R start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT and then we get the soft nearest neighbor F(𝒙tg)𝐹superscriptsubscript𝒙𝑡𝑔F(\bm{x}_{t}^{g})italic_F ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) of 𝒙tgsuperscriptsubscript𝒙𝑡𝑔\bm{x}_{t}^{g}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT:

F(𝒙tg)=z=1|𝑺|αz𝒓zg,αz=exp(𝒙tg,𝒓zg)k=1|𝑺|exp(𝒙tg,𝒓kg).formulae-sequence𝐹superscriptsubscript𝒙𝑡𝑔superscriptsubscript𝑧1𝑺subscript𝛼𝑧superscriptsubscript𝒓𝑧𝑔subscript𝛼𝑧superscriptsubscript𝒙𝑡𝑔superscriptsubscript𝒓𝑧𝑔superscriptsubscript𝑘1𝑺superscriptsubscript𝒙𝑡𝑔superscriptsubscript𝒓𝑘𝑔F(\bm{x}_{t}^{g})=\sum_{z=1}^{|\bm{S}|}\alpha_{z}\bm{r}_{z}^{g},~{}\alpha_{z}=% \frac{\exp(-\langle\bm{x}_{t}^{g},\bm{r}_{z}^{g}\rangle)}{\sum_{k=1}^{|\bm{S}|% }\exp(-\langle\bm{x}_{t}^{g},\bm{r}_{k}^{g}\rangle)}.italic_F ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_z = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_italic_S | end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT bold_italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , italic_α start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = divide start_ARG roman_exp ( - ⟨ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , bold_italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ⟩ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_italic_S | end_POSTSUPERSCRIPT roman_exp ( - ⟨ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , bold_italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ⟩ ) end_ARG . (10)

Since our sequence is paired, unlike Dwibedi et al. (2019), we can introduce constraints to the forward process. As our current model lacks temporal awareness, we incorporate a relatively simple classification loss in the forward process. We treat all features in sequence 𝑹gsuperscript𝑹𝑔\bm{R}^{g}bold_italic_R start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT as different classes and classify 𝒙tgsuperscriptsubscript𝒙𝑡𝑔\bm{x}_{t}^{g}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT accordingly. So the predicted output y^t=αtsubscript^𝑦𝑡subscript𝛼𝑡\hat{y}_{t}=\alpha_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The ground truth 𝒚𝒚\bm{y}bold_italic_y is represented by a one-hot vector, in which the t𝑡titalic_t-th position is set to 1.

tcls=z=1|𝑺|yzlog(yz^).subscriptsuperscript𝑐𝑙𝑠𝑡superscriptsubscript𝑧1𝑺subscript𝑦𝑧^subscript𝑦𝑧\mathcal{L}^{cls}_{t}=-\sum_{z=1}^{|\bm{S}|}y_{z}\log(\hat{y_{z}}).caligraphic_L start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_z = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_italic_S | end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_y start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG ) . (11)

As for sequence 𝑺𝑺\bm{S}bold_italic_S, the objective of FMC is as follows:

FMCI=t|𝑺|tcls.superscriptsubscriptFMC𝐼superscriptsubscript𝑡𝑺subscriptsuperscript𝑐𝑙𝑠𝑡\mathcal{L}_{\text{FMC}}^{I}=\sum_{t}^{|\bm{S}|}\mathcal{L}^{cls}_{t}.caligraphic_L start_POSTSUBSCRIPT FMC end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_italic_S | end_POSTSUPERSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (12)

Reverse Map** Regression (RMR). Since FMC enables cycle consistency from a relatively easy classification perspective, we want to incorporate a more complex objective in the process of reverse map**. We impose greater penalties on predictions that are temporally distant according to the distance between F(𝒙tg)𝐹superscriptsubscript𝒙𝑡𝑔F(\bm{x}_{t}^{g})italic_F ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) and 𝒙tgsuperscriptsubscript𝒙𝑡𝑔\bm{x}_{t}^{g}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT. To address this, we propose RMR. Initially, we obtain the distribution of similarities {βz}subscript𝛽𝑧\{\beta_{z}\}{ italic_β start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT } between F(𝒙tg)𝐹superscriptsubscript𝒙𝑡𝑔F(\bm{x}_{t}^{g})italic_F ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) and all points in 𝑿gsuperscript𝑿𝑔\bm{X}^{g}bold_italic_X start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT.

βz=exp(F(𝒙tg),𝒙zg)k=1|𝑺|exp(F(𝒙tg),𝒙kg).subscript𝛽𝑧𝐹superscriptsubscript𝒙𝑡𝑔superscriptsubscript𝒙𝑧𝑔superscriptsubscript𝑘1𝑺𝐹superscriptsubscript𝒙𝑡𝑔superscriptsubscript𝒙𝑘𝑔\beta_{z}=\frac{\exp{(-\langle F(\bm{x}_{t}^{g}),\bm{x}_{z}^{g}\rangle)}}{\sum% _{k=1}^{|\bm{S}|}\exp{(-\langle F(\bm{x}_{t}^{g}),\bm{x}_{k}^{g}\rangle})}.italic_β start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = divide start_ARG roman_exp ( - ⟨ italic_F ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ⟩ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_italic_S | end_POSTSUPERSCRIPT roman_exp ( - ⟨ italic_F ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ⟩ ) end_ARG . (13)

Representations that are temporally closer tend to be more similar. Thus the further the distance from timestep t𝑡titalic_t, the smaller the similarity should be. So we impose a Gaussian prior on β𝛽\betaitalic_β. Our goal is to maximize βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ensure that the similarity distribution peaks at t𝑡titalic_t. Therefore, we optimize |tμ|2σ2superscript𝑡𝜇2superscript𝜎2\frac{|t-\mu|^{2}}{\sigma^{2}}divide start_ARG | italic_t - italic_μ | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. A regularization term log(σ)𝜎\log(\sigma)roman_log ( italic_σ ) is added to ensure the peak of the distribution remains intact and avoid a reduction in variance resulting from this optimization. Since the temporal information contained in the sequence may have a large span, for cases with larger errors, we optimize |tμ|𝑡𝜇|t-\mu|| italic_t - italic_μ |. So the objective of RMR for sequence 𝑺𝑺\bm{S}bold_italic_S is as follows:

RMRI={|tμ|2σ2+λlog(σ),|tμ|δ,δ|tμ|δ22,otherwise,\mathcal{L}_{\text{RMR}}^{I}=\begin{cases}\displaystyle\frac{|t-\mu|^{2}}{% \sigma^{2}}+\lambda\log(\sigma)&,|t-\mu|\leq\delta,\\ \delta|t-\mu|-\displaystyle\frac{\delta^{2}}{2}&,\text{otherwise,}\end{cases}caligraphic_L start_POSTSUBSCRIPT RMR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT = { start_ROW start_CELL divide start_ARG | italic_t - italic_μ | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_λ roman_log ( italic_σ ) end_CELL start_CELL , | italic_t - italic_μ | ≤ italic_δ , end_CELL end_ROW start_ROW start_CELL italic_δ | italic_t - italic_μ | - divide start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_CELL start_CELL , otherwise, end_CELL end_ROW (14)

where μ=z=1|𝑺|βzz𝜇superscriptsubscript𝑧1𝑺subscript𝛽𝑧𝑧\mu=\sum_{z=1}^{|\bm{S}|}\beta_{z}\cdot zitalic_μ = ∑ start_POSTSUBSCRIPT italic_z = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_italic_S | end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ⋅ italic_z and σ2=z=1|𝑺|βz(zμ)2superscript𝜎2superscriptsubscript𝑧1𝑺subscript𝛽𝑧superscript𝑧𝜇2\sigma^{2}=\sum_{z=1}^{|\bm{S}|}\beta_{z}\cdot(z-\mu)^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_z = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_italic_S | end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ⋅ ( italic_z - italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, λ𝜆\lambdaitalic_λ denotes the regularization weight, and δ𝛿\deltaitalic_δ is a hyperparameter.

Therefore, for all sequences in 𝒟𝒟\mathcal{D}caligraphic_D, we sum these two objectives to obtain temporalI=RMRI+FMCIsuperscriptsubscript𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙𝐼superscriptsubscriptRMR𝐼superscriptsubscriptFMC𝐼\mathcal{L}_{temporal}^{I}=\mathcal{L}_{\text{RMR}}^{I}+\mathcal{L}_{\text{FMC% }}^{I}caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_m italic_p italic_o italic_r italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT = caligraphic_L start_POSTSUBSCRIPT RMR end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUBSCRIPT FMC end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT. Similarly, starting from a point in the text sequence, we can get temporalTsuperscriptsubscript𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙𝑇\mathcal{L}_{temporal}^{T}caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_m italic_p italic_o italic_r italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. So the objective of temporal modeling is:

temporal=temporalI+temporalT.subscript𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙superscriptsubscript𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙𝐼superscriptsubscript𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙𝑇\mathcal{L}_{temporal}=\mathcal{L}_{temporal}^{I}+\mathcal{L}_{temporal}^{T}.caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_m italic_p italic_o italic_r italic_a italic_l end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_m italic_p italic_o italic_r italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_m italic_p italic_o italic_r italic_a italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . (15)

3.4 Overall Objective

Therefore, our overall objective is denoted as \mathcal{L}caligraphic_L in Eq. 16. By optimizing this objective, our model can perceive spatial and temporal information with greater granularity.

=global+λ1local+λ2temporal,subscript𝑔𝑙𝑜𝑏𝑎𝑙subscript𝜆1subscript𝑙𝑜𝑐𝑎𝑙subscript𝜆2subscript𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙\mathcal{L}=\mathcal{L}_{global}+\lambda_{1}\mathcal{L}_{local}+\lambda_{2}% \mathcal{L}_{temporal},caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_l italic_o italic_c italic_a italic_l end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_m italic_p italic_o italic_r italic_a italic_l end_POSTSUBSCRIPT , (16)

where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are hyperparameters that control the weights of the loss functions.

4 Experiments

Table 1: Temporal image classification results on the MS-CXR-T benchmark across five findings under 10-fold cross-validation setting. We run with different seeds three times and calculate the mean and standard deviations. Avg. Acc. stands for average accuracy. PL.effusion denotes pleural effusion. We report the accuracy [%]. Best results are in boldface.
Model Consolidation Edema Pl.effusion Pneumonia Pneumothorax Avg. Acc.
ConVIRT-MIMIC (Zhang et al., 2022) 46.62±plus-or-minus\pm± 1.03 57.04±plus-or-minus\pm± 1.00 54.50±plus-or-minus\pm± 0.19 61.09±plus-or-minus\pm± 1.54 48.49±plus-or-minus\pm± 0.22 53.55±plus-or-minus\pm± 0.36
GLoRIA-CheXpert (Huang et al., 2021) 49.13±plus-or-minus\pm± 1.52 53.67±plus-or-minus\pm± 0.61 49.56±plus-or-minus\pm± 1.84 58.11±plus-or-minus\pm± 2.21 45.64±plus-or-minus\pm± 2.49 51.22±plus-or-minus\pm± 1.35
GLoRIA-MIMIC (Huang et al., 2021) 44.99±plus-or-minus\pm± 0.47 49.13±plus-or-minus\pm± 1.54 47.85±plus-or-minus\pm± 3.08 60.18±plus-or-minus\pm± 1.11 41.53±plus-or-minus\pm± 0.92 48.74±plus-or-minus\pm± 0.84
MGCA (Wang et al., 2022a) 50.79±plus-or-minus\pm± 0.44 62.71±plus-or-minus\pm± 0.24 57.17±plus-or-minus\pm± 0.69 63.73±plus-or-minus\pm± 0.89 52.16±plus-or-minus\pm± 1.02 57.31±plus-or-minus\pm± 0.34
MedCLIP (Wang et al., 2022c) 50.32±plus-or-minus\pm± 0.65 54.53±plus-or-minus\pm± 2.82 56.61±plus-or-minus\pm± 0.65 58.94±plus-or-minus\pm± 2.62 45.19±plus-or-minus\pm± 1.47 53.12±plus-or-minus\pm± 0.13
MRM (Wang et al., 2022c) 46.33±plus-or-minus\pm± 2.89 49.09±plus-or-minus\pm± 5.70 49.13±plus-or-minus\pm± 0.69 55.14±plus-or-minus\pm± 1.42 45.48±plus-or-minus\pm± 2.97 49.03±plus-or-minus\pm± 0.80
BioViL (Boecking et al., 2022) 56.40±plus-or-minus\pm± 0.24 57.26±plus-or-minus\pm± 0.77 54.51±plus-or-minus\pm± 0.39 67.10±plus-or-minus\pm± 0.01 55.30±plus-or-minus\pm± 0.21 58.11±plus-or-minus\pm± 0.02
Temporal-based
BioViL-T (Bannur et al., 2023) 56.93±plus-or-minus\pm± 1.77 61.55±plus-or-minus\pm± 0.95 53.94±plus-or-minus\pm± 0.89 67.24±plus-or-minus\pm± 0.20 55.46±plus-or-minus\pm± 0.01 59.02±plus-or-minus\pm± 0.34
Med-ST 60.57±plus-or-minus\pm± 1.18 67.35±plus-or-minus\pm± 0.32 58.47±plus-or-minus\pm± 1.50 65.00±plus-or-minus\pm± 0.34 54.18±plus-or-minus\pm± 0.81 61.12±plus-or-minus\pm± 0.34
Table 2: Temporal sentence similarity classification results on RadGraph subset of MS-CXR-T sentence similarity benchmark. Accuracy and AUROC scores are reported. Best results are in boldface.
Model MLM Acc. AUROC
MedCLIP ×\times× 66.41 52.66
MGCA ×\times× 75.42 76.38
BioViL 69.49 68.92
BioViL-T 78.81 81.39
Med-ST ×\times× 83.76 84.60
Table 3: Zero-shot classification results on RSNA. We report Accuracy, F1 and AUROC scores. Best results are highlighted in bold.
Model Acc. F1 AUROC
MGCA 67.25 54.87 73.97
BioViL 64.10 55.19 75.27
BioViL-T 63.23 54.90 75.12
Med-ST 68.37 57.63 77.14
Table 4: Medical image classification results on COVIDx datasets with 1%, 10% and 100% training data.
Method 1% 10% 100%
GLoRIA 67.3 77.8 89.0
ConVIRT 72.5 82.5 92.0
GLoRIA-MIMIC 66.5 80.5 88.8
MedKLIP 74.5 85.2 90.3
MGCA 74.8 84.8 92.3
Med-ST 71.2 87.7 93.0
Table 5: Ablation study on objectives and lateral view use.
globalsubscript𝑔𝑙𝑜𝑏𝑎𝑙\mathcal{L}_{global}caligraphic_L start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT temporalsubscript𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙\mathcal{L}_{temporal}caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_m italic_p italic_o italic_r italic_a italic_l end_POSTSUBSCRIPT localsubscript𝑙𝑜𝑐𝑎𝑙\mathcal{L}_{local}caligraphic_L start_POSTSUBSCRIPT italic_l italic_o italic_c italic_a italic_l end_POSTSUBSCRIPT Lateral Consolidation Edema Pl.effusion Pneumonia Pneumothorax Avg. Acc.
55.75±plus-or-minus\pm± 2.52 62.20±plus-or-minus\pm± 1.00 57.91±plus-or-minus\pm± 1.20 63.88±plus-or-minus\pm± 0.79 54.53±plus-or-minus\pm± 0.99 58.85±plus-or-minus\pm± 0.61
54.89±plus-or-minus\pm± 0.51 65.30±plus-or-minus\pm± 0.91 57.42±plus-or-minus\pm± 0.39 66.40±plus-or-minus\pm± 0.19 53.40±plus-or-minus\pm± 0.96 59.49±plus-or-minus\pm± 0.30
55.09±plus-or-minus\pm± 1.34 63.32±plus-or-minus\pm± 0.78 58.06±plus-or-minus\pm± 1.40 53.58±plus-or-minus\pm± 0.40 50.25±plus-or-minus\pm± 0.99 58.06±plus-or-minus\pm± 0.29
60.57±plus-or-minus\pm± 1.18 67.35±plus-or-minus\pm± 0.32 58.47±plus-or-minus\pm± 1.50 65.00±plus-or-minus\pm± 0.34 54.18±plus-or-minus\pm± 0.81 61.12±plus-or-minus\pm± 0.34

4.1 Pre-training Setup

Dataset. We pretrain our Med-ST framework on MIMIC-CXR dataset (Johnson et al., 2019). It is a publicly available dataset that includes paired chest X-ray images and radiology reports. Sometimes images incorporate both frontal and lateral views. All pairs have attributes of studydate and studytime, which allow us to construct sequences. We pre-process the dataset following Wang et al. (2022a), including image transformation and text tokenization, resulting in 232k𝑘kitalic_k image-text pairs. We preserve lateral views and thus 114k114𝑘114k114 italic_k pairs include lateral images.

Implementation Details. Our code is implemented using PyTorch (Paszke et al., 2019). The pre-training is performed using two GeForce RTX 3090 GPUs. Due to the effectiveness of pre-trained language models, we utilized BioClinicalBERT (Alsentzer et al., 2019) as the text encoder. For unified modal architecture design, we defaulted to using ViT-B/16 (Dosovitskiy et al., 2020) as the image encoder backbone. We use the AdamW optimizer (Loshchilov & Hutter, 2017) with a learning rate of 4e-5 and weight decay of 0.05. We employed a linear warmup with cosine annealing scheduler (Loshchilov & Hutter, 2016), initializing the learning rate to 1e-8 and the warmup epoch to 20. The τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is 0.07 and τ2subscript𝜏2\tau_{2}italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is 0.1. We set λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to 1, 1 in Eq. 16 and δ𝛿\deltaitalic_δ=2, λ𝜆\lambdaitalic_λ=0.001 in Eq. 14. For more details, please refer to Appendix A.

4.2 Experimental Setup

4.2.1 Dataset

MS-CXR-T benchmark (Bannur et al., 2023) is a multi-modal benchmark dataset aimed at evaluating biomedical vision-language processing models in radiology, specifically through two temporal tasks: image classification of chest X-rays and analysis of sentence similarity.

RSNA dataset (Shih et al., 2019) is a publicly available medical imaging dataset. It contains chest radiographs with annotations for the presence or absence of pneumonia.

COVIDx (Wang et al., 2020) is a dataset with over 30,000 chest X-ray (CXR) images from over 16,600 patients, including 16,490 positive COVID-19 cases. It’s used for a three-class classification task: categorizing radiographs as COVID-19, non-COVID pneumonia, or normal.

4.2.2 Downstream Tasks

Temporal Image Classification. It includes 1,326 labeled chest X-rays across five findings (Consolidation, Edema, Pleural Effusion, Pneumonia, and Pneumothorax), categorized into three stages of disease progression: Improving, Stable, Worsening. Due to the limited data size, we choose an SVM (Boser et al., 1992) classifier and employ cross-validation. We report the accuracy for each findings and the average classification accuracy under the setting of 5-fold and 10-fold.

Temporal Sentence Similarity. It includes 361 sentence pairs for evaluating temporal-semantic similarity. The task focuses on binary classification of pairs as paraphrases or contradictions regarding disease progression. This dataset includes two subsets: RadGraph and Swaps.We report our results on the RadGraph subset since it is more challenging. For more experimental settings and results on the Swaps subset, please refer to the Appendix B.2.

Zero-shot Image Classification. We test our zero-shot classification results on RSNA. We use the dataset division from previous work (Wang et al., 2022a; Huang et al., 2021) and also employ the prompt construction of BioViL-T. We report our classification results on the test set.

Medical Image Classification. Our model is tested on the COVIDx dataset to evaluate its efficacy in medical image classification. As with prior methods (Wang et al., 2022a), we fine-tune a classification head using different proportions of training data and then evaluate the classification performance on the test set.

4.3 Results

Temporal Image Classification Results. In Table 1, we report our performance on the task of temporal image classification under a 10-fold cross-validation setup. Our method achieves the highest average accuracy, surpassing BioViL-T (Bannur et al., 2023), which also utilizes temporal information, by 2.10%. Across five different findings, our method achieves the highest scores in three categories, particularly in the edema category, where our approach exceedes BioViL-T by 5.80%. Table 7 also validates our performance. We also achieve the best average accuracy in the experimental setup with a fold number of 5 in Table 7. Our method consistently achieves the best results in different cross-validation setups. This demonstrates that through our temporal modeling, the model is able to capture the semantics of temporal changes and integrate this information into the model, making it more suitable for temporal tasks.

Temporal Sentence Similarity Classification Results. In Table 2, we report our performance on the temporal text classification task. Our method achieves the best results in both accuracy and AUROC metrics. It surpasses BioViL-T by 4.95% in accuracy and by 3.21% in the AUROC metric. The BioViL-T method not only utilizes temporal information but also performs specialized mask language modeling. Additionally, the RadGraph dataset is a relatively complex text classification task, hence our results in this experiment thoroughly demonstrate the effectiveness of our approach. Through temporal modeling, our method acquires the capability to perceive changes, and with fine-grained local alignment, it more comprehensively understands the semantic context of the text.

Zero-shot Image Classification Results. Table 3 shows our results of zero-shot medical image classification task on RSNA dataset. Our method also achieves the best performance. This indicates that our performance improvement is not limited to temporal tasks but also benefits static classification tasks. Our model outperforms previous methods by 1%, 2%, and 1% in the accuracy, F1 score, and AUROC metrics, respectively. These consistent improvements across metrics suggest that spatial and temporal modeling can enhance the model’s representational capabilities. Since the logits in zero-shot classification are based on the similarity between image features and text prompts, it also indicates that our image and text encoders have learned more fine-grained and consistent representations.

Medical Image Classification Results. We present our classification results on COVIDx in Table 4. Our method achieved commendable results using both 10% and 100% of the training data. The outcomes in the classification task also demonstrate that the gains from our approach are evident in static tasks. Because our model absorbs a wider range of supervision signals, it may fail to capture the relevant information with only 1% of the training data available.

Refer to caption
Figure 4: Attention visualization with reference patch (red box) in the bounding box (black box). LA stands for local alignment. The bounding box represents the main pathological area annotated manually. It can be observed that our method achieves better correspondence with the bounding box after local alignment.

4.4 Analysis of Our Framework

Ablation Study. Table 5 displays the ablation results of our various components on the temporal image classification task. It can be observed that the best results are obtained when all objectives are combined with the use of lateral views. The first two rows indicate the effectiveness of our temporal modeling and local alignment objectives. The comparison between the third and fourth rows highlights the efficacy of using lateral views, and when all objectives are combined, we can achieve further improvements.

Visualization. Figure 4 presents the attention visualization of our model based on the reference patch (red box). This patch, contained within the bounding box (black box), encapsulates the corresponding textual information. The text corresponding to the first row is “Left basilar consolidation seen” and the second row corresponds to “patchy consolidation in the mid left lung”. It can be observed that our method, without local alignment, can only perceive certain areas. However, after applying modality-weighted local alignment, our method is able to recognize the pathological areas in the context, which correspond well with the bounding box. This demonstrates that our fine-grained modeling can indeed learn local alignment based on the amount of information contained in local pairs.

5 Conclusion

In this study, we present the Med-ST framework, which employs explicit spatial and temporal modeling along with comprehensive strategies for global and local alignment, and cross-modal bidirectional cycle consistency. This approach not only achieves superior performance in tasks including temporal sequences but also enhances the representation learning for medical image classification tasks. Med-ST emerges as a promising approach for medical multimodal pretraining, capitalizing on the extensive information present in multimodal datasets to acquire nuanced spatial and temporal insights.

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China No. 62376277, No. 61976206 and No. 62222215, and Bei**g Outstanding Young Scientist Program NO. BJJWZYJH012019100020098.

Impact Statement

Although the dataset we used contains temporal information, the specific dates are anonymized, eliminating the risk of privacy infringement. The Med-ST framework introduced in our paper models from both spatial and temporal perspectives, effectively simulating the behavior of clinicians in practice. We offer a promising approach that not only diagnoses abnormalities in chest radiographs but also analyzes historical data to make decisions. This can help reduce the workload on doctors and advance the development of medical diagnostics, making diagnosis more intelligent and accessible, especially in underdeveloped areas.

References

  • Alayrac et al. (2022) Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  • Alsentzer et al. (2019) Alsentzer, E., Murphy, J. R., Boag, W., Weng, W.-H., **, D., Naumann, T., and McDermott, M. Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323, 2019.
  • Bannur et al. (2023) Bannur, S., Hyland, S., Liu, Q., Perez-Garcia, F., Ilse, M., Castro, D. C., Boecking, B., Sharma, H., Bouzid, K., Thieme, A., et al. Learning to exploit temporal structure for biomedical vision-language processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15016–15027, 2023.
  • Bao et al. (2022) Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O. K., Aggarwal, K., Som, S., Piao, S., and Wei, F. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022.
  • Boecking et al. (2022) Boecking, B., Usuyama, N., Bannur, S., Castro, D. C., Schwaighofer, A., Hyland, S., Wetscherek, M., Naumann, T., Nori, A., Alvarez-Valle, J., et al. Making the most of text semantics to improve biomedical vision–language processing. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pp.  1–21. Springer, 2022.
  • Boser et al. (1992) Boser, B. E., Guyon, I. M., and Vapnik, V. N. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pp.  144–152, 1992.
  • Chambon et al. (2022a) Chambon, P., Bluethgen, C., Delbrouck, J.-B., Van der Sluijs, R., Połacin, M., Chaves, J. M. Z., Abraham, T. M., Purohit, S., Langlotz, C. P., and Chaudhari, A. Roentgen: Vision-language foundation model for chest x-ray generation. arXiv preprint arXiv:2211.12737, 2022a.
  • Chambon et al. (2022b) Chambon, P., Bluethgen, C., Langlotz, C. P., and Chaudhari, A. Adapting pretrained vision-language foundational models to medical imaging domains. arXiv preprint arXiv:2210.04133, 2022b.
  • Chen et al. (2022) Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
  • Cheng et al. (2023) Cheng, P., Lin, L., Lyu, J., Huang, Y., Luo, W., and Tang, X. Prior: Prototype representation joint learning from medical images and reports. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  21361–21371, 2023.
  • Dawidowicz et al. (2023) Dawidowicz, G., Hirsch, E., and Tal, A. Limitr: Leveraging local information for medical image-text representation. arXiv preprint arXiv:2303.11755, 2023.
  • Dosovitskiy et al. (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Dwibedi et al. (2019) Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., and Zisserman, A. Temporal cycle-consistency learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  1801–1810, 2019.
  • Huang et al. (2021) Huang, S.-C., Shen, L., Lungren, M. P., and Yeung, S. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  3942–3951, 2021.
  • Johnson et al. (2019) Johnson, A. E., Pollard, T. J., Greenbaum, N. R., Lungren, M. P., Deng, C.-y., Peng, Y., Lu, Z., Mark, R. G., Berkowitz, S. J., and Horng, S. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042, 2019.
  • Li et al. (2021) Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., and Hoi, S. C. H. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
  • Li et al. (2023) Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  • Loshchilov & Hutter (2016) Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  • Loshchilov & Hutter (2017) Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • Lu et al. (2023) Lu, H., Huo, Y., Ding, M., Fei, N., and Lu, Z. Cross-modal contrastive learning for generalizable and efficient image-text retrieval. Machine Intelligence Research, 20(4):569–582, 2023.
  • Moon et al. (2022) Moon, J. H., Lee, H., Shin, W., Kim, Y.-H., and Choi, E. Multi-modal understanding and generation for medical images and text via vision-language pre-training. IEEE Journal of Biomedical and Health Informatics, 26(12):6070–6080, 2022.
  • Oord et al. (2018) Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • Qin et al. (2022) Qin, Z., Yi, H., Lao, Q., and Li, K. Medical image understanding with pretrained vision language models: A comprehensive study. arXiv preprint arXiv:2209.15517, 2022.
  • Radford et al. (2021) Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  • Raoof et al. (2012) Raoof, S., Feigin, D., Sung, A., Raoof, S., Irugulpati, L., and Rosenow III, E. C. Interpretation of plain chest roentgenogram. Chest, 141(2):545–558, 2012.
  • Shaib et al. (2023) Shaib, C., Li, M. L., Joseph, S., Marshall, I. J., Li, J. J., and Wallace, B. C. Summarizing, simplifying, and synthesizing medical evidence using gpt-3 (with varying success). arXiv preprint arXiv:2305.06299, 2023.
  • Shih et al. (2019) Shih, G., Wu, C. C., Halabi, S. S., Kohli, M. D., Prevedello, L. M., Cook, T. S., Sharma, A., Amorosa, J. K., Arteaga, V., Galperin-Aizenberg, M., et al. Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiology: Artificial Intelligence, 1(1):e180041, 2019.
  • Shu et al. (2024) Shu, C., Zhu, Y., Tang, X., Xiao, J., Chen, Y., Li, X., Zhang, Q., and Lu, Z. Miter: Medical image–text joint adaptive pretraining with multi-level contrastive learning. Expert Systems with Applications, 238:121526, 2024.
  • Singhal et al. (2022) Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., et al. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138, 2022.
  • Wan et al. (2023) Wan, Z., Liu, C., Zhang, M., Fu, J., Wang, B., Cheng, S., Ma, L., Quilodrán-Casas, C., and Arcucci, R. Med-unic: Unifying cross-lingual medical vision-language pre-training by diminishing bias. arXiv preprint arXiv:2305.19894, 2023.
  • Wang et al. (2022a) Wang, F., Zhou, Y., Wang, S., Vardhanabhuti, V., and Yu, L. Multi-granularity cross-modal alignment for generalized medical visual representation learning. arXiv preprint arXiv:2210.06044, 2022a.
  • Wang et al. (2020) Wang, L., Lin, Z. Q., and Wong, A. Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images. Scientific reports, 10(1):19549, 2020.
  • Wang et al. (2022b) Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mohammed, O. K., Singhal, S., Som, S., et al. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022b.
  • Wang et al. (2023) Wang, Y., Zhao, Y., and Petzold, L. Are large language models ready for healthcare? a comparative study on clinical language understanding. arXiv preprint arXiv:2304.05368, 2023.
  • Wang et al. (2022c) Wang, Z., Wu, Z., Agarwal, D., and Sun, J. Medclip: Contrastive learning from unpaired medical images and text. arXiv preprint arXiv:2210.10163, 2022c.
  • Wu et al. (2023) Wu, C., Zhang, X., Zhang, Y., Wang, Y., and Xie, W. Medklip: Medical knowledge enhanced language-image pre-training. medRxiv, pp.  2023–01, 2023.
  • Yunxiang et al. (2023) Yunxiang, L., Zihan, L., Kai, Z., Ruilong, D., and You, Z. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070, 2023.
  • Zhang et al. (2023) Zhang, L., Ruan, L., Hu, A., and **, Q. Multimodal pretraining from monolingual to multilingual. Machine Intelligence Research, 20(2):220–232, 2023.
  • Zhang et al. (2022) Zhang, Y., Jiang, H., Miura, Y., Manning, C. D., and Langlotz, C. P. Contrastive learning of medical visual representations from paired images and text. In Machine Learning for Healthcare Conference, pp.  2–25. PMLR, 2022.
  • Zhou et al. (2022) Zhou, H.-Y., Chen, X., Zhang, Y., Luo, R., Wang, L., and Yu, Y. Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports. Nature Machine Intelligence, 4(1):32–40, 2022.
  • Zhou et al. (2023) Zhou, H.-Y., Lian, C., Wang, L., and Yu, Y. Advancing radiograph representation learning with masked record modeling. arXiv preprint arXiv:2301.13155, 2023.

Appendix A Pre-training Details

We sort all cases of the same patient according to the studyDate attribute in chronological order. Then we divide them into sequences of length 4, treating any remaining part that is less than 4 as a sequence of length 1. In other words, our dataset contains sequences of either four image-text pairs or a single image-text pair. During training, a batch randomly includes sequences of length 4 or 1. Global alignment and local alignment use all image-text pairs in all sequences, while sequence modeling employs all sequences of length 4.

Table 6 outlines the hyperparameters used during pre-training. We opt for MGCA weights initialization, leveraging their high performance to ensure faster convergence, particularly aiming for time efficiency. We use PyTorch Lightning and an early stop** mechanism to optimize training duration.

Table 6: Hyperparameters of Pre-training
Hyperparameters
Pre-training epochs 10
Batch size per GPU 10
Number of GPUs 2
Learning Rate 4e-5
Learning Rate Optimizer CosineAnnealing
Weight Decay 0.05

Appendix B Downstream Tasks

B.1 Temporal Image classification Similarity

B.1.1 Experimental Details

We split the dataset according to the annotated disease types, i.e., Consolidation, Edema, Pleural Effusion, Pneumonia, and Pneumothorax. And then, for each subtask, we first obtain the features of two images, concatenate them, and use an SVM for classification, employing cross-validation to get the accuracy for each category. The task is to categorize the concatenated feature into three stages of disease progression: Improving, Stable, Worsening. Subsequently, we calculate the average accuracy across five types of findings. We use publicly available pre-trained models from others and determine their accuracy on this task in the same manner.

Table 7: Temporal image classification results on the MS-CXR-T benchmark under 5-fold cross-validation setting. MLM stands for masked language modeling objective. Best results are highlighted in bold.
Model Consolidation Edema Pl.effusion Pneumonia Pneumothorax Avg. Acc.
ConVIRT-MIMIC 46.97±plus-or-minus\pm± 1.68 56.03±plus-or-minus\pm± 1.07 54.98±plus-or-minus\pm± 2.75 60.62±plus-or-minus\pm± 2.51 46.61±plus-or-minus\pm± 2.51 53.04±plus-or-minus\pm± 0.62
GLoRIA-CheXpert 50.10±plus-or-minus\pm± 1.94 51.63±plus-or-minus\pm± 0.46 51.09±plus-or-minus\pm± 0.72 60.49±plus-or-minus\pm± 1.05 44.10±plus-or-minus\pm± 1.18 51.48±plus-or-minus\pm± 0.84
GLoRIA-MIMIC 44.32±plus-or-minus\pm± 2.51 49.64±plus-or-minus\pm± 1.74 48.74±plus-or-minus\pm± 1.69 58.63±plus-or-minus\pm± 2.52 42.17±plus-or-minus\pm± 1.66 48.70±plus-or-minus\pm± 0.86
MGCA 52.43±plus-or-minus\pm± 1.82 63.30±plus-or-minus\pm± 1.07 56.13±plus-or-minus\pm± 1.97 64.83±plus-or-minus\pm± 0.21 52.78±plus-or-minus\pm± 1.37 57.90±plus-or-minus\pm± 0.30
MedCLIP 49.44±plus-or-minus\pm± 2.51 52.27±plus-or-minus\pm± 3.98 53.79±plus-or-minus\pm± 0.59 58.92±plus-or-minus\pm± 1.43 41.87±plus-or-minus\pm± 2.38 51.25±plus-or-minus\pm± 0.65
MRM 50.10±plus-or-minus\pm± 1.16 49.37±plus-or-minus\pm± 3.84 48.97±plus-or-minus\pm± 0.61 55.12±plus-or-minus\pm± 1.09 48.23±plus-or-minus\pm± 0.47 50.36±plus-or-minus\pm± 1.01
BioViL 57.56±plus-or-minus\pm± 1.17 57.27±plus-or-minus\pm± 0.72 54.10±plus-or-minus\pm± 0.65 66.95±plus-or-minus\pm± 0.53 54.82±plus-or-minus\pm± 0.22 58.14±plus-or-minus\pm± 0.41
Temporal-based
BioViL-T 56.73±plus-or-minus\pm± 1.76 59.52±plus-or-minus\pm± 1.09 52.80±plus-or-minus\pm± 0.90 66.67±plus-or-minus\pm± 0.60 55.61±plus-or-minus\pm± 0.23 58.26±plus-or-minus\pm± 0.42
Med-ST 60.72±plus-or-minus\pm± 2.02 65.56±plus-or-minus\pm± 1.24 57.99±plus-or-minus\pm± 0.30 64.98±plus-or-minus\pm± 0.60 53.71±plus-or-minus\pm± 1.83 60.59±plus-or-minus\pm± 0.72
Table 8: Temporal sentence similarity classification results on Swaps subset.
Model Accuracy ROC-AUC
MedCLIP 59.59 48.33
MGCA 68.16 71.97
BioViL 81.63 90.00
BioViL-T 94.29 97.59
Med-ST 76.73 86.12

B.2 Temporal Sentence Similarity

B.2.1 Experimental Details

The task has two subsets. The first subset, generated by RadGraph, identifies paraphrase and contradiction sentence pairs through analyzing graph representations of sentences. The second subset is created using the Swaps method, which involves changing temporal keywords within sentences to produce paraphrases and contradictions. Swaps approach creates pairs that differ subtly in their temporal semantics while pairs obtained through the RadGraph method exhibit greater variation. Table B.2 are two examples of sentence pairs from each subset. We divide the dataset into two subsets based on the construction of text pairs into RadGraph and Swaps and conduct tests on each subset. Then, firstly, we randomize the dataset and then obtain the AUROC results based on the similarity between text pairs. Subsequently, through ten-fold cross-validation, we select the similarity threshold corresponding to the highest accuracy on the validation set. We then test on the entire set to obtain the accuracy. We use publicly available pre-trained models from others and follow the same process to determine their performance on this task.

B.2.2 Results

Tables 2 and Table 8 respectively show the experimental results on the two subsets. Our method achieves better results on the more complex dataset. For the Swaps dataset, methods BioViL and BioViL-T achieve better results due to the use of masked language modeling.

Table 9: Examples of sentence pairs from the MS-CXR-T temporal sentence similarity benchmark.
Label Sentence 1 Sentence 2
Swaps Paraphrase “Nearly resolved bilateral pleural effusions.” “Nearly cleared bilateral pleural effusions.”
Contradiction “Bibasal atelectasis is unchanged.” “Bibasal atelectasis is new.”
RadGraph Paraphrase “Moderate pulmonary edema is exaggerated by low lung volumes, but also worsened.” “Lung volumes are lower exaggerating what is at least worsened moderate pulmonary edema.”
Contradiction “Small right pneumothorax has developed at the base of the right lung.” “Small pneumothorax at the base of the right lung is unchanged.”

B.3 Zero-shot Image Classification

B.3.1 Text Prompt

We use the same text prompt as BioViL-T for pneumonia.

pos_query = [ ’Findings consistent with pneumonia’, ’Findings suggesting pneumonia’, ’This opacity can represent pneumonia’, ’Findings are most compatible with pneumonia’, ],
neg_query = [ ’There is no pneumonia’, ’No evidence of pneumonia’, ’No evidence of acute pneumonia’, ’No signs of pneumonia’, ]

B.3.2 Experimental Details

We use the dataset split from previous work (Huang et al., 2021; Wang et al., 2022a) and test on the test set. The text embedding is the average of four prompt embeddings, thus we obtain two text features representing the presence and absence of pneumonia. After obtaining the image features, we determine the predicted category based on the similarity between the image features and the two text features.

Appendix C More Visualization

C.1 t-SNE visualizations results of image features

Figure 5 shows the visualization results of image features at different time steps for 200 sequences. The left side is from MGCA, and the right side is ours. Each category represents different time points in the sequence. From the figure, it can be seen that the feature distribution of MGCA is relatively uniform, with no discernible trend of change. In contrast, the trends between the four time points in our diagram are consistent, indicating that our model has learned time-related information.

C.2 Visualization of the learned distribution of β𝛽\betaitalic_β

In Figure 6, we randomly select 800 sample points and obtained their β𝛽\betaitalic_β distributions before training (first row) and after training (second row). The first column represents the distribution of samples with a ground truth index of 1, while the second column represents those with a ground truth index of 3. The red dashed line indicates the ground truth, and the blue dashed line represents the mean of β𝛽\betaitalic_β of the samples. Our objective is to bring the blue dashed line closer to the red dashed line.

It can be observed that, compared to the row above, the β𝛽\betaitalic_β distributions obtained after training exhibit closer proximity to the ground truth, and they also demonstrate better fitting to Gaussian distributions. This suggests that our model has learned temporal semantics to a certain extent, as the features of samples closer in time are more similar.

Appendix D More Ablation

D.1 Ablation study on the proportion of lateral views

Table 10: Ablation study on the proportion of lateral views.
Lateral proportion (%) Consolidation Edema Pl.effusion Pneumonia Pneumothorax Avg. Acc.
0 55.09±plus-or-minus\pm± 1.34 63.32±plus-or-minus\pm± 0.78 58.06±plus-or-minus\pm± 1.40 53.58±plus-or-minus\pm± 0.40 50.25±plus-or-minus\pm± 0.99 58.06±plus-or-minus\pm± 0.29
60 55.39±plus-or-minus\pm± 1.67 67.32±plus-or-minus\pm± 0.88 58.63±plus-or-minus\pm± 1.00 65.98±plus-or-minus\pm± 0.91 51.04±plus-or-minus\pm± 0.87 59.67±plus-or-minus\pm± 0.41
100 60.57±plus-or-minus\pm± 1.18 67.35±plus-or-minus\pm± 0.32 58.47±plus-or-minus\pm± 1.50 65.00±plus-or-minus\pm± 0.34 54.18±plus-or-minus\pm± 0.81 61.12±plus-or-minus\pm± 0.34

Table 10 illustrates the results of ablation experiments using different proportions of lateral views. It can be observed that even without the use of lateral views, our mean performance is 58.06, surpassing nearly all baselines. This underscores the effectiveness of local alignment objectives and temporal modeling. As the proportion of lateral views increases, performance also improves, indicating that lateral views provide an additional boost to performance.

D.2 Ablation study on FMC and RMR

Table 11: Ablation study on FMC and RMR.
FMCsubscriptFMC\mathcal{L}_{\text{FMC}}caligraphic_L start_POSTSUBSCRIPT FMC end_POSTSUBSCRIPT MRMsubscriptMRM\mathcal{L}_{\text{MRM}}caligraphic_L start_POSTSUBSCRIPT MRM end_POSTSUBSCRIPT Consolidation Edema Pl.effusion Pneumonia Pneumothorax Avg. Acc.
53.72 66.46 56.93 63.27 51.66 58.33
53.26 64.30 55.02 65.02 55.04 58.53
60.71 67.31 58.89 65.02 55.48 61.48

Table 11 presents the ablation experiments conducted on FMC and MRM. The experimental results indicate that MRM contributes to greater benefits, suggesting that more complex regression objectives enable the model to better capture contextual information within sequences. When both are applied, the performance on sequential tasks is optimal, indicating that combining these two objectives, from simple to challenging, allows the model to maximize its understanding of sequence semantics.

Refer to caption
Figure 5: t-SNE visualizations results of image features at different time steps for 200 sequences. The left side is from MGCA, and the right side is ours.
Refer to caption
Figure 6: Visualization of the learned distribution of β𝛽\betaitalic_β.