Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training

Yang, **xia; Su, Bing; Zhao, Wayne ** classification (FMC) and reverse map** regression (RMR). By perceiving temporal information from simple to complex, Med-ST can learn temporal semantics. Experimental results across four distinct tasks demonstrate the effectiveness of Med-ST, especially in temporal classification tasks. Our code and model are available at https://github.com/SVT-Yang/MedST.

Abstract:Medical vision-language pre-training methods mainly leverage the correspondence between paired medical images and radiological reports. Although multi-view spatial images and temporal sequences of image-report pairs are available in off-the-shelf multi-modal medical datasets, most existing methods have not thoroughly tapped into such extensive supervision signals. In this paper, we introduce the Med-ST framework for fine-grained spatial and temporal modeling to exploit information from multiple spatial views of chest radiographs and temporal historical records. For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views. To achieve a more comprehensive alignment, Med-ST not only establishes the global alignment between whole images and texts but also introduces modality-weighted local alignment between text tokens and spatial regions of images. For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward map** classification (FMC) and reverse map** regression (RMR). By perceiving temporal information from simple to complex, Med-ST can learn temporal semantics. Experimental results across four distinct tasks demonstrate the effectiveness of Med-ST, especially in temporal classification tasks. Our code and model are available at this https URL.

Comments:	Accepted at ICML 2024
Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2405.19654 [cs.AI]
	(or arXiv:2405.19654v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2405.19654

Computer Science > Artificial Intelligence

Title:Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators