11institutetext: Department of Automation, BNRist, Tsinghua University, Bei**g, China
22institutetext: Bei**g Academy of Artificial Intelligence, Bei**g, China
22email: [email protected], 22email: [email protected], 22email: [email protected], 22email: [email protected]

Structure-aware World Model for Probe Guidance via Large-scale Self-supervised Pre-train

Haojun Jiang* 1122 Meng Li* 22 Zhenguo Sun* 22 Ning Jia 22 Yu Sun 22 Shaqi Luo 22 Shiji Song 11 Gao Huang\faEnvelopeO 1122
Abstract

The complex structure of the heart leads to significant challenges in echocardiography, especially in acquisition cardiac ultrasound images. Successful echocardiography requires a thorough understanding of the structures on the two-dimensional plane and the spatial relationships between planes in three-dimensional space. In this paper, we innovatively propose a large-scale self-supervised pre-training method to acquire a cardiac structure-aware world model. The core innovation lies in constructing a self-supervised task that requires structural inference by predicting masked structures on a 2D plane and imagining another plane based on pose transformation in 3D space. To support large-scale pre-training, we collected over 1.36 million echocardiograms from ten standard views, along with their 3D spatial poses. In the downstream probe guidance task, we demonstrate that our pre-trained model consistently reduces guidance errors across the ten most common standard views on the test set with 0.29 million samples from 74 routine clinical scans, indicating that structure-aware pre-training benefits the scanning.

Keywords:
Echocardiography World Model Structural Understanding Self-supervised Pre-train Probe Guidance
11footnotetext: *These authors contributed equally to this work. This work was done while Haojun Jiang was an intern at Bei**g Academy of Artificial Intelligence.22footnotetext: \faEnvelopeOCorresponding author.

1 Introduction

Cardiovascular diseases are the leading cause of death in worldwide [18, 19]. Echocardiography is the most commonly used method in clinical practice to assess heart conditions. However, the structure of the heart is extremely complex. According to [12], up to seven anatomical structures need to be identified in a single plane, such as the Parasternal Long-axis Plane (Fig. 1 Left). Recognizing these anatomical structures is crucial for diagnosis, and understanding their spatial relationships in a two-dimensional plane also helps the sonographer fine-tune the probe to obtain the best quality images. Additionally, up to 27 different planes need to be examined, requiring the sonographer to understand the spatial relationships between these planes in three-dimensional space and to finely adjust the ultrasound probe’s position to reach the target location (Fig. 1 Right). Due to the aforementioned reasons, cardiac ultrasound examinations are extremely challenging. This also results in a long training period for ultrasound medical personnel, as they need to spend a lot of time familiarizing themselves with both 2D and 3D structures. Consequently, there is a significant talent shortage in this field, especially in regions with scarce medical resources, such as Africa.

Refer to caption
Figure 1: Diagram illustrating the capabilities of a cardiac world model. We aim to develop a cardiac world model that can understand both two-dimensional and three-dimensional structures. (Left) The world model needs to recognize various structures in two-dimensional planes and understand their spatial relationships for in-plane probe adjustment. (Right) Understanding the three-dimensional structure of the heart, specifically the spatial relationships between different planes, is crucial for out-of-plane probe adjustment. The images used in the diagram are sourced from [12].

With the rise of deep learning [4, 6, 7, 21], AI technology has shown great potential in improving the efficiency of echocardiography analysis. For example, Ouyang et al. [14] proposed a video-based deep learning algorithm, i.e., EchoNet-Dynamic, that can automatically make accurate assessments of cardiac function. More importantly, with the rise of large language models [2, 16, 17, 20] and multi-modal learning [8, 10, 11, 15], the interpretation of echocardiograms [3] has shown increasing improvement, demonstrating excellent performance in tasks such as pulmonary artery pressure estimation, left ventricular hypertrophy, heart failure, and left atrial enlargement. These AI-assisted diagnostic tools have demonstrated performance almost comparable to human experts. However, this comes with a prerequisite: the acquisition of high-quality echocardiograms. In regions with scarce medical resources, there are often no sonographers available to obtain high-quality echocardiograms. In such cases, the powerful capabilities of these AI-assisted diagnostic tools cannot be fully utilized. Few works [5, 9, 13] have focused on how to use AI technology to assist inexperienced sonographers in accurately acquiring target planes. Recently, Jiang et al. [9] proposed a cardiac dreamer, which only focuses on supervised learning of the 3D-structure of the heart and serves as a "heart map" for the probe guidance task. This work demonstrates a potential AI-assisted scanning method that is expected to improve the scanning skills of novices.

According to the clinical experience of doctors, understanding both the 2D and 3D structures is crucial for efficient scanning. For example, when you need to adjust the probe’s viewing angle on a particular plane to capture specific anatomical structures, you must have a good understanding of the spatial positions of those anatomical structures on that plane (Fig.1 Left). Thus, in this paper, we propose a 2D-3D joint structure-aware pre-training framework to obtain a data-driven cardiac world model that benefits ultrasound scanning. Specifically, we require the world model to learn important spatial relationships in the following ways: (1) for understanding two-dimensional structures, we use a masking approach that requires the world model to predict features at adjacent positions in the two-dimensional plane; (2) for understanding three-dimensional structures, we provide information on the positional changes of two planes in 3D space, requiring the world model to predict the features of the target plane after the positional change. We further collected expert operational data on acquiring the most common ten standard planes from 364 routine clinical scans, resulting in 1.36 million sample pairs gathered by three certified sonographers, to enhance the model’s learning of generalizable cardiac structure knowledge. Results on the downstream probe guidance tasks indicate that the proposed structure-aware pre-training method learns useful knowledge for assisting in the acquisition of echocardiograms.

2 Method

In this section, we describe the proposed structure-aware pre-training framework, illustrated in Fig. 2. We first introduce the prior I-JEPA work [1] , on which we based our method, in Section 2.1. Next, we discuss how we construct a pre-training framework that simultaneously learns 2D and 3D structural information by introducing three-dimensional spatial information in Section 2.2.

2.1 Preliminary

In the context of echocardiogram analysis, accurately understanding 2D structural information is fundamental for making correct diagnoses and conducting efficient scans. Recently, Assran et al. proposed a Joint-Embedding Predictive Architecture (I-JEPA) [1] to learn highly semantic image representation. The key idea for I-JEPA is learning the representation of images by predicting features of target blocks based on the non-overlap** context block and the positional embedding in the same image. This paradigm requires the model to understand the spatial relationships of different semantic structures in the 2D plane. For example, if a context block contains the head of a dog, it is likely that the body of the dog will be located below the context block. For echocardiogram analysis, this paradigm also enables the model to learn the spatial relationships of fine structures in the two-dimensional plane. For instance, the left ventricle (LV) is located below the interventricular septum (IVS), as shown on the left side of Fig.1. Therefore, it is highly suitable for modeling two-dimensional structural information in cardiac ultrasound images. Next, we briefly introduce the modeling and training method.

Targets. Specifically, the I-JEPA model employs a Vision Transformer [4]. Firstly, the input image 𝐈H×W𝐈superscript𝐻𝑊\mathbf{I}\in\mathbb{R}^{H\times W}bold_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT is divided into N𝑁Nitalic_N non-overlap** patches, which are further processed by the target encoder Fθsubscriptsuperscript𝐹𝜃F^{{}^{\prime}}_{\theta}italic_F start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Then, we select a portion of the patches to form M𝑀Mitalic_M target blocks, which may have overlap** regions. We denote the ground-truth target blocks’ features as 𝐘itLt×C,i{1,2,,M}formulae-sequencesubscriptsuperscript𝐘𝑡𝑖superscriptsubscript𝐿𝑡𝐶𝑖12𝑀\mathbf{Y}^{t}_{i}\in\mathbb{R}^{L_{t}\times C},i\in\{1,2,\cdots,M\}bold_Y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT , italic_i ∈ { 1 , 2 , ⋯ , italic_M }.

Context and Prediction. For the context features, patches not belonging to the target blocks are randomly selected and input into the context encoder to obtain features 𝐙cLc×Csubscript𝐙𝑐superscriptsubscript𝐿𝑐𝐶\mathbf{Z}_{c}\in\mathbb{R}^{L_{c}\times C}bold_Z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT. Subsequently, the predictor Wθsubscript𝑊𝜃W_{\theta}italic_W start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (world model) generates the features of target blocks based on the context features and positional embedding which indicate the location of context and target blocks. We denote the predicted target blocks’ feature as 𝐘^itLt×C,i{1,2,,M}formulae-sequencesubscriptsuperscript^𝐘𝑡𝑖superscriptsubscript𝐿𝑡𝐶𝑖12𝑀\hat{\mathbf{Y}}^{t}_{i}\in\mathbb{R}^{L_{t}\times C},i\in\{1,2,\cdots,M\}over^ start_ARG bold_Y end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT , italic_i ∈ { 1 , 2 , ⋯ , italic_M }.

Loss. With the ground-truth and predicted features of target blocks, the model is optimized using the following loss:

totalsubscript𝑡𝑜𝑡𝑎𝑙\displaystyle\mathcal{L}_{total}caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT =1Mi=1Mj=1LtSmoothL1(𝐘i,jt,𝐘^i,jt).absent1𝑀superscriptsubscript𝑖1𝑀superscriptsubscript𝑗1subscript𝐿𝑡subscriptSmoothL1subscriptsuperscript𝐘𝑡𝑖𝑗subscriptsuperscript^𝐘𝑡𝑖𝑗\displaystyle=\frac{1}{M}\sum_{i=1}^{M}\sum_{j=1}^{L_{t}}\mathcal{L}_{\mathrm{% SmoothL1}}(\mathbf{Y}^{t}_{i,j},\hat{\mathbf{Y}}^{t}_{i,j}).= divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT SmoothL1 end_POSTSUBSCRIPT ( bold_Y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , over^ start_ARG bold_Y end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) . (1)
Refer to caption
Figure 2: Diagram illustrating the pre-training method and downstream task. (a) The world model and encoder are required to predict features on the target plane based on the spatial relationships in both two-dimensional and three-dimensional spaces. (b) Following cardiac dreamer [9], we choose the probe guidance task to evaluate our method. The cardiac images used in the diagram are sourced from [12].

2.2 2D-3D Joint Structure-aware Pre-training

Echocardiography involves acquiring high-quality echocardiogram and subsequently conducting analysis and diagnosis based on these images. However, previous researches [3, 14] have primarily focused on understanding and analyzing the two-dimensional plane, neglecting how AI can assit in acquiring high-quality echocardiogram. While understanding the 2D structure of a plane can aid in scanning, especially when some features of the standard view are already visible, this alone is insufficient. Particularly when transitioning from one standard view to another, a deep understanding of the heart’s 3D structure is essential. Therefore, we propose a 2D-3D joint structure-aware pre-training framework, as illustrated in Fig.1 (a). The core insight is predicting the visual features of structures at target locations based on given 2D and 3D positional conditions, thereby learning the map** relationship between spatial positions and visual features. Next, we provide details of the modeling and training method.

Input. Given a source image 𝐈sH×Wsuperscript𝐈𝑠superscript𝐻𝑊\mathbf{I}^{s}\in\mathbb{R}^{H\times W}bold_I start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT, we select a target image 𝐈tH×Wsuperscript𝐈𝑡superscript𝐻𝑊\mathbf{I}^{t}\in\mathbb{R}^{H\times W}bold_I start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT from the same individual’s images. To ensure sufficient spatial pose variation between the two images, there must be an interval of at least 150 frames between their sequence numbers. Then, both images are divided into N𝑁Nitalic_N non-overlap** patches 𝐈ps,𝐈ptN×h×wsubscriptsuperscript𝐈𝑠𝑝subscriptsuperscript𝐈𝑡𝑝superscript𝑁𝑤\mathbf{I}^{s}_{p},\mathbf{I}^{t}_{p}\in\mathbb{R}^{N\times h\times w}bold_I start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_I start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_h × italic_w end_POSTSUPERSCRIPT.

Context and Target. First, we randomly select a rectangular region from the patches of the source image 𝐈ssuperscript𝐈𝑠\mathbf{I}^{s}bold_I start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT as the context information. These patches are then fed through the context encoder Fθsubscript𝐹𝜃F_{\theta}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, resulting in the context feature 𝐙ssuperscript𝐙𝑠\mathbf{Z}^{s}bold_Z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT:

𝐙ssuperscript𝐙𝑠\displaystyle\mathbf{Z}^{s}bold_Z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT =Fθ(𝐈ps𝐁s),𝐙sLs×C,formulae-sequenceabsentsubscript𝐹𝜃direct-productsubscriptsuperscript𝐈𝑠𝑝superscript𝐁𝑠superscript𝐙𝑠superscriptsubscript𝐿𝑠𝐶\displaystyle=F_{\theta}(\mathbf{I}^{s}_{p}\odot\mathbf{B}^{s}),\ \ \mathbf{Z}% ^{s}\in\mathbb{R}^{L_{s}\times C},= italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_I start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⊙ bold_B start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) , bold_Z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT , (2)

where 𝐁sN×1×1superscript𝐁𝑠superscript𝑁11\mathbf{B}^{s}\in\mathbb{R}^{N\times 1\times 1}bold_B start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 × 1 end_POSTSUPERSCRIPT is a binary mask indicating the selected patches, Lssubscript𝐿𝑠L_{s}italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the number of context patches, and C𝐶Citalic_C is the hidden dimension. Next, we select M𝑀Mitalic_M non-overlap** regions on the target image as the target blocks, requiring the model to understand both the three-dimensional structure of the heart and the spatial relationships of heart structures in the two-dimensional plane. If we choose regions on the target image that have the same spatial position as the context block, the model only needs to understand the three-dimensional spatial relationships. The target features are obtained as follows:

𝐘itsubscriptsuperscript𝐘𝑡𝑖\displaystyle\mathbf{Y}^{t}_{i}bold_Y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =Fθ(𝐈pt)𝐁it,𝐘itLt×C,i{1,2,,M},formulae-sequenceabsentdirect-productsubscript𝐹𝜃subscriptsuperscript𝐈𝑡𝑝subscriptsuperscript𝐁𝑡𝑖formulae-sequencesubscriptsuperscript𝐘𝑡𝑖superscriptsubscript𝐿𝑡𝐶𝑖12𝑀\displaystyle=F_{\theta}(\mathbf{I}^{t}_{p})\odot\mathbf{B}^{t}_{i},\ \ % \mathbf{Y}^{t}_{i}\in\mathbb{R}^{L_{t}\times C},i\in\{1,2,\cdots,M\},= italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_I start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ⊙ bold_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_Y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT , italic_i ∈ { 1 , 2 , ⋯ , italic_M } , (3)

where 𝐁itN×1subscriptsuperscript𝐁𝑡𝑖superscript𝑁1\mathbf{B}^{t}_{i}\in\mathbb{R}^{N\times 1}bold_B start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT is a binary mask indicating the selected patches and Ltsubscript𝐿𝑡L_{t}italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the number of target patches.

Condition and Prediction. The three-dimensional spatial relationship between the source image and the target image is denoted as 𝐚6𝐚superscript6\mathbf{a}\in\mathbb{R}^{6}bold_a ∈ blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, which encapsulates the translation and rotation changes in the x, y, and z directions. This vector 𝐚𝐚\mathbf{a}bold_a is encoded by Pθsubscript𝑃𝜃P_{\theta}italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to obtain the pose embedding, denoted as 𝐏1×C𝐏superscript1𝐶\mathbf{P}\in\mathbb{R}^{1\times C}bold_P ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_C end_POSTSUPERSCRIPT. Then, the 2D positional embeddings indicating the location of the context and target block are denoted as 𝐐sLs×Csuperscript𝐐𝑠superscriptsubscript𝐿𝑠𝐶\mathbf{Q}^{s}\in\mathbb{R}^{L_{s}\times C}bold_Q start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT and 𝐐itLt×C,i{1,2,,M}formulae-sequencesubscriptsuperscript𝐐𝑡𝑖superscriptsubscript𝐿𝑡𝐶𝑖12𝑀\mathbf{Q}^{t}_{i}\in\mathbb{R}^{L_{t}\times C},i\in\{1,2,\cdots,M\}bold_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT , italic_i ∈ { 1 , 2 , ⋯ , italic_M }. Finally, the predicted features from the world model Wθsubscript𝑊𝜃W_{\theta}italic_W start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT are obtained as follows:

𝐙ssuperscript𝐙superscript𝑠\displaystyle\mathbf{Z}^{s^{\prime}}bold_Z start_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT =𝐙s+𝐐s,absentsuperscript𝐙𝑠superscript𝐐𝑠\displaystyle=\mathbf{Z}^{s}+\mathbf{Q}^{s},= bold_Z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT + bold_Q start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , (4)
𝐙itsubscriptsuperscript𝐙superscript𝑡𝑖\displaystyle\mathbf{Z}^{t^{\prime}}_{i}bold_Z start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =𝐙it+𝐐it,absentsubscriptsuperscript𝐙𝑡𝑖subscriptsuperscript𝐐𝑡𝑖\displaystyle=\mathbf{Z}^{t}_{i}+\mathbf{Q}^{t}_{i},= bold_Z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_Q start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (5)
𝐘^itsubscriptsuperscript^𝐘𝑡𝑖\displaystyle\hat{\mathbf{Y}}^{t}_{i}over^ start_ARG bold_Y end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =Wθ(concat[𝐙s,𝐏,𝐙it]),absentsubscript𝑊𝜃concatsuperscript𝐙superscript𝑠𝐏subscriptsuperscript𝐙superscript𝑡𝑖\displaystyle=W_{\theta}(\text{concat}[\mathbf{Z}^{s^{\prime}},\mathbf{P},% \mathbf{Z}^{t^{\prime}}_{i}]),= italic_W start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( concat [ bold_Z start_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , bold_P , bold_Z start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ) , (6)

where 𝐙itLt×C,i{1,2,,M}formulae-sequencesubscriptsuperscript𝐙𝑡𝑖superscriptsubscript𝐿𝑡𝐶𝑖12𝑀\mathbf{Z}^{t}_{i}\in\mathbb{R}^{L_{t}\times C},i\in\{1,2,\cdots,M\}bold_Z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT , italic_i ∈ { 1 , 2 , ⋯ , italic_M } is a set of learnable parameters representing the features to be predicted. These parameters interact with the context feature and condition information to generate the target feature. Then, the models are optimized according to the loss function defined in Eq. (1).

3 Experiments

Refer to caption
Figure 3: Anatomic and 2D ultrasound images of ten standard planes. The cardiac images used in the diagram are sourced from [12].

3.1 Implementation Details

Dataset. In this paper, we collected data and conducted experiments on the ten most common standard planes [12], as shown in Fig. 3. The ultrasound images and scan data were acquired following the procedure described in [9]. Ultimately, we amassed data from 364 routine clinical scans, resulting in a total of 1.36 million image and 3D pose data pairs. The whole data collection process was approved and supervised by The University Science and Technology Ethics Committee. We split the dataset into 290 scans (1.07 million samples) for training and 74 scans (0.29 million samples) for testing. For pre-training, only the training set was utilized. Both the training and test sets were employed for the downstream probe guidance task. It is important to note that the individuals in the training and test sets are different to avoid information leakage and fairly validate the model’s generalization performance.

Pre-training. The context and target encoders were implemented using ViT-Small/16, and the world model utilized a custom vision transformer with a depth of 6 layers and a hidden dimension of 384. The entire model was trained for 50 epochs with a batch size of 1024 on 8 Nvidia RTX-4090 GPUs. The training included a 7-epoch warmup period (starting learning rate is 1e-4), followed by a learning rate of 5e-4, using a cosine scheduler with a final learning rate of 5e-7. The implementation details for generating the context block and target blocks, as well as the hyper-parameters, followed the procedures described in [1].

Downstream task. For the probe guidance task, we adopted the framework and procedure proposed in [9], as shown in Fig.2 (b). The input for this task is an ultrasound image, and the output is the probe position adjustment needed to achieve a specific standard view. This task aims to assist junior ultrasound medical personnel in scanning, enhancing the success rate and quality of view acquisition. During fine-tuning, our pre-trained world model Wθsubscript𝑊𝜃W_{\theta}italic_W start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT was loaded and optimized for 5 epochs with a batch size of 1024 on 8 Nvidia RTX-4090 GPUs. The learning rate was set to 1e-4, using a cosine scheduler with a final learning rate of 1e-6. The optimizer was set to AdamW.

Evaluation Metrics. The metric used for probe guidance task is the Mean Absolute Error (MAE) between the predicted and ground truth probe poses.

3.2 Results

Table 1: Evaluation of the probe guidance task. We report MAE results that represent probe guidance errors (lower is better) across ten standard planes.
Plane Model Translation (mm) Rotation (degree)
x y z rx ry rz
PLAX Cardiac Dreamer [9] 8.66 8.14 5.63 6.60 5.42 8.23
+++ Our Pre-train 8.39 (-3.15%) 8.02 (-1.53%) 5.53 (-1.80%) 6.46 (-2.12%) 5.34 (-1.42%) 7.89 (-4.07%)
PSAX-AV Cardiac Dreamer 7.26 6.63 4.51 5.28 6.26 7.43
+++ Our Pre-train 7.06 (-2.73%) 6.57 (-0.97%) 4.34 (-3.85%) 5.28 (-0.11%) 6.03 (-3.63%) 7.18 (-3.25%)
PSAX-PV Cardiac Dreamer 7.59 6.58 4.71 5.47 5.67 8.54
+++ Our Pre-train 7.49 (-1.38%) 6.56 (-0.28%) 4.66 (-1.08%) 5.46 (-0.21%) 5.52 (-2.73%) 8.21 (-3.88%)
PSAX-MV Cardiac Dreamer 7.81 6.42 4.89 6.68 5.87 9.11
+++ Our Pre-train 7.51 (-3.89%) 6.42 (0.04 %) 4.77 (-2.53%) 6.55 (-1.92%) 5.78 (-1.63%) 8.78 (-3.67%)
PSAX-PAP Cardiac Dreamer 7.18 6.07 4.43 6.35 5.39 8.53
+++ Our Pre-train 6.92 (–3.57%) 5.97 (-1.63%) 4.35 (-1.78%) 6.11 (-3.88%) 5.33 (-1.04%) 8.42 (-1.31%)
PSAX-APEX Cardiac Dreamer 6.98 5.98 4.25 5.69 4.89 7.33
+++ Our Pre-train 6.85 (-1.88%) 5.77 (-3.41%) 4.11 (-3.28%) 5.55 (-2.43%) 4.90 (0.07%) 7.25 (-1.11%)
A4C Cardiac Dreamer 7.72 7.17 5.45 5.64 4.89 8.91
+++ Our Pre-train 7.55 (-2.14%) 7.00 (-2.49%) 5.36 (-1.70%) 5.48 (-2.84%) 4.86 (-0.55%) 8.60 (-3.43%)
A5C Cardiac Dreamer 7.38 6.65 5.54 5.83 5.91 12.03
+++ Our Pre-train 7.15 (-3.19%) 6.46 (-2.85%) 5.40 (-2.41%) 5.80 (-0.49%) 5.88 (-0.54%) 11.88 (-1.18%)
A3C Cardiac Dreamer 7.21 6.52 5.21 5.92 6.29 9.81
+++ Our Pre-train 6.90 (-4.34%) 6.34 (-2.81%) 5.08 (-2.44%) 5.77 (-2.63%) 6.12 (-2.74%) 9.52 (-2.98%)
A2C Cardiac Dreamer 7.28 6.89 4.95 8.47 5.46 14.51
+++ Our Pre-train 7.04 (-3.32%) 6.70 (-2.84%) 4.83 (-2.35%) 8.37 (-1.20%) 5.33 (-2.45%) 13.98 (-3.66%)
Refer to caption
Figure 4: Ablation of the pre-training objectives. The figure shows the relative change in MAE across six degrees of freedom for ten standard views, comparing different pre-training objectives with Cardiac Dreamer [9]. Smaller values indicate better performance. (a) Our proposed 2D-3D Joint Structure-aware pre-training. (b, c) Pre-training focused only on 2D or 3D structures.

Comparison with SOTA. To validate that the proposed structure-aware pre-training method benefits the acquisition of echocardiogram, we conducted comprehensive evaluations on the downstream probe guidance task. As shown in Tab.1, the pre-trained model consistently achieved better or comparable results across all dimensions in the ten standard views. Notably, the highest observed improvement observed was up to 4.34%. The pre-trained model demonstrated slight weakness compared to the Cardiac Dreamer in only one dimension for the PSAX-MV and PSAX-APEX planes. Despite this minor shortfall, the overall performance indicates that the model has effectively learned valuable information about the 2D and 3D structures of the heart. This acquired knowledge enhances the precision of probe guidance during cardiac ultrasound scanning tasks, thereby potentially supporting less experienced sonographers in acquiring high-quality echocardiograms in the future.

Ablations. To demonstrate the importance of each component in our proposed joint 2D-3D modeling approach, we decoupled the 2D and 3D modeling, either focusing solely on 2D structure or 3D structure modeling. As shown in Fig.4, our method achieves improvements in almost all dimensions, whereas using only 2D or 3D modeling results in poorer performance in rotation dimension. In summary, while either 2D or 3D modeling alone enhances the model’s performance to some extent, combining both in joint pre-training achieves the best results. This conclusion aligns with practical experience as well. First, 3D modeling helps in understanding the 3D structure of the heart, while 2D modeling enables the model to more accurately identify anatomical landmarks on standard views, thereby providing crucial guidance for probe positioning adjustments.

4 Conclusion and Discussion

In this work, we propose a 2D-3D joint structure-aware pre-training framework to enhance the cardiac world model’s understanding of spatial relationships within two-dimensional structures on a single view and the three-dimensional spatial relationships between different views. We innovatively designed a self-supervised learning task that predicts visual features based on both two-dimensional and three-dimensional spatial information. To support large-scale self-supervised pre-training, we collected over a million ultrasound image and 3D pose data pairs. After pre-training on the large-scale dataset, considerable improvement were observed in downstream probe guidance tasks across the ten standard views.. In the future, we will attempt to validate our probe guidance model in real clinical settings, aiming to directly translate algorithmic improvements into enhanced medical outcomes or increased efficiency. Additionally, this algorithm has the potential to serve as the decision-making brain for autonomous ultrasound scanning robots, promoting the realization of fully autonomous cardiac scanning.

4.0.1 Acknowledgement.

This work was supported in part by the National Key R&D Program of China (2021ZD0140407) and the NSFC (62321005).

References

  • [1] Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., Ballas, N.: Self-supervised learning from images with a joint-embedding predictive architecture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15619–15629 (2023)
  • [2] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
  • [3] Christensen, M., Vukadinovic, M., Yuan, N., Ouyang, D.: Vision–language foundation model for echocardiogram interpretation. Nature Medicine pp. 1–8 (2024)
  • [4] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
  • [5] Droste, R., Drukker, L., Papageorghiou, A.T., Noble, J.A.: Automatic probe movement guidance for freehand obstetric ultrasound. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23. pp. 583–592. Springer (2020)
  • [6] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
  • [7] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4700–4708 (2017)
  • [8] Jiang, H., Lin, Y., Han, D., Song, S., Huang, G.: Pseudo-q: Generating pseudo language queries for visual grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15513–15523 (2022)
  • [9] Jiang, H., Sun, Z., Jia, N., Li, M., Sun, Y., Luo, S., Song, S., Huang, G.: Cardiac copilot: Automatic probe guidance for echocardiography with world model. arXiv preprint arXiv:2406.13165 (2024)
  • [10] Jiang, H., Zhang, J., Huang, R., Ge, C., Ni, Z., Lu, J., Zhou, J., Song, S., Huang, G.: Cross-modal adapter for text-video retrieval. arXiv preprint arXiv:2211.09623 (2022)
  • [11] Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34, 9694–9705 (2021)
  • [12] Mitchell, C., Rahko, P.S., Blauwet, L.A., Canaday, B., Finstuen, J.A., Foster, M.C., Horton, K., Ogunyankin, K.O., Palma, R.A., Velazquez, E.J.: Guidelines for performing a comprehensive transthoracic echocardiographic examination in adults: recommendations from the american society of echocardiography. Journal of the American Society of Echocardiography 32(1), 1–64 (2019)
  • [13] Narang, A., Bae, R., Hong, H., Thomas, Y., Surette, S., Cadieu, C., Chaudhry, A., Martin, R.P., McCarthy, P.M., Rubenson, D.S., et al.: Utility of a deep-learning algorithm to guide novices to acquire echocardiograms for limited diagnostic use. JAMA cardiology 6(6), 624–632 (2021)
  • [14] Ouyang, D., He, B., Ghorbani, A., Yuan, N., Ebinger, J., Langlotz, C.P., Heidenreich, P.A., Harrington, R.A., Liang, D.H., Ashley, E.A., et al.: Video-based ai for beat-to-beat assessment of cardiac function. Nature 580(7802), 252–256 (2020)
  • [15] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
  • [16] Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
  • [17] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8),  9 (2019)
  • [18] Roth, G.A., Johnson, C., Abajobir, A., Abd-Allah, F., Abera, S.F., Abyu, G., Ahmed, M., Aksut, B., Alam, T., Alam, K., et al.: Global, regional, and national burden of cardiovascular diseases for 10 causes, 1990 to 2015. Journal of the American college of cardiology 70(1), 1–25 (2017)
  • [19] Song, P., Fang, Z., Wang, H., Cai, Y., Rahimi, K., Zhu, Y., Fowkes, F.G.R., Fowkes, F.J., Rudan, I.: Global and regional prevalence, burden, and risk factors for carotid atherosclerosis: a systematic review, meta-analysis, and modelling study. The Lancet Global Health 8(5), e721–e729 (2020)
  • [20] Thirunavukarasu, A.J., Ting, D.S.J., Elangovan, K., Gutierrez, L., Tan, T.F., Ting, D.S.W.: Large language models in medicine. Nature medicine 29(8), 1930–1940 (2023)
  • [21] Yang, L., Jiang, H., Cai, R., Wang, Y., Song, S., Huang, G., Tian, Q.: Condensenet v2: Sparse feature reactivation for deep networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3569–3578 (2021)