22institutetext: Bei**g Academy of Artificial Intelligence, Bei**g, China
22email: [email protected], 22email: [email protected], 22email: [email protected], 22email: [email protected]
Structure-aware World Model for Probe Guidance via Large-scale Self-supervised Pre-train
Abstract
The complex structure of the heart leads to significant challenges in echocardiography, especially in acquisition cardiac ultrasound images. Successful echocardiography requires a thorough understanding of the structures on the two-dimensional plane and the spatial relationships between planes in three-dimensional space. In this paper, we innovatively propose a large-scale self-supervised pre-training method to acquire a cardiac structure-aware world model. The core innovation lies in constructing a self-supervised task that requires structural inference by predicting masked structures on a 2D plane and imagining another plane based on pose transformation in 3D space. To support large-scale pre-training, we collected over 1.36 million echocardiograms from ten standard views, along with their 3D spatial poses. In the downstream probe guidance task, we demonstrate that our pre-trained model consistently reduces guidance errors across the ten most common standard views on the test set with 0.29 million samples from 74 routine clinical scans, indicating that structure-aware pre-training benefits the scanning.
Keywords:
Echocardiography World Model Structural Understanding Self-supervised Pre-train Probe Guidance1 Introduction
Cardiovascular diseases are the leading cause of death in worldwide [18, 19]. Echocardiography is the most commonly used method in clinical practice to assess heart conditions. However, the structure of the heart is extremely complex. According to [12], up to seven anatomical structures need to be identified in a single plane, such as the Parasternal Long-axis Plane (Fig. 1 Left). Recognizing these anatomical structures is crucial for diagnosis, and understanding their spatial relationships in a two-dimensional plane also helps the sonographer fine-tune the probe to obtain the best quality images. Additionally, up to 27 different planes need to be examined, requiring the sonographer to understand the spatial relationships between these planes in three-dimensional space and to finely adjust the ultrasound probe’s position to reach the target location (Fig. 1 Right). Due to the aforementioned reasons, cardiac ultrasound examinations are extremely challenging. This also results in a long training period for ultrasound medical personnel, as they need to spend a lot of time familiarizing themselves with both 2D and 3D structures. Consequently, there is a significant talent shortage in this field, especially in regions with scarce medical resources, such as Africa.
With the rise of deep learning [4, 6, 7, 21], AI technology has shown great potential in improving the efficiency of echocardiography analysis. For example, Ouyang et al. [14] proposed a video-based deep learning algorithm, i.e., EchoNet-Dynamic, that can automatically make accurate assessments of cardiac function. More importantly, with the rise of large language models [2, 16, 17, 20] and multi-modal learning [8, 10, 11, 15], the interpretation of echocardiograms [3] has shown increasing improvement, demonstrating excellent performance in tasks such as pulmonary artery pressure estimation, left ventricular hypertrophy, heart failure, and left atrial enlargement. These AI-assisted diagnostic tools have demonstrated performance almost comparable to human experts. However, this comes with a prerequisite: the acquisition of high-quality echocardiograms. In regions with scarce medical resources, there are often no sonographers available to obtain high-quality echocardiograms. In such cases, the powerful capabilities of these AI-assisted diagnostic tools cannot be fully utilized. Few works [5, 9, 13] have focused on how to use AI technology to assist inexperienced sonographers in accurately acquiring target planes. Recently, Jiang et al. [9] proposed a cardiac dreamer, which only focuses on supervised learning of the 3D-structure of the heart and serves as a "heart map" for the probe guidance task. This work demonstrates a potential AI-assisted scanning method that is expected to improve the scanning skills of novices.
According to the clinical experience of doctors, understanding both the 2D and 3D structures is crucial for efficient scanning. For example, when you need to adjust the probe’s viewing angle on a particular plane to capture specific anatomical structures, you must have a good understanding of the spatial positions of those anatomical structures on that plane (Fig.1 Left). Thus, in this paper, we propose a 2D-3D joint structure-aware pre-training framework to obtain a data-driven cardiac world model that benefits ultrasound scanning. Specifically, we require the world model to learn important spatial relationships in the following ways: (1) for understanding two-dimensional structures, we use a masking approach that requires the world model to predict features at adjacent positions in the two-dimensional plane; (2) for understanding three-dimensional structures, we provide information on the positional changes of two planes in 3D space, requiring the world model to predict the features of the target plane after the positional change. We further collected expert operational data on acquiring the most common ten standard planes from 364 routine clinical scans, resulting in 1.36 million sample pairs gathered by three certified sonographers, to enhance the model’s learning of generalizable cardiac structure knowledge. Results on the downstream probe guidance tasks indicate that the proposed structure-aware pre-training method learns useful knowledge for assisting in the acquisition of echocardiograms.
2 Method
In this section, we describe the proposed structure-aware pre-training framework, illustrated in Fig. 2. We first introduce the prior I-JEPA work [1] , on which we based our method, in Section 2.1. Next, we discuss how we construct a pre-training framework that simultaneously learns 2D and 3D structural information by introducing three-dimensional spatial information in Section 2.2.
2.1 Preliminary
In the context of echocardiogram analysis, accurately understanding 2D structural information is fundamental for making correct diagnoses and conducting efficient scans. Recently, Assran et al. proposed a Joint-Embedding Predictive Architecture (I-JEPA) [1] to learn highly semantic image representation. The key idea for I-JEPA is learning the representation of images by predicting features of target blocks based on the non-overlap** context block and the positional embedding in the same image. This paradigm requires the model to understand the spatial relationships of different semantic structures in the 2D plane. For example, if a context block contains the head of a dog, it is likely that the body of the dog will be located below the context block. For echocardiogram analysis, this paradigm also enables the model to learn the spatial relationships of fine structures in the two-dimensional plane. For instance, the left ventricle (LV) is located below the interventricular septum (IVS), as shown on the left side of Fig.1. Therefore, it is highly suitable for modeling two-dimensional structural information in cardiac ultrasound images. Next, we briefly introduce the modeling and training method.
Targets. Specifically, the I-JEPA model employs a Vision Transformer [4]. Firstly, the input image is divided into non-overlap** patches, which are further processed by the target encoder . Then, we select a portion of the patches to form target blocks, which may have overlap** regions. We denote the ground-truth target blocks’ features as .
Context and Prediction. For the context features, patches not belonging to the target blocks are randomly selected and input into the context encoder to obtain features . Subsequently, the predictor (world model) generates the features of target blocks based on the context features and positional embedding which indicate the location of context and target blocks. We denote the predicted target blocks’ feature as .
Loss. With the ground-truth and predicted features of target blocks, the model is optimized using the following loss:
(1) |
2.2 2D-3D Joint Structure-aware Pre-training
Echocardiography involves acquiring high-quality echocardiogram and subsequently conducting analysis and diagnosis based on these images. However, previous researches [3, 14] have primarily focused on understanding and analyzing the two-dimensional plane, neglecting how AI can assit in acquiring high-quality echocardiogram. While understanding the 2D structure of a plane can aid in scanning, especially when some features of the standard view are already visible, this alone is insufficient. Particularly when transitioning from one standard view to another, a deep understanding of the heart’s 3D structure is essential. Therefore, we propose a 2D-3D joint structure-aware pre-training framework, as illustrated in Fig.1 (a). The core insight is predicting the visual features of structures at target locations based on given 2D and 3D positional conditions, thereby learning the map** relationship between spatial positions and visual features. Next, we provide details of the modeling and training method.
Input. Given a source image , we select a target image from the same individual’s images. To ensure sufficient spatial pose variation between the two images, there must be an interval of at least 150 frames between their sequence numbers. Then, both images are divided into non-overlap** patches .
Context and Target. First, we randomly select a rectangular region from the patches of the source image as the context information. These patches are then fed through the context encoder , resulting in the context feature :
(2) |
where is a binary mask indicating the selected patches, is the number of context patches, and is the hidden dimension. Next, we select non-overlap** regions on the target image as the target blocks, requiring the model to understand both the three-dimensional structure of the heart and the spatial relationships of heart structures in the two-dimensional plane. If we choose regions on the target image that have the same spatial position as the context block, the model only needs to understand the three-dimensional spatial relationships. The target features are obtained as follows:
(3) |
where is a binary mask indicating the selected patches and is the number of target patches.
Condition and Prediction. The three-dimensional spatial relationship between the source image and the target image is denoted as , which encapsulates the translation and rotation changes in the x, y, and z directions. This vector is encoded by to obtain the pose embedding, denoted as . Then, the 2D positional embeddings indicating the location of the context and target block are denoted as and . Finally, the predicted features from the world model are obtained as follows:
(4) | ||||
(5) | ||||
(6) |
where is a set of learnable parameters representing the features to be predicted. These parameters interact with the context feature and condition information to generate the target feature. Then, the models are optimized according to the loss function defined in Eq. (1).
3 Experiments
3.1 Implementation Details
Dataset. In this paper, we collected data and conducted experiments on the ten most common standard planes [12], as shown in Fig. 3. The ultrasound images and scan data were acquired following the procedure described in [9]. Ultimately, we amassed data from 364 routine clinical scans, resulting in a total of 1.36 million image and 3D pose data pairs. The whole data collection process was approved and supervised by The University Science and Technology Ethics Committee. We split the dataset into 290 scans (1.07 million samples) for training and 74 scans (0.29 million samples) for testing. For pre-training, only the training set was utilized. Both the training and test sets were employed for the downstream probe guidance task. It is important to note that the individuals in the training and test sets are different to avoid information leakage and fairly validate the model’s generalization performance.
Pre-training. The context and target encoders were implemented using ViT-Small/16, and the world model utilized a custom vision transformer with a depth of 6 layers and a hidden dimension of 384. The entire model was trained for 50 epochs with a batch size of 1024 on 8 Nvidia RTX-4090 GPUs. The training included a 7-epoch warmup period (starting learning rate is 1e-4), followed by a learning rate of 5e-4, using a cosine scheduler with a final learning rate of 5e-7. The implementation details for generating the context block and target blocks, as well as the hyper-parameters, followed the procedures described in [1].
Downstream task. For the probe guidance task, we adopted the framework and procedure proposed in [9], as shown in Fig.2 (b). The input for this task is an ultrasound image, and the output is the probe position adjustment needed to achieve a specific standard view. This task aims to assist junior ultrasound medical personnel in scanning, enhancing the success rate and quality of view acquisition. During fine-tuning, our pre-trained world model was loaded and optimized for 5 epochs with a batch size of 1024 on 8 Nvidia RTX-4090 GPUs. The learning rate was set to 1e-4, using a cosine scheduler with a final learning rate of 1e-6. The optimizer was set to AdamW.
Evaluation Metrics. The metric used for probe guidance task is the Mean Absolute Error (MAE) between the predicted and ground truth probe poses.
3.2 Results
Plane | Model | Translation (mm) | Rotation (degree) | ||||
---|---|---|---|---|---|---|---|
x | y | z | rx | ry | rz | ||
PLAX | Cardiac Dreamer [9] | 8.66 | 8.14 | 5.63 | 6.60 | 5.42 | 8.23 |
Our Pre-train | 8.39 (-3.15%) | 8.02 (-1.53%) | 5.53 (-1.80%) | 6.46 (-2.12%) | 5.34 (-1.42%) | 7.89 (-4.07%) | |
PSAX-AV | Cardiac Dreamer | 7.26 | 6.63 | 4.51 | 5.28 | 6.26 | 7.43 |
Our Pre-train | 7.06 (-2.73%) | 6.57 (-0.97%) | 4.34 (-3.85%) | 5.28 (-0.11%) | 6.03 (-3.63%) | 7.18 (-3.25%) | |
PSAX-PV | Cardiac Dreamer | 7.59 | 6.58 | 4.71 | 5.47 | 5.67 | 8.54 |
Our Pre-train | 7.49 (-1.38%) | 6.56 (-0.28%) | 4.66 (-1.08%) | 5.46 (-0.21%) | 5.52 (-2.73%) | 8.21 (-3.88%) | |
PSAX-MV | Cardiac Dreamer | 7.81 | 6.42 | 4.89 | 6.68 | 5.87 | 9.11 |
Our Pre-train | 7.51 (-3.89%) | 6.42 (0.04 %) | 4.77 (-2.53%) | 6.55 (-1.92%) | 5.78 (-1.63%) | 8.78 (-3.67%) | |
PSAX-PAP | Cardiac Dreamer | 7.18 | 6.07 | 4.43 | 6.35 | 5.39 | 8.53 |
Our Pre-train | 6.92 (–3.57%) | 5.97 (-1.63%) | 4.35 (-1.78%) | 6.11 (-3.88%) | 5.33 (-1.04%) | 8.42 (-1.31%) | |
PSAX-APEX | Cardiac Dreamer | 6.98 | 5.98 | 4.25 | 5.69 | 4.89 | 7.33 |
Our Pre-train | 6.85 (-1.88%) | 5.77 (-3.41%) | 4.11 (-3.28%) | 5.55 (-2.43%) | 4.90 (0.07%) | 7.25 (-1.11%) | |
A4C | Cardiac Dreamer | 7.72 | 7.17 | 5.45 | 5.64 | 4.89 | 8.91 |
Our Pre-train | 7.55 (-2.14%) | 7.00 (-2.49%) | 5.36 (-1.70%) | 5.48 (-2.84%) | 4.86 (-0.55%) | 8.60 (-3.43%) | |
A5C | Cardiac Dreamer | 7.38 | 6.65 | 5.54 | 5.83 | 5.91 | 12.03 |
Our Pre-train | 7.15 (-3.19%) | 6.46 (-2.85%) | 5.40 (-2.41%) | 5.80 (-0.49%) | 5.88 (-0.54%) | 11.88 (-1.18%) | |
A3C | Cardiac Dreamer | 7.21 | 6.52 | 5.21 | 5.92 | 6.29 | 9.81 |
Our Pre-train | 6.90 (-4.34%) | 6.34 (-2.81%) | 5.08 (-2.44%) | 5.77 (-2.63%) | 6.12 (-2.74%) | 9.52 (-2.98%) | |
A2C | Cardiac Dreamer | 7.28 | 6.89 | 4.95 | 8.47 | 5.46 | 14.51 |
Our Pre-train | 7.04 (-3.32%) | 6.70 (-2.84%) | 4.83 (-2.35%) | 8.37 (-1.20%) | 5.33 (-2.45%) | 13.98 (-3.66%) |
Comparison with SOTA. To validate that the proposed structure-aware pre-training method benefits the acquisition of echocardiogram, we conducted comprehensive evaluations on the downstream probe guidance task. As shown in Tab.1, the pre-trained model consistently achieved better or comparable results across all dimensions in the ten standard views. Notably, the highest observed improvement observed was up to 4.34%. The pre-trained model demonstrated slight weakness compared to the Cardiac Dreamer in only one dimension for the PSAX-MV and PSAX-APEX planes. Despite this minor shortfall, the overall performance indicates that the model has effectively learned valuable information about the 2D and 3D structures of the heart. This acquired knowledge enhances the precision of probe guidance during cardiac ultrasound scanning tasks, thereby potentially supporting less experienced sonographers in acquiring high-quality echocardiograms in the future.
Ablations. To demonstrate the importance of each component in our proposed joint 2D-3D modeling approach, we decoupled the 2D and 3D modeling, either focusing solely on 2D structure or 3D structure modeling. As shown in Fig.4, our method achieves improvements in almost all dimensions, whereas using only 2D or 3D modeling results in poorer performance in rotation dimension. In summary, while either 2D or 3D modeling alone enhances the model’s performance to some extent, combining both in joint pre-training achieves the best results. This conclusion aligns with practical experience as well. First, 3D modeling helps in understanding the 3D structure of the heart, while 2D modeling enables the model to more accurately identify anatomical landmarks on standard views, thereby providing crucial guidance for probe positioning adjustments.
4 Conclusion and Discussion
In this work, we propose a 2D-3D joint structure-aware pre-training framework to enhance the cardiac world model’s understanding of spatial relationships within two-dimensional structures on a single view and the three-dimensional spatial relationships between different views. We innovatively designed a self-supervised learning task that predicts visual features based on both two-dimensional and three-dimensional spatial information. To support large-scale self-supervised pre-training, we collected over a million ultrasound image and 3D pose data pairs. After pre-training on the large-scale dataset, considerable improvement were observed in downstream probe guidance tasks across the ten standard views.. In the future, we will attempt to validate our probe guidance model in real clinical settings, aiming to directly translate algorithmic improvements into enhanced medical outcomes or increased efficiency. Additionally, this algorithm has the potential to serve as the decision-making brain for autonomous ultrasound scanning robots, promoting the realization of fully autonomous cardiac scanning.
4.0.1 Acknowledgement.
This work was supported in part by the National Key R&D Program of China (2021ZD0140407) and the NSFC (62321005).
References
- [1] Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., Ballas, N.: Self-supervised learning from images with a joint-embedding predictive architecture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15619–15629 (2023)
- [2] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
- [3] Christensen, M., Vukadinovic, M., Yuan, N., Ouyang, D.: Vision–language foundation model for echocardiogram interpretation. Nature Medicine pp. 1–8 (2024)
- [4] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
- [5] Droste, R., Drukker, L., Papageorghiou, A.T., Noble, J.A.: Automatic probe movement guidance for freehand obstetric ultrasound. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23. pp. 583–592. Springer (2020)
- [6] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
- [7] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4700–4708 (2017)
- [8] Jiang, H., Lin, Y., Han, D., Song, S., Huang, G.: Pseudo-q: Generating pseudo language queries for visual grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15513–15523 (2022)
- [9] Jiang, H., Sun, Z., Jia, N., Li, M., Sun, Y., Luo, S., Song, S., Huang, G.: Cardiac copilot: Automatic probe guidance for echocardiography with world model. arXiv preprint arXiv:2406.13165 (2024)
- [10] Jiang, H., Zhang, J., Huang, R., Ge, C., Ni, Z., Lu, J., Zhou, J., Song, S., Huang, G.: Cross-modal adapter for text-video retrieval. arXiv preprint arXiv:2211.09623 (2022)
- [11] Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34, 9694–9705 (2021)
- [12] Mitchell, C., Rahko, P.S., Blauwet, L.A., Canaday, B., Finstuen, J.A., Foster, M.C., Horton, K., Ogunyankin, K.O., Palma, R.A., Velazquez, E.J.: Guidelines for performing a comprehensive transthoracic echocardiographic examination in adults: recommendations from the american society of echocardiography. Journal of the American Society of Echocardiography 32(1), 1–64 (2019)
- [13] Narang, A., Bae, R., Hong, H., Thomas, Y., Surette, S., Cadieu, C., Chaudhry, A., Martin, R.P., McCarthy, P.M., Rubenson, D.S., et al.: Utility of a deep-learning algorithm to guide novices to acquire echocardiograms for limited diagnostic use. JAMA cardiology 6(6), 624–632 (2021)
- [14] Ouyang, D., He, B., Ghorbani, A., Yuan, N., Ebinger, J., Langlotz, C.P., Heidenreich, P.A., Harrington, R.A., Liang, D.H., Ashley, E.A., et al.: Video-based ai for beat-to-beat assessment of cardiac function. Nature 580(7802), 252–256 (2020)
- [15] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
- [16] Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
- [17] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
- [18] Roth, G.A., Johnson, C., Abajobir, A., Abd-Allah, F., Abera, S.F., Abyu, G., Ahmed, M., Aksut, B., Alam, T., Alam, K., et al.: Global, regional, and national burden of cardiovascular diseases for 10 causes, 1990 to 2015. Journal of the American college of cardiology 70(1), 1–25 (2017)
- [19] Song, P., Fang, Z., Wang, H., Cai, Y., Rahimi, K., Zhu, Y., Fowkes, F.G.R., Fowkes, F.J., Rudan, I.: Global and regional prevalence, burden, and risk factors for carotid atherosclerosis: a systematic review, meta-analysis, and modelling study. The Lancet Global Health 8(5), e721–e729 (2020)
- [20] Thirunavukarasu, A.J., Ting, D.S.J., Elangovan, K., Gutierrez, L., Tan, T.F., Ting, D.S.W.: Large language models in medicine. Nature medicine 29(8), 1930–1940 (2023)
- [21] Yang, L., Jiang, H., Cai, R., Wang, Y., Song, S., Huang, G., Tian, Q.: Condensenet v2: Sparse feature reactivation for deep networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3569–3578 (2021)