HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: arydshln

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-NC-ND 4.0
arXiv:2403.16499v2 [cs.CV] 07 Apr 2024

Self-Supervised Learning for Medical Image Data with Anatomy-Oriented Imaging Planes

Tianwei Zhang111Tianwei Zhang, Dong Wei, and Mengmeng Zhu contributed equally. Dong Wei222Tianwei Zhang, Dong Wei, and Mengmeng Zhu contributed equally. Mengmeng Zhu333Tianwei Zhang, Dong Wei, and Mengmeng Zhu contributed equally. Shi Gu [email protected] Yefeng Zheng School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China Jarvis Research Center, Tencent YouTu Lab, Shenzhen 518057, China
Abstract

Self-supervised learning has emerged as a powerful tool for pretraining deep networks on unlabeled data, prior to transfer learning of target tasks with limited annotation. The relevance between the pretraining pretext and target tasks is crucial to the success of transfer learning. Various pretext tasks have been proposed to utilize properties of medical image data (e.g., three dimensionality), which are more relevant to medical image analysis than generic ones for natural images. However, previous work rarely paid attention to data with anatomy-oriented imaging planes, e.g., standard cardiac magnetic resonance imaging views. As these imaging planes are defined according to the anatomy of the imaged organ, pretext tasks effectively exploiting this information can pretrain the networks to gain knowledge on the organ of interest. In this work, we propose two complementary pretext tasks for this group of medical image data based on the spatial relationship of the imaging planes. The first is to learn the relative orientation between the imaging planes and implemented as regressing their intersecting lines. The second exploits parallel imaging planes to regress their relative slice locations within a stack. Both pretext tasks are conceptually straightforward and easy to implement, and can be combined in multitask learning for better representation learning. Thorough experiments on two anatomical structures (heart and knee) and representative target tasks (semantic segmentation and classification) demonstrate that the proposed pretext tasks are effective in pretraining deep networks for remarkably boosted performance on the target tasks, and superior to other recent approaches.

keywords:
\KWDAnatomy-oriented imaging plane, transfer learning, self-supervised pretraining
journal: Medical Image Analysis\useunder

1 Introduction

Medical image analysis (MedIA) has important clinical applications, including diagnosis (Silveira et al., 2009), quantitative analysis (Wei et al., 2013), prognosis (González et al., 2018), therapy planning (Jackson et al., 2018), and risk assessment (Klifa et al., 2010). Since manual analysis of medical image data in big amounts can be labor-intensive, time-consuming, and subjective, computer-aided automated methods are of great value. Benefiting from the progress of deep learning techniques, especially the deep neural networks (DNNs), automated MedIA methods have advanced remarkably in recent years (Litjens et al., 2017). However, to achieve satisfactory performance, DNNs often need large amounts of labeled data for effective training, which can be difficult and costly to obtain in practice.

Transfer learning is an effective technique when the data is insufficient for training DNNs from scratch (Tan et al., 2018), where the model is first pretrained on tasks with sufficient data and then fine-tuned on the target task with limited data and annotations. For transfer learning on natural images, pretrained models on the ImageNet (Russakovsky et al., 2015) are available for a variety of popular DNN structures and are routinely used nowadays. However, as the distance between the pretraining and target tasks plays a crucial role in transfer learning (Zhang et al., 2017), these models may become less effective when transferred to MedIA tasks due to the large gap between the two types of images. Therefore, researchers are confronted with a dilemma. On one hand, for effective transfer learning on medical images, models pretrained with medical images of highly relevant tasks are preferred. On the other hand, it is difficult to obtain large quantities of annotations for medical images for the pretraining. Fortunately, an emerging subfield of deep learning known as self-supervised learning (SSL) suggests a way out.

In SSL, the training tasks and supervision signals are defined by inherent properties of the data without any manual annotation (**g and Tian, 2020). Therefore, SSL can pretrain the models on a big amount of unlabeled data with proper pretext tasks. To exploit unique properties of medical image data, various pretext tasks have been proposed. Jamaludin et al. (2017) proposed to train a Siamese network based on patient identity. More recently, Models Genesis (Zhou et al., 2019b) and Rubik’s cube series (Zhu et al., 2020) presented generic pretext tasks for medical image data, especially volumetric images. However, previous works rarely paid attention to medical image data with anatomy-oriented imaging planes, e.g., magnetic resonance imaging (MRI) of various organs and body parts such as heart, knee, and shoulder,444Readers are referred to https://mrimaster.com/ for more details as well as more anatomical structures adopting anatomy-oriented imaging planes. which constitute a large portion of medical image data besides full 3D volumetric (e.g., CT) and 2D (e.g., X-ray) images. As such imaging planes are defined with respect to the anatomy of the imaged organ, pretext tasks that can effectively utilize this information are expected to be more relevant to potential target tasks (also known as downstream tasks) on the organ of interest than the generic ones.

In this work, we propose two pretext tasks for medical image data with anatomy-oriented imaging planes based on the spatial alignment relationship among multiple imaging planes. The first is to learn the relative orientation between the imaging planes. In clinical imaging, it is common to define anatomy-oriented view planes for obtaining comparable biometric measurements across populations. These are called standard view planes for a specific organ. For example, the neuro-imaging community defines the mid-sagittal plane in evaluation of pathological brains by estimating the departures from bilateral symmetry in the cerebrum (Stegmann et al., 2005). Similarly, standard views are used in cardiac magnetic resonance (CMR) imaging for quantification of cardiac volumetry, function, and blood flow (Kramer et al., 2020). A key component of acquiring these standard views is the identification of specific anatomical landmarks to prescribe the imaging planes. As a result, these imaging planes often intersect with each other at anatomically meaningful landmarks, and the intersecting lines can provide strong cues about the imaged organ. Therefore, we propose to predict the intersecting lines between the imaging planes by regressing a distance-based heatmap.

Our second pretext task is complementary to the first, which exploits the spatial relationship among parallel imaging planes (in contrast with that between intersecting ones). Specifically, we propose to regress the relative locations of the slices within a stack. To solve this task, the network must gain an understanding of the within-slice content (focused on the imaged organ) and the cross-slice context, thus is better prepared for potential downstream tasks on the specific organ. Closely related to our work, Zhang et al. (2017) proposed pair-wise ordering of slices extracted from volumetric scans, for the downstream task of fine-grained body part recognition. However, their pretext task may encounter difficulty in handling objects with symmetrical structures, whereas we solve this problem with a small yet effective alteration, i.e., centrosymmetric map**, for wider applicability. Another difference is that we directly regress the relative slice locations with a single network, instead of ordering paired slices with a Siamese architecture. In addition, we further investigate multi-task SSL combining both of the proposed pretext tasks, to fully exploit the two types of complementary spatial relationships among the imaging planes.

In summary, the prominent contribution of this work is the proposal of two complementary pretext tasks for self-supervised learning of medical image data with anatomy-oriented imaging planes. We hypothesize that these tasks are more relevant to potential downstream tasks than general transformation-and-recovery based pretext tasks (Zhou et al., 2019b; Noroozi and Favaro, 2016) on such data, thus expected to lead to better transfer learning. Besides, they are conceptually straightforward and easy to implement. For evaluation, we conduct thorough experiments for imaging-based analysis of two different anatomical structures (heart and knee) and two representative downstream tasks (semantic segmentation and classification). We not only investigate the impact of the proposed pretext tasks on the downstream tasks but also study their learnability. The results indicate that the proposed pretext tasks lead to better transfer learning than other recently proposed competitors, empirically confirming our hypothesis.

2 Related work

2.1 Transfer Learning

Transfer learning (Tan et al., 2018) is a powerful technique against shortage of training data, where the model parameters are first pretrained on a data-rich task before fine-tuned on a downstream task of limited data. Its effectiveness has been demonstrated on diverse tasks (Yosinski et al., 2014; Oquab et al., 2014; Huh et al., 2016), typically with a large labeled dataset like the ImageNet (Russakovsky et al., 2015) for pretraining. For medical images, however, it is difficult to annotate data on such a large scale for supervised pretraining. Although it is possible to transfer models pretrained on natural images to medical images (Anthimopoulos et al., 2016; Shin et al., 2016), the huge domain gap between them may compromise the transfer efficacy (Chen et al., 2019). Therefore, pretraining methodologies that allow full exploitation of the large archive of medical image data in the absence of annotation are desirable. Self-supervised learning (SSL)—an emerging deep learning paradigm that recently thrives—appears to be a promising solution.

2.2 SSL for Natural Images

In SSL, the supervision signal is defined by intrinsic properties of the data, eliminating the need for extrinsic annotation (**g and Tian, 2020). Therefore, SSL is suitable for pretraining a model on large amounts of unlabeled data (Chen et al., 2020b). Design of pretext tasks is crucial in SSL. An effective pretext task should make a good use of a certain data property, while being learnable yet non-trivial. Various pretext tasks have been proposed and proved effective for natural images, including predicting relative positions of two image patches (Doersch et al., 2015), solving image jigsaw puzzles (Noroozi and Favaro, 2016), colorizing grayscale images (Zhang et al., 2016), and predicting image rotations (Hendrycks et al., 2019). Recently, we have witnessed a surge of SSL approaches based on contrastive/similarity learning of different views augmented from the same images (Chen et al., 2020a; Chaitanya et al., 2020; Grill et al., 2020). In addition, the masked autoencoder (MA) was proven successful as a scalable self-supervised vision learner, where the pretext task was to reconstruct the original image given its partial observation (He et al., 2022). Despite their success, these pretext tasks were primarily designed for general-purpose transfer learning of especially natural images and did not consider the unique characteristics of anatomy-oriented medical image data.

2.3 SSL for Medical Images

Among the differences between natural images and many medical images, one of the most notable is the 3D spatial property of the latter. Various pretext tasks have been proposed to exploit 3D information of volumetric medical images. A series of pretext tasks of solving a Rubik’s cube was constructed from volumetric input in (Zhu et al., 2020; Tao et al., 2020) for generic 3D medical image analysis. Models Genesis (Zhou et al., 2019b) is another example of pretext tasks designed for generic 3D medical image analysis, where the input underwent a series of random transformations and the task was to restore the original input. In contrast, Spitzer et al. (2018) proposed to predict the geodesic distance along the brain surface between two patches randomly sampled from the cortex of the same brain using a Siamese network, for improving cytoarchitectonic segmentation of human brain areas, which is an example of pretext tasks exclusively tailored for target downstream tasks.

However, none of the above-described pretext tasks utilized the spatial relationship of a set of intersecting slices with anatomy-oriented imaging planes, instead of a full 3D volume. Since the intersecting imaging planes are prescribed concerning the structural landmarks of the imaged organ, pretext tasks based on their spatial relationship are expected to teach the networks about the anatomy of the organ. In a pioneering work—in fact, the only one that we are aware of—along this line, Bai et al. (2019) proposed to define a segmentation pretext task based on the relative orientation between intersecting CMR planes and achieved encouraging results on the downstream task of accurate ventricle segmentation. Concretely, they defined square boxes along the intersecting lines as the segmentation targets. We share the same motivation of utilizing the spatial relationship between anatomy-oriented imaging planes for SSL but build upon the recent progress in DNN-based keypoint detection to regress a heatmap defined by the distance to the intersecting line, which can be considered a continuous and softened version of Bai et al. (2019)’s pretext task.

2.4 Keypoint Detection

In keypoint detection (Zhou et al., 2018), the input image is fed to a fully convolutional network (Long et al., 2015) to generate a multi-channel heatmap, where each channel stands for a keypoint and the peaks indicate keypoint locations. The network is trained in a fully supervised way by a Gaussian heatmap defined with ground truth keypoint locations, where each keypoint defines the mean of a Gaussian kernel. Besides the prevalent application to well-defined semantic keypoints, e.g., human joints (Newell et al., 2016, 2017), there is a recent trend of general implicit keypoint detection applied to the task of object detection (Zhou et al., 2018; Law and Deng, 2018; Zhou et al., 2019a). The keypoint-based objection detection eliminated two drawbacks of the anchor box based methods (Law and Deng, 2018): (i) the imbalance between positive and negative anchor boxes and the resulting slow convergence, and (ii) the complicated hyperparameters and design choices of the anchor boxes, including number, size, and aspect ratio. In this work, we extend the concept of DNN-based keypoint detection for line detection, i.e., the intersecting lines between two imaging planes.

2.5 Multi-Task SSL

Joint learning of multiple related tasks has proven effective in learning more robust feature representations for better generalization, thus leading to improved performance (Ruder, 2017). Therefore, it is no surprise that multi-task learning (MTL) has also been actively explored in SSL. Doersch and Zisserman (2017) explored multi-task SSL for natural images with four pretext tasks, and showed that even a naive multi-head architecture could achieve consistent improvement in performance. As to medical image analysis, Li et al. (2020) proposed ColorMe, a framework combining spatial context and color distribution of scopy images for SSL. However, ColorMe was proposed for color images and inapplicable to grayscale images such as magnetic resonance imaging (MRI). In this work, we also explore multi-task SSL where the pretext task of regressing relative slice orientations is combined with a second pretext task of regressing relative slice locations.

Refer to caption
Fig. 1: Network architecture for SSL of the proposed pretext tasks (FC: fully connected layer). Top: relative orientation regression; bottom: relative location regression.
Refer to caption
Fig. 2: Standard CMR views. (a) A mid-ventricular SAX view: the straight lines are the intersecting lines with the 2C and 4C views in (b) and (c), respectively. (b)–(c) Standard 2C and 4C views: the parallel lines indicate intersecting lines with the stack of SAX views (with normalized relative locations marked), in which the white line indicates the SAX view in (a). (d) 3D visualization of the images in (a)–(c).

3 Preliminaries

As introduced earlier, anatomy-oriented imaging planes are commonly used in clinical practice for obtaining comparable biometric measurements across populations, which are referred to as the standard views for a specific organ. Take CMR for example, which is the standard for quantification of cardiac volumetry, function, and blood flow (Gerche et al., 2013). Cardiac pathologies are often best evaluated along the principal axes of the heart in the long-axis (LAX) and short-axis (SAX) views, rather than the axial, coronal, or sagittal plane defined with respect to the body axes. The most commonly used standard CMR views include a stack of SAX views (Fig. 2(a)), a two-chamber (2C; Fig. 2(b)) LAX view, and a four-chamber (4C; Fig. 2(c)) LAX view. The SAX views are perpendicular to the long axis of the left ventricle (LV), whereas the LAX views are along the long axis. These views provide complementary information for a comprehensive evaluation of the heart. Although different protocols are used for prescribing the standard SAX and LAX view planes by different vendors and institutions, the general consensus is that both the SAX and LAX views should be tailored to the unique individual anatomy (Kramer et al., 2020). For example, the 4C plane should pass through the center of the LV and the right ventricle (RV) apex in the SAX views, whereas the 2C plane should bisect the LV while in parallel to the ventricular septum (see the intersecting lines in Fig. 2(a)). In addition, the stack of SAX views should cover the LV from the base through the apex with an even spacing (Figs. 2(b) and (c)).

The spatial information of a modern medical image is fully recorded when it is saved properly, e.g., in the Neuroimaging Informatics Technology Initiative555https://nifti.nimh.nih.gov/ (NIfTI) or Digital Imaging and Communications in Medicine666https://www.dicomstandard.org/ (DICOM) format. Particularly, DICOM is the international standard for medical images and related information, and is used by almost all radiographs nowadays. It defines the formats for medical images that can be exchanged with the data and quality necessary for clinical use. The DICOM header contains two attributes that record the location and orientation of a medical image: (i) Image Position Patient (IPP): the x𝑥xitalic_x, y𝑦yitalic_y, and z𝑧zitalic_z coordinates of the upper left corner (center of the first voxel transmitted) of the image with respect to the reference coordinate system (RCS), a patient-based coordinate system; and (ii) Image Orientation Patient (IOP): the direction cosines of the first row and the first column of the image with respect to the RCS. Using IPP and IOP, the spatial relationship between any two radiographs (of the same exam) can be readily computed.

4 Methods

In this section, we propose two novel pretext tasks—regressing relative orientations and relative locations—for medical images with anatomy-oriented view planes (Fig. 1). Both tasks are based on the spatial information self-contained in the radiographs, thus needing no manual annotation. Meanwhile, both tasks require an understanding of the image contents to accomplish. Therefore, learning them can prepare the networks for relevant downstream target tasks, e.g., semantic segmentation and classification as demonstrated in this work, leading to better transfer learning. In addition, the two tasks can be combined for MTL, further boosting the SSL. For consistency with the previous section, we continue to use CMR as the example for a description of the proposed pretext tasks.

4.1 Relative Orientation Regression

The first pretext task is to predict the relative orientation of an imaging plane within another intersecting one. As illustrated in Fig. 2(a), the 2C LAX view bisects the LV while in parallel to the ventricular septum in SAX views, and the 4C LAX view bisects the LV while passing through the RV apex. Given a SAX image as input, we propose to train the networks to predict its intersecting lines with the 2C and 4C LAX views, which can be readily computed using their IPPs and IOPs. The intuition is that, to correctly predict the intersecting lines, the networks have to gain an understanding of the cardiac structures based on which the anatomy-oriented view planes are prescribed, such as the LV, RV, ventricular septum, and the myocardium. Once well trained by the pretext task, the networks are expected to be better prepared (pretrained) for potential downstream tasks, e.g., multi-structural segmentation.

To define the regression ground truth for each input SAX image, we first compute its intersection lines with both the 2C and 4C LAX view planes, obtaining two straight lines across the SAX image. Then, a heatmap is constructed from each of these two lines (see Fig. 4 for examples), based on the distance to the line and a Gaussian kernel: {linenomath*}

G(x,y)=exp[(Ax+By+C)2/(2σ2(A2+B2))],𝐺𝑥𝑦superscript𝐴𝑥𝐵𝑦𝐶22superscript𝜎2superscript𝐴2superscript𝐵2G(x,y)=\exp\big{[}-{(Ax+By+C)^{2}}/{\big{(}2\sigma^{2}(A^{2}+B^{2})\big{)}}% \big{]},italic_G ( italic_x , italic_y ) = roman_exp [ - ( italic_A italic_x + italic_B italic_y + italic_C ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) ] , (1)

where (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) denotes the coordinate of a pixel, Ax+By+C=0𝐴𝑥𝐵𝑦𝐶0Ax+By+C=0italic_A italic_x + italic_B italic_y + italic_C = 0 is the equation of the intersection line in the coordinate system of the SAX image, A𝐴Aitalic_A, B𝐵Bitalic_B, and C𝐶Citalic_C are real-number coefficients in the standard form of the equation of a line, and σ𝜎\sigmaitalic_σ is the standard deviation of the Gaussian kernel. Through preliminary experiments, we find that the transfer learning performance on our downstream tasks is not sensitive to the exact value of σ𝜎\sigmaitalic_σ, and simply fix it to 6 pixels in this work. Similar to the Gaussian-based heatmaps commonly used in the keypoint detection literature (Pfister et al., 2015), using the “softened” ground truth as the training target imposes less penalty when the prediction gets closer to the exact location, thus encouraging the prediction to gradually approach the desired status. A fully convolutional network (Long et al., 2015) with the standard encoder-decoder architecture (e.g., the commonly used U-Net (Ronneberger et al., 2015)) can be employed to regress the two heatmaps (Fig. 1 top). The output has two channels, one for each heatmap. Similar to (Pfister et al., 2015), an L2 loss is employed to train the network: {linenomath*}

ori=1NK|Ω|i=1Nk=1K(x,y)ΩGi,k(x,y)G^i,k(x,y)2,subscriptori1𝑁𝐾Ωsuperscriptsubscript𝑖1𝑁superscriptsubscript𝑘1𝐾subscript𝑥𝑦Ωsuperscriptnormsubscript𝐺𝑖𝑘𝑥𝑦subscript^𝐺𝑖𝑘𝑥𝑦2\begin{aligned} \mathcal{L}_{\mathrm{ori}}=\frac{1}{NK|\Omega|}{\sum}_{i=1}^{N% }{\sum}_{k=1}^{K}{\sum}_{(x,y)\in\Omega}\big{\|}G_{i,k}(x,y)-\hat{G}_{i,k}(x,y% )\big{\|}^{2},\end{aligned}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT roman_ori end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N italic_K | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ roman_Ω end_POSTSUBSCRIPT ∥ italic_G start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ( italic_x , italic_y ) - over^ start_ARG italic_G end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ( italic_x , italic_y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW

(2)

where N𝑁Nitalic_N is the total number of slices, K𝐾Kitalic_K is the total number of heatmaps to predict (also the number of output channels), (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) iterates over all the pixels in the input image domain ΩΩ\Omegaroman_Ω, and G^^𝐺\hat{G}over^ start_ARG italic_G end_ARG is the predicted heatmap.

4.2 Relative Location Regression

For medical imaging employing anatomy-oriented view planes, it is common that a stack of parallel slices is prescribed to fully cover the structure of interest. For example, in CMR, a stack of SAX views is prescribed to cover the LV from the base to the apex (Figs. 2(b) and (c)). Naturally, the consecutive views reflect the gradual anatomical changes in sequence (Fig. 3), and a trained observer can speculate about the relative location of a specific slice by its appearance, with respect to the entire structure of interest. Driven by this observation, the second pretext task we propose is to predict the relative location of an input slice within the parallel stack, which is complementary to the first pretext task.

Refer to caption
Fig. 3: Left to right: four SAX CMR slices from the base (relative location===0.0) to the apex (relative location===1.0) of the LV.

Practically, we consider two situations when defining the relative location. The first situation is just like the CMR illustrated above, in which we define the relative location (denoted by l𝑙litalic_l) of a specific slice to be its normalized location within the stack: {linenomath*}

lsi=d(si,s1)/d(sN,s1),subscript𝑙subscript𝑠𝑖𝑑subscript𝑠𝑖subscript𝑠1𝑑subscript𝑠𝑁subscript𝑠1l_{s_{i}}=d(s_{i},s_{1})/d(s_{N},s_{1}),italic_l start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_d ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) / italic_d ( italic_s start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , (3)

where sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_ith slice, N𝑁Nitalic_N is the total number of slices of the stack, and d(,)𝑑d(\cdot,\cdot)italic_d ( ⋅ , ⋅ ) is the Euclidean distance between two parallel slices. Hence, the relative locations of the entire stack of slices is normalized to span the range of [0,1]01[0,1][ 0 , 1 ] for different subjects. In the second situation, the slices may be symmetrically similar about a “mirror” point. For example, in sagittal knee MRI, the slices on both sides of the middle slice can be difficult to differentiate. In this case, we use the sine function to map the raw distance ratio by: {linenomath*}

lsi=sin[πd(si,s1)/d(sN,s1)].subscript𝑙subscript𝑠𝑖𝜋𝑑subscript𝑠𝑖subscript𝑠1𝑑subscript𝑠𝑁subscript𝑠1l_{s_{i}}=\sin\big{[}\pi\cdot{d(s_{i},s_{1})}/{d(s_{N},s_{1})}\big{]}.italic_l start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_sin [ italic_π ⋅ italic_d ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) / italic_d ( italic_s start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] . (4)

In this way, lsisubscript𝑙subscript𝑠𝑖l_{s_{i}}italic_l start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT becomes symmetric about the central slice while still spanning the full range of [0,1]01[0,1][ 0 , 1 ], with symmetric slices on different sides of the mirror point mapped to similar values, and the pretext task is made more generically applicable. Again, the relative slice locations are defined by mining the spatial attributes recorded in the DICOM header. Therefore, no manual annotation is needed.

Given an input slice sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we let the network to regress its relative location and define the loss function as: {linenomath*}

loc=1Ni=1Nlsil^si2,subscriptloc1𝑁superscriptsubscript𝑖1𝑁superscriptnormsubscript𝑙subscript𝑠𝑖subscript^𝑙subscript𝑠𝑖2\mathcal{L}_{\mathrm{loc}}=\frac{1}{N}{\sum}_{i=1}^{N}\big{\|}l_{s_{i}}-\hat{l% }_{s_{i}}\big{\|}^{2},caligraphic_L start_POSTSUBSCRIPT roman_loc end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_l start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - over^ start_ARG italic_l end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (5)

where l^sisubscript^𝑙subscript𝑠𝑖\hat{l}_{s_{i}}over^ start_ARG italic_l end_ARG start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the network’s prediction. Implementation-wise, we pass the feature map obtained at the end of the encoder through a pooling layer to convert it to a feature vector, followed by a fully connected layer to regress the relative location (Fig. 1 bottom). We expect that training the network with this pretext task should also prepare it for relevant downstream tasks, since visually speculating about the relative slice locations requires knowledge about the organ of interest and its surroundings.

4.3 Multitask SSL

We further explore integration of the two complementary pretext tasks for multitask SSL, considering that both tasks require certain understanding of the organ of interest. With better pretraining via MTL, the network is expected to yield better performance when transferred and fine-tuned on potential downstream tasks. The loss function for the MTL is straightforward, i.e., the summation of the losses defined individually for the two pretext tasks: {linenomath*}

MTL=ori+loc.subscriptMTLsubscriptorisubscriptloc\displaystyle\mathcal{L}_{\mathrm{MTL}}=\mathcal{L}_{\mathrm{ori}}+\mathcal{L}% _{\mathrm{loc}}.caligraphic_L start_POSTSUBSCRIPT roman_MTL end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT roman_ori end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT roman_loc end_POSTSUBSCRIPT . (6)

In this work, an equal weight of 1 is used for orisubscriptori\mathcal{L}_{\mathrm{ori}}caligraphic_L start_POSTSUBSCRIPT roman_ori end_POSTSUBSCRIPT and locsubscriptloc\mathcal{L}_{\mathrm{loc}}caligraphic_L start_POSTSUBSCRIPT roman_loc end_POSTSUBSCRIPT as we notice that they are generally in the same magnitude. It is also worth mentioning that the pretext tasks are not dependent on any specific scanner manufacturer or model, as the relative orientation and location as supervision signal are computed only between slices scanned in the same study by the same machine.

4.4 Fine-Tuning on Target Tasks

After the networks are well trained with the proposed pretext tasks, they can be fine-tuned with limited annotations for better performance on target downstream tasks. In this work, we evaluate the efficacy of the proposed pretext tasks for transfer learning on two fundamental downstream tasks in medical image analysis: semantic segmentation and classification (i.e., imaging-based diagnosis). For the former, both the pretrained encoder and decoder are used, with the regression head replaced by a multiclass segmentor consisting of a 1×\times×1 convolution layer followed by softmax. We use the cross-entropy loss for training. For the latter, only the pretrained encoder is needed. We reinitialize the fully connected layer and affix a sigmoid function to convert the output to classification probabilities. A weighted binary cross-entropy loss is employed to account for imbalanced class sizes, in which the loss for a sample is scaled inversely proportional to the prevalence of its class in the dataset. The encoder and decoder parameters are adjusted during fine-tuning instead of fixed.

5 Experiments

5.1 Materials

To evaluate the efficacy of the proposed self-supervising pretext tasks on transfer learning of medical image analysis of different body parts, we use cardiac and knee MRI for thorough experiments, which are representative of medical image data with anatomy-oriented imaging planes.

5.1.1 CMR

For self-supervised pretraining, the Data Science Bowl Cardiac Challenge777https://www.kaggle.com/c/second-annual-data-science-bowl/overview (DSBCC) dataset is used. The dataset comprises cine images of more than 1,000 subjects in DICOM format, with a diverse representation of individual variations, including subjects from young to old, images from numerous health centers, and hearts of normal to abnormal cardiac function. The images were acquired with two scanners (Siemens Area 1.5 T and Siemens Skyra 3.0 T). The official training, validation, and test sets include 500, 200, and 440 CMR exams, respectively. Our pretext tasks here are to regress the relative orientations of the 2C and 4C views in the SAX views, and regress the relative locations of the SAX views. Accordingly, the inclusion criterion is that the exam should include all of the SAX, 2C, and 4C views. Thus, 451, 176, and 381 exams are included in this work from the official training, validation, and test sets, respectively. In addition, we combine the official training and validation data for training, and use the official test data for validation of whether the pretext tasks can be successfully learned by the networks. As no segmentation annotation is provided, the DSBCC dataset is only used for self-supervised pretraining.

For the downstream target tasks—multi-structural segmentation and abnormality diagnosis in SAX CMR images, we use the Automated Cardiac Diagnosis Challenge (ACDC) dataset (Bernard et al., 2018). The dataset consists of 100 CMR exams acquired with two scanners (Siemens Area 1.5 T and Siemens Trio Tim 3.0 T) at the University Hospital of Dijon, including five evenly distributed subgroups: normal, previous myocardial infarction, dilated cardiomyopathy, hypertrophic cardiomyopathy, and abnormal right ventricle. The cine images include a series of SAX views covering the LV from the base to the apex, without any LAX view though. Images at the end-systolic (ES) and end-diastolic (ED) phases are provided with manual annotations of three cardiac structures (LV, RV, and myocardium).888The ACDC dataset does not record the spatial information, thus cannot be used for the proposed pretraining. As the official test set of the ACDC challenge is not publicly available, we divide the official training set into training, validation and test sets comprising 64, 16 and 20 subjects, respectively.

Table 1: Acquisition protocols of the CMR and knee MRI datasets: DSBCC, ACDC (Bernard et al., 2018), fastMRI (Zbontar et al., 2018), and Stanford (Bien et al., 2018).

Dataset CMR Knee DSBCC ACDC fastMRI Stanford Width (pixel) 166–704 154–512 182–1024 192–224 Height (pixel) 166–704 154–428 192–1024 384–512 Field strength (tesla) 1.5 or 3.0 1.5 or 3.0 1.5 or 3.0 1.5 or 3.0 Pixel spacing (mm) 0.60–1.80 0.70–1.92 0.146–1.237 0.293–0.313 Slice thickness (mm) 4–11 5–10 2.0–5.0 2.5–3.5 Slice gap (mm) 7.98, 8, 10 0 or 5 2.20–6.25 0, 0.5, 1 No. frames per cycle 9–30 28–40 - - No. slices per stack 8–18 6–18 15–62 24–42

More details about the image acquisition protocols are presented in Table 1. We unify the in-plane resolutions of all SAX images in the two datasets via cubic interpolation to 1.260×\times×1.260 mm and 1.367×\times×1.367 mm, respectively, according to the modes of the image resolution distributions. We then unify the size of the resampled images to 224×\times×224 pixels, either by central crop** or zero padding. Again, the image size is determined according to the mode of the image size distribution, after the resolution is unified. Lastly, the z-score standardization is conducted on each image to make the pixel intensities have a zero mean and unit standard deviation. We do not unify the slice thickness.

5.1.2 Knee MRI

Knee MRI is the standard-of-care imaging modality for diagnosis of knee injuries (Naraghi and White, 2016). In routine protocols, the axial and coronal planes should be perpendicular and parallel to the middle line of the femur and tibia in the sagittal views, respectively (see Fig. 5 for examples). Accordingly, our pretext tasks are to regress the orientations of the central coronal and axial slices within the sagittal images, and the relative locations of the stack of sagittal images.

For self-supervised pretraining, we use the data released as part of the fastMRI dataset by Zbontar et al. (2018). This dataset includes DICOM data from 10,000 clinical knee MRI exams, representing a wide variety of scanners and pulse sequences. Each exam typically contains several sequences; as only the sagittal T2W with fat suppression is also included by the protocol of the dataset for the target task (described next), thus it is selected for our pretext tasks. The official Batch 1 and Batch 2 are used for training and validation, respectively. Only the exams that include sagittal T2W images accompanied by both axial and coronal views are included. The fastMRI dataset has no label for training or evaluation of any target task of diagnosis.

On the Stanford dataset (Bien et al., 2018), we target at the downstream task of detecting general abnormalities and specific diagnoses (anterior cruciate ligament (ACL) and meniscal tears) in knee MRI. The examinations were performed at the Stanford University Medical Center with GE scanners (GE Discovery, GE Healthcare, Waukesha, WI) and a routine non-contrast knee MRI protocol, including the sagittal T2W with fat saturation sequence. We combine the original training (1,130 exams) and tuning (120 exams) sets (the test set is not publicly available), and randomly divide them into our training, validation, and test sets of 800, 200, and 250 exams, respectively, while maintaining the ratios of disorders (80.6% abnormal, including 21.0% with ACL tears and 35.9% with meniscal tears).999The Stanford dataset does not record the spatial information, thus cannot be used for the proposed pretraining.

More details about the image acquisition protocols are presented in Table 1. Similar to the CMR data, the two knee datasets are unified via interpolation (to 0.5×\times×0.5 mm pixel spacing), crop** or zero-padding (to 256×\times×256 pixels), and z-score standardization. Again, the slice thickness is not unified.

5.2 Evaluation Metrics

5.2.1 Pretext Tasks

For the task of relative location regression, the mean squared error (MSE), mean absolute error (MAE), Pearson correlation coefficient (denoted by r𝑟ritalic_r), coefficient of determination (denoted by R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT), and explained variance (DeMaris, 2002) are used to evaluate the learning outcome. In addition, linear regression analysis between the predicted and ground truth values is performed. For the task of regressing relative orientation, we resort to visual inspection of the predicted heatmaps in comparison with the ground truth.

5.2.2 Target Tasks

For multi-structural segmentation in CMR, the Dice coefficients and average symmetric surface distance (ASSD) are computed volume-wise for the LV, RV, and myocardium, in accordance with the ACDC challenge (Bernard et al., 2018). In addition, mean values averaged across the three structures are computed for straightforward comparison among methods. For diagnosis of cardiac and knee MRI, the area under the curve (AUC) of the receiver operating characteristic curve is employed, which can better evaluate classification performance on imbalanced data. Specifically, for knee MRI, we treat the diagnosis of each of the three abnormalities (general abnormalities, ACL tears, and meniscal tears) as a binary classification task following Bien et al. (2018) and report the mean AUCs across the three tasks.

5.3 Implementation

We use the PyTorch framework (Paszke et al., 2019) and Adam (Kingma and Ba, 2014) optimizer for all experiments. A single NVIDIA TITAN Xp GPU is used for training and inference. Our code will be available.

For CMR, we employ the U-Net architecture (Ronneberger et al., 2015) as backbone for consistency with (Bai et al., 2019), a closely related work which also experimented on the ACDC dataset and we should compare to. Mini-batches of 20 and 10 images are used for the pretext and target tasks, respectively. The initial learning rate is set to 0.001. For the pretext tasks, it is halved every 100 epochs, and we train for 500 epochs in total. For the target task of multi-structural segmentation, the learning rate is halved every 50 epochs, and we train for a total of 200 epochs with an L1 regularization of 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. Online data augmentation, including random rotation and scaling, is performed for both the pretext and target tasks. For the target task of abnormality diagnosis, we adopt the solution of the competition champion (Khened et al., 2017) to construct a 100-tree random forest classifier using ten features: ejection fractions of left and right ventricles, volumes of the LV at ES and ED, volumes of the RV at ES and ED, masses of the myocardium at ES and ED, and patient height and width. The first eight features were computed with the ground truth segmentation maps to train the classifier. Then, for testing, these features were calculated from the automatic segmentation results of the first target task.

For diagnosis of knee MRI, we employ the work (Bien et al., 2018) which published the Stanford dataset as our baseline, and follow it to use the AlexNet (Krizhevsky et al., 2012) as backbone. Considering the availability of the ImageNet pretraining (Russakovsky et al., 2015) for the AlexNet, we stack the T2W images three times as the input to make use of the ImageNet pretrained parameters, and further pretrain (fine-tune) the network with the SSL pretext tasks. Following Bien et al. (2018), slices of a T2W series are used as a mini-batch. The total number of training epochs is set to 100 for both the pretext and target tasks. For the pretext tasks, the learning rate is initially set to 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and decreases at the 50th epoch by a factor of 0.1. A weight decay of 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT is used. For the target task, the learning rate is initially set to 1×1051superscript1051\times 10^{-5}1 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, and reduced when the validation loss fails to improve for five epochs in a row by a factor of 0.3, and the weight decay is set to 1×1021superscript1021\times 10^{-2}1 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT. Online data augmentation, including random rotation, shift, and horizontal flip, is performed for both the pretext and target tasks.

Refer to caption
Fig. 4: Visualization of the predicted heatmaps for the pretext task of relative orientation regression on the DSBCC CMR validation set, along with the ground truth. ED: end-diastolic, ES: end-systolic. Best viewed when digitally zoomed in.

5.4 Experimental Settings

The effectiveness of self-supervising pretext tasks is often evaluated by the transfer learning performance on downstream target tasks (Zhou et al., 2019b; Zhu et al., 2020), and we follow this paradigm. Specifically, we use different portions of the training data of the target tasks for training and compare the performance of (i) training from scratch and (ii) fine-tuning the parameters pretrained with the pretext tasks. For the proposed pretext tasks, we study three pretraining settings: regressing relative orientations (Eqn. (2)), regressing relative locations (Eqn. (5)), and regressing both together (Eqn. (6)). We compare our pretext tasks with several established ones in the literature, including the jigsaw puzzle (Noroozi and Favaro, 2016), pairwise slice ordering (Zhang et al., 2017), Models Genesis (Zhou et al., 2019b), anatomical position prediction (APP) (Bai et al., 2019), SimCLR (Chen et al., 2020a), BYOL (Grill et al., 2020), and masked autoencoder (MA; He et al., 2022). For a fair comparison, we adopt the same backbone networks, pretraining principles, and fine-tuning protocols for all the compared methods. Precisely, for pretraining, we follow the recipes described in the original papers and pretrain the networks until convergence. For fine-tuning, the protocols described in the previous section are applied to all methods.

In addition, given that the pretrained weights on large-scale natural image datasets are widely available for the more recent SimCLR, BYOL, and MA approaches, we also evaluate the performance of transferring the pretrained weights of these methods with large-scale natural image pretraining for reference. Specifically, we adopt the ImageNet (Russakovsky et al., 2015) pretrained ResNet-50 (He et al., 2016) model for SimCLR and BYOL, and the ImageNet pretrained ViT-Base (Dosovitskiy et al., 2021) model for MA. For CMR segmentation, a U-Net-like segmentation network is built with the pretrained ResNet-50 and ViT-Base backbones; whereas for knee MRI diagnosis, the pretrained backbones are directly used. These ImageNet pretrained models will be marked with \dagger signs when presenting experimental results later. Again, the same protocols described in the previous section are employed for fine-tuning.

Refer to caption
Fig. 5: Visualization of the predicted heatmaps for the pretext task of relative orientation regression on the knee MRI validation data (Zbontar et al., 2018), along with the ground truth. Note that symmetric slices on different sides of the central slice are mapped to the same relative locations (lsisubscript𝑙subscript𝑠𝑖l_{s_{i}}italic_l start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT) by Eqn. (4). ACL: anterior cruciate ligament. Best viewed when digitally zoomed in.

5.5 Performance on Proposed Pretext Tasks

5.5.1 Relative Orientation Regression

We first visually inspect the predicted heatmaps for the intersecting lines between imaging view planes for both the cardiac and knee MRI datasets, in comparison with the ground truth. Fig. 4 displays SAX CMR slices of different cardiac phases, relative locations, and health status (normal and abnormal). Fig. 5 displays sagittal knee MRI slices of different locations (on both sides of the knees) and health status. As we can see, the models predict heatmaps consistently similar to the ground truth on both datasets, irrespective of the varying conditions. The results suggest that the models have gained knowledge about the important anatomical landmarks that define these imaging planes via training with the proposed pretext task.

Table 2: Validation results of the pretext task of regressing relative slice locations with single-task learning (Loc.) and multitask learning (MTL). \uparrow: higher is better; \downarrow: lower is better. MSE: mean squared error, MAE: mean absolute error, r𝑟ritalic_r: Pearson correlation coefficient, R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: coefficient of determination, var.: variance.

Dataset Task MSE\downarrow MAE\downarrow r𝑟ritalic_r\uparrow R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT\uparrow Explained var.\uparrow DSBCC Loc. 0.0187 0.1030 0.9187 0.8266 0.8304 MTL 0.0128 0.0826 0.9415 0.8822 0.8847 fastMRI Loc. 0.0146 0.0914 0.9254 0.8555 0.8558 MTL 0.0142 0.0896 0.9284 0.8596 0.8598

Refer to caption
Refer to caption
Fig. 6: Linear regression analyses between the predicted relative slice locations and ground truth. Left: Cardiac MTL, and right: knee MTL.
Refer to caption
Fig. 7: Visualization of the multi-structural segmentation results (red: RV, blue: LV, and green: myocardium) in SAX CMR images, corresponding to the results presented in Table 4. Note that to highlight the differences between the segmentation results with different pretext tasks, more challenging slices are used for visualization when more subjects are available for fine-tuning. Rows 1 and 2: ES, and rows 3–5: ED.

5.5.2 Relative Location Regression

We then quantitatively validate the results of predicting relative slice locations in two settings: single-task and multitask learning (MTL). The results are presented in Table 2. As we can see, the task is performed well individually. All the evaluated metrics are fairly good on both datasets: the MSEs and MAEs are at the levels of 0.01 and 0.1, respectively, the Pearson coefficients are above 0.9, and both the coefficients of determination and explained variances are above 0.8. These results suggest that the second pretext task can also be effectively learned by the network, thus embedding knowledge about the imaged structures of interest. When further combining it with the other proposed pretext task, all the metrics improve on both datasets. The linear regression analyses are plotted in Fig. 6, which are in harmony with the results in Table 2.

5.6 Transfer Learning Performance on Target Tasks

5.6.1 CMR Semantic Segmentation and Diagnosis

Table 4 presents the test performance for the segmentation of the LV, RV, and myocardium on the ACDC dataset, as well as the cross-structure mean results. From the table, we have the following observations. First, the performances achieved by fine-tuning the networks pretrained with different pretext tasks are almost always better than those by training from scratch (except for the contrastive learning based approaches with extremely low data for fine-tuning, i.e., one subject), demonstrating the effectiveness of the self-supervising pretext tasks in transfer learning. Second, the performance improvements are more obvious when fewer data are available for fine-tuning, emphasizing the prominent efficacy of self-supervised pretraining in the low-data regime. Third, our multitask learning (MTL) integrating the relative orientation and location regression tasks achieves the greatest improvements in the majority of settings for both metrics and different anatomic structures. When only four subjects are used for fine-tuning, our MTL achieves a reasonable performance of mean Dice at 0.814 and mean ASSD at 1.917 mm, which are comparable to training from scratch with 16 subjects (mean Dice at 0.815 and mean ASSD at 2.064 mm). When 32 subjects are used, the MTL achieves considerably better performance than training from scratch with twice as much data. Lastly, of the two proposed pretext tasks, relative orientation regression seems to be slightly more effective than relative location regression for the specific downstream task.

Meanwhile, we note that the performances of directly transferring and fine-tuning ImageNet pretrained SimCLR{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT, BYOL{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT, and MA{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT models are generally worse than those of the same methods pretrained with in-domain medical image data—with the due caution that different backbones are used. Also, the pretraining cannot consistently outperform training from scratch. These may indicate that for transfer learning of medical images of modalities with significant domain gaps from natural images, pretraining on domain-specific data is critical and more effective than on large-scale natural image datasets.

Fig. 7 visualizes several representative examples corresponding to the results of different pretext tasks in Table 4. As we can see, our MTL produces segmentations closest to the ground truth, consistent with the quantitative evaluation.

Table 3: Classification performance (AUC) on the ACDC CMR dataset (Bernard et al., 2018), and comparison of transfer learning performance with different pretraining methods: training from scratch (None), slice ordering (Slice ord.; Zhang et al., 2017), jigsaw puzzle (Jig. puzz.; Noroozi and Favaro, 2016), Models Genesis (Mod. Gen.; Zhou et al., 2019b), anatomical position prediction (APP; Bai et al., 2019), SimCLR (Chen et al., 2020a), BYOL (Grill et al., 2020), masked autoencoder (MA) (He et al., 2022), our relative location regression (Rel. loc.), relative orientation regression (Rel. ori.), and multitask learning (MTL). {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT: directly fine-tuned from ImageNet pretrained models with different backbones from other methods.

Pretrain No. subjectsa method 1 (1.56%) 4 (6.25%) 16 (25%) 32 (50%) 64 (100%) None 0.608 0.825 0.861 0.909 0.920 \cdashline1-6Slice ord. 0.750 0.884 0.912 0.934 0.918 Jig. puzz. 0.835 0.881 0.929 0.904 0.968 Mod. Gen. 0.768 0.893 0.937 0.928 0.928 APP 0.686 0.815 0.922 0.921 0.956 \cdashline1-6SimCLR 0.567 0.821 0.921 0.924 0.925 BYOL 0.454 0.754 0.912 0.943 0.943 MA 0.598 0.868 0.931 0.937 0.935 \cdashline1-6Rel. loc. 0.734 0.880 0.925 0.927 0.962 Rel. ori. 0.795 0.868 0.903 0.925 0.934 MTL 0.839 0.903 0.959 0.959 0.972 SimCLR{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 0.484 0.717 0.841 0.866 0.873 BYOL{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 0.392 0.735 0.893 0.925 0.935 MA{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 0.489 0.811 0.818 0.827 0.840 a The number of subjects used for fine-tuning on the task of CMR semantic segmentation, with percentage with respect to the entire training dataset in parentheses.

Table 4: Results of multi-structural segmentation of SAX CMR images of the ACDC dataset (Bernard et al., 2018) (\uparrow: higher is better; \downarrow: lower is better), and comparison of transfer learning performance with different pretraining methods: training from scratch (None), slice ordering (Slice ord.; Zhang et al., 2017), jigsaw puzzle (Jig. puzz.; Noroozi and Favaro, 2016), Models Genesis (Mod. Gen.; Zhou et al., 2019b), anatomical position prediction (APP; Bai et al., 2019), SimCLR (Chen et al., 2020a), BYOL (Grill et al., 2020), masked autoencoder (MA) (He et al., 2022), our relative location regression (Rel. loc.), relative orientation regression (Rel. ori.), and multitask learning (MTL). {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT: directly fine-tuned from ImageNet pretrained models with different backbones from other methods. Format: mean (standard deviation).

Amount Pretrain Mean LV RV Myocardium of dataa method Dice \uparrow ASSD (mm) \downarrow Dice \uparrow ASSD (mm) \downarrow Dice \uparrow ASSD (mm) \downarrow Dice \uparrow ASSD (mm) \downarrow 1 subject (1.56%): 20 slices None 0.328 (0.286) 20.223 (28.729) 0.456 (0.338) 27.463 (39.943) 0.227 (0.216) 21.054 (21.337) 0.299 (0.238) 12.151 (17.525) \cdashline2-10 Slice ord. 0.437 (0.318) 20.425 (28.849) 0.507 (0.357) 19.558 (29.433) 0.341 (0.288) 24.468 (28.677) 0.462 (0.278) 17.249 (27.946) Jig. puzz. 0.410 (0.286) 21.778 (29.537) 0.486 (0.297) 25.064 (35.956) 0.370 (0.266) 18.757 (20.099) 0.372 (0.279) 21.513 (30.008) Mod. Gen. 0.470 (0.275) 14.713 (22.634) 0.569 (0.294) 14.676 (24.579) 0.374 (0.237) 16.884 (23.920) 0.466 (0.254) 12.578 (18.742) APP 0.461 (0.290) 17.151 (31.190) 0.547 (0.323) 20.156 (39.117) 0.375 (0.265) 17.364 (23.287) 0.461 (0.249) 13.932 (28.748) \cdashline2-10 SimCLR 0.278 (0.288) 33.262 (44.645) 0.328 (0.344) 33.503 (47.469) 0.264 (0.249) 31.741 (40.985) 0.243 (0.255) 33.262 (44.645) BYOL 0.230 (0.252) 33.929 (43.196) 0.232 (0.293) 33.140 (43.360) 0.242 (0.216) 28.948 (38.385) 0.216 (0.241) 39.699 (46.749) MA 0.326 (0.282) 26.841 (42.818) 0.414 (0.336) 18.023 (30.621) 0.256 (0.236) 40.991 (53.739) 0.307 (0.239) 21.506 (42.817) \cdashline2-10 Rel. loc. 0.433 (0.306) 18.333 (25.977) 0.481 (0.360) 20.638 (30.416) 0.380 (0.266) 17.671 (20.710) 0.440 (0.275) 16.691 (25.729) Rel. ori. 0.471 (0.297) 14.258 (24.692) 0.567 (0.336) 10.486 (18.307) 0.394 (0.253) 13.876 (16.098) 0.453 (0.269) 18.412 (34.688) MTL 0.511 (0.289) 13.900 (29.809) 0.619 (0.317) 14.665 (33.594) 0.438 (0.241) 17.034 (35.445) 0.475 (0.270) 10.001 (15.974) 4 subjects (6.25%): 80 slices None 0.702 (0.228) 4.754 (7.705) 0.761 (0.258) 5.138 (8.782) 0.631 (0.241) 6.506 (9.409) 0.715 (0.151) 2.618 (2.157) \cdashline2-10 Slice ord. 0.745 (0.205) 3.316 (4.150) 0.826 (0.185) 2.791 (4.512) 0.659 (0.230) 4.918 (4.420) 0.750 (0.156) 2.241 (2.788) Jig. puzz. 0.783 (0.156) 2.273 (3.290) 0.856 (0.164) 1.995 (4.794) 0.721 (0.152) 3.503 (2.491) 0.771 (0.115) 1.323 (0.891) Mod.Gen. 0.798 (0.138) 2.117 (2.095) 0.875 (0.108) 1.635 (2.287) 0.736 (0.159) 3.015 (1.950) 0.782 (0.101) 1.701 (1.710) APP 0.766 (0.175) 2.528 (2.768) 0.828 (0.153) 2.372 (2.727) 0.711 (0.219) 3.552 (3.416) 0.758 (0.115) 1.659 (1.431) \cdashline2-10 SimCLR 0.750 (0.192) 3.348 (5.569) 0.802 (0.243) 4.145 (8.937) 0.698 (0.179) 3.946 (2.786) 0.749 (0.115) 1.953 (1.575) BYOL 0.762 (0.161) 2.592 (2.382) 0.801 (0.184) 3.062 (3.105) 0.764 (0.121) 2.873 (1.939) 0.722 (0.163) 1.841 (1.661) MA 0.788 (0.167) 2.259 (3.091) 0.842 (0.175) 2.429 (4.516) 0.732 (0.197) 2.944 (2.492) 0.789 (0.089) 1.404 (0.907) \cdashline2-10 Rel. loc. 0.794 (0.151) 2.585 (2.915) 0.882 (0.080) 1.556 (1.350) 0.733 (0.188) 3.431 (3.102) 0.766 (0.118) 2.766 (3.499) Rel. ori. 0.813 (0.136) 2.226 (4.959) 0.871 (0.154) 2.639 (8.266) 0.784 (0.139) 2.562 (1.932) 0.789 (0.083) 1.476 (0.946) MTL 0.814 (0.118) 1.917 (1.668) 0.869 (0.107) 1.853 (1.967) 0.777 (0.132) 2.628 (1.721) 0.795 (0.091) 1.271 (0.770) 16 subjects (25%): 318 slices None 0.815 (0.129) 2.064 (1.943) 0.849 (0.135) 2.326 (2.511) 0.785 (0.144) 2.611 (1.813) 0.809 (0.093) 1.255 (0.845) \cdashline2-10 Slice ord. 0.813 (0.133) 2.070 (2.121) 0.837 (0.150) 2.573 (2.966) 0.794 (0.141) 2.317 (1.670) 0.806 (0.101) 1.319 (1.016) Jig. puzz. 0.815 (0.116) 2.001 (1.869) 0.845 (0.133) 2.341 (2.581) 0.809 (0.104) 2.272 (1.502) 0.790 (0.102) 1.391 (1.001) Mod.Gen. 0.819 (0.144) 2.078 (2.263) 0.856 (0.146) 2.222 (2.935) 0.792 (0.176) 2.415 (2.149) 0.810 (0.087) 1.596 (1.325) APP 0.815 (0.141) 2.144 (2.676) 0.847 (0.142) 2.458 (3.102) 0.785 (0.177) 2.625 (3.030) 0.810 (0.079) 1.348 (1.311) \cdashline2-10 SimCLR 0.824 (0.129) 1.992 (3.011) 0.852 (0.168) 2.581 (4.877) 0.815 (0.112) 1.969 (1.263) 0.807 (0.093) 1.429 (1.073) BYOL 0.823 (0.132) 1.838 (2.791) 0.844 (0.165) 2.301 (4.348) 0.825 (0.126) 2.039 (1.799) 0.799 (0.089) 1.173 (0.726) MA 0.818 (0.158) 2.031 (2.641) 0.861 (0.142) 2.004 (2.724) 0.766 (0.205) 2.947 (3.324) 0.826 (0.086) 1.143 (0.914) \cdashline2-10 Rel. loc. 0.831 (0.148) 1.938 (3.411) 0.873 (0.167) 1.944 (4.873) 0.794 (0.169) 2.654 (2.981) 0.826 (0.148) 1.215 (1.187) Rel. ori. 0.834 (0.115) 1.599 (1.622) 0.866 (0.112) 1.683 (2.016) 0.837 (0.135) 1.782 (1.683) 0.801 (0.084) 1.331 (0.940) MTL 0.840 (0.136) 1.602 (2.699) 0.868 (0.177) 1.922 (4.263) 0.825 (0.131) 1.865 (1.661) 0.826 (0.078) 1.019 (0.641) 32 subjects (50%): 618 slices None 0.851 (0.112) 1.552 (2.779) 0.887 (0.153) 1.671 (4.508) 0.824 (0.089) 2.126 (1.343) 0.844 (0.061) 0.858 (0.465) \cdashline2-10 Slice ord. 0.875 (0.107) 1.400 (2.569) 0.880 (0.160) 1.893 (4.142) 0.877 (0.076) 1.515 (1.317) 0.867 (0.053) 0.792 (0.531) Jig. puzz. 0.880 (0.104) 1.187 (2.568) 0.905 (0.152) 1.351 (4.241) 0.859 (0.078) 1.565 (1.118) 0.875 (0.046) 0.645 (0.296) Mod.Gen. 0.884 (0.066) 1.075 (1.079) 0.912 (0.072) 1.095 (1.468) 0.875 (0.058) 1.427 (0.986) 0.866 (0.058) 0.702 (0.324) APP 0.871 (0.095) 1.339 (1.616) 0.913 (0.088) 1.046 (1.739) 0.832 (0.114) 2.232 (1.846) 0.869 (0.055) 0.738 (0.406) \cdashline2-10 SimCLR 0.869 (0.108) 1.430 (2.364) 0.884 (0.156) 1.766 (3.861) 0.857 (0.089) 1.695 (1.170) 0.867 (0.052) 0.830 (0.539) BYOL 0.897 (0.062) 0.873 (0.866) 0.929 (0.047) 0.687 (0.775) 0.885 (0.067) 1.259 (1.124) 0.877 (0.056) 0.672 (0.404) MA 0.886 (0.071) 1.089 (1.183) 0.915 (0.080) 0.997 (1.482) 0.864 (0.076) 1.622 (1.179) 0.879 (0.040) 0.647 (0.353) \cdashline2-10 Rel. loc. 0.884 (0.075) 1.149 (1.289) 0.913 (0.082) 1.070 (1.640) 0.865 (0.082) 1.671 (1.288) 0.874 (0.049) 0.705 (0.405) Rel. ori. 0.885 (0.073) 1.087 (1.286) 0.908 (0.085) 1.034 (1.630) 0.872 (0.079) 1.546 (1.336) 0.875 (0.043) 0.680 (0.376) MTL 0.900 (0.055) 0.850 (0.815) 0.926 (0.051) 0.735 (0.945) 0.891 (0.058) 1.177 (0.908) 0.882 (0.047) 0.639 (0.331) 64 subjects (100%): 1188 slices None 0.868 (0.081) 1.302 (1.358) 0.920 (0.049) 0.727 (0.701) 0.828 (0.106) 2.317 (1.809) 0.855 (0.041) 0.863 (0.467) \cdashline2-10 Slice ord. 0.887 (0.067) 1.069 (1.150) 0.920 (0.077) 1.003 (1.613) 0.879 (0.058) 1.376 (0.909) 0.862 (0.050) 0.828 (0.617) Jig. puzz. 0.886 (0.067) 1.037 (1.116) 0.913 (0.072) 1.021 (1.504) 0.877 (0.071) 1.418 (1.054) 0.867 (0.045) 0.671 (0.298) Mod.Gen. 0.898 (0.061) 0.883 (0.962) 0.915 (0.073) 0.866 (1.177) 0.893 (0.067) 1.197 (1.058) 0.887 (0.033) 0.586 (0.290) APP 0.903 (0.056) 0.799 (0.908) 0.930 (0.067) 0.714 (1.253) 0.891 (0.050) 1.127 (0.806) 0.889 (0.036) 0.558 (0.281) \cdashline2-10 SimCLR 0.897 (0.054) 0.921 (0.878) 0.928 (0.053) 0.790 (1.047) 0.882 (0.048) 1.306 (0.925) 0.882 (0.046) 0.668 (0.389) BYOL 0.901 (0.055) 0.841 (0.813) 0.925 (0.059) 0.843 (1.044) 0.897 (0.048) 1.048 (0.811) 0.882 (0.048) 0.632 (0.379) MA 0.902 (0.058) 0.833 (0.798) 0.935 (0.053) 0.584 (0.723) 0.884 (0.065) 1.334 (0.964) 0.886 (0.039) 0.583 (0.289) \cdashline2-10 Rel. loc. 0.907 (0.054) 0.793 (0.769) 0.939 (0.040) 0.570 (0.679) 0.896 (0.060) 1.167 (0.934) 0.886 (0.046) 0.641 (0.480) Rel. ori. 0.906 (0.053) 0.789 (0.882) 0.934 (0.047) 0.634 (0.821) 0.894 (0.061) 1.190 (1.168) 0.890 (0.038) 0.544 (0.228) MTL 0.910 (0.058) 0.719 (0.932) 0.941 (0.038) 0.526 (0.528) 0.897 (0.075) 1.119 (1.425) 0.893 (0.042) 0.510 (0.233) SimCLR{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 0.884 (0.113) 1.159 (2.867) 0.903 (0.153) 1.149 (4.608) 0.877 (0.109) 1.319 (1.708) 0.871 (0.052) 0.661 (0.113) BYOL{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 0.872 (0.077) 1.202 (1.216) 0.883 (0.091) 1.484 (1.447) 0.881 (0.082) 1.286 (1.374) 0.851 (0.046) 0.836 (0.486) MA{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 0.885 (0.069) 1.035 (1.154) 0.912 (0.075) 1.083 (1.529) 0.887 (0.066) 1.216 (1.133) 0.855 (0.052) 0.807 (0.538) a Amount of data used for fine-tuning on the target task; format: number of subjects (percentage with respect to the entire training dataset): number of slices.

Table 3 shows the AUCs for diagnosing previous myocardial infarction, dilated cardiomyopathy, hypertrophic cardiomyopathy, abnormal right ventricle, and normal cases on the ACDC dataset. Almost all the compared pretext tasks improve performance upon the train-from-scratch baseline with all different amounts of fine-tuning data for semantic segmentation (except for the contrastive learning based approaches in few-subject settings). Meanwhile, our MTL pretraining achieves the highest AUCs for all amounts of fine-tuning data. These results display similar trends to Table 4, which is expected as the classifier depends on the segmentation results, and the segmentation model fine-tuned from our MTL pretraining performs the best overall.

5.6.2 Knee MRI Diagnosis

Table 5 presents the mean AUCs for diagnosing abnormality, ACL tear, and meniscal tear based on sagittal T2W fat suppressed knee MRI. Above all, it is noted that for training from scratch, increasing the number of subjects from 80 to 400 brings no substantial change in the mean AUC (0.433, 0.429 and 0.474 for 80, 200, and 400 subjects, respectively), and further increasing to 800 subjects improves the AUC marginally to 0.562, suggesting the challenge of the downstream task. Second, although consistently better performance is achieved using transfer learning than training from scratch—as expected, it is to our surprise that fine-tuning ImageNet pretrained models with other pretext tasks actually diminishes the transfer learning efficacy of the ImageNet pretraining in most settings. In contrast, our MTL substantially improves upon the ImageNet pretraining when the number of subjects is small (by similar-to\sim0.07 and similar-to\sim0.05 with 80 and 200 subjects, respectively), and maintains comparable performance to the ImageNet pretraining when the number of subjects is 400 or larger. In fact, with 80 and 200 subjects, our relative location or orientation regression task alone leads to substantial improvement upon the ImageNet pretraining. Lastly, in contrast with the downstream task of CMR segmentation, the pretext task of relative location regression consistently yields superior performance to the relative orientation regression on knee MRI diagnosis.

Table 5: Classification performance (mean AUC) on sagittal T2W fat-saturated knee MRI of the Stanford dataset (Bien et al., 2018), and comparison of transfer learning performance with different pretraining methods: training from scratch (None), ImageNet (Russakovsky et al., 2015), Models Genesis (Mod. Gen.; Zhou et al., 2019b), anatomical position prediction (APP; Bai et al., 2019), jigsaw puzzle (Jig. puzz.; Noroozi and Favaro, 2016), slice ordering (Slice ord.; Zhang et al., 2017), masked autoencoder (MA) (He et al., 2022), BYOL (Grill et al., 2020), SimCLR (Chen et al., 2020a), our relative location regression (Rel. loc.), relative orientation regression (Rel. ori.), and multitask learning (MTL). {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT: directly fine-tuned from ImageNet pretrained models with different backbones from other methods.

Pretrain No. subjectsa method 80 (10%) 200 (25%) 400 (50%) 800 (100%) None 0.433 0.429 0.474 0.562 ImageNet 0.682 0.737 0.799 0.824 \cdashline1-5 Mod. Gen. 0.631 0.722 0.707 0.699 APP 0.654 0.713 0.719 0.723 Jig. puzz. 0.632 0.703 0.770 0.793 Slice ord. 0.625 0.740 0.773 0.792 \cdashline1-5 MA 0.548 0.583 0.618 0.624 BYOL 0.577 0.641 0.690 0.753 SimCLR 0.698 0.761 0.774 0.800 \cdashline1-5Rel. loc. 0.740 0.781 0.791 0.815 Rel. ori. 0.703 0.774 0.787 0.803 MTL 0.755 0.783 0.801 0.819 MA{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 0.421 0.462 0.519 0.536 BYOL{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 0.440 0.531 0.533 0.552 SimCLR{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 0.432 0.474 0.496 0.530 a The number of subjects used for fine-tuning on the target task, with percentage with respect to the entire training dataset in parentheses.

6 Discussion and conclusion

In this work, we proposed two complementary self-supervising pretext tasks for transfer learning of medical image data with anatomy-oriented imaging planes: regressing relative orientations and locations of a set of views. Both pretext tasks were based on the spatial relationship among the views and were simple and straightforward. We conducted thorough experiments with two representative types of such medical image data (cardiac and knee MRI) and two typical downstream tasks (semantic segmentation and computer-aided diagnosis) in medical image analysis. The results showed that the proposed pretext tasks were not only feasible for pretraining (i.e., with tractable learnability), but also effective in transfer learning for the typical downstream tasks. In addition, our proposed pretext tasks led to superior performance on the downstream tasks to existing self-supervising pretext tasks.

Both of our proposed pretext tasks were conceptually simple and easy to implement. Their learnability was validated in Section 5.5. On the one hand, all the evaluated metrics were quite reasonable for the task of relative location regression on both datasets, as presented in Table 2. In addition, after combining with the task of relative orientation regression, all the evaluated metrics were improved. This suggested that the other task provided positive feedbacks for predicting the slice locations. On the other hand, the models learned to predict stable intersecting lines for the task of relative orientation regression on both datasets, as shown in Figs. 4 and 5, across different slice locations, cardiac phases (CMR only), and pathological status. These results suggested that the models had acquired knowledge about the crucial landmarks that defined the set of imaging planes, robust against varying conditions.

We demonstrated transfer learning with the proposed pretext tasks on two typical downstream tasks in medical image analysis: semantic segmentation and classification, and on two representative datasets: cardiac and knee MRI. The experimental results showed that the proposed pretext tasks could effectively improve downstream task performance upon training from scratch, especially in the low-data regime. We conjecture that the performance gain could result from (1) robust representations learned from the pretext tasks (Tian et al., 2020) and (2) the regularizing effect of the pretraining against overfitting (Gidaris et al., 2019). When combing both pretext tasks together, the multitask learning achieved further improvements upon pretraining with either of them, suggesting that the two tasks could boost each other without conflict and thus were truly complementary. Compared to the other self-supervising pretext tasks, ours yielded better performance on the downstream tasks, indicating superiority in transfer learning for medical image data with anatomy-oriented imaging planes. Particularly, when used to fine-tune the ImageNet pretrained models, our pretext tasks can further improve the transfer learning performance in low-data regime. This is desirable when ImageNet pretrained models are available. We attributed the improvements to the smaller distances between our pretext tasks and potential downstream tasks for the targeted data group. More specifically, the proposed pretext tasks were intensively focused on learning the anatomy of the structure of interest being imaged, which was helpful to target tasks on that specific structure. In addition, understanding the images as a whole in the pretext tasks might also help downstream tasks that require whole images as input, compared with patch-based pretraining. In the future, we plan to also apply our pretext tasks to other popular tasks in medical image analysis, e.g., lesion detection.

When employed individually, relative orientation regression was more effective than relative location regression on the downstream task of CMR segmentation, whereas the opposite was observed on the downstream task of knee MRI classification. We conjecture that this might be a manifestation of the different distances between the pretext and downstream tasks, i.e., relative orientation regression was closer to semantic segmentation than relative location regression, and the opposite for classification. Meanwhile, in prescribing knee MRI, the axial and coronal planes were (roughly) perpendicular and parallel to the middle line of the femur and tibia in the sagittal views, respectively (Fig. 5). This may have simplified the pretext task of relative orientation regression for knee MRI since locating the knee cap and the generally upright femur and tibia might be easier than prescribing CMR (although we implemented random rotation as part of the online data augmentation), thus reducing its difficulty and efficacy in pretraining the networks.

We notice that the pairwise slice ordering pretext task proposed by Zhang et al. (2017) generally deteriorated the transfer performance of the ImageNet pretraining (Table 5). This was because the network had difficulty in ordering similar slices symmetric about a center. In contrast, with the center-symmetric map** (Eqn. (4)), our relative location regression substantially improves upon the ImageNet pretraining. In this work, we employ the sine function to map similar slices in symmetrical cases to similar regression values. There is a potential to let the network learn a more general map** from the image data, like the relation network, which learned the distance function for low-shot recognition (Sung et al., 2018).

A limitation of this work was that the effectiveness of the proposed self-supervising pretext tasks was only evaluated in the scenario of transfer learning, where the networks were pretrained on a different dataset before transferred to the target dataset for the downstream task. This was because we had no access to a suitable dataset that had both spatial information properly recorded for pretraining and annotations available for fine-tuning and evaluation. Despite the potential domain gap (Tsai et al., 2018) in our experimental setting, transfer learning with our pretext tasks turned out effective. In practice, it is very likely that the dataset for the target task can also be used for self-supervised pretraining with the pretext tasks, given the spatial information is properly recorded for valid volumetric analysis. In such practical scenarios, we expect the impact of our pretext tasks to be more significant considering the elimination of the domain gap.101010After this work was completed, we became aware of a concurrent work that published a dataset suitable for pretraining by our proposed pretext tasks and evaluation on downstream tasks (Martín-Isla et al., 2023). We hope to evaluate our method on this candidate dataset in the future and also encourage the research community to do so with our published code.

Also, we note that our proposed pretext tasks were exclusive to multi-view medical image data with anatomy-oriented imaging planes. Yet, this data group comprises a significant subset of clinical imaging data, e.g., MRI of various organs and body parts and the standard mammography views. Given the abundance of such data in clinics, we believe our methodology contributed significantly to the medical image analysis community by highlighting the underexplored potential of exploiting anatomy position information in medical image pretraining.

Lastly, this work was focused on effective network pretraining by self-supervised learning for medical image data with anatomy-oriented imaging planes. Accordingly, we employed straightforward network and training configurations for fine-tuning on target tasks to emphasize the effect of pretraining. We expect more advanced deep pipelines for medical image analysis, e.g., the nnU-Net (Isensee et al., 2021), would benefit from incorporating weights pretrained by our proposed pretext tasks on applicable data.

Acknowledgments

The authors gratefully acknowledge the support of the National Natural Science Foundation of China under Key Program 62236009, General Program 61876032 (to S.G.), Shenzhen Science and Technology Program under JCYJ20210324140807019 (to S.G.)

References

  • Anthimopoulos et al. (2016) Anthimopoulos, M., Christodoulidis, S., Ebner, L., Christe, A., Mougiakakou, S., 2016. Lung pattern classification for interstitial lung diseases using a deep convolutional neural network. IEEE Trans. Med. Imag. 35, 1207–1216.
  • Bai et al. (2019) Bai, W., Chen, C., Tarroni, G., Duan, J., Guitton, F., Petersen, S.E., Guo, Y., Matthews, P.M., Rueckert, D., 2019. Self-supervised learning for cardiac MR image segmentation by anatomical position prediction, in: Int. Conf. Med. Image Comput. Comput. Assist. Interv., Springer. pp. 541–549.
  • Bernard et al. (2018) Bernard, O., Lalande, A., Zotti, C., Cervenansky, F., Yang, X., Heng, P.A., Cetin, I., Lekadir, K., Camara, O., Ballester, M.A.G., et al., 2018. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved? IEEE Trans. Med. Imag. 37, 2514–2525.
  • Bien et al. (2018) Bien, N., Rajpurkar, P., Ball, R., Irvin, J., Park, A., Jones, E., Bereket, M., Patel, B., Yeom, K., Shpanskaya, K., Halabi, S., Zucker, E., Fanton, G., Amanatullah, D., Beaulieu, C., Riley, G., Stewart, R., Blankenberg, F., Larson, D., Jones, R., Langlotz, C., Ng, A., Lungren, M., 2018. Deep-learning-assisted diagnosis for knee magnetic resonance imaging: development and retrospective validation of MRNet. PLoS Medicine 15, e1002699.
  • Chaitanya et al. (2020) Chaitanya, K., Erdil, E., Karani, N., Konukoglu, E., 2020. Contrastive learning of global and local features for medical image segmentation with limited annotations. Adv. Neural Inform. Process. Syst. 33, 12546–12558.
  • Chen et al. (2019) Chen, L., Bentley, P., Mori, K., Misawa, K., Fujiwara, M., Rueckert, D., 2019. Self-supervised learning for medical image analysis using image context restoration. Med. Image Anal. 58, 101539.
  • Chen et al. (2020a) Chen, T., Kornblith, S., Norouzi, M., Hinton, G., 2020a. A simple framework for contrastive learning of visual representations, in: Int. Conf. Mach. Learn., PMLR. pp. 1597–1607.
  • Chen et al. (2020b) Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G., 2020b. Big self-supervised models are strong semi-supervised learners, in: Adv. Neural Inform. Process. Syst., pp. 22243–22255.
  • DeMaris (2002) DeMaris, A., 2002. Explained variance in logistic regression: A Monte Carlo study of proposed measures. Sociol. Methods Res. 31, 27–74.
  • Doersch et al. (2015) Doersch, C., Gupta, A., Efros, A.A., 2015. Unsupervised visual representation learning by context prediction, in: Int. Conf. Comput. Vis., pp. 1422–1430.
  • Doersch and Zisserman (2017) Doersch, C., Zisserman, A., 2017. Multi-task self-supervised visual learning, in: Int. Conf. Comput. Vis., pp. 2051–2060.
  • Dosovitskiy et al. (2021) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., 2021. An image is worth 16x16 words: Transformers for image recognition at scale, in: Int. Conf. Learn. Represent.
  • Gerche et al. (2013) Gerche, A.L., Claessen, G., Van de Bruaene, A., Pattyn, N., Van Cleemput, J., Gewillig, M., Bogaert, J., Dymarkowski, S., Claus, P., Heidbuchel, H., 2013. Cardiac MRI: a new gold standard for ventricular volume quantification during high-intensity exercise. Circ. Cardiovasc. Imaging 6, 329–338.
  • Gidaris et al. (2019) Gidaris, S., Bursuc, A., Komodakis, N., Pérez, P., Cord, M., 2019. Boosting few-shot visual learning with self-supervision, in: Int. Conf. Comput. Vis., pp. 8059–8068.
  • González et al. (2018) González, G., Ash, S.Y., Vegas-Sánchez-Ferrero, G., Onieva Onieva, J., Rahaghi, F.N., Ross, J.C., Díaz, A., San José Estépar, R., Washko, G.R., 2018. Disease staging and prognosis in smokers using deep learning in chest computed tomography. Am. J. Resp. Crit. Care Med. 197, 193–203.
  • Grill et al. (2020) Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al., 2020. Bootstrap your own latent A new approach to self-supervised learning. Adv. Neural Inform. Process. Syst. 33, 21271–21284.
  • He et al. (2022) He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R., 2022. Masked autoencoders are scalable vision learners, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 16000–16009.
  • He et al. (2016) He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 770–778.
  • Hendrycks et al. (2019) Hendrycks, D., Mazeika, M., Kadavath, S., Song, D., 2019. Using self-supervised learning can improve model robustness and uncertainty, in: Adv. Neural Inform. Process. Syst., pp. 15663–15674.
  • Huh et al. (2016) Huh, M., Agrawal, P., Efros, A.A., 2016. What makes ImageNet good for transfer learning? arXiv preprint arXiv:1608.08614 .
  • Isensee et al. (2021) Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H., 2021. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods 18, 203–211.
  • Jackson et al. (2018) Jackson, P., Hardcastle, N., Dawe, N., Kron, T., Hofman, M.S., Hicks, R.J., 2018. Deep learning renal segmentation for fully automated radiation dose estimation in unsealed source therapy. Frontiers Oncol. 8, 215.
  • Jamaludin et al. (2017) Jamaludin, A., Kadir, T., Zisserman, A., 2017. Self-supervised learning for spinal MRIs, in: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, pp. 294–302.
  • **g and Tian (2020) **g, L., Tian, Y., 2020. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 43, 4037–4058.
  • Khened et al. (2017) Khened, M., Alex, V., Krishnamurthi, G., 2017. Densely connected fully convolutional network for short-axis cardiac cine MR image segmentation and heart diagnosis using random forest, in: Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges: Int. Workshop STACOM, Springer. pp. 140–151.
  • Kingma and Ba (2014) Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
  • Klifa et al. (2010) Klifa, C., Carballido-Gamio, J., Wilmes, L., Laprie, A., Shepherd, J., Gibbs, J., Fan, B., Noworolski, S., Hylton, N., 2010. Magnetic resonance imaging for secondary assessment of breast density in a high-risk cohort. Magn. Reson. Imaging 28, 8–15.
  • Kramer et al. (2020) Kramer, C.M., Barkhausen, J., Bucciarelli-Ducci, C., Flamm, S.D., Kim, R.J., Nagel, E., 2020. Standardized cardiovascular magnetic resonance imaging (CMR) protocols: 2020 update. J. Cardiovasc. Magn. Reson. 22, 1–18.
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet classification with deep convolutional neural networks. Adv. Neural Inform. Process. Syst. 25, 1097–1105.
  • Law and Deng (2018) Law, H., Deng, J., 2018. CornerNet: Detecting objects as paired keypoints, in: Eur. Conf. Comput. Vis., pp. 734–750.
  • Li et al. (2020) Li, Y., Chen, J., Zheng, Y., 2020. A multi-task self-supervised learning framework for scopy images, in: IEEE Int. Symp. Biomed. Imaging, IEEE. pp. 2005–2009.
  • Litjens et al. (2017) Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken, B., Sánchez, C.I., 2017. A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88.
  • Long et al. (2015) Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic segmentation, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 3431–3440.
  • Martín-Isla et al. (2023) Martín-Isla, C., Campello, V.M., Izquierdo, C., Kushibar, K., Sendra-Balcells, C., Gkontra, P., Sojoudi, A., Fulton, M.J., Arega, T.W., Punithakumar, K., et al., 2023. Deep learning segmentation of the right ventricle in cardiac MRI: The M&Ms challenge. IEEE Trans. Med. Imag. 27, 3302–3313.
  • Naraghi and White (2016) Naraghi, A.M., White, L.M., 2016. Imaging of athletic injuries of knee ligaments and menisci: sports imaging series. Radiology 281, 23–40.
  • Newell et al. (2017) Newell, A., Huang, Z., Deng, J., 2017. Associative embedding: End-to-end learning for joint detection and grou**, in: Adv. Neural Inform. Process. Syst., pp. 2274–2284.
  • Newell et al. (2016) Newell, A., Yang, K., Deng, J., 2016. Stacked hourglass networks for human pose estimation, in: Eur. Conf. Comput. Vis., Springer. pp. 483–499.
  • Noroozi and Favaro (2016) Noroozi, M., Favaro, P., 2016. Unsupervised learning of visual representations by solving jigsaw puzzles, in: Eur. Conf. Comput. Vis., Springer. pp. 69–84.
  • Oquab et al. (2014) Oquab, M., Bottou, L., Laptev, I., Sivic, J., 2014. Learning and transferring mid-level image representations using convolutional neural networks, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 1717–1724.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S., 2019. PyTorch: An imperative style, high-performance deep learning library, in: Adv. Neural Inform. Process. Syst., pp. 8024–8035.
  • Pfister et al. (2015) Pfister, T., Charles, J., Zisserman, A., 2015. Flowing ConvNets for human pose estimation in videos, in: Int. Conf. Comput. Vis., pp. 1913–1921.
  • Ronneberger et al. (2015) Ronneberger, O., Fischer, P., Brox, T., 2015. U-Net: Convolutional networks for biomedical image segmentation, in: Int. Conf. Med. Image Comput. Comput. Assist. Interv., Springer. pp. 234–241.
  • Ruder (2017) Ruder, S., 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 .
  • Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L., 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252.
  • Shin et al. (2016) Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., Summers, R.M., 2016. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imag. 35, 1285–1298.
  • Silveira et al. (2009) Silveira, M., Nascimento, J.C., Marques, J.S., Marçal, A.R., Mendonça, T., Yamauchi, S., Maeda, J., Rozeira, J., 2009. Comparison of segmentation methods for melanoma diagnosis in dermoscopy images. IEEE J. Sel. Topics Signal Process. 3, 35–45.
  • Spitzer et al. (2018) Spitzer, H., Kiwitz, K., Amunts, K., Harmeling, S., Dickscheid, T., 2018. Improving cytoarchitectonic segmentation of human brain areas with self-supervised Siamese networks, in: Int. Conf. Med. Image Comput. Comput. Assist. Interv., Springer. pp. 663–671.
  • Stegmann et al. (2005) Stegmann, M.B., Skoglund, K., Ryberg, C., 2005. Mid-sagittal plane and mid-sagittal surface optimization in brain MRI using a local symmetry measure, in: Medical Imaging 2005: Image Processing, International Society for Optics and Photonics. pp. 568–579.
  • Sung et al. (2018) Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M., 2018. Learning to compare: Relation network for few-shot learning, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 1199–1208.
  • Tan et al. (2018) Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C., 2018. A survey on deep transfer learning, in: Int. Conf. Artif. Neural Netw., Springer. pp. 270–279.
  • Tao et al. (2020) Tao, X., Li, Y., Zhou, W., Ma, K., Zheng, Y., 2020. Revisiting Rubik’s cube: Self-supervised learning with volume-wise transformation for 3D medical image segmentation, in: Int. Conf. Med. Image Comput. Comput. Assist. Interv., Springer. pp. 238–248.
  • Tian et al. (2020) Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J.B., Isola, P., 2020. Rethinking few-shot image classification: a good embedding is all you need?, in: Eur. Conf. Comput. Vis., Springer. pp. 266–282.
  • Tsai et al. (2018) Tsai, Y.H., Hung, W.C., Schulter, S., Sohn, K., Yang, M.H., Chandraker, M., 2018. Learning to adapt structured output space for semantic segmentation, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 7472–7481.
  • Wei et al. (2013) Wei, D., Sun, Y., Ong, S.H., Chai, P., Teo, L.L., Low, A.F., 2013. A comprehensive 3-D framework for automatic quantification of late gadolinium enhanced cardiac magnetic resonance images. IEEE Trans. Biomed. Eng. 60, 1499–1508.
  • Yosinski et al. (2014) Yosinski, J., Clune, J., Bengio, Y., Lipson, H., 2014. How transferable are features in deep neural networks?, in: Adv. Neural Inform. Process. Syst., pp. 3320–3328.
  • Zbontar et al. (2018) Zbontar, J., Knoll, F., Sriram, A., Murrell, T., Huang, Z., Muckley, M.J., Defazio, A., Stern, R., Johnson, P., Bruno, M., Parente, M., Geras, K.J., Katsnelson, J., Chandarana, H., Zhang, Z., Drozdzal, M., Romero, A., Rabbat, M., Vincent, P., Yakubova, N., Pinkerton, J., Wang, D., Owens, E., Zitnick, C.L., Recht, M.P., Sodickson, D.K., Lui, Y.W., 2018. fastMRI: An open dataset and benchmarks for accelerated MRI. arXiv preprint arXiv:1811.08839 .
  • Zhang et al. (2017) Zhang, P., Wang, F., Zheng, Y., 2017. Self supervised deep representation learning for fine-grained body part recognition, in: IEEE Int. Symp. Biomed. Imaging, pp. 578–582.
  • Zhang et al. (2016) Zhang, R., Isola, P., Efros, A.A., 2016. Colorful image colorization, in: Eur. Conf. Comput. Vis., Springer. pp. 649–666.
  • Zhou et al. (2018) Zhou, X., Karpur, A., Luo, L., Huang, Q., 2018. Starmap for category-agnostic keypoint and viewpoint estimation, in: Eur. Conf. Comput. Vis., pp. 318–334.
  • Zhou et al. (2019a) Zhou, X., Zhuo, J., Krahenbuhl, P., 2019a. Bottom-up object detection by grou** extreme and center points, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 850–859.
  • Zhou et al. (2019b) Zhou, Z., Sodha, V., Siddiquee, M.M.R., Feng, R., Tajbakhsh, N., Gotway, M.B., Liang, J., 2019b. Models Genesis: Generic autodidactic models for 3D medical image analysis, in: Int. Conf. Med. Image Comput. Comput. Assist. Interv., Springer. pp. 384–393.
  • Zhu et al. (2020) Zhu, J., Li, Y., Hu, Y., Ma, K., Zhou, S.K., Zheng, Y., 2020. Rubik’s cube+: A self-supervised feature learning framework for 3D medical image analysis. Med. Image Anal. , 101746.