figure \cftpagenumbersofftable
Demystifying the Effect of Receptive Field Size in U-Net Models for Medical Image Segmentation
Abstract
Purpose: Medical image segmentation is a critical task in healthcare applications, and U-Nets have demonstrated promising results in this domain. This work delves into the understudied aspect of receptive field (RF) size and its impact on the U-Net and Attention U-Net architectures used for medical imaging segmentation.
Approach: This work explores several critical elements including the relationship between RF size, characteristics of the region of interest, and model performance, as well as the balance between RF size and computational costs for U-Net and Attention U-Net methods for different datasets. This work also proposes a mathematical notation for representing the theoretical receptive field (TRF) of a given layer in a network and proposes two new metrics namely - effective receptive field (ERF) rate and the Object rate to quantify the fraction
of significantly contributing pixels within the ERF against the TRF area and assessing the
relative size of the segmentation object compared to the TRF size respectively.
Results: The results demonstrate that there exists an optimal TRF size that successfully strikes a balance between capturing a wider global context and maintaining computational efficiency, thereby optimizing model performance. Interestingly, a distinct correlation is observed between the data complexity and the required TRF size; segmentation based solely on contrast achieved peak performance even with smaller TRF sizes, whereas more complex segmentation tasks necessitated larger TRFs. Attention U-Net models consistently outperformed their U-Net counterparts, highlighting the value of attention mechanisms regardless of TRF size.
Conclusions: These novel insights present an invaluable resource for develo** more efficient U-Net-based architectures for medical imaging and pave the way for future exploration of other segmentation architectures. A tool is also developed that calculates the TRF for a U-Net (and Attention U-Net) model, and also
suggest an appropriate TRF size for a given model and dataset.
keywords:
Effective Receptive Field, Receptive Field, Segmentation, Theoretical Receptive Field, U-Net*Navchetan Awasthi, \linkable[email protected]
1 Introduction
Medical imaging, a cornerstone of modern healthcare, provides non-invasive means for diagnosing and monitoring a wide range of diseases. However, the interpretation of medical images often requires expert knowledge and can be time-consuming, leading to a growing interest in automated analysis methods [1].
Semantic segmentation, a key task in computer vision, plays a crucial role in this context. It involves the categorization of pixels in an image into predefined classes, enabling the delineation of anatomical structures and pathological regions in medical images [2]. The U-Net architecture, a convolutional neural network (CNN) designed specifically for biomedical image segmentation, has emerged as a popular choice for semantic segmentation tasks in medical imaging [3]. As illustrated in Figure 1, it employs an encoder-decoder structure. The encoder progressively reduces the spatial dimensionality while increasing the feature representation, capturing the global context of the image. The decoder, on the other hand, gradually recovers the spatial information, enabling precise localization [4]. The U-Net is renowned for its accuracy in semantic segmentation tasks [3]. An extended version, Attention U-Net, integrates an attention mechanism to enhance feature capturing to improve overall performance [5].
Within these networks, the concept of the receptive field (RF) is crucial. It refers to the region in the input space that affects a feature in a CNN [6]. There are two kinds of receptive fields: the theoretical receptive field (TRF) and the effective receptive field (ERF). The TRF is defined as the maximum region of the input image that influences a specific pixel of the output, considering only the receptive field from the preceding layers that are relevant to the current layer [7]. This is in contrast to the ERF, which is the actual region of the input image that contributes to the activation of a particular neuron in the network, taking into account the impact of operations such as pooling [6]. An example of the TRF and ERF is illustrated in Figure 2.
Previous studies have started investigating the role of RF size on U-Net performance for image segmentation tasks, but not all aspects have been explored. In one such study, [8] focused on ultrasound image segmentation, demonstrating that the RF size has a more critical role than the network’s depth or the number of parameters. They suggested that a computationally efficient shallow network could replace a deep one without performance loss by manipulating the RF size. However, their study was limited to the U-Net architecture and a single dataset, comparing only a deep and a shallow network, leaving room for a more exhaustive investigation.
In another study, [9] delved into the influence of RF size and network complexity on a CNN’s performance for transmission electron microscopy (TEM) image analysis. They found that the RF size’s influence varied with TEM image resolution and contrast characteristics. For low-resolution TEM images, where contrast is crucial, RF size had a minimal influence. But for high-resolution TEM images, where identification is less dependent on contrast changes, the RF size was vital, especially for low contrast images. However, they only considered TEM images and regular U-Nets, leaving the applicability of these findings to other medical imaging tasks unexplored.
Our study builds upon these insights by examining the influence of RF size on U-Net and Attention U-Net architectures across multiple medical image segmentation datasets with certain characteristics in the region of interest (RoI). We compare ten different U-Net architectures shown in Table-1 with varied RF sizes and equal total parameters, thus isolating RF size’s impact on performance while taking into consideration specific dataset characteristics. Moreover, we repeat the experiments on eight different synthetic datasets and six real-world medical datasets. We also extend our investigation to the Attention U-Net architecture, thereby expanding the study beyond regular U-Net architecture. Our aim is to offer critical insights for U-Net-based architectures’ design, considering RF size as a key parameter.
Specifically, this paper makes the following contributions to the field of medical image segmentation with U-Nets:
-
1.
We provide a comprehensive analysis of the role of the RF size in the performance of U-Net and Attention U-Net architectures, demonstrating its significance in capturing the necessary context for accurate segmentation.
-
2.
We propose a mathematical notation to represent the TRF at a given layer within a network, utilizing a 4-dimensional tensor and provide the mathematical framework for calculating TRF for the different deep learning layers.
-
3.
We propose two new metrics called the ERF rate and the Object rate to quantify the fraction of significantly contributing pixels within the ERF against the TRF area and assessing the relative size of the segmentation object compared to the TRF size respectively.
-
4.
We explore the trade-off between RF size and computational cost for a variety of medical imaging datasets and synthetic datasets.
-
5.
We compare the performance of U-Net and Attention U-Net architectures for the same RF size, highlighting the effectiveness of the attention mechanism in improving the model’s overall performance.
-
6.
We present a nuanced analysis of the performance trends across datasets with different characteristics in the RoI, particularly its size and contrast to the surrounding area.
-
7.
We provide a tool that calculates the TRF for a U-Net (and Attention U-Net) model, and also suggest an appropriate TRF size for a given model and dataset.
2 Methodology
This study explores the role of the receptive field in the performance of U-Net and Attention U-Net models in semantic segmentation tasks. Through a series of experiments with varying TRF sizes, we evaluated these models on diverse range of datasets. In this section, we further provide comprehensive descriptions of the model architectures, as well as the TRF and ERF computation.
2.1 U-Net Design and TRF Tuning
The configuration of the hyper-parameters of a U-Net model significantly impacts the size of its TRF. As established by [10], the TRF size is primarily determined by the number of pooling layers and the convolutional kernel sizes. To elaborate, Figure 1 illustrates a variable attention U-Net diagram, where the TRF size can be adjusted in two different ways. Firstly, when the vertical depth () of the network is increased, one encoder and one decoder block is added before and after the bottleneck respectively. This increases the number of pooling layers and therefore increases the TRF size. Changing the network depth on its own does not result in a significant impact on the model’s performance [8]. Secondly, the TRF size can be varied by changing the kernel sizes of the convolutional layers within the network. The mathematical details of the effects of various layers on the TRF size are provided in Section 2.2.
It should be noted that adjusting these hyper-parameters also impacts the total number of parameters in the model. To ensure a fair comparison between the performance of various configurations, the total number of parameters must remain approximately equal. According to [11], this can be achieved by modifying the number of output channels in each convolution layer within the network blocks. Table 1 provides an overview of all configurations utilized in this study. It is important to mention that the parameter count is based on the standard U-Net architecture. The Attention U-Net introduces additional parameters due to the inclusion of an attention block at each layer. However, this increase varies approximately on the order of 100,000, which is relatively insignificant and can be considered negligible in this context.
TRF size | Out channels per layer | # Parameters | |
---|---|---|---|
54 | 3, 2 | [230, 456, 765, 1245] | 31,013,720 |
100 | 3, 3 | [145, 256, 512, 1024] | 31,012,268 |
146 | 3, 4 | [133, 244, 355, 791] | 31,032,960 |
204 | 4, 3 | [64, 128, 256, 512, 1024] | 31,042,369 |
230 | 3, 6 | [63, 170, 256, 512] | 31,031,345 |
298 | 4, 4 | [25, 44, 110, 451, 756] | 31,043,816 |
360 | 3, 8 | [47, 83, 180, 360] | 31,062,482 |
412 | 5, 3 | [63, 64, 115, 255, 512, 1024] | 31,043,945 |
486 | 4, 6 | [28, 58, 146, 270, 510] | 31,027,119 |
570 | 4, 7 | [24, 55, 101, 223, 481] | 31,041,124 |
2.2 Computing the TRF
Formally, the TRF refers to the maximum region of the input image that potentially influences a specific pixel in the output layer. To represent the TRF at layer in a U-Net architecture of depth , we introduce a 4-dimensional tensor . Here, the first two dimensions correspond to the and axes of the input image, respectively, while the third and fourth dimensions represent the top-left (t-l) and bottom-right (b-r) coordinates of the TRF at layer . For a given pixel located at position in the output layer , the TRF can be expressed as a matrix in which the first row corresponds to the top-left corner of the TRF, and the second row corresponds to the bottom-right corner of the TRF:
(1) |
Empirically, it has been observed that all pixels have an equal TRF size in the output layer, except those located around the border because of the padded zeroes. Based on this observation, a single (maximum) TRF value can be assigned to the entire U-Net. In the remainder of this paper, we define the TRF size of a U-Net as the size of the TRF of the center pixel in the output layer:
(2) |
To compute the values of the TRF matrix in Equation 1, we traverse the network from the first to the final layer, tracking the TRF of each pixel at every layer based on the previous layer’s pixels until reaching the output layer [7]. Therefore, the TRF of the pixel at position in a layer at depth can be expressed as:
(3) |
In the input layer , the TRF of each pixel corresponds to the pixel itself:
(4) |
The computation of the TRF in subsequent layers depends on the U-Net’s configuration. Here are all possible layers that a U-Net may include.
2.2.1 Convolution
In a 2D convolution layer, a filter or kernel is applied to a 2D image, performing a dot product at each position [12]. The kernel size () impacts the detail level captured, while the stride () affects the kernel shift amount. Padding (), set to ‘same’ in this study, ensures the output feature map matches the input image dimensions, permitting edge-based convolution operations.
If the padding is set to same, the number of arrays that must be padded on every side of the tensor to simulate a convolution while maintaining the previous layer’s dimensions can be calculated. For a layer, the padding values along the and axes are computed as follows:
(5) | ||||
(6) |
Therefore, along the first and second axes of the 4-dimensional tensor , the tensor is padded with and , 2-dimensional tensors that contain the same values as the edges along the first and second axes of . Let denote this padded tensor. For each position , the top-left and bottom-right pixels from the previous layer’s TRF can be fetched from at the indices and , respectively. Thus, the TRF at position for a convolutional layer at depth can be denoted as:
(7) |
2.2.2 Max pooling
2D max pooling is a feature map reduction method where a rectangular kernel selects maximum values within regions, creating a smaller feature map [12]. The kernel size () defines the sliding window size over the input, and the stride () – in our study equal to in order to simplify the computation – controls the window’s movement.
For a given position , the topmost and leftmost pixels from the previous layer’s TRF can be accessed from the tensor at the index of , while the bottom-most and rightmost pixels can be accessed at the index of . As such, the TRF at position for a max pooling layer at depth can be expressed as follows:
(8) |
2.2.3 Upsampling
Upsampling is a technique used to increase the spatial resolution of feature maps. In particular, it is implemented through transposed convolution or deconvolution, which is the reverse operation of convolution. During the transposed convolution operation, a kernel of size is applied to the input feature map to generate an output feature map with a higher spatial resolution. The stride determines the amount of shift in the output feature map for each kernel application [12]. When the stride is set to , the size of the output feature map is equal to the size of the input feature map multiplied by the stride.
However, when the stride is different from the kernel size , there may be overlaps in the values of the output feature map. Therefore, an iterative method is required to identify the corners of the TRF for each pixel in the output feature map. Specifically, Algorithm 1 is applied to each pixel in the input map, computing the range in which the pixel is copied to the output feature map by multiplying the top and left indices with the stride and the bottom and right indices with the stride and then adding the kernel size. The algorithm then iterates over the pixels in the output feature map within this range. If there is no overlap, the indices from the previous layer at are simply copied. Otherwise, for the top and left of the TRF, the algorithm takes the minimum of the current index and a potentially overlap** index, while for the bottom and right TRF, it takes the maximum.
2.2.4 Concatenations
Within the U-Net architecture, skip connections from layer are integrated into the decoder blocks by concatenating them with the output of the upsampling layer [13]. To achieve this, the TRF of the tensors being concatenated, denoted as and , must first be determined. The TRF of each pixel after concatenation, denoted as , is obtained by selecting the lowest indices for the left and top of both TRFs, and the highest indices for the right and bottom of both TRFs. This approach ensures that the largest possible TRF is obtained.
2.2.5 Activation functions
While nonlinear activation functions like ReLU and sigmoid do affect the ERF by potentially reducing its size when certain parameters are set to zero [14], they have no effect on the TRF, as these functions act element-wise on the previous layer. Therefore, in a layer with an activation function, it can be concluded that .
2.2.6 Attention gates
Attention gates, a key component of the Attention U-Net architecture (illustrated in Figure 1), receive input features from a layer denoted as , and a gating signal from a layer [5]. The inputs are then subjected to convolutions, followed by element-wise addition. At this point, the TRF is equivalent to the maximum range of the TRF of either input, as the TRF is not modified by the convolution. Next, a ReLU and sigmoid function are applied, which leave the TRF unchanged, as described in the previous section. Finally, element-wise multiplication is performed on the output, which results in the TRF being equivalent to the maximum range of the TRF of either input. As a result, the TRF size of a U-Net and Attention U-Net with the same depth and convolution kernel sizes are equivalent.
Therefore, similar to concatenations, the TRF of an attention gate is the maximal range of the TRF from its input features of layer and the gating signal of layer .
2.3 Computing the ERF
For each pixel in the input image , its impact on the center pixel of the output image is measured by computing the partial derivative of the center output pixel with respect to each input pixel . This method quantifies how much changes if is changed by a small amount [6]. For a TRF, the corresponding ERF () can be expressed as a matrix as shown in equation 9.
(9) |
The actual computation of the ERF can be done easily with most deep learning frameworks by back-propagating the value of one certain output pixel to the entire input, and taking the slice of the input at the position of the TRF.
3 Experiment
3.1 Training protocol
All the models were trained on a high-performance computing node featuring two Intel Xeon Platinum 8360Y CPUs and an NVIDIA A100 GPU with 40 GB of HBM2 memory. We used the PyTorch framework [15] and employed Binary Cross-Entropy with Logits Loss as our loss function, with the Adam optimizer facilitating training due to its efficiency and minimal memory requirements [16].
The initial learning rate was set at , and a learning rate scheduling strategy was implemented to optimize learning. This strategy reduces the learning rate by 0.1 when the validation loss plateaus for four epochs, enabling more substantial updates in early training phases and smaller updates as the model nears convergence. Training lasted up to 200 epochs, with early stop** [17] implemented to prevent overfitting. If the validation loss remained static over 20 consecutive epochs, training was ceased, and the parameters that achieved the lowest validation loss were saved.
A | A large | A contour | A large contour | B | B large | B contour |
---|---|---|---|---|---|---|
B large contour | Fetal head | Fetal head 2 | Kidneys | Lungs | Thyroid | Nerve |
3.2 Datasets
Our study utilized a wide array of datasets, both synthetic and real-world medical images. The synthetic datasets were specifically designed to evaluate certain hypotheses under controlled conditions. Following this, we applied our hypotheses to medical imaging datasets, which encompassed a variety of imaging techniques and anatomical structures, adding a layer of complexity and realism to our evaluations. Illustrative examples of images and corresponding masks from each dataset can be found in Figure 3.
3.2.1 Synthetic Datasets
The synthetic shape datasets are designed to provide a controlled environment for investigating the impact of the TRF on the performance of the models. The datasets consist of synthetic images with predefined shapes and configurations, allowing for a systematic exploration of the models’ behaviour under different conditions.
There are a total of 8 datasets with generated images. These are of two types, referred to as Type A and Type B. Both types include three non-overlap** shapes – a circle, a triangle, and a square – that are randomly placed and rotated, with a random gray value assigned to each shape. For Type A images, the masks are identical to the shapes in the images. For Type B images, the masks are the same, but the mask of the square is omitted, adding an additional level of complexity to the segmentation task.
For each type, four datasets are created. Two of them contain small shapes placed on an invisible grid, and two of them contain large shapes placed on an invisible grid. For both the small and large datasets one of them contains images with filled shapes and filled masks, and the other one contains images with only the contours of the shapes with filled masks. Each dataset contains a total of 1000 images of which 700 are used for training, 150 for validation, and 150 for testing.
These synthetic shape datasets offer valuable insights into the role of the TRF on the performance of the models. By comparing the performance of the models on images with small shapes versus large shapes, we can assess how the TRF size affects the model’s ability to capture features of different scales. Specifically, it allows us to determine to what degree it matters if the TRF is smaller than the shape, or if the shape fits into the TRF.
The comparison between images with filled shapes and those with contour shapes allows us to determine what happens if the TRF does not capture the entire shape, but only a part of it, such as the part which is completely black in the image but is filled in the mask because it is within the contours. This is particularly relevant for real-world applications, where the images often contain complex structures that the model needs to accurately segment.
Furthermore, the use of Type B images, where the mask of the square is omitted, enables us to examine how the models handle irrelevant features in the images. This is particularly relevant for real-world applications, where the images often contain irrelevant or distracting features that the model needs to ignore to perform the task effectively.
3.2.2 Medical Datasets
The experiments were carried out using below listed 6 medical datasets. The datasets are classified into two categories: high-contrast, where the RoI can be visually distinguished solely based on its contrast with the background, and low-contrast, which requires additional details like the RoI’s contour or shape to distinguish it from the background.
1. Fetal Head – This low-contrast dataset consists of 2D ultrasound images of fetal heads [dataset] [18]. It includes 350 training images, 74 validation images, and 76 test images. The images were obtained using a standard clinical ultrasound system, and the fetal head circumference was manually annotated by expert sonographers.
2. Fetal Head 2 – This low-contrast dataset is another set of 2D ultrasound images of fetal heads, with a larger number of images [dataset] [19, 20]. It includes 14560 training images, 3240 validation images, and 2875 test images. The images in this dataset were collected from multiple hospitals and were annotated by experienced radiologists.
3. Kidneys – This low-contrast dataset consists of 3D MRI images of kidneys [dataset] [21, 22]. It includes 454 training images, 91 validation images, and 104 test images. The images were acquired using a 3T MRI scanner and the kidney regions were manually segmented by radiologists.
4. Lungs – This high-contrast dataset consists of 2D X-Ray images of lungs [dataset] [23, 24]. It includes 396 training images, 84 validation images, and 86 test images. The images were collected from a variety of patients with different lung conditions, providing a diverse dataset for training and testing.
5. Thyroid – This low-contrast dataset consists of 3D ultrasound images of the thyroid [dataset] [25]. It includes 3160 training images, 439 validation images, and 510 test images. The images were acquired using a high-frequency linear array transducer and the thyroid regions were manually segmented by experienced clinicians.
6. Nerve – This low-contrast dataset consists of 2D ultrasound images of nerves [dataset] [26]. It includes 1610 training images, 364 validation images, and 349 test images. The images were collected from a variety of patients and the nerve structures were manually annotated by expert radiologists.
3.3 Data Pre-processing
All images in the datasets were pre-processed to ensure consistency and optimal performance of the models. The pre-processing steps included resizing all images to a uniform size of 576 576 pixels. For the 3D datasets, all 2D slices were extracted and used as separate images.
The datasets were split into training, validation, and test sets, with approximately 70% of the images used for training, 15% for validation, and 15% for testing. However, to prevent overfitting, slices from one 3D volume or 2D images from the same patient were included in only one of the train, validation, or test sets. This means that the split is not always exactly in these ratios.
Finally, on some of the smaller datasets random data augmentation was applied in order to improve the absolute results. On each sample, four random combinations of a horizontal flip, vertical flip and rotation with 90, 180 or 270 degrees were applied, where each part of the combination is applied with a probability of 0.5.
3.4 Evaluation Measures
In the realm of image segmentation, five principal metrics are typically utilized to assess performance [27, 28]. The Dice Similarity Coefficient (DSC) serves as a statistical metric, measuring the similarity between two sets by calculating the ratio of twice the intersection area to the total size of both sets. Sensitivity, or recall, appraises the model’s ability to accurately identify positive instances, hence providing insight into the model’s efficacy in segmenting intended areas. Specificity evaluates the model’s proficiency in correctly recognizing negative instances, or in other words, its capability to exclude regions not meant to be segmented. Accuracy gauges the model’s overall correctness in assigning classifications. Lastly, the Jaccard Index (JI) is an intersection-over-union measure that quantifies the similarity between the predicted and actual segmentations, providing a rigorous assessment of model performance in segmenting images.
Moreover, to understand fully the impact of TRF and ERF on model performance, two additional metrics are proposed in this work: ERF rate and object rate. We also factor in the training time (epochs) as a metric, quantifying the epochs needed to attain the lowest validation loss. This allows us to comparatively analyze the training cost across various models.
3.4.1 ERF rate
We propose a new metric called ERF rate to measure the ERF distribution. It quantifies the fraction of significantly contributing pixels within the ERF against the TRF area, utilizing the absolute value of the ERF pixels. The ERF rate (equation 10) accounts for all the meaningful pixels above a certain threshold () in the ERF, giving more weight to pixels with higher values and normalizing the result with the TRF area. The metric is computed for each test image, reporting the mean ERF rate as the overall score.
(10) |
We use Kernel Density Estimation (KDE) to find the threshold () for key contributing pixels, estimating the Probability Density Function (PDF) of a continuous variable based on observed samples [29].
The density function of ERF values can be calculated using the formula in equation 11, where is the estimated PDF, is a kernel function with bandwidth , and is the number of observations in . It is centered at each observation .
(11) |
To identify the ideal parameters for KDE, we examined the ERF absolute value histogram for a large dataset sample. It reveals two different types of distributions: (i) ERFs with both contributing and non-contributing pixels have a bimodal distribution with a left peak representing non-contributing pixels and a right peak representing contributing pixels, and (ii) ERFs with mostly non-contributing pixels have a highly positively skewed distribution. The first parameter, the bandwidth (), controls the kernel width and PDF smoothing level. Silverman’s rule-of-thumb [30] was used to automatically determine (, where is the standard deviation of sample of size ), because it performs well on both bimodal and skewed distributions [31].
Finally, the threshold () was selected based on the trough in bimodal distributions or the stop** point of decrease in skewed distributions (Figure 4). To reduce the number of troughs, thus making it easier to find the optimal threshold, a Gaussian kernel function was used to smooth the estimated PDF [29].
3.4.2 Object rate
In order to assess the relative size of the object to be segmented in comparison to the TRF size, a new metric denoted as object rate is proposed. This metric is computed by dividing the total area of a rectangle encompassing the edges of the object by the total area of the TRF size, or as defined in equation 2. Thus, for an object with its highest point at , lowest at , leftmost at , and rightmost at , the object rate can be calculated as follows:
(12) |
4 Results & Discussion
Detailed results of the performance of the U-Net model for the different metrics on all medical datasets can be found in Table 2. Results of the Attention U-Net on the medical datasets and the U-Net on the synthetic datasets of type A and B can be found in A, B, and C respectively. In the following section we present different plots to interpret and discuss these results.
TRF size | trf54 | trf100 | trf146 | trf204 | trf230 | trf298 | trf360 | trf412 | trf486 | trf570 |
---|---|---|---|---|---|---|---|---|---|---|
Fetal head | ||||||||||
Training time (epochs) | 38 | 31 | 21 | 26 | 34 | 21 | 24 | 26 | 28 | 30 |
ERF rate before training | 0.0135 | 0.0097 | 0.0047 | 0.0011 | 0.0046 | 0.0014 | 0.0044 | 0.0005 | 0.0012 | 0.0009 |
ERF rate | 0.8898 | 0.9380 | 0.9153 | 0.8614 | 0.8196 | 0.5970 | 0.5785 | 0.4618 | 0.4175 | 0.2309 |
Dice score | 0.7752 | 0.8866 | 0.9224 | 0.9527 | 0.9506 | 0.9526 | 0.9623 | 0.9614 | 0.9650 | 0.9665 |
Object rate | 67.3558 | 16.8389 | 7.4840 | 4.0325 | 2.6942 | 1.7922 | 1.0997 | 0.9870 | 0.6452 | 0.4481 |
Accuracy | 0.8687 | 0.9300 | 0.9518 | 0.9690 | 0.9693 | 0.9704 | 0.9749 | 0.9735 | 0.9761 | 0.9773 |
Sensitivity | 0.8458 | 0.8914 | 0.9395 | 0.9508 | 0.9707 | 0.9709 | 0.9675 | 0.9570 | 0.9627 | 0.9680 |
Specificity | 0.8831 | 0.9530 | 0.9612 | 0.9828 | 0.9739 | 0.9745 | 0.9844 | 0.9878 | 0.9887 | 0.9873 |
Jaccard index | 0.6577 | 0.8152 | 0.8723 | 0.9212 | 0.9210 | 0.9248 | 0.9390 | 0.9371 | 0.9435 | 0.9466 |
Fetal head 2 | ||||||||||
Training time (epochs) | 1 | 1 | 1 | 6 | 1 | 8 | 7 | 7 | 25 | 14 |
ERF rate before training | 0.1531 | 0.0330 | 0.0300 | 0.0097 | 0.0245 | 0.0073 | 0.0163 | 0.0019 | 0.0051 | 0.0054 |
ERF rate | 1.0005 | 0.0944 | 0.2275 | 0.0095 | 0.0168 | 0.0051 | 0.2761 | 0.0010 | 0.3159 | 0.1735 |
Dice score | 0.6009 | 0.6261 | 0.7582 | 0.8745 | 0.7950 | 0.8588 | 0.9028 | 0.9071 | 0.9116 | 0.9214 |
Object rate | 35.5599 | 8.8900 | 3.9511 | 2.1289 | 1.4224 | 0.9462 | 0.5806 | 0.5211 | 0.3406 | 0.2365 |
Accuracy | 0.9889 | 0.8888 | 0.9207 | 0.9557 | 0.9315 | 0.9525 | 0.9654 | 0.9675 | 0.9686 | 0.9725 |
Sensitivity | 0.7058 | 0.7818 | 0.8512 | 0.9133 | 0.9284 | 0.9502 | 0.9567 | 0.9524 | 0.9479 | 0.9616 |
Specificity | 0.9913 | 0.9004 | 0.9303 | 0.9627 | 0.9316 | 0.9514 | 0.9645 | 0.9679 | 0.9708 | 0.9723 |
Jaccard index | 0.5154 | 0.4764 | 0.6265 | 0.7876 | 0.6841 | 0.7805 | 0.8365 | 0.8445 | 0.8513 | 0.8656 |
Kidneys | ||||||||||
Training time (epochs) | 21 | 34 | 31 | 32 | 38 | 47 | 54 | 47 | 48 | 54 |
ERF rate before training | 0.1715 | 0.0496 | 0.0303 | 0.0126 | 0.0287 | 0.0058 | 0.0217 | 0.0024 | 0.0059 | 0.0065 |
ERF rate | 0.0123 | 0.0341 | 0.0227 | 0.0057 | 0.0162 | 0.0035 | 0.0088 | 0.0012 | 0.0038 | 0.0038 |
Dice score | 0.7560 | 0.8367 | 0.8477 | 0.8524 | 0.8617 | 0.8364 | 0.8865 | 0.8657 | 0.8439 | 0.8802 |
Object rate | 27.4954 | 6.8738 | 3.0550 | 1.6461 | 1.0998 | 0.7316 | 0.4489 | 0.4029 | 0.2634 | 0.1829 |
Accuracy | 0.9832 | 0.9904 | 0.9911 | 0.9911 | 0.9917 | 0.9889 | 0.9923 | 0.9902 | 0.9900 | 0.9918 |
Sensitivity | 0.7808 | 0.8856 | 0.8762 | 0.8783 | 0.8645 | 0.8814 | 0.8993 | 0.8914 | 0.8647 | 0.8884 |
Specificity | 0.9892 | 0.9935 | 0.9945 | 0.9948 | 0.9963 | 0.9916 | 0.9961 | 0.9935 | 0.9942 | 0.9954 |
Jaccard index | 0.6320 | 0.7509 | 0.7679 | 0.7717 | 0.7853 | 0.7490 | 0.8119 | 0.7836 | 0.7607 | 0.8055 |
Lungs | ||||||||||
Training time (epochs) | 15 | 20 | 29 | 26 | 36 | 26 | 51 | 28 | 33 | 40 |
ERF rate before training | 0.0417 | 0.0419 | 0.0216 | 0.0127 | 0.0122 | 0.0069 | 0.0134 | 0.0009 | 0.0036 | 0.0061 |
ERF rate | 0.0614 | 0.1298 | 0.0315 | 0.0040 | 0.0137 | 0.0031 | 0.0245 | 0.0012 | 0.0011 | 0.0039 |
Dice score | 0.9601 | 0.9673 | 0.9687 | 0.9686 | 0.9683 | 0.9666 | 0.9689 | 0.9683 | 0.9662 | 0.9673 |
Object rate | 84.4219 | 21.1055 | 9.3802 | 5.0542 | 3.3769 | 2.2463 | 1.3784 | 1.2371 | 0.8087 | 0.5616 |
Accuracy | 0.9784 | 0.9823 | 0.9830 | 0.9829 | 0.9829 | 0.9820 | 0.9830 | 0.9827 | 0.9818 | 0.9824 |
Sensitivity | 0.9650 | 0.9677 | 0.9697 | 0.9681 | 0.9695 | 0.9654 | 0.9694 | 0.9654 | 0.9746 | 0.9776 |
Specificity | 0.9825 | 0.9870 | 0.9872 | 0.9878 | 0.9869 | 0.9873 | 0.9875 | 0.9886 | 0.9834 | 0.9834 |
Jaccard index | 0.9240 | 0.9371 | 0.9398 | 0.9396 | 0.9391 | 0.9361 | 0.9402 | 0.9389 | 0.9353 | 0.9373 |
Nerve | ||||||||||
Training time (epochs) | 7 | 13 | 15 | 8 | 14 | 17 | 10 | 8 | 10 | 12 |
ERF rate before training | 0.1425 | 0.0444 | 0.0381 | 0.0107 | 0.0287 | 0.0096 | 0.0213 | 0.0019 | 0.0104 | 0.0078 |
ERF rate | 0.9312 | 0.7345 | 0.6953 | 0.0057 | 0.3348 | 0.0224 | 0.1363 | 0.0008 | 0.1244 | 0.0399 |
Dice score | 0.4685 | 0.7329 | 0.7531 | 0.7745 | 0.7792 | 0.7863 | 0.7965 | 0.7951 | 0.7960 | 0.7947 |
Object rate | 7.3183 | 1.8296 | 0.8131 | 0.4381 | 0.2927 | 0.1947 | 0.1195 | 0.1072 | 0.0701 | 0.0487 |
Accuracy | 0.9758 | 0.9848 | 0.9859 | 0.9868 | 0.9873 | 0.9872 | 0.9876 | 0.9878 | 0.9881 | 0.9880 |
Sensitivity | 0.6442 | 0.7519 | 0.7637 | 0.7803 | 0.7978 | 0.7990 | 0.7901 | 0.8068 | 0.8301 | 0.8289 |
Specificity | 0.9808 | 0.9914 | 0.9923 | 0.9930 | 0.9928 | 0.9930 | 0.9940 | 0.9934 | 0.9927 | 0.9927 |
Jaccard index | 0.3281 | 0.6030 | 0.6321 | 0.6572 | 0.6614 | 0.6701 | 0.6827 | 0.6800 | 0.6786 | 0.6785 |
Thyroid | ||||||||||
Training time (epochs) | 1 | 1 | 3 | 2 | 2 | 5 | 3 | 2 | 7 | 4 |
ERF rate before training | 0.1652 | 0.0434 | 0.0280 | 0.0094 | 0.0227 | 0.0093 | 0.0180 | 0.0021 | 0.0089 | 0.0054 |
ERF rate | 0.1041 | 0.1439 | 0.1823 | 0.0124 | 0.0482 | 0.0240 | 0.0269 | 0.0038 | 0.0152 | 0.0054 |
Dice score | 0.5155 | 0.5829 | 0.6456 | 0.7043 | 0.7124 | 0.6907 | 0.6680 | 0.6667 | 0.7457 | 0.7284 |
Object rate | 14.8609 | 3.7152 | 1.6512 | 0.8897 | 0.5944 | 0.3954 | 0.2426 | 0.2178 | 0.1424 | 0.0989 |
Accuracy | 0.9718 | 0.9807 | 0.9840 | 0.9860 | 0.9822 | 0.9864 | 0.9854 | 0.9859 | 0.9871 | 0.9837 |
Sensitivity | 0.6309 | 0.7427 | 0.7374 | 0.7516 | 0.7021 | 0.7563 | 0.7449 | 0.7705 | 0.7602 | 0.7268 |
Specificity | 0.9788 | 0.9827 | 0.9879 | 0.9912 | 0.9928 | 0.9905 | 0.9884 | 0.9879 | 0.9935 | 0.9933 |
Jaccard index | 0.3881 | 0.4779 | 0.5481 | 0.6107 | 0.6168 | 0.6036 | 0.5785 | 0.5746 | 0.6526 | 0.6246 |
TRF = 54 | TRF = 230 | TRF = 486 |
---|---|---|
TRF = 54 | TRF = 230 | TRF = 486 |
---|---|---|
TRF = 54 | TRF = 230 | TRF = 486 |
---|---|---|
TRF = 54 | TRF = 230 | TRF = 486 |
---|---|---|
TRF = 54 | TRF = 230 | TRF = 486 |
---|---|---|
4.1 The Role of Contrast
Figure 5 illustrates the relationship between the TRF size and the DSC for all synthetic shape datasets, encompassing both Type A and Type B for the U-Net model. For all datasets that can be segmented solely based on contrast (A, A large), the model attains perfect performance even at the smallest TRF size (Figure 5 (a)). For datasets, that present an added layer of complexity by either representing only contours of RoI in input images (A contour, A large contour) or by excluding the square from the mask (Type B), require a larger TRF to reach peak performance (Figure 5 (a), Figure 5 (b)). These datasets with an added complexity in segmentation show a model performance trend where DSC starts at a lower point for a small TRF and requires a larger TRF to reach peak performance unlike the consistent perfect performance in the contrast-based datasets.
The same pattern is present in the medical datasets: all datasets which have a low-contrast RoI show a trend of increasing DSC as the TRF size grows, whereas the high-contrast lung dataset attains peak performance starting at the lowest TRF (Figs. 6(a) and 6(e)). The segmentation output for the datasets of fetal head, kidneys, nerve and thyroid for different TRF’s are shown in Figs. 7 (a), 7 (b), 7 (c), and 7 (d) respectively and the combined results for the U-net model are shown in Table-2. Fig. 8 shows the results for the lung dataset for different TRF’s. Since, in the lung dataset, the RoI can be identified visually using the contrast, the DSC attains close to peak value even for a very small TRF. It is clear that the predicted segmentation improves significantly with increasing TRF for all datasets except the lung dataset.
This pattern is further highlighted in Table 2, where all low-contrast datasets consistently show a trend of increasing DSC with TRF, and all high-contrast datasets do not show the same trend.
TRF = 54 | 100 | 146 | 204 | 230 | 298 | 360 | 412 | 486 | 570 |
---|---|---|---|---|---|---|---|---|---|
4.2 Optimal TRF Size
In Figs. 6(b) and 6(f), a trend is visible which shows that the ERF rate shrinks with the enlargement of the TRF size for the U-Net and the Attention U-Net respectively. This suggests that as the TRF size increases, a smaller proportion of pixels actually contribute to the predicted segmentation. Moreover, as the TRF size increases, the training time (epochs) also tends to increase as displayed in Figs. 6(c) and 6(g) for the U-Net and the Attention U-Net respectively. This finding implies that an excessively large TRF size may lead to unnecessary computations, potentially explaining the observed increase in training time (epochs) with the expansion of the TRF size.
In this context, the object rate, plotted against the DSC in Figs. 6(d) (U-Net) and 6(h) (Attention U-Net), also seems to play a role. When the object rate, i.e. the size of the RoI relative to the TRF, increases, the DSC degrades. This is corroborated by the two rightmost columns in Table 3, where for most low-contrast datasets where the TRF size plays a major role, the optimal TRF size, i.e. the TRF size at which the DSC saturates, is usually only slightly smaller than the size of the RoI.
Despite the overall trend of increasing DSC with expanding TRF size, we observe slight drops. This can be interpreted in light of the concept of variability in neural networks, as discussed by [32]. Variability, as they define it, represents the richness of landscape patterns in the data space with respect to well-scaled random weights. As the TRF size increases, the model starts to incorporate more global context into its predictions. While this can be beneficial for capturing larger-scale structures in the image, it may also introduce more noise into the model’s predictions, especially if the larger TRF includes irrelevant or distracting features. This could result in slight decreases in the DSC.
4.3 Attention Mechanism and TRF Size
In Figs. 6(a) and 6(e) the TRF size is plotted against the DSC for the U-Net and Attention U-Net respectively for all the medical datasets. In both instances, the trend of an increasing DSC as the TRF grows is present. Fig. 9 shows the segmentation for the fetal head 2 dataset for all the TRF’s and the corresponding TRF and ERF. As the TRF size increases, the segmentation accuracy increases and the overall trend can be seen in Table-2. The same trend is also visible for the Attention U-Net model, the results of which are shown in A. However, all absolute scores are higher in the case of the Attention U-Net. Table-3 column 4 shows the summary of results for all the datasets if they follow the pattern of DSC using Attention U-Net model. All the medical imaging datasets except the lung dataset follow the pattern for both U-Net as well as Attention U-Net model, with a higher absolute scores for attention counterpart. Hence, it can be said that the attention mechanism will consistently improve the performance, regardless of TRF size. Even with attention mechanism, TRF plays an important role and a larger TRF might further improve the performance of Attention U-net models.
4.4 Designing Efficient Architectures
In this work, we performed the experiments for different TRF’s having same number of total parameters for different datasets as can be seen in Table-1. Detailed results of the performance of the U-Net model for the different metrics on all medical datasets can be found in Table 2. Results of the Attention U-Net on the medical datasets and the U-Net on the synthetic datasets of type A and B can be found in A, B, and C respectively. These results show that even for the same number of parameters there is a very high effect on the performance of the network if the TRF is changed. Inclusion of TRF size as a parameter for models can lead to a more fair comparison among their performance. It will also help in designing efficient architectures, ones with optimal trade-off between performance and number of parameters.
Dataset | Dataset Type | RoI can be identified visually only by contrast | Pattern of increasing DSC with TRF | Pattern retained with Attention U-Net, but higher absolute score | Average dimension of RoI | DSC saturates between TRF sizes |
Nerve | Clinical | o | Yes | Yes | 159 | 298-360 |
B contour | Synthetic | No | Yes | N/A | 168 | 100-146 |
A contour | Synthetic | No | Yes | N/A | 169 | 100-146 |
Thyroid | Clinical | No | Yes | Yes | 187 | 146-204 |
B large contour | Synthetic | No | Yes | N/A | 237 | 146-204 |
A large contour | Synthetic | No | Yes | N/A | 242 | 146-204 |
Fetal head 2 | Clinical | No | Yes | Yes | 255 | 146-204 |
Fetal head | Clinical | No | Yes | Yes | 260 | 146-204 |
Kidneys | Clinical | No* | Yes | Yes | 101 | 298-360 |
B | Synthetic | No* | Yes | N/A | 168 | 54-100 |
B large | Synthetic | No* | Yes | N/A | 238 | 100-146 |
Lungs | Clinical | Yes | No | N/A | 329 | 0-54 |
A | Synthetic | es | No | N/A | 168 | 0-54 |
A large | Synthetic | Yes | No | N/A | 244 | 0-54 |
5 Conclusion
This work highlights the essential role of the TRF size in semantic segmentation tasks with U-Net and Attention U-Net architectures across datasets of various modalities. We discovered that an optimal TRF size, the one which balances capturing of global context and computational efficiency, can significantly enhance model performance. This implies that an excessively large TRF size may lead to unnecessary computational costs without corresponding improvements in performance. Additionally, our results also emphasize the added value of the attention mechanism in boosting segmentation accuracy, irrespective of the TRF size.
Our findings suggest that the datasets where RoI can be visually identified by contrast comparison alone, typically attain peak performance with even small TRF. Conversely, this is not the case when additional complexities are present, such as contrast not being only criteria for identifying RoI or contours demarcating RoI. This implies that the model’s performance also depends on factors like the complexity of the task and the size of the RoI relative to the TRF size.
Furthermore, our study indicates that the DSC tends to plateau at a certain TRF size depending on the dataset. This suggests that there exists an optimal TRF size for each dataset, beyond which further expansion of the TRF size does not significantly improve the DSC. These findings can have practical implications for the design of segmentation models, suggesting that increasing TRF size may not always be necessary or beneficial.
These insights provide a valuable reference for designing and optimizing U-Net-based architectures for various tasks and datasets in medical imaging. While our study focused on the U-Net and Attention U-Net architectures, there are many other architectures used for semantic segmentation tasks, such as SegNet [33], PSPNet [34], and DeepLab [35]. Future research could investigate the impact of the TRF size on the performance of these architectures.
Acknowledgements
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Appendices
Appendix A All results for the different evaluation measures on the medical datasets (fetal head, fetal head 2, kidneys, lungs, nerve, and thyroid) for the Attention- U-Net.
TRF size | trf54 | trf100 | trf146 | trf204 | trf230 | trf298 | trf360 | trf412 | trf486 | trf570 |
---|---|---|---|---|---|---|---|---|---|---|
Fetal head | ||||||||||
Training time (epochs) | 36 | 28 | 26 | 34 | 35 | 39 | 39 | 34 | 32 | 34 |
ERF rate before training | 0.8256 | 0.5080 | 0.4090 | 0.4592 | 0.3215 | 0.3601 | 0.2955 | 0.4388 | 0.2972 | 0.2690 |
ERF rate | 0.9854 | 0.9548 | 0.9732 | 0.1771 | 0.8143 | 0.0024 | 0.5025 | 0.0013 | 0.0016 | 0.0025 |
Dice score | 0.8307 | 0.9213 | 0.9524 | 0.9538 | 0.9625 | 0.9640 | 0.9642 | 0.9655 | 0.9675 | 0.9667 |
Object rate | 67.3558 | 16.8389 | 7.4840 | 4.0325 | 2.6942 | 1.7922 | 1.0997 | 0.9870 | 0.6452 | 0.4481 |
Accuracy | 0.8984 | 0.9512 | 0.9708 | 0.9715 | 0.9769 | 0.9781 | 0.9780 | 0.9791 | 0.9807 | 0.9803 |
Sensitivity | 0.9085 | 0.9553 | 0.9528 | 0.9727 | 0.9591 | 0.9665 | 0.9637 | 0.9621 | 0.9638 | 0.9671 |
Specificity | 0.8920 | 0.9476 | 0.9799 | 0.9711 | 0.9869 | 0.9848 | 0.9865 | 0.9891 | 0.9905 | 0.9879 |
Jaccard index | 0.7200 | 0.8656 | 0.9206 | 0.9231 | 0.9388 | 0.9417 | 0.9420 | 0.9445 | 0.9482 | 0.9469 |
Fetal head 2 | ||||||||||
Training time (epochs) | 6 | 3 | 4 | 2 | 4 | 8 | 4 | 6 | 4 | 12 |
ERF rate before training | 0.8459 | 0.5388 | 0.4327 | 0.5030 | 0.3511 | 0.4051 | 0.3217 | 0.4826 | 0.3293 | 0.3049 |
ERF rate | 1.0000 | 0.9070 | 0.8755 | 0.0325 | 0.7660 | 0.0845 | 0.3467 | 0.0030 | 0.1717 | 0.1735 |
Dice score | 0.7094 | 0.8058 | 0.8403 | 0.8574 | 0.8895 | 0.9025 | 0.9082 | 0.9135 | 0.9106 | 0.9224 |
Object rate | 35.5599 | 8.8900 | 3.9511 | 2.1289 | 1.4224 | 0.9462 | 0.5806 | 0.5211 | 0.3406 | 0.2850 |
Accuracy | 0.9155 | 0.9380 | 0.9476 | 0.9508 | 0.9621 | 0.9658 | 0.9673 | 0.9699 | 0.9698 | 0.9518 |
Sensitivity | 0.8614 | 0.8797 | 0.9172 | 0.9281 | 0.9419 | 0.9521 | 0.9465 | 0.9424 | 0.9513 | 0.9395 |
Specificity | 0.9232 | 0.9479 | 0.9516 | 0.9529 | 0.9648 | 0.9668 | 0.9688 | 0.9739 | 0.9705 | 0.9612 |
Jaccard index | 0.5716 | 0.6892 | 0.7341 | 0.7616 | 0.8130 | 0.8350 | 0.8441 | 0.8542 | 0.8517 | 0.8723 |
Kidneys | ||||||||||
Training time (epochs) | 27 | 29 | 25 | 48 | 52 | 50 | 57 | 57 | 65 | 55 |
ERF rate before training | 0.8694 | 0.5124 | 0.4018 | 0.4715 | 0.3262 | 0.3916 | 0.2979 | 0.4270 | 0.3122 | 0.2975 |
ERF rate | 0.1832 | 0.1251 | 0.0902 | 0.0406 | 0.1925 | 0.0388 | 0.0122 | 0.0015 | 0.0049 | 0.0019 |
Dice score | 0.7481 | 0.8529 | 0.8410 | 0.8542 | 0.8709 | 0.8484 | 0.8703 | 0.8558 | 0.8979 | 0.8586 |
Object rate | 27.4954 | 6.8738 | 3.0550 | 1.6461 | 1.0998 | 0.7316 | 0.4489 | 0.4029 | 0.2634 | 0.1829 |
Accuracy | 0.9829 | 0.9911 | 0.9896 | 0.9911 | 0.9917 | 0.9901 | 0.9918 | 0.9913 | 0.9930 | 0.9903 |
Sensitivity | 0.8210 | 0.8484 | 0.8500 | 0.8813 | 0.8878 | 0.8360 | 0.8813 | 0.8849 | 0.8920 | 0.8742 |
Specificity | 0.9867 | 0.9959 | 0.9951 | 0.9942 | 0.9951 | 0.9952 | 0.9954 | 0.9952 | 0.9965 | 0.9936 |
Jaccard index | 0.6197 | 0.7720 | 0.7505 | 0.7747 | 0.7967 | 0.7687 | 0.7990 | 0.7798 | 0.8300 | 0.7759 |
Lungs | ||||||||||
Training time (epochs) | 21 | 22 | 30 | 16 | 31 | 23 | 29 | 17 | 30 | 30 |
ERF rate before training | 0.8377 | 0.5446 | 0.4295 | 0.5266 | 0.3453 | 0.3745 | 0.3063 | 0.4558 | 0.3053 | 0.3143 |
ERF rate | 0.8615 | 0.6073 | 0.4391 | 0.0071 | 0.0240 | 0.1075 | 0.0245 | 0.0044 | 0.0084 | 0.0045 |
Dice score | 0.9574 | 0.9677 | 0.9673 | 0.9672 | 0.9668 | 0.9662 | 0.9671 | 0.9649 | 0.9666 | 0.9681 |
Object rate | 84.4219 | 21.1055 | 9.3802 | 5.0542 | 3.3769 | 2.2463 | 1.3784 | 1.2371 | 0.8087 | 0.5616 |
Accuracy | 0.9769 | 0.9824 | 0.9822 | 0.9823 | 0.9819 | 0.9817 | 0.9822 | 0.9810 | 0.9819 | 0.9827 |
Sensitivity | 0.9488 | 0.9738 | 0.9629 | 0.9665 | 0.9728 | 0.9636 | 0.9711 | 0.9618 | 0.9685 | 0.9684 |
Specificity | 0.9867 | 0.9850 | 0.9886 | 0.9872 | 0.9846 | 0.9875 | 0.9856 | 0.9871 | 0.9862 | 0.9874 |
Jaccard index | 0.9191 | 0.9378 | 0.9371 | 0.9370 | 0.9363 | 0.9352 | 0.9370 | 0.9331 | 0.9362 | 0.9388 |
Nerve | ||||||||||
Training time (epochs) | 10 | 12 | 12 | 13 | 10 | 10 | 14 | 9 | 14 | 11 |
ERF rate before training | 0.8478 | 0.5326 | 0.4261 | 0.4985 | 0.3459 | 0.3965 | 0.3144 | 0.4639 | 0.3263 | 0.3110 |
ERF rate | 0.9638 | 0.8183 | 0.8422 | 0.4236 | 0.5661 | 0.2254 | 0.4018 | 0.0762 | 0.2615 | 0.1329 |
Dice score | 0.4801 | 0.7014 | 0.7428 | 0.7708 | 0.7631 | 0.7689 | 0.7911 | 0.7911 | 0.7941 | 0.7964 |
Object rate | 7.3183 | 1.8296 | 0.8131 | 0.4381 | 0.2927 | 0.1947 | 0.1195 | 0.1072 | 0.0701 | 0.0487 |
Accuracy | 0.9746 | 0.9849 | 0.9860 | 0.9872 | 0.9869 | 0.9867 | 0.9881 | 0.9875 | 0.9880 | 0.9881 |
Sensitivity | 0.5738 | 0.8122 | 0.7848 | 0.7959 | 0.8150 | 0.7957 | 0.8210 | 0.8139 | 0.8343 | 0.8228 |
Specificity | 0.9824 | 0.9885 | 0.9911 | 0.9924 | 0.9914 | 0.9922 | 0.9927 | 0.9929 | 0.9924 | 0.9930 |
Jaccard index | 0.3427 | 0.5687 | 0.6203 | 0.6541 | 0.6437 | 0.6490 | 0.6761 | 0.6765 | 0.6770 | 0.6805 |
Thyroid | ||||||||||
Training time (epochs) | 2 | 2 | 2 | 3 | 4 | 3 | 3 | 2 | 3 | 5 |
ERF rate before training | 0.8760 | 0.5268 | 0.4179 | 0.4842 | 0.3426 | 0.4035 | 0.3087 | 0.4485 | 0.2973 | 0.3025 |
ERF rate | 0.8913 | 0.6137 | 0.3449 | 0.0786 | 0.1057 | 0.0504 | 0.1024 | 0.0125 | 0.0468 | 0.0253 |
Dice score | 0.5706 | 0.6773 | 0.6638 | 0.7464 | 0.7142 | 0.7181 | 0.7060 | 0.7455 | 0.7219 | 0.7420 |
Object rate | 14.8609 | 3.7152 | 1.6512 | 0.8897 | 0.5944 | 0.3954 | 0.2426 | 0.2178 | 0.1424 | 0.0989 |
Accuracy | 0.9704 | 0.9811 | 0.9819 | 0.9844 | 0.9818 | 0.9835 | 0.9839 | 0.9832 | 0.9802 | 0.9836 |
Sensitivity | 0.6093 | 0.7146 | 0.7767 | 0.7933 | 0.7155 | 0.7677 | 0.7581 | 0.7960 | 0.7262 | 0.7863 |
Specificity | 0.9832 | 0.9897 | 0.9871 | 0.9904 | 0.9914 | 0.9903 | 0.9899 | 0.9902 | 0.9922 | 0.9901 |
Jaccard index | 0.4402 | 0.5736 | 0.5609 | 0.6372 | 0.6085 | 0.6137 | 0.6084 | 0.6343 | 0.6114 | 0.6333 |
Appendix B Results of the regular U-Net model on the Type A shapes datasets
TRF size | 54 | 100 | 146 | 204 | 230 | 298 | 360 | 412 | 486 | 570 |
---|---|---|---|---|---|---|---|---|---|---|
A | ||||||||||
Training time (epochs) | 63 | 66 | 111 | 71 | 69 | 71 | 200 | 74 | 192 | 72 |
ERF rate before training | 0.0514 | 0.0131 | 0.0104 | 0.0033 | 0.0078 | 0.0043 | 0.0080 | 0.0005 | 0.0015 | 0.0044 |
ERF rate | 0.0028 | 0.0527 | 0.0009 | 0.0008 | 0.0008 | 0.0006 | 0.0002 | 0.0002 | 0.0002 | 0.0002 |
Dice score | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Object rate | 100.5928 | 25.1482 | 11.1770 | 6.0224 | 4.0237 | 2.6766 | 1.6424 | 1.4741 | 0.9636 | 0.6692 |
Accuracy | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Sensitivity | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Specificity | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Jaccard index | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
A contour | ||||||||||
Training time (epochs) | 14 | 17 | 87 | 18 | 200 | 47 | 200 | 39 | 49 | 117 |
ERF rate before training | 0.0353 | 0.0138 | 0.0088 | 0.0038 | 0.0059 | 0.0020 | 0.0040 | 0.0004 | 0.0010 | 0.0011 |
ERF rate | 0.4710 | 0.0502 | 0.0392 | 0.0048 | 0.0047 | 0.0029 | 0.0036 | 0.0005 | 0.0013 | 0.0017 |
Dice score | 0.8219 | 0.9791 | 0.9998 | 0.9996 | 0.9996 | 0.9997 | 0.9998 | 0.9996 | 0.9997 | 0.9999 |
Object rate | 98.8970 | 24.7242 | 10.9886 | 5.9208 | 3.9559 | 2.6315 | 1.6147 | 1.4492 | 0.9473 | 0.6579 |
Accuracy | 0.9469 | 0.9930 | 0.9999 | 0.9999 | 0.9999 | 0.9999 | 0.9999 | 0.9999 | 0.9999 | 1.0000 |
Sensitivity | 0.9505 | 0.9933 | 0.9999 | 0.9997 | 0.9996 | 0.9998 | 0.9998 | 0.9997 | 0.9997 | 0.9999 |
Specificity | 0.9465 | 0.9929 | 1.0000 | 0.9999 | 0.9999 | 0.9999 | 1.0000 | 0.9999 | 0.9999 | 1.0000 |
Jaccard index | 0.6983 | 0.9591 | 0.9996 | 0.9992 | 0.9992 | 0.9995 | 0.9997 | 0.9992 | 0.9994 | 0.9997 |
A large | ||||||||||
Training time (epochs) | 50 | 64 | 57 | 70 | 97 | 68 | 105 | 67 | 99 | 69 |
ERF rate before training | 0.0326 | 0.0097 | 0.0115 | 0.0019 | 0.0058 | 0.0008 | 0.0047 | 0.0008 | 0.0020 | 0.0015 |
ERF rate | 0.0028 | 0.0007 | 0.0004 | 0.0003 | 0.0002 | 0.0002 | 0.0001 | 0.0001 | 0.0001 | 0.0000 |
Dice score | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Object rate | 137.8993 | 34.4748 | 15.3221 | 8.2559 | 5.5160 | 3.6693 | 2.2515 | 2.0207 | 1.3209 | 0.9173 |
Accuracy | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Sensitivity | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Specificity | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Jaccard index | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
A large contour | ||||||||||
Training time (epochs) | 25 | 31 | 56 | 13 | 24 | 199 | 19 | 12 | 25 | 33 |
ERF rate before training | 0.0290 | 0.0148 | 0.0081 | 0.0037 | 0.0055 | 0.0031 | 0.0052 | 0.0006 | 0.0013 | 0.0014 |
ERF rate | 0.4730 | 0.2601 | 0.8689 | 0.0028 | 0.0034 | 0.0015 | 0.0025 | 0.0003 | 0.0006 | 0.0007 |
Dice score | 0.6708 | 0.8646 | 0.9841 | 0.9995 | 0.9997 | 0.9993 | 0.9998 | 0.9995 | 0.9994 | 0.9994 |
Object rate | 138.7451 | 34.6863 | 15.4161 | 8.3065 | 5.5498 | 3.6918 | 2.2653 | 2.0331 | 1.3290 | 0.9229 |
Accuracy | 0.8197 | 0.9119 | 0.9892 | 0.9997 | 0.9998 | 0.9995 | 0.9998 | 0.9997 | 0.9996 | 0.9996 |
Sensitivity | 0.8949 | 0.9112 | 0.9912 | 0.9995 | 0.9997 | 0.9993 | 0.9998 | 0.9995 | 0.9995 | 0.9993 |
Specificity | 0.8005 | 0.9123 | 0.9881 | 0.9998 | 0.9998 | 0.9996 | 0.9999 | 0.9998 | 0.9996 | 0.9998 |
Jaccard index | 0.5056 | 0.7626 | 0.9688 | 0.9991 | 0.9994 | 0.9986 | 0.9995 | 0.9990 | 0.9987 | 0.9989 |
Appendix C Results of the regular U-Net model on the Type B shapes datasets
TRF size | 54 | 100 | 146 | 204 | 230 | 298 | 360 | 412 | 486 | 570 |
---|---|---|---|---|---|---|---|---|---|---|
B large | ||||||||||
Training time (epochs) | 46 | 52 | 173 | 190 | 123 | 47 | 45 | 44 | 45 | 198 |
ERF rate before training | 0.0387 | 0.0113 | 0.0128 | 0.0022 | 0.0105 | 0.0022 | 0.0054 | 0.0007 | 0.0019 | 0.0019 |
ERF rate | 0.0918 | 0.0556 | 0.0016 | 0.0010 | 0.0005 | 0.0006 | 0.0002 | 0.0002 | 0.0001 | 0.0001 |
Dice score | 0.9345 | 0.9959 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 0.9998 | 1.0000 | 0.9999 | 1.0000 |
Object rate | 57.3475 | 14.3369 | 6.3719 | 3.4333 | 2.2939 | 1.5259 | 0.9363 | 0.8404 | 0.5493 | 0.3815 |
Accuracy | 0.9843 | 0.9991 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Sensitivity | 0.8810 | 0.9921 | 1.0000 | 1.0000 | 1.0000 | 0.9999 | 0.9998 | 1.0000 | 0.9999 | 1.0000 |
Specificity | 0.9994 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Jaccard index | 0.8772 | 0.9920 | 1.0000 | 1.0000 | 1.0000 | 0.9999 | 0.9996 | 1.0000 | 0.9999 | 1.0000 |
B contour | ||||||||||
Training time (epochs) | 20 | 12 | 81 | 125 | 137 | 42 | 194 | 139 | 144 | 182 |
ERF rate before training | 0.0277 | 0.0063 | 0.0059 | 0.0018 | 0.0061 | 0.0019 | 0.0046 | 0.0005 | 0.0013 | 0.0012 |
ERF rate | 0.5588 | 0.3507 | 0.0383 | 0.0018 | 0.0042 | 0.0012 | 0.0033 | 0.0006 | 0.0022 | 0.0009 |
Dice score | 0.8067 | 0.9751 | 0.9997 | 0.9998 | 0.9998 | 0.9995 | 0.9998 | 0.9997 | 0.9998 | 0.9998 |
Object rate | 62.1263 | 15.5316 | 6.9029 | 3.7194 | 2.4851 | 1.6531 | 1.0143 | 0.9104 | 0.5951 | 0.4133 |
Accuracy | 0.9620 | 0.9945 | 0.9999 | 1.0000 | 1.0000 | 0.9999 | 1.0000 | 0.9999 | 1.0000 | 1.0000 |
Sensitivity | 0.9490 | 0.9961 | 0.9998 | 0.9998 | 0.9998 | 0.9994 | 0.9998 | 0.9997 | 0.9998 | 0.9998 |
Specificity | 0.9633 | 0.9943 | 1.0000 | 1.0000 | 1.0000 | 0.9999 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Jaccard index | 0.6771 | 0.9515 | 0.9995 | 0.9996 | 0.9996 | 0.9989 | 0.9997 | 0.9995 | 0.9996 | 0.9997 |
B large | ||||||||||
Training time (epochs) | 34 | 67 | 63 | 184 | 28 | 59 | 143 | 35 | 110 | 199 |
ERF rate before training | 0.0231 | 0.0077 | 0.0087 | 0.0023 | 0.0048 | 0.0030 | 0.0049 | 0.0005 | 0.0019 | 0.0011 |
ERF rate | 0.0021 | 0.0016 | 0.0006 | 0.0002 | 0.0002 | 0.0002 | 0.0001 | 0.0001 | 0.0001 | 0.0001 |
Dice score | 0.8703 | 0.9454 | 0.9933 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Object rate | 85.1332 | 21.2833 | 9.4592 | 5.0968 | 3.4053 | 2.2653 | 1.3900 | 1.2475 | 0.8155 | 0.5663 |
Accuracy | 0.9368 | 0.9756 | 0.9971 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Sensitivity | 0.7825 | 0.9161 | 0.9933 | 1.0000 | 0.9999 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Specificity | 0.9943 | 0.9937 | 0.9982 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Jaccard index | 0.7706 | 0.8974 | 0.9868 | 1.0000 | 0.9999 | 0.9999 | 1.0000 | 0.9999 | 1.0000 | 1.0000 |
B large contour | ||||||||||
Training time (epochs) | 27 | 55 | 59 | 135 | 32 | 180 | 198 | 30 | 200 | 35 |
ERF rate before training | 0.0119 | 0.0088 | 0.0064 | 0.0040 | 0.0049 | 0.0019 | 0.0040 | 0.0006 | 0.0016 | 0.0019 |
ERF rate | 0.6198 | 0.2014 | 0.6863 | 0.0011 | 0.0062 | 0.0013 | 0.0027 | 0.0004 | 0.0008 | 0.0008 |
Dice score | 0.6593 | 0.8793 | 0.9815 | 0.9998 | 0.9996 | 0.9998 | 0.9998 | 0.9994 | 0.9998 | 0.9993 |
Object rate | 89.1812 | 22.2953 | 9.9090 | 5.3392 | 3.5672 | 2.3730 | 1.4561 | 1.3068 | 0.8543 | 0.5932 |
Accuracy | 0.8838 | 0.9513 | 0.9919 | 0.9999 | 0.9998 | 0.9999 | 0.9999 | 0.9998 | 0.9999 | 0.9997 |
Sensitivity | 0.9156 | 0.9584 | 0.9922 | 0.9998 | 0.9996 | 0.9998 | 0.9998 | 0.9995 | 0.9998 | 0.9989 |
Specificity | 0.8795 | 0.9497 | 0.9918 | 0.9999 | 0.9999 | 0.9999 | 1.0000 | 0.9998 | 0.9999 | 0.9999 |
Jaccard index | 0.4939 | 0.7860 | 0.9637 | 0.9996 | 0.9991 | 0.9996 | 0.9996 | 0.9989 | 0.9996 | 0.9986 |
C.1 Disclosures
No conflicts of interest.
C.2 Code, Data, and Materials Availability
The complete source code utilized in this work can be accessed via our GitHub repository at https://github.com/vinloo/u-net-receptive-field-study. In addition to this, we have developed an open-source tool designed to calculate and suggest an appropriate TRF size based on a specified U-Net configuration and dataset. This tool is intended to aid researchers and practitioners in the field and is included in the repository.
References
- [1] G. Litjens, T. Kooi, B. E. Bejnordi, et al., “A survey on deep learning in medical image analysis,” Medical Image Analysis 42, 60–88 (2017).
- [2] M. H. Hesamian, W. Jia, X. He, et al., “Deep learning techniques for medical image segmentation: achievements and challenges,” Journal of digital imaging 32, 582–596 (2019).
- [3] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” (2015).
- [4] C. Williams, F. Falck, G. Deligiannidis, et al., “A unified framework for u-net design and analysis,” (2023).
- [5] O. Oktay, J. Schlemper, L. L. Folgoc, et al., “Attention u-net: Learning where to look for the pancreas,” (2018).
- [6] W. Luo, Y. Li, R. Urtasun, et al., “Understanding the effective receptive field in deep convolutional neural networks,” (2017).
- [7] A. Araujo, W. Norris, and J. Sim, “Computing receptive fields of convolutional neural networks,” Distill (2019). https://distill.pub/2019/computing-receptive-fields.
- [8] B. Behboodi, M. Fortin, C. J. Belasso, et al., “Receptive field size as a key design parameter for ultrasound image segmentation with u-net,” in 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 2117–2120 (2020).
- [9] K. Sytwu, C. Groschner, and M. C. Scott, “Understanding the Influence of Receptive Field and Network Complexity in Neural Network-Guided TEM Image Analysis,” Microscopy and Microanalysis 28, 1896–1904 (2022).
- [10] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” (2016).
- [11] M. Sarıgül, B. Ozyildirim, and M. Avci, “Differential convolutional neural network,” Neural Networks 116, 279–287 (2019).
- [12] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,” (2018).
- [13] J. Wu, Y. Zhang, K. Wang, et al., “Skip connection u-net for white matter hyperintensities segmentation from mri,” IEEE Access 7, 155194–155202 (2019).
- [14] C.-C. J. Kuo, “Understanding convolutional neural networks with a mathematical model,” Journal of Visual Communication and Image Representation 41, 406–413 (2016).
- [15] A. Paszke, S. Gross, F. Massa, et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems 32 (2019).
- [16] Z. Zhang, “Improved adam optimizer for deep neural networks,” in 2018 IEEE/ACM 26th international symposium on quality of service (IWQoS), 1–2, Ieee (2018).
- [17] L. Prechelt, “Early stop**-but when?,” in Neural Networks: Tricks of the trade, 55–69, Springer (2002).
- [18] T. L. A. van den Heuvel, D. de Bruijn, C. L. de Korte, et al., “Automated measurement of fetal head circumference using 2D ultrasound images,” (2018).
- [19] Y. Lu, J. Bai, M. Zhou, et al., “Jnu-ifm,” (2022).
- [20] Y. Lu, M. Zhou, D. Zhi, et al., “The JNU-IFM dataset for segmenting pubic symphysis-fetal head,” Data in Brief 41, 107904 (2022).
- [21] A. J. Daniel, C. E. Buchanan, T. Allcock, et al., “T2-weighted kidney mri segmentation,” (2021).
- [22] A. J. Daniel, C. E. Buchanan, T. Allcock, et al., “Automated renal segmentation in healthy and chronic kidney disease subjects using a convolutional neural network,” Magnetic Resonance in Medicine 86, 1125–1136 (2021).
- [23] Viacheslav Danilov, “Chest x-ray dataset for lung segmentation,” (2022).
- [24] R. H. Kassamali and S. Jafarieh, “Passion and hard work produces high quality research in uk: response to focus on china: should clinicians engage in research? and lessons from other countries,” Quantitative Imaging in Medicine and Surgery 4(6) (2014).
- [25] T. Wunderling, B. Golla, P. Poudel, et al., “Comparison of thyroid segmentation techniques for 3d ultrasound,” in Proceedings of SPIE Medical Imaging, (Orlando, USA) (2017).
- [26] W. C. Anna Montoya, “Ultrasound nerve segmentation,” (2016).
- [27] A. W. Setiawan, “Image segmentation metrics in skin lesion: Accuracy, sensitivity, specificity, dice coefficient, jaccard index, and matthews correlation coefficient,” in 2020 International Conference on Computer Engineering, Network, and Intelligent Multimedia (CENIM), 97–102 (2020).
- [28] D. Müller, I. Soto-Rey, and F. Kramer, “Towards a guideline for evaluation metrics in medical image segmentation,” BMC Research Notes 15 (2022).
- [29] Weglarczyk, Stanislaw, “Kernel density estimation and its application,” ITM Web Conf. 23, 00037 (2018).
- [30] B. W. Silverman, Density Estimation for Statistics and Data Analysis, 47–48. Chapman & Hall, London (1986).
- [31] J. K. Harpole, C. M. Woods, T. L. Rodebaugh, et al., “How bandwidth selection algorithms impact exploratory data analysis using kernel density estimation.,” Psychological Methods 19, 428–443 (2014).
- [32] Y. Yu and Y. Zhang, “Multi-layer perceptron trainability explained via variability,” (2023).
- [33] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence 39(12), 2481–2495 (2017).
- [34] H. Zhao, J. Shi, X. Qi, et al., “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2881–2890 (2017).
- [35] L.-C. Chen, G. Papandreou, I. Kokkinos, et al., “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence 40(4), 834–848 (2017).
List of Figures
- 1 Variable attention U-Net in which the depth (), kernel size of the convolution layers (), and number of channels () can be tuned to alter the size of the TRF. It can be converted to a regular U-Net by simply removing the attention gates and gating signals.
- 2 Example of Theoretical Receptive Field (TRF) and Effective Receptive Field (ERF) in an image. The yellow square denotes the TRF, the maximum input area influencing the output pixel located at the centre of the square. The gray pixels, representing the ERF, show the actual input area affecting a neuron’s activation, with intensity indicating the impact level.
- 3 Typical images and segmentation masks for the synthetic datasets (A and B) and medical datasets (Fetal head, Fetal head 2, Kidneys, Lungs, Thyroid, Nerve).
- 4 Examples of determining the threshold () for the ERF rate with KDE for bimodally distributed ERF pixel values (top row) and positively skewed distributed ERF pixel values (bottom row).
- 5 Performance of the shapes datasets (A and B) for the regular U-Net
- 6 Various plots for the analyses of the medical datasets for U-Net (a) DSC Vs TRF size, (b) ERF rate Vs TRF size, (c) Training time (epochs) Vs TRF size, (d) Dice Score Vs TRF, and for Attention U-Net (e) DSC Vs TRF size, (f) ERF rate Vs TRF size, (g) Training time (epochs) Vs TRF size, (h) Dice Score Vs TRF.
- 7 Examples of the TRF (yellow square), ERF (pixels within the TRF) in the top row in each subfigure and the predicted segmentation for various TRF sizes in the bottom row, on the samples from the datasets in Figure 3.
- (a) Examples on the fetal head dataset.
- (b) Examples on the kidneys dataset.
- (c) Examples on the nerve dataset.
- (d) Examples on the thyroid dataset.
- 8 Examples of the TRF (yellow square), ERF (pixels within the TRF) in the top row and the predicted segmentation for various TRF sizes in the bottom row, on the sample from the lungs dataset from Figure 3.
- 9 Examples of the TRF (yellow square), ERF (pixels within the TRF) in the top row and the predicted segmentation for various TRF sizes in the bottom row, on the sample from the fetal head 2 dataset from Figure 3.
List of Tables
- 1 All U-Net configurations with different TRF sizes. The TRF size is influenced by the convolutional kernel size () and the vertical depth of the network ().
- 2 All the results for the different evaluation measures on the medical datasets (fetal head, fetal head 2, kidneys, lungs, nerve, and thyroid) for the U-Net.
- 3 Summary of the insights from the results. Values with no* mean that the RoI can be identified visually by contrast, but that not all regions which have this contrast are also included in the mask.