H-SynEx: Using synthetic images and ultra-high resolution
ex vivo MRI for hypothalamus subregion segmentation

Livia Rodrigues Martina Bocchetta Oula Puonti Douglas Greve Ana Carolina Londe Marcondes França Simone Appenzeller Juan Eugenio Iglesias Leticia Rittner

Abstract

The hypothalamus is a small structure located in the center of the brain and is involved in significant functions such as slee**, temperature, and appetite control. Various neurological disorders are also associated with hypothalamic abnormalities. Automated image analysis of this structure from brain MRI is thus highly desirable to study the hypothalamus in vivo. However, most automated segmentation tools currently available focus exclusively on T1w images. In this study, we introduce H-SynEx, a machine learning method for automated segmentation of hypothalamic subregions that generalizes across different MRI sequences and resolutions without retraining. H-synEx was trained with synthetic images built from label maps derived from ultra-high resolution ex vivo MRI scans, which enables finer-grained manual segmentation when compared with 1 $mm$ isometric in vivo images. We validated our method using Dice Coefficient (DSC) and Average Hausdorff distance (AVD) across in vivo images from six different datasets with six different MRI sequences (T1, T2, proton density, quantitative T1, fractional anisotrophy, and FLAIR). Statistical analysis compared hypothalamic subregion volumes in controls, Alzheimer’s disease (AD), and behavioral variant frontotemporal dementia (bvFTD) subjects using the Area Under the Receiving Operating Characteristic curve (AUROC) and Wilcoxon rank sum test. Our results show that H-SynEx successfully leverages information from ultra-high resolution scans to segment in vivo from different MRI sequences. Our automated segmentation was able to discriminate controls versus Alzheimer’s Disease patients on FLAIR images with 5 $mm$ spacing. H-SynEx is openly available at https://github.com/liviamarodrigues/hsynex.

keywords:

Hypothalamus segmentation, ex vivo MRI, domain randomization

\externaldocument

Supplementary_material

\affiliation

[1]Universidade Estadual de Campinas, School of Electrical and Computer Engineering \affiliation[2]Massachusetts General Hospital, Harvard Medical School \affiliation[3]Dementia Research Centre, Department of Neurodegenerative Disease, UCL Queen Square Institute of Neurology, University College London, London, United Kingdom \affiliation[4]Centre for Cognitive and Clinical Neuroscience, Division of Psychology, Department of Life Sciences, College of Health, Medicine and Life Sciences, Brunel University London, United Kingdom \affiliation[5]Universidade Estadual de Campinas - School of Medical Sciences \affiliation[6]Centre for Medical Image Computing, University College London \affiliation[7]Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology

{highlights}

The development of a fully automated segmentation method trained on synthetic images derived from ex vivo MRI label maps capable of identifying hypothalamic subregions across various MRI sequences and resolutions, including clinical acquisitions with large slice spacing;

The usage of ultra-high resolution ex vivo images to build the label maps yields a highly accurate model of the hypothalamus anatomy.

H-SynEx outperforms other state-of-the-art methods in two patient-control comparisons conducted in this study and is currently the only method capable of segmenting hypothalamic subregions on MRI sequences other than T1w and T2w.

1 Introduction

The hypothalamus is a small, cone-shaped, gray-matter structure located in the central part of the brain. It is composed of subnuclei containing the cell bodies of multiple neuron subtypes. Despite its small dimensions, the hypothalamus plays a significant role in controlling sleep, body temperature, appetite, and emotions, among other functions [1, 2]. In the literature, several studies establish a connection between the whole hypothalamus and neurodegenerative diseases such as Alzheimer’s disease [3], Huntington’s disease [4, 5], Behavioral Variant Frontotemporal Dementia (bvFTD) [6, 7], Amyotrophic Lateral Sclerosis (ALS) [8, 9], among others [10, 11, 12, 13]. Some studies suggest a differential involvement of the hypothalamic subregions across conditions [6], leading to the belief that studying these subregions individually is essential for a better understanding of these conditions.

MRI enables the study of the human brain in vivo, but many analyses (e.g., volumetry) require manual segmentations that are challenging and time-consuming. For the hypothalamus, manual segmentation is particularly prone to high inter- and intra-rater variability due to its small size and low contrast with neighboring tissue [14, 15, 16]. Even with the help of semi-automated methods, a segmentation of a single scan can take up to 40 minutes [17], making large-scale studies impractical at most research sites. To better understand the role of the hypothalamus, several studies use different MRI sequencies [8, 10, 12, 3, 18]. However, these studies are limited to select sites and require specialists with neuroanatomical knowledge to perform manual annotation.

Numerous supervised methods have been proposed for the hypothalamus automated segmentation on T1w [19, 14, 16, 15, 20] and T2w images [16]. However, none of these methods can segment images at anisotropic resolution (often the case in clinical MRI) or in different sequences than the ones they were trained on (T1w/T2w). They all require retraining to function across different sequences and resolutions, necessitating more labeled data. The use of semi-supervised models on medical images enhances the generalization of networks without necessarily increasing the quantity of annotated data [21, 22]. However, most of these models work only in one type of MRI sequence and usually need retraining to adapt to different sequences. Synthetic images allow the construction of training datasets and flawless ground truths [23, 24] and the development of methods capable of generalizing in across different MRI sequences [25, 24].

So far, all automated hypothalamus segmentation methods were conducted using manual segmentation of in vivo images with resolutions ranging between 0.8 $mm$ and 1 $mm$ . Being a small structure, the delineation of the hypothalamus is significantly affected by partial volume effects, even in high-resolution images (such as 0.8 $mm$ ). Recently, the usage of ultra-high resolution ex vivo MRI has proven to be beneficial in the segmentation of small structures such as the hippocampus, amygdala, and thalamus [26, 27, 28], as it permits a better visualization of their anatomical boundaries, leading to more accurate manual annotation

In this article, we train a model using synthetic images derived from label maps built from ultra-high resolution ex vivo MRI. Using synthetic images provides robustness againt changes in MRI contrast, while constructing the label maps from ex vivo images provides more accurate delineation of the hypothalamus at higher resolution, enhancing the automated segmentation quality.

H-SynEx, our automated method for hypothalamic subregion segmentation, demonstrates robustness across different MRI contrasts and resolutions. In our experiments, we evaluate its resilience across T1w, T2w, PD, qT1, FA, and FLAIR sequences, as well as in data with 5 $mm$ spacing.

2 Data

2.1 Training Data

The data used for training H-SynEx comprises synthetic images derived from 3D segmentations (label maps). These label maps are built using a dataset consisting of 10 post mortem MRI acquisitions of brain hemispheres [29] of 5 male and 5 female specimenss who died of natural causes with no clinical diagnoses or neuropathology. The voxel resolution ranges from 120 to 150 $\mu$ m. The age at the time of death ranges from 54 to 79 years, with an average of 66.4 $\pm$ 8.46 years. The dataset is publicly available at the Distributed Archives for Neurophysiology Data Integration (DANDI Archive)¹¹1https://dandiarchive.org/dandiset/000026/draft/files?location= [30] (Figure 1).

Refer to caption — Figure 1: ex vivo MR images: Examples of three images used during the method development

2.2 Test Data

The method evaluation relies on in vivo images from 6 different datasets (Table 1):

1.

FreeSurfer Maintenance (FSM) [20]: Composed of 29 subjects from which 7 were used for validation and 22 for testing. For each subject, we have T1-weighted (T1w), T2-weighted (T2w), proton density (PD), fracitional anisotropy (FA), and quantitative T1 (qT1) acquisitions (Figure 2). In all cases the voxel resolution is 1 $mm$ isotropic. FSM contains manual labeling for the whole hypothalamus and its subregions (right and left anterior-superior, anterior-inferior, tuberal-superior, tuberal-inferior, and posterior). The manual segmentation was performed on in vivo images, and thus with limited accuracy. This dataset was approved by the Massachusetts General Hospital Internal Review Board for the protection of human subjects and all subjects gave written informed consent.
2.

MiLI [15]: The MICLab-LNI Initiative comprises manual and automated segmentations of the entire hypothalamus conducted on T1w images with slice thickness between 0.9 $mm$ and 1.2 $mm$ . However, it lacks segmentations for hypothalamic subregions. It includes subjects from various open datasets such as MiLI, OASIS [31], and IXI [32]. We only used the manually segmented images, totaling 55 from MiLI (30 controls and 25 ataxia patients), 23 from OASIS, and 19 from IXI. For the latter dataset, as it also encompasses T2w and proton density (PD) acquisitions, we incorporated these modalities in our experiments.
3.

ADNI [33]: We used a total of 572 controls (280 male and 292 female with average age of $75.5\pm 6.4$ and $73.6\pm 6.01$ , respectively) and 271 Alzheimer’s disease (AD) patients (143 male and 98 female with average age of $75.34\pm 7.6$ and $73.8\pm 7.6$ , respectively) for both T1w (1 $mm$ isometric) and FLAIR ( $0.85mm\times 0.85mm\times 5mm$ ) modalities. The ADNI dataset does not have manual segmentation of the hypothalamus.
4.

NIFD [34]: From the Neuroimaging in Frontotemporal Dementia dataset, we used 111 controls (49 male and 62 female with average age of $61.8\pm 7.4$ and $63.4\pm 7.8$ , respectively) against 74 behavioral variant frontotemporal dementia (bvFTD) patients (51 male and 23 female with average age of $61.16\pm 5.8$ and $62.4\pm 7.7$ , respectively). The voxel resolution is 1 $mm$ isotropic. The NIFD dataset does not have manual segmentation of the hypothalamus.

Table 1: Datasets used for model validation and testing; WS: Whole Structure, SR: Subregion

	Dataset Name	Sequence type	Acquis. Number	Subjects Number	Voxel Resolution	Manual Segmentation Content	Segmentation Protocol
Validation	FSM	T1w, T2w, PD	35	7 Controls	1 $mm$ isometric	WS/SR	WS:Author [20]
Testing		FA, qT1					SR:Bocchetta et al [6]
	FSM	T1w, T2w PD,	110	22 Controls	1 $mm$ isometric	WS/SR	WS:Author [20]
		FA, qT1					SR:Bocchetta et al [6]
	MiLI	T1	55	30 Controls	slice thickness between	WS	WS:Rodrigues et al [15]
				25 Patients
	MiLI-OASIS	T1	23	23 Controls	0.9 $mm$ and 1.2 $mm$
	MiLI-IXI	T1w, T2w, PD	57	19 Controls
	ADNI	T1w, FLAIR	1686	572 Controls	1 $mm$ isometric (T1w)		—
				271 AD Patients	$0.85\times 0.85\times 5mm$ (FLAIR)	No manual
	NIFD	T1	185	111 Controls	1 $mm$ isometric	segmentation
				74 bvFTD patients

3 Methods

3.1 Preprocessing of ex vivo MRI

We will train our neural networks with synthetic images generated from label maps. In order to create these, some operations were first performed on the ex vivo scans::

1.

Preprocessing: we reoriented the images to conform to positive RAS standards, flipped the right hemispheres, eliminated of all non-brain voxels and performed bias field correction. Also, we resampled the voxels to 0.3 $mm$ to find a balance between high resolution and computational cost. (Figure 3(a,b)).
2.

Creation of label maps: Starting with the preprocessed MRI data, we manually delineated the hypothalamus and its subregions. Also, we needed the whole-brain segmentation to bring context around the hypothalamus. However, as other brain structures are not the primary focus of segmentation, it is not necessary for their segmentation to be performed manually, as they may contain noise and may not directly correspond to brain structures. Therefore, we generated automated whole-brain segmentation using k-means, with the value of $k$ varying from 4 to 9, to introduce more variability into the dataset. Lastly, we merged both manual and automated brain hemisphere segmentation (Figure 3(c)) and mirrored it to generate a complete whole-brain label map $L\left[D\times H\times W\right]$ (Figure 3(d)). The mirroring process is conducted using an optimization technique that aims to minimize gaps and overlaps. More details on the label maps creation can be found at Rodrigues et al [35].
3.

Find MNI coordinates: Given that the hypothalamus is a small structure, the use of spatial priors is helpful during training. To achieve this, we integrate MNI coordinates for each input voxel by registering the label maps into the MNI space. Using the label maps $L\left[D\times H\times W\right]$ , we generate a Gaussian image $G\left[D\times H\times W\right]$ that simulates a T1w MRI. Subsequently, we registered $G$ to the MNI space using NiftiReg [36] and obtain the MNI coordinates $C\left[3\times D\times H\times W\right]$ of the registered image. $C$ serves as an additional input channel during training to support the network.
4.

Crop: We crop $L$ and $C$ around the hypothalamus, resulting in two standardized arrays, $L_{\text{crop}}\left[200\times 200\times 200\right]$ and $C_{\text{crop}}\left[3\times 200\times 200\times 200\right]$ , which corresponds to a field of view of $60\times 60\times 60mm$ .
5.

One-hot array: We convert $L_{\text{crop}}$ into a one-hot array $L_{\text{one}}\left[V\times 200\times 200\times 200\right]$ , being $V$ the number of labels presented on $L_{\text{crop}}$ . $V$ varies according to the $K$ labels employed on the whole brain segmentation.

3.2 Manual segmentation of the hypothalamus in training data: ex vivo images

H-SynEx is capable of segmenting 10 subregions of the hypothalamus, being right and left Anterior inferior, Anterior Superior, Tuberal inferior, Tuberal Superior, and Posterior. However, since the ex vivo images present only one whole hemisphere of the brain, the manual segmentation was done in only one side of the hypothalamus. The whole structure and subregion segmentation protocol are based on Rodrigues et al [15] and Bocchetta et al [6], repectively. We also automatically delineate the fornix, using morphological closing. The details on the manual segmentation protocol used to delineate the hypothalamus are described in [35].

3.3 Training

3.3.1 Synthetic Images Generation

The synthetic image generation (Figure 5(a)) is performed on the fly during training. At each iteration, one of the training label maps, $L$ , is randomly selected. $L$ goes throught the preprocessing presented on Section 3.1, which results on the cropped label map ( $L_{\text{crop}}$ ) and MNI coordinates ( $C_{\text{crop}}$ ). Then, we apply aggressive geometric augmentation that encompasses random crop, rotation, and elastic transformation on both $L_{\text{crop}}$ and $C_{\text{crop}}$ .,Next, we use the generative model proposed by SynthSeg [24] based on Gaussian Mixture Models conditioned on the transformed $L_{crop}$ , using randomized parameters for contrast and resolution to create the final synthetic image The transformed $L_{\text{crop}}$ will be the target used to train the network. To assist training, we use an Euclidean distance map ( $E$ ) derived from the target, which has been proven to help locate boundary features during segmentation tasks [37]. $E$ is part of the loss function and is only employed during training, not being necessary during inference. The final input of the network is the concatenation of the synthetic image and the transformed $C_{\text{crop}}$ .

3.3.2 Training architecture

Two distinct sub-models were trained separately, one for the entire hypothalamus ( $M_{\text{hyp}}$ ) and another specifically for its subregions ( $M_{\text{sub}}$ ) (Figure 5 (b)). Both $M_{\text{hyp}}$ and $M_{\text{sub}}$ are 3D-UNets [38, 39], however, in both cases, we added a skip connection between the input channels referring to the transformed $C_{\text{crop}}$ and the final convolutional block to ensure that the original positional encoding is readily available at full-resolution also in the decoder. $M_{\text{hyp}}$ receives $I$ as input and outputs $O_{\text{hyp}}$ . The input of $M_{\text{sub}}$ is defined as $I_{\text{sub}}=I*O_{\text{hyp}}$ . While $O_{\text{hyp}}$ is a 2-channel array representing the hypothalamus and its background, $O_{\text{sub}}$ , the output of $M_{\text{sub}}$ , is a 13-channel array encompassing the subregions, right and left fornices and background.

3.3.3 Loss function and training details

The loss function applied to $M_{hyp}$ (1) is a combination of Dice Loss ( $DL$ ) and Mean Square Error ( $MSE$ ), while the loss function applied to $M_{sub}$ (2), on the other hand, combines $DL$ and Cross Entropy ( $CE$ ). Although our goal is to optimize the Dice coefficient, the Dice loss function has flat gradients away from the optimum at initialization. This issue is mitigated by combining it with other loss functions such as the $MSE$ and the cross-entropy (CE) loss, which provides better gradient information and improves training efficiency.

L_{\text{hyp}}=\alpha*DL\left(T,T_{\text{pred}}\right)+\beta*\text{MSE}\left(E% ,E_{\text{pred}}\right)

(1)

L_{\text{sub}}=\alpha*DL\left(T,T_{\text{pred}}\right)+\beta*\text{CE}\left(T,% T_{\text{pred}}\right)

(2)

For both models, we used Adam optimizer with a learning rate of $5*10^{-5}$ , a batch size of 32, and values of $\alpha$ and $\beta$ as $0.3$ and $0.7$ , respectively. As stop criteria, we simply trained $M_{hyp}$ for 40000 training steps and did not use any validation set. However, on $M_{sub}$ , we used 35 images from FSM (5 acquisitions from different MRI sequences from 7 distinct subjects) as validation set(Table 1). We set an early stop criteria based on the DSC of the validation set. For this, we defined the stop** criteria as $\delta_{min}$ = 0.001. The network trained for 28000 steps and stopped. Both 3D U-Net modules are composed of an encoder of 5 levels with 24, 48, 96, 192, and 384 feature maps. Each convolutional block is composed of three layers: group normalization, convolution, and activation function (ReLU).

3.4 Inference and Post processing

The inference process is summarized in Figure 6. The first step is preprocessing, in which we find the MNI coordinates ( $C_{\text{inf}}$ ) of the input image, by using a fast deep learning algorithm, EasyReg [25]. The input of $M_{hyp}$ , defined as $A_{\text{inf}}$ , is found by crop** and concatenating $C_{\text{inf}}$ and the original image ( $I_{\text{inf}}$ ). The input for $M_{\text{sub}}$ , however, is formed by the product of $A_{\text{inf}}$ , the output of $M_{hyp}$ , and the ventral diencephalon (VDC) label, which is derived from the whole brain segmentation produced by EasyReg [25]. The inclusion of the ventral-DC label is justified as we found it to reduce false positives within the anterior subregion. The post-processing phase comprises two sequential steps: the rescaling of the final segmentation to match the voxel size of the original image ( $I_{\text{inf}}$ ), and the exclusion of voxels that belong to the third ventricle by using the whole brain segmentation obtained from EasyReg [25].

3.5 Statistical Analysis

The statistical analysis was done using the AVD and DSC combined with Wilcoxon signed-rank tests to assess the statistical significance of differences in performance across methods. We also compared the ability of H-SynEx and competing methods to find statistical differences in the volume of hypothalamus subregions of controls and patients (AD and bvFTD). Since the datasets have few subjects and we can not assess with high significance that the distribution is Gaussian, the statistical analyses were conducted considering non-parametric distributions. We used Wilcoxon rank-sum test to assess the significant difference in medians between groups and the area under the receiving operating characteristic curve (AUROC) as a non-parametric version of effect sizes between groups. Finally, we used the DeLong test to compare AUROCs across methods operating on the same sample. All statistical tests were conducted with a confidence level of 95% $(p-value<0.05)$

4 Experiments and Results

H-SynEx was trained using synthetic images derived from ultra-high-resolution ex vivo label maps. While the synthetic approach increase the network ability to generalize across different sequences, the use of ex vivo images improve the ability to delineate the hypothalamus due to their ultra-high-resolution. Given that, our experiments were structured to assess the method’s applicability under diverse conditions (Table 2).

Table 2: Summary of conducted experiments

		Testing set
Experiment	Objective	Dataset	Number of Acquisitions	MRI
		Dataset	per MRI Sequence	Sequence
Inter-Rater Metrics	To establish a baseline for	FSM	10	T1
Inter-Rater Metrics	evaluation metrics	FSM	10	T1
Direct comparison with	To assess whereas the method	FSM	22	T1w, T2w,PD,
manual segmentation on	is capable to segment	FSM	22	FA, qT1
different sequences	on different MRI sequences	IXI	19	T1w,T2w,PD
	Comparing H-SynEx	MiLI	55	T1
Comparing against	against other state-of-the-art	MiLI-OASIS	23
state-of-the-art methods	available methods	MiLI-IXI	19
	using only T1 images	FSM	22
Application in	Assess the method usability	ADNI	843	T1
Group Studies	in group studies	NIFD	185	T1
Resilience to large	To assess usability	ADNI	843	T1w, FLAIR
slice spacing	on diverse MRI resolution	ADNI	843	T1w, FLAIR

4.1 Consistency between labeling protocols

One of the primary challenges in analyzing the results of our experiment is that each dataset used in testing has a distinct manual segmentation protocol, none of which aligns with the one employed in training H-SynEx due to the difference between in vivo and ex vivo images [35]. Therefore, our initial experiment aims to establish an upper bound value for DSC and AVD by comparing inter-rater metrics using distinct segmentation protocols performed on T1w images. We compare manual segmentations in 10 FSM images delineated by two different raters: the first uses the FSM protocol (Table 1) while the second employs the protocol used during the label maps construction(Section 3.2).

The results (Table 3) shows that despite the AVD is influenced by different protocols, its values remain small, with the highest being 0.43. The DSC metric, however, is affected by both the variations in segmentation protocols and the small size of the hypothalamus subregions, resulting in final values of 0.66 or lower.

Table 3: Inter-rater metrics (median) for 10 subjects from FSM

	DSC	AVD(mm)
Anterior	0.63	0.41
Tuberal	0.66	0.43
Posterior	0.66	0.38

4.2 Direct comparison with manual segmentation on different sequences

In this experiment, we aim to evaluate the ability of H-SynEx to properly segment the subregions of the hypothalamus in different MRI sequences. We employed five different sequences from FSM - T1w, T2w, proton density (PD), fractional anisotropy (FA), and quantitative T1 (qT1)- and three from IXI -T1w, T2w, and PD. As other methods from the literature exclusively operate on T1w images, a quantitative comparison of their metrics with H-SynEx was not possible in this experiment.

Analyzing H-SynEx metrics on different sequences, we can see that the method presents a better performance on T1w images (Figure 8). Yet, it is capable of segmenting the hypothalamus and its subregions in all the proposed MRI sequences, as can be seen in Figure 7.

4.3 Comparing against other state-of-the-art methods

To compare H-SynEx with other state-of-the-art models [14, 15, 20], we used T1w images from MiLI and FSM datasets and analyzed the whole hypothalamus segmentation. It is worth noting that the MiLI segmentation protocol does not include the mammillary bodies. Therefore, for this dataset, we excluded the posterior subregion from the results before computing the metrics. Similarly, HypAST does not segment the posterior subregion, therefore we excluded it from FSM in this case, before running the metrics.

Given that Billot et al [14] works only on T1w images, we compared its results on the hypothalamus suregions with H-SynEx on 22 T1w images from FSM (Table 4). Finally, to compare H-SynEx with ScLimbic [20] and Rodrigues et al [15] we used the whole structure (Table 5).

Table 4: AVD and DSC(median) for H-SynEx and Billot et al. on different subregions for FSM dataset. †indicates statistical significance on a two-sided Wilcoxon rank-sum test using Bonferroni correction for

p<0.05

		H-SynEx	Billot et al.
AVD (mm)	Anterior	0.54†	1.32
	Tuberal	0.49†	0.66
	Posterior	0.33†	0.52
DICE	Anterior	0.53†	0.33
	Tuberal	0.59	0.58
	H-Posterior	0.67†	0.55

Table 5: AVD and DSC(median) for H-SynEx, ScLimbic [20] and Billot et al. [14] on different datasets (MiLI, IXI, OASIS, and FSM) for the entire hypothalamus (except MB). The symbols indicate statistical significance on a two-sided Wilcoxon rank-sum test using Bonferroni correction for

p<0.05

: (*) Billot vs H-SynEx; (^†) ScLimbic vs H-SynEx; (^‡) Billot vs ScLimbic. Since ScLimbic was trained using the FSM dataset, we did not consider these results. Similarly, since HypAST was trained using data from MiLI, IXI and the same segmentation protocol as OASIS, we did not consider these results

. MiLI IXI OASIS FSM AVD (mm) Billot 0.46 0.61*^‡ 0.47 0.40 HypAST - - - 0.41 ScLimbic 0.39^†^‡ 0.44 0.49 - H-SynEx 0.45 0.45 0.5 0.43 DICE Billot 0.66* 0.6 0.65*^‡ 0.68 HypAST - - - 0.69 ScLimbic 0.67^†^‡ 0.64^†^‡ 0.59 - H-SynEx 0.63 0.62 0.58 0.65

4.4 Application to group studies

In this experiment, we employ H-SynEx on images acquired from both patient and control groups to simulate the real-world application of this method by physicians. Also, we assess the ability of the network to separate groups as a proxy for performance on datasets that have no ground truth segmentation

In the literature, we can find some studies that point to hypothalamic atrophy in both AD and bvFTD patients [6, 40]. Therefore, to evaluate the group studies, we compared the hypothalamic subregion volumes of patients and control groups from ADNI (AD subjects) and NIFD (bvFTD subjects). We normalized the volumes by dividing them by the total intracranial volume (TIV), provided by SynthSeg [24]. This normalization is a common practice in volumetric studies with brain MRI. For comparative purposes, we conducted the analysis using Billot et al. and compared with H-SynEx through DeLong test [41].

Observing the applicability of the methods on group studies (Table 6), H-SynEx achieved statistical significance $(p<0.05)$ in the Wilcoxon rank-sum test in all hypothalamic subregions when comparing AD vs. controls, while Billot et al. was unable to detect differences in the tuberal-inferior region. Additionally, in some cases, we observed a higher AUROC in H-SynEx, along with a $p-value<0.05$ for the DeLong test, indicating the ability of H-SynEx to better discern differences between the two groups in this dataset. Regarding NIFD, the results were similar for both models, except for the tuberal-inferior region.

4.5 Resilience to large slice spacing

In this experiment, we applied H-SynEx on FLAIR images from the ADNI dataset acquired with a slice spacing (and thickness) of $5mm$ in the axial plane. Here, we want to evaluate our method’s capability to identify hypothalamic atrophy with larger spacings, which are common in clinical MRI. Once no other method in the literature works with FLAIR images, we solely compared H-SynEx segmentations on 5 $mm$ spacing FLAIR images from the same subjects from the ADNI dataset used in Experiment 4.4. When analyzing the volumes, H-SynEx returns statistically significant results (Table 6) when comparing patient and control volumes normalized by TIV in all subregions, except for the posterior subregion.

Table 6: AUROC Values for patients vs. controls for H-SynEx and Billot methods in ADNI and NIFD datasets. For ADNI dataset, we also analyze our method when applied to FLAIR images with spacing of

5mm

. Stars indicate the level of statistical significance (two-sided Wilcoxon rank-sum test) between both cohorts (*

p<0.05

, **

p<0.01

). ^† indicates statistical significance on the DeLong test (

p<0.05

) between H-SynEx and Billot methods. ^‡ indicates statistical significance on the DeLong test (

p<0.05

) between H-SynEx applied on T1-w and H-SynEx applied on Flairs.

Dataset	ADNI			NIFD
	H-SynEx Flair	H-SynEx T1w	Billot T1w	H-SynEx T1w	Billot T1w
Whole	0.66**^‡	0.74**	0.65**^†	0.79**	0.74**
a-sHyp	0.60**^‡	0.69**	0.72**	0.76**	0.75**
a-iHyp	0.60**	0.64**	0.55*^†	0.72**	0.62**
supTub	0.68**^‡	0.60**	0.67**^†	0.76**	0.76**
infTub	0.67**^‡	0.73**	0.52^†	0.74**	0.59*^†
postHyp	0.52^‡	0.72**	0.70**	0.7**	0.73**

5 Discussion and Conclusion

Due to the small size of the hypothalamus and its low contrast compared to neighboring tissues, its manual segmentation is challenging, and variable among and within raters. These characteristics extend across various MRI sequences. To address this issue, we introduced H-SynEx, a novel automated segmentation method for the hypothalamus and its subregions. To the best of our knowledge, H-SynEx is the first method to combine ultra-high-resolution ex vivo MRI and synthetic images. This integration has allowed us to develop a method capable of effectively segmenting small structures, such as hypothalamus subregions, across various MRI sequences and resolutions, including FLAIR images with a spacing of 5 $mm$ .

Typically, when evaluating how well a developed segmentation method generalizes, we compare it to others found in existing literature. To do this, it is common to use a dataset that none of the methods have seen during training. However, when these methods use training sets with different segmentation protocols, this difference can introduce bias, favoring the method trained under the same protocol as the test images. By using ex vivo images to construct the training set, the segmentation protocol used in training H-SynEx became different from any other in vivo image set. Consequently, the main challenge in analyzing the results lies in the difference between the training and test protocols. Focusing on that, we compared the manual segmentation of two raters who employed distinct protocols on 10 T1w images from the FSM dataset and found inter-rater DSC values lower or equal to 0.66 and AVD higher or equal to 0.38. We use these values as a baseline for analyzing the metrics in the subsequent experiments.

On Experiment 2, we analyzed H-SynEx usability across different MRI sequences. We could assess that T1w images presented the best results. However, despite the lower DSC and higher AVD values for the other sequences, it is important to emphasize that the manual segmentations of the hypothalamus subregions in both FSM and IXI were done in T1w images, not being influenced by the different contrasts of other sequences. Additionally, while the FSM images for each subject are already registered, this is not the case for the IXI dataset. Hence, the manual segmentations were registered to be used on the different sequences acquisitions of the same subject. Both registration and the use of a different sequence for manual segmentation may compromise the final results. Finally, we could notice a high variability on both metrics, which may be explained by the small size of the hypothalamus. This hypothesis is reassured by comparing the volumes delineated by H-SynEx and manual segmentation in the FSM dataset (Figure 9). We can see that both the posterior and anterior subregions, which show greater variability in the DSC and AVD, are relatively smaller than the tuberal subregion. Furthermore, the variability in volumes across sequences and subregions appears to be less pronounced than the variability in the metrics. For instance, for the anterior subregions we can see a large variability in the DSC, which is less pronounced in both AVD and the volumetric analysis. This may imply that the small size of the anterior subregion may be interfering in the final DSC values. The same analysis is valid for the posterior region.

When comparing H-SynEx with other state-of-the-art methods, we see that H-SynEx outperforms Billot et al in almost every metric for subregion segmentation. Here, it is important to highlight that despite DSC values seem to be low at first glance, they are not far from the values observed in the inter-rater analysis. H-SynEx AVD values, however, demonstrate greater similarity to inter-rater AVD, particularly in the posterior subregion where even lower AVD values are observed. Additionally, H-SynEx AVD metrics are substantially lower compared to those reported by Billot et al. Observing AVD and DSC for the whole structure (Table 5), H-SynEx outperforms Billot et al and returns similar results to HypAST [15] and ScLimbic [20] on the former, despite not achieving the best performance on the latter. However, when dealing with small structures with complex boundaries, distance metrics such as AVD, are more suitable to compare different methods [42]. Also, it is important to emphasize that all other methods were exclusively trained on in vivo T1w images, not having to deal with domain gap. Despite not achieving the highest quantitative results on T1w images, H-SynEx offers a distinct advantage. Built upon well-established domain randomization methods, it demonstrates superior generalization ability across MRI sequences. This enhanced robustness stems from its ability to handle variations in data, making it more adaptable to different imaging conditions.

When comparing volumes of the hypothalamus from patient and control groups on T1w images, we have confirmed that our method detects expected differences in all subregions in ADNI and NIFD datasets, with AUROCs of 0.74 and 0.79 respectively, and $p-value<0.05$ for the Wilcoxon signed-rank test in both cases. Notably, the AUROC values reported to NIFD are higher than those found in ADNI (Table 6). This behavior is expected since bvFTD patients tend to exhibit more pronounced hypothalamic atrophy than AD patients (10-12% volume loss in AD and 15-20% in bvFTD) [43]. Additionally, we determined that H-SynEx results differ statistically from Billot et al for the entire hypothalamus and in most subregions in the ADNI dataset, with a $p-value<0.05$ for the DeLong test.

Finally, we analyzed the same subjects from ADNI used in experiment 4, but using FLAIR images with a spacing of 5 $mm$ . It is possible to see that, similarly to when analyzing T1w images, the method was able to differentiate between patients and controls in almost all subregions, except for the posterior. This may be explained by the 5 $mm$ spacing of the FLAIR images since it makes many images lack the mammilary bodies, or limit it to just one slice of the image. For this reason, the small AUROC values in this subregion are expected. Finally, we plotted the correlation among T1w and FLAIR normalized volumes (Figure 10) to investigate whether H-SynEx exhibits consistency among them. The anterior subregion displays a moderate correlation (r=0.40 and r=0.50, respectively), and tuberal subregions have strong correlations (r=0.79 and r=0.80, respectively), both for controls and AD subjects. As expected, the posterior correlation is weak in both cases (r=0.11 and r=0.22). These results support the hypothesis that the method can be used in challenging resolutions and still detect differences among groups.

Although H-SynEx leverages randomized synthetic images to mitigate training bias, a limitation remains. The model’s accuracy on unseen data can still be affected by the image contrast itself. For instance, when analysing Experiment 4.2, in both IXI and FSM there is only one label per subject, done on T1w images. Therefore the manual segmentations used to generate the quantitative results were not influenced by different contrasts, which may influence the final results. Also, we could demonstrate that the smallest subregions (anterior and posterior) had the biggest variability, especially in DC, an overlap measure known for being sensitive to small structures [42].

To the best of our knowledge, we have presented the first automated method for hypothalamic subregion segmentation capable of working across different in vivo MRI sequences and resolutions without retraining. By producing reliable and consistent segmentations, H-SynEx facilitates the analysis of the hypothalamus in various pre-existing datasets, whether in research or clinical settings. Our tool is publicly available and has the potential to increase our understanding of the roles played by the hypothalamus and its subregions in neurodegenerative diseases and other related conditions.

6 Acknowledgements

L.Rodrigues acknowledges the Coordination for the Improvement of Higher Education Personnel (88887.716540/2022-00). M. Bocchetta is supported by a Fellowship award from the Alzheimer’s Society, UK (AS-JF-19a-004-517). J.E.Iglesias acknowledges NIH 1RF1MH123195, 1R01AG070988, and a grant from the Jack Satter foundation. L. Rittner acknowledges CNPq 313598/2020-7 and FAPESP 2013/07559-3. S.Appenzeller acknowledges CAPES Print, CAPES 001 e BRAINN.

References

[1] C. Neudorfer, J. Germann, G. J. Elias, R. Gramer, A. Boutet, A. M. Lozano, A high-resolution in vivo magnetic resonance imaging atlas of the human hypothalamic region, Scientific Data 7 (1) (2020) 305.
[2] C. B. Saper, B. B. Lowell, The hypothalamus, Current Biology 24 (23) (2014) R1111–R1116.
[3] R. Piyush, S. Ramakrishnan, Analysis of sub-anatomic volume changes in Alzheimer brain using diffusion tensor imaging, in: 2014 40th Annual Northeast Bioengineering Conference (NEBEC), IEEE, 2014, pp. 1–2.
[4] S. Gabery, N. Georgiou-Karistianis, et al., Volumetric analysis of the hypothalamus in huntington disease using 3T MRI: The image-hd study, PloS one 10 (2) (2015) e0117593.
[5] D. M. Bartlett, A. Reyes, et al., Investigating the relationships between hypothalamic volume and measures of circadian rhythm and habitual sleep in premanifest huntington’s disease, Neurobiology of sleep and circadian rhythms 6 (2019) 1–8.
[6] M. Bocchetta, E. Gordon, et al., Detailed volumetric analysis of the hypothalamus in behavioral variant frontotemporal dementia, Journal of Neurology 262 (2015) 2635–2642.
[7] O. Piguet, Å. Petersén, B. Yin Ka Lam, S. Gabery, K. Murphy, J. R. Hodges, G. M. Halliday, Eating and hypothalamus changes in behavioral-variant frontotemporal dementia, Annals of Neurology 69 (2) (2011) 312–319.
[8] M. Gorges, P. Vercruysse, et al., Hypothalamic atrophy is related to body mass index and age at onset in amyotrophic lateral sclerosis, Journal of Neurology, Neurosurgery & Psychiatry 88 (12) (2017) 1033–1041.
[9] R. M. Ahmed, F. Steyn, L. Dupuis, Hypothalamus and weight loss in amyotrophic lateral sclerosis, Handbook of Clinical Neurology 180 (2021) 327–338.
[10] J. Seong, J. Y. Kang, J. S. Sun, K. W. Kim, Hypothalamic inflammation and obesity: a mechanistic review, Archives of pharmacal research 42 (2019) 383–392.
[11] S. Modi, D. Thaploo, et al., Individual differences in trait anxiety are associated with gray matter alterations in hypothalamus: Preliminary neuroanatomical evidence, Psychiatry Research: Neuroimaging 283 (2019) 45–54.
[12] F. H. Wolfe, G. Auzias, et al., Focal atrophy of the hypothalamus associated with third ventricle enlargement in autism spectrum disorder, Neuroreport 26 (17) (2015) 1017–1022.
[13] M. Gutierrez, M. Garcia, J. Rodriguez, S. Rivero, S. Jacobelli, Hypothalamic-pituitary-adrenal axis function and prolactin secretion in systemic lupus erythematosus, Lupus 7 (6) (1998) 404–408.
[14] B. Billot, M. Bocchetta, et al., Automated segmentation of the hypothalamus and associated subunits in brain MRI, NeuroImage 223 (2020) 117287.
[15] L. Rodrigues, T. J. R. Rezende, G. Wertheimer, Y. Santos, M. França, L. Rittner, A benchmark for hypothalamus segmentation on t1-weighted mr images, NeuroImage 264 (2022) 119741.
[16] S. Estrada, D. Kügler, E. Bahrami, P. Xu, D. Mousa, M. Breteler, N. A. Aziz, M. Reuter, Fastsurfer-hypvinn: Automated sub-segmentation of the hypothalamus and adjacent structures on high-resolutional brain mri, arXiv preprint arXiv:2308.12736 (2023).
[17] J. Wolff, S. Schindler, et al., A semi-automated algorithm for hypothalamus volumetry in 3 Tesla magnetic resonance images, Psychiatry Research: Neuroimaging 277 (2018) 45–51.
[18] E. A. Schur, S. J. Melhorn, S.-K. Oh, J. M. Lacy, K. E. Berkseth, S. J. Guyenet, J. A. Sonnen, V. Tyagi, M. Rosalynn, B. De Leon, et al., Radiologic evidence that hypothalamic gliosis is associated with obesity and insulin resistance in humans, Obesity 23 (11) (2015) 2142–2148.
[19] L. Rodrigues, T. Rezende, et al., Hypothalamus fully automatic segmentation from MR images using a U-Net based architecture, in: 15th SIPAIM, Vol. 11330, International Society for Optics and Photonics, 2020, p. 113300J.
[20] D. N. Greve, B. Billot, D. Cordero, A. Hoopes, M. Hoffmann, A. V. Dalca, B. Fischl, J. E. Iglesias, J. C. Augustinack, A deep learning toolbox for automatic segmentation of subcortical limbic structures from mri images, Neuroimage 244 (2021) 118610.
[21] A. R. Fayjie, R. Dutta, P. Kashyap, U. R. Kumar, P. Vandewalle, Semi-supervised adversarial few-shot learning for medical image segmentation (2022).
[22] G. Bortsova, F. Dubost, L. Hogeweg, I. Katramados, M. De Bruijne, Semi-supervised medical image segmentation via learning consistency under transformations, in: Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part VI 22, Springer, 2019, pp. 810–818.
[23] V. Thambawita, P. Salehi, S. A. Sheshkal, S. A. Hicks, H. L. Hammer, S. Parasa, T. d. Lange, P. Halvorsen, M. A. Riegler, Singan-seg: Synthetic training data generation for medical image segmentation, PloS one 17 (5) (2022) e0267976.
[24] B. Billot, D. N. Greve, O. Puonti, A. Thielscher, K. Van Leemput, B. Fischl, A. V. Dalca, J. E. Iglesias, et al., Synthseg: Segmentation of brain mri scans of any contrast and resolution without retraining, Medical image analysis 86 (2023) 102789.
[25] J. E. Iglesias, Easyreg: A ready-to-use deep learning tool for symmetric affine and nonlinear brain mri registration (2023).
[26] J. E. Iglesias, J. C. Augustinack, K. Nguyen, C. M. Player, A. Player, M. Wright, N. Roy, M. P. Frosch, A. C. McKee, L. L. Wald, et al., A computational atlas of the hippocampal formation using ex vivo, ultra-high resolution mri: application to adaptive segmentation of in vivo mri, Neuroimage 115 (2015) 117–137.
[27] Z. M. Saygin, D. Kliemann, J. E. Iglesias, A. J. van der Kouwe, E. Boyd, M. Reuter, A. Stevens, K. Van Leemput, A. McKee, M. P. Frosch, et al., High-resolution magnetic resonance imaging reveals nuclei of the human amygdala: manual segmentation to automatic atlas, Neuroimage 155 (2017) 370–382.
[28] J. E. Iglesias, R. Insausti, G. Lerma-Usabiaga, M. Bocchetta, K. Van Leemput, D. N. Greve, A. Van der Kouwe, B. Fischl, C. Caballero-Gaudes, P. M. Paz-Alonso, et al., A probabilistic atlas of the human thalamic nuclei combining ex vivo mri and histology, Neuroimage 183 (2018) 314–326.
[29] I. Costantini, L. Morgan, J. Yang, Y. Balbastre, D. Varadarajan, L. Pesce, M. Scardigli, G. Mazzamuto, V. Gavryusev, F. M. Castelli, et al., A cellular resolution atlas of broca’s area, Science Advances 9 (41) (2023) eadg3844.
[30] Distributed archives for neurophysiology data integrationdoi:https://doi.org/10.5281/zenodo.7041535.
[31] P. J. LaMontagne, T. L. Benzinger, J. C. Morris, S. Keefe, R. Hornbeck, C. Xiong, E. Grant, J. Hassenstab, K. Moulder, A. Vlassenko, et al., OASIS-3: longitudinal neuroimaging, clinical, and cognitive dataset for normal aging and Alzheimer disease, MedRxiv (2019).
[32] IXI Dataset, https://brain-development.org/ixi-dataset/, accessed: 2023-11-29.
[33] S. G. Mueller, M. W. Weiner, L. J. Thal, R. C. Petersen, C. Jack, W. Jagust, J. Q. Trojanowski, A. W. Toga, L. Beckett, The alzheimer’s disease neuroimaging initiative, Neuroimaging Clinics 15 (4) (2005) 869–877.
[34] NIFD Dataset, https://ida.loni.usc.edu/collaboration/access/appLicense.jsp, accessed: 2023-11-29.
[35] L. Rodrigues, M. Bocchetta, O. Puonti, D. Greve, A. C. Londe, M. França, S. Appenzeller, L. Rittner, J. E. Iglesias, High-resolution segmentations of the hypothalamus and its subregions for training of segmentation models (2024). arXiv:2406.19492.
[36] M. Modat, J. McClelland, S. Ourselin, Lung registration using the niftyreg package, Medical image analysis for the clinic-a grand Challenge 2010 (2010) 33–42.
[37] X. Liu, L. Yang, J. Chen, S. Yu, K. Li, Region-to-boundary deep learning model with multi-scale feature fusion for medical image segmentation, Biomedical Signal Processing and Control 71 (2022) 103165.
[38] A. Wolny, L. Cerrone, A. Vijayan, R. Tofanelli, A. V. Barro, M. Louveaux, C. Wenzl, S. Strauss, D. Wilson-Sánchez, R. Lymbouridou, S. S. Steigleder, C. Pape, A. Bailoni, S. Duran-Nebreda, G. W. Bassel, J. U. Lohmann, M. Tsiantis, F. A. Hamprecht, K. Schneitz, A. Maizel, A. Kreshuk, Accurate and versatile 3d segmentation of plant tissues at cellular resolution, eLife 9 (2020) e57613. doi:10.7554/eLife.57613.
URL https://doi.org/10.7554/eLife.57613
[39] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, O. Ronneberger, 3d u-net: learning dense volumetric segmentation from sparse annotation, in: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19, Springer, 2016, pp. 424–432.
[40] A. Tao, Z. Myslinski, Y. Pan, C. Iadecola, J. Dyke, G. Chiang, M. Ishii, Hypothalamic atrophy in alzheimer’s disease (1819) (2021).
[41] E. R. DeLong, D. M. DeLong, D. L. Clarke-Pearson, Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach, Biometrics (1988) 837–845.
[42] A. A. Taha, A. Hanbury, Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool, BMC medical imaging 15 (1) (2015) 1–28.
[43] P. Vercruysse, D. Vieau, et al., Hypothalamic alterations in neurodegenerative diseases and their relation to abnormal energy metabolism, Front.Mol. Neurosci. 11 (2018) 2.

H-SynEx: Using synthetic images and ultra-high resolution ex vivo MRI for hypothalamus subregion segmentation