License: CC BY 4.0
arXiv:2312.03133v1 [eess.IV] 05 Dec 2023

Predicting Bone Degradation Using Vision Transformer and Synthetic Cellular Microstructures Dataset

Mohammad Saber Hashemi1, Azadeh Sheidaei*,1
Abstract

Bone degradation, especially for astronauts in microgravity conditions, is crucial for space exploration missions since the lower applied external forces accelerate the diminution in bone stiffness and strength substantially. Although existing computational models help us understand this phenomenon and possibly restrict its effect in the future, they are time-consuming to simulate the changes in the bones, not just the bone microstructures, of each individual in detail. In this study, a robust yet fast computational method to predict and visualize bone degradation has been developed. Our deep-learning method, TransVNet, can take in different 3D voxelized images and predict their evolution throughout months utilizing a hybrid 3D-CNN-VisionTransformer autoencoder architecture. Because of limited available experimental data and challenges of obtaining new samples, a digital twin dataset of diverse and initial bone-like microstructures was generated to train our TransVNet on the evolution of the 3D images through a previously developed degradation model for microgravity.
Keywords: vision transformer, hybrid encoder, 3D image sequencing, cellular microstructures, trabecular bone degradation

Refer to caption
Figure 1: Overview of our proposed deep learning framework, TransVNet.

Introduction

Deep learning is a subfield of machine learning that uses artificial neural networks to model and solve complex problems. It is a type of machine learning that involves training artificial neural networks on large datasets to learn complex patterns and relationships in the data. Artificial neural networks have been devised to mimic human or natural intelligence with biological computing networks by encountering different input situations or samples and learning the correct actions, i.e., different neurons’ activation, via optimizing an objective or loss function. Deep learners have outperformed not only traditional machine learning models but also humans in certain tasks, such as image recognition and game-playing, through customization and massive training for specific tasks (Janiesch, Zschech, and Heinrich 2021), thanks to the development of powerful hardware and efficient implementation of backpropagation algorithms (Rumelhart, Hinton, and Williams 1986; Wright et al. 2022). However, there is still a long road to reach the ideal general intelligence with minimum data and computing necessities similar to humans. Deep learning has been used in many static or sequential applications, such as image recognition, speech recognition, natural language processing, and many other domains, such as drug discovery and genomics. Recently, Large Language Models (LLMs), such as GPT-3 (Generative Pre-trained Transformer 3) (Brown et al. 2020), BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al. 2018), and T5 (Text-to-Text Transfer Transformer) (Raffel et al. 2020), trained on massive amounts of text data available on the Internet, have demonstrated their outstanding capability in generating coherent and contextually relevant human-like texts that can be used for various applications like language translation, chatbots, and programming. Deep learners’ capacity to learn complicated concepts at various levels of complexity is provided by layering several linear and non-linear elements for processing. They typically combine low-level features to obtain abstract high-level features, which can significantly alleviate the local minimum problem (LeCun, Bengio, and Hinton 2015; Liu and Wang 2021).

Machine learning can be categorized in terms of the provided data features and learning methods as follows: Supervised, Semi-supervised, Self-supervised, and Unsupervised. Machine learning models can be either discriminative, such as logistic regression, support vector machines, decision trees, and random forests, or generative, such as Naive Bayes, Gaussian mixture models, Boltzmann machines, and deep generative models. Discriminative models learn to distinguish between different classes of data, thereby typically used for supervised learning tasks. In contrast, generative models learn to generate new data samples similar to the data they were trained on, thereby typically used for unsupervised learning tasks (Goodfellow, Bengio, and Courville 2016).

In this study, the specific application and the available data require a discriminative model to predict the next evolution of a given 3D segmented image, i.e., the microstructure, in the same given format, i.e., 3D segmented image. Since computational simulations were used for data generation, the dataset is automatically labeled, i.e., each given 3D input image is associated with a sequence of 3D images, and supervised learning is suitable. In the literature, different deep learning methods have been proposed for image sequence prediction, such as 3DCNN (Ji et al. 2012) and ConvLSTM (Azad et al. 2019). However, our proposed method is tailored for 3D images and is simpler because each image only depends on the previous one; therefore, long-term memory issues do not affect our studied problem.

The quality of the bone texture depends on various factors, among which gravity is of great importance since it constantly applies a compressive force on body organs, especially bones. Under normal conditions on Earth, the compressive force helps in maintaining a healthy bone structure and preventing the weakening of bone. In the microgravity of outer space, bone cells react to the new loading condition quickly by degrading bone microstructures at a rate faster than normal. Despite the bone loss happening in the normal condition mostly due to aging and diseases, in lack of gravitational forces, the performance of bone cells significantly changes, and a high absorption rate causes a huge loss of bone minerals in a short time. Generally, the degradation rate is fast at first but then slows down. According to statistical studies, astronauts monthly lose about 2 percent of their bone mineral density in their lower limb bones and vertebrae as a consequence of being in microgravity and adapting to the new condition (Keyak et al. 2009; Vico et al. 2000; LeBlanc et al. 2000). The bone loss graph in the absence of gravity looks like a curve with decreasing slope over time. According to (Qin 2010), there are some similarities between microgravity bone degradation and bone loss due to aging in normal conditions. The research suggests that in both cases, the bone loss rate is rapid at the beginning of the degradation process, and the bone loss rate slows down until it reaches a steady state. The model presented by Qin predicts that a healthy person loses about 35 percent of their bone mineral density during a 3-year exposure to microgravity. Even though current spaceflights usually do not take longer than six months, prediction of changes in bone quality over a long time could come into use in understanding the mechanism of bone degradation. Moreover, future space flights may take longer, and more knowledge of the mechanism of bone resorption and formation will be beneficial to prevent bone loss in such situations.

Our virtually-generated dataset consists of the in-silico data of bone microstructure generation and its degradation throughout the study time (at most 36 months). The data format is 3D matrices or 3D voxelized images of the artificially generated and degraded bone microstructures under microgravity which is especially important for space exploration and experiment missions. Therefore, each data point is actually stored in a 4D matrix inside a ”.mat” file as a result of the bone degradation code written in MATLAB software (Bagherian et al. 2020). The fourth dimension is the month (e.g., index 0 denotes the initial 3D microstructure, index 1 denotes the changed/degraded microstructure after 1 month, …). The 3D images are binary in the current dataset as there are two constituents inside the bones: soft bone marrow and hard bone mineral. Therefore, the features are segmented images, and the labels are the categorical values or classes of the voxels in 3D images. The input and output both have the same format, as we are interested in finding the degraded microstructure in detail, given the input 3D microstructure. However, the deep-learning framework is general for similar tasks. So far, we have generated more than 1000 microstructures with previously developed degradation model (Bagherian et al. 2020). 10% of the total dataset will be used for testing purposes, and the rest will be used for training. In the training cycles, 15% of the available training data will used as the validation dataset.

Our deep learning algorithm is based on TransUNet (Chen et al. 2021), a framework which, itself, is based on U-Net (Ronneberger, Fischer, and Brox 2015) and Vision-Transformer (Dosovitskiy et al. 2020). However, our proposed model works with 3D images and predicts the next time frames for the degradation and similar time-series tasks. We utilize the Transformer’s ability to embed mixed data sources. For instance, the time is embedded, as well as the embeddings related to the input microstructure or 3D image. The output is the microstructure at the next time step.

The outcomes of this computational framework include a novel virtual dataset of bone or cellular microstructures as well as their degradation evolution through a simulation method; and the novel deep learning framework, called TransVNet, that can be used for similar time-series prediction of 3D-image evolution and other generative models for material design; and some experimental results on how it performs given the available dataset. The generative variant is not part of this study. In Figure 1, the high-level overview of our proposed algorithm is depicted.

Methods and Materials

Virtual Data Generation

A fast and computationally efficient program called HetMiGen (short for Heterogeneous Microstructure Generator) has been developed to artificially generate 3D heterogeneous microstructures without the need to any reference 2D cut section images of physical material microstructures through expensive experimental methods such as SEM. This in-silico data generation is especially useful for computational materials science since it does not rely on the reconstruction techniques based on some expensive experiments, let alone the physical material processing complications with the shortcoming of not being able to consider all the design space for material design purposes. The C++ source codes can be compiled for and deployed on different machines with Linux or Windows OSs, and the executable can be readily run given a CSV file whose each line contains the microstructural parameters of the microstructure to be generated: the microstructure id number, the number of phases in addition to the background phase, the volume fraction of each phase, the number of initial seeds, the increment (positive or negative) of seeds’ addition in the next seed addition iterations, the frequency by which seeds are added, the radius of the local neighborhood to be checked for proximity for each phase (zero means no check is needed and seeds can be grown until they touch others), whether each phase should be clustered at the end, the rate of growth decay for each phase (zero means that growth parameters are fixed throughout the evolution iterations), and the probabilistic growth thresholds for Cellular Automate based on the considered neighborhood type (von Neumann in the current version of the codes). By changing the input parameters, it can generate multitudes of heterogenous microstructures of different material systems consisting of two or more material phases with different morphologies. The parameters have an affinity to the physical process of manufacturing and the thermomechanical evolution of microstructure as well. The details of this algorithm is discussed in a paper to be published by the author. A sufficiently large number of cellular microstructures were generated using this code to create the initial dataset of 3D segmented bone-like images. The overview of the algorithm flowchart is shown in Figure 2.

Refer to caption
Figure 2: HetMiGen realization algorithm flowchart to generate initial 3D segmented microstructures.

Based on the initial dataset, the degraded or time-series evolution of the microstructures has been simulated using a previously developed method and its associated parallel MATLAB code (Bagherian et al. 2020). Those data samples are then preprocessed in MATLAB such that the high-quality ones with more than 95% clustering in the bone mineral segment are filtered, and an equal number of microstructures for different volume fraction ranges are present since it is the most important feature of the 3D images, and a balanced dataset is needed for accurate prediction after training. Therefore, data sampling is balanced among different volume fraction ranges during the training. In addition to this balanced sampling, appropriate data augmentation techniques, such as random rotation and random flip**, are considered during training for robustness. The main features considered are the segmented 3D images with their associated time stamps. However, other expert-picked features of the input microstructures, such as their TPCF and TPCCF statistical descriptors, can also be considered to supplement the aforementioned main features if they are needed or proven to improve the final network performance.

TarnsVNet, a Hybrid Autoencoder

To the author’s knowledge, the 3D image sequence prediction problem has not been studied extensively yet. However, our proposed method, TransVNet, is based on the state-of-the-art in computer vision, as it can take advantage of local convolutional operators and global attention ones. More specifically, our network uses local and global features in encoding the input data and uses the localized information of the previous or input microstructures in reconstructing or decoding the encoded features via skip connections to associated CNN-extracted features of the input image. Another advantage of our proposed method compared to 2D networks such as TranUNet is that it directly processes 3D images. One might argue that it is possible to consider a 2D network operating sequentially on a stack of 2D images that forms the input 3D image together. Nevertheless, such an approach discards the 3D information and relationships in the input image and may not be as accurate. This is especially important when the input image has long 3D features in all directions, which is the case for cellular morphologies found in nature, such as bone microstructures. Due to the much bigger input feature size of 3D images than 2D images and limited computational resources, a minimum number of features for different levels of encoder and decoder has been considered. Also, the ViT parameters were fixed for faster training and lower computational cost. However, no pre-trained 3D CNN feature extractor was found, so the CNN feature extractor is completely trained. As the objective is the correct labeling of the output voxels for segmentation or classification, the Cross-Entropy (CE) loss criterion was considered in the loss function by applying the Softmax function on the segmentation mask, i.e., the log-likelihoods of possible classes of data which are binary in our studied case. As many segmentation studies have reported DSC scores for their performance measurement, it is also used to assess the accuracy of the trained network and complement the Cross-Entropy term in the loss function. The DSC measures the spatial overlap between two segmented images, A and B target regions, and is defined as DSC(A,B)=2(AB)/(A+B)𝐷𝑆𝐶𝐴𝐵2𝐴𝐵𝐴𝐵DSC(A,B)=2(A\bigcap B)/(A+B)italic_D italic_S italic_C ( italic_A , italic_B ) = 2 ( italic_A ⋂ italic_B ) / ( italic_A + italic_B ). In binary manual segmentation, this coefficient may be derived from a two-by-two contingency table of segmentation classification probabilities. Thus, minimizing the total loss functional \mathcal{L}caligraphic_L with respect to the network parameters, θ𝜃\thetaitalic_θ, is the learning objective:

θb[xt+1=fθ(y^t),y^t+1]=CE(xt+1,y^t+1)+(1DSC(yt+1,y^t+1))2=12[(1H×W×Tn=1H×W×Tlog(exp(xn,y^nt+1t+1)c=1Cexp(xn,ct+1)))+(1DSC(argmaxc{exp(xn,ct+1)c=1Cexp(xn,ct+1)},y^t+1))]\mathcal{L}^{b}_{\theta}\left[x^{t+1}=f_{\theta}(\hat{y}^{t}),\hat{y}^{t+1}% \right]\\ =\frac{CE(x^{t+1},\hat{y}^{t+1})+(1-DSC(y^{t+1},\hat{y}^{t+1}))}{2}\\ =\frac{1}{2}\Biggr{[}\left(\frac{-1}{H\times W\times T}\sum_{n=1}^{H\times W% \times T}log(\frac{exp(x_{n,\hat{y}_{n}^{t+1}}^{t+1})}{\sum_{c=1}^{C}exp(x_{n,% c}^{t+1})})\right)\\ +\left(1-DSC(\operatorname*{arg\,max}_{c}\{\frac{exp(x_{n,c}^{t+1})}{\sum_{c=1% }^{C}exp(x_{n,c}^{t+1})}\},\hat{y}^{t+1})\right)\Biggr{]}start_ROW start_CELL caligraphic_L start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT [ italic_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL = divide start_ARG italic_C italic_E ( italic_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) + ( 1 - italic_D italic_S italic_C ( italic_y start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) ) end_ARG start_ARG 2 end_ARG end_CELL end_ROW start_ROW start_CELL = divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ ( divide start_ARG - 1 end_ARG start_ARG italic_H × italic_W × italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H × italic_W × italic_T end_POSTSUPERSCRIPT italic_l italic_o italic_g ( divide start_ARG italic_e italic_x italic_p ( italic_x start_POSTSUBSCRIPT italic_n , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_e italic_x italic_p ( italic_x start_POSTSUBSCRIPT italic_n , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) end_ARG ) ) end_CELL end_ROW start_ROW start_CELL + ( 1 - italic_D italic_S italic_C ( start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT { divide start_ARG italic_e italic_x italic_p ( italic_x start_POSTSUBSCRIPT italic_n , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_e italic_x italic_p ( italic_x start_POSTSUBSCRIPT italic_n , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) end_ARG } , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) ) ] end_CELL end_ROW (1)

, where b𝑏bitalic_b denotes sample b𝑏bitalic_b in a batch of size B𝐵Bitalic_B, y^tsuperscript^𝑦𝑡\hat{y}^{t}over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the input of the network (the segmented image at time step t), y^t+1superscript^𝑦𝑡1\hat{y}^{t+1}over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT is the target of the network (the segmented image at time step t+1), xt+1superscript𝑥𝑡1x^{t+1}italic_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT is the output of the segmentation mask layer (the log-likelihoods of each class at different voxels), yt+1superscript𝑦𝑡1y^{t+1}italic_y start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT is the label or segmented image associated with xt+1superscript𝑥𝑡1x^{t+1}italic_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT, C𝐶Citalic_C is the number of classes (2 in our case), and N𝑁Nitalic_N is the total number of voxels.

As seen in the overview of the architecture, the encoder is hybrid, meaning it consists of a 3D CNN feature extractor followed by a vision transformer. The goal is to predict the next evolution of the microstructure, called y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG, with C^^𝐶\hat{C}over^ start_ARG italic_C end_ARG channels for the desired number of segmentation classes (two in this study because each voxel represents the material phase at that location, which can be either bone marrow or bone mineral), given a 3D input image yH×W×T𝑦superscript𝐻𝑊𝑇y\in\mathbb{R}^{H\times W\times T}italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_T end_POSTSUPERSCRIPT (1603superscript1603160^{3}160 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT dimensions in this study following resizing of the HetMiGen outputs of 1503superscript1503150^{3}150 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT dimensions) with three default input channels (the RGB format in general). The CNN local feature extractor encodes the input into its low-dimensional high-level features, with Csuperscript𝐶C^{{}^{\prime}}italic_C start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT (32 in this study) channels and a dimension size of H=H2#CNNDownScaling=16023superscript𝐻𝐻superscript2#𝐶𝑁𝑁𝐷𝑜𝑤𝑛𝑆𝑐𝑎𝑙𝑖𝑛𝑔160superscript23H^{{}^{\prime}}=\frac{H}{2^{\#CNNDownScaling}}=\frac{160}{2^{3}}italic_H start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = divide start_ARG italic_H end_ARG start_ARG 2 start_POSTSUPERSCRIPT # italic_C italic_N italic_N italic_D italic_o italic_w italic_n italic_S italic_c italic_a italic_l italic_i italic_n italic_g end_POSTSUPERSCRIPT end_ARG = divide start_ARG 160 end_ARG start_ARG 2 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG (in this study). Each block consists of a double 3D Convolution followed by ReLU nonlinearity and Batch Normalization layers, and a DownSampler through 1×1×11111\times 1\times 11 × 1 × 1 Convolution with a stride of two. The vision transformer needs a series given by image sequentialization. Therefore, the input image is tokenized into 3D flattened patches of size P𝑃Pitalic_P:

{xiPP3.C|i=1,,N=HWTP3}conditional-setsuperscriptsubscript𝑥𝑖𝑃superscriptformulae-sequencesuperscript𝑃3𝐶formulae-sequence𝑖1𝑁𝐻𝑊𝑇superscript𝑃3\{x_{i}^{P}\in\mathbb{R}^{P^{3}.C}|i=1,...,N=\frac{HWT}{P^{3}}\}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT . italic_C end_POSTSUPERSCRIPT | italic_i = 1 , … , italic_N = divide start_ARG italic_H italic_W italic_T end_ARG start_ARG italic_P start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG } (2)

Then, these patches will be transformed into D-dimension vectors, with D being the hidden state dimension of the vision transformer, by a learnable linear transformation, ER(P3.C)×D𝐸superscript𝑅formulae-sequencesuperscript𝑃3superscript𝐶absent𝐷E\in R^{(P^{3}.C^{{}^{\prime}})\times D}italic_E ∈ italic_R start_POSTSUPERSCRIPT ( italic_P start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT . italic_C start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) × italic_D end_POSTSUPERSCRIPT. The important point is that we also add the patch positional embedding and the time embedding, via a linear transformation, to the image embedding to encode features not present in the raw input image. Then, the transformer with L𝐿Litalic_L layers will transform the embedded sequence as follows (MSA is Multi-head Self Attention, MLP is multi-layer perception, and LN is layer normalization operator).

z0=[x1PE;;xNPE]+Embtime+Embpositionsubscript𝑧0superscriptsubscript𝑥1𝑃𝐸superscriptsubscript𝑥𝑁𝑃𝐸𝐸𝑚subscript𝑏𝑡𝑖𝑚𝑒𝐸𝑚subscript𝑏𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛z_{0}=[x_{1}^{P}E;...;x_{N}^{P}E]+Emb_{time}+Emb_{position}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT italic_E ; … ; italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT italic_E ] + italic_E italic_m italic_b start_POSTSUBSCRIPT italic_t italic_i italic_m italic_e end_POSTSUBSCRIPT + italic_E italic_m italic_b start_POSTSUBSCRIPT italic_p italic_o italic_s italic_i italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT (3)
zl=MSA(LN(zl1))+zl1superscriptsubscript𝑧𝑙𝑀𝑆𝐴𝐿𝑁subscript𝑧𝑙1subscript𝑧𝑙1z_{l}^{{}^{\prime}}=MSA(LN(z_{l-1}))+z_{l-1}italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = italic_M italic_S italic_A ( italic_L italic_N ( italic_z start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT ) ) + italic_z start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT (4)
zl=MLP(LN(zl))+zlsubscript𝑧𝑙𝑀𝐿𝑃𝐿𝑁superscriptsubscript𝑧𝑙superscriptsubscript𝑧𝑙z_{l}=MLP(LN(z_{l}^{{}^{\prime}}))+z_{l}^{{}^{\prime}}italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_M italic_L italic_P ( italic_L italic_N ( italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) ) + italic_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT (5)

The decoder is a 3D CNN Upsampler. There are multiple cascaded trainable blocks to reach the high-dimensional output or segmentation mask from the low-dimensional encoded sequence of feature vectors. First, the sequence of N𝑁Nitalic_N encoded features, zLN×Dsubscript𝑧𝐿superscript𝑁𝐷z_{L}\in\mathbb{R}^{N\times D}italic_z start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT, is reshaped into a 3D image with the dimension of D×HP×WP×TP𝐷superscript𝐻𝑃superscript𝑊𝑃superscript𝑇𝑃D\times\frac{H^{{}^{\prime}}}{P}\times\frac{W^{{}^{\prime}}}{P}\times\frac{T^{% {}^{\prime}}}{P}italic_D × divide start_ARG italic_H start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_P end_ARG × divide start_ARG italic_W start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_P end_ARG × divide start_ARG italic_T start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_P end_ARG. Then, each upsampling block will double the size of the image. Each block consists of a trilinear upsampling function with an enlargement factor of two, a double 3D-Convolution, a ReLU nonlinearity, and a Batch Normalization. As mentioned above, the V-shaped architecture enables feature aggregation via Skip Connections from the CNN encoder layers to their associated CNN decoder ones. The decoder layers with input sizes lower than the smallest CNN encoder features can get their encoder features via Max Pooling of the last CNN encoder outputs, i.e., the smallest features.

Results

Data generation

Using the HetMiGen code, a diverse dataset of initial bone-like geometries represented by binary-segmented 3D images was generated. Then, based on the previous study, we simulated the degradation of the microstructures during 36 months of time steps. This provided us with a sizeable training dataset to test the performance of TransVNet. The result of data generation is summarized by the histogram of Figure 3. As inferred from the figure, the dataset is unbalanced with respect to the volume fraction of the bone marrow segment in the initial microstructures. Therefore, the high-volume fraction microstructures were sampled more frequently in training in addition to data augmentation through rotation and flip**.

Refer to caption
Figure 3: Histogram of the number of microstructures generated versus the volume fraction ranges.

The network performance and ablation study

To the best of our knowledge, there is no appropriate baseline method for our specific problem, the time-series prediction of 3D segmented images. However, the initial training performance is better than that of TransUNet applied to 2D medical segmentation tasks. The curves of the loss values during the training are plotted in figure 4. The main and only training signal is the average value of the Dice loss and the Cross-Entropy loss shown by the red line. The Dice loss is calculated as Dice=1DSC𝐷𝑖𝑐𝑒1𝐷𝑆𝐶Dice=1-DSCitalic_D italic_i italic_c italic_e = 1 - italic_D italic_S italic_C as explained above. Its value at the end is 0.084, which means that DSC is 91.64%. The best DSC score of TransUNet was 89.71%, calculated as the average value for different classes of organs segmented by their network. We also considered Hausdorff distance as another performance metric in our experiments. Hausdorff distance (HD) measures how far two subsets of a metric space are from each other.

Refer to caption
Figure 4: The loss values during the training; red line denotes the main training signal, the average of the Dice loss and the Cross-Entropy loss, while the blue one shows the Cross-Entropy alone.
Refer to caption
(a) Evolution of microstructure 025980 with an initial mineral volume fraction of 0.406.
Refer to caption
(b) Evolution of microstructure 101962 with an initial mineral volume fraction of 0.235.
Figure 5: Samples of the network performance on the test set (leftmost column: evolution after 4 months; middle column: evolution after 8 months; rightmost column: evolution after 12 months) (red color: network error).
Table 1: Ablation and scaling experimental results on our test dataset.
Method DSC ([0,1]absent01absent\in[0,1]\uparrow∈ [ 0 , 1 ] ↑) HD (mm or voxels \downarrow)
ViTOnly-pretrained 0.040122 14.725012
ViTOnly 0.118725 21.709163
ViTOnly-patch8 0.208199 20.731199
Hybrid-patch8 0.981572 0.033904
Hybrid-patch8-months4 0.960275 0.300317
Hybrid-resolution160-patch2-months4 0.978929 0.184443
Hybrid-resolution160-patch2-months4-epoch1 0.955390 0.722508

We performed an ablation study to further assess the capabilities of our network and show the superiority of our proposed TransVNet architecture compared with other possible models. To make the comparison faster and more computationally efficient, we first considered resized microstructures with 643superscript64364^{3}64 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT resolution instead of the original 1503superscript1503150^{3}150 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT resolution in our generated dataset. The first model on the encoder side was simply the ViT trained on ImageNet with fixed parameters in its layers and trainable ones in the embedding layer with a patch size of 16, thereby having 6416×6416×6416=6464166416641664\frac{64}{16}\times\frac{64}{16}\times\frac{64}{16}=64divide start_ARG 64 end_ARG start_ARG 16 end_ARG × divide start_ARG 64 end_ARG start_ARG 16 end_ARG × divide start_ARG 64 end_ARG start_ARG 16 end_ARG = 64 tokens for global information inference through the transformer-based architecture. After four training epochs, the results showed very poor performance on the unseen test dataset. Making the hidden layers trainable also did not improve the performance to a satisfactory level; however, considering the lower patch size of 8, leading to the higher number of tokens as the transformer layers’ input, improved it significantly yet less than a satisfactory level. Then, we considered our proposed hybrid model in the encoder with Skip Connections to the decoder, which resulted in very high and satisfactory performance metrics. To determine whether our proposed method is also efficient in the more challenging task of next-four-step prediction of the segmented image with morphological changes much more than those in the next-step prediction, the target segmented image after four months was considered in training, i.e., replacing y^t+1superscript^𝑦𝑡1\hat{y}^{t+1}over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT with y^t+4superscript^𝑦𝑡4\hat{y}^{t+4}over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_t + 4 end_POSTSUPERSCRIPT in Equation 1. The performance metrics show that it is very accurate in the new data representation yet performs at levels lower than the less challenging task.

Next, we focused on the scalability of our proposed method by considering a network with an architecture similar to the last successful model yet suitable for the original-resolution inputs and outputs; and by starting the training process from the shared learned parameters of the trained lower-resolution network. This experiment also resulted in high performance metrics. Therefore, our proposed method is advantageous since it can achieve satisfactory performance, even after one training epoch, when the learned weights of the lower-resolution network are transferred for fine-tuning a high-resolution network. The detailed numerical results are provided in Table 1. Sample performance results of the last network on the test dataset are visualized in Figure 5. The white color shows the mineral phase of the 3D bone microstructures while the blue one denotes the marrow phase. The red voxels show the discrepancy between the network prediction and the ground truth, i.e., the bone degradation simulation results. The leftmost, middle, and rightmost columns demonstrate the fourth-month, eighth-month, and twelfth-month snapshots of the evolution, respectively.

Discussion and Conclusions

In conclusion, our scalable and fast HetMiGen code was used to generate an artificial dataset of 1000 different geometries of the porous bone microstructures at their initial time step. Then, the time-series evolution of the initial 3D segmented images was simulated via a previously developed bone degradation code. We proposed a new deep learning network, called TransVNet, which was directly applied to the fast prediction of the bone degradation task given our artificially created (in-silico) dataset. Our experimental results show that TransVNet has a superior performance compared with its ancestor, TransUNet, which has been applied to the different yet less challenging task of 2D medical image segmentation, and our proposed method is scalable as it remains very accurate even for image sequences with higher levels of changes and as it can be efficiently fine-tuned for high-resolution prediction using previously trained parameters of a lower-resolution network. As for future studies, the effect of different hyperparameters of the current model on its performance can be further studied. Furthermore, the architecture can be changed by considering Transposed Convolution operators in the decoder and processing more than one input at a time step using the transformer capability in finding sequential relationships, especially for other time-series data with longer history dependency. The results of such variations can be compared to determine the best-performing network.

References

  • Azad et al. (2019) Azad, R.; Asadi-Aghbolaghi, M.; Fathy, M.; and Escalera, S. 2019. Bi-Directional ConvLSTM U-Net with Densley Connected Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops.
  • Bagherian et al. (2020) Bagherian, A.; Baghani, M.; George, D.; Rémond, Y.; Chappard, C.; Patlazhan, S.; and Baniassadi, M. 2020. A novel numerical model for the prediction of patient-dependent bone density loss in microgravity based on micro-CT images. Continuum Mechanics and Thermodynamics, 32(3): 927–943.
  • Brown et al. (2020) Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
  • Chen et al. (2021) Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A. L.; and Zhou, Y. 2021. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306.
  • Devlin et al. (2018) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Dosovitskiy et al. (2020) Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  • Goodfellow, Bengio, and Courville (2016) Goodfellow, I.; Bengio, Y.; and Courville, A. 2016. Deep learning. MIT press.
  • Janiesch, Zschech, and Heinrich (2021) Janiesch, C.; Zschech, P.; and Heinrich, K. 2021. Machine learning and deep learning. Electronic Markets, 31(3): 685–695.
  • Ji et al. (2012) Ji, S.; Xu, W.; Yang, M.; and Yu, K. 2012. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1): 221–231.
  • Keyak et al. (2009) Keyak, J.; Koyama, A.; LeBlanc, A.; Lu, Y.; and Lang, T. 2009. Reduction in proximal femoral strength due to long-duration spaceflight. Bone, 44(3): 449–453.
  • LeBlanc et al. (2000) LeBlanc, A.; Schneider, V.; Shackelford, L.; West, S.; Oganov, V.; Bakulin, A.; and Voronin, L. 2000. Bone mineral and lean tissue loss after long duration space flight. J Musculoskelet Neuronal Interact, 1(2): 157–60.
  • LeCun, Bengio, and Hinton (2015) LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. nature, 521(7553): 436–444.
  • Liu and Wang (2021) Liu, J.; and Wang, X. 2021. Plant diseases and pests detection based on deep learning: a review. Plant Methods, 17: 1–18.
  • Qin (2010) Qin, Y. 2010. Challenges to the musculoskeleton during a journey to Mars: assessment and counter measures. Journal of Cosmology, 12: 3778–3780.
  • Raffel et al. (2020) Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1): 5485–5551.
  • Ronneberger, Fischer, and Brox (2015) Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, 234–241. Springer.
  • Rumelhart, Hinton, and Williams (1986) Rumelhart, D. E.; Hinton, G. E.; and Williams, R. J. 1986. Learning representations by back-propagating errors. nature, 323(6088): 533–536.
  • Vico et al. (2000) Vico, L.; Collet, P.; Guignandon, A.; Lafage-Proust, M.-H.; Thomas, T.; Rehailia, M.; and Alexandre, C. 2000. Effects of long-term microgravity exposure on cancellous and cortical weight-bearing bones of cosmonauts. The Lancet, 355(9215): 1607–1611.
  • Wright et al. (2022) Wright, L. G.; Onodera, T.; Stein, M. M.; Wang, T.; Schachter, D. T.; Hu, Z.; and McMahon, P. L. 2022. Deep physical neural networks trained with backpropagation. Nature, 601(7894): 549–555.