Realism in Action: Anomaly-Aware Diagnosis of Brain Tumors from Medical Images Using YOLOv8 and DeiT

Seyed Mohammad Hossein Hashemi, Leila Safari*, Amirhossein Dadashzade Taromi S.M.H. H. is a B.Sc. graduate in Computer Engineering from University of Zanjan. He is currently a prospective graduate student. (e-mail: [email protected].*Correspondence: L. S. is a faculty of Computer Engineering Department at University of Zanjan, Zanjan, Iran. (e-mail: [email protected]).A. D.T. is a B.Sc. graduate in Computer Engineering from University of Zanjan. He is currently a M.Sc. student in Artificial Intelligence at Institute for Advanced Studies in Basic Sciences, Zanjan, Iran. (e-mail: [email protected]).

Abstract

In the field of medical sciences, reliable detection and classification of brain tumors from images remains a formidable challenge due to the rarity of tumors within the population of patients. Therefore, the ability to detect tumors in anomaly scenarios is paramount for ensuring timely interventions and improved patient outcomes. This study addresses the issue by leveraging deep learning (DL) techniques to detect and classify brain tumors in challenging situations. The curated data set from the National Brain Map** Lab (NBML) comprises 81 patients, including 30 Tumor cases and 51 Normal cases. The detection and classification pipelines are separated into two consecutive tasks. The detection phase involved comprehensive data analysis and pre-processing to modify the number of image samples and the number of patients of each class to anomaly distribution (9 Normal per 1 Tumor) to comply with real world scenarios. Next, in addition to common evaluation metrics for the testing, we employed a novel performance evaluation method called Patient to Patient (PTP), focusing on the realistic evaluation of the model. In the detection phase, we fine-tuned a YOLOv8n detection model to detect the tumor region. Subsequent testing and evaluation yielded competitive performance both in common evaluation metrics and PTP metrics. Furthermore, using the Data Efficient Image Transformer (DeiT) module, we distilled a Vision Transformer (ViT) model from a fine-tuned ResNet152 as a teacher in the classification phase. This approach demonstrates promising strides in reliable tumor detection and classification, offering potential advancements in tumor diagnosis for real-world medical imaging scenarios.

Index Terms:

Anomaly distribution, Brain tumors, Tumor diagnosis, Medical imaging, ResNet152, Vision Transformer, YOLOv8n

I Introduction

Reliability and precision are pivotal factors in the context of brain tumor diagnosis. The brain, a vital organ situated within the human body that oversees the entire nervous system [1]. Consequently, any deviations within the brain can significantly impact human health. Among these anomalies, brain tumors stand out as particularly severe. Brain tumors involve the uncontrolled and aberrant growth of cells within the brain. These tumors can be categorized into two groups: primary tumors and secondary tumors. Primary tumors emerge within the brain tissue itself, whereas secondary tumors stem from other areas of the body, migrating to the brain tissue through the bloodstream [2]. Glioma and Meningioma are two severe types of brain tumors among primary tumors. If not identified in their early stages, these tumors can lead to fatal outcomes for patients [3]. As the World Health Organization (WHO) outlined, brain tumors are categorized into four grades. Grade 1 and grade 2 tumors correspond to less aggressive forms (such as Meningioma), whereas grade 3 and grade 4 tumors encompass more aggressive varieties (like Glioma)[1].

In the context of the detection and monitoring of brain tumors, nowadays, with huge advancements in the medical imaging field, there are various imaging technologies used by radiologists and doctors to observe internal human body organs, such as computed tomography (CT), positron emission tomography (PET), and magnetic resonance imaging (MRI). Among the array of modalities available, MRI emerges as the foremost selection for non-invasive brain tumor detection and evaluation. This preference is owed to its remarkable resolution and superior ability to provide contrasting details of soft tissues [4]. Manually scrutinizing these images is a laborious and demanding endeavor. Moreover, it is susceptible to errors, particularly given the surge in patient volumes and the relatively low incidence rate of brain tumors [5, 6]. This combination makes detecting and categorizing these tumors a formidable challenge. In response, our objective is to formulate a resilient, anomaly-conscious, computer-aided, automated solution. This approach aims to enhance the accuracy of clinical brain tumor diagnosis.

The initial challenge in the study of tumor diagnosis lies in the significant variability of tumors in terms of their shapes, textures, and contrasts, both within and between cases [7]. Within the realm of tumor diagnosis, scientists have harnessed a diverse array of machine learning (ML) techniques. These include Support Vector Machines (SVMs), K-Nearest Neighbor (KNN), Decision Trees, and Naive Bayes. Furthermore, in the context of deep learning algorithms, they employed a diverse array of methods, including Convolutional Neural Networks (CNNs), VGGNets [8], GoogleNet [9], and ResNets [10]. These advanced algorithms have been instrumental in assisting with tumor diagnosis.

Nevertheless, in the context of brain tumor diagnosis, a significant challenge persists, stemming from the inherent rarity of brain tumor occurrences within a larger population. Based on the most recent statistical data released by Johns Hopkins Medicine, the incidence rate of brain and nervous system tumors in the United States is approximately 30 adults per 100,000 individuals [6]. Indeed, the incidence rate can differ significantly based on age, gender, location, and other population factors. Nevertheless, a thorough review of recent data suggests that this number remains very low. Although the rare occurrence of this illness is comforting, difficulties arise when doctors need to identify these uncommon cases in a large and diverse population [11].

So far, several significant contributions have been made to offer robust solutions for this task. Often, these models, regardless of their architecture, have been trained on semi-balanced data sets where potential class imbalances have been resolved by data techniques such as Random Over-Sampling, Synthetic Minority Over-Sampling Technique (SMOTE), and Weighted Loss Functions. However, in this study, we aimed to raise the realism leverage of the problem. We proposed a DL-based solution that has been trained and evaluated on close-to-clinical diagnosis scenarios. This unique perspective on the problem could enable us to close the gap between academic research outcomes and the practical usage of this study.

To resolve the mentioned issues, our solution utilizes two key factors. Firstly, (1) we harness a substantial quantity of distinct image data from the data set we acquired from NBML. This data set encompasses records from 81 patients who underwent monitoring using various imaging technologies, such as CT, PET, and MRI. Secondly, (2) we implement a meticulous data pipeline. This pipeline not only refines the data set to skew its distribution towards non-tumor samples but also preserves category proportions while augmenting data within each class, thus enhancing overall sample diversity.

In the context of the tumor detection phase, our approach involves leveraging the robust and adaptable capabilities of the YOLOv8n detection model and fine-tuning it to classify Tumor cases from Normal ones. We meticulously trained the YOLOv8n model on our data set to yield accurate yet anomaly-resistant detections. Subsequently, we proceed to the evaluation phase, where we gauge the model’s efficacy in tumor detection. This assessment employs a novel PTP evaluation function, providing a pragmatic understanding of the model’s performance during execution.

Lastly, we utilized a distilled ViT using the DeiT architecture with a fine-tuned ResNet152 as the teacher model [12]. We incorporated this model for its attention mechanism and the lightness, features we exploit to our advantage. This model contributes to classifying brain tumor types into three distinct classes: Meningioma, Pituitary, and Glioma.

The following paragraphs are structured as follows: Section II is a literature review on related works in this field, Section III explains the materials and methods we utilized, Section IV presents the results we achieved, and Section V concludes the article.

II Related Works

Following the advancements of ML algorithms, many valuable contributions have been made to offer robust and accurate solutions for the brain tumor classification problem [13, 14, 15]. The article [7], authored by J. Kang and his colleagues, proposed a novel pipeline to classify brain tumors. They initially pre-processed the input data and then utilized an array of CNN pre-trained models (e.g., ResNet, DenseNet, VGG, AlexNet, MnasNet, etc.) to extract features from them. Next, they evaluated those deep features using a range of ML classifiers (e.g., Random Forest, SVM, KNN, etc.). Based on their evaluation outcome, they cherry-picked and combined their top three feature sets to form a highly distinctive feature vector for each image, making it easier for the model to learn and classify the image. Lastly, using the concept of bagging, they created a classifier comprising nine ML classifiers that predict the label based on the input feature.

During the last decade, computer hardware advancements, especially Graphical Processing Units (GPU), have caused DL solutions to be an undeniable and consistent solution for many tasks. In this regard, various CNN-based architectures (e.g., DenseNets, Xception, VGG-Nets, etc.) have been introduced for image processing tasks, and many of them have been applied to studying brain tumor classification[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27].

The paper [28] proposes a framework based on unsupervised deep generative neural networks that combine Variational Auto Encoders (VAEs) and Generative Adversarial Networks (GANs) to generate realistic brain tumor MRIs. The proposed method significantly improves the performance of the ResNet50 classifier, achieving an average accuracy improvement from %72.63 to %96.25. The framework shows potential as a valuable clinical tool for medical experts.

The article [29] utilizes the DenseNet201 pre-trained DL model which is fine-tuned and trained using a deep transfer of imbalanced data learning. The features of the trained model are extracted from the average pool layer, which captures deep information about each type of tumor. However, the author claims that the layer’s characteristics alone are insufficient for a reliable classification of brain tumors. Therefore, two techniques for feature selection are proposed: Entropy-Kurtosis based High Feature Values (EKbHFV) and a Modified Genetic Algorithm (MGA) based on meta-heuristics. The selected features from the genetic algorithm are further refined using a new threshold function. The features obtained from EKbHFV and MGA are fused using a non-redundant serial-based approach and classified using a multi-class SVM cubic classifier. Their evaluation results indicate a significant accuracy of %95.

This paper [30] proposed a model called FT-ViT, a fine-tuned vision transformer model that uses image processing, including image patching, to further extract learnable features from them. Their testing results suggest a significant accuracy score of %98.13.

The article [31] discusses using an ensemble of pre-trained ViT (vision transformer) models to classify brain tumors from MR images. They trained and evaluated an array of ViT models, namely B/16, B/32, L/16, and L/32, and analyzed their performance on brain tumor classification tasks. Based on their final results, they concluded that the best singular model among the pre-trained ViT models is L/32, with an overall test accuracy of %98.2 at a resolution of 384 × 384. However, they added that when all four ViT models are combined as a stack of classifiers ensembled together, the overall accuracy improves to %98.7.

In [32], the authors focused on the YOLOv7 model and aimed to improve its performance through transfer learning. First, they employed image enhancement techniques to enhance the visual representation of brain tissue in MRI scans and also utilized data augmentation technique to increase their samples. To enhance the training phase of the model, the authors incorporated the Convolutional Block Attention Module (CBAM) into YOLOv7, enhancing its feature extraction capabilities for salient regions associated with brain malignancies. Also, they added a Spatial Pyramid Pooling Fast+ (SPPF+) layer to improve the model’s sensitivity. Their proposed model achieved a remarkable %99.5 accuracy in tumor detection task based on their testing results.

III Materials and Methods

In this section, we explain the steps we took for data preparation and then elaborate our proposed framework for tumor detection and classification. Our general approach here is to break down the complex task of tumor diagnosis into smaller sub-problems. Accurate brain tumor diagnosis involves two distinct goals: 1.Detecting tumors within a predominantly Normal dataset, and 2. Identifying unusual brain tissues and their type from their unique characteristics in highly noisy scenes (e.g., shape and suspicious tissue placement).

The first phase of our framework, tumor detection, focuses on addressing the first goal, which is training a model that is resilient to anomaly-distributed populations and can accurately detect brain tumors in various imaging modalities.

The second phase of our framework, tumor classification, involves through data collection and preparation steps to merge multiple publicly available datasets into a single source of data and convert them into a suitable format for the classifier. Next, we created a custom DeiT model and trained it to classify brain tumors in three classes of Meningioma, Pituitary, and Glioma.

III-A Brain Tumor Detection

This phase includes comprehensive data collection from private data sources and unique pre-processing steps to mimic realistic tumor diagnosis scenarios. In this regard, we collected a relatively small image dataset of Normal and Tumor classes with samples belonging to 81 patients, with 30 being Tumor cases and 51 Normal. Next, we subjected the dataset to a custom method of data pre-processing that included data distribution modification techniques and data augmentation, to transform the dataset into a suitable format.

Coming to the next step, we trained and fine-tuned the YOLOv8n model for the tumor detection task. Furthermore, during the evaluation step, in addition to common evaluation metrics (i.e, F1-Score, Recall and Precision) we employed a novel PTP evaluation function to facilitate a pragmatic assessment of the model’s performance.

Figure 1 depicts the proposed brain tumor detection procedure. As its shown in the figure, the raw data passes through a data pre-processing pipeline and transformed into a suitable format for the detection model, the YOLOv8n. Based on the model’s prediction, the tumor-detected samples are fed into the classifier for tumor type classification. The details of the sub-components of Figure 1 is explained in the following subsections.

Refer to caption — Figure 1: Brain Tumor Detection Process.

III-B Data Collection

During the training phase of our brain tumor detection model, we procured a relatively small dataset from NBML, including 81 patients, with 30 cases designated as tumor-positive and the remaining categorized as Normal. Notably, each patient’s folder featured various images and encompassed a spectrum of imaging modalities, including PET, CT, and MRI scans. Consequently, we assert that the data set inherently incorporates a form of data augmentation. In simpler terms, the inclusion of diverse imaging modalities for each patient mitigates the potential for bias towards any particular modality, such as MRI. Importantly, it is imperative to emphasize that the data set obtained from NBML has been exclusively employed for the detection phase and it is held privately, with all associated rights and credits attributed to NBML.

III-C Data Pre-processing

The essence of our data preparation pipeline centers on the careful pre-processing of our data set to closely mirror real-world scenarios. In this section, we will explain the first critical aspect of our methodology, which is distributing the patient data to ensure it reflects real-world scenarios for testing. In section D, we will outline the specific steps we took to determine the threshold for classifying patients into their corresponding classes.

In the context of brain tumors, the United States typically reports [33] incidence rates ranging from 0.03 to 0.06. We recognize that other nations may experience different rates due to their distinct factors. Given the limitations stemming from our limited dataset for the detection phase and the absence of comprehensive external data sources, we exercised caution by adopting a conservative estimate of a 0.1 incidence rate. The decision was made to simulate the anomalous scenario of detecting a Tumor case from Normal cases. Subsequently, following our presumed brain tumor distribution in the general population, we segmented the data set into two sets: one for training and the other for testing, with a particular stipulation. Specifically, in the training data set, we ensured that there were nine randomly selected Normal images for every tumor image. This approach was rooted in our hypothesis that, throughout both the training and evaluation phases, the model’s focus should not lie in discerning each individual’s situation but rather in acquiring knowledge from a diverse spectrum of patients exhibiting various scenarios and learning more robust and distinctive features.

Moving on to the testing set, we specifically selected 30 patients, encompassing 27 cases categorized as Normal and 3 cases as Tumor, with their corresponding folders housing all associated images. We aimed to maintain a distribution of Tumor and Normal patients that closely aligns with real-world scenarios while upholding the distribution pattern established during training. Notably, this step presented a primary challenge as each patient possessed a varying number of images, and the initial data set distribution significantly diverged from our desired goal. Additionally, we had to address the issue of particular images from tumor patients not exhibiting any signs of brain tumors, necessitating their removal and cleansing from the training data set. To address the challenges at hand, we followed a systematic approach as below:

1. Utilizing a third-party software (MicroDicom) to convert DICOM-format patient folders and their associated images into “.png” format, standardizing their resolution to 540x540 pixels. Our initial data examination revealed approximately 18,000 Normal images distributed among 51 patients, while around 30 Tumor patients contributed roughly 12,000 image samples.

2. Resizing all images to a uniform size allowed us to compress each patient’s folder into a ZIP archive, providing a precise count of images per patient and facilitating the selection process for each data set section.

3. Prioritizing data preservation and efficiency for the testing set, opting for the 27 normal patient folders with the smallest ZIP file sizes, resulting in 27 patients.

4. Choosing the three files with the smallest ZIP sizes from the uncleaned tumor patient folders. Ending up with a total of 30 patients for the testing set, achieving a tumor case distribution rate of 10%.

5. selectively picking the remaining cleaned tumor case folders to construct our training set, and including approximately 1.4k tumor-indicative images, supplemented by 12,000 normal samples. This created an intentionally skewed training data set with a tumor-to-normal sample distribution of 10%, meaning we assigned nine normal images for each tumor image.

6. To enhance the diversity of our training samples and proactively mitigate the risk of overfitting while having more control over the data preparation process, we developed a tailored data augmentation class utilizing Python’s built-in libraries. This augmentation class encompassed modifications such as brightness intensity adjustments, rotations, as well as vertical and horizontal flips, all while ensuring the corresponding bounding boxes (the rectangular area indicating the tumor tissue) were appropriately adjusted. The execution of the data augmentation pipeline has led to a substantial expansion in the overall data set volume, all while preserving its initial distribution. This augmentation procedure has contributed to enhancing and diversifying our training data, resulting in an overall improvement in data quality.

III-D Proposed PTP Evaluation Metrics

In our research, our primary goal is to improve the model performance in real-world situations compared to academic assessments. To achieve this, we have introduced a new evaluation method called PTP. In this methodology, instead of indiscriminately feeding MRI images into the model without considering their patient origins, we adopt a more pragmatic strategy: feed in all the MRI scans from each patient. The model then carefully examines each image for any signs of abnormalities related to brain tumors. It keeps track of the number of images showing these signs, and if this count goes above a certain threshold, the model classifies the case as indicative of a tumor. So, instead of relying solely on standard evaluation metrics like F1-Score, Recall, Precision, and Accuracy, we have adopted a PTP-based evaluation metrics.

The PTP-based evaluation metrics are implemented as a straightforward Python function that systematically iterates through each patient’s directory within the testing data set. Within this process, it feeds the images specific to each patient into the model for analysis. While the model processes each image individually and generates corresponding predictions, the PTP function maintains a record of these predictions. After completing this iterative procedure and analyzing all images within a patient’s folder, the PTP function calculates a patient-specific tumor threshold. This threshold represents the proportion of images indicative of tumors within the entire collection of a patient images. If this computed threshold surpasses the predefined General Tumor Threshold (GTT), the PTP function classifies the patient as a Tumor case. This process is repeated individually for each patient, with the PTP function iterating through all patients in the dataset. Ultimately, the PTP function computes additional metrics, encompassing PTP-ACC, PTP Recall, PTP-Precision, and PTP-F1. It is noteworthy that this function is designed and deployed only as an evaluation metric for the Tumor detection model; hence, during the classification phase, we utilized common evaluation metrics such as accuracy to assess the model.

Here is a concise explanation of the mentioned metrics:

• PTP-ACC: This metric assesses the model’s accuracy in classifying patients as having tumors or not based on their individual tumor thresholds. This metric quantifies the proportion of correctly identified patient cases in the total patient population. In essence, it measures how well the model can distinguish patients with tumors from those without, providing a valuable indicator of its overall accuracy in patient classification.

• PTP-Recall: This metric evaluates the model’s ability to correctly identify patients with tumors among all individuals who genuinely have tumors. It quantifies the ratio of true positive patient cases (those correctly identified as having tumors) to the total number of patients with tumors in the data set. PTP-Recall is a crucial measure of the model’s sensitivity, highlighting its effectiveness in capturing all patients with tumors and minimizing the risk of missing any cases.

• PTP-Precision: This metric gauges the precision of the model in labeling patients as having tumors, considering the instances it has identified as positive cases. It calculates the ratio of true positive patient cases to the total number of patient cases labeled as having tumors by the model. This metric provides insights into the model’s precision in patient classification, emphasizing its ability to minimize false positive identifications while maintaining accuracy.

• PTP-F1: This metric combines the metrics of PTP-Precision and PTP-Recall into a single score to offer a balanced evaluation of the model’s performance in identifying patients with tumors. Calculated as the harmonic mean of these two metrics, PTP-F1 takes into account both false positives and false negatives in patient classification. This is particularly valuable when there is an imbalance between the number of patients with and without tumors, as it provides a comprehensive assessment of the model’s overall performance.

III-D1 General Tumor Threshold

An essential preliminary step in deploying the PTP function is the determination of the GTT applied uniformly across the patient population. The GTT is defined as the threshold value representing the proportion of images the model must classify as indicative of tumors within all the images belonging to a patient to designate it as a tumor case. To establish this threshold, we adopted an approach wherein we provided the entire patient data set, comprising patients from the training and validation sets, in its original format and fed it into the trained model. Subsequently, we allowed the PTP function to calculate the number of tumor-indicative images per patient across the entire training and validation sets. Based on those values, we picked a suitable yet reliable GTT threshold value. The details of the GTT calculation further explained in the results section.

III-E Detection Model

Within the domain of computer vision, the YOLO series of algorithms, developed by Ultralytics, has gained significant prominence and widespread usage, primarily due to their remarkable ability to achieve high accuracy while maintaining a compact model size. This attribute makes YOLO accessible to many developers, as it can be effectively trained on a single GPU. The latest advancement in the YOLO framework, known as YOLOv8, exhibits versatility across various applications such as object identification, image categorization, and segmentation. Over time, YOLO models have undergone multiple iterations, each building upon its predecessor to address prior limitations and enhance overall performance.

The architectural modifications introduced in YOLOv8 have significantly improved its object detection capabilities, especially in complex and noisy scenes, surpassing the performance of earlier YOLO versions. As shown in Figure 5, this updated architecture incorporates a backbone comprising a sequence of convolutional layers responsible for extracting features at different resolutions. These extracted features are subsequently processed through a neck module for consolidation before channeling into the detection head [35].

One of the distinctive characteristics of the YOLOv8 architecture lies in its anchor-free design. This method eliminates the necessity for predefined reference anchor points, hence making the model way more efficient. In addition, the anchor-free brings a higher level of adaptability in scales and aspect ratios for YOLOv8.

Although almost all the loss functions are intact in YOLOv8 compared to the previous version, the V8 design deviates from conventional objectness loss and utilizes distributional focal loss instead. The distributional focal loss presents a novel approach to object detection by treating the continuous distribution of box locations as a discretized probability distribution. Instead of regarding box locations as exact coordinates, this perspective views them as probability distributions [36].

In our approach, based on what we observed from YOLOv8 for each version performance, as illustrated in Figure 4, we picked the YOLOv8n as the backbone model for the tumor detection phase. The decision was primarily influenced by its relatively compact model size and faster processing speed compared to the other variations. Moreover, YOLOv8n exhibited substantial performance enhancements when contrasted with its predecessors from versions ”v5” to ”v7”. To achieve this, we harnessed the formidable computational capabilities of the YOLOv8n object detection model, leveraging its pre-trained weights to train it specifically for tumor tissue and region detection and fine-tuning it on our purposefully skewed training data set.

III-F Brain Tumor Classification

This section covers the second phase of the proposed pipeline, the brain tumor classification step. In this step, the Knowledge Distillation (KD) technique is deployed to train a lightweight ViT from the ResNet152 model on our classification dataset (Figure 7).

KD [37], is a training method in which a less powerful student model learns from the guidance of a more capable teacher network. Unlike traditional training, where only the teacher’s highest-scoring outputs (hard labels) are considered, KD uses the complete output vector generated by the teacher’s Softmax function. This approach not only enhances the performance of the student model but can also be seen as a way of compressing the knowledge contained in the larger teacher model into a smaller, more efficient student model. In the context of KD, following up on the further advancement of computer vision, DeiT [12] marks a significant stride by initially applying the KD technique to ViT, aimed at distilling valuable insights from CNNs to enhance training efficiency.

The inherent design of CNNs, characterized by translation-invariant convolution operations, makes them particularly well-suited for image-related tasks, owing to their local inductive bias, which proves advantageous in vision tasks. Therefore, the application of DeiT has the potential to facilitate a swifter and more effective convergence of ViT when deployed in vision-related tasks. In this regard, we employed the KD process, harnessed the computational capacity of ResNet152, and fine-tuned a reliable CNN as the backbone of our classification.

III-G ResNet-152

ResNet-152, Residual Network with 152 layers, is a significant deep CNN architecture primarily tailored for image classification and feature extraction tasks. It belongs to the ResNet model family, is widely recognized for its outstanding performance in computer vision applications, and represents an improved iteration of the original ResNet-152. A notable innovation within ResNet models is the inclusion of residual blocks, which effectively address the challenge of vanishing gradients when training deep neural networks (DNN). The ”vanishing gradients” problem in DNN occurs when gradients become too small during training, hindering the learning process of the model.

III-H Vision Transformer

One of the main components of our pipeline is ViT. This model [38] works by treating pieces of an image like words in a sentence, trying to mimic how the original transformer model was used for understanding language [39]. Unlike the original transformer, which had both an encoder and a decoder, ViT keeps things more straightforward with just an encoder in its design. In ViT, the input image has dimensions $\mathbb{R}^{H\times W\times C}$ . It’s then split into $N$ smaller pieces called patches, each sized at $P\times P\times C$ , where $N=\frac{{HW}}{{p^{2}}}$ (H: height, W: width, C: number of channels)[30].

Next, the model creates a linear representation for these patches and adds position information to these representations to know where each patch is located. Additionally, an additional patch is included in the embedding that can be adjusted through learning. This embedding is used for the final classification step and is processed by a multi-layer Perceptron (MLP) head. Moreover, the combined information from the patches and their position embeddings are taken and passed through a transformer encoder model. This encoder model consists of multiple layers that alternate between multi-headed self-attention and MLP blocks (as illustrated in Figure 4).

III-I Evaluation Metrics

Upon completing the training and testing phases, it is imperative to employ standardized assessment criteria to evaluate the model’s effectiveness. In recent studies, the researchers utilized a range of evaluation metrics, including Precision, Recall, Sensitivity, Specificity, Accuracy, F1-score. These metrics are derived by applying the model to the data set and tallying the occurrences of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). TP denotes instances where the model accurately identified and labeled tumor cases, while FP refers to non-tumor cases incorrectly classified as tumors. FN represents tumors that went unrecognized during the diagnostic process. TN signifies true negatives, where the model’s predictions are aligned with the actual negative cases. As for the classification phase, we utilized a diverse range of evaluation metrics to clearly observe the model’s performance. It is critical to mention that the PTP evaluation metrics are only employed in the tumor detection phase exclusively.

Precision=TP/(TP+FP)

(1)

Recall=TP/(TP+FN)

(2)

Notably, the F1-score, which computes the harmonic mean between Precision and Recall, is frequently regarded as a primary metric for evaluating model performance in situations where data sets exhibit an imbalance in class distribution.

F1=2\times(Precision\times Recall)/(Precision+Recall)

(3)

III-J Classification Dataset

During the classification phase, we accurately combined two datasets (www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset, accessed on August 2023 and Figshare dataset [40]) for our training and validation sets and also divided a set for benchmarking the model. To diversify and increase the number of samples per class, we took similar to detection phase data pre-processing steps.

III-K Computational Resources

We primarily utilized a local system with a single CUDA-enabled Nvidia GeForce GTX 1650 GPU. On occasion, we also used Google Colab, which provided 12 GB of RAM and a T4 GPU. For model training, validation, and testing, we employed PyTorch version 2.0.1+cu117 with Python 3.9.7.

Furthermore, we deployed the ViT model using the P Wang GitHub repository [41] implementation, while the ResNet152 model was trained and evaluated using the Torch Vision models module. Custom Python scripts were developed as needed in data cleaning and bounding box augmentation modules to support our workflow.

IV Results

In this section, we will elaborate on the experiments and the results we achieved from deploying the proposed pipeline. First, we are going to explain the detection phase results, and then we will transition to the next stage, which is classification.

IV-A Detection Results

The initial step in this phase was data pre-processing. After conducting a comprehensive data prepration and transforming the data set into a suitable format, we loaded the YOLOv8n model pre-trained weights, tailored its hyper-parameters and fine-tuned it on our detection dataset. For this phase, we used the dataset we acquired from NBML for both the training and validation steps.

In first training attempt we ended up with a precise model with low recall score as we were inducing unwanted misinformation into the model’s learning path. The bounding boxes that we specified for tumor regions were accurate in the majority of the augmentations including the vertical and horizontal flips.

TABLE I: YOLOv8n Model Validation Box Results over 50 Epochs

Data	Precision	Recall	mAP50
With-Rotated Samples	0.74	0.58	0.68
Without-Rotated Samples	0.87	0.71	0.80

However, the rotation of the image and the corresponding object of interest (Tumor) was injecting false information into the Tumor class. This happened when we attempted to rotate the bounding box around its initial center using our custom module, and some non-tumor regions were injected in the new bounding box region. Therefore, we decided to stay with more reliable augmentation effects such as vertical, horizontal flips and brightness intensity.

Furthermore, in consideration of our constrained access to computational resources, we opted for the most compact iteration of YOLOv8, denoted as Nano. We employed larger batch sizes to expedite the training duration per epoch. The consideration of employing advanced versions of YOLOv8, such as YOLOv8m or YOLOv8L, suggests the possibility of achieving improved results. Nevertheless, it is important to acknowledge that this choice comes with the trade-off of extended training periods and increased demands on computational resources.

TABLE II: YOLOv8n Model Configuration

Opt	Sched	lr0	lrf	AMP	Epochs
SGD	CosLR	0.01	0.00001	False	50

Opt: Optimizer, Sched: Scheduler, lr0: Initial learning rate, lrf: Final learning rate, AMP: Automatic Mixed Precision.

TABLE III: YOLOv8n Model Evaluation Results

Class	Precision	Recall	F1	Support
Tumor	0.99	0.96	0.97	1905
Normal	0.99	0.99	0.99	20750
AVG	0.99	0.975	0.98	22655

We trained the mentioned model for 50 epochs, and the evaluation results indicated (Table III) that the model is highly accurate in detecting tumor-confiscated images, and despite being agile and super lightweight with only 3.2M parameters, it does have a reliable performance.

TABLE IV: YOLOv8n Model PTP Evaluation Results

Support	Accuracy	F1	Precision	Recall
3 Tumor Cases	1.0	1.0	1.0	1.0
27 Normal Cases	1.0	1.0	1.0	1.0

Furthermore, the model manages to achieve significant scores in our PTP evaluation, as detailed in Table IV. The value of GTT is the output of a meticulous data analysis step in our training and evaluation sets. We incorporated the training and validation sets in their original format into the model and calculated the value of GTT for each patient of these sets. After a exploratory analysis of the patient specific tumor threshold (Figure 8) values among the both Normal and Tumor cases, we estimated the value of the GTT to be at least be 0.04%. This value is calculated from the average of the first quartile and the median values of tumor-indicative distribution.

IV-B Classification Results

TABLE V: DeiT hyper-parameters tuning experiments

No.	Hard Distillation	Temperature	Depth	Patch Size	Dimension	Attention Head	MLP Dim	Val-Accuracy
1	False (Default)	2 (Default)	4 (Default)	24 (Default)	256 (Default)	16 (Default)	128 (Default)	81.91
2	True	2 (Default)	4 (Default)	24 (Default)	256 (Default)	16 (Default)	128 (Default)	84.74
3	True	1	4 (Default)	24 (Default)	256 (Default)	16 (Default)	128 (Default)	83.22
4	True	9	4 (Default)	24 (Default)	256 (Default)	16 (Default)	128 (Default)	81.69
5	True	3	6	24 (Default)	256 (Default)	16 (Default)	128 (Default)	82.35
6	True	3	2	32	256 (Default)	16 (Default)	128 (Default)	85.40
7	True	3	2	24	256 (Default)	16 (Default)	128 (Default)	86.05
8	True	3	2	24	1024	16 (Default)	128 (Default)	68.19
9	True	3	2	24	128	16 (Default)	128 (Default)	85.40
10	True	3	2	24	512	16 (Default)	128 (Default)	74.29
11	True	3	2	24	128	64	128 (Default)	88.67
12	True	3	2	24	128	64	256	88.45
13	True	3	2	24	128	64	2048	87.58
14	True	3	2	24	128	64	512	89.76

The approach we took for this phase is knowledge distilation based on DeiT architecture. Using the custom data augmentation module, we intially created a relatively large dataset from training set which is obtained from Figshare dataset. Then, we fine-tuned a heavy weight yet strong teacher model using ResNet152 architecture.

During the process of distillation, having an access to a strong teacher model is a necessity. By utilizing the DeiT architecture, the teacher model (ResNet152) expertise in classifying the object of interest using its rich CNN kernels would flow into the student model and eventually trains an efficient student model from scratch. The obvious benefit of DeiT architecture is that it enables us to train compact ViT models from rich and stronger teachers. This is specifically useful when we have limited samples of data.

Proceeding to the subsequent stage, we initiated the hyper-parameters tuning process for the DeiT model. We explored 14 different architectures to determine optimal values for hyper-parameters. The specifics of these experiments are delineated in table V.

Based on our observations, we settled on a DeiT model and fine-tuned it using our dataset. We allocated 15% of the original dataset specifically for benchmarking the classifier. It’s noteworthy that we utilized the same set of data for both the teacher and student models. This measure was implemented to prevent any potential bias or misleading results during testing, ensuring the integrity of our results and guarding against any form of unwanted data manipulation.

TABLE VI: Teacher Classifier Test Results

Tumor Class	Precision	Recall	F1	Support
Meningioma	0.92	0.91	0.91	107
Glioma	0.99	0.97	0.98	214
Pituitary	0.94	0.97	0.96	140
Weighted AVG	0.97	0.97	0.97	461

TABLE VII: Distilled Student Classifier Test Results

Tumor Class	Precision	Recall	F1	Support
Meningioma	0.82	0.82	0.82	107
Glioma	0.95	0.92	0.93	214
Pituitary	0.95	0.99	0.97	140
Weighted AVG	0.92	0.92	0.92	461

V Discussion and Conclusion

In conclusion, this article delves into the comprehensive evaluation of state-of-the-art models, specifically YOLOv8 and DeiT, for the task of tumor detection and classification. The distinctive contribution lies in the introduction of novel performance metrics, notably PTP, into the evaluation framework. These metrics not only assess the models’ proficiency in accurately detecting and classifying brain tumors but also gauge their ability to make informed decisions regarding the overall patient condition.

To rigorously assess the models, we curated a new dataset comprising Tumor and Normal cases, maintaining a conservative ratio of 10% for tumor cases in the entire population. The detection results, evaluated using common metrics, demonstrated a commendable performance, with an F1-Score of 0.98. Notably, the model showcased robust accuracy even in anomalous scenarios, achieving a perfect score of 1.0 as measured by the PTP-F1 metric.

In the final phase, we adopted the DeiT architecture and fine-tuned a lightweight student ViT using a ResNet152-based teacher model. Despite training the model for a limited number of epochs, the process of distilling knowledge from a complex CNN into a more compact student model proved to be a promising direction for our task. The student model achieved a test accuracy of 0.92 within 20 training epochs.

It is essential to note that the results presented in classification section rely solely on common evaluation metrics, and PTP metrics were not applied. This decision stems from the belief that the design logic of PTP is currently tailored for the detection phase. However, exploring the integration of PTP-like metrics in classification remains an intriguing avenue for future research endeavors. We acknowledge that the current value represented by GTT may not be perfectly precise for practical deployment, and there is potential for refinement within a broader population of patients.

VI References

References

[1] D. N. Louis, A. Perry, G. Reifenberger, A. Von Deimling, D. Figarella-Branger, W. K. Cavenee, H. Ohgaki, O. D. Wiestler, P. Kleihues, and D. W. Ellison, “The 2016 world health organization classification of tumors of the central nervous system: a summary,” Acta neuropathologica, vol. 131, pp. 803–820, 2016.
[2] G. S. Tandel, M. Biswas, O. G. Kakde, A. Tiwari, H. S. Suri, M. Turk, J. R. Laird, C. K. Asare, A. A. Ankrah, N. Khanna et al., “A review on a deep learning perspective in brain cancer classification,” Cancers, vol. 11, no. 1, p. 111, 2019.
[3] A. K. Anaraki, M. Ayati, and F. Kazemi, “Magnetic resonance imaging-based brain tumor grades classification and grading via convolutional neural networks and genetic algorithms,” biocybernetics and biomedical engineering, vol. 39, no. 1, pp. 63–74, 2019.
[4] R. Augustine, A. Al Mamun, A. Hasan, S. A. Salam, R. Chandrasekaran, R. Ahmed, and A. S. Thakor, “Imaging cancer cells with nanostructures: Prospects of nanotechnology driven non-invasive cancer diagnosis,” Advances in Colloid and Interface Science, vol. 294, p. 102457, 2021.
[5] K. Popuri, D. Cobzas, A. Murtha, and M. Jägersand, “3d variational brain tumor segmentation using dirichlet priors on a clustered feature set,” International journal of computer assisted radiology and surgery, vol. 7, pp. 493–506, 2012.
[6] “Brain Tumors and Brain Cancer,” 2023, [Online; accessed 31. Aug. 2023]. [Online]. Available: https://www.hopkinsmedicine.org/health/conditions-and-diseases/brain-tumor
[7] J. Kang, Z. Ullah, and J. Gwak, “Mri-based brain tumor classification using ensemble of deep features and machine learning classifiers,” Sensors, vol. 21, no. 6, p. 2222, 2021.
[8] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
[10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[11] “Brain Tumor - Statistics,” 2023, [Online; accessed 31. Aug. 2023]. [Online]. Available: https://www.cancer.net/cancer-types/brain-tumor/statistics
[12] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in International conference on machine learning. PMLR, 2021, pp. 10 347–10 357.
[13] M. Devi, S. Maheswaran et al., “An efficient method for brain tumor detection using texture features and svm classifier in mr images,” Asian Pacific journal of cancer prevention: APJCP, vol. 19, no. 10, p. 2789, 2018.
[14] E. I. Zacharaki, S. Wang, S. Chawla, D. Soo Yoo, R. Wolf, E. R. Melhem, and C. Davatzikos, “Classification of brain tumor type and grade using mri texture and shape in a machine learning scheme,” Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine, vol. 62, no. 6, pp. 1609–1618, 2009.
[15] S. Shrot, M. Salhov, N. Dvorski, E. Konen, A. Averbuch, and C. Hoffmann, “Application of mr morphologic, diffusion tensor, and perfusion imaging in the classification of brain tumors using machine learning scheme,” Neuroradiology, vol. 61, pp. 757–765, 2019.
[16] S. Deepak and P. Ameer, “Retrieval of brain mri with tumor using contrastive loss based similarity on googlenet encodings,” Computers in biology and medicine, vol. 125, p. 103993, 2020.
[17] Z. N. K. Swati, Q. Zhao, M. Kabir, F. Ali, Z. Ali, S. Ahmed, and J. Lu, “Brain tumor classification for mr images using transfer learning and fine-tuning,” Computerized Medical Imaging and Graphics, vol. 75, pp. 34–46, 2019.
[18] Y. Zhuge, H. Ning, P. Mathen, J. Y. Cheng, A. V. Krauze, K. Camphausen, and R. W. Miller, “Automated glioma grading on conventional mri images using deep convolutional neural networks,” Medical physics, vol. 47, no. 7, pp. 3044–3053, 2020.
[19] R. Pomponio, G. Erus, M. Habes, J. Doshi, D. Srinivasan, E. Mamourian, V. Bashyam, I. M. Nasrallah, T. D. Satterthwaite, Y. Fan et al., “Harmonization of large mri datasets for the analysis of brain imaging patterns throughout the lifespan,” NeuroImage, vol. 208, p. 116450, 2020.
[20] M. A. Naser and M. J. Deen, “Brain tumor segmentation and grading of lower-grade glioma using deep learning in mri images,” Computers in biology and medicine, vol. 121, p. 103758, 2020.
[21] Ö. Polat and C. Güngen, “Classification of brain tumors from mr images using deep transfer learning,” The Journal of Supercomputing, vol. 77, no. 7, pp. 7236–7252, 2021.
[22] H. A. Khan, W. Jue, M. Mushtaq, and M. U. Mushtaq, “Brain tumor classification in mri image using convolutional neural network,” Mathematical Biosciences and Engineering, 2021.
[23] M. M. Badža and M. Č. Barjaktarović, “Classification of brain tumors from mri images using a convolutional neural network,” Applied Sciences, vol. 10, no. 6, p. 1999, 2020.
[24] E. U. Haq, H. Jianjun, K. Li, H. U. Haq, and T. Zhang, “An mri-based deep learning approach for efficient classification of brain tumors,” Journal of Ambient Intelligence and Humanized Computing, pp. 1–22, 2021.
[25] A. Sekhar, S. Biswas, R. Hazra, A. K. Sunaniya, A. Mukherjee, and L. Yang, “Brain tumor classification using fine-tuned googlenet features and machine learning algorithms: Iomt enabled cad system,” IEEE journal of biomedical and health informatics, vol. 26, no. 3, pp. 983–991, 2021.
[26] N. S. Shaik and T. K. Cherukuri, “Multi-level attention network: application to brain tumor classification,” Signal, Image and Video Processing, vol. 16, no. 3, pp. 817–824, 2022.
[27] M. F. Alanazi, M. U. Ali, S. J. Hussain, A. Zafar, M. Mohatram, M. Irfan, R. AlRuwaili, M. Alruwaili, N. H. Ali, and A. M. Albarrak, “Brain tumor/mass classification framework using magnetic-resonance-imaging-based isolated and developed transfer deep-learning model,” Sensors, vol. 22, no. 1, p. 372, 2022.
[28] B. Ahmad, J. Sun, Q. You, V. Palade, and Z. Mao, “Brain tumor classification using a combination of variational autoencoders and generative adversarial networks,” Biomedicines, vol. 10, no. 2, p. 223, 2022.
[29] M. I. Sharif, M. A. Khan, M. Alhussein, K. Aurangzeb, and M. Raza, “A decision support system for multimodal brain tumor classification using deep learning,” Complex & Intelligent Systems, pp. 1–14, 2021.
[30] A. A. Asiri, A. Shaf, T. Ali, U. Shakeel, M. Irfan, K. M. Mehdar, H. T. Halawani, A. H. Alghamdi, A. F. A. Alshamrani, and S. M. Alqhtani, “Exploring the power of deep learning: Fine-tuned vision transformer for accurate and efficient brain tumor detection in mri scans,” Diagnostics, vol. 13, no. 12, p. 2094, 2023.
[31] S. Tummala, S. Kadry, S. A. C. Bukhari, and H. T. Rauf, “Classification of brain tumor from magnetic resonance imaging using vision transformers ensembling,” Current Oncology, vol. 29, no. 10, pp. 7498–7511, 2022.
[32] A. B. Abdusalomov, M. Mukhiddinov, and T. K. Whangbo, “Brain Tumor Detection Based on Deep Learning Approaches and Magnetic Resonance Imaging,” Cancers, vol. 15, no. 16, August 2023.
[33] “Cancer of the Brain and Other Nervous System - Cancer Stat Facts,” 2023, [Online; accessed 31. Aug. 2023]. [Online]. Available: https://seer.cancer.gov/statfacts/html/brain.html
[34] “Brief summary of YOLOv8 model structure $\cdot$ Issue #189 $\cdot$ ultralytics/ultralytics,” august 2023, [Online; accessed 13. Aug. 2023]. [Online]. Available: https://github.com/ultralytics/ultralytics/issues/189
[35] J. Solawetz, “What is YOLOv8? The Ultimate Guide.” Roboflow Blog, December 2023. [Online]. Available: https://blog.roboflow.com/whats-new-in-yolov8
[36] W. M. Elmessery, J. Gutiérrez, G. G. Abd El-Wahhab, I. A. Elkhaiat, I. S. El-Soaly, S. K. Alhag, L. A. Al-Shuraym, M. A. Akela, F. S. Moghanm, and M. F. Abdelshafie, “Yolo-based model for automatic detection of broiler pathological phenomena through visual and thermal images in intensive poultry houses,” Agriculture, vol. 13, no. 8, p. 1527, 2023.
[37] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
[38] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[40] J. Cheng, “Brain tumor dataset,” figshare. Dataset, vol. 1512427, no. 5, 2017.
[41] “vit-pytorch,” 2023, [Online; accessed 31. Aug. 2023]. [Online]. Available: https://github.com/lucidrains/vit-pytorch