License: arXiv.org perpetual non-exclusive license
arXiv:2402.01067v1 [eess.IV] 01 Feb 2024

Assessing Patient Eligibility for Inspire Therapy through Machine Learning and Deep Learning Models

Mohsena Chowdhury Toronto Metropolitan UniversityTorontoONCanada [email protected] Tejas Vyas Toronto Metropolitan UniversityTorontoONCanada [email protected] Rahul Alapati University of Kansas Medical CenterKansas CityKSUSA [email protected] Andrés M Bur University of Kansas Medical CenterKansas CityKSUSA [email protected]  and  Guanghui Wang Toronto Metropolitan UniversityTorontoONCanada [email protected]
(2024)
Abstract.

Inspire therapy is an FDA-approved internal neurostimulation treatment for obstructive sleep apnea. However, not all patients respond to this therapy, posing a challenge even for experienced otolaryngologists to determine candidacy. This paper makes the first attempt to leverage both machine learning and deep learning techniques in discerning patient responsiveness to Inspire therapy using medical data and videos captured through Drug-Induced Sleep Endoscopy (DISE), an essential procedure for Inspire therapy. To achieve this, we gathered and annotated three datasets from 127 patients. Two of these datasets comprise endoscopic videos focused on the Base of the Tongue and Velopharynx. The third dataset composes the patient’s clinical information. By utilizing these datasets, we benchmarked and compared the performance of six deep learning models and five classical machine learning algorithms. The results demonstrate the potential of employing machine learning and deep learning techniques to determine a patient’s eligibility for Inspire therapy, paving the way for future advancements in this field.

Inspire therapy, DISE video, base of the tongue, velopharynx, machine learning, deep learning, classification
copyright: acmcopyrightjournalyear: 2024doi: XXXXXXX.XXXXXXXconference: Make sure to enter the correct conference title from your rights confirmation emai; May 24–26, 2024; Chongqing, Chinaprice: 15.00isbn: 978-1-4503-XXXX-X/18/06

1. Introduction

Obstructive sleep apnea is a common sleep disorder that impacts millions of people worldwide. It is characterized by repetitive episodes of partial or complete obstruction of the upper airway during sleep, leading to fragmented sleep and decreased oxygen levels in the blood. Treatment for obstructive sleep apnea aims to reduce daytime sleepiness and the morbidity and mortality associated with increased risks of ischemic heart disease, cardiac arrhythmias, hypertension, and other vascular complications (Gasparini et al., 2021)(Lévy et al., 2015).

In previous studies, AI-based technologies have demonstrated great potential in the diagnosis and treatment of patients with obstructive sleep apnea. By utilizing AI in sleep medicine, clinicians can enhance their ability to accurately diagnose and tailor treatment plans for individual patients (Molnár et al., 2022)(Su and Lu, 2023). AI technologies can analyze sleep patterns and identify specific markers of obstructive sleep apnea, allowing for more efficient and accurate diagnoses. They can also assist in monitoring the effectiveness and adherence of treatment, making adjustments as needed to optimize patient outcomes (Huang et al., 2022)(Molnár et al., 2022)(Van den Bossche et al., 2021). Additionally, AI can aid in identifying patients who may not respond well to traditional treatment methods, such as continuous positive airway pressure or mandibular advancement devices, guiding clinicians in considering alternative therapies like positional therapy (Brennan and Kirby, 2023).

Most prior AI-based studies that utilized machine learning and predictive analytics, can analyze large amounts of clinical data to identify patterns and markers of obstructive sleep apnea (Huang et al., 2022)(Molnár et al., 2022). In recent years, the classification of snores produced in different airway states used ML-based guided treatment for OSA before Drug-Induced Sleep Endoscopy (DISE) (Huang et al., 2022)(Liu et al., 2022). There is very limited use of deep learning-based approaches to identify obstructive sleep apnea from DISE videos for identifying the airway collapse patterns and location (Hanif et al., 2021) and severity scores (Hanif et al., 2023).

In this study, we employ endoscopic images obtained from the DISE procedure, where patients are sedated to simulate sleep and the upper airway is evaluated to determine eligibility for surgical treatment for obstructive sleep apnea (OSA). This research aims to predict which patients will respond to Inspire, a surgical implant designed to stimulate tongue movement. The prediction outcome seeks to categorize patients as Responders or Non-responders based on the analysis of endoscopy images. This study is the first endeavor to predict a patient’s response emphasis on the base of tongue (BOT) or velopharynx (VP) throat region images. We implemented and compared the performance with five machine learning algorithms: Decision Tree (Ke et al., 2017), Gradient Boosting (Ke et al., 2017), k-nearest Neighbors (Peterson, 2009), Logistic Regression (Sperandei, 2014), and Random Forest (Biau and Scornet, 2016), as well as six deep learning models: VGG-16 (Simonyan and Zisserman, 2014), ResNet-50 (Jian et al., 2016), ResNet-101 (Jian et al., 2016), EfficientNet-B0 (Tan and Le, 2019), DenseNet-121 (Huang et al., 2017), DenseNet-169 (Huang et al., 2017).

The major contributions of this paper are as follows:

  • We conducted the first study with the objective of predicting the patient response to assess eligibility for Inspire therapy using endoscopy images from Drug-Induced Sleep Endoscopy videos.

  • We generated and annotated three datasets from a cohort of 127 patients, totaling 24,750 image frames. The dataset encompasses 88 cases of responders and 39 cases of non-responders.

  • We implemented and benchmarked the performance of five machine learning algorithms and six deep neural network models utilizing the generated datasets.

2. Background

Obstructive Sleep Apnea (OSA) is a widespread health disorder marked by recurrent instances of upper airway collapse during sleep. This condition leads to disrupted sleep patterns and chronic hypoxemia, leading to various secondary health implications, including hypertension, cardiovascular disease, and cognitive impairment. Continuous Positive Airway Pressure (CPAP) has traditionally served as the primary treatment for Obstructive Sleep Apnea (OSA). However, challenges with patient tolerance often hinder satisfactory compliance with this device. An alternative therapy is Inspire, an FDA-approved hypoglossal nerve stimulator. The hypoglossal nerve controls tongue protrusion and retraction. By stimulating specific branches that induce forward tongue protrusion, the muscles in the neck become rigid, preventing collapse and subsequent airway obstruction. This stimulation is synchronized with the inhalation phase, ensuring that the tongue protrudes during inspiration — the period when neck muscles are most susceptible to collapse and airway obstruction.

The evaluation of eligibility for Inspire involves a procedure called Drug-Induced Sleep Endoscopy (DISE). This entails administering a light amount of anesthesia to simulate sleep. The airway is then observed at four distinct locations: the velopharynx, oropharynx, tongue base, and epiglottis (VOTE). At each location, the surgeon assesses the orientation and degree of collapsibility of the airway using the VOTE score. This score is crucial in determining the most suitable therapies for patients, as specific patterns of airway collapse may not respond uniformly to all treatments. Fig. 1 illustrates the location of the velopharynx, oropharynx, tongue base, and epiglottis in the airway and the VOTE score criteria and classification (Kastoer et al., 2018)(Qian et al., 2016).

Refer to caption
Refer to caption
Figure 1. The VOTE score criteria and classification (Kastoer et al., 2018)(Qian et al., 2016).

For example, the Inspire Hypoglossal Nerve stimulator is approved for patients with predominant anterior-posterior velopharyngeal collapse. This means that this therapy is mostly beneficial for patients with anterior and posterior wall collapse during inspiration at the level of the velopharynx. If patients had concentric collapse, meaning the whole airway shrinks in the shape of a circle, at the level of the velopharynx, the Inspire device is less likely to alleviate airway obstruction. Similarly, if patients had predominant lateral wall collapse, meaning the sides of the upper airway collapse inward, the Inspire device is again less likely to be therapeutic. Additional FDA criteria for the hypoglossal nerve stimulator include AHI15𝐴𝐻𝐼15AHI\geq 15italic_A italic_H italic_I ≥ 15, AHI100𝐴𝐻𝐼100AHI\leq 100italic_A italic_H italic_I ≤ 100, and a BMI40kg/m2𝐵𝑀𝐼40𝑘𝑔superscript𝑚2BMI\leq 40kg/m^{2}italic_B italic_M italic_I ≤ 40 italic_k italic_g / italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

3. Dataset

In this study, we collected data from 127 patients diagnosed at the Department of Otolaryngology-Head and Neck Surgery at the University of Kansas Medical Center (KUMC). The endoscopic images were captured during DISE, a procedure involving patient sedation to simulate sleep while evaluating the upper airway for eligibility for surgical treatment for OSA. The dataset comprises 24,777 images of throat regions focusing on the base of the tongue (BOT) or velopharynx (VP). This dataset is categorized into “responder” and “non-responder” classes, indicating the respective patient groups that exhibited a “response to the therapy” and those who “did not respond to the therapy,” as determined by Sher Criteria. (Sher Criteria – AHI<15𝐴𝐻𝐼15AHI<15italic_A italic_H italic_I < 15 events/hour and >50%absentpercent50>50\%> 50 % decrease in post-operative AHI from pre-op). The characteristics of the dataset are shown in Table 1.

Fig. 2 displays sample images of four patients, two categorized as “responder” and two as “non-responder,” showcasing both BOT and VP image frames for each patient. Notably, distinguishing between these two patient classes based solely on their BOT or VP images proves challenging, even for experienced otolaryngologists and clinical experts in this field. The imaging process introduces significant variations in illumination, viewpoint, occlusion, and reflection, as evident in these images. Moreover, images from the same patient may exhibit substantial visual differences, further complicating the problem. One objective of this study is to investigate whether current deep learning techniques can autonomously extract pertinent features capable of discerning between “responder” and “non-responder” based on their BOT and/or VP images.

Table 1. Characteristics of the dataset
Total Patients Patients/Class Images/Class Frame Resolution
Responder Non-responder Responder Non-responder Min Max
127 88 39 16,515 8,262 720×\times×480 1920×\times×1080
Refer to caption
Figure 2. Sample image frames of four patients sampled from their corresponding BOT and VP sequences. Two patients are responders (1st row) and two are non-responders (2nd row).

We collected two video sequences for each patient corresponding to BOT and VP. Thus, we created two distinct datasets for binary classification to assess prediction accuracy based on different throat regions. The VP dataset comprises a total of 11,298 images, with 64% belonging to responders and 36% to non-responders. The BOT dataset consists of 13,479 video images with approximately 69% responders and 31% non-responders. To obtain more statistically meaningful results, we employ 10-fold cross-validation by randomly dividing the entire dataset into 10 equal patient-wise folds. Due to variations in the number of collected frames from different patients, our training and test datasets exhibit diverse ranges. Additionally, we compare the results obtained from these two datasets in combination. The dataset sizes used for all deep learning classification models are detailed in Table 2.

Table 2. Statistical information of the dataset.
Datasets Images Images/Class
Responder Non-responder
VP 11,298 7,257 4,041
BOT 13,479 9,258 4,221
Combined (VP+BOT) 24,777 16,515 8,262

In addition to the video datasets, we gathered comprehensive clinical information for all patients. The clinical data encompasses 22 essential features, including race, ethnicity, BMI level, Pre-operative Apnea-Hypopnea Index, Sleep Apnea Severity, OSA-severity, responder status, and more. We explored the prospect of making predictions solely based on the clinical data. To achieve this, we implemented and compared five machine learning-based classification models. We applied 10-fold cross-validation using the same split as the video data for a robust evaluation of model performance.

4. Methods

In this paper, we implemented and benchmarked the performance of six deep learning models on the video datasets and five ML algorithms on the clinical dataset. In recent years, convolutional neural networks (CNN) have achieved huge success in image classification (McClannahan et al., 2021)(Zhang et al., 2023), object detection (Bur et al., 2023)(Li et al., 2021), and segmentation (Patel et al., 2022)(Xiao et al., 2023). We implemented the following classical CNN models.

  • VGG is known for its simplicity and effectiveness. It won 1st and 2nd place in object detection and classification in 2014 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). (Simonyan and Zisserman, 2014).

  • ResNet was introduced to address the vanishing gradient problem in deep networks via residual blocks, allowing for the training of very deep networks (Jian et al., 2016). Two popular structures are ResNet-50 and ResNet-101.

  • EfficientNet-B0 was designed to achieve better performance with fewer parameters by introducing compound scaling to balance network depth, width, and resolution (Tan and Le, 2019).

  • DenseNet introduced densely connected blocks, where each layer receives input from all preceding layers, promoting feature reuse and efficient parameter utilization (Huang et al., 2017). DenseNet-121 and DenseNet-169 are two popular structures.

The following five ML algorithms have been implemented.

  • Logistic Regression (LR) is commonly used for binary classification. The model is well-suited for linearly separable problems, but may struggle with complex relationships (Sperandei, 2014).

  • Decision Trees (DT) utilizes interpretable tree structure for hierarchical decision-making. It is simple but prone to overfitting on complex datasets (Ke et al., 2017).

  • Random Forest (RF) is an ensemble method that constructs multiple decision trees during training and outputs the mode of the classes. It can handle complex relationships but may lack interpretability (Biau and Scornet, 2016).

  • k-Nearest Neighbors (k-NN) makes predictions based on the majority class or average of the k-nearest data points in the feature space. It is computationally expensive for large datasets (Peterson, 2009).

  • Gradient Boosting (GB) is an ensemble learning technique that builds a series of weak learners and combines their predictions to form a strong learner. It is powerful for classification but is computationally intensive (Ke et al., 2017).

Table 3. The implementation specifics and hyperparameter settings for each model.
Key settings VGG-16 ResNet-50/101 EfficientNet DenseNet-121/169
Batch size 32 64 32 64
Learning rate 0.0001 0.0001 0.015 0.00001
Loss function NLLLoss Cross-entropy NLLLoss Cross-entropy
Optimizer ADAM SGD ADAM ADAM
Dropout 0.4 - - 0.5
Epochs 50 50 50 50

Evaluation metrics. To assess the performance of our models, we employed several standard evaluation metrics in image classification, including precision, recall, F1 score, AUC score, and overall accuracy (Vyas et al., 2024). These metrics serve as benchmarks to gauge the effectiveness of our models. Accuracy quantifies the overall correctness of the model’s predictions. It is calculated as the ratio of correctly classified instances to the total instances. Precision represents the ratio of true positive predictions to the total predicted positives, offering insights into the model’s ability to minimize false positives, particularly when their associated costs are high.

Recall measures the ratio of true positive predictions to the total actual positives, reflecting the model’s proficiency in identifying all relevant instances, which is crucial when the cost of false negatives is significant. The F1 score, a harmonic mean of precision and recall, provides a balanced assessment of the model’s performance, particularly valuable in scenarios with class imbalances. The AUC (area under the ROC curve) score is a metric tailored for evaluating classification models, especially in binary classification tasks. The ROC (receiver operating characteristic) curve visually represents the trade-off between sensitivity (true positive rate) and specificity (true negative rate) at various thresholds, with the AUC score quantifying the area under this curve.

5. Experiments

5.1. Evaluation of Deep Learning Models

Pre-processing. In this study, the original dataset spans a diverse range of resolutions, ranging from 720×480720480720\times 480720 × 480 to 1920×1080192010801920\times 10801920 × 1080. However, the developed deep neural networks require input images of a fixed size of 224×224224224224\times 224224 × 224. To align with this requirement, we initially downscaled all images in both BOT and VP datasets to the expected input size of the networks. During the data augmentation phase, we introduced random horizontal and vertical shifts within a range of 0.5 and applied rotations up to 35 degrees.

Training settings. We conducted an evaluation of six deep learning networks, i.e., VGG-16, ResNet-50, ResNet-101, EfficientNet-B0, DenseNet-121, and DenseNet-169, utilizing the two DISE medical image datasets. Additionally, a comparison was made with a combined dataset. Implementation specifics and hyperparameter settings for each model are outlined in Table 3. All models undergo pre-training on the ImageNet dataset with the top layer fine-tuned using our training set, with a 40% dropout for VGG-16 and a 50% dropout for DenseNet-121/169. All models were trained for 50 epochs. All experiments were executed in PyTorch using NVIDIA A100 GPU.

Table 4. Patient level accuracy on VP and BOT datasets
Method Patient acc (VP) Patient acc (BOT)
VGG-16 0.711±plus-or-minus\pm±0.04 0.642±plus-or-minus\pm±0.05
ResNet-50 0.636±plus-or-minus\pm±0.05 0.697±plus-or-minus\pm±0.15
ResNet-101 0.631±plus-or-minus\pm±0.03 0.693±plus-or-minus\pm±0.11
DenseNet-121 0.659±plus-or-minus\pm±0.09 0.645±plus-or-minus\pm±0.14
DenseNet-169 0.713±plus-or-minus\pm±0.08 0.626±plus-or-minus\pm±0.12
Efficient-B0 0.505±plus-or-minus\pm±0.08 0.594±plus-or-minus\pm±0.08

Model performance. Given the limited sample size of the collected dataset, all models underwent evaluation using 10-fold cross-validation to yield less biased results. Consequently, the data were randomly divided into 10 equal groups at the patient level to prevent cross-contamination of the training and test sets. Thus, we conducted 10 experiments for each model, alternating the use of one subset as the test data while training the model with the remaining data. The statistical performance of all models is detailed in Table 5, presenting mean values for training accuracy, validation accuracy, F1 score, and AUC score for each model. Insights drawn from the results include: (i) All models exhibited higher accuracy on the Velopharynx (VP) dataset compared to BOT and the combined datasets, indicating that the velopharynx (VP) area contains more discriminative features for classification than the base of the tongue (BOT). (ii) Most models consistently performed well in our experiments. In contrast, DenseNet-169 achieved the highest accuracy, while EfficientNet yielded lower performance compared to other models.

Refer to caption
Figure 3. The precision and recall of all models on BOT (left) and VP (right) datasets.
Table 5. Performance of deep learning models (mean±plus-or-minus\pm±std).
Method_DB Frame acc Train acc F1 score AUC
VGG-16_VP 0.676±plus-or-minus\pm±0.04 0.833±plus-or-minus\pm±0.05 0.780 0.595
VGG-16_BOT 0.608±plus-or-minus\pm±0.05 0.799±plus-or-minus\pm±0.03 0.725 0.512
VGG-16_combine 0.636±plus-or-minus\pm±0.04 0.780±plus-or-minus\pm±0.04 0.763 0.494
ResNet-50_VP 0.626±plus-or-minus\pm±0.03 0.748±plus-or-minus\pm±0.02 0.745 0.519
ResNet-50_BOT 0.608±plus-or-minus\pm±0.13 0.755±plus-or-minus\pm±0.04 0.729 0.504
ResNet-50_combine 0.642±plus-or-minus\pm±0.05 0.734±plus-or-minus\pm±0.02 0.765 0.501
ResNet-101_VP 0.626±plus-or-minus\pm±0.03 0.722±plus-or-minus\pm±0.01 0.732 0.509
ResNet-101_BOT 0.675±plus-or-minus\pm±0.11 0.731±plus-or-minus\pm±0.02 0.790 0.501
ResNet-101_combine 0.674±plus-or-minus\pm±0.05 0.726±plus-or-minus\pm±0.01 0.711 0.497
DenseNet-121_VP 0.633±plus-or-minus\pm±0.08 0.797±plus-or-minus\pm±0.02 0.711 0.464
DenseNet-121_BOT 0.635±plus-or-minus\pm±0.15 0.708±plus-or-minus\pm±0.04 0.680 0.464
DenseNet-121_combine 0.621±plus-or-minus\pm±0.04 0.784±plus-or-minus\pm±0.05 0.682 0.461
DenseNet-169_VP 0.691±plus-or-minus\pm± 0.09 0.712±plus-or-minus\pm±0.03 0.728 0.510
DenseNet-169_BOT 0.642±plus-or-minus\pm±0.13 0.721±plus-or-minus\pm±0.04 0.748 0.509
DenseNet-169_combine 0.682±plus-or-minus\pm±0.06 0.778±plus-or-minus\pm±0.06 0.712 0.504
EficientNet_VP 0.522±plus-or-minus\pm±0.08 0.539±plus-or-minus\pm±0.03 0.400 0.500
EfficientNet_BOT 0.606±plus-or-minus\pm±0.09 0.565±plus-or-minus\pm±0.04 0.578 0.500
EfficientNet_combine 0.506±plus-or-minus\pm±0.08 0.562±plus-or-minus\pm±0.02 0.445 0.500

All deep learning models in this study provide predictions for individual images, and accuracy is computed at the image level. However, since the medical dataset was categorized into responder and non-responder classes based on patients, it is more meaningful to make predictions at the patient level (i.e., sequence level). In this paper, we introduce patient accuracy, which signifies accuracy at the patient level by calculating the majority of the frame-level accuracy within each sequence. As depicted in Table 4, the accuracy is significantly enhanced for most cases at the patient level. For instance, the accuracy for VGG corresponding to VP, BOT, and combined datasets increases from 67.6%, 60.8%, and 63.6% to 71.1%, 64.2%, and 68.9%, respectively. Overall, both VGG and DenseNet-169 demonstrate the best performance for the VP dataset.

Precision and Recall. Precision and recall serve as critical metrics in classification tasks, offering insights beyond mere accuracy, particularly valuable for imbalanced datasets or scenarios where the cost of false positives and false negatives varies significantly. In Fig. 3, the precision and recall of all DL models on BOT and VP datasets are illustrated. Notably, DensNet-169 achieved the highest precision rates of 78.14% for VP and 77.18% for BOT. Meanwhile, VGG-16 and ResNet-101 showcased competitive high recall values across the two datasets. Precision quantifies the accuracy of positive predictions by the classifier, while recall measures the classifier’s proficiency in identifying all relevant instances of the positive class. Striking a balance between precision and recall often involves trade-offs tailored to specific needs in practice.

5.2. Evaluation of Machine Learning Models

Training settings. All machine learning models were assessed using the clinical information dataset. The evaluation employed a 10-fold cross-validation approach with identical data splits as those used for the video datasets. Consequently, the training and test sets in each trial shared the same patient IDs as those utilized in the training of deep neural network models. We leverage the Scikit-learn library (Pedregosa et al., 2011) to implement the ML models, and the hyperparameters and settings for these models are detailed below.

  • Decision Tree: Criterion = ‘gini’; Max Depth = None; Min Samples Leaf = 10; Min Samples Split = 2

  • Logistic Regression: C = 0.001; Penalty = ‘none’; Solver = ‘sag’

  • Gradient Boosting: Learning Rate = 0.01; Max Depth = 5; Max Features = ‘log2’; N Estimators = 200; Subsample = 0.7

  • Random Forest: Bootstrap = True; Max Depth = 5; Min Samples Leaf = 1; Min Samples Split = 8; N Estimators = 100

  • k-Nearest Neighbors: Metric = ‘euclidean’; N Neighbors = 30; Weights = ‘uniform’

Model performance. The performance of all ML models is assessed using the following metrics: F1 score, AUC score, and overall accuracy. The evaluation results are presented in Table 6, from which we can see that the performance of all ML models closely aligns with that of the DL models. Among all ML models, Logistic Regression stands out with the highest accuracy at 68.5% and the highest F1 score at 0.804, closely followed by k-Nearest Neighbors and Gradient Boosting. A higher F1 score indicates a better balance between precision and recall, positioning Logistic Regression as the top-performing model in terms of accuracy and F1 score. It is noteworthy that, although Gradient Boosting achieves an accuracy of 64.2%, slightly lower than LR and k-NN, it attains the highest AUC score of 0.531 among all models. AUC assesses the model’s ability to distinguish between positive and negative instances.

Table 6. Performance of machine learning models.
Algorithm Accuracy F1 Score AUC
Decision Tree 0.578±0.10plus-or-minus0.5780.100.578\pm 0.100.578 ± 0.10 0.6920.6920.6920.692 0.5060.5060.5060.506
Gradient Boosting 0.642±0.14plus-or-minus0.6420.140.642\pm 0.140.642 ± 0.14 0.7490.7490.7490.749 0.531
k-Nearest Neighbors 0.675±0.06plus-or-minus0.6750.060.675\pm 0.060.675 ± 0.06 0.8040.8040.8040.804 0.4930.4930.4930.493
Logistic Regression 0.685±0.02plus-or-minus0.6850.02\textbf{0.685}\pm 0.020.685 ± 0.02 0.813 0.4940.4940.4940.494
Random Forest 0.636±0.13plus-or-minus0.6360.130.636\pm 0.130.636 ± 0.13 0.7460.7460.7460.746 0.5190.5190.5190.519

In this study, the leading ML algorithm and the top DL model demonstrated comparable accuracy. DL models have the ability to automatically learn hierarchical representations of features from raw data, particularly at capturing intricate patterns in extensive and complex datasets, thereby eliminating the need for manual feature engineering. However, DL models are computationally expensive and demand substantial training data. On the other hand, classical ML algorithms exhibit less dependence on data and computational efficiency. Nevertheless, they necessitate manual feature engineering, relying on domain knowledge for feature collection and extraction that may be challenging for medical applications where relevant information is often limited. A promising approach is to leverage the strengths of both techniques by combining them, aiming to capitalize on the advantages each offers.

6. Conclusion

In this paper, we have showcased the potential of employing machine learning and deep learning models to forecast patient responsiveness to Inspire therapy. Two datasets were meticulously collected and annotated from Drug-Induced Sleep Endoscopy (DISE) videos, alongside a clinical information dataset of 127 patients. The performance of six deep learning models and five machine learning models was implemented and evaluated using the curated datasets. The insights derived from this study serve as a valuable reference for future research in this domain. Acknowledging the limitations imposed by the datasets’ restricted size and recognizing the confined accuracy achieved when using individual datasets, we are presently develo** a multimodal fusion framework. This framework aims to enhance the performance of existing learning models by exploring both video and text data.

Acknowledgements.
The project is partly supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the National Institutes of Health (NIH).

References

  • (1)
  • Biau and Scornet (2016) Gérard Biau and Erwan Scornet. 2016. A random forest guided tour. Test 25 (2016), 197–227.
  • Brennan and Kirby (2023) Hannah L Brennan and Simon D Kirby. 2023. The role of artificial intelligence in the treatment of obstructive sleep apnea. Journal of Otolaryngology-Head & Neck Surgery 52, 1 (2023), 1–6.
  • Bur et al. (2023) Andrés M Bur, Tianxiao Zhang, Xiangyu Chen, and et al. 2023. Interpretable Computer Vision to Detect and Classify Structural Laryngeal Lesions in Digital Flexible Laryngoscopic Images. Otolaryngology–Head and Neck Surgery (2023).
  • Gasparini et al. (2021) Giulio Gasparini, Gianmarco Saponaro, Mattia Todaro, and et al. 2021. Functional upper airway space endoscopy: a prognostic indicator in obstructive sleep apnea treatment with mandibular advancement devices. International Journal of Environmental Research and Public Health 18, 5 (2021), 2393.
  • Hanif et al. (2021) Umaer Hanif, Eric Kezirian, Eva Kirkegaard Kiær, Emmanuel Mignot, Helge BD Sorensen, and Poul Jennum. 2021. Upper airway classification in sleep endoscopy examinations using convolutional recurrent neural networks. In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 3957–3960.
  • Hanif et al. (2023) Umaer Hanif, Eva Kirkegaard Kiaer, Robson Capasso, Stanley Y Liu, Emmanuel JM Mignot, Helge BD Sorensen, and Poul Jennum. 2023. Automatic scoring of drug-induced sleep endoscopy for obstructive sleep apnea using deep learning. Sleep Medicine 102 (2023), 19–29.
  • Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700–4708.
  • Huang et al. (2022) Zhengfei Huang, Pien FN Bosschieter, Ghizlane Aarab, Maurits KA van Selms, Joost W Vanhommerig, Antonius AJ Hilgevoord, Frank Lobbezoo, and Nico De Vries. 2022. Predicting upper airway collapse sites found in drug-induced sleep endoscopy from clinical data and snoring sounds in patients with obstructive sleep apnea: a prospective clinical study. Journal of clinical sleep medicine 18, 9 (2022), 2119–2131.
  • Jian et al. (2016) S Jian, H Kaiming, R Shaoqing, and Z Xiangyu. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision & Pattern Recognition. 770–778.
  • Kastoer et al. (2018) C Kastoer, LBL Benoist, M Dieltjens, B Torensma, LH De Vries, PE Vonk, MJL Ravesloot, and Nico de Vries. 2018. Comparison of upper airway collapse patterns and its clinical significance: drug-induced sleep endoscopy in patients without obstructive sleep apnea, positional and non-positional obstructive sleep apnea. Sleep and Breathing 22 (2018), 939–948.
  • Ke et al. (2017) Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017).
  • Lévy et al. (2015) Patrick Lévy, Malcolm Kohler, Walter T McNicholas, and et al. 2015. Obstructive sleep apnoea syndrome. Nature reviews Disease primers 1, 1 (2015), 1–21.
  • Li et al. (2021) Kaidong Li, Nina Y Wang, Yiju Yang, and Guanghui Wang. 2021. Sgnet: A super-class guided network for image classification and object detection. In 2021 18th Conference on Robots and Vision (CRV). IEEE, 127–134.
  • Liu et al. (2022) Yitao Liu, Yang Feng, Yanru Li, Wen Xu, Xingjun Wang, and Demin Han. 2022. Automatic classification of the obstruction site in obstructive sleep apnea based on snoring sounds. American Journal of Otolaryngology 43, 6 (2022), 103584.
  • McClannahan et al. (2021) Brian McClannahan, Cucong Zhong, and Guanghui Wang. 2021. Classification of Long Noncoding RNA Elements Using Deep Convolutional Neural Networks and Siamese Networks. arXiv preprint arXiv:2102.05582 (2021).
  • Molnár et al. (2022) Viktória Molnár, Zoltán Lakner, András Molnár, Dávid László Tárnoki, Ádám Domonkos Tárnoki, László Kunos, Zsófia Jokkel, and László Tamás. 2022. The Predictive Role of the Upper-Airway Adipose Tissue in the Pathogenesis of Obstructive Sleep Apnoea. Life 12, 10 (2022), 1543.
  • Patel et al. (2022) Krushi Bharatbhai Patel, Fengjun Li, and Guanghui Wang. 2022. Fuzzynet: A fuzzy attention module for polyp segmentation. In NeurIPS’22 Workshop on All Things Attention: Bridging Different Perspectives on Attention.
  • Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. the Journal of machine Learning research 12 (2011), 2825–2830.
  • Peterson (2009) Leif E Peterson. 2009. K-nearest neighbor. Scholarpedia 4, 2 (2009), 1883.
  • Qian et al. (2016) Kun Qian, Christoph Janott, Vedhas Pandit, Zixing Zhang, Clemens Heiser, Winfried Hohenhorst, Michael Herzog, Werner Hemmert, and Björn Schuller. 2016. Classification of the excitation location of snore sounds in the upper airway by acoustic multifeature analysis. IEEE Transactions on Biomedical Engineering 64, 8 (2016), 1731–1741.
  • Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
  • Sperandei (2014) Sandro Sperandei. 2014. Understanding logistic regression analysis. Biochemia medica 24, 1 (2014), 12–18.
  • Su and Lu (2023) Hsing-Hao Su and Chuan-Pin Lu. 2023. Development of a Deep Learning-Based Epiglottis Obstruction Ratio Calculation System. Sensors 23, 18 (2023), 7669.
  • Tan and Le (2019) Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning. PMLR, 6105–6114.
  • Van den Bossche et al. (2021) Karlien Van den Bossche, Eli Van de Perck, Andrew Wellman, Elahe Kazemeini, Marc Willemen, Johan Verbraecken, Olivier M Vanderveken, Daniel Vena, and Sara Op de Beeck. 2021. Comparison of drug-induced sleep endoscopy and natural sleep endoscopy in the assessment of upper airway pathophysiology during sleep: protocol and study design. Frontiers in Neurology 12 (2021), 768973.
  • Vyas et al. (2024) Tejas Vyas, Mohsena Chowdhury, Xiaojiao Xiao, and et al. 2024. Predicting Mitral Valve mTEER Surgery Outcomes Using Machine Learning and Deep Learning Techniques. arXiv preprint arXiv:2401.13197 (2024).
  • Xiao et al. (2023) Xiaojiao Xiao, Qinmin Vivian Hu, and Guanghui Wang. 2023. Edge-aware multi-task network for integrating quantification segmentation and uncertainty prediction of liver tumor on multi-modality non-contrast MRI. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 652–661.
  • Zhang et al. (2023) Tianxiao Zhang, Andrés M Bur, Shannon Kraft, and et al. 2023. Gender, Smoking History, and Age Prediction from Laryngeal Images. Journal of Imaging 9, 6 (2023), 109.