Robust Surgical Phase Recognition From Annotation Efficient Supervision
Abstract
Surgical phase recognition is a key task in computer-assisted surgery, aiming to automatically identify and categorize the different phases within a surgical procedure. Despite substantial advancements, most current approaches rely on fully supervised training, requiring expensive and time-consuming frame-level annotations. Timestamp supervision has recently emerged as a promising alternative, significantly reducing annotation costs while maintaining competitive performance. However, models trained on timestamp annotations can be negatively impacted by missing phase annotations, leading to a potential drawback in real-world scenarios. In this work, we address this issue by proposing a robust method for surgical phase recognition that can handle missing phase annotations effectively. Furthermore, we introduce the SkipTag@K annotation approach to the surgical domain, enabling a flexible balance between annotation effort and model performance. Our method achieves competitive results on two challenging datasets, demonstrating its efficacy in handling missing phase annotations and its potential for reducing annotation costs. Specifically, we achieve an accuracy of 85.1% on the MultiBypass140 dataset using only 3 annotated frames per video, showcasing the effectiveness of our method and the potential of the SkipTag@K setup. We perform extensive experiments to validate the robustness of our method and provide valuable insights to guide future research in surgical phase recognition. Our work contributes to the advancement of surgical workflow recognition and paves the way for more efficient and reliable surgical phase recognition systems.
1 Introduction
Surgical data science [1, 2] has great importance in the field of computer aided surgery. Surgical phase recognition is an important task with a growing interest in recent years [3], focusing on automatically identifying and categorizing the different phases within a surgical procedure. This task has various applications, such as automatic indexing of surgical video databases [4] and skill assessment of surgeons [5, 6].
Despite the substantial advancements in surgical phase recognition [3], most current approaches rely on fully supervised training, requiring frame-level annotations for all videos in the training set. Obtaining such a training set is both time-consuming and costly, as it necessitates experienced surgeons to meticulously annotate the videos. This issue may hinder the adoption of automatic surgical phase recognition tools for new surgical procedures and use cases. To address this challenge, Ding et al. [7] proposed a promising direction based on timestamp annotations, where only a single frame per phase is annotated. This approach has been shown empirically to reduce annotation time by 74 % compared to full annotation while still achieving comparable results.
However, when dealing with complex surgeries where the order of phases is not deterministic, a potential drawback of timestamp annotation is the possibility of missing phases that were overlooked by the annotator. In this work, we delve into this issue and explore the robustness of the model in relation to such missing phases, proposing a method that is robust to missing phase annotations.
Furthermore, we introduce the SkipTag@K annotation method to the surgical domain, where only K frames from a video are annotated. This method has immense potential in terms of annotation time efficiency. By employing SkipTag@K, we achieve an accuracy of 83.6 on the Cholec80 dataset [4] using only 2 samples per video, significantly reducing the annotation burden.
In summary, our main contributions are as follows:
-
1.
We address the problem of missing labels in timestamp supervision and present a robust method to handle this issue.
-
2.
We introduce SkipTag@K to the surgical domain and demonstrate its potential for efficient annotation.
-
3.
We provide in-depth empirical studies of our method across two challenging datasets in the surgical domain.
Our code will be made available upon acceptance.
2 Related Work
2.1 Surgical phase recognition
Surgical phase recognition has been explored using various approaches across several types of surgeries, including cataract surgeries [8], laparoscopic cholecystectomies and [4] TKR procedures [9]. Early works have relied on hand-crafted features and statistical models [10], [11]. Recent works leverage the capabilities of deep learning for surgical phase recognition, often adopting a two-stage network architecture. These architectures initially extract features using backbone networks such as ResNet [12], Inception [13], or Vision Transformers (ViT) [14]. Subsequently, temporal dependencies are modeled using an additional model such as Long Short-Term Memory (LSTM) [15], Multi-Stage Temporal Convolutional Networks (MS-TCN) [16], and Transformer networks [17]. These two-stage approaches aim to effectively capture spatial and temporal information from the surgical videos. While most works focus on the time-consuming fully supervised setup [3], Several works explored several alternatives. In the semi-supervised setup, a partial subset of videos are fully annotated, and the rest remain unannotated. In this direction, Ramesh et al. [18] trained self-supervised feature extractors using four different methods, including the DINO method and evaluated the models in both fully-supervised and semi-supervised setups. They trained their models exclusively on the Cholec80 dataset and demonstrated the results on Cholec80, as well as the models’ generalization abilities on other datasets. In our work, we adapt this training methodology to each dataset separately, namely the Cholec80 and MultiBypass140 datasets. Additional works also perused the semi-supervised direction [19, 20, 21, 22]. Other explored directions in the surgical domain include active learning [23], [24], and timestamp supervision [7].
While active learning methods select clips or entire videos for annotation, our work focuses on setups where single frames are annotated, such as timestamp supervision and SkipTag@k. Single-frame annotations are less time-intensive compared to annotating long clips or entire videos, making them more efficient for obtaining labeled data. In contrast to active learning, where the model iteratively selects samples for annotation, SkipTag@k allows the selection of all samples at the same time, further simplifying the annotation process. Several works have previously explored the use of pseudo-lables [25, 26]. [20]. They first trained an offline model on a limited supervised training set, in order to generate pseudo-labels on the entire training set, then trained an online model on those pseudo-labels.
2.2 Action segmentation
Surgical phase recognition can be viewed as a specialized case of the action segmentation task. The key differences lie in the unique characteristics of the surgical domain. Several weakly supervised setups have been explored in relation to the action segmentation task, such as timestamp supervision [27, 28, 29, 30], set supervision [31], and transcript-based supervision where Huang et al. [32] suggested an extension to the Connectionist Temporal Classification (CTC) loss that utilizes the similarity between consecutive frames. The challenge of missing actions has also been highlighted by prior works, such as Souri et al. [33] which suggests an optimization-based pseudo-label expansion mechanism, and Rahaman et al. [34] which proposes an EM-based approach. Rahaman et al. also introduce the SkipTag setting, where a fixed number of frames (K) are annotated per video. However, they only evaluate their method with K set to the average number of actions per video. We extend this concept to SkipTag@K and evaluate our method using different K values. Unlike Rahaman et al.’s method, our approach does not assume a prior on the action lengths. The semi-supervised setup was also explored in the action segmentation domain [35].
2.3 Imbalanced data
The distribution of surgical phases is often highly imbalanced, with some phases occurring more frequently or lasting longer than others. This imbalance can lead to biased models that struggle to accurately recognize less prevalent phases. Various techniques have been proposed to address the class imbalance problem. Focal loss [36], originally introduced for object detection, has been widely adopted in many domains, including the surgical domain. Zhang et al. [37] employed an unweighted focal loss for surgical phase recognition, while Ramesh et al. [38] utilized a class-weighted loss to mitigate the impact of imbalanced data. Weighted focal loss has also been applied in the surgical domain, albeit for surgical image classification tasks rather than phase recognition [39, 40].
2.4 Uncertainty Estimation
Monte Carlo Dropout (MCD) [41] is a widely used method for estimating uncertainty in deep learning models by performing multiple forward passes with dropout enabled during inference, and it has a solid theoretical basis. Temperature scaling [42] is another approach that calibrates the confidence scores of a trained model by introducing a temperature parameter to the softmax function. In surgical phase recognition, Bodenstedt et al. [24] employed MCD with several estimators, including an entropy-based estimator for uncertainty estimation in an active learning framework. Ding et al. [7] leveraged MCD for uncertainty estimation using the standard deviation of predictions as a measure of uncertainty to generate reliable pseudo-labels for training.
3 Methods
![Refer to caption](extracted/5686056/assets/sheme.png)
The prediction and training schemes are illustrated in Fig.1.
3.1 Problem Formulation
Let be a surgical video containing frames. Our goal is to predict , which contains a phase classification for each frame in the video from a set of phase options , where is the number of phase options. We have a set of VN videos with corresponding annotations. The annotation type varies depending on the supervision setup. In the fully-supervised setup, every video of length has a corresponding labeling . In the timestamp supervision setup, only one frame from each phase is annotated, resulting in a labeling , where PN is the number of phases. In the SkipTag@ supervision, a subset of frames is chosen and annotated, resulting in the labeling . We denote as the ground-truth of the -th frame, and is equal to if a ground-truth exists for the -th frame or a previously generated pseudo-label if such was generated for the -th frame.
3.2 Feature extraction
Following the approach proposed by Ramesh et al. [18], we utilize a ResNet-50 [12] model to extract frame-wise features. The ResNet-50 model is initialized with weights pre-trained on ImageNet [43] and then fine-tuned using self-supervised learning on the corresponding dataset with the DINO [44] method.
3.3 Pseudo-labels generation
Using the extracted features, we can utilize our temporal model to generate a probability matrix , where is the -th row containing an estimated distribution between phases for the -th frame. We denote as the predicted probability of the -th frame for the -th phase type. The predicted phase for the -th frame is determined by selecting the phase with the highest probability according to the estimated distribution. Our training method relies on the use of pseudo-labels, which are generated using a technique similar to the Uncertainty-Aware Temporal Diffusion (UATD) method proposed by Ding et al. [7]. However, our approach differs from UATD by employing entropy as the uncertainty estimator, as opposed to the standard deviation used in [7].
3.4 Loss
The loss function used in our approach consists of several components, each of which will be explained in detail below.
Balanced Classification Loss, we employ the weighted focal loss [36], which is designed to handle imbalanced data more effectively than the standard cross-entropy loss. This loss component is calculated as follows:
(1) |
(2) |
where if a label (either ground truth or pseudo-label) exists for the -th frame and 0 otherwise. is the total number of frames in the video, is the predicted probability of the model for the -th frame belonging to the label class , and is a hyper-parameter that controls the focus on hard examples. is an inverse class frequency weight [45] calculated as follows:
(3) |
where is the total number of annotated frames and is the number of annotated frames belonging to the -th phase option. This weighting scheme helps to mitigate the impact of class imbalance during training.
Entropy Loss aims to encourage the model to be more certain about its predictions. It is calculated as follows:
(4) |
(5) |
Confidence loss is adopted from Li et al. [27]. This loss encourages the model to predict a monotonic increase and decrease in the confidence of a phase around a ground truth prediction. This loss helps to suppress outliers and enforce temporal consistency in the model’s predictions. The loss is calculated as follows:
(6) |
(7) |
where is the set of time points of annotated frames and TP is the number of annotated time points.
Smoothness loss is designed to encourage the model to predict a smooth phase segmentation, penalizing changes between neighboring frames. This is done based on a truncated mean squared error over the log probabilities as done in [46, 27, 7]. The smoothness loss is calculated as follows:
(8) |
(9) |
(10) |
Where is a threshold hyper-parameter.
Star Temporal Classification (STC) loss, which is a variation of the CTC [47] loss, is used to encourage the model to predict the correct order of the phases. CTC loss aims to tackle misalignment between input and output sequences of varying lengths, and is widely used for automatic speech recognition [48] and handwritten text recognition. [49]. The CTC introduces an additional token and a collapse function that maps frame-wise predictions to dense predictions by removing tokens and repeating predictions. Let be a video containing frames and a segment-based labeling .
The CTC loss can be expressed as follows:
(11) | ||||
(12) |
where represents a sparse phase prediction that would have collapsed to the correct segment-based labeling . Using the CTC’s assumption of conditional independence
(13) |
are predicted by the model. The CTC loss can be efficiently calculated using dynamic programming. CTC can also be calculated using weighted finite-state transducers [50]. STC aims to allow learning from weakly supervised labels. The STC is based on adding an additional token that can represent a missing label. This idea allows handling of a flexible number of missing tokens, while encouraging the model to predict the current phase order.
To obtain , we use and remove consecutive duplicate labels. In experiments where STC is used, we add an additional phase that the model can predict, model prediction from a close neighboring frame are copied after the STC loss is calculated to allow correct flow of the method.
The total loss is a weighted sum of the individual loss components:
(14) |
where are hyper-parameters.
3.5 Additional training stage
![Refer to caption](extracted/5686056/assets/unceranity-graph.png)
Using the trained model , we create fixed partial pseudo-labels that are then used to train a new model.
When generating those pseudo-labels, we aim to utilize all of the model’s predictions, except for those corresponding to transition moments and their surrounding frames. To achieve this, we first need to detect transition moments. We used a scaled entropy measure for this
(15) |
where is the model logits’s output, is the scaling temperature and is the entropy measure as defined in the Eq. 4. We use to push the model’s outputs to the far ends of the spectrum. By employing this estimator, we can easily detect the transition events using a simple threshold as illustrated in Fig. 2. We now consider all predictions that are not within a window of frames from a phase transition event and utilize them as pseudo-labels for further training. The new model is trained using an unweighted focal loss combined with a smoothness loss component.
3.6 Implementation Details
All experiments were performed on NVIDIA A100 Tensor Core and NVIDIA RTX 6000 Ada Generation GPUS. Both datasets were down-sampled to 1fps. We follow Ramesh et al. [18] regarding the self-supervised feature extractor fine-tunning configuration. We used the Adam optimizer with initial learning rate of 5e-4 for training the temporal model for 50 epochs. We resize the images to a 224 × 399 resolution before propagating through our feature extractor. We employ the TCN-based model suggested by Li et al. [27] as our temporal model. We use for all experiments.
4 Experiments
4.1 Datasets
Cholec80 [4] contains 80 videos of laparoscopic cholecystectomy procedures, with an average duration of 38 minutes. The dataset contain 7 types of phases. It was recorded at 25 fps with a resolution of 854x840 or 1920x1080. We follow Ding et al. [7] and use the first 40 videos as a training set and use the rest of the videos as a test set.
MultiBypass140 [51]
contains 140 videos of laparoscopic Roux-en-Y gastric bypass surgeries, with an average duration of 91 minutes. The dataset contains 12 types of phases. It was recorded at 25 fps in two medical centers with a resolution of 720x576 or 854x480 or 1920x1080.
we follow Lavanchy and Ramesh et al. [51] and split the dataset to three parts, training, validation and test, which contain 80, 20, and 40 videos respectively.
4.2 Evaluation Metrics
We follow previus works [19, 7, 15] and report frame-level evaluation metrics, accuracy (AC), precision (PR), recall (RE), Jaccard (JA), and F1. For each phase prediction P and a ground truth GT , PR, RE, JA, and F1 are calculated as follows:
(16) |
the scores for each phase are averaged across all the phases in a video. Accuracy is calculated globally for each video. When evaluating on Cholec80, we follow [18, 7] and report 10-second ’relaxed’ metrics, meaning that we allow two correct phases in the 10 second surrounding window of each phase transition.
4.3 Sampling Distribution
Figure 4 presents the distribution of per-frame phase annotations in the Cholec80 dataset. The full distribution, shown in yellow, illustrates a significant imbalance between classes, with some phases occurring much more frequently than others. This imbalance poses a challenge for surgical phase recognition models, as they may struggle to accurately classify underrepresented phases. In contrast, the timestamp labels, depicted in pink, are distributed more evenly across the phases. This suggests that while the duration of phases may vary, the number of phase occurrences is relatively balanced in the dataset. The more uniform distribution of timestamp labels can be attributed to the fact that they are selected based on the presence of each phase rather than their duration. In order to create SkipTag@K annotations, we split the video into K equal partitions and sample uniformly from each partition a single sample. The blue shades represent the SkipTag@K sampling method with K values of 2, 4, and 7, where K denotes the fixed number of frames annotated per video. As K increases, the SkipTag@K distributions more closely resemble the full distribution, indicating that this sampling strategy effectively captures the original data distribution. The similarity between the SkipTag@7 and full distributions suggests that annotating just 7 frames per video can provide a representative sample of the phase distribution in the Cholec80 dataset.
![Refer to caption](x1.png)
4.4 Robustness to Missing Phase Annotations
Method Cholec80 MultiBypass140 RE (%) PR (%) JA (%) AC (%) F1 (%) RE (%) PR (%) JA (%) AC (%) F1 (%) Timestamp supervision (miss rate = 0) Ding et al. [7] 90.55.9 89.54.4 79.98.5 91.95.6 - 74.310.0 75.713.2 65.211.6 87.210.6 72.111.3 Ours 979.8 84.86.9 76.18.8 90.45.9 87.95.4 78.311.7 75.911.5 67.813 88.49.7 74.612.3 Miss rate = 0.1 Ding et al. [7] 86.411.5 87.96.7 74.98.0 86.509.6 86.56.7 75.110.1 75.212.6 65.112.0 86.310.0 72.111.3 Ours 97.39.3 84.16.8 75.48.4 89.85.9 87.45.8 76.911.3 73.912.4 65.913.7 87.59.9 72.813 Miss rate = 0.2 Ding et al. [7] 81.819.2 85.28.45 67.814.0 848.8 81.68.45 70.513.0 69.313.9 58.512.6 82.510.3 66.413.1 Ours 96.89.3 81.97.2 72.79.15 86.97.4 84.96.5 75.712.2 73.112.3 64.813.6 879.8 71.713 Miss rate = 0.3 Ding et al. [7] 76.616.1 80.29.6 58.913.5 7812.1 77.19.6 61.411.3 60.612.2 46.711.3 71.514.3 55.711.5 Ours 96.59.5 82.47.4 73.29 87.47.5 85.55.8 76.712.4 72.213 63.714.6 85.79.5 71.114
To demonstrate the impact of missing phase annotations and the robustness of our method, we compare our approach with Ding et al.’s method. We simulated missing phase annotations by randomly removing phase labels from the timestamp annotations with varying miss rate probabilities . As shown in Table 1, Ding et al.’s method is significantly affected by missing phase annotations. This is evident from the increased standard deviations in the evaluation metrics on the Cholec80 dataset and the substantial drop in performance, ranging from a 10 % to a 28 % decline when comparing the timestamp results () to the results with . In contrast, our method’s performance only slightly declines, achieving competitive results on the Cholec80 dataset in the timestamp setup, surpassing Ding et al.’s method in the timestamp setup, and consistently outperforming their approach on both datasets across all missing rate settings and almost all metrics.
Figure 4 further illustrates the performance comparison between our method and Ding et al.’s approach under different missing rate probabilities.
![Refer to caption](x2.png)
4.5 SkipTag@K Evaluation
In this section, we evaluate the performance of our model in the challenging SkipTag@K setup, where only K frames from each video are annotated. We present results for three K values relative to the average number of phases in the corresponding dataset. The K values are , where is 6.8 for Cholec80 and 9.3 for MultiBypass140.
Setup | Method | RE (%) | PR (%) | JA (%) | AC (%) | F1 (%) |
---|---|---|---|---|---|---|
SkipTag@7 | Ding et al. [7] | 86.511.5 | 87.96.7 | 74.98.0 | 86.59.6 | 86.56.7 |
Ours | 87.811.5 | 88.18.5 | 73.611.9 | 91.25.9 | 86.47.8 | |
SkipTag@4 | Ding et al. [7] | 50.432.0 | 84.413.3 | 40.624.7 | 76.79.8 | 57.113.3 |
Ours | 84.013.0 | 82.310.4 | 66.913.5 | 88.97.6 | 84.19.7 | |
SkipTag@2 | Ding et al. [7] | 32.139.3 | 88.017.7 | 23.826.0 | 65.612.1 | 39.117.7 |
Ours | 71.212.2 | 81.09.0 | 57.911.1 | 83.67.1 | 78.17.9 |
Setup | Method | RE (%) | PR (%) | JA (%) | AC (%) | F1 (%) |
---|---|---|---|---|---|---|
SkipTag@9 | Ding et al. [7] | 46.67.3 | 44.110.1 | 37.99.4 | 76.511.3 | 42.88.9 |
Ours | 77.810.5 | 75.810.6 | 68.312.3 | 88.710.5 | 74.711.2 | |
SkipTag@5 | Ding et al. [7] | 31.55.8 | 26.88.3 | 22.26.8 | 64.411.7 | 26.86.8 |
Ours | 74.911.9 | 75.013.3 | 66.314.3 | 87.711.7 | 72.713.4 | |
SkipTag@3 | Ding et al. [7] | 25.08.4 | 21.511.2 | 16.18.3 | 49.916.1 | 20.39.1 |
Ours | 68.612.8 | 65.714.4 | 58.514.4 | 85.112.3 | 65.014.3 |
Tables 2 and 3 demonstrate that our model achieves impressive results in the SkipTag@K setup, despite the limited number of annotated frames. Remarkably, the decline in performance when reducing the number of annotated samples is relatively small compared to the reduction in the number of samples itself. For instance, on both datasets, the F1 measure drops by less than 13% when comparing the results obtained with the largest tested K value to those obtained with the lowest tested K value, even though the larger number of samples is at least 3 times as large. Similarly, the accuracy drops by less than 9% under the same conditions. In contrast, Ding et al.’s method experiences a drastic performance drop when the number of annotated samples is reduced.
4.6 Ablations
4.6.1 Component Analysis
Method | RE (%) | PR (%) | JA (%) | AC (%) | F1 (%) |
---|---|---|---|---|---|
Base | 34.117.5 | 54.315.8 | 22.615.2 | 50.917.1 | 5817.8 |
+DINO FE | 28.32.7 | 67.310.6 | 18.74.0 | 67.710.9 | 72.114.9 |
+Conf loss | 37.87.3 | 77.49.5 | 27.96.2 | 70.99.5 | 7811.0 |
+Focal loss | 6311.0 | 76.612.4 | 51.410.7 | 81.89.1 | 77.18.8 |
+Loss reweighting | 81.510.0 | 86.89.4 | 68.19.9 | 89.26.6 | 84.66.7 |
+STC loss | 86.810.6 | 88.17.2 | 72.511.0 | 90.85.6 | 85.66.9 |
+Additional training | 87.811.5 | 88.18.5 | 73.611.9 | 91.25.9 | 86.47.8 |
Table 4 presents an ablation study that highlights the contribution of each component in our method, using the SkipTag@7 setup on the Cholec80 dataset. The first modification is the replacement of the feature extractor from a ResNet-50 model pretrained on ImageNet to a self-supervised DINO model fine-tuned on the relevant dataset. This change leads to significant improvements across most metrics and reduces the model’s standard deviation, indicating more robust performance. Subsequent modifications demonstrate the impact of various loss functions and training strategies. The introduction of the confidence loss, focal loss, loss re-weighting, and STC loss all contribute to incremental improvements in the model’s performance. The final component is the additional training phase, which utilizes the generated pseudo-labels to further refine the model. This step results in further performance gains, achieving the highest scores across all evaluation metrics.
4.6.2 Additional training stage
RE (%) | PR (%) | JA (%) | AC (%) | F1 (%) | ||
---|---|---|---|---|---|---|
Timestamp supervision | Base | 97.59.2 | 84.97.6 | 76.410.3 | 90.36.8 | 87.96.9 |
=0.5 | 97.19.6 | 82.97.4 | 73.99.3 | 88.76.2 | 85.96.7 | |
=0.25 | 97.09.8 | 84.86.9 | 76.18.8 | 90.45.9 | 87.95.4 | |
Missing 0.1 | Base | 96.59.1 | 82.57.4 | 73.09.2 | 87.86.7 | 85.46.6 |
=0.5 | 96.69.6 | 82.76.9 | 73.58.6 | 88.36.2 | 85.56.2 | |
=0.25 | 97.39.3 | 84.16.8 | 75.48.4 | 89.85.9 | 87.45.8 | |
Missing 0.2 | Base | 96.89.3 | 81.97.2 | 72.79.1 | 86.97.4 | 84.96.5 |
=0.5 | 96.79.1 | 82.06.6 | 72.58.3 | 87.36.2 | 85.05.6 | |
=0.25 | 96.69.1 | 83.26.2 | 74.08.0 | 88.06.6 | 85.85.3 | |
Missing 0.3 | Base | 96.79.3 | 80.77.9 | 71.310.4 | 85.48.9 | 83.28.7 |
=0.5 | 96.19.6 | 80.97.1 | 71.59.4 | 86.57.2 | 84.36.1 | |
=0.25 | 96.59.5 | 82.47.4 | 73.29.0 | 87.47.5 | 85.55.8 | |
SkipTag@7 | Base | 86.810.6 | 88.17.2 | 72.511.0 | 90.85.6 | 85.66.9 |
=0.5 | 83.512.7 | 83.810.7 | 69.213.3 | 88.48.2 | 84.78.1 | |
=0.25 | 87.811.5 | 88.18.5 | 73.611.9 | 91.25.9 | 86.47.8 | |
SkipTag@4 | Base | 83.012.1 | 84.07.7 | 66.011.8 | 88.36.3 | 80.79.9 |
=0.5 | 53.910.5 | 76.311.9 | 40.69.1 | 80.17.1 | 81.512.1 | |
=0.25 | 84.013.0 | 82.310.4 | 66.913.5 | 88.97.6 | 84.19.7 | |
SkipTag@2 | Base | 73.812.5 | 81.69.4 | 59.412.7 | 83.47.8 | 77.49.1 |
=0.5 | 69.811.9 | 78.110.3 | 54.911.5 | 81.48.2 | 74.89.2 | |
=0.25 | 71.212.2 | 81.09.0 | 57.911.1 | 83.67.1 | 78.17.9 |
RE (%) | PR (%) | JA (%) | AC (%) | F1 (%) | ||
---|---|---|---|---|---|---|
Timestamp supervision | Base | 72.610.0 | 70.010.6 | 61.912.0 | 86.710.0 | 69.011.3 |
=0.5 | 78.311.7 | 75.911.5 | 67.813.0 | 88.49.7 | 74.612.3 | |
=0.25 | 77.212.0 | 75.912.2 | 67.213.2 | 88.19.8 | 73.912.8 | |
Missing 0.1 | Base | 71.310.5 | 69.010.6 | 60.612.0 | 85.910.0 | 67.811.2 |
=0.5 | 76.911.3 | 73.912.4 | 65.913.7 | 87.59.9 | 72.813.0 | |
=0.25 | 78.211.4 | 75.313.2 | 67.214.0 | 87.710.3 | 74.113.3 | |
Missing 0.2 | Base | 73.912.3 | 72.112.4 | 62.713.4 | 85.410.7 | 70.113.0 |
=0.5 | 75.712.2 | 73.112.3 | 64.813.6 | 87.09.8 | 71.713.0 | |
=0.25 | 71.111.9 | 72.113.3 | 62.013.6 | 85.112.6 | 68.913.0 | |
Missing 0.3 | Base | 75.011.1 | 71.212.0 | 62.012.6 | 84.710.1 | 70.112.1 |
=0.5 | 76.712.4 | 72.213.0 | 63.714.6 | 85.79.5 | 71.114.0 | |
=0.25 | 76.712.0 | 72.513.0 | 63.814.3 | 85.810.0 | 71.313.7 | |
SkipTag@9 | Base | 72.810.7 | 73.411.5 | 63.913.0 | 87.610.4 | 70.512.2 |
=0.5 | 77.810.5 | 75.810.6 | 68.312.3 | 88.710.5 | 74.711.2 | |
=0.25 | 77.710.3 | 75.710.8 | 68.112.1 | 88.710.5 | 74.511.1 | |
SkipTag@5 | Base | 69.312.4 | 73.012.5 | 60.713.8 | 86.111.5 | 67.912.9 |
=0.5 | 74.911.9 | 75.013.3 | 66.314.3 | 87.711.7 | 72.713.4 | |
=0.25 | 73.911.9 | 75.713.3 | 65.514.5 | 87.611.9 | 72.113.5 | |
SkipTag@3 | Base | 62.711.6 | 62.213.8 | 52.713.3 | 83.111.8 | 59.212.9 |
=0.5 | 68.612.8 | 65.714.4 | 58.514.4 | 85.112.3 | 65.014.3 | |
=0.25 | 67.612.8 | 65.414.3 | 57.314.9 | 84.511.8 | 63.814.6 |
In this section we focus on the pseudo-lebel generation mechanism and investigate the effect of the temperature used in the pseudo-generation mechanism. Tables 5 and 6 present a comparison between the base model (before pseudo-label generation) and models trained on pseudo-labels generated with and . The comparison is performed for various setups, including timestamp supervision, missing annotations, and SkipTag@K. As illustrated in Figure 2, lower temperature values result in more extreme uncertainty estimates, leading to a larger number of frames falling below the uncertainty threshold and consequently being annotated with pseudo-labels. For the Cholec80 dataset (Table 5), we observe that setting is preferred over setting across all supervision setups. The additional training stage improves the results in most setup.
For the MultiBypass140 dataset (Table 6), the optimal temperature value varies depending on the setup. However, both and consistently outperform the base model, demonstrating the effectiveness of the additional training stage across different setups.
5 Discussion and Conclusions
In this study, we focused on annotation time efficient techniques for surgical phase recognition. We addressed the challenge of missing phase annotations in the promising timestamp supervision approach and presented a robust method that is able to tackle this hurdle. This direction paves the way for timestamp annotation in a realistic setup where some phase annotations may be unintentionally omitted, further promoting its adoption as a viable alternative to the fully supervised setup. Additionally, we introduced SkipTag@K to the surgical domain, offering a flexible trade-off between annotation effort and model performance and demonstrated the competitive results obtained with it. Remarkably, we achieved impressive performance on both datasets using as few as 2 samples for Cholec80 and 3 samples for MultiBypass140 per video which corresponds to only 0.037% and 0.0023% of the corresponding training sets respectively.
While we successfully tackled the issue of missing annotation labels, our work did not address the problem of incorrect label assignments during the annotation process. This opens up an avenue for future research to investigate methods that can handle both missing and incorrect labels. In the SkipTag@K experiments, we employed a simple strategy of uniformly selecting a sample from each of the K equally divided video segments. Although this approach yielded impressive results, exploring more sophisticated sampling, as clustering-based techniques, could potentially lead to even better performance. Another potential future direction is to investigate the use of easily obtainable weak signals from the surgical procedure itself to efficiently achieve partial labeling. Such signals could include the use of specific surgical tools, changes in the surgical scene, or even audio cues from the operating room. By leveraging these inherent weak signals, we may be able to further reduce the annotation burden while maintaining high performance in surgical phase recognition.
References
- [1] L. Maier-Hein, S. S. Vedula, S. Speidel, N. Navab, R. Kikinis, A. Park, M. Eisenmann, H. Feussner, G. Forestier, S. Giannarou et al., “Surgical data science for next-generation interventions,” Nature Biomedical Engineering, vol. 1, no. 9, pp. 691–696, 2017.
- [2] L. Maier-Hein, M. Eisenmann, D. Sarikaya, K. März, T. Collins, A. Malpani, J. Fallert, H. Feussner, S. Giannarou, P. Mascagni et al., “Surgical data science–from concepts toward clinical translation,” Medical image analysis, vol. 76, p. 102306, 2022.
- [3] K. C. Demir, H. Schieber, T. WeiseRoth, M. May, A. Maier, and S. H. Yang, “Deep learning in surgical workflow analysis: A review of phase and step recognition,” IEEE Journal of Biomedical and Health Informatics, 2023.
- [4] A. P. Twinanda, S. Shehata, D. Mutter, J. Marescaux, M. De Mathelin, and N. Padoy, “Endonet: a deep architecture for recognition tasks on laparoscopic videos,” IEEE transactions on medical imaging, vol. 36, no. 1, pp. 86–97, 2016.
- [5] D. Liu, Q. Li, T. Jiang, Y. Wang, R. Miao, F. Shan, and Z. Li, “Towards unified surgical skill assessment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 9522–9531.
- [6] M. Komatsu, D. Kitaguchi, M. Yura, N. Takeshita, M. Yoshida, M. Yamaguchi, H. Kondo, T. Kinoshita, and M. Ito, “Automatic surgical phase recognition-based skill assessment in laparoscopic distal gastrectomy using multicenter videos,” Gastric Cancer, vol. 27, no. 1, pp. 187–196, 2024.
- [7] X. Ding, X. Yan, Z. Wang, W. Zhao, J. Zhuang, X. Xu, and X. Li, “Less is more: Surgical phase recognition from timestamp supervision,” IEEE Transactions on Medical Imaging, 2023.
- [8] O. Zisimopoulos, E. Flouty, I. Luengo, P. Giataganas, J. Nehme, A. Chow, and D. Stoyanov, “Deepphase: surgical phase recognition in cataracts videos,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part IV 11. Springer, 2018, pp. 265–272.
- [9] A. Kadkhodamohammadi, N. Sivanesan Uthraraj, P. Giataganas, G. Gras, K. Kerr, I. Luengo, S. Oussedik, and D. Stoyanov, “Towards video-based surgical workflow understanding in open orthopaedic surgery,” Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, vol. 9, no. 3, pp. 286–293, 2021.
- [10] T. Blum, H. Feußner, and N. Navab, “Modeling and segmentation of surgical workflow from laparoscopic video,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2010: 13th International Conference, Bei**g, China, September 20-24, 2010, Proceedings, Part III 13. Springer, 2010, pp. 400–407.
- [11] N. Padoy, T. Blum, S.-A. Ahmadi, H. Feussner, M.-O. Berger, and N. Navab, “Statistical modeling and recognition of surgical workflow,” Medical image analysis, vol. 16, no. 3, pp. 632–641, 2012.
- [12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- [13] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
- [14] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- [15] Y. **, Q. Dou, H. Chen, L. Yu, J. Qin, C.-W. Fu, and P.-A. Heng, “Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network,” IEEE transactions on medical imaging, vol. 37, no. 5, pp. 1114–1126, 2017.
- [16] T. Czempiel, M. Paschali, M. Keicher, W. Simson, H. Feussner, S. T. Kim, and N. Navab, “Tecno: Surgical phase recognition with multi-stage temporal convolutional networks,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23. Springer, 2020, pp. 343–352.
- [17] T. Czempiel, M. Paschali, D. Ostler, S. T. Kim, B. Busam, and N. Navab, “Opera: Attention-regularized transformers for surgical phase recognition,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24. Springer, 2021, pp. 604–614.
- [18] S. Ramesh, V. Srivastav, D. Alapatt, T. Yu, A. Murali, L. Sestini, C. I. Nwoye, I. Hamoud, S. Sharma, A. Fleurentin et al., “Dissecting self-supervised learning methods for surgical computer vision,” Medical Image Analysis, vol. 88, p. 102844, 2023.
- [19] X. Shi, Y. **, Q. Dou, and P.-A. Heng, “Semi-supervised learning with progressive unlabeled data excavation for label-efficient surgical workflow recognition,” Medical Image Analysis, vol. 73, p. 102158, 2021.
- [20] T. Yu, D. Mutter, J. Marescaux, and N. Padoy, “Learning from a tiny dataset of manual annotations: a teacher/student approach for surgical phase recognition,” arXiv preprint arXiv:1812.00033, 2018.
- [21] Y. Chen, Q. L. Sun, and K. Zhong, “Semi-supervised spatio-temporal cnn for recognition of surgical workflow,” EURASIP Journal on Image and Video Processing, vol. 2018, pp. 1–9, 2018.
- [22] G. Yengera, D. Mutter, J. Marescaux, and N. Padoy, “Less is more: Surgical phase recognition with less annotations through self-supervised pre-training of cnn-lstm networks,” arXiv preprint arXiv:1805.08569, 2018.
- [23] X. Shi, Y. **, Q. Dou, and P.-A. Heng, “Lrtd: Long-range temporal dependency based active learning for surgical workflow recognition,” International Journal of Computer Assisted Radiology and Surgery, vol. 15, pp. 1573–1584, 2020.
- [24] S. Bodenstedt, D. Rivoir, A. Jenke, M. Wagner, M. Breucha, B. Müller-Stich, S. T. Mees, J. Weitz, and S. Speidel, “Active learning using deep bayesian networks for surgical workflow analysis,” International journal of computer assisted radiology and surgery, vol. 14, pp. 1079–1087, 2019.
- [25] Y. Zhang, S. Bano, A.-S. Page, J. Deprest, D. Stoyanov, and F. Vasconcelos, “Retrieval of surgical phase transitions using reinforcement learning,” in International conference on medical image computing and computer-assisted intervention. Springer, 2022, pp. 497–506.
- [26] T. M. Ward, D. A. Hashimoto, Y. Ban, D. W. Rattner, H. Inoue, K. D. Lillemoe, D. L. Rus, G. Rosman, and O. R. Meireles, “Automated operative phase identification in peroral endoscopic myotomy,” Surgical endoscopy, vol. 35, pp. 4008–4015, 2021.
- [27] Z. Li, Y. Abu Farha, and J. Gall, “Temporal action segmentation from timestamp supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8365–8374.
- [28] N. Behrmann, S. A. Golestaneh, Z. Kolter, J. Gall, and M. Noroozi, “Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation,” in European Conference on Computer Vision. Springer, 2022, pp. 52–68.
- [29] H. Khan, S. Haresh, A. Ahmed, S. Siddiqui, A. Konin, M. Z. Zia, and Q.-H. Tran, “Timestamp-supervised action segmentation with graph convolutional networks,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 10 619–10 626.
- [30] D. Du, E. Li, L. Si, F. Xu, and F. Sun, “Timestamp-supervised action segmentation from the perspective of clustering,” in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, E. Elkind, Ed. International Joint Conferences on Artificial Intelligence Organization, 8 2023, pp. 690–698, main Track. [Online]. Available: https://doi.org/10.24963/ijcai.2023/77
- [31] J. Li and S. Todorovic, “Anchor-constrained viterbi for set-supervised action segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9806–9815.
- [32] D.-A. Huang, L. Fei-Fei, and J. C. Niebles, “Connectionist temporal modeling for weakly supervised action labeling,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer, 2016, pp. 137–153.
- [33] Y. Souri, Y. A. Farha, E. Bahrami, G. Francesca, and J. Gall, “Robust action segmentation from timestamp supervision,” in 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022. BMVA Press, 2022. [Online]. Available: https://bmvc2022.mpi-inf.mpg.de/0392.pdf
- [34] R. Rahaman, D. Singhania, A. Thiery, and A. Yao, “A generalized and robust framework for timestamp supervision in temporal action segmentation,” in European Conference on Computer Vision. Springer, 2022, pp. 279–296.
- [35] D. Singhania, R. Rahaman, and A. Yao, “Iterative contrast-classify for semi-supervised temporal action segmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 2262–2270.
- [36] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
- [37] B. Zhang, A. Ghanem, A. Simes, H. Choi, and A. Yoo, “Surgical workflow recognition with 3dcnn for sleeve gastrectomy,” International Journal of Computer Assisted Radiology and Surgery, vol. 16, no. 11, pp. 2029–2036, 2021.
- [38] S. Ramesh, D. Dall’Alba, C. Gonzalez, T. Yu, P. Mascagni, D. Mutter, J. Marescaux, P. Fiorini, and N. Padoy, “Multi-task temporal convolutional networks for joint recognition of surgical phases and steps in gastric bypass procedures,” International journal of computer assisted radiology and surgery, vol. 16, pp. 1111–1119, 2021.
- [39] D. N. Le, H. X. Le, L. T. Ngo, and H. T. Ngo, “Transfer learning with class-weighted and focal loss function for automatic skin cancer classification,” arXiv preprint arXiv:2009.05977, 2020.
- [40] R. Qin, K. Qiao, L. Wang, L. Zeng, J. Chen, and B. Yan, “Weighted focal loss: An effective loss function to overcome unbalance problem of chest x-ray14,” in IOP Conference Series: Materials Science and Engineering, vol. 428, no. 1. IOP Publishing, 2018, p. 012022.
- [41] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in international conference on machine learning. PMLR, 2016, pp. 1050–1059.
- [42] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” in International conference on machine learning. PMLR, 2017, pp. 1321–1330.
- [43] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
- [44] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.
- [45] V. Lertnattee and T. Theeramunkong, “Analysis of inverse class frequency in centroid-based text classification,” in IEEE International Symposium on Communications and Information Technology, 2004. ISCIT 2004., vol. 2. IEEE, 2004, pp. 1171–1176.
- [46] Y. A. Farha and J. Gall, “Ms-tcn: Multi-stage temporal convolutional network for action segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3575–3584.
- [47] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
- [48] J. Li et al., “Recent advances in end-to-end automatic speech recognition,” APSIPA Transactions on Signal and Information Processing, vol. 11, no. 1, 2022.
- [49] W. AlKendi, F. Gechter, L. Heyberger, and C. Guyeux, “Advancements and challenges in handwritten text recognition: A comprehensive survey,” Journal of Imaging, vol. 10, no. 1, p. 18, 2024.
- [50] A. Hannun, V. Pratap, J. Kahn, and W.-N. Hsu, “Differentiable weighted finite-state transducers,” arXiv preprint arXiv:2010.01003, 2020.
- [51] J. L. Lavanchy, S. Ramesh, D. Dall’Alba, C. Gonzalez, P. Fiorini, B. P. Müller-Stich, P. C. Nett, J. Marescaux, D. Mutter, and N. Padoy, “Challenges in multi-centric generalization: phase and step recognition in roux-en-y gastric bypass surgery,” International journal of computer assisted radiology and surgery, pp. 1–9, 2024.