Robust Surgical Phase Recognition From Annotation Efficient Supervision

Or Rubin, and Shlomi Laufer This study was funded by The Bernard M. Gordon Center for Systems Engineering at the Technion. The authors are with the department of Data and Decision Sciences, Technion, Israel Institute of Technology, Haifa 3200, Israel. (e-mail: [email protected];    [email protected]).
Abstract

Surgical phase recognition is a key task in computer-assisted surgery, aiming to automatically identify and categorize the different phases within a surgical procedure. Despite substantial advancements, most current approaches rely on fully supervised training, requiring expensive and time-consuming frame-level annotations. Timestamp supervision has recently emerged as a promising alternative, significantly reducing annotation costs while maintaining competitive performance. However, models trained on timestamp annotations can be negatively impacted by missing phase annotations, leading to a potential drawback in real-world scenarios. In this work, we address this issue by proposing a robust method for surgical phase recognition that can handle missing phase annotations effectively. Furthermore, we introduce the SkipTag@K annotation approach to the surgical domain, enabling a flexible balance between annotation effort and model performance. Our method achieves competitive results on two challenging datasets, demonstrating its efficacy in handling missing phase annotations and its potential for reducing annotation costs. Specifically, we achieve an accuracy of 85.1% on the MultiBypass140 dataset using only 3 annotated frames per video, showcasing the effectiveness of our method and the potential of the SkipTag@K setup. We perform extensive experiments to validate the robustness of our method and provide valuable insights to guide future research in surgical phase recognition. Our work contributes to the advancement of surgical workflow recognition and paves the way for more efficient and reliable surgical phase recognition systems.

1 Introduction

Surgical data science [1, 2] has great importance in the field of computer aided surgery. Surgical phase recognition is an important task with a growing interest in recent years [3], focusing on automatically identifying and categorizing the different phases within a surgical procedure. This task has various applications, such as automatic indexing of surgical video databases [4] and skill assessment of surgeons [5, 6].
Despite the substantial advancements in surgical phase recognition [3], most current approaches rely on fully supervised training, requiring frame-level annotations for all videos in the training set. Obtaining such a training set is both time-consuming and costly, as it necessitates experienced surgeons to meticulously annotate the videos. This issue may hinder the adoption of automatic surgical phase recognition tools for new surgical procedures and use cases. To address this challenge, Ding et al. [7] proposed a promising direction based on timestamp annotations, where only a single frame per phase is annotated. This approach has been shown empirically to reduce annotation time by 74 % compared to full annotation while still achieving comparable results.

However, when dealing with complex surgeries where the order of phases is not deterministic, a potential drawback of timestamp annotation is the possibility of missing phases that were overlooked by the annotator. In this work, we delve into this issue and explore the robustness of the model in relation to such missing phases, proposing a method that is robust to missing phase annotations.

Furthermore, we introduce the SkipTag@K annotation method to the surgical domain, where only K frames from a video are annotated. This method has immense potential in terms of annotation time efficiency. By employing SkipTag@K, we achieve an accuracy of 83.6 on the Cholec80 dataset [4] using only 2 samples per video, significantly reducing the annotation burden.

In summary, our main contributions are as follows:

  1. 1.

    We address the problem of missing labels in timestamp supervision and present a robust method to handle this issue.

  2. 2.

    We introduce SkipTag@K to the surgical domain and demonstrate its potential for efficient annotation.

  3. 3.

    We provide in-depth empirical studies of our method across two challenging datasets in the surgical domain.

Our code will be made available upon acceptance.

2 Related Work

2.1 Surgical phase recognition

Surgical phase recognition has been explored using various approaches across several types of surgeries, including cataract surgeries [8], laparoscopic cholecystectomies and [4] TKR procedures [9]. Early works have relied on hand-crafted features and statistical models [10], [11]. Recent works leverage the capabilities of deep learning for surgical phase recognition, often adopting a two-stage network architecture. These architectures initially extract features using backbone networks such as ResNet [12], Inception [13], or Vision Transformers (ViT) [14]. Subsequently, temporal dependencies are modeled using an additional model such as Long Short-Term Memory (LSTM) [15], Multi-Stage Temporal Convolutional Networks (MS-TCN) [16], and Transformer networks [17]. These two-stage approaches aim to effectively capture spatial and temporal information from the surgical videos. While most works focus on the time-consuming fully supervised setup [3], Several works explored several alternatives. In the semi-supervised setup, a partial subset of videos are fully annotated, and the rest remain unannotated. In this direction, Ramesh et al. [18] trained self-supervised feature extractors using four different methods, including the DINO method and evaluated the models in both fully-supervised and semi-supervised setups. They trained their models exclusively on the Cholec80 dataset and demonstrated the results on Cholec80, as well as the models’ generalization abilities on other datasets. In our work, we adapt this training methodology to each dataset separately, namely the Cholec80 and MultiBypass140 datasets. Additional works also perused the semi-supervised direction [19, 20, 21, 22]. Other explored directions in the surgical domain include active learning [23], [24], and timestamp supervision [7].

While active learning methods select clips or entire videos for annotation, our work focuses on setups where single frames are annotated, such as timestamp supervision and SkipTag@k. Single-frame annotations are less time-intensive compared to annotating long clips or entire videos, making them more efficient for obtaining labeled data. In contrast to active learning, where the model iteratively selects samples for annotation, SkipTag@k allows the selection of all samples at the same time, further simplifying the annotation process. Several works have previously explored the use of pseudo-lables [25, 26]. [20]. They first trained an offline model on a limited supervised training set, in order to generate pseudo-labels on the entire training set, then trained an online model on those pseudo-labels.

2.2 Action segmentation

Surgical phase recognition can be viewed as a specialized case of the action segmentation task. The key differences lie in the unique characteristics of the surgical domain. Several weakly supervised setups have been explored in relation to the action segmentation task, such as timestamp supervision [27, 28, 29, 30], set supervision [31], and transcript-based supervision where Huang et al. [32] suggested an extension to the Connectionist Temporal Classification (CTC) loss that utilizes the similarity between consecutive frames. The challenge of missing actions has also been highlighted by prior works, such as Souri et al. [33] which suggests an optimization-based pseudo-label expansion mechanism, and Rahaman et al. [34] which proposes an EM-based approach. Rahaman et al. also introduce the SkipTag setting, where a fixed number of frames (K) are annotated per video. However, they only evaluate their method with K set to the average number of actions per video. We extend this concept to SkipTag@K and evaluate our method using different K values. Unlike Rahaman et al.’s method, our approach does not assume a prior on the action lengths. The semi-supervised setup was also explored in the action segmentation domain [35].

2.3 Imbalanced data

The distribution of surgical phases is often highly imbalanced, with some phases occurring more frequently or lasting longer than others. This imbalance can lead to biased models that struggle to accurately recognize less prevalent phases. Various techniques have been proposed to address the class imbalance problem. Focal loss [36], originally introduced for object detection, has been widely adopted in many domains, including the surgical domain. Zhang et al. [37] employed an unweighted focal loss for surgical phase recognition, while Ramesh et al. [38] utilized a class-weighted loss to mitigate the impact of imbalanced data. Weighted focal loss has also been applied in the surgical domain, albeit for surgical image classification tasks rather than phase recognition [39, 40].

2.4 Uncertainty Estimation

Monte Carlo Dropout (MCD) [41] is a widely used method for estimating uncertainty in deep learning models by performing multiple forward passes with dropout enabled during inference, and it has a solid theoretical basis. Temperature scaling [42] is another approach that calibrates the confidence scores of a trained model by introducing a temperature parameter to the softmax function. In surgical phase recognition, Bodenstedt et al. [24] employed MCD with several estimators, including an entropy-based estimator for uncertainty estimation in an active learning framework. Ding et al. [7] leveraged MCD for uncertainty estimation using the standard deviation of predictions as a measure of uncertainty to generate reliable pseudo-labels for training.

3 Methods

Refer to caption
Figure 1: Overview of our proposed surgical phase recognition method. (a) The model prediction pipeline for generating phase predictions from an input surgical video. It consists of a two-stage architecture: a feature extraction model followed by a temporal model. (b) The feature extractor training pipeline. A ResNet-50 model pre-trained on ImageNet is fine-tuned using self-supervised learning on the target surgical video dataset. (c) The temporal model training pipeline. It includes an initial training stage using our proposed loss function to create a base model. The base model then generates pseudo-labels which are used to train the final temporal model.

The prediction and training schemes are illustrated in Fig.1.

3.1 Problem Formulation

Let X=[x1,,xT]𝑋subscript𝑥1subscript𝑥𝑇X=[x_{1},\ldots,x_{T}]italic_X = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] be a surgical video containing T𝑇Titalic_T frames. Our goal is to predict Y^=[y1^,,yT^]^𝑌^subscript𝑦1^subscript𝑦𝑇\hat{Y}=[\hat{y_{1}},\ldots,\hat{y_{T}}]over^ start_ARG italic_Y end_ARG = [ over^ start_ARG italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , … , over^ start_ARG italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG ], which contains a phase classification for each frame in the video from a set of phase options 𝒫={1,,C}𝒫1𝐶\mathcal{P}=\{1,\ldots,C\}caligraphic_P = { 1 , … , italic_C }, where C𝐶Citalic_C is the number of phase options. We have a set of VN videos X¯={Xi}i=1VN¯𝑋superscriptsubscriptsubscript𝑋𝑖𝑖1VN\overline{X}=\{X_{i}\}_{i=1}^{\text{VN}}over¯ start_ARG italic_X end_ARG = { italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT VN end_POSTSUPERSCRIPT with corresponding annotations. The annotation type varies depending on the supervision setup. In the fully-supervised setup, every video XX¯𝑋¯𝑋X\in\overline{X}italic_X ∈ over¯ start_ARG italic_X end_ARG of length T𝑇Titalic_T has a corresponding labeling Yfull=[y1,,yT]subscript𝑌𝑓𝑢𝑙𝑙subscript𝑦1subscript𝑦𝑇Y_{full}=[y_{1},\ldots,y_{T}]italic_Y start_POSTSUBSCRIPT italic_f italic_u italic_l italic_l end_POSTSUBSCRIPT = [ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ]. In the timestamp supervision setup, only one frame from each phase is annotated, resulting in a labeling Yts=[yt1,,ytPN]subscript𝑌𝑡𝑠subscript𝑦subscript𝑡1subscript𝑦subscript𝑡PNY_{ts}=[y_{t_{1}},\ldots,y_{t_{\text{PN}}}]italic_Y start_POSTSUBSCRIPT italic_t italic_s end_POSTSUBSCRIPT = [ italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT PN end_POSTSUBSCRIPT end_POSTSUBSCRIPT ], where PN is the number of phases. In the SkipTag@K𝐾Kitalic_K supervision, a subset of K𝐾Kitalic_K frames is chosen and annotated, resulting in the labeling YSkipTag@K=[yt1,,ytK]subscript𝑌SkipTag@𝐾subscript𝑦subscript𝑡1subscript𝑦subscript𝑡𝐾Y_{\text{SkipTag@}K}=[y_{t_{1}},\ldots,y_{t_{K}}]italic_Y start_POSTSUBSCRIPT SkipTag@ italic_K end_POSTSUBSCRIPT = [ italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ]. We denote ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the ground-truth of the t𝑡titalic_t-th frame, and yt~~subscript𝑦𝑡\tilde{y_{t}}over~ start_ARG italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG is equal to ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT if a ground-truth exists for the t𝑡titalic_t-th frame or a previously generated pseudo-label if such was generated for the t𝑡titalic_t-th frame.

3.2 Feature extraction

Following the approach proposed by Ramesh et al. [18], we utilize a ResNet-50 [12] model to extract frame-wise features. The ResNet-50 model is initialized with weights pre-trained on ImageNet [43] and then fine-tuned using self-supervised learning on the corresponding dataset with the DINO [44] method.

3.3 Pseudo-labels generation

Using the extracted features, we can utilize our temporal model M()𝑀M(\cdot)italic_M ( ⋅ ) to generate a probability matrix P[0,1]T×C𝑃superscript01𝑇𝐶P\in{[0,1]}^{T\times C}italic_P ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_T × italic_C end_POSTSUPERSCRIPT, where ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the t𝑡titalic_t-th row containing an estimated distribution between phases for the t𝑡titalic_t-th frame. We denote Pa,bsubscript𝑃𝑎𝑏P_{a,b}italic_P start_POSTSUBSCRIPT italic_a , italic_b end_POSTSUBSCRIPT as the predicted probability of the a𝑎aitalic_a-th frame for the b𝑏bitalic_b-th phase type. The predicted phase for the t𝑡titalic_t-th frame is determined by selecting the phase with the highest probability according to the estimated distribution. Our training method relies on the use of pseudo-labels, which are generated using a technique similar to the Uncertainty-Aware Temporal Diffusion (UATD) method proposed by Ding et al. [7]. However, our approach differs from UATD by employing entropy as the uncertainty estimator, as opposed to the standard deviation used in [7].

3.4 Loss

The loss function used in our approach consists of several components, each of which will be explained in detail below.
Balanced Classification Loss, we employ the weighted focal loss [36], which is designed to handle imbalanced data more effectively than the standard cross-entropy loss. This loss component is calculated as follows:

FL(qo)=(1qo)γlog(qo)𝐹𝐿subscript𝑞𝑜superscript1subscript𝑞𝑜𝛾𝑙𝑜𝑔subscript𝑞𝑜FL(q_{o})=-(1-q_{o})^{\gamma}log(q_{o})italic_F italic_L ( italic_q start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) = - ( 1 - italic_q start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT italic_l italic_o italic_g ( italic_q start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) (1)
cls=1t=1𝑇mtt=1𝑇mtwyt~FL(Pt,yt~)subscript𝑐𝑙𝑠1𝑡1𝑇subscript𝑚𝑡𝑡1𝑇subscript𝑚𝑡subscript𝑤~subscript𝑦𝑡𝐹𝐿subscript𝑃𝑡~subscript𝑦𝑡\mathcal{L}_{cls}=\frac{1}{\underset{t=1}{\overset{T}{\sum}}m_{t}}\underset{t=% 1}{\overset{T}{\sum}}m_{t}\cdot w_{\tilde{y_{t}}}\cdot FL(P_{t,\tilde{y_{t}}})caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG start_UNDERACCENT italic_t = 1 end_UNDERACCENT start_ARG overitalic_T start_ARG ∑ end_ARG end_ARG italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_UNDERACCENT italic_t = 1 end_UNDERACCENT start_ARG overitalic_T start_ARG ∑ end_ARG end_ARG italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_w start_POSTSUBSCRIPT over~ start_ARG italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_POSTSUBSCRIPT ⋅ italic_F italic_L ( italic_P start_POSTSUBSCRIPT italic_t , over~ start_ARG italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_POSTSUBSCRIPT ) (2)

where mt=1subscript𝑚𝑡1m_{t}=1italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 if a label (either ground truth or pseudo-label) exists for the t𝑡titalic_t-th frame and 0 otherwise. T𝑇Titalic_T is the total number of frames in the video, Pt,yt~subscript𝑃𝑡~subscript𝑦𝑡P_{t,\tilde{y_{t}}}italic_P start_POSTSUBSCRIPT italic_t , over~ start_ARG italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_POSTSUBSCRIPT is the predicted probability of the model for the t𝑡titalic_t-th frame belonging to the label class yt~~subscript𝑦𝑡\tilde{y_{t}}over~ start_ARG italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG, and γ𝛾\gammaitalic_γ is a hyper-parameter that controls the focus on hard examples. wyi~subscript𝑤~subscript𝑦𝑖w_{\tilde{y_{i}}}italic_w start_POSTSUBSCRIPT over~ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_POSTSUBSCRIPT is an inverse class frequency weight [45] calculated as follows:

wc=NNcsubscript𝑤𝑐𝑁subscript𝑁𝑐w_{c}=\text{$\frac{N}{N_{c}}$}italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG italic_N end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG (3)

where N𝑁Nitalic_N is the total number of annotated frames and Ncsubscript𝑁𝑐N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the number of annotated frames belonging to the c𝑐citalic_c-th phase option. This weighting scheme helps to mitigate the impact of class imbalance during training.

Entropy Loss aims to encourage the model to be more certain about its predictions. It is calculated as follows:

lEntropy(q)=H(q)=j=1𝐶qjlog(qj)subscript𝑙Entropy𝑞𝐻𝑞𝑗1𝐶subscript𝑞𝑗𝑙𝑜𝑔subscript𝑞𝑗l_{\text{Entropy}}(q)=H(q)=-\underset{j=1}{\overset{C}{\sum}}q_{j}\cdot log(q_% {j})italic_l start_POSTSUBSCRIPT Entropy end_POSTSUBSCRIPT ( italic_q ) = italic_H ( italic_q ) = - start_UNDERACCENT italic_j = 1 end_UNDERACCENT start_ARG overitalic_C start_ARG ∑ end_ARG end_ARG italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ italic_l italic_o italic_g ( italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (4)
Entropy=1t=1𝑇mtt=1𝑇mtlEntropy(pt)subscriptEntropy1𝑡1𝑇subscript𝑚𝑡𝑡1𝑇subscript𝑚𝑡subscript𝑙Entropysubscript𝑝𝑡\mathcal{L}_{\text{Entropy}}=\frac{1}{\underset{t=1}{\overset{T}{\sum}}m_{t}}% \underset{t=1}{\overset{T}{\sum}}m_{t}\cdot l_{\text{Entropy}}(p_{t})\quadcaligraphic_L start_POSTSUBSCRIPT Entropy end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG start_UNDERACCENT italic_t = 1 end_UNDERACCENT start_ARG overitalic_T start_ARG ∑ end_ARG end_ARG italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_UNDERACCENT italic_t = 1 end_UNDERACCENT start_ARG overitalic_T start_ARG ∑ end_ARG end_ARG italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_l start_POSTSUBSCRIPT Entropy end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (5)

Confidence loss is adopted from Li et al. [27]. This loss encourages the model to predict a monotonic increase and decrease in the confidence of a phase around a ground truth prediction. This loss helps to suppress outliers and enforce temporal consistency in the model’s predictions. The loss is calculated as follows:

δt,yti={max(0,log(Pt,yti)log(Pt1,yti))ifttimax(0,log(Pt1,yti)log(Pt,yti))ift<tisubscript𝛿𝑡subscript𝑦subscript𝑡𝑖cases0subscript𝑃𝑡subscript𝑦subscript𝑡𝑖subscript𝑃𝑡1subscript𝑦subscript𝑡𝑖𝑖𝑓𝑡subscript𝑡𝑖0subscript𝑃𝑡1subscript𝑦subscript𝑡𝑖subscript𝑃𝑡subscript𝑦subscript𝑡𝑖𝑖𝑓𝑡subscript𝑡𝑖\delta_{t,y_{t_{i}}}=\begin{cases}\max\left(0,\log(P_{t,y_{t_{i}}})-\log(P_{t-% 1,y_{t_{i}}})\right)&if\ t\geq t_{i}\\ \max\left(0,\log(P_{t-1,y_{t_{i}}})-\log(P_{t,y_{t_{i}}})\right)&if\ t<t_{i}% \end{cases}italic_δ start_POSTSUBSCRIPT italic_t , italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { start_ROW start_CELL roman_max ( 0 , roman_log ( italic_P start_POSTSUBSCRIPT italic_t , italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - roman_log ( italic_P start_POSTSUBSCRIPT italic_t - 1 , italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) end_CELL start_CELL italic_i italic_f italic_t ≥ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL roman_max ( 0 , roman_log ( italic_P start_POSTSUBSCRIPT italic_t - 1 , italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - roman_log ( italic_P start_POSTSUBSCRIPT italic_t , italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) end_CELL start_CELL italic_i italic_f italic_t < italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW (6)
conf=1Ti=2TP1(t=t(i1)t(i+1)δt,yti)subscript𝑐𝑜𝑛𝑓1superscript𝑇superscriptsubscript𝑖2TP1superscriptsubscript𝑡subscript𝑡𝑖1subscript𝑡𝑖1subscript𝛿𝑡subscript𝑦subscript𝑡𝑖\mathcal{L}_{conf}=\frac{1}{T^{{}^{\prime}}}\sum\limits_{i=2}^{\text{TP}-1}% \left(\sum\limits_{t=t_{(i-1)}}^{t_{(i+1)}}\delta_{t,y_{t_{i}}}\right)caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_f end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT TP - 1 end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT ( italic_i - 1 ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT ( italic_i + 1 ) end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_t , italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (7)

where Tsparse={t1,,tTP}subscript𝑇𝑠𝑝𝑎𝑟𝑠𝑒subscript𝑡1subscript𝑡TPT_{sparse}=\{t_{1},\ldots,t_{\text{TP}}\}italic_T start_POSTSUBSCRIPT italic_s italic_p italic_a italic_r italic_s italic_e end_POSTSUBSCRIPT = { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT TP end_POSTSUBSCRIPT } is the set of time points of annotated frames and TP is the number of annotated time points. T=2(tTPt1)superscript𝑇2subscript𝑡TPsubscript𝑡1T^{{}^{\prime}}=2(t_{\text{TP}}-t_{1})italic_T start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = 2 ( italic_t start_POSTSUBSCRIPT TP end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )

Smoothness loss is designed to encourage the model to predict a smooth phase segmentation, penalizing changes between neighboring frames. This is done based on a truncated mean squared error over the log probabilities as done in [46, 27, 7]. The smoothness loss is calculated as follows:

Δt,c=|log(Pt,c)log(Pt1,c)|subscriptΔ𝑡𝑐𝑙𝑜𝑔subscript𝑃𝑡𝑐𝑙𝑜𝑔subscript𝑃𝑡1𝑐\Delta_{t,c}=|log(P_{t,c})-log(P_{t-1,c})|roman_Δ start_POSTSUBSCRIPT italic_t , italic_c end_POSTSUBSCRIPT = | italic_l italic_o italic_g ( italic_P start_POSTSUBSCRIPT italic_t , italic_c end_POSTSUBSCRIPT ) - italic_l italic_o italic_g ( italic_P start_POSTSUBSCRIPT italic_t - 1 , italic_c end_POSTSUBSCRIPT ) | (8)
Δ~t,c={Δt,cΔt,cτSτSotherwisesubscript~Δ𝑡𝑐casessubscriptΔ𝑡𝑐subscriptΔ𝑡𝑐subscript𝜏𝑆subscript𝜏𝑆𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒\tilde{\Delta}_{t,c}=\begin{cases}\Delta_{t,c}&\Delta_{t,c}\leq\tau_{S}\\ \tau_{S}&otherwise\end{cases}over~ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT italic_t , italic_c end_POSTSUBSCRIPT = { start_ROW start_CELL roman_Δ start_POSTSUBSCRIPT italic_t , italic_c end_POSTSUBSCRIPT end_CELL start_CELL roman_Δ start_POSTSUBSCRIPT italic_t , italic_c end_POSTSUBSCRIPT ≤ italic_τ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_τ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_CELL start_CELL italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e end_CELL end_ROW (9)
S=1(T1)Ct=2Tc=1CΔ~t,c2subscript𝑆1𝑇1𝐶superscriptsubscript𝑡2𝑇superscriptsubscript𝑐1𝐶superscriptsubscript~Δ𝑡𝑐2\mathcal{L}_{S}=\frac{1}{(T-1)\cdot C}\sum\limits_{t=2}^{T}\sum\limits_{c=1}^{% C}\tilde{\Delta}_{t,c}^{2}caligraphic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG ( italic_T - 1 ) ⋅ italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT over~ start_ARG roman_Δ end_ARG start_POSTSUBSCRIPT italic_t , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (10)

Where τSsubscript𝜏𝑆\tau_{S}italic_τ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is a threshold hyper-parameter.

Star Temporal Classification (STC) loss, which is a variation of the CTC [47] loss, is used to encourage the model to predict the correct order of the phases. CTC loss aims to tackle misalignment between input and output sequences of varying lengths, and is widely used for automatic speech recognition [48] and handwritten text recognition. [49]. The CTC introduces an additional blank𝑏𝑙𝑎𝑛𝑘blankitalic_b italic_l italic_a italic_n italic_k token and a collapse function B()𝐵B(\cdot)italic_B ( ⋅ ) that maps frame-wise predictions to dense predictions by removing blank𝑏𝑙𝑎𝑛𝑘blankitalic_b italic_l italic_a italic_n italic_k tokens and repeating predictions. Let X𝑋Xitalic_X be a video containing T𝑇Titalic_T frames and a segment-based labeling yseg𝒫PNsubscript𝑦segsuperscript𝒫𝑃𝑁y_{\text{seg}}\in\mathcal{P}^{PN}italic_y start_POSTSUBSCRIPT seg end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUPERSCRIPT italic_P italic_N end_POSTSUPERSCRIPT.

The CTC loss can be expressed as follows:

CTCsubscript𝐶𝑇𝐶\displaystyle\mathcal{L}_{CTC}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T italic_C end_POSTSUBSCRIPT =lnP(yseg|X)absent𝑃conditionalsubscript𝑦seg𝑋\displaystyle=-\ln\,P(y_{\text{seg}}|X)= - roman_ln italic_P ( italic_y start_POSTSUBSCRIPT seg end_POSTSUBSCRIPT | italic_X ) (11)
=ln{π|B(π)=yseg}P(π|X)absentsubscriptconditional-set𝜋𝐵𝜋subscript𝑦seg𝑃conditional𝜋𝑋\displaystyle=-\ln\,\sum\limits_{\{\pi|B(\pi)=y_{\text{seg}}\}}P(\pi|X)= - roman_ln ∑ start_POSTSUBSCRIPT { italic_π | italic_B ( italic_π ) = italic_y start_POSTSUBSCRIPT seg end_POSTSUBSCRIPT } end_POSTSUBSCRIPT italic_P ( italic_π | italic_X ) (12)

where π=[π1,,πT]𝒫T𝜋subscript𝜋1subscript𝜋𝑇superscript𝒫𝑇\pi=[\pi_{1},\ldots,\pi_{T}]\in\mathcal{P}^{T}italic_π = [ italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] ∈ caligraphic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT represents a sparse phase prediction that would have collapsed to the correct segment-based labeling ysegsubscript𝑦𝑠𝑒𝑔y_{seg}italic_y start_POSTSUBSCRIPT italic_s italic_e italic_g end_POSTSUBSCRIPT. Using the CTC’s assumption of conditional independence

P(π|X)=t=1TP(πt|t,X)𝑃conditional𝜋𝑋superscriptsubscriptproduct𝑡1𝑇𝑃conditionalsubscript𝜋𝑡𝑡𝑋P(\pi|X)=\prod\limits_{t=1}^{T}P(\pi_{t}|t,X)italic_P ( italic_π | italic_X ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_t , italic_X ) (13)

P(πt|t,X)𝑃conditionalsubscript𝜋𝑡𝑡𝑋P(\pi_{t}|t,X)italic_P ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_t , italic_X ) are predicted by the model. The CTC loss can be efficiently calculated using dynamic programming. CTC can also be calculated using weighted finite-state transducers [50]. STC aims to allow learning from weakly supervised labels. The STC is based on adding an additional star𝑠𝑡𝑎𝑟staritalic_s italic_t italic_a italic_r token that can represent a missing label. This idea allows handling of a flexible number of missing tokens, while encouraging the model to predict the current phase order.

To obtain ysegsubscript𝑦segy_{\text{seg}}italic_y start_POSTSUBSCRIPT seg end_POSTSUBSCRIPT, we use y~={yt~:1tT and yt~ exists}~𝑦conditional-set~subscript𝑦𝑡1𝑡𝑇 and ~subscript𝑦𝑡 exists\tilde{y}=\{\tilde{y_{t}}:1\leq t\leq T\text{ and }\tilde{y_{t}}\text{ exists}\}over~ start_ARG italic_y end_ARG = { over~ start_ARG italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG : 1 ≤ italic_t ≤ italic_T and over~ start_ARG italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG exists } and remove consecutive duplicate labels. In experiments where STC is used, we add an additional blank𝑏𝑙𝑎𝑛𝑘blankitalic_b italic_l italic_a italic_n italic_k phase that the model can predict, model prediction from a close neighboring frame are copied after the STC loss is calculated to allow correct flow of the method.

The total loss is a weighted sum of the individual loss components:

=cls+α1S+α2Entropy+α3conf+α4stcsubscript𝑐𝑙𝑠subscript𝛼1subscript𝑆subscript𝛼2subscriptEntropysubscript𝛼3subscript𝑐𝑜𝑛𝑓subscript𝛼4subscript𝑠𝑡𝑐\mathcal{L}=\mathcal{L}_{cls}+\alpha_{1}\mathcal{L}_{S}+\alpha_{2}\mathcal{L}_% {\text{Entropy}}+\alpha_{3}\mathcal{L}_{conf}+\alpha_{4}\mathcal{L}_{stc}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT Entropy end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_f end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_c end_POSTSUBSCRIPT (14)

where α1,α2,α3,and α4subscript𝛼1subscript𝛼2subscript𝛼3and subscript𝛼4\alpha_{1},\alpha_{2},\alpha_{3},\text{and }\alpha_{4}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , and italic_α start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT are hyper-parameters.

3.5 Additional training stage

Refer to caption
Figure 2: Illustration of the uncertainty measure used to identify phase transition events with different temperature scaling values (T) on a Cholec80 video. The black vertical lines represent a surrounding window of 2W frames centered on the ground truth transition events between surgical phases. Lower temperature values result in more stable uncertainty measure, enabling more robust transition event detection compared to using no temperature scaling (T=1). The red horizontal line represent the uncertainty threshold.

Using the trained model M()𝑀M(\cdot)italic_M ( ⋅ ), we create fixed partial pseudo-labels that are then used to train a new model.

When generating those pseudo-labels, we aim to utilize all of the model’s predictions, except for those corresponding to transition moments and their surrounding frames. To achieve this, we first need to detect transition moments. We used a scaled entropy measure for this

U(l)=H(softmax(lMT))𝑈𝑙𝐻softmax𝑙subscript𝑀𝑇U(l)=H\left(\operatorname{softmax}\left(\frac{l}{M_{T}}\right)\right)italic_U ( italic_l ) = italic_H ( roman_softmax ( divide start_ARG italic_l end_ARG start_ARG italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG ) ) (15)

where l𝑙litalic_l is the model logits’s output, MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is the scaling temperature and H𝐻Hitalic_H is the entropy measure as defined in the Eq. 4. We use 0<MT<10subscript𝑀𝑇10<M_{T}<10 < italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT < 1 to push the model’s outputs to the far ends of the spectrum. By employing this estimator, we can easily detect the transition events using a simple threshold τtransitionsubscript𝜏𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛\tau_{transition}italic_τ start_POSTSUBSCRIPT italic_t italic_r italic_a italic_n italic_s italic_i italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT as illustrated in Fig. 2. We now consider all predictions that are not within a window of W𝑊Witalic_W frames from a phase transition event and utilize them as pseudo-labels for further training. The new model is trained using an unweighted focal loss combined with a smoothness loss component.

3.6 Implementation Details

All experiments were performed on NVIDIA A100 Tensor Core and NVIDIA RTX 6000 Ada Generation GPUS. Both datasets were down-sampled to 1fps. We follow Ramesh et al. [18] regarding the self-supervised feature extractor fine-tunning configuration. We used the Adam optimizer with initial learning rate of 5e-4 for training the temporal model for 50 epochs. We resize the images to a 224 × 399 resolution before propagating through our feature extractor. We employ the TCN-based model suggested by Li et al. [27] as our temporal model. We use W=25,τS=16formulae-sequence𝑊25subscript𝜏𝑆16W=25,\tau_{S}=16italic_W = 25 , italic_τ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = 16 for all experiments.

4 Experiments

4.1 Datasets

Cholec80 [4] contains 80 videos of laparoscopic cholecystectomy procedures, with an average duration of 38 minutes. The dataset contain 7 types of phases. It was recorded at 25 fps with a resolution of 854x840 or 1920x1080. We follow Ding et al. [7] and use the first 40 videos as a training set and use the rest of the videos as a test set.
MultiBypass140 [51] contains 140 videos of laparoscopic Roux-en-Y gastric bypass surgeries, with an average duration of 91 minutes. The dataset contains 12 types of phases. It was recorded at 25 fps in two medical centers with a resolution of 720x576 or 854x480 or 1920x1080. we follow Lavanchy and Ramesh et al. [51] and split the dataset to three parts, training, validation and test, which contain 80, 20, and 40 videos respectively.

4.2 Evaluation Metrics

We follow previus works [19, 7, 15] and report frame-level evaluation metrics, accuracy (AC), precision (PR), recall (RE), Jaccard (JA), and F1. For each phase prediction P and a ground truth GT , PR, RE, JA, and F1 are calculated as follows:

PR=|GTP||P|,RE=|GTP||GT|,JA=|GTP||GTP|,F1=2PRRERE+PRformulae-sequence𝑃𝑅𝐺𝑇𝑃𝑃formulae-sequence𝑅𝐸𝐺𝑇𝑃𝐺𝑇formulae-sequence𝐽𝐴𝐺𝑇𝑃𝐺𝑇𝑃𝐹12𝑃𝑅𝑅𝐸𝑅𝐸𝑃𝑅\begin{split}PR=\frac{|GT\cap P|}{|P|},RE=\frac{|GT\cap P|}{|GT|},\\ JA=\frac{|GT\cap P|}{|GT\cup P|},F1=\frac{2\cdot PR\cdot RE}{RE+PR}\end{split}start_ROW start_CELL italic_P italic_R = divide start_ARG | italic_G italic_T ∩ italic_P | end_ARG start_ARG | italic_P | end_ARG , italic_R italic_E = divide start_ARG | italic_G italic_T ∩ italic_P | end_ARG start_ARG | italic_G italic_T | end_ARG , end_CELL end_ROW start_ROW start_CELL italic_J italic_A = divide start_ARG | italic_G italic_T ∩ italic_P | end_ARG start_ARG | italic_G italic_T ∪ italic_P | end_ARG , italic_F 1 = divide start_ARG 2 ⋅ italic_P italic_R ⋅ italic_R italic_E end_ARG start_ARG italic_R italic_E + italic_P italic_R end_ARG end_CELL end_ROW (16)

the scores for each phase are averaged across all the phases in a video. Accuracy is calculated globally for each video. When evaluating on Cholec80, we follow [18, 7] and report 10-second ’relaxed’ metrics, meaning that we allow two correct phases in the 10 second surrounding window of each phase transition.

4.3 Sampling Distribution

Figure 4 presents the distribution of per-frame phase annotations in the Cholec80 dataset. The full distribution, shown in yellow, illustrates a significant imbalance between classes, with some phases occurring much more frequently than others. This imbalance poses a challenge for surgical phase recognition models, as they may struggle to accurately classify underrepresented phases. In contrast, the timestamp labels, depicted in pink, are distributed more evenly across the phases. This suggests that while the duration of phases may vary, the number of phase occurrences is relatively balanced in the dataset. The more uniform distribution of timestamp labels can be attributed to the fact that they are selected based on the presence of each phase rather than their duration. In order to create SkipTag@K annotations, we split the video into K equal partitions and sample uniformly from each partition a single sample. The blue shades represent the SkipTag@K sampling method with K values of 2, 4, and 7, where K denotes the fixed number of frames annotated per video. As K increases, the SkipTag@K distributions more closely resemble the full distribution, indicating that this sampling strategy effectively captures the original data distribution. The similarity between the SkipTag@7 and full distributions suggests that annotating just 7 frames per video can provide a representative sample of the phase distribution in the Cholec80 dataset.

Refer to caption
Figure 3: Distribution of per-frame phase annotations in the Cholec80 dataset. The full distribution (yellow) reveals significant class imbalance, while the timestamp labels (pink) are more evenly distributed. SkipTag@K sampling with K=2, 4, and 7 (blue shades) effectively captures the original data distribution, with increasing similarity to the full distribution as K increases.

4.4 Robustness to Missing Phase Annotations

Method Cholec80 MultiBypass140 RE (%) PR (%) JA (%) AC (%) F1 (%) RE (%) PR (%) JA (%) AC (%) F1 (%) Timestamp supervision (miss rate = 0) Ding et al. [7] 90.5±plus-or-minus\pm±5.9 89.5±plus-or-minus\pm±4.4 79.9±plus-or-minus\pm±8.5 91.9±plus-or-minus\pm±5.6 - 74.3±plus-or-minus\pm±10.0 75.7±plus-or-minus\pm±13.2 65.2±plus-or-minus\pm±11.6 87.2±plus-or-minus\pm±10.6 72.1±plus-or-minus\pm±11.3 Ours 97±plus-or-minus\pm±9.8 84.8±plus-or-minus\pm±6.9 76.1±plus-or-minus\pm±8.8 90.4±plus-or-minus\pm±5.9 87.9±plus-or-minus\pm±5.4 78.3±plus-or-minus\pm±11.7 75.9±plus-or-minus\pm±11.5 67.8±plus-or-minus\pm±13 88.4±plus-or-minus\pm±9.7 74.6±plus-or-minus\pm±12.3 Miss rate = 0.1 Ding et al. [7] 86.4±plus-or-minus\pm±11.5 87.9±plus-or-minus\pm±6.7 74.9±plus-or-minus\pm±8.0 86.50±plus-or-minus\pm±9.6 86.5±plus-or-minus\pm±6.7 75.1±plus-or-minus\pm±10.1 75.2±plus-or-minus\pm±12.6 65.1±plus-or-minus\pm±12.0 86.3±plus-or-minus\pm±10.0 72.1±plus-or-minus\pm±11.3 Ours 97.3±plus-or-minus\pm±9.3 84.1±plus-or-minus\pm±6.8 75.4±plus-or-minus\pm±8.4 89.8±plus-or-minus\pm±5.9 87.4±plus-or-minus\pm±5.8 76.9±plus-or-minus\pm±11.3 73.9±plus-or-minus\pm±12.4 65.9±plus-or-minus\pm±13.7 87.5±plus-or-minus\pm±9.9 72.8±plus-or-minus\pm±13 Miss rate = 0.2 Ding et al. [7] 81.8±plus-or-minus\pm±19.2 85.2±plus-or-minus\pm±8.45 67.8±plus-or-minus\pm±14.0 84±plus-or-minus\pm±8.8 81.6±plus-or-minus\pm±8.45 70.5±plus-or-minus\pm±13.0 69.3±plus-or-minus\pm±13.9 58.5±plus-or-minus\pm±12.6 82.5±plus-or-minus\pm±10.3 66.4±plus-or-minus\pm±13.1 Ours 96.8±plus-or-minus\pm±9.3 81.9±plus-or-minus\pm±7.2 72.7±plus-or-minus\pm±9.15 86.9±plus-or-minus\pm±7.4 84.9±plus-or-minus\pm±6.5 75.7±plus-or-minus\pm±12.2 73.1±plus-or-minus\pm±12.3 64.8±plus-or-minus\pm±13.6 87±plus-or-minus\pm±9.8 71.7±plus-or-minus\pm±13 Miss rate = 0.3 Ding et al. [7] 76.6±plus-or-minus\pm±16.1 80.2±plus-or-minus\pm±9.6 58.9±plus-or-minus\pm±13.5 78±plus-or-minus\pm±12.1 77.1±plus-or-minus\pm±9.6 61.4±plus-or-minus\pm±11.3 60.6±plus-or-minus\pm±12.2 46.7±plus-or-minus\pm±11.3 71.5±plus-or-minus\pm±14.3 55.7±plus-or-minus\pm±11.5 Ours 96.5±plus-or-minus\pm±9.5 82.4±plus-or-minus\pm±7.4 73.2±plus-or-minus\pm±9 87.4±plus-or-minus\pm±7.5 85.5±plus-or-minus\pm±5.8 76.7±plus-or-minus\pm±12.4 72.2±plus-or-minus\pm±13 63.7±plus-or-minus\pm±14.6 85.7±plus-or-minus\pm±9.5 71.1±plus-or-minus\pm±14

Table 1: Robustness to missing phase annotations

To demonstrate the impact of missing phase annotations and the robustness of our method, we compare our approach with Ding et al.’s method. We simulated missing phase annotations by randomly removing phase labels from the timestamp annotations with varying miss rate probabilities pmsubscript𝑝𝑚p_{m}italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. As shown in Table 1, Ding et al.’s method is significantly affected by missing phase annotations. This is evident from the increased standard deviations in the evaluation metrics on the Cholec80 dataset and the substantial drop in performance, ranging from a 10 % to a 28 % decline when comparing the timestamp results (pm=0subscript𝑝𝑚0p_{m}=0italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 0) to the results with pm=0.3subscript𝑝𝑚0.3p_{m}=0.3italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 0.3. In contrast, our method’s performance only slightly declines, achieving competitive results on the Cholec80 dataset in the timestamp setup, surpassing Ding et al.’s method in the timestamp setup, and consistently outperforming their approach on both datasets across all missing rate settings and almost all metrics.

Figure 4 further illustrates the performance comparison between our method and Ding et al.’s approach under different missing rate probabilities.

Refer to caption
Figure 4: Robustness comparison between our method and Ding et al.’s approach under different missing rate probabilities for the Cholec80 and MultiBypass140 datasets. The six subfigures depict the evaluation metrics of accuracy, Jaccard index, and F1 score for both datasets. As the missing rate increases, Ding et al.’s method experiences a significant drop in performance across all metrics on both datasets, while our method maintains stable performance with only a slight decline.

4.5 SkipTag@K Evaluation

In this section, we evaluate the performance of our model in the challenging SkipTag@K setup, where only K frames from each video are annotated. We present results for three K values relative to the average number of phases Navgsubscript𝑁𝑎𝑣𝑔N_{avg}italic_N start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT in the corresponding dataset. The K values are round(Navg),Navg2,andNavg4𝑟𝑜𝑢𝑛𝑑subscript𝑁𝑎𝑣𝑔subscript𝑁𝑎𝑣𝑔2andsubscript𝑁𝑎𝑣𝑔4round(N_{avg}),\lceil\frac{N_{avg}}{2}\rceil,\text{and}\lceil\frac{N_{avg}}{4}\rceilitalic_r italic_o italic_u italic_n italic_d ( italic_N start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT ) , ⌈ divide start_ARG italic_N start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ⌉ , and ⌈ divide start_ARG italic_N start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG ⌉, where Navgsubscript𝑁𝑎𝑣𝑔N_{avg}italic_N start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT is 6.8 for Cholec80 and 9.3 for MultiBypass140.

Setup Method RE (%) PR (%) JA (%) AC (%) F1 (%)
SkipTag@7 Ding et al. [7] 86.5±plus-or-minus\pm±11.5 87.9±plus-or-minus\pm±6.7 74.9±plus-or-minus\pm±8.0 86.5±plus-or-minus\pm±9.6 86.5±plus-or-minus\pm±6.7
Ours 87.8±plus-or-minus\pm±11.5 88.1±plus-or-minus\pm±8.5 73.6±plus-or-minus\pm±11.9 91.2±plus-or-minus\pm±5.9 86.4±plus-or-minus\pm±7.8
SkipTag@4 Ding et al. [7] 50.4±plus-or-minus\pm±32.0 84.4±plus-or-minus\pm±13.3 40.6±plus-or-minus\pm±24.7 76.7±plus-or-minus\pm±9.8 57.1±plus-or-minus\pm±13.3
Ours 84.0±plus-or-minus\pm±13.0 82.3±plus-or-minus\pm±10.4 66.9±plus-or-minus\pm±13.5 88.9±plus-or-minus\pm±7.6 84.1±plus-or-minus\pm±9.7
SkipTag@2 Ding et al. [7] 32.1±plus-or-minus\pm±39.3 88.0±plus-or-minus\pm±17.7 23.8±plus-or-minus\pm±26.0 65.6±plus-or-minus\pm±12.1 39.1±plus-or-minus\pm±17.7
Ours 71.2±plus-or-minus\pm±12.2 81.0±plus-or-minus\pm±9.0 57.9±plus-or-minus\pm±11.1 83.6±plus-or-minus\pm±7.1 78.1±plus-or-minus\pm±7.9
Table 2: SkipTag@K results on the Cholec80 dataset
Setup Method RE (%) PR (%) JA (%) AC (%) F1 (%)
SkipTag@9 Ding et al. [7] 46.6±plus-or-minus\pm±7.3 44.1±plus-or-minus\pm±10.1 37.9±plus-or-minus\pm±9.4 76.5±plus-or-minus\pm±11.3 42.8±plus-or-minus\pm±8.9
Ours 77.8±plus-or-minus\pm±10.5 75.8±plus-or-minus\pm±10.6 68.3±plus-or-minus\pm±12.3 88.7±plus-or-minus\pm±10.5 74.7±plus-or-minus\pm±11.2
SkipTag@5 Ding et al. [7] 31.5±plus-or-minus\pm±5.8 26.8±plus-or-minus\pm±8.3 22.2±plus-or-minus\pm±6.8 64.4±plus-or-minus\pm±11.7 26.8±plus-or-minus\pm±6.8
Ours 74.9±plus-or-minus\pm±11.9 75.0±plus-or-minus\pm±13.3 66.3±plus-or-minus\pm±14.3 87.7±plus-or-minus\pm±11.7 72.7±plus-or-minus\pm±13.4
SkipTag@3 Ding et al. [7] 25.0±plus-or-minus\pm±8.4 21.5±plus-or-minus\pm±11.2 16.1±plus-or-minus\pm±8.3 49.9±plus-or-minus\pm±16.1 20.3±plus-or-minus\pm±9.1
Ours 68.6±plus-or-minus\pm±12.8 65.7±plus-or-minus\pm±14.4 58.5±plus-or-minus\pm±14.4 85.1±plus-or-minus\pm±12.3 65.0±plus-or-minus\pm±14.3
Table 3: SkipTag@K results on the MultiBypass140 dataset

Tables 2 and 3 demonstrate that our model achieves impressive results in the SkipTag@K setup, despite the limited number of annotated frames. Remarkably, the decline in performance when reducing the number of annotated samples is relatively small compared to the reduction in the number of samples itself. For instance, on both datasets, the F1 measure drops by less than 13% when comparing the results obtained with the largest tested K value to those obtained with the lowest tested K value, even though the larger number of samples is at least 3 times as large. Similarly, the accuracy drops by less than 9% under the same conditions. In contrast, Ding et al.’s method experiences a drastic performance drop when the number of annotated samples is reduced.

4.6 Ablations

4.6.1 Component Analysis

Method RE (%) PR (%) JA (%) AC (%) F1 (%)
Base 34.1±plus-or-minus\pm±17.5 54.3±plus-or-minus\pm±15.8 22.6±plus-or-minus\pm±15.2 50.9±plus-or-minus\pm±17.1 58±plus-or-minus\pm±17.8
+DINO FE 28.3±plus-or-minus\pm±2.7 67.3±plus-or-minus\pm±10.6 18.7±plus-or-minus\pm±4.0 67.7±plus-or-minus\pm±10.9 72.1±plus-or-minus\pm±14.9
+Conf loss 37.8±plus-or-minus\pm±7.3 77.4±plus-or-minus\pm±9.5 27.9±plus-or-minus\pm±6.2 70.9±plus-or-minus\pm±9.5 78±plus-or-minus\pm±11.0
+Focal loss 63±plus-or-minus\pm±11.0 76.6±plus-or-minus\pm±12.4 51.4±plus-or-minus\pm±10.7 81.8±plus-or-minus\pm±9.1 77.1±plus-or-minus\pm±8.8
+Loss reweighting 81.5±plus-or-minus\pm±10.0 86.8±plus-or-minus\pm±9.4 68.1±plus-or-minus\pm±9.9 89.2±plus-or-minus\pm±6.6 84.6±plus-or-minus\pm±6.7
+STC loss 86.8±plus-or-minus\pm±10.6 88.1±plus-or-minus\pm±7.2 72.5±plus-or-minus\pm±11.0 90.8±plus-or-minus\pm±5.6 85.6±plus-or-minus\pm±6.9
+Additional training 87.8±plus-or-minus\pm±11.5 88.1±plus-or-minus\pm±8.5 73.6±plus-or-minus\pm±11.9 91.2±plus-or-minus\pm±5.9 86.4±plus-or-minus\pm±7.8
Table 4: Ablation study demonstrating the impact of each component in our method, using the SkipTag@7 setup on the Cholec80 dataset.

Table 4 presents an ablation study that highlights the contribution of each component in our method, using the SkipTag@7 setup on the Cholec80 dataset. The first modification is the replacement of the feature extractor from a ResNet-50 model pretrained on ImageNet to a self-supervised DINO model fine-tuned on the relevant dataset. This change leads to significant improvements across most metrics and reduces the model’s standard deviation, indicating more robust performance. Subsequent modifications demonstrate the impact of various loss functions and training strategies. The introduction of the confidence loss, focal loss, loss re-weighting, and STC loss all contribute to incremental improvements in the model’s performance. The final component is the additional training phase, which utilizes the generated pseudo-labels to further refine the model. This step results in further performance gains, achieving the highest scores across all evaluation metrics.

4.6.2 Additional training stage

RE (%) PR (%) JA (%) AC (%) F1 (%)
Timestamp supervision Base 97.5±plus-or-minus\pm±9.2 84.9±plus-or-minus\pm±7.6 76.4±plus-or-minus\pm±10.3 90.3±plus-or-minus\pm±6.8 87.9±plus-or-minus\pm±6.9
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.5 97.1±plus-or-minus\pm±9.6 82.9±plus-or-minus\pm±7.4 73.9±plus-or-minus\pm±9.3 88.7±plus-or-minus\pm±6.2 85.9±plus-or-minus\pm±6.7
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.25 97.0±plus-or-minus\pm±9.8 84.8±plus-or-minus\pm±6.9 76.1±plus-or-minus\pm±8.8 90.4±plus-or-minus\pm±5.9 87.9±plus-or-minus\pm±5.4
Missing 0.1 Base 96.5±plus-or-minus\pm±9.1 82.5±plus-or-minus\pm±7.4 73.0±plus-or-minus\pm±9.2 87.8±plus-or-minus\pm±6.7 85.4±plus-or-minus\pm±6.6
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.5 96.6±plus-or-minus\pm±9.6 82.7±plus-or-minus\pm±6.9 73.5±plus-or-minus\pm±8.6 88.3±plus-or-minus\pm±6.2 85.5±plus-or-minus\pm±6.2
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.25 97.3±plus-or-minus\pm±9.3 84.1±plus-or-minus\pm±6.8 75.4±plus-or-minus\pm±8.4 89.8±plus-or-minus\pm±5.9 87.4±plus-or-minus\pm±5.8
Missing 0.2 Base 96.8±plus-or-minus\pm±9.3 81.9±plus-or-minus\pm±7.2 72.7±plus-or-minus\pm±9.1 86.9±plus-or-minus\pm±7.4 84.9±plus-or-minus\pm±6.5
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.5 96.7±plus-or-minus\pm±9.1 82.0±plus-or-minus\pm±6.6 72.5±plus-or-minus\pm±8.3 87.3±plus-or-minus\pm±6.2 85.0±plus-or-minus\pm±5.6
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.25 96.6±plus-or-minus\pm±9.1 83.2±plus-or-minus\pm±6.2 74.0±plus-or-minus\pm±8.0 88.0±plus-or-minus\pm±6.6 85.8±plus-or-minus\pm±5.3
Missing 0.3 Base 96.7±plus-or-minus\pm±9.3 80.7±plus-or-minus\pm±7.9 71.3±plus-or-minus\pm±10.4 85.4±plus-or-minus\pm±8.9 83.2±plus-or-minus\pm±8.7
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.5 96.1±plus-or-minus\pm±9.6 80.9±plus-or-minus\pm±7.1 71.5±plus-or-minus\pm±9.4 86.5±plus-or-minus\pm±7.2 84.3±plus-or-minus\pm±6.1
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.25 96.5±plus-or-minus\pm±9.5 82.4±plus-or-minus\pm±7.4 73.2±plus-or-minus\pm±9.0 87.4±plus-or-minus\pm±7.5 85.5±plus-or-minus\pm±5.8
SkipTag@7 Base 86.8±plus-or-minus\pm±10.6 88.1±plus-or-minus\pm±7.2 72.5±plus-or-minus\pm±11.0 90.8±plus-or-minus\pm±5.6 85.6±plus-or-minus\pm±6.9
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.5 83.5±plus-or-minus\pm±12.7 83.8±plus-or-minus\pm±10.7 69.2±plus-or-minus\pm±13.3 88.4±plus-or-minus\pm±8.2 84.7±plus-or-minus\pm±8.1
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.25 87.8±plus-or-minus\pm±11.5 88.1±plus-or-minus\pm±8.5 73.6±plus-or-minus\pm±11.9 91.2±plus-or-minus\pm±5.9 86.4±plus-or-minus\pm±7.8
SkipTag@4 Base 83.0±plus-or-minus\pm±12.1 84.0±plus-or-minus\pm±7.7 66.0±plus-or-minus\pm±11.8 88.3±plus-or-minus\pm±6.3 80.7±plus-or-minus\pm±9.9
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.5 53.9±plus-or-minus\pm±10.5 76.3±plus-or-minus\pm±11.9 40.6±plus-or-minus\pm±9.1 80.1±plus-or-minus\pm±7.1 81.5±plus-or-minus\pm±12.1
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.25 84.0±plus-or-minus\pm±13.0 82.3±plus-or-minus\pm±10.4 66.9±plus-or-minus\pm±13.5 88.9±plus-or-minus\pm±7.6 84.1±plus-or-minus\pm±9.7
SkipTag@2 Base 73.8±plus-or-minus\pm±12.5 81.6±plus-or-minus\pm±9.4 59.4±plus-or-minus\pm±12.7 83.4±plus-or-minus\pm±7.8 77.4±plus-or-minus\pm±9.1
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.5 69.8±plus-or-minus\pm±11.9 78.1±plus-or-minus\pm±10.3 54.9±plus-or-minus\pm±11.5 81.4±plus-or-minus\pm±8.2 74.8±plus-or-minus\pm±9.2
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.25 71.2±plus-or-minus\pm±12.2 81.0±plus-or-minus\pm±9.0 57.9±plus-or-minus\pm±11.1 83.6±plus-or-minus\pm±7.1 78.1±plus-or-minus\pm±7.9
Table 5: Impact of temperature scaling on the additional training stage for the Cholec80 dataset.
RE (%) PR (%) JA (%) AC (%) F1 (%)
Timestamp supervision Base 72.6±plus-or-minus\pm±10.0 70.0±plus-or-minus\pm±10.6 61.9±plus-or-minus\pm±12.0 86.7±plus-or-minus\pm±10.0 69.0±plus-or-minus\pm±11.3
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.5 78.3±plus-or-minus\pm±11.7 75.9±plus-or-minus\pm±11.5 67.8±plus-or-minus\pm±13.0 88.4±plus-or-minus\pm±9.7 74.6±plus-or-minus\pm±12.3
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.25 77.2±plus-or-minus\pm±12.0 75.9±plus-or-minus\pm±12.2 67.2±plus-or-minus\pm±13.2 88.1±plus-or-minus\pm±9.8 73.9±plus-or-minus\pm±12.8
Missing 0.1 Base 71.3±plus-or-minus\pm±10.5 69.0±plus-or-minus\pm±10.6 60.6±plus-or-minus\pm±12.0 85.9±plus-or-minus\pm±10.0 67.8±plus-or-minus\pm±11.2
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.5 76.9±plus-or-minus\pm±11.3 73.9±plus-or-minus\pm±12.4 65.9±plus-or-minus\pm±13.7 87.5±plus-or-minus\pm±9.9 72.8±plus-or-minus\pm±13.0
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.25 78.2±plus-or-minus\pm±11.4 75.3±plus-or-minus\pm±13.2 67.2±plus-or-minus\pm±14.0 87.7±plus-or-minus\pm±10.3 74.1±plus-or-minus\pm±13.3
Missing 0.2 Base 73.9±plus-or-minus\pm±12.3 72.1±plus-or-minus\pm±12.4 62.7±plus-or-minus\pm±13.4 85.4±plus-or-minus\pm±10.7 70.1±plus-or-minus\pm±13.0
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.5 75.7±plus-or-minus\pm±12.2 73.1±plus-or-minus\pm±12.3 64.8±plus-or-minus\pm±13.6 87.0±plus-or-minus\pm±9.8 71.7±plus-or-minus\pm±13.0
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.25 71.1±plus-or-minus\pm±11.9 72.1±plus-or-minus\pm±13.3 62.0±plus-or-minus\pm±13.6 85.1±plus-or-minus\pm±12.6 68.9±plus-or-minus\pm±13.0
Missing 0.3 Base 75.0±plus-or-minus\pm±11.1 71.2±plus-or-minus\pm±12.0 62.0±plus-or-minus\pm±12.6 84.7±plus-or-minus\pm±10.1 70.1±plus-or-minus\pm±12.1
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.5 76.7±plus-or-minus\pm±12.4 72.2±plus-or-minus\pm±13.0 63.7±plus-or-minus\pm±14.6 85.7±plus-or-minus\pm±9.5 71.1±plus-or-minus\pm±14.0
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.25 76.7±plus-or-minus\pm±12.0 72.5±plus-or-minus\pm±13.0 63.8±plus-or-minus\pm±14.3 85.8±plus-or-minus\pm±10.0 71.3±plus-or-minus\pm±13.7
SkipTag@9 Base 72.8±plus-or-minus\pm±10.7 73.4±plus-or-minus\pm±11.5 63.9±plus-or-minus\pm±13.0 87.6±plus-or-minus\pm±10.4 70.5±plus-or-minus\pm±12.2
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.5 77.8±plus-or-minus\pm±10.5 75.8±plus-or-minus\pm±10.6 68.3±plus-or-minus\pm±12.3 88.7±plus-or-minus\pm±10.5 74.7±plus-or-minus\pm±11.2
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.25 77.7±plus-or-minus\pm±10.3 75.7±plus-or-minus\pm±10.8 68.1±plus-or-minus\pm±12.1 88.7±plus-or-minus\pm±10.5 74.5±plus-or-minus\pm±11.1
SkipTag@5 Base 69.3±plus-or-minus\pm±12.4 73.0±plus-or-minus\pm±12.5 60.7±plus-or-minus\pm±13.8 86.1±plus-or-minus\pm±11.5 67.9±plus-or-minus\pm±12.9
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.5 74.9±plus-or-minus\pm±11.9 75.0±plus-or-minus\pm±13.3 66.3±plus-or-minus\pm±14.3 87.7±plus-or-minus\pm±11.7 72.7±plus-or-minus\pm±13.4
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.25 73.9±plus-or-minus\pm±11.9 75.7±plus-or-minus\pm±13.3 65.5±plus-or-minus\pm±14.5 87.6±plus-or-minus\pm±11.9 72.1±plus-or-minus\pm±13.5
SkipTag@3 Base 62.7±plus-or-minus\pm±11.6 62.2±plus-or-minus\pm±13.8 52.7±plus-or-minus\pm±13.3 83.1±plus-or-minus\pm±11.8 59.2±plus-or-minus\pm±12.9
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.5 68.6±plus-or-minus\pm±12.8 65.7±plus-or-minus\pm±14.4 58.5±plus-or-minus\pm±14.4 85.1±plus-or-minus\pm±12.3 65.0±plus-or-minus\pm±14.3
MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT=0.25 67.6±plus-or-minus\pm±12.8 65.4±plus-or-minus\pm±14.3 57.3±plus-or-minus\pm±14.9 84.5±plus-or-minus\pm±11.8 63.8±plus-or-minus\pm±14.6
Table 6: Impact of temperature scaling on the additional training stage for the MultiBypass140 dataset.

In this section we focus on the pseudo-lebel generation mechanism and investigate the effect of the temperature MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT used in the pseudo-generation mechanism. Tables 5 and 6 present a comparison between the base model (before pseudo-label generation) and models trained on pseudo-labels generated with MT=0.25subscript𝑀𝑇0.25M_{T}=0.25italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 0.25 and MT=0.5subscript𝑀𝑇0.5M_{T}=0.5italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 0.5. The comparison is performed for various setups, including timestamp supervision, missing annotations, and SkipTag@K. As illustrated in Figure 2, lower temperature values result in more extreme uncertainty estimates, leading to a larger number of frames falling below the uncertainty threshold and consequently being annotated with pseudo-labels. For the Cholec80 dataset (Table 5), we observe that setting MT=0.25subscript𝑀𝑇0.25M_{T}=0.25italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 0.25 is preferred over setting MT=0.5subscript𝑀𝑇0.5M_{T}=0.5italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 0.5 across all supervision setups. The additional training stage improves the results in most setup.

For the MultiBypass140 dataset (Table 6), the optimal temperature value varies depending on the setup. However, both MT=0.25subscript𝑀𝑇0.25M_{T}=0.25italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 0.25 and MT=0.5subscript𝑀𝑇0.5M_{T}=0.5italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 0.5 consistently outperform the base model, demonstrating the effectiveness of the additional training stage across different setups.

5 Discussion and Conclusions

In this study, we focused on annotation time efficient techniques for surgical phase recognition. We addressed the challenge of missing phase annotations in the promising timestamp supervision approach and presented a robust method that is able to tackle this hurdle. This direction paves the way for timestamp annotation in a realistic setup where some phase annotations may be unintentionally omitted, further promoting its adoption as a viable alternative to the fully supervised setup. Additionally, we introduced SkipTag@K to the surgical domain, offering a flexible trade-off between annotation effort and model performance and demonstrated the competitive results obtained with it. Remarkably, we achieved impressive performance on both datasets using as few as 2 samples for Cholec80 and 3 samples for MultiBypass140 per video which corresponds to only 0.037% and 0.0023% of the corresponding training sets respectively.

While we successfully tackled the issue of missing annotation labels, our work did not address the problem of incorrect label assignments during the annotation process. This opens up an avenue for future research to investigate methods that can handle both missing and incorrect labels. In the SkipTag@K experiments, we employed a simple strategy of uniformly selecting a sample from each of the K equally divided video segments. Although this approach yielded impressive results, exploring more sophisticated sampling, as clustering-based techniques, could potentially lead to even better performance. Another potential future direction is to investigate the use of easily obtainable weak signals from the surgical procedure itself to efficiently achieve partial labeling. Such signals could include the use of specific surgical tools, changes in the surgical scene, or even audio cues from the operating room. By leveraging these inherent weak signals, we may be able to further reduce the annotation burden while maintaining high performance in surgical phase recognition.

References

  • [1] L. Maier-Hein, S. S. Vedula, S. Speidel, N. Navab, R. Kikinis, A. Park, M. Eisenmann, H. Feussner, G. Forestier, S. Giannarou et al., “Surgical data science for next-generation interventions,” Nature Biomedical Engineering, vol. 1, no. 9, pp. 691–696, 2017.
  • [2] L. Maier-Hein, M. Eisenmann, D. Sarikaya, K. März, T. Collins, A. Malpani, J. Fallert, H. Feussner, S. Giannarou, P. Mascagni et al., “Surgical data science–from concepts toward clinical translation,” Medical image analysis, vol. 76, p. 102306, 2022.
  • [3] K. C. Demir, H. Schieber, T. WeiseRoth, M. May, A. Maier, and S. H. Yang, “Deep learning in surgical workflow analysis: A review of phase and step recognition,” IEEE Journal of Biomedical and Health Informatics, 2023.
  • [4] A. P. Twinanda, S. Shehata, D. Mutter, J. Marescaux, M. De Mathelin, and N. Padoy, “Endonet: a deep architecture for recognition tasks on laparoscopic videos,” IEEE transactions on medical imaging, vol. 36, no. 1, pp. 86–97, 2016.
  • [5] D. Liu, Q. Li, T. Jiang, Y. Wang, R. Miao, F. Shan, and Z. Li, “Towards unified surgical skill assessment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 9522–9531.
  • [6] M. Komatsu, D. Kitaguchi, M. Yura, N. Takeshita, M. Yoshida, M. Yamaguchi, H. Kondo, T. Kinoshita, and M. Ito, “Automatic surgical phase recognition-based skill assessment in laparoscopic distal gastrectomy using multicenter videos,” Gastric Cancer, vol. 27, no. 1, pp. 187–196, 2024.
  • [7] X. Ding, X. Yan, Z. Wang, W. Zhao, J. Zhuang, X. Xu, and X. Li, “Less is more: Surgical phase recognition from timestamp supervision,” IEEE Transactions on Medical Imaging, 2023.
  • [8] O. Zisimopoulos, E. Flouty, I. Luengo, P. Giataganas, J. Nehme, A. Chow, and D. Stoyanov, “Deepphase: surgical phase recognition in cataracts videos,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part IV 11.   Springer, 2018, pp. 265–272.
  • [9] A. Kadkhodamohammadi, N. Sivanesan Uthraraj, P. Giataganas, G. Gras, K. Kerr, I. Luengo, S. Oussedik, and D. Stoyanov, “Towards video-based surgical workflow understanding in open orthopaedic surgery,” Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, vol. 9, no. 3, pp. 286–293, 2021.
  • [10] T. Blum, H. Feußner, and N. Navab, “Modeling and segmentation of surgical workflow from laparoscopic video,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2010: 13th International Conference, Bei**g, China, September 20-24, 2010, Proceedings, Part III 13.   Springer, 2010, pp. 400–407.
  • [11] N. Padoy, T. Blum, S.-A. Ahmadi, H. Feussner, M.-O. Berger, and N. Navab, “Statistical modeling and recognition of surgical workflow,” Medical image analysis, vol. 16, no. 3, pp. 632–641, 2012.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [13] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
  • [14] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [15] Y. **, Q. Dou, H. Chen, L. Yu, J. Qin, C.-W. Fu, and P.-A. Heng, “Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network,” IEEE transactions on medical imaging, vol. 37, no. 5, pp. 1114–1126, 2017.
  • [16] T. Czempiel, M. Paschali, M. Keicher, W. Simson, H. Feussner, S. T. Kim, and N. Navab, “Tecno: Surgical phase recognition with multi-stage temporal convolutional networks,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23.   Springer, 2020, pp. 343–352.
  • [17] T. Czempiel, M. Paschali, D. Ostler, S. T. Kim, B. Busam, and N. Navab, “Opera: Attention-regularized transformers for surgical phase recognition,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24.   Springer, 2021, pp. 604–614.
  • [18] S. Ramesh, V. Srivastav, D. Alapatt, T. Yu, A. Murali, L. Sestini, C. I. Nwoye, I. Hamoud, S. Sharma, A. Fleurentin et al., “Dissecting self-supervised learning methods for surgical computer vision,” Medical Image Analysis, vol. 88, p. 102844, 2023.
  • [19] X. Shi, Y. **, Q. Dou, and P.-A. Heng, “Semi-supervised learning with progressive unlabeled data excavation for label-efficient surgical workflow recognition,” Medical Image Analysis, vol. 73, p. 102158, 2021.
  • [20] T. Yu, D. Mutter, J. Marescaux, and N. Padoy, “Learning from a tiny dataset of manual annotations: a teacher/student approach for surgical phase recognition,” arXiv preprint arXiv:1812.00033, 2018.
  • [21] Y. Chen, Q. L. Sun, and K. Zhong, “Semi-supervised spatio-temporal cnn for recognition of surgical workflow,” EURASIP Journal on Image and Video Processing, vol. 2018, pp. 1–9, 2018.
  • [22] G. Yengera, D. Mutter, J. Marescaux, and N. Padoy, “Less is more: Surgical phase recognition with less annotations through self-supervised pre-training of cnn-lstm networks,” arXiv preprint arXiv:1805.08569, 2018.
  • [23] X. Shi, Y. **, Q. Dou, and P.-A. Heng, “Lrtd: Long-range temporal dependency based active learning for surgical workflow recognition,” International Journal of Computer Assisted Radiology and Surgery, vol. 15, pp. 1573–1584, 2020.
  • [24] S. Bodenstedt, D. Rivoir, A. Jenke, M. Wagner, M. Breucha, B. Müller-Stich, S. T. Mees, J. Weitz, and S. Speidel, “Active learning using deep bayesian networks for surgical workflow analysis,” International journal of computer assisted radiology and surgery, vol. 14, pp. 1079–1087, 2019.
  • [25] Y. Zhang, S. Bano, A.-S. Page, J. Deprest, D. Stoyanov, and F. Vasconcelos, “Retrieval of surgical phase transitions using reinforcement learning,” in International conference on medical image computing and computer-assisted intervention.   Springer, 2022, pp. 497–506.
  • [26] T. M. Ward, D. A. Hashimoto, Y. Ban, D. W. Rattner, H. Inoue, K. D. Lillemoe, D. L. Rus, G. Rosman, and O. R. Meireles, “Automated operative phase identification in peroral endoscopic myotomy,” Surgical endoscopy, vol. 35, pp. 4008–4015, 2021.
  • [27] Z. Li, Y. Abu Farha, and J. Gall, “Temporal action segmentation from timestamp supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8365–8374.
  • [28] N. Behrmann, S. A. Golestaneh, Z. Kolter, J. Gall, and M. Noroozi, “Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation,” in European Conference on Computer Vision.   Springer, 2022, pp. 52–68.
  • [29] H. Khan, S. Haresh, A. Ahmed, S. Siddiqui, A. Konin, M. Z. Zia, and Q.-H. Tran, “Timestamp-supervised action segmentation with graph convolutional networks,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2022, pp. 10 619–10 626.
  • [30] D. Du, E. Li, L. Si, F. Xu, and F. Sun, “Timestamp-supervised action segmentation from the perspective of clustering,” in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, E. Elkind, Ed.   International Joint Conferences on Artificial Intelligence Organization, 8 2023, pp. 690–698, main Track. [Online]. Available: https://doi.org/10.24963/ijcai.2023/77
  • [31] J. Li and S. Todorovic, “Anchor-constrained viterbi for set-supervised action segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9806–9815.
  • [32] D.-A. Huang, L. Fei-Fei, and J. C. Niebles, “Connectionist temporal modeling for weakly supervised action labeling,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14.   Springer, 2016, pp. 137–153.
  • [33] Y. Souri, Y. A. Farha, E. Bahrami, G. Francesca, and J. Gall, “Robust action segmentation from timestamp supervision,” in 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022.   BMVA Press, 2022. [Online]. Available: https://bmvc2022.mpi-inf.mpg.de/0392.pdf
  • [34] R. Rahaman, D. Singhania, A. Thiery, and A. Yao, “A generalized and robust framework for timestamp supervision in temporal action segmentation,” in European Conference on Computer Vision.   Springer, 2022, pp. 279–296.
  • [35] D. Singhania, R. Rahaman, and A. Yao, “Iterative contrast-classify for semi-supervised temporal action segmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 2262–2270.
  • [36] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
  • [37] B. Zhang, A. Ghanem, A. Simes, H. Choi, and A. Yoo, “Surgical workflow recognition with 3dcnn for sleeve gastrectomy,” International Journal of Computer Assisted Radiology and Surgery, vol. 16, no. 11, pp. 2029–2036, 2021.
  • [38] S. Ramesh, D. Dall’Alba, C. Gonzalez, T. Yu, P. Mascagni, D. Mutter, J. Marescaux, P. Fiorini, and N. Padoy, “Multi-task temporal convolutional networks for joint recognition of surgical phases and steps in gastric bypass procedures,” International journal of computer assisted radiology and surgery, vol. 16, pp. 1111–1119, 2021.
  • [39] D. N. Le, H. X. Le, L. T. Ngo, and H. T. Ngo, “Transfer learning with class-weighted and focal loss function for automatic skin cancer classification,” arXiv preprint arXiv:2009.05977, 2020.
  • [40] R. Qin, K. Qiao, L. Wang, L. Zeng, J. Chen, and B. Yan, “Weighted focal loss: An effective loss function to overcome unbalance problem of chest x-ray14,” in IOP Conference Series: Materials Science and Engineering, vol. 428, no. 1.   IOP Publishing, 2018, p. 012022.
  • [41] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in international conference on machine learning.   PMLR, 2016, pp. 1050–1059.
  • [42] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” in International conference on machine learning.   PMLR, 2017, pp. 1321–1330.
  • [43] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  • [44] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.
  • [45] V. Lertnattee and T. Theeramunkong, “Analysis of inverse class frequency in centroid-based text classification,” in IEEE International Symposium on Communications and Information Technology, 2004. ISCIT 2004., vol. 2.   IEEE, 2004, pp. 1171–1176.
  • [46] Y. A. Farha and J. Gall, “Ms-tcn: Multi-stage temporal convolutional network for action segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3575–3584.
  • [47] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
  • [48] J. Li et al., “Recent advances in end-to-end automatic speech recognition,” APSIPA Transactions on Signal and Information Processing, vol. 11, no. 1, 2022.
  • [49] W. AlKendi, F. Gechter, L. Heyberger, and C. Guyeux, “Advancements and challenges in handwritten text recognition: A comprehensive survey,” Journal of Imaging, vol. 10, no. 1, p. 18, 2024.
  • [50] A. Hannun, V. Pratap, J. Kahn, and W.-N. Hsu, “Differentiable weighted finite-state transducers,” arXiv preprint arXiv:2010.01003, 2020.
  • [51] J. L. Lavanchy, S. Ramesh, D. Dall’Alba, C. Gonzalez, P. Fiorini, B. P. Müller-Stich, P. C. Nett, J. Marescaux, D. Mutter, and N. Padoy, “Challenges in multi-centric generalization: phase and step recognition in roux-en-y gastric bypass surgery,” International journal of computer assisted radiology and surgery, pp. 1–9, 2024.