-
The Endoscapes Dataset for Surgical Scene Segmentation, Object Detection, and Critical View of Safety Assessment: Official Splits and Benchmark
Authors:
Aditya Murali,
Deepak Alapatt,
Pietro Mascagni,
Armine Vardazaryan,
Alain Garcia,
Nariaki Okamoto,
Guido Costamagna,
Didier Mutter,
Jacques Marescaux,
Bernard Dallemagne,
Nicolas Padoy
Abstract:
This technical report provides a detailed overview of Endoscapes, a dataset of laparoscopic cholecystectomy (LC) videos with highly intricate annotations targeted at automated assessment of the Critical View of Safety (CVS). Endoscapes comprises 201 LC videos with frames annotated sparsely but regularly with segmentation masks, bounding boxes, and CVS assessment by three different clinical experts…
▽ More
This technical report provides a detailed overview of Endoscapes, a dataset of laparoscopic cholecystectomy (LC) videos with highly intricate annotations targeted at automated assessment of the Critical View of Safety (CVS). Endoscapes comprises 201 LC videos with frames annotated sparsely but regularly with segmentation masks, bounding boxes, and CVS assessment by three different clinical experts. Altogether, there are 11090 frames annotated with CVS and 1933 frames annotated with tool and anatomy bounding boxes from the 201 videos, as well as an additional 422 frames from 50 of the 201 videos annotated with tool and anatomy segmentation masks. In this report, we provide detailed dataset statistics (size, class distribution, dataset splits, etc.) and a comprehensive performance benchmark for instance segmentation, object detection, and CVS prediction. The dataset and model checkpoints are publically available at https://github.com/CAMMA-public/Endoscapes.
△ Less
Submitted 26 January, 2024; v1 submitted 19 December, 2023;
originally announced December 2023.
-
Challenges in Multi-centric Generalization: Phase and Step Recognition in Roux-en-Y Gastric Bypass Surgery
Authors:
Joel L. Lavanchy,
Sanat Ramesh,
Diego Dall'Alba,
Cristians Gonzalez,
Paolo Fiorini,
Beat Muller-Stich,
Philipp C. Nett,
Jacques Marescaux,
Didier Mutter,
Nicolas Padoy
Abstract:
Most studies on surgical activity recognition utilizing Artificial intelligence (AI) have focused mainly on recognizing one type of activity from small and mono-centric surgical video datasets. It remains speculative whether those models would generalize to other centers. In this work, we introduce a large multi-centric multi-activity dataset consisting of 140 videos (MultiBypass140) of laparoscop…
▽ More
Most studies on surgical activity recognition utilizing Artificial intelligence (AI) have focused mainly on recognizing one type of activity from small and mono-centric surgical video datasets. It remains speculative whether those models would generalize to other centers. In this work, we introduce a large multi-centric multi-activity dataset consisting of 140 videos (MultiBypass140) of laparoscopic Roux-en-Y gastric bypass (LRYGB) surgeries performed at two medical centers: the University Hospital of Strasbourg (StrasBypass70) and Inselspital, Bern University Hospital (BernBypass70). The dataset has been fully annotated with phases and steps. Furthermore, we assess the generalizability and benchmark different deep learning models in 7 experimental studies: 1) Training and evaluation on BernBypass70; 2) Training and evaluation on StrasBypass70; 3) Training and evaluation on the MultiBypass140; 4) Training on BernBypass70, evaluation on StrasBypass70; 5) Training on StrasBypass70, evaluation on BernBypass70; Training on MultiBypass140, evaluation 6) on BernBypass70 and 7) on StrasBypass70. The model's performance is markedly influenced by the training data. The worst results were obtained in experiments 4) and 5) confirming the limited generalization capabilities of models trained on mono-centric data. The use of multi-centric training data, experiments 6) and 7), improves the generalization capabilities of the models, bringing them beyond the level of independent mono-centric training and validation (experiments 1) and 2)). MultiBypass140 shows considerable variation in surgical technique and workflow of LRYGB procedures between centers. Therefore, generalization experiments demonstrate a remarkable difference in model performance. These results highlight the importance of multi-centric datasets for AI model generalization to account for variance in surgical technique and workflows.
△ Less
Submitted 18 December, 2023;
originally announced December 2023.
-
TRUSTED: The Paired 3D Transabdominal Ultrasound and CT Human Data for Kidney Segmentation and Registration Research
Authors:
William Ndzimbong,
Cyril Fourniol,
Loic Themyr,
Nicolas Thome,
Yvonne Keeza,
Beniot Sauer,
Pierre-Thierry Piechaud,
Arnaud Mejean,
Jacques Marescaux,
Daniel George,
Didier Mutter,
Alexandre Hostettler,
Toby Collins
Abstract:
Inter-modal image registration (IMIR) and image segmentation with abdominal Ultrasound (US) data has many important clinical applications, including image-guided surgery, automatic organ measurement and robotic navigation. However, research is severely limited by the lack of public datasets. We propose TRUSTED (the Tridimensional Renal Ultra Sound TomodEnsitometrie Dataset), comprising paired tran…
▽ More
Inter-modal image registration (IMIR) and image segmentation with abdominal Ultrasound (US) data has many important clinical applications, including image-guided surgery, automatic organ measurement and robotic navigation. However, research is severely limited by the lack of public datasets. We propose TRUSTED (the Tridimensional Renal Ultra Sound TomodEnsitometrie Dataset), comprising paired transabdominal 3DUS and CT kidney images from 48 human patients (96 kidneys), including segmentation, and anatomical landmark annotations by two experienced radiographers. Inter-rater segmentation agreement was over 94 (Dice score), and gold-standard segmentations were generated using the STAPLE algorithm. Seven anatomical landmarks were annotated, important for IMIR systems development and evaluation. To validate the dataset's utility, 5 competitive Deep Learning models for automatic kidney segmentation were benchmarked, yielding average DICE scores from 83.2% to 89.1% for CT, and 61.9% to 79.4% for US images. Three IMIR methods were benchmarked, and Coherent Point Drift performed best with an average Target Registration Error of 4.53mm. The TRUSTED dataset may be used freely researchers to develop and validate new segmentation and IMIR methods.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
Weakly Supervised Temporal Convolutional Networks for Fine-grained Surgical Activity Recognition
Authors:
Sanat Ramesh,
Diego Dall'Alba,
Cristians Gonzalez,
Tong Yu,
Pietro Mascagni,
Didier Mutter,
Jacques Marescaux,
Paolo Fiorini,
Nicolas Padoy
Abstract:
Automatic recognition of fine-grained surgical activities, called steps, is a challenging but crucial task for intelligent intra-operative computer assistance. The development of current vision-based activity recognition methods relies heavily on a high volume of manually annotated data. This data is difficult and time-consuming to generate and requires domain-specific knowledge. In this work, we…
▽ More
Automatic recognition of fine-grained surgical activities, called steps, is a challenging but crucial task for intelligent intra-operative computer assistance. The development of current vision-based activity recognition methods relies heavily on a high volume of manually annotated data. This data is difficult and time-consuming to generate and requires domain-specific knowledge. In this work, we propose to use coarser and easier-to-annotate activity labels, namely phases, as weak supervision to learn step recognition with fewer step annotated videos. We introduce a step-phase dependency loss to exploit the weak supervision signal. We then employ a Single-Stage Temporal Convolutional Network (SS-TCN) with a ResNet-50 backbone, trained in an end-to-end fashion from weakly annotated videos, for temporal activity segmentation and recognition. We extensively evaluate and show the effectiveness of the proposed method on a large video dataset consisting of 40 laparoscopic gastric bypass procedures and the public benchmark CATARACTS containing 50 cataract surgeries.
△ Less
Submitted 11 April, 2023; v1 submitted 21 February, 2023;
originally announced February 2023.
-
Live Laparoscopic Video Retrieval with Compressed Uncertainty
Authors:
Tong Yu,
Pietro Mascagni,
Juan Verde,
Jacques Marescaux,
Didier Mutter,
Nicolas Padoy
Abstract:
Searching through large volumes of medical data to retrieve relevant information is a challenging yet crucial task for clinical care. However the primitive and most common approach to retrieval, involving text in the form of keywords, is severely limited when dealing with complex media formats. Content-based retrieval offers a way to overcome this limitation, by using rich media as the query itsel…
▽ More
Searching through large volumes of medical data to retrieve relevant information is a challenging yet crucial task for clinical care. However the primitive and most common approach to retrieval, involving text in the form of keywords, is severely limited when dealing with complex media formats. Content-based retrieval offers a way to overcome this limitation, by using rich media as the query itself. Surgical video-to-video retrieval in particular is a new and largely unexplored research problem with high clinical value, especially in the real-time case: using real-time video hashing, search can be achieved directly inside of the operating room. Indeed, the process of hashing converts large data entries into compact binary arrays or hashes, enabling large-scale search operations at a very fast rate. However, due to fluctuations over the course of a video, not all bits in a given hash are equally reliable. In this work, we propose a method capable of mitigating this uncertainty while maintaining a light computational footprint. We present superior retrieval results (3-4 % top 10 mean average precision) on a multi-task evaluation protocol for surgery, using cholecystectomy phases, bypass phases, and coming from an entirely new dataset introduced here, critical events across six different surgery types. Success on this multi-task benchmark shows the generalizability of our approach for surgical video retrieval.
△ Less
Submitted 12 June, 2023; v1 submitted 8 March, 2022;
originally announced March 2022.
-
Temporally Constrained Neural Networks (TCNN): A framework for semi-supervised video semantic segmentation
Authors:
Deepak Alapatt,
Pietro Mascagni,
Armine Vardazaryan,
Alain Garcia,
Nariaki Okamoto,
Didier Mutter,
Jacques Marescaux,
Guido Costamagna,
Bernard Dallemagne,
Nicolas Padoy
Abstract:
A major obstacle to building models for effective semantic segmentation, and particularly video semantic segmentation, is a lack of large and well annotated datasets. This bottleneck is particularly prohibitive in highly specialized and regulated fields such as medicine and surgery, where video semantic segmentation could have important applications but data and expert annotations are scarce. In t…
▽ More
A major obstacle to building models for effective semantic segmentation, and particularly video semantic segmentation, is a lack of large and well annotated datasets. This bottleneck is particularly prohibitive in highly specialized and regulated fields such as medicine and surgery, where video semantic segmentation could have important applications but data and expert annotations are scarce. In these settings, temporal clues and anatomical constraints could be leveraged during training to improve performance. Here, we present Temporally Constrained Neural Networks (TCNN), a semi-supervised framework used for video semantic segmentation of surgical videos. In this work, we show that autoencoder networks can be used to efficiently provide both spatial and temporal supervisory signals to train deep learning models. We test our method on a newly introduced video dataset of laparoscopic cholecystectomy procedures, Endoscapes, and an adaptation of a public dataset of cataract surgeries, CaDIS. We demonstrate that lower-dimensional representations of predicted masks can be leveraged to provide a consistent improvement on both sparsely labeled datasets with no additional computational cost at inference time. Further, the TCNN framework is model-agnostic and can be used in conjunction with other model design choices with minimal additional complexity.
△ Less
Submitted 27 December, 2021;
originally announced December 2021.
-
Rendezvous: Attention Mechanisms for the Recognition of Surgical Action Triplets in Endoscopic Videos
Authors:
Chinedu Innocent Nwoye,
Tong Yu,
Cristians Gonzalez,
Barbara Seeliger,
Pietro Mascagni,
Didier Mutter,
Jacques Marescaux,
Nicolas Padoy
Abstract:
Out of all existing frameworks for surgical workflow analysis in endoscopic videos, action triplet recognition stands out as the only one aiming to provide truly fine-grained and comprehensive information on surgical activities. This information, presented as <instrument, verb, target> combinations, is highly challenging to be accurately identified. Triplet components can be difficult to recognize…
▽ More
Out of all existing frameworks for surgical workflow analysis in endoscopic videos, action triplet recognition stands out as the only one aiming to provide truly fine-grained and comprehensive information on surgical activities. This information, presented as <instrument, verb, target> combinations, is highly challenging to be accurately identified. Triplet components can be difficult to recognize individually; in this task, it requires not only performing recognition simultaneously for all three triplet components, but also correctly establishing the data association between them. To achieve this task, we introduce our new model, the Rendezvous (RDV), which recognizes triplets directly from surgical videos by leveraging attention at two different levels. We first introduce a new form of spatial attention to capture individual action triplet components in a scene; called Class Activation Guided Attention Mechanism (CAGAM). This technique focuses on the recognition of verbs and targets using activations resulting from instruments. To solve the association problem, our RDV model adds a new form of semantic attention inspired by Transformer networks; called Multi-Head of Mixed Attention (MHMA). This technique uses several cross and self attentions to effectively capture relationships between instruments, verbs, and targets. We also introduce CholecT50 - a dataset of 50 endoscopic videos in which every frame has been annotated with labels from 100 triplet classes. Our proposed RDV model significantly improves the triplet prediction mean AP by over 9% compared to the state-of-the-art methods on this dataset.
△ Less
Submitted 3 March, 2022; v1 submitted 7 September, 2021;
originally announced September 2021.
-
Intraoperative time out to promote the implementation of the critical view of safety in laparoscopic cholecystectomy: a video-based assessment of 343 procedures
Authors:
Pietro Mascagni,
Maria Rita Rodriguez-Luna,
Takeshi Urade,
Emanuele Felli,
Patrick Pessaux,
Didier Mutter,
Jacques Marescaux,
Guido Costamagna,
Bernard Dallemagne,
Nicolas Padoy
Abstract:
Background: The critical view of safety (CVS) is poorly adopted in surgical practices although it is ubiquitously recommended to prevent major bile duct injuries during laparoscopic cholecystectomy (LC). This study aims to determine whether performing a short intraoperative time out can improve CVS implementation. Methods: Surgeons performing LCs at an academic centre were invited to perform a 5-s…
▽ More
Background: The critical view of safety (CVS) is poorly adopted in surgical practices although it is ubiquitously recommended to prevent major bile duct injuries during laparoscopic cholecystectomy (LC). This study aims to determine whether performing a short intraoperative time out can improve CVS implementation. Methods: Surgeons performing LCs at an academic centre were invited to perform a 5-second long time out to verify CVS before dividing the cystic duct (5-second rule). The primary endpoint was to compare the rate of CVS achievement between LCs performed in the year before and the year after the 5-second rule. The CVS achievement rate was computed using the mediated video-based assessment of two independent reviewers. Clinical outcomes, LC workflows, and postoperative reports were also compared. Results: Three hundred and forty-three (171 before and 172 after the 5-second rule) of the 381 LCs performed over the 2-year study were analysed. After the implementation of the 5-second rule, the rate of CVS achievement increased significantly (15.9 vs 44.1 %; P<0.001) as well as the rate of bailout procedures (8.2 vs 15.7 %; P=0.045), the median time to clip the cystic duct or artery (17:26 [interquartile range: 16:46] vs 23:12 [17:16] minutes; P=0.007), and the rate of postoperative CVS reporting (1.3 vs 28.8 %; P<0.001). Morbidity was comparable (1.75 vs 2.33 % before and after the 5-second rule respectively; P=0.685). Conclusion: Performing a short intraoperative time out improves CVS implementation during LC. Systematic intraoperative cognitive aids should be studied to sustain the uptake of guidelines.
△ Less
Submitted 6 April, 2021;
originally announced April 2021.
-
Multi-Task Temporal Convolutional Networks for Joint Recognition of Surgical Phases and Steps in Gastric Bypass Procedures
Authors:
Sanat Ramesh,
Diego Dall'Alba,
Cristians Gonzalez,
Tong Yu,
Pietro Mascagni,
Didier Mutter,
Jacques Marescaux,
Paolo Fiorini,
Nicolas Padoy
Abstract:
Purpose: Automatic segmentation and classification of surgical activity is crucial for providing advanced support in computer-assisted interventions and autonomous functionalities in robot-assisted surgeries. Prior works have focused on recognizing either coarse activities, such as phases, or fine-grained activities, such as gestures. This work aims at jointly recognizing two complementary levels…
▽ More
Purpose: Automatic segmentation and classification of surgical activity is crucial for providing advanced support in computer-assisted interventions and autonomous functionalities in robot-assisted surgeries. Prior works have focused on recognizing either coarse activities, such as phases, or fine-grained activities, such as gestures. This work aims at jointly recognizing two complementary levels of granularity directly from videos, namely phases and steps. Method: We introduce two correlated surgical activities, phases and steps, for the laparoscopic gastric bypass procedure. We propose a Multi-task Multi-Stage Temporal Convolutional Network (MTMS-TCN) along with a multi-task Convolutional Neural Network (CNN) training setup to jointly predict the phases and steps and benefit from their complementarity to better evaluate the execution of the procedure. We evaluate the proposed method on a large video dataset consisting of 40 surgical procedures (Bypass40). Results: We present experimental results from several baseline models for both phase and step recognition on the Bypass40 dataset. The proposed MTMS-TCN method outperforms in both phase and step recognition by 1-2% in accuracy, precision and recall, compared to single-task methods. Furthermore, for step recognition, MTMS-TCN achieves a superior performance of 3-6% compared to LSTM based models in accuracy, precision, and recall. Conclusion: In this work, we present a multi-task multi-stage temporal convolutional network for surgical activity recognition, which shows improved results compared to single-task models on the Bypass40 gastric bypass dataset with multi-level annotations. The proposed method shows that the joint modeling of phases and steps is beneficial to improve the overall recognition of each type of activity.
△ Less
Submitted 24 February, 2021;
originally announced February 2021.
-
Recognition of Instrument-Tissue Interactions in Endoscopic Videos via Action Triplets
Authors:
Chinedu Innocent Nwoye,
Cristians Gonzalez,
Tong Yu,
Pietro Mascagni,
Didier Mutter,
Jacques Marescaux,
Nicolas Padoy
Abstract:
Recognition of surgical activity is an essential component to develop context-aware decision support for the operating room. In this work, we tackle the recognition of fine-grained activities, modeled as action triplets <instrument, verb, target> representing the tool activity. To this end, we introduce a new laparoscopic dataset, CholecT40, consisting of 40 videos from the public dataset Cholec80…
▽ More
Recognition of surgical activity is an essential component to develop context-aware decision support for the operating room. In this work, we tackle the recognition of fine-grained activities, modeled as action triplets <instrument, verb, target> representing the tool activity. To this end, we introduce a new laparoscopic dataset, CholecT40, consisting of 40 videos from the public dataset Cholec80 in which all frames have been annotated using 128 triplet classes. Furthermore, we present an approach to recognize these triplets directly from the video data. It relies on a module called Class Activation Guide (CAG), which uses the instrument activation maps to guide the verb and target recognition. To model the recognition of multiple triplets in the same frame, we also propose a trainable 3D Interaction Space, which captures the associations between the triplet components. Finally, we demonstrate the significance of these contributions via several ablation studies and comparisons to baselines on CholecT40.
△ Less
Submitted 10 July, 2020;
originally announced July 2020.
-
Weakly Supervised Convolutional LSTM Approach for Tool Tracking in Laparoscopic Videos
Authors:
Chinedu Innocent Nwoye,
Didier Mutter,
Jacques Marescaux,
Nicolas Padoy
Abstract:
Purpose: Real-time surgical tool tracking is a core component of the future intelligent operating room (OR), because it is highly instrumental to analyze and understand the surgical activities. Current methods for surgical tool tracking in videos need to be trained on data in which the spatial positions of the tools are manually annotated. Generating such training data is difficult and time-consum…
▽ More
Purpose: Real-time surgical tool tracking is a core component of the future intelligent operating room (OR), because it is highly instrumental to analyze and understand the surgical activities. Current methods for surgical tool tracking in videos need to be trained on data in which the spatial positions of the tools are manually annotated. Generating such training data is difficult and time-consuming. Instead, we propose to use solely binary presence annotations to train a tool tracker for laparoscopic videos.
Methods: The proposed approach is composed of a CNN + Convolutional LSTM (ConvLSTM) neural network trained end-to-end, but weakly supervised on tool binary presence labels only. We use the ConvLSTM to model the temporal dependencies in the motion of the surgical tools and leverage its spatio-temporal ability to smooth the class peak activations in the localization heat maps (Lh-maps).
Results: We build a baseline tracker on top of the CNN model and demonstrate that our approach based on the ConvLSTM outperforms the baseline in tool presence detection, spatial localization, and motion tracking by over 5.0%, 13.9%, and 12.6%, respectively.
Conclusions: In this paper, we demonstrate that binary presence labels are sufficient for training a deep learning tracking model using our proposed method. We also show that the ConvLSTM can leverage the spatio-temporal coherence of consecutive image frames across a surgical video to improve tool presence detection, spatial localization, and motion tracking.
keywords: Surgical workflow analysis, tool tracking, weak supervision, spatio-temporal coherence, ConvLSTM, endoscopic videos
△ Less
Submitted 15 February, 2019; v1 submitted 4 December, 2018;
originally announced December 2018.
-
Learning from a tiny dataset of manual annotations: a teacher/student approach for surgical phase recognition
Authors:
Tong Yu,
Didier Mutter,
Jacques Marescaux,
Nicolas Padoy
Abstract:
Vision algorithms capable of interpreting scenes from a real-time video stream are necessary for computer-assisted surgery systems to achieve context-aware behavior. In laparoscopic procedures one particular algorithm needed for such systems is the identification of surgical phases, for which the current state of the art is a model based on a CNN-LSTM. A number of previous works using models of th…
▽ More
Vision algorithms capable of interpreting scenes from a real-time video stream are necessary for computer-assisted surgery systems to achieve context-aware behavior. In laparoscopic procedures one particular algorithm needed for such systems is the identification of surgical phases, for which the current state of the art is a model based on a CNN-LSTM. A number of previous works using models of this kind have trained them in a fully supervised manner, requiring a fully annotated dataset. Instead, our work confronts the problem of learning surgical phase recognition in scenarios presenting scarce amounts of annotated data (under 25% of all available video recordings). We propose a teacher/student type of approach, where a strong predictor called the teacher, trained beforehand on a small dataset of ground truth-annotated videos, generates synthetic annotations for a larger dataset, which another model - the student - learns from. In our case, the teacher features a novel CNN-biLSTM-CRF architecture, designed for offline inference only. The student, on the other hand, is a CNN-LSTM capable of making real-time predictions. Results for various amounts of manually annotated videos demonstrate the superiority of the new CNN-biLSTM-CRF predictor as well as improved performance from the CNN-LSTM trained using synthetic labels generated for unannotated videos. For both offline and online surgical phase recognition with very few annotated recordings available, this new teacher/student strategy provides a valuable performance improvement by efficiently leveraging the unannotated data.
△ Less
Submitted 30 September, 2020; v1 submitted 30 November, 2018;
originally announced December 2018.
-
Future-State Predicting LSTM for Early Surgery Type Recognition
Authors:
Siddharth Kannan,
Gaurav Yengera,
Didier Mutter,
Jacques Marescaux,
Nicolas Padoy
Abstract:
This work presents a novel approach for the early recognition of the type of a laparoscopic surgery from its video. Early recognition algorithms can be beneficial to the development of 'smart' OR systems that can provide automatic context-aware assistance, and also enable quick database indexing. The task is however ridden with challenges specific to videos belonging to the domain of laparoscopy,…
▽ More
This work presents a novel approach for the early recognition of the type of a laparoscopic surgery from its video. Early recognition algorithms can be beneficial to the development of 'smart' OR systems that can provide automatic context-aware assistance, and also enable quick database indexing. The task is however ridden with challenges specific to videos belonging to the domain of laparoscopy, such as high visual similarity across surgeries and large variations in video durations. To capture the spatio-temporal dependencies in these videos, we choose as our model a combination of a Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) network. We then propose two complementary approaches for improving early recognition performance. The first approach is a CNN fine-tuning method that encourages surgeries to be distinguished based on the initial frames of laparoscopic videos. The second approach, referred to as 'Future-State Predicting LSTM', trains an LSTM to predict information related to future frames, which helps in distinguishing between the different types of surgeries. We evaluate our approaches on a large dataset of 425 laparoscopic videos containing 9 types of surgeries (Laparo425), and achieve on average an accuracy of 75% having observed only the first 10 minutes of a surgery. These results are quite promising from a practical standpoint and also encouraging for other types of image-guided surgeries.
△ Less
Submitted 4 September, 2019; v1 submitted 28 November, 2018;
originally announced November 2018.
-
Weakly-Supervised Learning for Tool Localization in Laparoscopic Videos
Authors:
Armine Vardazaryan,
Didier Mutter,
Jacques Marescaux,
Nicolas Padoy
Abstract:
Surgical tool localization is an essential task for the automatic analysis of endoscopic videos. In the literature, existing methods for tool localization, tracking and segmentation require training data that is fully annotated, thereby limiting the size of the datasets that can be used and the generalization of the approaches. In this work, we propose to circumvent the lack of annotated data with…
▽ More
Surgical tool localization is an essential task for the automatic analysis of endoscopic videos. In the literature, existing methods for tool localization, tracking and segmentation require training data that is fully annotated, thereby limiting the size of the datasets that can be used and the generalization of the approaches. In this work, we propose to circumvent the lack of annotated data with weak supervision. We propose a deep architecture, trained solely on image level annotations, that can be used for both tool presence detection and localization in surgical videos. Our architecture relies on a fully convolutional neural network, trained end-to-end, enabling us to localize surgical tools without explicit spatial annotations. We demonstrate the benefits of our approach on a large public dataset, Cholec80, which is fully annotated with binary tool presence information and of which 5 videos have been fully annotated with bounding boxes and tool centers for the evaluation.
△ Less
Submitted 18 July, 2018; v1 submitted 14 June, 2018;
originally announced June 2018.
-
Less is More: Surgical Phase Recognition with Less Annotations through Self-Supervised Pre-training of CNN-LSTM Networks
Authors:
Gaurav Yengera,
Didier Mutter,
Jacques Marescaux,
Nicolas Padoy
Abstract:
Real-time algorithms for automatically recognizing surgical phases are needed to develop systems that can provide assistance to surgeons, enable better management of operating room (OR) resources and consequently improve safety within the OR. State-of-the-art surgical phase recognition algorithms using laparoscopic videos are based on fully supervised training. This limits their potential for wide…
▽ More
Real-time algorithms for automatically recognizing surgical phases are needed to develop systems that can provide assistance to surgeons, enable better management of operating room (OR) resources and consequently improve safety within the OR. State-of-the-art surgical phase recognition algorithms using laparoscopic videos are based on fully supervised training. This limits their potential for widespread application, since creation of manual annotations is an expensive process considering the numerous types of existing surgeries and the vast amount of laparoscopic videos available. In this work, we propose a new self-supervised pre-training approach based on the prediction of remaining surgery duration (RSD) from laparoscopic videos. The RSD prediction task is used to pre-train a convolutional neural network (CNN) and long short-term memory (LSTM) network in an end-to-end manner. Our proposed approach utilizes all available data and reduces the reliance on annotated data, thereby facilitating the scaling up of surgical phase recognition algorithms to different kinds of surgeries. Additionally, we present EndoN2N, an end-to-end trained CNN-LSTM model for surgical phase recognition and evaluate the performance of our approach on a dataset of 120 Cholecystectomy laparoscopic videos (Cholec120). This work also presents the first systematic study of self-supervised pre-training approaches to understand the amount of annotations required for surgical phase recognition. Interestingly, the proposed RSD pre-training approach leads to performance improvement even when all the training data is manually annotated and outperforms the single pre-training approach for surgical phase recognition presently published in the literature. It is also observed that end-to-end training of CNN-LSTM networks boosts surgical phase recognition performance.
△ Less
Submitted 22 May, 2018;
originally announced May 2018.
-
RSDNet: Learning to Predict Remaining Surgery Duration from Laparoscopic Videos Without Manual Annotations
Authors:
Andru Putra Twinanda,
Gaurav Yengera,
Didier Mutter,
Jacques Marescaux,
Nicolas Padoy
Abstract:
Accurate surgery duration estimation is necessary for optimal OR planning, which plays an important role in patient comfort and safety as well as resource optimization. It is, however, challenging to preoperatively predict surgery duration since it varies significantly depending on the patient condition, surgeon skills, and intraoperative situation. In this paper, we propose a deep learning pipeli…
▽ More
Accurate surgery duration estimation is necessary for optimal OR planning, which plays an important role in patient comfort and safety as well as resource optimization. It is, however, challenging to preoperatively predict surgery duration since it varies significantly depending on the patient condition, surgeon skills, and intraoperative situation. In this paper, we propose a deep learning pipeline, referred to as RSDNet, which automatically estimates the remaining surgery duration (RSD) intraoperatively by using only visual information from laparoscopic videos. Previous state-of-the-art approaches for RSD prediction are dependent on manual annotation, whose generation requires expensive expert knowledge and is time-consuming, especially considering the numerous types of surgeries performed in a hospital and the large number of laparoscopic videos available. A crucial feature of RSDNet is that it does not depend on any manual annotation during training, making it easily scalable to many kinds of surgeries. The generalizability of our approach is demonstrated by testing the pipeline on two large datasets containing different types of surgeries: 120 cholecystectomy and 170 gastric bypass videos. The experimental results also show that the proposed network significantly outperforms a traditional method of estimating RSD without utilizing manual annotation. Further, this work provides a deeper insight into the deep learning network through visualization and interpretation of the features that are automatically learned.
△ Less
Submitted 3 December, 2018; v1 submitted 9 February, 2018;
originally announced February 2018.
-
Single- and Multi-Task Architectures for Tool Presence Detection Challenge at M2CAI 2016
Authors:
Andru P. Twinanda,
Didier Mutter,
Jacques Marescaux,
Michel de Mathelin,
Nicolas Padoy
Abstract:
The tool presence detection challenge at M2CAI 2016 consists of identifying the presence/absence of seven surgical tools in the images of cholecystectomy videos. Here, we propose to use deep architectures that are based on our previous work where we presented several architectures to perform multiple recognition tasks on laparoscopic videos. In this technical report, we present the tool presence d…
▽ More
The tool presence detection challenge at M2CAI 2016 consists of identifying the presence/absence of seven surgical tools in the images of cholecystectomy videos. Here, we propose to use deep architectures that are based on our previous work where we presented several architectures to perform multiple recognition tasks on laparoscopic videos. In this technical report, we present the tool presence detection results using two architectures: (1) a single-task architecture designed to perform solely the tool presence detection task and (2) a multi-task architecture designed to perform jointly phase recognition and tool presence detection. The results show that the multi-task network only slightly improves the tool presence detection results. In constrast, a significant improvement is obtained when there are more data available to train the networks. This significant improvement can be regarded as a call for action for other institutions to start working toward publishing more datasets into the community, so that better models could be generated to perform the task.
△ Less
Submitted 27 October, 2016;
originally announced October 2016.
-
Single- and Multi-Task Architectures for Surgical Workflow Challenge at M2CAI 2016
Authors:
Andru P. Twinanda,
Didier Mutter,
Jacques Marescaux,
Michel de Mathelin,
Nicolas Padoy
Abstract:
The surgical workflow challenge at M2CAI 2016 consists of identifying 8 surgical phases in cholecystectomy procedures. Here, we propose to use deep architectures that are based on our previous work where we presented several architectures to perform multiple recognition tasks on laparoscopic videos. In this technical report, we present the phase recognition results using two architectures: (1) a s…
▽ More
The surgical workflow challenge at M2CAI 2016 consists of identifying 8 surgical phases in cholecystectomy procedures. Here, we propose to use deep architectures that are based on our previous work where we presented several architectures to perform multiple recognition tasks on laparoscopic videos. In this technical report, we present the phase recognition results using two architectures: (1) a single-task architecture designed to perform solely the surgical phase recognition task and (2) a multi-task architecture designed to perform jointly phase recognition and tool presence detection. On top of these architectures we propose to use two different approaches to enforce the temporal constraints of the surgical workflow: (1) HMM-based and (2) LSTM-based pipelines. The results show that the LSTM-based approach is able to outperform the HMM-based approach and also to properly enforce the temporal constraints into the recognition process.
△ Less
Submitted 28 October, 2016; v1 submitted 27 October, 2016;
originally announced October 2016.
-
Automatic View-Point Selection for Inter-Operative Endoscopic Surveillance
Authors:
Anant S. Vemuri,
Stephane A. Nicolau,
Jacques Marescaux,
Luc Soler,
Nicholas Ayache
Abstract:
Esophageal adenocarcinoma arises from Barrett's esophagus, which is the most serious complication of gastroesophageal reflux disease. Strategies for screening involve periodic surveillance and tissue biopsies. A major challenge in such regular examinations is to record and track the disease evolution and re-localization of biopsied sites to provide targeted treatments. In this paper, we extend our…
▽ More
Esophageal adenocarcinoma arises from Barrett's esophagus, which is the most serious complication of gastroesophageal reflux disease. Strategies for screening involve periodic surveillance and tissue biopsies. A major challenge in such regular examinations is to record and track the disease evolution and re-localization of biopsied sites to provide targeted treatments. In this paper, we extend our original inter-operative relocalization framework to provide a constrained image based search for obtaining the best view-point match to the live view. Within this context we investigate the effect of: the choice of feature descriptors and color-space; filtering of uninformative frames and endoscopic modality, for view-point localization. Our experiments indicate an improvement in the best view-point retrieval rate to [92%,87%] from [73%,76%] (in our previous approach) for NBI and WL.
△ Less
Submitted 13 October, 2016;
originally announced October 2016.
-
ORBSLAM-based Endoscope Tracking and 3D Reconstruction
Authors:
Nader Mahmoud,
IƱigo Cirauqui,
Alexandre Hostettler,
Christophe Doignon,
Luc Soler,
Jacques Marescaux,
J. M. M. Montiel
Abstract:
We aim to track the endoscope location inside the surgical scene and provide 3D reconstruction, in real-time, from the sole input of the image sequence captured by the monocular endoscope. This information offers new possibilities for develo** surgical navigation and augmented reality applications. The main benefit of this approach is the lack of extra tracking elements which can disturb the sur…
▽ More
We aim to track the endoscope location inside the surgical scene and provide 3D reconstruction, in real-time, from the sole input of the image sequence captured by the monocular endoscope. This information offers new possibilities for develo** surgical navigation and augmented reality applications. The main benefit of this approach is the lack of extra tracking elements which can disturb the surgeon performance in the clinical routine. It is our first contribution to exploit ORBSLAM, one of the best performing monocular SLAM algorithms, to estimate both of the endoscope location, and 3D structure of the surgical scene. However, the reconstructed 3D map poorly describe textureless soft organ surfaces such as liver. It is our second contribution to extend ORBSLAM to be able to reconstruct a semi-dense map of soft organs. Experimental results on in-vivo pigs, shows a robust endoscope tracking even with organs deformations and partial instrument occlusions. It also shows the reconstruction density, and accuracy against ground truth surface obtained from CT.
△ Less
Submitted 29 August, 2016;
originally announced August 2016.
-
EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos
Authors:
Andru P. Twinanda,
Sherif Shehata,
Didier Mutter,
Jacques Marescaux,
Michel de Mathelin,
Nicolas Padoy
Abstract:
Surgical workflow recognition has numerous potential medical applications, such as the automatic indexing of surgical video databases and the optimization of real-time operating room scheduling, among others. As a result, phase recognition has been studied in the context of several kinds of surgeries, such as cataract, neurological, and laparoscopic surgeries. In the literature, two types of featu…
▽ More
Surgical workflow recognition has numerous potential medical applications, such as the automatic indexing of surgical video databases and the optimization of real-time operating room scheduling, among others. As a result, phase recognition has been studied in the context of several kinds of surgeries, such as cataract, neurological, and laparoscopic surgeries. In the literature, two types of features are typically used to perform this task: visual features and tool usage signals. However, the visual features used are mostly handcrafted. Furthermore, the tool usage signals are usually collected via a manual annotation process or by using additional equipment. In this paper, we propose a novel method for phase recognition that uses a convolutional neural network (CNN) to automatically learn features from cholecystectomy videos and that relies uniquely on visual information. In previous studies, it has been shown that the tool signals can provide valuable information in performing the phase recognition task. Thus, we present a novel CNN architecture, called EndoNet, that is designed to carry out the phase recognition and tool presence detection tasks in a multi-task manner. To the best of our knowledge, this is the first work proposing to use a CNN for multiple recognition tasks on laparoscopic videos. Extensive experimental comparisons to other methods show that EndoNet yields state-of-the-art results for both tasks.
△ Less
Submitted 23 May, 2016; v1 submitted 9 February, 2016;
originally announced February 2016.