-
Why is the winner the best?
Authors:
Matthias Eisenmann,
Annika Reinke,
Vivienn Weru,
Minu Dietlinde Tizabi,
Fabian Isensee,
Tim J. Adler,
Sharib Ali,
Vincent Andrearczyk,
Marc Aubreville,
Ujjwal Baid,
Spyridon Bakas,
Niranjan Balu,
Sophia Bano,
Jorge Bernal,
Sebastian Bodenstedt,
Alessandro Casella,
Veronika Cheplygina,
Marie Daum,
Marleen de Bruijne,
Adrien Depeursinge,
Reuben Dorent,
Jan Egger,
David G. Ellis,
Sandy Engelhardt,
Melanie Ganz
, et al. (100 additional authors not shown)
Abstract:
International benchmarking competitions have become fundamental for the comparative performance assessment of image analysis methods. However, little attention has been given to investigating what can be learnt from these competitions. Do they really generate scientific progress? What are common and successful participation strategies? What makes a solution superior to a competing method? To addre…
▽ More
International benchmarking competitions have become fundamental for the comparative performance assessment of image analysis methods. However, little attention has been given to investigating what can be learnt from these competitions. Do they really generate scientific progress? What are common and successful participation strategies? What makes a solution superior to a competing method? To address this gap in the literature, we performed a multi-center study with all 80 competitions that were conducted in the scope of IEEE ISBI 2021 and MICCAI 2021. Statistical analyses performed based on comprehensive descriptions of the submitted algorithms linked to their rank as well as the underlying participation strategies revealed common characteristics of winning solutions. These typically include the use of multi-task learning (63%) and/or multi-stage pipelines (61%), and a focus on augmentation (100%), image preprocessing (97%), data curation (79%), and postprocessing (66%). The "typical" lead of a winning team is a computer scientist with a doctoral degree, five years of experience in biomedical image analysis, and four years of experience in deep learning. Two core general development strategies stood out for highly-ranked teams: the reflection of the metrics in the method design and the focus on analyzing and handling failure cases. According to the organizers, 43% of the winning algorithms exceeded the state of the art but only 11% completely solved the respective domain problem. The insights of our study could help researchers (1) improve algorithm development strategies when approaching new problems, and (2) focus on open research questions revealed by this work.
△ Less
Submitted 30 March, 2023;
originally announced March 2023.
-
Simulating and reporting frequentist operating characteristics of clinical trials that borrow external information
Authors:
Annette Kopp-Schneider,
Manuel Wiesenfarth,
Leonhard Held,
Silvia Calderazzo
Abstract:
Borrowing of information from historical or external data to inform inference in a current trial is an expanding field in the era of precision medicine, where trials are often performed in small patient cohorts for practical or ethical reasons. Many approaches for borrowing from external data have been proposed. Even though these methods are mainly based on Bayesian approaches by incorporating ext…
▽ More
Borrowing of information from historical or external data to inform inference in a current trial is an expanding field in the era of precision medicine, where trials are often performed in small patient cohorts for practical or ethical reasons. Many approaches for borrowing from external data have been proposed. Even though these methods are mainly based on Bayesian approaches by incorporating external information into the prior for the current analysis, frequentist operating characteristics of the analysis strategy are of interest. In particular, type I error and power at a prespecified point alternative are in the focus. It is well-known that borrowing from external information may lead to the alteration of type I error rate. We propose a procedure to investigate and report the frequentist operating characteristics in this context. The approach evaluates type I error rate of the test with borrowing from external data and calibrates the test without borrowing to this type I error rate. On this basis, a fair comparison of power between the test with and without borrowing is achieved.
△ Less
Submitted 24 February, 2023;
originally announced February 2023.
-
Understanding metric-related pitfalls in image analysis validation
Authors:
Annika Reinke,
Minu D. Tizabi,
Michael Baumgartner,
Matthias Eisenmann,
Doreen Heckmann-Nötzel,
A. Emre Kavur,
Tim Rädsch,
Carole H. Sudre,
Laura Acion,
Michela Antonelli,
Tal Arbel,
Spyridon Bakas,
Arriel Benis,
Matthew Blaschko,
Florian Buettner,
M. Jorge Cardoso,
Veronika Cheplygina,
Jianxu Chen,
Evangelia Christodoulou,
Beth A. Cimini,
Gary S. Collins,
Keyvan Farahani,
Luciana Ferrer,
Adrian Galdran,
Bram van Ginneken
, et al. (53 additional authors not shown)
Abstract:
Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibilit…
▽ More
Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation.
△ Less
Submitted 23 February, 2024; v1 submitted 3 February, 2023;
originally announced February 2023.
-
Biomedical image analysis competitions: The state of current participation practice
Authors:
Matthias Eisenmann,
Annika Reinke,
Vivienn Weru,
Minu Dietlinde Tizabi,
Fabian Isensee,
Tim J. Adler,
Patrick Godau,
Veronika Cheplygina,
Michal Kozubek,
Sharib Ali,
Anubha Gupta,
Jan Kybic,
Alison Noble,
Carlos Ortiz de Solórzano,
Samiksha Pachade,
Caroline Petitjean,
Daniel Sage,
Donglai Wei,
Elizabeth Wilden,
Deepak Alapatt,
Vincent Andrearczyk,
Ujjwal Baid,
Spyridon Bakas,
Niranjan Balu,
Sophia Bano
, et al. (331 additional authors not shown)
Abstract:
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis,…
▽ More
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
△ Less
Submitted 12 September, 2023; v1 submitted 16 December, 2022;
originally announced December 2022.
-
Tree-based exploratory identification of predictive biomarkers in observational data
Authors:
Julia Krzykalla,
Axel Benner,
Annette Kopp-Schneider
Abstract:
The idea of "stratified medicine" is an important driver of methodological research on the identification of predictive biomarkers. Most methods proposed so far for this purpose have been developed for the use on randomized data only. However, especially for rare cancers, data from clinical registries or observational studies might be the only available data source. For such data, methods for an u…
▽ More
The idea of "stratified medicine" is an important driver of methodological research on the identification of predictive biomarkers. Most methods proposed so far for this purpose have been developed for the use on randomized data only. However, especially for rare cancers, data from clinical registries or observational studies might be the only available data source. For such data, methods for an unbiased estimation of the average treatment effect are well established. Research on confounder adjustment when investigating the heterogeneity of treatment effects and the variables responsible for this is usually restricted to regression modelling. In this paper, we demonstrate how the predMOB, a tree-based method that specifically searches for predictive factors, can be combined with common strategies for confounder adjustment (covariate adjustment, matching, Inverse Probability of Treatment Weighting (IPTW)). In an extensive simulation study, we show that covariate adjustment allows the correct identification of predictive factors in the presence of confounding whereas IPTW fails in situations in which the true predictive factor is not completely independent of the confounding mechanism. A combination of both, covariate adjustment and IPTW performs as well as covariate adjustment alone, but might be more robust in complex settings. An application to the German Breast Cancer Study Group (GBSG) Trial 2 illustrates these conclusions.
△ Less
Submitted 16 December, 2022;
originally announced December 2022.
-
Robust incorporation of historical information with known type I error rate inflation
Authors:
Silvia Calderazzo,
Annette Kopp-Schneider
Abstract:
Bayesian clinical trials can benefit of available historical information through the elicitation of informative prior distributions. Concerns are however often raised about the potential for prior-data conflict and the impact of Bayes test decisions on frequentist operating characteristics, with particular attention being assigned to inflation of type I error rates. This motivates the development…
▽ More
Bayesian clinical trials can benefit of available historical information through the elicitation of informative prior distributions. Concerns are however often raised about the potential for prior-data conflict and the impact of Bayes test decisions on frequentist operating characteristics, with particular attention being assigned to inflation of type I error rates. This motivates the development of principled borrowing mechanisms, that strike a balance between frequentist and Bayesian decisions. Ideally, the trust assigned to historical information defines the degree of robustness to prior-data conflict one is willing to sacrifice. However, such relationship is often not directly available when explicitly considering inflation of type I error rates. We build on available literature relating frequentist and Bayesian test decisions, and investigate a rationale for inflation of type I error rate which explicitly and linearly relates the amount of borrowing and the amount of type I error rate inflation in one-arm studies. A novel dynamic borrowing mechanism tailored to hypothesis testing is additionally proposed. We show that, while dynamic borrowing prevents the possibility to obtain a simple closed form type I error rate computation, an explicit upper bound can still be enforced. Connections with the robust mixture prior approach, particularly in relation to the choice of the mixture weight and robust component, are made. Simulations are performed to show the properties of the approach for normal and binomial outcomes.
△ Less
Submitted 30 November, 2022;
originally announced November 2022.
-
Labeling instructions matter in biomedical image analysis
Authors:
Tim Rädsch,
Annika Reinke,
Vivienn Weru,
Minu D. Tizabi,
Nicholas Schreck,
A. Emre Kavur,
Bünyamin Pekdemir,
Tobias Roß,
Annette Kopp-Schneider,
Lena Maier-Hein
Abstract:
Biomedical image analysis algorithm validation depends on high-quality annotation of reference datasets, for which labeling instructions are key. Despite their importance, their optimization remains largely unexplored. Here, we present the first systematic study of labeling instructions and their impact on annotation quality in the field. Through comprehensive examination of professional practice…
▽ More
Biomedical image analysis algorithm validation depends on high-quality annotation of reference datasets, for which labeling instructions are key. Despite their importance, their optimization remains largely unexplored. Here, we present the first systematic study of labeling instructions and their impact on annotation quality in the field. Through comprehensive examination of professional practice and international competitions registered at the MICCAI Society, we uncovered a discrepancy between annotators' needs for labeling instructions and their current quality and availability. Based on an analysis of 14,040 images annotated by 156 annotators from four professional companies and 708 Amazon Mechanical Turk (MTurk) crowdworkers using instructions with different information density levels, we further found that including exemplary images significantly boosts annotation performance compared to text-only descriptions, while solely extending text descriptions does not. Finally, professional annotators constantly outperform MTurk crowdworkers. Our study raises awareness for the need of quality standards in biomedical image analysis labeling instructions.
△ Less
Submitted 20 July, 2022;
originally announced July 2022.
-
Metrics reloaded: Recommendations for image analysis validation
Authors:
Lena Maier-Hein,
Annika Reinke,
Patrick Godau,
Minu D. Tizabi,
Florian Buettner,
Evangelia Christodoulou,
Ben Glocker,
Fabian Isensee,
Jens Kleesiek,
Michal Kozubek,
Mauricio Reyes,
Michael A. Riegler,
Manuel Wiesenfarth,
A. Emre Kavur,
Carole H. Sudre,
Michael Baumgartner,
Matthias Eisenmann,
Doreen Heckmann-Nötzel,
Tim Rädsch,
Laura Acion,
Michela Antonelli,
Tal Arbel,
Spyridon Bakas,
Arriel Benis,
Matthew Blaschko
, et al. (49 additional authors not shown)
Abstract:
Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. Particularly in automatic biomedical image analysis, chosen performance metrics often do not reflect the domain interest, thus failing to adequately measure scientific progress and hindering translation of ML techniques into practice. To overcome this, our large international ex…
▽ More
Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. Particularly in automatic biomedical image analysis, chosen performance metrics often do not reflect the domain interest, thus failing to adequately measure scientific progress and hindering translation of ML techniques into practice. To overcome this, our large international expert consortium created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. The framework was developed in a multi-stage Delphi process and is based on the novel concept of a problem fingerprint - a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), data set and algorithm output. Based on the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as a classification task at image, object or pixel level, namely image-level classification, object detection, semantic segmentation, and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool, which also provides a point of access to explore weaknesses, strengths and specific recommendations for the most common validation metrics. The broad applicability of our framework across domains is demonstrated by an instantiation for various biological and medical image analysis use cases.
△ Less
Submitted 23 February, 2024; v1 submitted 3 June, 2022;
originally announced June 2022.
-
How can we learn (more) from challenges? A statistical approach to driving future algorithm development
Authors:
Tobias Roß,
Pierangela Bruno,
Annika Reinke,
Manuel Wiesenfarth,
Lisa Koeppel,
Peter M. Full,
Bünyamin Pekdemir,
Patrick Godau,
Darya Trofimova,
Fabian Isensee,
Sara Moccia,
Francesco Calimeri,
Beat P. Müller-Stich,
Annette Kopp-Schneider,
Lena Maier-Hein
Abstract:
Challenges have become the state-of-the-art approach to benchmark image analysis algorithms in a comparative manner. While the validation on identical data sets was a great step forward, results analysis is often restricted to pure ranking tables, leaving relevant questions unanswered. Specifically, little effort has been put into the systematic investigation on what characterizes images in which…
▽ More
Challenges have become the state-of-the-art approach to benchmark image analysis algorithms in a comparative manner. While the validation on identical data sets was a great step forward, results analysis is often restricted to pure ranking tables, leaving relevant questions unanswered. Specifically, little effort has been put into the systematic investigation on what characterizes images in which state-of-the-art algorithms fail. To address this gap in the literature, we (1) present a statistical framework for learning from challenges and (2) instantiate it for the specific task of instrument instance segmentation in laparoscopic videos. Our framework relies on the semantic meta data annotation of images, which serves as foundation for a General Linear Mixed Models (GLMM) analysis. Based on 51,542 meta data annotations performed on 2,728 images, we applied our approach to the results of the Robust Medical Instrument Segmentation Challenge (ROBUST-MIS) challenge 2019 and revealed underexposure, motion and occlusion of instruments as well as the presence of smoke or other objects in the background as major sources of algorithm failure. Our subsequent method development, tailored to the specific remaining issues, yielded a deep learning model with state-of-the-art overall performance and specific strengths in the processing of images in which previous methods tended to fail. Due to the objectivity and generic applicability of our approach, it could become a valuable tool for validation in the field of medical image analysis and beyond. and segmentation of small, crossing, moving and transparent instrument(s) (parts).
△ Less
Submitted 17 June, 2021;
originally announced June 2021.
-
Machine learning-based analysis of hyperspectral images for automated sepsis diagnosis
Authors:
Maximilian Dietrich,
Silvia Seidlitz,
Nicholas Schreck,
Manuel Wiesenfarth,
Patrick Godau,
Minu Tizabi,
Jan Sellner,
Sebastian Marx,
Samuel Knödler,
Michael M. Allers,
Leonardo Ayala,
Karsten Schmidt,
Thorsten Brenner,
Alexander Studier-Fischer,
Felix Nickel,
Beat P. Müller-Stich,
Annette Kopp-Schneider,
Markus A. Weigand,
Lena Maier-Hein
Abstract:
Sepsis is a leading cause of mortality and critical illness worldwide. While robust biomarkers for early diagnosis are still missing, recent work indicates that hyperspectral imaging (HSI) has the potential to overcome this bottleneck by monitoring microcirculatory alterations. Automated machine learning-based diagnosis of sepsis based on HSI data, however, has not been explored to date. Given thi…
▽ More
Sepsis is a leading cause of mortality and critical illness worldwide. While robust biomarkers for early diagnosis are still missing, recent work indicates that hyperspectral imaging (HSI) has the potential to overcome this bottleneck by monitoring microcirculatory alterations. Automated machine learning-based diagnosis of sepsis based on HSI data, however, has not been explored to date. Given this gap in the literature, we leveraged an existing data set to (1) investigate whether HSI-based automated diagnosis of sepsis is possible and (2) put forth a list of possible confounders relevant for HSI-based tissue classification. While we were able to classify sepsis with an accuracy of over $98\,\%$ using the existing data, our research also revealed several subject-, therapy- and imaging-related confounders that may lead to an overestimation of algorithm performance when not balanced across the patient groups. We conclude that further prospective studies, carefully designed with respect to these confounders, are necessary to confirm the preliminary results obtained in this study.
△ Less
Submitted 15 June, 2021;
originally announced June 2021.
-
Common Limitations of Image Processing Metrics: A Picture Story
Authors:
Annika Reinke,
Minu D. Tizabi,
Carole H. Sudre,
Matthias Eisenmann,
Tim Rädsch,
Michael Baumgartner,
Laura Acion,
Michela Antonelli,
Tal Arbel,
Spyridon Bakas,
Peter Bankhead,
Arriel Benis,
Matthew Blaschko,
Florian Buettner,
M. Jorge Cardoso,
Jianxu Chen,
Veronika Cheplygina,
Evangelia Christodoulou,
Beth Cimini,
Gary S. Collins,
Sandy Engelhardt,
Keyvan Farahani,
Luciana Ferrer,
Adrian Galdran,
Bram van Ginneken
, et al. (68 additional authors not shown)
Abstract:
While the importance of automatic image analysis is continuously increasing, recent meta-research revealed major flaws with respect to algorithm validation. Performance metrics are particularly key for meaningful, objective, and transparent performance assessment and validation of the used automatic algorithms, but relatively little attention has been given to the practical pitfalls when using spe…
▽ More
While the importance of automatic image analysis is continuously increasing, recent meta-research revealed major flaws with respect to algorithm validation. Performance metrics are particularly key for meaningful, objective, and transparent performance assessment and validation of the used automatic algorithms, but relatively little attention has been given to the practical pitfalls when using specific metrics for a given image analysis task. These are typically related to (1) the disregard of inherent metric properties, such as the behaviour in the presence of class imbalance or small target structures, (2) the disregard of inherent data set properties, such as the non-independence of the test cases, and (3) the disregard of the actual biomedical domain interest that the metrics should reflect. This living dynamically document has the purpose to illustrate important limitations of performance metrics commonly applied in the field of image analysis. In this context, it focuses on biomedical image analysis problems that can be phrased as image-level classification, semantic segmentation, instance segmentation, or object detection task. The current version is based on a Delphi process on metrics conducted by an international consortium of image analysis experts from more than 60 institutions worldwide.
△ Less
Submitted 6 December, 2023; v1 submitted 12 April, 2021;
originally announced April 2021.
-
Heidelberg Colorectal Data Set for Surgical Data Science in the Sensor Operating Room
Authors:
Lena Maier-Hein,
Martin Wagner,
Tobias Ross,
Annika Reinke,
Sebastian Bodenstedt,
Peter M. Full,
Hellena Hempe,
Diana Mindroc-Filimon,
Patrick Scholz,
Thuy Nuong Tran,
Pierangela Bruno,
Anna Kisilenko,
Benjamin Müller,
Tornike Davitashvili,
Manuela Capek,
Minu Tizabi,
Matthias Eisenmann,
Tim J. Adler,
Janek Gröhl,
Melanie Schellenberg,
Silvia Seidlitz,
T. Y. Emmy Lai,
Bünyamin Pekdemir,
Veith Roethlingshoefer,
Fabian Both
, et al. (8 additional authors not shown)
Abstract:
Image-based tracking of medical instruments is an integral part of surgical data science applications. Previous research has addressed the tasks of detecting, segmenting and tracking medical instruments based on laparoscopic video data. However, the proposed methods still tend to fail when applied to challenging images and do not generalize well to data they have not been trained on. This paper in…
▽ More
Image-based tracking of medical instruments is an integral part of surgical data science applications. Previous research has addressed the tasks of detecting, segmenting and tracking medical instruments based on laparoscopic video data. However, the proposed methods still tend to fail when applied to challenging images and do not generalize well to data they have not been trained on. This paper introduces the Heidelberg Colorectal (HeiCo) data set - the first publicly available data set enabling comprehensive benchmarking of medical instrument detection and segmentation algorithms with a specific emphasis on method robustness and generalization capabilities. Our data set comprises 30 laparoscopic videos and corresponding sensor data from medical devices in the operating room for three different types of laparoscopic surgery. Annotations include surgical phase labels for all video frames as well as information on instrument presence and corresponding instance-wise segmentation masks for surgical instruments (if any) in more than 10,000 individual frames. The data has successfully been used to organize international competitions within the Endoscopic Vision Challenges 2017 and 2019.
△ Less
Submitted 23 February, 2021; v1 submitted 7 May, 2020;
originally announced May 2020.
-
Robust Medical Instrument Segmentation Challenge 2019
Authors:
Tobias Ross,
Annika Reinke,
Peter M. Full,
Martin Wagner,
Hannes Kenngott,
Martin Apitz,
Hellena Hempe,
Diana Mindroc Filimon,
Patrick Scholz,
Thuy Nuong Tran,
Pierangela Bruno,
Pablo Arbeláez,
Gui-Bin Bian,
Sebastian Bodenstedt,
Jon Lindström Bolmgren,
Laura Bravo-Sánchez,
Hua-Bin Chen,
Cristina González,
Dong Guo,
Pål Halvorsen,
Pheng-Ann Heng,
Enes Hosgor,
Zeng-Guang Hou,
Fabian Isensee,
Debesh Jha
, et al. (25 additional authors not shown)
Abstract:
Intraoperative tracking of laparoscopic instruments is often a prerequisite for computer and robotic-assisted interventions. While numerous methods for detecting, segmenting and tracking of medical instruments based on endoscopic video images have been proposed in the literature, key limitations remain to be addressed: Firstly, robustness, that is, the reliable performance of state-of-the-art meth…
▽ More
Intraoperative tracking of laparoscopic instruments is often a prerequisite for computer and robotic-assisted interventions. While numerous methods for detecting, segmenting and tracking of medical instruments based on endoscopic video images have been proposed in the literature, key limitations remain to be addressed: Firstly, robustness, that is, the reliable performance of state-of-the-art methods when run on challenging images (e.g. in the presence of blood, smoke or motion artifacts). Secondly, generalization; algorithms trained for a specific intervention in a specific hospital should generalize to other interventions or institutions.
In an effort to promote solutions for these limitations, we organized the Robust Medical Instrument Segmentation (ROBUST-MIS) challenge as an international benchmarking competition with a specific focus on the robustness and generalization capabilities of algorithms. For the first time in the field of endoscopic image processing, our challenge included a task on binary segmentation and also addressed multi-instance detection and segmentation. The challenge was based on a surgical data set comprising 10,040 annotated images acquired from a total of 30 surgical procedures from three different types of surgery. The validation of the competing methods for the three tasks (binary segmentation, multi-instance detection and multi-instance segmentation) was performed in three different stages with an increasing domain gap between the training and the test data. The results confirm the initial hypothesis, namely that algorithm performance degrades with an increasing domain gap. While the average detection and segmentation quality of the best-performing algorithms is high, future research should concentrate on detection and segmentation of small, crossing, moving and transparent instrument(s) (parts).
△ Less
Submitted 19 May, 2020; v1 submitted 23 March, 2020;
originally announced March 2020.
-
Methods and open-source toolkit for analyzing and visualizing challenge results
Authors:
Manuel Wiesenfarth,
Annika Reinke,
Bennett A. Landman,
Manuel Jorge Cardoso,
Lena Maier-Hein,
Annette Kopp-Schneider
Abstract:
Biomedical challenges have become the de facto standard for benchmarking biomedical image analysis algorithms. While the number of challenges is steadily increasing, surprisingly little effort has been invested in ensuring high quality design, execution and reporting for these international competitions. Specifically, results analysis and visualization in the event of uncertainties have been given…
▽ More
Biomedical challenges have become the de facto standard for benchmarking biomedical image analysis algorithms. While the number of challenges is steadily increasing, surprisingly little effort has been invested in ensuring high quality design, execution and reporting for these international competitions. Specifically, results analysis and visualization in the event of uncertainties have been given almost no attention in the literature. Given these shortcomings, the contribution of this paper is two-fold: (1) We present a set of methods to comprehensively analyze and visualize the results of single-task and multi-task challenges and apply them to a number of simulated and real-life challenges to demonstrate their specific strengths and weaknesses; (2) We release the open-source framework challengeR as part of this work to enable fast and wide adoption of the methodology proposed in this paper. Our approach offers an intuitive way to gain important insights into the relative and absolute performance of algorithms, which cannot be revealed by commonly applied visualization techniques. This is demonstrated by the experiments performed within this work. Our framework could thus become an important tool for analyzing and visualizing challenge results in the field of biomedical image analysis and beyond.
△ Less
Submitted 5 December, 2019; v1 submitted 11 October, 2019;
originally announced October 2019.
-
BIAS: Transparent reporting of biomedical image analysis challenges
Authors:
Lena Maier-Hein,
Annika Reinke,
Michal Kozubek,
Anne L. Martel,
Tal Arbel,
Matthias Eisenmann,
Allan Hanbuary,
Pierre Jannin,
Henning Müller,
Sinan Onogur,
Julio Saez-Rodriguez,
Bram van Ginneken,
Annette Kopp-Schneider,
Bennett Landman
Abstract:
The number of biomedical image analysis challenges organized per year is steadily increasing. These international competitions have the purpose of benchmarking algorithms on common data sets, typically to identify the best method for a given problem. Recent research, however, revealed that common practice related to challenge reporting does not allow for adequate interpretation and reproducibility…
▽ More
The number of biomedical image analysis challenges organized per year is steadily increasing. These international competitions have the purpose of benchmarking algorithms on common data sets, typically to identify the best method for a given problem. Recent research, however, revealed that common practice related to challenge reporting does not allow for adequate interpretation and reproducibility of results. To address the discrepancy between the impact of challenges and the quality (control), the Biomedical I mage Analysis ChallengeS (BIAS) initiative developed a set of recommendations for the reporting of challenges. The BIAS statement aims to improve the transparency of the reporting of a biomedical image analysis challenge regardless of field of application, image modality or task category assessed. This article describes how the BIAS statement was developed and presents a checklist which authors of biomedical image analysis challenges are encouraged to include in their submission when giving a paper on a challenge into review. The purpose of the checklist is to standardize and facilitate the review process and raise interpretability and reproducibility of challenge results by making relevant information explicit.
△ Less
Submitted 31 August, 2020; v1 submitted 9 October, 2019;
originally announced October 2019.
-
A large annotated medical image dataset for the development and evaluation of segmentation algorithms
Authors:
Amber L. Simpson,
Michela Antonelli,
Spyridon Bakas,
Michel Bilello,
Keyvan Farahani,
Bram van Ginneken,
Annette Kopp-Schneider,
Bennett A. Landman,
Geert Litjens,
Bjoern Menze,
Olaf Ronneberger,
Ronald M. Summers,
Patrick Bilic,
Patrick F. Christ,
Richard K. G. Do,
Marc Gollub,
Jennifer Golia-Pernicka,
Stephan H. Heckers,
William R. Jarnagin,
Maureen K. McHugo,
Sandy Napel,
Eugene Vorontsov,
Lena Maier-Hein,
M. Jorge Cardoso
Abstract:
Semantic segmentation of medical images aims to associate a pixel with a label in a medical image without human initialization. The success of semantic segmentation algorithms is contingent on the availability of high-quality imaging data with corresponding labels provided by experts. We sought to create a large collection of annotated medical image datasets of various clinically relevant anatomie…
▽ More
Semantic segmentation of medical images aims to associate a pixel with a label in a medical image without human initialization. The success of semantic segmentation algorithms is contingent on the availability of high-quality imaging data with corresponding labels provided by experts. We sought to create a large collection of annotated medical image datasets of various clinically relevant anatomies available under open source license to facilitate the development of semantic segmentation algorithms. Such a resource would allow: 1) objective assessment of general-purpose segmentation methods through comprehensive benchmarking and 2) open and free access to medical image data for any researcher interested in the problem domain. Through a multi-institutional effort, we generated a large, curated dataset representative of several highly variable segmentation tasks that was used in a crowd-sourced challenge - the Medical Segmentation Decathlon held during the 2018 Medical Image Computing and Computer Aided Interventions Conference in Granada, Spain. Here, we describe these ten labeled image datasets so that these data may be effectively reused by the research community.
△ Less
Submitted 24 February, 2019;
originally announced February 2019.
-
Why rankings of biomedical image analysis competitions should be interpreted with care
Authors:
Lena Maier-Hein,
Matthias Eisenmann,
Annika Reinke,
Sinan Onogur,
Marko Stankovic,
Patrick Scholz,
Tal Arbel,
Hrvoje Bogunovic,
Andrew P. Bradley,
Aaron Carass,
Carolin Feldmann,
Alejandro F. Frangi,
Peter M. Full,
Bram van Ginneken,
Allan Hanbury,
Katrin Honauer,
Michal Kozubek,
Bennett A. Landman,
Keno März,
Oskar Maier,
Klaus Maier-Hein,
Bjoern H. Menze,
Henning Müller,
Peter F. Neher,
Wiro Niessen
, et al. (13 additional authors not shown)
Abstract:
International challenges have become the standard for validation of biomedical image analysis methods. Given their scientific impact, it is surprising that a critical analysis of common practices related to the organization of challenges has not yet been performed. In this paper, we present a comprehensive analysis of biomedical image analysis challenges conducted up to now. We demonstrate the imp…
▽ More
International challenges have become the standard for validation of biomedical image analysis methods. Given their scientific impact, it is surprising that a critical analysis of common practices related to the organization of challenges has not yet been performed. In this paper, we present a comprehensive analysis of biomedical image analysis challenges conducted up to now. We demonstrate the importance of challenges and show that the lack of quality control has critical consequences. First, reproducibility and interpretation of the results is often hampered as only a fraction of relevant information is typically provided. Second, the rank of an algorithm is generally not robust to a number of variables such as the test data used for validation, the ranking scheme applied and the observers that make the reference annotations. To overcome these problems, we recommend best practice guidelines and define open research questions to be addressed in the future.
△ Less
Submitted 18 September, 2019; v1 submitted 6 June, 2018;
originally announced June 2018.
-
Exploiting the potential of unlabeled endoscopic video data with self-supervised learning
Authors:
Tobias Ross,
David Zimmerer,
Anant Vemuri,
Fabian Isensee,
Manuel Wiesenfarth,
Sebastian Bodenstedt,
Fabian Both,
Philip Kessler,
Martin Wagner,
Beat Müller,
Hannes Kenngott,
Stefanie Speidel,
Annette Kopp-Schneider,
Klaus Maier-Hein,
Lena Maier-Hein
Abstract:
Surgical data science is a new research field that aims to observe all aspects of the patient treatment process in order to provide the right assistance at the right time. Due to the breakthrough successes of deep learning-based solutions for automatic image annotation, the availability of reference annotations for algorithm training is becoming a major bottleneck in the field. The purpose of this…
▽ More
Surgical data science is a new research field that aims to observe all aspects of the patient treatment process in order to provide the right assistance at the right time. Due to the breakthrough successes of deep learning-based solutions for automatic image annotation, the availability of reference annotations for algorithm training is becoming a major bottleneck in the field. The purpose of this paper was to investigate the concept of self-supervised learning to address this issue.
Our approach is guided by the hypothesis that unlabeled video data can be used to learn a representation of the target domain that boosts the performance of state-of-the-art machine learning algorithms when used for pre-training. Core of the method is an auxiliary task based on raw endoscopic video data of the target domain that is used to initialize the convolutional neural network (CNN) for the target task. In this paper, we propose the re-colorization of medical images with a generative adversarial network (GAN)-based architecture as auxiliary task. A variant of the method involves a second pre-training step based on labeled data for the target task from a related domain. We validate both variants using medical instrument segmentation as target task.
The proposed approach can be used to radically reduce the manual annotation effort involved in training CNNs. Compared to the baseline approach of generating annotated data from scratch, our method decreases exploratively the number of labeled images by up to 75% without sacrificing performance. Our method also outperforms alternative methods for CNN pre-training, such as pre-training on publicly available non-medical or medical data using the target task (in this instance: segmentation).
As it makes efficient use of available (non-)public and (un-)labeled data, the approach has the potential to become a valuable tool for CNN (pre-)training.
△ Less
Submitted 31 January, 2018; v1 submitted 27 November, 2017;
originally announced November 2017.