How Many Annotators Do We Need? -- A Study on the Influence of Inter-Observer Variability on the Reliability of Automatic Mitotic Figure Assessment
Authors:
Frauke Wilm,
Christof A. Bertram,
Christian Marzahl,
Alexander Bartel,
Taryn A. Donovan,
Charles-Antoine Assenmacher,
Kathrin Becker,
Mark Bennett,
Sarah Corner,
Brieuc Cossic,
Daniela Denk,
Martina Dettwiler,
Beatriz Garcia Gonzalez,
Corinne Gurtner,
Annika Lehmbecker,
Sophie Merz,
Stephanie Plog,
Anja Schmidt,
Rebecca C. Smedley,
Marco Tecilla,
Tuddow Thaiwong,
Katharina Breininger,
Matti Kiupel,
Andreas Maier,
Robert Klopfleisch
, et al. (1 additional authors not shown)
Abstract:
Density of mitotic figures in histologic sections is a prognostically relevant characteristic for many tumours. Due to high inter-pathologist variability, deep learning-based algorithms are a promising solution to improve tumour prognostication. Pathologists are the gold standard for database development, however, labelling errors may hamper development of accurate algorithms. In the present work…
▽ More
Density of mitotic figures in histologic sections is a prognostically relevant characteristic for many tumours. Due to high inter-pathologist variability, deep learning-based algorithms are a promising solution to improve tumour prognostication. Pathologists are the gold standard for database development, however, labelling errors may hamper development of accurate algorithms. In the present work we evaluated the benefit of multi-expert consensus (n = 3, 5, 7, 9, 11) on algorithmic performance. While training with individual databases resulted in highly variable F$_1$ scores, performance was notably increased and more consistent when using the consensus of three annotators. Adding more annotators only resulted in minor improvements. We conclude that databases by few pathologists and high label accuracy may be the best compromise between high algorithmic performance and time investment.
△ Less
Submitted 8 January, 2021; v1 submitted 4 December, 2020;
originally announced December 2020.
Deep learning algorithms out-perform veterinary pathologists in detecting the mitotically most active tumor region
Authors:
Marc Aubreville,
Christof A. Bertram,
Christian Marzahl,
Corinne Gurtner,
Martina Dettwiler,
Anja Schmidt,
Florian Bartenschlager,
Sophie Merz,
Marco Fragoso,
Olivia Kershaw,
Robert Klopfleisch,
Andreas Maier
Abstract:
Manual count of mitotic figures, which is determined in the tumor region with the highest mitotic activity, is a key parameter of most tumor grading schemes. It can be, however, strongly dependent on the area selection due to uneven mitotic figure distribution in the tumor section.We aimed to assess the question, how significantly the area selection could impact the mitotic count, which has a know…
▽ More
Manual count of mitotic figures, which is determined in the tumor region with the highest mitotic activity, is a key parameter of most tumor grading schemes. It can be, however, strongly dependent on the area selection due to uneven mitotic figure distribution in the tumor section.We aimed to assess the question, how significantly the area selection could impact the mitotic count, which has a known high inter-rater disagreement. On a data set of 32 whole slide images of H&E-stained canine cutaneous mast cell tumor, fully annotated for mitotic figures, we asked eight veterinary pathologists (five board-certified, three in training) to select a field of interest for the mitotic count. To assess the potential difference on the mitotic count, we compared the mitotic count of the selected regions to the overall distribution on the slide.Additionally, we evaluated three deep learning-based methods for the assessment of highest mitotic density: In one approach, the model would directly try to predict the mitotic count for the presented image patches as a regression task. The second method aims at deriving a segmentation mask for mitotic figures, which is then used to obtain a mitotic density. Finally, we evaluated a two-stage object-detection pipeline based on state-of-the-art architectures to identify individual mitotic figures. We found that the predictions by all models were, on average, better than those of the experts. The two-stage object detector performed best and outperformed most of the human pathologists on the majority of tumor cases. The correlation between the predicted and the ground truth mitotic count was also best for this approach (0.963 to 0.979). Further, we found considerable differences in position selection between pathologists, which could partially explain the high variance that has been reported for the manual mitotic count.
△ Less
Submitted 21 October, 2020; v1 submitted 12 February, 2019;
originally announced February 2019.