Nuclear Pleomorphism in Canine Cutaneous Mast Cell Tumors: Comparison of Reproducibility and Prognostic Relevance between Estimates, Manual Morphometry and Algorithmic Morphometry
Authors:
Andreas Haghofer,
Eda Parlak,
Alexander Bartel,
Taryn A. Donovan,
Charles-Antoine Assenmacher,
Pompei Bolfa,
Michael J. Dark,
Andrea Fuchs-Baumgartinger,
Andrea Klang,
Kathrin Jäger,
Robert Klopfleisch,
Sophie Merz,
Barbara Richter,
F. Yvonne Schulman,
Hannah Janout,
Jonathan Ganz,
Josef Scharinger,
Marc Aubreville,
Stephan M. Winkler,
Matti Kiupel,
Christof A. Bertram
Abstract:
Variation in nuclear size and shape is an important criterion of malignancy for many tumor types; however, categorical estimates by pathologists have poor reproducibility. Measurements of nuclear characteristics (morphometry) can improve reproducibility, but manual methods are time consuming. The aim of this study was to explore the limitations of estimates and develop alternative morphometric sol…
▽ More
Variation in nuclear size and shape is an important criterion of malignancy for many tumor types; however, categorical estimates by pathologists have poor reproducibility. Measurements of nuclear characteristics (morphometry) can improve reproducibility, but manual methods are time consuming. The aim of this study was to explore the limitations of estimates and develop alternative morphometric solutions for canine cutaneous mast cell tumors (ccMCT). We assessed the following nuclear evaluation methods for measurement accuracy, reproducibility, and prognostic utility: 1) anisokaryosis (karyomegaly) estimates by 11 pathologists; 2) gold standard manual morphometry of at least 100 nuclei; 3) practicable manual morphometry with stratified sampling of 12 nuclei by 9 pathologists; and 4) automated morphometry using a deep learning-based segmentation algorithm. The study dataset comprised 96 ccMCT with available outcome information. The study dataset comprised 96 ccMCT with available outcome information. Inter-rater reproducibility of karyomegaly estimates was low ($κ$ = 0.226), while it was good (ICC = 0.654) for practicable morphometry of the standard deviation (SD) of nuclear size. As compared to gold standard manual morphometry (AUC = 0.839, 95% CI: 0.701 - 0.977), the prognostic value (tumor-specific survival) of SDs of nuclear area for practicable manual morphometry (12 nuclei) and automated morphometry were high with an area under the ROC curve (AUC) of 0.868 (95% CI: 0.737 - 0.991) and 0.943 (95% CI: 0.889 - 0.996), respectively. This study supports the use of manual morphometry with stratified sampling of 12 nuclei and algorithmic morphometry to overcome the poor reproducibility of estimates.
△ Less
Submitted 23 May, 2024; v1 submitted 26 September, 2023;
originally announced September 2023.
How Many Annotators Do We Need? -- A Study on the Influence of Inter-Observer Variability on the Reliability of Automatic Mitotic Figure Assessment
Authors:
Frauke Wilm,
Christof A. Bertram,
Christian Marzahl,
Alexander Bartel,
Taryn A. Donovan,
Charles-Antoine Assenmacher,
Kathrin Becker,
Mark Bennett,
Sarah Corner,
Brieuc Cossic,
Daniela Denk,
Martina Dettwiler,
Beatriz Garcia Gonzalez,
Corinne Gurtner,
Annika Lehmbecker,
Sophie Merz,
Stephanie Plog,
Anja Schmidt,
Rebecca C. Smedley,
Marco Tecilla,
Tuddow Thaiwong,
Katharina Breininger,
Matti Kiupel,
Andreas Maier,
Robert Klopfleisch
, et al. (1 additional authors not shown)
Abstract:
Density of mitotic figures in histologic sections is a prognostically relevant characteristic for many tumours. Due to high inter-pathologist variability, deep learning-based algorithms are a promising solution to improve tumour prognostication. Pathologists are the gold standard for database development, however, labelling errors may hamper development of accurate algorithms. In the present work…
▽ More
Density of mitotic figures in histologic sections is a prognostically relevant characteristic for many tumours. Due to high inter-pathologist variability, deep learning-based algorithms are a promising solution to improve tumour prognostication. Pathologists are the gold standard for database development, however, labelling errors may hamper development of accurate algorithms. In the present work we evaluated the benefit of multi-expert consensus (n = 3, 5, 7, 9, 11) on algorithmic performance. While training with individual databases resulted in highly variable F$_1$ scores, performance was notably increased and more consistent when using the consensus of three annotators. Adding more annotators only resulted in minor improvements. We conclude that databases by few pathologists and high label accuracy may be the best compromise between high algorithmic performance and time investment.
△ Less
Submitted 8 January, 2021; v1 submitted 4 December, 2020;
originally announced December 2020.