¹¹institutetext: Technische Hochschule Ingolstadt, Ingolstadt, Germany ²²institutetext: Gestalt Diagnostics, Spokane, USA ³³institutetext: University of Veterinary Medicine, Vienna, Austria ⁴⁴institutetext: Freie Universität Berlin, Berlin, Germany ⁵⁵institutetext: SeaWorld Yas Island, Abu Dhabi, UAE ⁶⁶institutetext: Cornell University, Ithaca, USA ⁷⁷institutetext: USAMV, Cluj-Napoca, Romania ⁸⁸institutetext: Medical University of Vienna, Vienna, Austria ⁹⁹institutetext: Hoffmann-La Roche, Basel, Switzerland ¹⁰¹⁰institutetext: Ross University School of Veterinary Medicine, Basseterre, St. Kitts and Nevis ¹¹¹¹institutetext: University of Florida, Gainesville, USA¹²¹²institutetext: University Hospital Erlangen, Erlangen, Germany¹³¹³institutetext: IDEXX Laboratories, Kornwestheim, Germany ¹⁴¹⁴institutetext: The Schwarzman Animal Medical Center, New York, USA ¹⁵¹⁵institutetext: Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

On the Value of PHH3 for Mitotic Figure Detection on H&E-stained Images

Jonathan Ganz 11 Christian Marzahl 22 Jonas Ammeling 11 Barbara Richter 33 Chloé Puget 44 Daniela Denk 55 Elena A. Demeter 66 Flaviu A. Tabaran 77 Gabriel Wasinger 88 Karoline Lipnik 33 Marco Tecilla 99 Matthew J. Valentine 1010 Michael J. Dark 1111 Niklas Abele 1212 Pompei Bolfa 1010 Ramona Erber 12121515 Robert Klopfleisch 44 Sophie Merz 1313 Taryn A. Donovan 1414 Samir Jabari 1212 Christof A. Bertram 33 Katharina Breininger 1515 Marc Aubreville 11

Abstract

The count of mitotic figures (MFs) observed in hematoxylin and eosin (H&E)-stained slides is an important prognostic marker as it is a measure for tumor cell proliferation. However, the identification of MFs has a known low inter-rater agreement. Deep learning algorithms can standardize this task, but they require large amounts of annotated data for training and validation. Furthermore, label noise introduced during the annotation process may impede the algorithm’s performance. Unlike H&E, the mitosis-specific antibody phospho-histone H3 (PHH3) specifically highlights MFs. Counting MFs on slides stained against PHH3 leads to higher agreement among raters and has therefore recently been used as a ground truth for the annotation of MFs in H&E. However, as PHH3 facilitates the recognition of cells indistinguishable from H&E stain alone, the use of this ground truth could potentially introduce noise into the H&E-related dataset, impacting model performance. This study analyzes the impact of PHH3-assisted MF annotation on inter-rater reliability and object level agreement through an extensive multi-rater experiment. We found that the annotators’ object-level agreement increased when using PHH3-assisted labeling. Subsequently, MF detectors were evaluated on the resulting datasets to investigate the influence of PHH3-assisted labeling on the models’ performance. Additionally, a novel dual-stain MF detector was developed to investigate the interpretation-shift of PHH3-assisted labels used in H&E, which clearly outperformed single-stain detectors. However, the PHH3-assisted labels did not have a positive effect on solely H&E-based models. The high performance of our dual-input detector reveals an information mismatch between the H&E and PHH3-stained images as the cause of this effect.

Keywords:

Mitotic Figure Detection Computational Pathology PHH3.

1 Introduction

In tumor diagnosis, a crucial step is the examination of tissue samples by pathologists to derive important information related to the tumor and its appropriate treatment. One factor of interest is the proliferation fraction of the respective tumor, which can be assessed through the number of cells undergoing cell division, depicted by mitotic figures [2]. The mitotic count (MC), defined as the number of MFs within a standardized area of ten consecutive high-power fields [13], is part of various tumor grading systems in human [7, 10] as well as veterinary medicine [8, 14]. However, counting MFs is a task known to have a low inter-rater agreement [19, 3]. The use of slide scanners has enabled the digitization of entire slides into whole slide images, which can be evaluated automatically using deep learning (DL) methods. In the context of MF detection, DL algorithms already demonstrated human-like performance [3]. Nevertheless, those algorithms rely on large amounts of annotated data, and the annotation quality is known to affect the performance of the trained DL model [4, 21]. In contrast to the hematoxylin and eosin (H&E) stain, which is the standard stain used in histopathology, the mitosis-specific antibody phosphohistone H3 (PHH3) specifically highlights the cell nucleus during mitosis [6]. Staining against PHH3 is a validated method with diagnostic and prognostic significance, emphasizing its utility in the assessment of the MC in various tumor types [5, 6, 20, 9, 16]. Furthermore, different studies reported that MC derived solely by PHH3 leads to lower variability of the MC among different raters compared with the MC acquired through H&E [9, 6]. Nevertheless, H&E is the routine stain in pathology, and thus detectors intended for clinical use have to be applicable to this stain. In order to still be able to leverage the advantages of PHH3 staining to improve the quality of H&E-stained MF datasets, staining against PHH3 has been used as an assistance for the annotation of MFs in H&E by de-staining H&E-stained slides and re-staining them against PHH3 [17, 3]. The re-stained slides are then digitized and registered with their H&E counterparts. However, PHH3 is more sensitive to early mitotic phases and less sensitive to late phases of the cell cycle [20]. As a result, the MFs highlighted in slides stained against PHH3 differ from those recognizable in H&E, leading to an elevated (though biologically more accurate) MC [6]. Our hypothesis is that providing an annotator with PHH3 assistance, as described above, may introduce a potential bias that results in the inclusion of MFs in the dataset that would not have been annotated with only H&E available. This, in turn, may result in the inclusion of MFs that cannot be identified with only the H&E-stained slides available, leading to an information mismatch and, from the perspective of the H&E, noisy MF labels. Our main contributions are:

•

Extensive multi-rater experiments were conducted to show PHH3-assisted annotation’s superior labeling consistency and to generate high-quality datasets to evaluate our methods.
•

Using our novel dual-stain MF detector, we show the principal superiority of PHH3-assisted labels, given that both stains are available.
•

We show that PHH3-assisted labels can cause interpretation shifts in H&E when training MF detectors. This implies that a ’biologically accurate’ PHH3-assisted ground truth may not improve performance if information mismatch is ignored.

2 Datasets

Two different datasets were used in this study, one for the training of the object detectors and one for the multi-rater experiment. Both datasets utilize corresponding region of interest (ROI) pairs of tumor tissue stained with two stains: The source slides were initially stained with H&E, digitized, and subsequently de-stained and re-stained with PHH3 before being digitized again. The resulting WSIs were then registered using a registration algorithm for WSIs [12]. From the original WSIs, the ROIs were selected by two pathologists based on tissue and scan quality and the area of perceived highest mitotic activity.

2.1 Image Dataset for the Annotation Study

The dataset used in the annotation study consists of 20 ROIs representing four different tumor types, two of which were of human origin and of two of veterinary origin. Tumors of different tumor types and species were included in order to draw broader conclusions from the results of the study. Five samples each of human astrocytoma and meningioma were collected from the diagnostic archive of anonymized hospital, after prior ethics approval (No. ANON). The slides were digitized using an Hamamatsu NanoZoomer S60 at 40 $\times$ magnification. From the diagnostic archive of anonymized university five samples each of canine cutaneous mast cell tumor (CCMCT) and of canine mammary carcinoma (CMC) were collected. For these slides, no ethics approval was needed. These slides were digitized with a 3DHistech Pannoramic Scan II at 40 $\times$ magnification. Each ROI was selected to cover $2.37\,$\mathrm{m}\mathrm{m}$^{2}$ of tissue, which is equivalent to approximately ten high power fields [13]. In the remainder of this paper we refer to the dataset generated by annotating these images as study dataset.

2.2 Development Dataset

To examine the impact of PHH3-assisted MF annotation on the performance of MF detectors, we trained detectors with annotations acquired only using H&E-stained slides and with annotations that were generated through PHH3-assisted annotation. For this purpose, we used a datatset consisiting of ROIs of multiple tumors and species from different laboratories. The dataset contained ten samples of CCMCT, nine samples of CMC, ten sampes of canine hemangiosarcoma, nine samples of feline lymphoma, ten samples of feline soft tissue sarcoma, three samples of human astrocytoma, ten samples of human bladder cancer, nine samples of human colon carcinoma, ten samples of human melanoma, and four samples of human meningioma. We refer to this as the development dataset. Two different ground truth definitions were available for this dataset. The first definition relied solely on H&E-stained slides, while the second was created with PHH3-assisted labeling. The H&E-only annotations were created as described in [4] by three pathologist with at least five years of experience in MF identification assisted by a DL model with high recall that screened the slides for MF candidates. To be accepted as a MF, at least two pathologists had to agree upon a candidate. The PHH3-assisted annotations were created by a single expert using an open source web-based annotation server [11] where the two corresponding stains could be superimposed on each other with variable transparency. This way, the expert was able to assess the immunohistochemistry (IHC) label present in the PHH3-stained slide and the morphological features visible in the H&E slide at once. If the registration was not perfect, e.g., due to tissue deformation, the position of the respective cell in the H&E-stained slide was annotated. Cells that had a positive IHC label but that were lacking MF morphology in the H&E stain were not annotated, as these cells are not identifiable as MF in the H&E-stained slides.

3 Methods

We investigated the value of PHH3-assisted MF annotation from two different perspectives. The impact on the inter-rater agreement was investigated via a multi-rater experiment. In a second experiment, the datasets resulting from this experiment were used to evaluate MF detectors trained on labels derived with and without PHH3 assistance.

3.1 Human Expert Mitotic Figure Annotation Study

To investigate the impact of co-registered PHH3 slides on inter-rater agreement, a study was conducted with 13 pathology experts. The study consisted of two phases, with a four week washout period to prevent participants from recalling the cases. To further prevent bias, the slides were presented in randomized order and under different names in each phase. During the initial phase, participants only had access to H&E-stained ROIs. In the subsequent phase, participants could overlay co-registered PHH3-stained ROIs on the H&E-stained image with adjustable transparency, allowing for simultaneous examination of both stains. Additionally, participants were given three different labels to choose from when annotating MFs in the second phase, depending on whether a MF was identifiable in both H&E and PHH3, only in H&E, or only in PHH3. In this paper, only annotations of the first two classes are considered, as these match the annotation semantics of the PHH3-assisted ground truth of the MIDOG dataset. For imperfect registration, participants were instructed to annotate the position of the respective cell in the H&E-stained ROI. Before the second phase a pathologist highly experienced in PHH3-assisted labeling conducted an online training with the participants.

3.2 Detection Architectures

Refer to caption — Figure 1: Architectural overview of our dual-stain MF detector.

To assess the impact of PHH3-assisted labeling on the performance of object detectors, we used the Fully Convolutional One-Stage Object Detector (FCOS) [18]. Our hypothesis was that the annotation assisted by PHH3 will introduce MFs into the dataset that are not distinguishable in H&E alone. The solely H&E-based detector lacks the information from the PHH3-stained section, which causes an interpretation shift in the labels and introduces noise into the H&E dataset. This noise could potentially impede the performance of models trained on such a dataset. If this is correct, a detector that is able to use the information from both stains should show better results on such a dataset than a pure H&E detector. To test the hypothesis we extended the FCOS detector into a Dual-Input FCOS (DI-FCOS) detector as depicted in Fig. 1, which was trained on the corresponding H&E and PHH3 patches simultaneously. The DI-FCOS model consists of two ResNet18 backbones. Feature fusion is achieved via mid-fusion, i.e., information from different input modalities is combined at an intermediate stage within the network [15]. In particular, the features of each ResNet level are fused before they are forwarded to the feature pyramid, using one merging network for each input level of the feature pyramid. Let $\mathbf{H}\in\mathbb{R}^{C\times H\times W}$ and $\mathbf{P}\in\mathbb{R}^{C\times H\times W}$ be the feature maps of a level of the H&E and PHH3 backbone, where $C$ represents the number of channels of the respective level and $H$ and $W$ are the sizes of the feature maps which depend on the size of the input image. Then the fused features $\mathbf{F}\in\mathbb{R}^{C\times H\times W}$ are computed by $\mathbf{F}=\text{{ReLU}}(\text{{Conv}}(\text{{LayerNorm}}(\text{{Cat}}(\mathbf% {H},\mathbf{P}))))$ , where Cat denotes a concatenation of the feature vectors along the channel dimension and Conv is a $1\times 1$ convolution which halves the number of input channels to $C$ after the concatenation. The rest of the network follows the standard FCOS architecture as described in [18]. To confirm that a difference in model performance between the FCOS and DI-FCOS models is not due to the DI-FCOS models’ higher number of parameters, we also compare it to an FCOS detector with a ResNet101 backbone. For all experiments, the feature maps from the second, third, and fourth blocks of the respective backbone were used to construct the feature pyramids for both the standard FCOS and the DI-FCOS model. We used a fixed learning rate of $10^{-4}$ and AdamW as the optimizer. All models were trained until convergence, which was observed using the average precision (AP) metric on the validation set, which we also used for early stop** and for model selection. All models were trained on patches with a height and width of 512 pixels, and patches were selected so that at least $50~{}\%$ of the training patches contained MFs. A standard image augmentation pipeline was used during the training of each object detector in this study.

4 Results

We first describe the results of the human-rater experiment as the resulting dataset was part of the evaluation of the object detectors.

To measure agreement at the object level, we compared each rater’s annotation on images of the study dataset to the consensus of all other raters’ annotations for each phase of the experiment. We formed the consensus by matching the annotations of the remaining raters for each image using a distance-based clustering approach. An annotation was added to a cluster if it was no more than 7.5 micrometers away from the center of the cluster, approximately corresponding to the diameter of a nucleus [1]. A cluster was considered a MF if it contained annotations from at least six raters. Using a higher threshold could result in suboptimal results because raters tend to miss MFs that are difficult to identify [19, 4]. Hence, a higher threshold may lead to the exclusion of MFs that are likely to be overlooked. For the creation of the H&E-only label set of the development dataset, a DL model with high recall was used to overcome this problem. Although this threshold is a hyperparameter that can influence the results, the same trend was observed across different thresholds (see supplementary Figure S1). The agreement between the raters’ annotations and the described consensus was then measured via Dice Similarity Coefficient/F1-score, as proposed in [19], precision, and recall. Additionally, we evaluated the inter-rater reliability of the MC using the intraclass correlation coefficient (ICC). Figure 2 (a) shows the precision and recall values for each rater in the different phases of the study. Each rater demonstrated an increase in either recall, precision, or both when comparing the results of phase P1 to P2. If there was a slight decrease in either precision or recall in phase P2, it was always accompanied by a substantial increase in the other metric. The results over all raters are given in Figure 2 (b). Each metric increased by a large margin from P1 to P2. In particular we found the average F1-score to increase from $0.53\pm 0.11$ to $0.74\pm 0.11$ , the average precision from $0.53\pm 0.20$ to $0.78\pm 0.17$ , and the average recall from $0.67\pm 0.19$ to $0.77\pm 0.19$ . Furthermore, the ICC increased from $0.90$ in P1 to $0.99$ in P2.
Second, we investigated the performance of MF detectors that were trained with H&E-only and with PHH3-assisted labels. To increase the statistical informativeness of the results we trained each object detector in a five-fold Monte Carlo cross-validation scheme. For this, the development dataset was randomly split into $70\%$ training, and $15\%$ validation and test cases five times. To ensure comparability of the results, all models were trained and tested on the same five splits. Furthermore, we evaluated the resulting models on the study datasets derived from P1 (H&E-only) and P2 (PHH3-assisted) of the human rater experiment. The results of both evaluations are given in Table 1. The AP was used to measure the object detection performance.

Table 1: Results of the FCOS and DI-FCOS models on the test sets of the five fold cross-validation on the different label sets of the development dataset, and results of the inference of the models on the study dataset. Given are the means and standard deviations of the average precision as a result of cross-validation. We found only a minor impact of the threshold of annotators in the ground truth, see Figure S3.

Test dataset	Model	Parameters	Trained using HE-only labels		Trained using PHH3-assisted labels
			Evaluation labels		Evaluation labels
			HE-only	PHH3-assisted	HE-only	PHH3-assisted
Development dataset	FCOS (ResNet18)	19.0 M	$0.70\pm 0.03$	$0.64\pm 0.02$	$0.71\pm 0.03$	$0.68\pm 0.04$
	FCOS (ResNet101)	51.0 M	$\mathbf{0.74}\pm 0.04$	$0.68\pm 0.05$	$0.71\pm 0.04$	$0.69\pm 0.06$
	DI-FCOS (ResNet18)	39.9 M	$\mathbf{0.74}\pm 0.04$	$0.73\pm 0.05$	$0.72\pm 0.04$	$\mathbf{0.79}\pm 0.05$
Study dataset	FCOS (ResNet18)	19.0 M	$0.64\pm 0.02$	$0.58\pm 0.03$	$0.61\pm 0.04$	$0.61\pm 0.02$
	FCOS (ResNet101)	51.0 M	$0.66\pm 0.03$	$0.60\pm 0.02$	$0.62\pm 0.03$	$0.60\pm 0.04$
	DI-FCOS (ResNet18)	39.9 M	$0.66\pm 0.04$	$\mathbf{0.72}\pm 0.06$	$0.61\pm 0.05$	$\mathbf{0.81}\pm 0.05$

We found that the single input FCOS models performed better on the H&E-only labels compared to the PHH3-assisted labels, regardless of whether they were trained with H&E-only labels or with PHH3-assisted labels. The DI-FCOS models outperformed the FCOS models on the PHH3-assisted label sets of both datasets. On the HE-only label sets of both datasets, the FCOS model with the larger backbone performed on-par with the DI-FCOS model. Overall, the highest performance was observed when the DI-FCOS models were trained and tested on PHH3-assisted labels, with a mean AP of $0.79\pm 0.05$ on the development dataset and a mean AP of $0.81\pm 0.05$ on the study dataset (see Figure S2 for examples). The performance of these models was much lower when tested on the H&E-only labels.

5 Discussion

This study demonstrates that the inter-rater reliability and agreement of MF annotations substantially improve with the use of co-registered PHH3-stains of the exact same slide as an annotation assistance. However, the performance of the single-input object detectors on the PHH3-assisted labels also confirmed our hypothesis that the PHH3 assistance causes MFs to be included in the data set that are not distinguishable in the H&E alone. Given the only marginal difference between both single input models when tested on the H&E-only labels, it appears that including those MFs for model training has only a minor effect. However, if a PHH3-assisted label set is used to evaluate an MF detector, this may underestimate the performance of the detector. This is likely due to a low recall on cells that are non-identifiable as MFs in H&E. Since the IHC label is dependent on the biological process of cell division, the PHH3-assisted ground truth can be considered a more accurate estimate of the actual biological ground truth. Therefore, while a MF detector trained with PHH3-assisted data may perform less favorably on a H&E-only ground truth, its predictions could be closer to the biological ground truth. The results of our DI-FCOS detectors on the PHH3-assisted labels demonstrate that high performance on a PHH3-assisted label set is possible when information from both modalities is available. Although MFs in the PHH3- and H&E-stained sections may not always align perfectly (see Figure S2), the DI-FCOS results demonstrate robustness against this displacement. Fine registration of patches before inputting them into the detector could potentially enhance the results even further. Considering that similar results were achieved with both label sets in the development dataset, it is reasonable to conclude that PHH3-assisted labeling with only one rater could replace a complex multi-rater approach as described in [4]. Hence, future research can investigate how to use its potential for labeling in H&E without any negative effects.

References

[1] Aubreville, M., Stathonikos, N., Bertram, C.A., Klopfleisch, R., Ter Hoeve, N., Ciompi, F., Wilm, F., Marzahl, C., Donovan, T.A., Maier, A., et al.: Mitosis domain generalization in histopathology images—the MIDOG challenge. Medical Image Analysis 84, 102699 (2023)
[2] Baak, J.P., van Diest, P.J., Voorhorst, F.J., van der Wall, E., Beex, L.V., Vermorken, J.B., Janssen, E.A.: Prospective multicenter validation of the independent prognostic value of the mitotic activity index in lymph node-negative breast cancer patients younger than 55 years. Journal of Clinical Oncology 23, 5993–6001 (2005)
[3] Bertram, C.A., Aubreville, M., Donovan, T.A., Bartel, A., Wilm, F., Marzahl, C., Assenmacher, C.A., Becker, K., Bennett, M., Corner, S., et al.: Computer-assisted mitotic count using a deep learning–based algorithm improves interobserver reproducibility and accuracy. Veterinary pathology 59(2), 211–226 (2022)
[4] Bertram, C.A., Veta, M., Marzahl, C., Stathonikos, N., Maier, A., Klopfleisch, R., Aubreville, M.: Are pathologist-defined labels reproducible? comparison of the tupac16 mitotic figure dataset with an alternative set of labels. In: Interpretable and Annotation-Efficient Learning for Medical Image Computing. pp. 204–213. Springer (2020)
[5] Colman, H., Giannini, C., Huang, L., Gonzalez, J., Hess, K., Bruner, J., Fuller, G., Langford, L., Pelloski, C., Aaron, J., Burger, P., Aldape, K.: Assessment and prognostic significance of mitotic index using the mitosis marker phospho-histone h3 in low and intermediate-grade infiltrating astrocytomas. American Journal of Surgical Pathology 30, 657–664 (5 2006)
[6] Duregon, E., Cassenti, A., Pittaro, A., Ventura, L., Senetta, R., Rudà, R., Cassoni, P.: Better see to better agree: Phosphohistone h3 increases interobserver agreement in mitotic count for meningioma grading and imposes new specific thresholds. Neuro-Oncology 17, 663–669 (5 2015)
[7] Elston, C.W., Ellis, I.O.: Pathological prognostic factors in breast cancer. i. the value of histological grade in breast cancer: experience from a large study with long‐term follow‐up. Histopathology 19, 403–410 (1991)
[8] Kiupel, M., Webster, J., Bailey, K., Best, S., DeLay, J., Detrisac, C., Fitzgerald, S., Gamble, D., Ginn, P., Goldschmidt, M., et al.: Proposal of a 2-tier histologic grading system for canine cutaneous mast cell tumors to more accurately predict biological behavior. Veterinary pathology 48(1), 147–155 (2011)
[9] Laflamme, P., Mansoori, B.K., Sazanova, O., Orain, M., Couture, C., Simard, S., Trahan, S., Manem, V., Joubert, P.: Phospho-histone-h3 immunostaining for pulmonary carcinoids: impact on clinical appraisal, interobserver correlation, and diagnostic processing efficiency. Human Pathology 106, 74–81 (12 2020)
[10] Louis, D.N., Perry, A., Wesseling, P., Brat, D.J., Cree, I.A., Figarella-Branger, D., Hawkins, C., Ng, H.K., Pfister, S.M., Reifenberger, G., Soffietti, R., Deimling, A.V., Ellison, D.W.: The 2021 who classification of tumors of the central nervous system: A summary. Neuro-Oncology 23, 1231–1251 (8 2021)
[11] Marzahl, C., Aubreville, M., Bertram, C.A., Maier, J., Bergler, C., Kröger, C., Voigt, J., Breininger, K., Klopfleisch, R., Maier, A.: Exact: a collaboration toolset for algorithm-aided annotation of images with annotation version control. Scientific reports 11(1), 4343 (2021)
[12] Marzahl, C., Wilm, F., Tharun, L., Perner, S., Bertram, C.A., Kröger, C., Voigt, J., Klopfleisch, R., Maier, A., Aubreville, M., et al.: Robust quad-tree based registration on whole slide images. In: MICCAI Workshop on Computational Pathology. pp. 181–190. PMLR (2021)
[13] Meuten, D.J., Moore, F.M., George, J.W.: Mitotic count and the field of view area: Time to standardize. Veterinary Pathology 53, 7–9 (1 2016)
[14] Peña, L., Andrés, P.D., Clemente, M., Cuesta, P., Pérez-Alenza, M.: Prognostic value of histological grading in noninflammatory canine mammary carcinomas in a prospective study with two-year follow-up: relationship with clinical and histological characteristics. Veterinary Pathology 50(1), 94–105 (2013)
[15] Qingyun, F., Zhaokui, W.: Cross-modality attentive feature fusion for object detection in multispectral remote sensing imagery. Pattern Recognition 130, 108786 (2022)
[16] Skaland, I., Janssen, E.A., Gudlaugsson, E., Klos, J., Kjellevold, K.H., Søiland, H., Baak, J.P.: Validating the prognostic value of proliferation measured by phosphohistone h3 (pph3) in invasive lymph node-negative breast cancer patients less than 71 years of age. Breast Cancer Research and Treatment 114, 39–45 (3 2009)
[17] Tellez, D., Balkenhol, M., Otte-Höller, I., van de Loo, R., Vogels, R., Bult, P., Wauters, C., Vreuls, W., Mol, S., Karssemeijer, N., et al.: Whole-slide mitosis detection in H&E breast histology using PHH3 as a reference to train distilled stain-invariant convolutional networks. IEEE transactions on medical imaging 37(9), 2126–2136 (2018)
[18] Tian, Z., Shen, C., Chen, H., He, T.: Fcos: Fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9627–9636 (2019)
[19] Veta, M., Diest, P.J.V., Jiwa, M., Al-Janabi, S., Pluim, J.P.: Mitosis counting in breast cancer: Object-level interobserver agreement and comparison to an automatic method. PLoS ONE 11 (8 2016)
[20] Voss, S.M., Riley, M.P., Lokhandwala, P.M., Wang, M., Yang, Z.: Mitotic count by phosphohistone h3 immunohistochemical staining predicts survival and improves interobserver reproducibility in well-differentiated neuroendocrine tumors of the pancreas. The American journal of surgical pathology 39(1), 13–24 (2015)
[21] Wilm, F., Bertram, C.A., Marzahl, C., Bartel, A., Donovan, T.A., Assenmacher, C.A., Becker, K., Bennett, M., Corner, S., Cossic, B., et al.: Influence of inter-annotator variability on automatic mitotic figure assessment. In: Bildverarbeitung für die Medizin 2021: Proceedings, German Workshop on Medical Image Computing, Regensburg, March 7-9, 2021. pp. 241–246. Springer (2021)