11institutetext: Technische Hochschule Ingolstadt, Ingolstadt, Germany 22institutetext: Gestalt Diagnostics, Spokane, USA 33institutetext: University of Veterinary Medicine, Vienna, Austria 44institutetext: Freie Universität Berlin, Berlin, Germany 55institutetext: SeaWorld Yas Island, Abu Dhabi, UAE 66institutetext: Cornell University, Ithaca, USA 77institutetext: USAMV, Cluj-Napoca, Romania 88institutetext: Medical University of Vienna, Vienna, Austria 99institutetext: Hoffmann-La Roche, Basel, Switzerland 1010institutetext: Ross University School of Veterinary Medicine, Basseterre, St. Kitts and Nevis 1111institutetext: University of Florida, Gainesville, USA1212institutetext: University Hospital Erlangen, Erlangen, Germany1313institutetext: IDEXX Laboratories, Kornwestheim, Germany 1414institutetext: The Schwarzman Animal Medical Center, New York, USA 1515institutetext: Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

On the Value of PHH3 for Mitotic Figure Detection on H&E-stained Images

Jonathan Ganz 11    Christian Marzahl 22    Jonas Ammeling 11    Barbara Richter 33    Chloé Puget 44    Daniela Denk 55    Elena A. Demeter 66    Flaviu A. Tabaran 77    Gabriel Wasinger 88    Karoline Lipnik 33    Marco Tecilla 99    Matthew J. Valentine 1010    Michael J. Dark 1111    Niklas Abele 1212    Pompei Bolfa 1010    Ramona Erber 12121515    Robert Klopfleisch 44    Sophie Merz 1313    Taryn A. Donovan 1414    Samir Jabari 1212    Christof A. Bertram 33    Katharina Breininger 1515    Marc Aubreville 11
Abstract

The count of mitotic figures (MFs) observed in hematoxylin and eosin (H&E)-stained slides is an important prognostic marker as it is a measure for tumor cell proliferation. However, the identification of MFs has a known low inter-rater agreement. Deep learning algorithms can standardize this task, but they require large amounts of annotated data for training and validation. Furthermore, label noise introduced during the annotation process may impede the algorithm’s performance. Unlike H&E, the mitosis-specific antibody phospho-histone H3 (PHH3) specifically highlights MFs. Counting MFs on slides stained against PHH3 leads to higher agreement among raters and has therefore recently been used as a ground truth for the annotation of MFs in H&E. However, as PHH3 facilitates the recognition of cells indistinguishable from H&E stain alone, the use of this ground truth could potentially introduce noise into the H&E-related dataset, impacting model performance. This study analyzes the impact of PHH3-assisted MF annotation on inter-rater reliability and object level agreement through an extensive multi-rater experiment. We found that the annotators’ object-level agreement increased when using PHH3-assisted labeling. Subsequently, MF detectors were evaluated on the resulting datasets to investigate the influence of PHH3-assisted labeling on the models’ performance. Additionally, a novel dual-stain MF detector was developed to investigate the interpretation-shift of PHH3-assisted labels used in H&E, which clearly outperformed single-stain detectors. However, the PHH3-assisted labels did not have a positive effect on solely H&E-based models. The high performance of our dual-input detector reveals an information mismatch between the H&E and PHH3-stained images as the cause of this effect.

Keywords:
Mitotic Figure Detection Computational Pathology PHH3.

1 Introduction

In tumor diagnosis, a crucial step is the examination of tissue samples by pathologists to derive important information related to the tumor and its appropriate treatment. One factor of interest is the proliferation fraction of the respective tumor, which can be assessed through the number of cells undergoing cell division, depicted by mitotic figures [2]. The mitotic count (MC), defined as the number of MFs within a standardized area of ten consecutive high-power fields [13], is part of various tumor grading systems in human [7, 10] as well as veterinary medicine [8, 14]. However, counting MFs is a task known to have a low inter-rater agreement [19, 3]. The use of slide scanners has enabled the digitization of entire slides into whole slide images, which can be evaluated automatically using deep learning (DL) methods. In the context of MF detection, DL algorithms already demonstrated human-like performance [3]. Nevertheless, those algorithms rely on large amounts of annotated data, and the annotation quality is known to affect the performance of the trained DL model [4, 21]. In contrast to the hematoxylin and eosin (H&E) stain, which is the standard stain used in histopathology, the mitosis-specific antibody phosphohistone H3 (PHH3) specifically highlights the cell nucleus during mitosis [6]. Staining against PHH3 is a validated method with diagnostic and prognostic significance, emphasizing its utility in the assessment of the MC in various tumor types [5, 6, 20, 9, 16]. Furthermore, different studies reported that MC derived solely by PHH3 leads to lower variability of the MC among different raters compared with the MC acquired through H&E [9, 6]. Nevertheless, H&E is the routine stain in pathology, and thus detectors intended for clinical use have to be applicable to this stain. In order to still be able to leverage the advantages of PHH3 staining to improve the quality of H&E-stained MF datasets, staining against PHH3 has been used as an assistance for the annotation of MFs in H&E by de-staining H&E-stained slides and re-staining them against PHH3 [17, 3]. The re-stained slides are then digitized and registered with their H&E counterparts. However, PHH3 is more sensitive to early mitotic phases and less sensitive to late phases of the cell cycle [20]. As a result, the MFs highlighted in slides stained against PHH3 differ from those recognizable in H&E, leading to an elevated (though biologically more accurate) MC [6]. Our hypothesis is that providing an annotator with PHH3 assistance, as described above, may introduce a potential bias that results in the inclusion of MFs in the dataset that would not have been annotated with only H&E available. This, in turn, may result in the inclusion of MFs that cannot be identified with only the H&E-stained slides available, leading to an information mismatch and, from the perspective of the H&E, noisy MF labels. Our main contributions are:

  • Extensive multi-rater experiments were conducted to show PHH3-assisted annotation’s superior labeling consistency and to generate high-quality datasets to evaluate our methods.

  • Using our novel dual-stain MF detector, we show the principal superiority of PHH3-assisted labels, given that both stains are available.

  • We show that PHH3-assisted labels can cause interpretation shifts in H&E when training MF detectors. This implies that a ’biologically accurate’ PHH3-assisted ground truth may not improve performance if information mismatch is ignored.

2 Datasets

Two different datasets were used in this study, one for the training of the object detectors and one for the multi-rater experiment. Both datasets utilize corresponding region of interest (ROI) pairs of tumor tissue stained with two stains: The source slides were initially stained with H&E, digitized, and subsequently de-stained and re-stained with PHH3 before being digitized again. The resulting WSIs were then registered using a registration algorithm for WSIs [12]. From the original WSIs, the ROIs were selected by two pathologists based on tissue and scan quality and the area of perceived highest mitotic activity.

2.1 Image Dataset for the Annotation Study

The dataset used in the annotation study consists of 20 ROIs representing four different tumor types, two of which were of human origin and of two of veterinary origin. Tumors of different tumor types and species were included in order to draw broader conclusions from the results of the study. Five samples each of human astrocytoma and meningioma were collected from the diagnostic archive of anonymized hospital, after prior ethics approval (No. ANON). The slides were digitized using an Hamamatsu NanoZoomer S60 at 40×\times× magnification. From the diagnostic archive of anonymized university five samples each of canine cutaneous mast cell tumor (CCMCT) and of canine mammary carcinoma (CMC) were collected. For these slides, no ethics approval was needed. These slides were digitized with a 3DHistech Pannoramic Scan II at 40×\times× magnification. Each ROI was selected to cover 2.37mm22.37superscriptmm22.37\,$\mathrm{m}\mathrm{m}$^{2}2.37 roman_mm start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT of tissue, which is equivalent to approximately ten high power fields [13]. In the remainder of this paper we refer to the dataset generated by annotating these images as study dataset.

2.2 Development Dataset

To examine the impact of PHH3-assisted MF annotation on the performance of MF detectors, we trained detectors with annotations acquired only using H&E-stained slides and with annotations that were generated through PHH3-assisted annotation. For this purpose, we used a datatset consisiting of ROIs of multiple tumors and species from different laboratories. The dataset contained ten samples of CCMCT, nine samples of CMC, ten sampes of canine hemangiosarcoma, nine samples of feline lymphoma, ten samples of feline soft tissue sarcoma, three samples of human astrocytoma, ten samples of human bladder cancer, nine samples of human colon carcinoma, ten samples of human melanoma, and four samples of human meningioma. We refer to this as the development dataset. Two different ground truth definitions were available for this dataset. The first definition relied solely on H&E-stained slides, while the second was created with PHH3-assisted labeling. The H&E-only annotations were created as described in [4] by three pathologist with at least five years of experience in MF identification assisted by a DL model with high recall that screened the slides for MF candidates. To be accepted as a MF, at least two pathologists had to agree upon a candidate. The PHH3-assisted annotations were created by a single expert using an open source web-based annotation server [11] where the two corresponding stains could be superimposed on each other with variable transparency. This way, the expert was able to assess the immunohistochemistry (IHC) label present in the PHH3-stained slide and the morphological features visible in the H&E slide at once. If the registration was not perfect, e.g., due to tissue deformation, the position of the respective cell in the H&E-stained slide was annotated. Cells that had a positive IHC label but that were lacking MF morphology in the H&E stain were not annotated, as these cells are not identifiable as MF in the H&E-stained slides.

3 Methods

We investigated the value of PHH3-assisted MF annotation from two different perspectives. The impact on the inter-rater agreement was investigated via a multi-rater experiment. In a second experiment, the datasets resulting from this experiment were used to evaluate MF detectors trained on labels derived with and without PHH3 assistance.

3.1 Human Expert Mitotic Figure Annotation Study

To investigate the impact of co-registered PHH3 slides on inter-rater agreement, a study was conducted with 13 pathology experts. The study consisted of two phases, with a four week washout period to prevent participants from recalling the cases. To further prevent bias, the slides were presented in randomized order and under different names in each phase. During the initial phase, participants only had access to H&E-stained ROIs. In the subsequent phase, participants could overlay co-registered PHH3-stained ROIs on the H&E-stained image with adjustable transparency, allowing for simultaneous examination of both stains. Additionally, participants were given three different labels to choose from when annotating MFs in the second phase, depending on whether a MF was identifiable in both H&E and PHH3, only in H&E, or only in PHH3. In this paper, only annotations of the first two classes are considered, as these match the annotation semantics of the PHH3-assisted ground truth of the MIDOG dataset. For imperfect registration, participants were instructed to annotate the position of the respective cell in the H&E-stained ROI. Before the second phase a pathologist highly experienced in PHH3-assisted labeling conducted an online training with the participants.

3.2 Detection Architectures

Refer to caption
Figure 1: Architectural overview of our dual-stain MF detector.

To assess the impact of PHH3-assisted labeling on the performance of object detectors, we used the Fully Convolutional One-Stage Object Detector (FCOS) [18]. Our hypothesis was that the annotation assisted by PHH3 will introduce MFs into the dataset that are not distinguishable in H&E alone. The solely H&E-based detector lacks the information from the PHH3-stained section, which causes an interpretation shift in the labels and introduces noise into the H&E dataset. This noise could potentially impede the performance of models trained on such a dataset. If this is correct, a detector that is able to use the information from both stains should show better results on such a dataset than a pure H&E detector. To test the hypothesis we extended the FCOS detector into a Dual-Input FCOS (DI-FCOS) detector as depicted in Fig. 1, which was trained on the corresponding H&E and PHH3 patches simultaneously. The DI-FCOS model consists of two ResNet18 backbones. Feature fusion is achieved via mid-fusion, i.e., information from different input modalities is combined at an intermediate stage within the network [15]. In particular, the features of each ResNet level are fused before they are forwarded to the feature pyramid, using one merging network for each input level of the feature pyramid. Let 𝐇C×H×W𝐇superscript𝐶𝐻𝑊\mathbf{H}\in\mathbb{R}^{C\times H\times W}bold_H ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT and 𝐏C×H×W𝐏superscript𝐶𝐻𝑊\mathbf{P}\in\mathbb{R}^{C\times H\times W}bold_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT be the feature maps of a level of the H&E and PHH3 backbone, where C𝐶Citalic_C represents the number of channels of the respective level and H𝐻Hitalic_H and W𝑊Witalic_W are the sizes of the feature maps which depend on the size of the input image. Then the fused features 𝐅C×H×W𝐅superscript𝐶𝐻𝑊\mathbf{F}\in\mathbb{R}^{C\times H\times W}bold_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT are computed by 𝐅=ReLU(Conv(LayerNorm(Cat(𝐇,𝐏))))𝐅ReLUConvLayerNormCat𝐇𝐏\mathbf{F}=\text{{ReLU}}(\text{{Conv}}(\text{{LayerNorm}}(\text{{Cat}}(\mathbf% {H},\mathbf{P}))))bold_F = ReLU ( Conv ( LayerNorm ( Cat ( bold_H , bold_P ) ) ) ), where Cat denotes a concatenation of the feature vectors along the channel dimension and Conv is a 1×1111\times 11 × 1 convolution which halves the number of input channels to C𝐶Citalic_C after the concatenation. The rest of the network follows the standard FCOS architecture as described in [18]. To confirm that a difference in model performance between the FCOS and DI-FCOS models is not due to the DI-FCOS models’ higher number of parameters, we also compare it to an FCOS detector with a ResNet101 backbone. For all experiments, the feature maps from the second, third, and fourth blocks of the respective backbone were used to construct the feature pyramids for both the standard FCOS and the DI-FCOS model. We used a fixed learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and AdamW as the optimizer. All models were trained until convergence, which was observed using the average precision (AP) metric on the validation set, which we also used for early stop** and for model selection. All models were trained on patches with a height and width of 512 pixels, and patches were selected so that at least 50%percent5050~{}\%50 % of the training patches contained MFs. A standard image augmentation pipeline was used during the training of each object detector in this study.

4 Results

We first describe the results of the human-rater experiment as the resulting dataset was part of the evaluation of the object detectors.

Refer to caption
Figure 2: Figure (a) displays the precision and recall of each rater plotted against the consensus of the remaining raters of phase one and three. The results of the first (P1) and second (P2) phases of the study are marked by a dot and a cross, respectively. Each rater is represented by a different color. The F1 value, precision, and recall of each rater against the consensus of the remaining raters is given in Figure (b). Means are indicated by the black crosses.

To measure agreement at the object level, we compared each rater’s annotation on images of the study dataset to the consensus of all other raters’ annotations for each phase of the experiment. We formed the consensus by matching the annotations of the remaining raters for each image using a distance-based clustering approach. An annotation was added to a cluster if it was no more than 7.5 micrometers away from the center of the cluster, approximately corresponding to the diameter of a nucleus [1]. A cluster was considered a MF if it contained annotations from at least six raters. Using a higher threshold could result in suboptimal results because raters tend to miss MFs that are difficult to identify [19, 4]. Hence, a higher threshold may lead to the exclusion of MFs that are likely to be overlooked. For the creation of the H&E-only label set of the development dataset, a DL model with high recall was used to overcome this problem. Although this threshold is a hyperparameter that can influence the results, the same trend was observed across different thresholds (see supplementary Figure S1). The agreement between the raters’ annotations and the described consensus was then measured via Dice Similarity Coefficient/F1-score, as proposed in [19], precision, and recall. Additionally, we evaluated the inter-rater reliability of the MC using the intraclass correlation coefficient (ICC). Figure 2 (a) shows the precision and recall values for each rater in the different phases of the study. Each rater demonstrated an increase in either recall, precision, or both when comparing the results of phase P1 to P2. If there was a slight decrease in either precision or recall in phase P2, it was always accompanied by a substantial increase in the other metric. The results over all raters are given in Figure 2 (b). Each metric increased by a large margin from P1 to P2. In particular we found the average F1-score to increase from 0.53±0.11plus-or-minus0.530.110.53\pm 0.110.53 ± 0.11 to 0.74±0.11plus-or-minus0.740.110.74\pm 0.110.74 ± 0.11, the average precision from 0.53±0.20plus-or-minus0.530.200.53\pm 0.200.53 ± 0.20 to 0.78±0.17plus-or-minus0.780.170.78\pm 0.170.78 ± 0.17, and the average recall from 0.67±0.19plus-or-minus0.670.190.67\pm 0.190.67 ± 0.19 to 0.77±0.19plus-or-minus0.770.190.77\pm 0.190.77 ± 0.19. Furthermore, the ICC increased from 0.900.900.900.90 in P1 to 0.990.990.990.99 in P2.
Second, we investigated the performance of MF detectors that were trained with H&E-only and with PHH3-assisted labels. To increase the statistical informativeness of the results we trained each object detector in a five-fold Monte Carlo cross-validation scheme. For this, the development dataset was randomly split into 70%percent7070\%70 % training, and 15%percent1515\%15 % validation and test cases five times. To ensure comparability of the results, all models were trained and tested on the same five splits. Furthermore, we evaluated the resulting models on the study datasets derived from P1 (H&E-only) and P2 (PHH3-assisted) of the human rater experiment. The results of both evaluations are given in Table 1. The AP was used to measure the object detection performance.

Table 1: Results of the FCOS and DI-FCOS models on the test sets of the five fold cross-validation on the different label sets of the development dataset, and results of the inference of the models on the study dataset. Given are the means and standard deviations of the average precision as a result of cross-validation. We found only a minor impact of the threshold of annotators in the ground truth, see Figure S3.
Test dataset Model Parameters  Trained using HE-only labels  Trained using PHH3-assisted labels
Evaluation labels Evaluation labels
HE-only PHH3-assisted HE-only PHH3-assisted
Development dataset FCOS (ResNet18) 19.0 M 0.70±0.03plus-or-minus0.700.030.70\pm 0.030.70 ± 0.03 0.64±0.02plus-or-minus0.640.020.64\pm 0.020.64 ± 0.02 0.71±0.03plus-or-minus0.710.030.71\pm 0.030.71 ± 0.03 0.68±0.04plus-or-minus0.680.040.68\pm 0.040.68 ± 0.04
FCOS (ResNet101) 51.0 M 0.74±0.04plus-or-minus0.740.04\mathbf{0.74}\pm 0.04bold_0.74 ± 0.04 0.68±0.05plus-or-minus0.680.050.68\pm 0.050.68 ± 0.05 0.71±0.04plus-or-minus0.710.040.71\pm 0.040.71 ± 0.04 0.69±0.06plus-or-minus0.690.060.69\pm 0.060.69 ± 0.06
DI-FCOS (ResNet18) 39.9 M 0.74±0.04plus-or-minus0.740.04\mathbf{0.74}\pm 0.04bold_0.74 ± 0.04 0.73±0.05plus-or-minus0.730.050.73\pm 0.050.73 ± 0.05 0.72±0.04plus-or-minus0.720.040.72\pm 0.040.72 ± 0.04 0.79±0.05plus-or-minus0.790.05\mathbf{0.79}\pm 0.05bold_0.79 ± 0.05
Study dataset FCOS (ResNet18) 19.0 M 0.64±0.02plus-or-minus0.640.020.64\pm 0.020.64 ± 0.02 0.58±0.03plus-or-minus0.580.030.58\pm 0.030.58 ± 0.03 0.61±0.04plus-or-minus0.610.040.61\pm 0.040.61 ± 0.04 0.61±0.02plus-or-minus0.610.020.61\pm 0.020.61 ± 0.02
FCOS (ResNet101) 51.0 M 0.66±0.03plus-or-minus0.660.030.66\pm 0.030.66 ± 0.03 0.60±0.02plus-or-minus0.600.020.60\pm 0.020.60 ± 0.02 0.62±0.03plus-or-minus0.620.030.62\pm 0.030.62 ± 0.03 0.60±0.04plus-or-minus0.600.040.60\pm 0.040.60 ± 0.04
DI-FCOS (ResNet18) 39.9 M 0.66±0.04plus-or-minus0.660.040.66\pm 0.040.66 ± 0.04 0.72±0.06plus-or-minus0.720.06\mathbf{0.72}\pm 0.06bold_0.72 ± 0.06 0.61±0.05plus-or-minus0.610.050.61\pm 0.050.61 ± 0.05 0.81±0.05plus-or-minus0.810.05\mathbf{0.81}\pm 0.05bold_0.81 ± 0.05

We found that the single input FCOS models performed better on the H&E-only labels compared to the PHH3-assisted labels, regardless of whether they were trained with H&E-only labels or with PHH3-assisted labels. The DI-FCOS models outperformed the FCOS models on the PHH3-assisted label sets of both datasets. On the HE-only label sets of both datasets, the FCOS model with the larger backbone performed on-par with the DI-FCOS model. Overall, the highest performance was observed when the DI-FCOS models were trained and tested on PHH3-assisted labels, with a mean AP of 0.79±0.05plus-or-minus0.790.050.79\pm 0.050.79 ± 0.05 on the development dataset and a mean AP of 0.81±0.05plus-or-minus0.810.050.81\pm 0.050.81 ± 0.05 on the study dataset (see Figure S2 for examples). The performance of these models was much lower when tested on the H&E-only labels.

5 Discussion

This study demonstrates that the inter-rater reliability and agreement of MF annotations substantially improve with the use of co-registered PHH3-stains of the exact same slide as an annotation assistance. However, the performance of the single-input object detectors on the PHH3-assisted labels also confirmed our hypothesis that the PHH3 assistance causes MFs to be included in the data set that are not distinguishable in the H&E alone. Given the only marginal difference between both single input models when tested on the H&E-only labels, it appears that including those MFs for model training has only a minor effect. However, if a PHH3-assisted label set is used to evaluate an MF detector, this may underestimate the performance of the detector. This is likely due to a low recall on cells that are non-identifiable as MFs in H&E. Since the IHC label is dependent on the biological process of cell division, the PHH3-assisted ground truth can be considered a more accurate estimate of the actual biological ground truth. Therefore, while a MF detector trained with PHH3-assisted data may perform less favorably on a H&E-only ground truth, its predictions could be closer to the biological ground truth. The results of our DI-FCOS detectors on the PHH3-assisted labels demonstrate that high performance on a PHH3-assisted label set is possible when information from both modalities is available. Although MFs in the PHH3- and H&E-stained sections may not always align perfectly (see Figure S2), the DI-FCOS results demonstrate robustness against this displacement. Fine registration of patches before inputting them into the detector could potentially enhance the results even further. Considering that similar results were achieved with both label sets in the development dataset, it is reasonable to conclude that PHH3-assisted labeling with only one rater could replace a complex multi-rater approach as described in [4]. Hence, future research can investigate how to use its potential for labeling in H&E without any negative effects.

References

  • [1] Aubreville, M., Stathonikos, N., Bertram, C.A., Klopfleisch, R., Ter Hoeve, N., Ciompi, F., Wilm, F., Marzahl, C., Donovan, T.A., Maier, A., et al.: Mitosis domain generalization in histopathology images—the MIDOG challenge. Medical Image Analysis 84, 102699 (2023)
  • [2] Baak, J.P., van Diest, P.J., Voorhorst, F.J., van der Wall, E., Beex, L.V., Vermorken, J.B., Janssen, E.A.: Prospective multicenter validation of the independent prognostic value of the mitotic activity index in lymph node-negative breast cancer patients younger than 55 years. Journal of Clinical Oncology 23, 5993–6001 (2005)
  • [3] Bertram, C.A., Aubreville, M., Donovan, T.A., Bartel, A., Wilm, F., Marzahl, C., Assenmacher, C.A., Becker, K., Bennett, M., Corner, S., et al.: Computer-assisted mitotic count using a deep learning–based algorithm improves interobserver reproducibility and accuracy. Veterinary pathology 59(2), 211–226 (2022)
  • [4] Bertram, C.A., Veta, M., Marzahl, C., Stathonikos, N., Maier, A., Klopfleisch, R., Aubreville, M.: Are pathologist-defined labels reproducible? comparison of the tupac16 mitotic figure dataset with an alternative set of labels. In: Interpretable and Annotation-Efficient Learning for Medical Image Computing. pp. 204–213. Springer (2020)
  • [5] Colman, H., Giannini, C., Huang, L., Gonzalez, J., Hess, K., Bruner, J., Fuller, G., Langford, L., Pelloski, C., Aaron, J., Burger, P., Aldape, K.: Assessment and prognostic significance of mitotic index using the mitosis marker phospho-histone h3 in low and intermediate-grade infiltrating astrocytomas. American Journal of Surgical Pathology 30, 657–664 (5 2006)
  • [6] Duregon, E., Cassenti, A., Pittaro, A., Ventura, L., Senetta, R., Rudà, R., Cassoni, P.: Better see to better agree: Phosphohistone h3 increases interobserver agreement in mitotic count for meningioma grading and imposes new specific thresholds. Neuro-Oncology 17, 663–669 (5 2015)
  • [7] Elston, C.W., Ellis, I.O.: Pathological prognostic factors in breast cancer. i. the value of histological grade in breast cancer: experience from a large study with long‐term follow‐up. Histopathology 19, 403–410 (1991)
  • [8] Kiupel, M., Webster, J., Bailey, K., Best, S., DeLay, J., Detrisac, C., Fitzgerald, S., Gamble, D., Ginn, P., Goldschmidt, M., et al.: Proposal of a 2-tier histologic grading system for canine cutaneous mast cell tumors to more accurately predict biological behavior. Veterinary pathology 48(1), 147–155 (2011)
  • [9] Laflamme, P., Mansoori, B.K., Sazanova, O., Orain, M., Couture, C., Simard, S., Trahan, S., Manem, V., Joubert, P.: Phospho-histone-h3 immunostaining for pulmonary carcinoids: impact on clinical appraisal, interobserver correlation, and diagnostic processing efficiency. Human Pathology 106, 74–81 (12 2020)
  • [10] Louis, D.N., Perry, A., Wesseling, P., Brat, D.J., Cree, I.A., Figarella-Branger, D., Hawkins, C., Ng, H.K., Pfister, S.M., Reifenberger, G., Soffietti, R., Deimling, A.V., Ellison, D.W.: The 2021 who classification of tumors of the central nervous system: A summary. Neuro-Oncology 23, 1231–1251 (8 2021)
  • [11] Marzahl, C., Aubreville, M., Bertram, C.A., Maier, J., Bergler, C., Kröger, C., Voigt, J., Breininger, K., Klopfleisch, R., Maier, A.: Exact: a collaboration toolset for algorithm-aided annotation of images with annotation version control. Scientific reports 11(1),  4343 (2021)
  • [12] Marzahl, C., Wilm, F., Tharun, L., Perner, S., Bertram, C.A., Kröger, C., Voigt, J., Klopfleisch, R., Maier, A., Aubreville, M., et al.: Robust quad-tree based registration on whole slide images. In: MICCAI Workshop on Computational Pathology. pp. 181–190. PMLR (2021)
  • [13] Meuten, D.J., Moore, F.M., George, J.W.: Mitotic count and the field of view area: Time to standardize. Veterinary Pathology 53,  7–9 (1 2016)
  • [14] Peña, L., Andrés, P.D., Clemente, M., Cuesta, P., Pérez-Alenza, M.: Prognostic value of histological grading in noninflammatory canine mammary carcinomas in a prospective study with two-year follow-up: relationship with clinical and histological characteristics. Veterinary Pathology 50(1), 94–105 (2013)
  • [15] Qingyun, F., Zhaokui, W.: Cross-modality attentive feature fusion for object detection in multispectral remote sensing imagery. Pattern Recognition 130, 108786 (2022)
  • [16] Skaland, I., Janssen, E.A., Gudlaugsson, E., Klos, J., Kjellevold, K.H., Søiland, H., Baak, J.P.: Validating the prognostic value of proliferation measured by phosphohistone h3 (pph3) in invasive lymph node-negative breast cancer patients less than 71 years of age. Breast Cancer Research and Treatment 114, 39–45 (3 2009)
  • [17] Tellez, D., Balkenhol, M., Otte-Höller, I., van de Loo, R., Vogels, R., Bult, P., Wauters, C., Vreuls, W., Mol, S., Karssemeijer, N., et al.: Whole-slide mitosis detection in H&E breast histology using PHH3 as a reference to train distilled stain-invariant convolutional networks. IEEE transactions on medical imaging 37(9), 2126–2136 (2018)
  • [18] Tian, Z., Shen, C., Chen, H., He, T.: Fcos: Fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9627–9636 (2019)
  • [19] Veta, M., Diest, P.J.V., Jiwa, M., Al-Janabi, S., Pluim, J.P.: Mitosis counting in breast cancer: Object-level interobserver agreement and comparison to an automatic method. PLoS ONE 11 (8 2016)
  • [20] Voss, S.M., Riley, M.P., Lokhandwala, P.M., Wang, M., Yang, Z.: Mitotic count by phosphohistone h3 immunohistochemical staining predicts survival and improves interobserver reproducibility in well-differentiated neuroendocrine tumors of the pancreas. The American journal of surgical pathology 39(1), 13–24 (2015)
  • [21] Wilm, F., Bertram, C.A., Marzahl, C., Bartel, A., Donovan, T.A., Assenmacher, C.A., Becker, K., Bennett, M., Corner, S., Cossic, B., et al.: Influence of inter-annotator variability on automatic mitotic figure assessment. In: Bildverarbeitung für die Medizin 2021: Proceedings, German Workshop on Medical Image Computing, Regensburg, March 7-9, 2021. pp. 241–246. Springer (2021)