\addbibresource

medbibliography.bib \addbibresourceMLbibliography.bib

AI-based Anomaly Detection for Clinical-Grade Histopathological Diagnostics

Jonas Dippel Machine Learning Group, Technische Universität Berlin, Berlin, Germany. BIFOLD – Berlin Institute for the Foundations of Learning and Data, Berlin, Germany. Niklas Prenißl Institute of Pathology, Charité – Universitätsmedizin Berlin, Berlin, Germany. Berlin Institute of Health at Charité – Universitätsmedizin Berlin, BIH Biomedical Innovation Academy, BIH Charité Junior Digital Clinician Scientist Program, Charitéplatz 1, 10117 Berlin, Germany. Julius Hense Machine Learning Group, Technische Universität Berlin, Berlin, Germany. BIFOLD – Berlin Institute for the Foundations of Learning and Data, Berlin, Germany. Philipp Liznerski RPTU, Kaiserslautern, Germany. Tobias Winterhoff Aignostics GmbH, Berlin, Germany. Simon Schallenberg Institute of Pathology, Charité – Universitätsmedizin Berlin, Berlin, Germany. Marius Kloft RPTU, Kaiserslautern, Germany. Oliver Buchstab Institute of Pathology, Ludwig-Maximilians-Universität, Munich, Germany. David Horst Institute of Pathology, Charité – Universitätsmedizin Berlin, Berlin, Germany. German Cancer Research Center (DKFZ) & German Cancer Consortium (DKTK), Munich Partner Site. Maximilian Alber Institute of Pathology, Charité – Universitätsmedizin Berlin, Berlin, Germany. Aignostics GmbH, Berlin, Germany. Lukas Ruff Aignostics GmbH, Berlin, Germany. Klaus-Robert Müller Machine Learning Group, Technische Universität Berlin, Berlin, Germany. BIFOLD – Berlin Institute for the Foundations of Learning and Data, Berlin, Germany. Department of Artificial Intelligence, Korea University, Seoul, Korea. Max-Planck Institute for Informatics, Saarbrücken, Germany. Frederick Klauschen BIFOLD – Berlin Institute for the Foundations of Learning and Data, Berlin, Germany. Institute of Pathology, Charité – Universitätsmedizin Berlin, Berlin, Germany. Institute of Pathology, Ludwig-Maximilians-Universität, Munich, Germany. German Cancer Research Center (DKFZ) & German Cancer Consortium (DKTK), Munich Partner Site.

Abstract

While previous studies have demonstrated the potential of AI to diagnose diseases in imaging data, clinical implementation is still lagging behind. This is partly because AI models require training with large numbers of examples only available for common diseases. In clinical reality, however, only few diseases are common, whereas the majority of diseases are less frequent (long-tail distribution). Current AI models overlook or misclassify these diseases. We propose a deep anomaly detection approach that only requires training data from common diseases to detect also all less frequent diseases. We collected two large real-world datasets of gastrointestinal biopsies, which are prototypical of the problem. Herein, the ten most common findings account for approximately 90% of cases, whereas the remaining 10% contained 56 disease entities, including many cancers. 17 million histological images from 5,423 cases were used for training and evaluation. Without any specific training for the diseases, our best-performing model reliably detected a broad spectrum of infrequent (“anomalous”) pathologies with 95.0% (stomach) and 91.0% (colon) AUROC and generalized across scanners and hospitals. By design, the proposed anomaly detection can be expected to detect any pathological alteration in the diagnostic tail of gastrointestinal biopsies, including rare primary or metastatic cancers. This study establishes the first effective clinical application of AI-based anomaly detection in histopathology that can flag anomalous cases, facilitate case prioritization, reduce missed diagnoses and enhance the general safety of AI models, thereby driving AI adoption and automation in routine diagnostics and beyond.

1 Introduction

Diagnostic pathology is facing serious challenges due to a shortage of pathologists in many parts of the world and too few young doctors entering the profession \parencitepathologists-gap. Meanwhile, the diagnostic workload, and cancer burden in particular, is rising under an aging population \parencitecancercases. Moreover, diagnostic procedures are getting more complex due to the demands of precision medicine. Studies have shown significant diagnostic errors in a range of 0.1% up to 10% of cases, depending on methodology and case selection \parencitediagnosticerror1, diagnosticerror2, which are at risk to further increase with surging time pressure.

Artificial intelligence (AI) is often proposed as a solution to these challenges \parencitehisto-xai-review, stenzinger2022artificial,van2021deep. Seminal studies have shown that deep learning-based approaches can classify common diseases \parenciteByeon2022,Steinbuss2020,campanella2019clinical,strom2020prostate, identify tumor origin \parencitelu2021origin, prognosticate patient outcome \parencitecourtiol2019mesothelioma,chen2022multimodal, quantify biomarkers, and even predict certain mutations from H&E images \parencitebinder-morphological, arslan2024pancancer, kather2020pancancer, demonstrating the great potential of AI for histopathology. However, all these approaches follow the paradigm of supervised learning, that is, they require the to-be-recognized pathological patterns to be present and labeled in the training data. As a result, existing AI solutions are limited to common diagnoses only, for which sufficiently large amounts of training examples are available.

In clinical practice, however, only few diagnoses are common and the vast majority of diseases are relatively rare, reflected by a long-tail distribution of diseases, exemplified for colon and gastric routine biopsies in Figure 1. Critically, the challenge for pathologists is to reliably detect and diagnose all those infrequent diseases among the common (and therefore easy-to-diagnose) cases. Current AI models fail to support this crucial aspect, as it often is practically impossible to gather sufficient training data for the long tail of infrequent conditions. As a result, existing classifiers tend to produce false predictions on uncommon differential diagnoses or miss them completely \parencitevan2021deep, evanserrorspath. This issue is mostly ignored in the literature, where allegedly high performances are only reported on curated datasets of common findings \parenciteuncertainty_kompa, uncertainty_ovadia, evanserrorspath. In consequence, human confirmation is currently required for every slide subjected to AI-analysis, as less common pathologies must always be expected in routine diagnostics. We believe that this shortcoming is a major obstacle for the adoption of AI in histopathological diagnostics.

In this study, we address this fundamental problem and examine whether we can detect infrequent findings in histopathological images using AI-based anomaly detection (AD) \parenciteRuffRev. In contrast to supervised learning, AD follows the paradigm that certain data inputs are too infrequent to be sufficiently represented during model training. Instead of trying to learn insufficiently represented patterns, AD methods aim to very precisely characterize the frequent findings, which in our setting includes normal cases and common pathologies that can be learned by supervised methods. Samples deviating from the learned common characteristics are consequently deemed “anomalies.” Since only frequent findings are used for AD model training, there is no need for extensive data collection or annotation gathering of rarer instances from the tail of the disease distribution.

We propose different modern AD methods for histopathology and apply them to whole slide images (WSI) of gastrointestinal (GI) biopsies, which arguably pose the most frequent diagnostic question in histopathology. Here, most cases belong to one of ten common diagnoses while the remaining patients suffer from one of the many rarer diseases in the long tail (Figure 1), making this a particularly relevant use case. The proposed AI-AD can serve as a clinical AI assistant that flags critical cases requiring the pathologists’ particular attention during routine diagnostics. Further, it may enhance the safety profile of supervised AI models and drive AI adoptation and automation.

Refer to caption — Figure 1: Diseases in GI biopsies. Bar plot showing the frequency distribution of diagnoses in colon and stomach biopsies in the Charité cohort. Frequent findings are highlighted in green and represent the common or “normal” cases (90/91% of all cases for stomach/colon). The distribution has a long tail of infrequent/rare diagnoses or “anomalies” (red), which the AI-based AD approach aims to detect (NFS = not further specified, MINEN =mixed neuroendocrine-nonneuroendocrine neoplasm).

2 Methods

2.1 Datasets

We continuously digitized H&E-stained slides of gastric and colon specimens from routine diagnostics at the Charité university hospital (years 2020/2021) and translated the diagnoses from the clinical reports into SNOMED-CT codes. 1973 slides with frequent findings were included in our Charité dataset (stomach = 961, colon = 1,012) along with 200 slides presenting anomalous findings (126 from years 2020/2021, 74 from the archives of years 2000–2021). All slides were scanned with a 3DHistech P1000 scanner. Resulting dataset characteristics are presented in Appendix B.1. In total, 65 distinct diagnoses are represented.

We created detailed annotations for all anomalous slides for evaluation purposes. The annotations delineated (i) regions of interest containing the actual diagnosis-defining anomalies, (ii) non-diagnosis-defining anomalous regions as, e.g., tumor-adjacent inflammation, and (iii) artifacts like pen marks or blurry areas.

Additionally, we retrieved 2,901 slides from various tissue types other than colon and stomach as auxiliary training data for the OE approach (see below). Slides were taken from routine diagnostics at the Charité pathology department, including healthy and diseased tissue from different organs (details in Appendix B.1).

To demonstrate the ability of our methodology to generalize across hospitals and scanners, we collected an independent dataset with 192 gastric and 157 colon slides from the Institute of Pathology, LMU Munich (more details in Appendix B.5).

2.2 Evaluation scheme and training data

We report the mean and standard deviation from a 5-fold cross-validation evaluation. As cellular changes in low-grade and high-grade adenomas exist on a continuum and the detection of high-grade changes is crucial, we excluded low-grade adenomas from the training data (further details in Appendix C.7).

All AD models operate on $340\times 340$ pixel patches from the WSI. To aggregate the patch scores to the slide level, we selected the 10% patches with the highest anomaly scores and computed the mean of their respective anomaly scores.

2.3 Deep anomaly detection

Deep AD methods learn meaningful feature maps from high-dimensional data via deep learning to distinguish “normal” from “anomalous” patterns in the learned feature space \parenciteRuffRev. Since anomalous data are naturally infrequent and hard to obtain, the models are usually trained with normal data only (unsupervised) or a few additional real anomalies (semi-supervised) \parenciteRuffRev. For this study, we restricted our methods to the unsupervised setting. We compared two deep AD paradigms: self-supervised learning and OE.

Deep AD with self-supervised learning

The idea of self-supervised learning-based AD is to train a deep neural network on an auxiliary task such as contrasting semantically matching vs. different samples in feature space \parencitetack2020. A trained model compresses high-dimensional images into a low-dimensional feature space, where dissimilarities can be measured and thus anomalies detected. We assessed two variants: (i) the feature extraction model CTransPath \parencitectranspath, trained on 32,220 H&E-stained diagnostic slides comprising 32 cancer subtypes from TCGA and PAIP, and (ii) the CTransPath model fine-tuned on the frequent finding patches with a deep one-class classification (OCC) loss \parencitereiss2021panda. In each setting, we created separate models for gastric and colon data, respectively. Further approaches, which showed inferior performance, are described in Appendix C.5. To determine patch anomaly scores in feature space, we applied a modified $k$ -nearest neighbors (kNN) algorithm.

Deep AD with OE

The idea of OE is to collect vast amounts of informative auxiliary data that—unlike true anomalous data—are easy to obtain in large numbers \parencitehendrycks2019deep. A classifier is trained to distinguish the common data from this auxiliary data. If the auxiliary data deviates from the common data distribution while still retaining a substantial degree of similarity, the model is able to learn the specific characteristics of the common class, enabling it to detect true anomalies \parencitehendrycks2019deep, liznerski2022exposing.

To adapt OE to histopathological images, we defined all frequent finding patches of one tissue type (stomach or colon) as normal. Patches from other collected tissue types (prostate, kidney, liver, etc.) from our separate OE dataset and normal patches from the respective other tissue type (colon or stomach) are auxiliary anomalies. We hypothesized that patches from a range of other tissue types are a close proxy for potential anomalies, while being sufficiently similar to the normal patches of the target tissue type. We trained a deep neural network to distinguish normal patches and these auxiliary anomalies, employing a binary cross-entropy loss \parenciteliznerski2022exposing. The model learns a compact decision boundary around patches of the frequent findings (Figure 2, TRAINING). After training, we computed the anomaly score of a patch as the probability the model assigns to the anomaly class (Figure 2, INFERENCE). We also fine-tuned the CTranspath model on the OE task, which showed similar performance as a randomly initialized ResNet-18 model (detailed ablation in Appendix D.4).

3 Results

For our study, we collected and digitized 5,423 tissue slides at two hospitals (Charité, Berlin and University Hospital of Ludwig-Maximilians-University, Munich) resulting in 17M histological images for training and evaluation purposes. The Charité cohort, which we used for training and primary validation, includes 2,173 GI biopsies showcasing 65 distinct diagnoses and 200 cases of anomalous findings (as specified in Figure 1). Additionally, we retrieved 2,901 slides from various tissue types other than colon and stomach as auxiliary training data, which we use in the OE approach. Dataset statistics are reported in Appendix B.1.

We applied and extended modern AD methodologies \parenciteRuffRev, namely self-supervised AD \parencitetack2020 (with CTransPath \parencitectranspath and OCC \parencitereiss2021panda) and OE \parencitehendrycks2019deep, liznerski2022exposing, which have recently drastically reduced error rates on natural image benchmarks and so far had not been adopted to diagnostic pathology. Herein, we focused on the clinically relevant use case of detecting all diagnoses in the dianostic long tail. For training, we exclusively exposed the AD models to patches of frequent findings (as well as auxiliary slides for the OE approach). Importantly, infrequent findings (i.e., the anomalies) were not shown during training.

AD performance results, depicting how accurately infrequent diseases were detected, are presented in Table 1. The best self-supervised method achieved slide-AUROC scores of 94.95% (stomach) and 89.76% (colon) and patch-AUROC scores of 89.73% (stomach) and 87.03% (colon). With the OE-based method, we attained higher AUROC scores than with the self-supervised methods, with slide-AUROC scores of 95.04% (stomach) and 91.01% (colon) and patch-AUROC scores of 91.37% (stomach) and 90.47% (colon). These results demonstrate that deep anomaly detection can reliably detect long-tail diseases in histopathological slides.

Table 1: Performance of the three proposed anomaly detection methods on the Charité cohort. Slide-AUROC measures slide separability based on the aggregated slide anomaly scores and respective slide diagnosis labels. Patch-AUROC measures the separability of individual patches based on the patch anomaly scores and ground-truth labels provided by pathologist annotations. Reported results are mean and standard deviation based on a 5-fold cross-validation of the normal training data, expressed as percentages.

		Stomach		Colon
Model	Diagnosis Group	slide-AUROC	patch-AUROC	slide-AUROC	patch-AUROC
Self-supervision w/ kNN		$94.95\pm 1.16$	$87.21\pm 0.36$	$89.76\pm 0.77$	$85.09\pm 0.63$
	Neoplastic, malignant	$95.23\pm 1.0$	$87.03\pm 0.47$	$97.48\pm 0.47$	$90.99\pm 0.64$
	Neoplastic, other	$94.95\pm 1.39$	$92.04\pm 0.25$	$95.36\pm 0.46$	$88.9\pm 1.72$
	Inflammation	$91.84\pm 1.9$	$87.86\pm 0.59$	$90.03\pm 1.16$	$84.46\pm 0.71$
	Other	$98.37\pm 0.74$	$92.78\pm 0.25$	$51.83\pm 2.47$	$44.57\pm 1.96$
Self-supervision w/ OCC		$93.76\pm 1.39$	$89.73\pm 0.47$	$88.51\pm 0.69$	$87.03\pm 0.49$
	Neoplastic, malignant	$95.24\pm 1.31$	$91.01\pm 0.59$	$96.12\pm 0.8$	$92.35\pm 0.6$
	Neoplastic, other	$90.28\pm 1.66$	$92.19\pm 0.37$	$93.64\pm 0.62$	$91.61\pm 0.76$
	Inflammation	$91.11\pm 1.55$	$92.96\pm 0.53$	$89.94\pm 1.04$	$86.73\pm 0.51$
	Other	$96.95\pm 1.07$	$92.17\pm 0.51$	$43.93\pm 2.16$	$44.85\pm 1.52$
Outlier Exposure		$95.04\pm 0.54$	$91.37\pm 0.34$	$91.01\pm 0.69$	$90.47\pm 0.33$
	Neoplastic, malignant	$97.72\pm 0.44$	$95.02\pm 0.28$	$96.97\pm 0.61$	$96.23\pm 0.27$
	Neoplastic, other	$88.45\pm 0.82$	$90.51\pm 0.48$	$95.72\pm 0.91$	$94.17\pm 0.38$
	Inflammation	$93.4\pm 1.02$	$95.75\pm 0.34$	$94.42\pm 1.07$	$90.24\pm 0.41$
	Other	$95.61\pm 0.3$	$92.44\pm 0.67$	$40.41\pm 1.86$	$37.25\pm 0.86$

3.1 AI-AD detects diverse pathological patterns

In contrast to previous work, we aimed to detect all histological tissue changes beyond common pathologies. This includes neoplasms, inflammation, infections, as well as other tissue changes likecalcinosis, xanthoma, or pancreatic heterotopia. Our approach addresses the real-world clinical setting, in which any type of pathological change has to be expected at any time.

We were largely successful in this task, as almost all diseases from various diagnostic groups resulted in considerably elevated anomaly scores (Figure 3) Importantly, malignant tumors of very different morphology and histogenesis, such as carcinomas, neuroendocrine tumors, lymphomas, metastatic melanomas, or sarcomas, were reliably assigned high anomaly scores. In fact, of all diagnostic groups, slide-AUROCs for malignancies were highest with $97.72\%$ for stomach and $96.97\%$ for colon, respectively. This is crucial, as detecting malignancy is the most consequential task in histopathological diagnostics. Infrequent benign and precancerous neoplastic changes were also reliably detected (slide-AUROC $88.45\%$ for stomach, $95.72\%$ for colon). Additionally, the AD model effectively recognized inflammation of the colon (slide-AUROC of $94.42$ %; for stomach most types of inflammation are frequent and therefore non-anomalous).

Pathological alterations that were not yet recognized by the AD model are pseudomelanosis coli, a harmlessbrown discoloration of colonic mucosa, as well as intestinal spirochaetosis, a bacterial infection presenting with a thin fuzzy line of bacteria on the surface of colon epithelium.

Importantly, for the given results on the Charité cohort, setting a conservative detection threshold on the slide anomaly scores to achieve 100% (99%, 95%) anomaly sensitivity, already 36.2% (51.91%, 72.93%) of stomach and 4.21% (4.99%, 38.2%) of colon cases can be reliably predicted as not being abnormal. Excluding anomalous cases of pseudomelanosis coli and intestinal spirochaetosis, which are either of no clinical significance (pseudomelanosis) or usually require special stains (spirochaetosis) even for trained pathologists, the colon numbers increase to 22.29% (49.46%, 85.22%). These results demonstrate the potential of the presented AD methodology for pathologist time savings and safe automation in diagnostics.

3.2 Heatmaps enable visual feedback and interpretation of AI-AD

Interpretability of AI predictions is crucial in a medical setting \parencitehisto-xai-review, bach2015pixel, samek2021explaining, and it is important to guide experts to the anomalous patterns. Furthermore, it is important to verify that the model’s predictions are not based on shortcut features, e.g. tissue artifacts (so-called “Clever Hans” effects \parencitelapuschkin2019unmasking, kauffmann2020clever).

Our patch-based approach allowed us to create heatmaps that highlight regions determined anomalous by our model. Exemplary heatmaps, which showcase the reliable detection performances for a broad range of anomalous malignant and benign findings, are shown in Figure 4. Additional heatmaps of complete tissue cuts along with pathologists’ annotations are provided in Appendix C.4.

Interestingly, tissue artifacts were often not or only partly highlighted in heatmaps, which is in line with a lower patch-AUROC for artifacts than for anomalies (Appendix C.3). This is important, as such artifacts can be common among slides and should therefore not result in markedly enhanced slide anomaly scores or uninformative heatmaps.

3.3 AI-AD generalizes across hospitals and scanners

Research indicates that the performance of AI models can drastically deteriorate with changes in input data characteristics \parencitehendrycks2021many. In histopathology, variations due to different lab staining protocols or differences in scanner equipment are known critical factors \parencitecampanella2019clinical. To test the generalization performance of our trained models, we evaluated their performance on an independent cohort from the Institute of Pathology at Ludwig-Maximilians-University (LMU) Munich, where slides were digitized with a different scanner type. We collected cases with frequent findings (63 for colon, 164 for stomach) as well as cases presenting anomalous findings (94 for colon, 28 for stomach), resulting in a total of more than 500k image patches for generalization validation purposes (Appendix B.1).

Evaluating model performance–without re-training the models on the new data–yielded competitive slide-AUROCs of 94.5% (stomach) and 85.88% (colon) for our previously best-performing approach (detailed results in Appendix C.1). Focusing the validation analysis on the clinically most relevant anomalies, i. e. malignant pathologies, we reach slide-AUROCs of $94.77\%$ for stomach and $95.02\%$ for colon.

4 Discussion

The long-tail disease distribution encountered in clinical reality (only few diseases are common, most diseases are less common) poses a significant challenge for the implementation of AI in medical diagnostics. It is often impossible to accurately represent all diseases in the long diagnostic tail during model training and, in most cases, not even attempted. This causes critical diagnostic errors by AI models, compromising security and clinical usability \parencitevan2021deep, evanserrorspath. Our developed AD for histopathology addresses this critical shortcoming, as it does not depend on training data for the long-tail diseases. We were able to show high detection performances within GI biopsies and demonstrated generalization across labs, staining patterns, and scanner characteristics.

First promising adaptations of AD have been presented in dermatology and radiology \parencitedermatology,radiology. However, AD has been largely unexplored for histopathology so far. In previous publications, the focus has been on the detection of single pathologies defined as anomalous, as for example breast cancer metastases in lymph nodes \parencitepocevivciute2021unsupervised,shvetsova2021anomaly,stepec2021unsupervised, linmans2024diffusion. In contrast, our dataset with a large number of different infrequent findings reflects the clinical reality, and thereby stands out from the related AD work in histopathology. Additionally, there are methodological limitations, as previous work is predominantly not based on the recent advances of OE or self-supervised learning, with the exception of \parenciteZINGMANpublished who consider a variant of the latter for the detection of non-alcoholic fatty liver disease in mouse liver tissue. Rather, most previous works use AD models such as generative adversarial networks \parencitepocevivciute2021unsupervised, stepec2021unsupervised,zehnder2022multiscale, autoencoders \parenciteshvetsova2021anomaly, or flow-based models \parencitepawlowski2021abnormality, which have been found less effective \parenciteRuffRev.

Our results further provide interesting insights into state-of-the-art AD methods on histopathological data: (1) we show that the right trade-off between similarity and diversity of the OE data is crucial for generalization, (2) only about 100 slides of common findings are sufficient for characterization of the normal training data, and (3) color augmentations and stain normalization are critical for generalization to different scanners and hospitals. We present respective ablation studies for these insights in Appendix D.

There are certain limitations to our AI-AD approach, particularly with respect to the detection of extremely subtle tissue changes in colon biopsies. These include pseudomelanosis coli, collagenous/lymphocytic colitis, and intestinal spirochaetosis, which are difficult to detect even for the trained expert and, in the case of collagenous/lymphocytic colitis and spirochaetosis, require additional special stains. While lower anomaly scores are therefore not surprising in such cases, additional strategies will need to be developed in the future to improve model performance further. The implementation of semi-supervised learning methodologies may hold significant potential in this aspect \parenciteruff2020dsad. As some pathologies, such as architectural changes in colon mucosa, are hardly detectable on individual patches of the currently used size – even for pathologists – multi-scale approaches with larger tissue context should be explored.

Our AD model used as a stand-alone clinical AI assistant has the potential to substantially improve both diagnostic efficiency and quality by reducing the amount of missed diagnoses through identifying “suspicious” cases and highlighting anomalies in histological slides. Critically, because of its design, it can be expected to reliably detect any kind of primary or metastatic cancer in stomach/colon samples even beyond the entities we evaluated it on. To our knowledge, no other published AI tool is capable of this in a zero-shot manner, even across other tissues. An integrated approach of AD and supervised detection of common findings (for GI-samples e.g. \parenciteByeon2022, Steinbuss2020) could in the future enhance the safety profile of supervised models and even lead to an overall safe automatic processing of samples. Our results indicate that with the current performance, already up to a third of biopsies with frequent findings could be automatically diagnosed without the risk of missing any less frequent and potentially severe diseases. This fraction can be expected to grow with future model improvements and may ultimately only leave a subset of cases to require manual review, which could drastically reduce pathologists’ workload and pave the way for a largely automated and safe AI-based histopathological diagnostics.

Acknowledgements

This work was partly funded by the German Ministry for Education and Research (under refs 01IS14013A-E, 01GQ1115, 01GQ0850, 01IS18056A, 01IS18025A and 01IS1-8037A) and BBDC/BZML and BIFOLD. NP is participant in the BIH Charité Junior Digital Clinician Scientist Program funded by the Charité – Universitätsmedizin Berlin, and the Berlin Institute of Health at Charité (BIH). KRM was partly supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grants funded by the Korea Government (MSIT) (No. 2019-0-00079, Artificial Intelligence Graduate School Program, Korea University and No. 2022-0-00984, Development of Artificial Intelligence Technology for Personalized Plug-and-Play Explanation and Verification of Explanation). Part of this work was conducted within the DFG research unit FOR 5359 (KL 2698/6-1 and KL 2698/7-1). MK acknowledges support by the Carl-Zeiss Foundation, the DFG awards KL 2698/2-1 and KL 2698/5-1, and the BMBF awards 03—B0770E and 01—S21010C.

Ethics approval

The current project has been considered by the local ethics committees in Berlin and Munich under the EA1/125/23 and 23-0334, respectively.

Author contributions

J.D., N.P., P.L., L.R., K.-R.M., F.K. conceptualized the project. N.P, F.K, O.B., L.R. were responsible for the data collection. N.P., F.K. performed data curation, data annotation and data analysis. J.D., J.H., P.L. carried out data analysis and model construction; J.D., J.H, N.P., P.L. performed model validation and data visualization. T.W. helped with dataset building. N.P., J.D., J.H., P.L., L.R. K-R.M., F.K. wrote the first draft, which was reviewed and edited by T.W., S.S., M.K., O.B., D.H., M.A.; K.-R.M., F.K. and L.R. supervised the project. Funding was secured by N.P., K.-R.M., M.K., and F.K. Correspondence to L.R., K.-R.M., F.K.

\printbibliography

Appendix A Related work

This section reviews related work in more detail and provides a thorough introduction to the latest deep anomaly detection methods and their applications in histopathology.

A.1 Deep Anomaly Detection (AD)

In contrast to classical approaches to AD that are known to perform poorly on high-dimensional data such as images \parenciteRuffRev, modern AD methods employ deep neural networks that scale well with higher dimensions \parenciteruff2018deep,pang2021,RuffRev. The first deep methods used autoencoders \parencitesakurada2014anomaly,zhou2017anomaly,nguyen2019, generative models \parenciteschlegl2017unsupervised,deecke2018image,schlegl2019, and one-class models \parenciteruff2018deep,wu2019deep to perform AD. Autoencoders are neural networks trained to compress inputs into a low-dimensional space, from which they then reconstruct the original input. Trained only on normal samples, autoencoders are less able to accurately reconstruct anomalous patterns. The difference between the original input and the reconstruction can be used for determining an anomaly score. Generative models can be used for AD in a similar manner; e.g. by taking the difference of the most similar generated sample to the original. In deep one-class models, a neural network is trained to map normal samples to a latent space so that the samples are encompassed in a hypersphere of minimal radius. For inference, samples that lie outside of this hypersphere are considered anomalous. Recently, self-supervised learning \parencitegolan2018deep,hendrycks2019using,tack2020 and outlier exposure \parencitehendrycks2019deep,liznerski2022exposing caused a breakthrough in deep AD, halving the error rates on established benchmarks \parencitetack2020, liznerski2022exposing.

Self-supervised Anomaly Detection

Self-supervised learning emerged as a means to learn general-purpose features for various tasks without the need for manual labeling. In self-supervised learning, labels for learning are automatically generated. For example, in \parencitegidaris2018unsupervised, the authors rotated each training image with four different angles and trained the neural network to predict the angle. They termed the trained network “RotNet.” RotNet is also the first method that was used for deep AD \parencite golan2018deep, marking the beginning of self-supervised deep AD. Training RotNet exclusively on normal samples leads to uncertainy when applied to anomalies. Thus, the uncertainty of the prediction can be used as an anomaly score. Later works improved upon this by adding and combining additional geometric transformations \parencitehendrycks2019using. More recent approaches used contrastive losses in combination with image transformations to further improve the AD performance \parencitetack2020, sohn2021, zou2022spot. In the CSI method \parencitetack2020 the contrastive loss of SimCLR is used \parencitechen2020simple to train a network to map diverse transformations of normal images close together, while pushing different normal samples apart. The network also has to predict the kind of transformation, as was done in \parencitehendrycks2019using, and integrates a k-nearest neighbor algorithm for final detection. CSI is still the state-of-the-art AD method for natural images, when one is restrained from using extra training data in the form of outlier exposure or pretraining.

Outlier Exposure

AD algorithms are typically unsupervised, because it is considered unfeasible to find anomalous data that sufficiently represents everything anomalous. An anomaly detector trained with samples that represent just a subset of all notions of anomalousness is prone to be biased towards the seen notions, generalizes poorly, and thus performs inaccurately overall \parenciteruff2020dsad. However, \parencitehendrycks2019deep found that random data samples from a domain (e.g. random images from the web for natural image tasks) are most likely anomalous for any given AD problem in the domain. Using a huge corpus of such “auxiliary anomalies” during training, leads to effective generalization and substantially enhances the performance of anomaly detectors on natural images benchmarks. They called their approach “outlier exposure” (OE). Later works investigated the behavior of models trained with OE \parenciteliznerski2022exposing and found that training a binary classifier with a standard cross-entropy loss to distinguish between normal data and OE samples yields the best results. They also proposed a modification of the unsupervised deep one-class loss \parenciteruff2018deep, which they termed “hypersphere classification” (HSC), that performs slightly worse, but is more robust to non-representative OE sample distributions. This is particularly important when only a few OE samples are available. While OE with random samples from the internet is the state of the art for natural image AD, it is less suitable for subtle anomalies such as defective versions of normal samples (e.g., a cracked screw for screws being normal). Here, random images from the internet are too dissimilar from the data and thus too easy to detect as anomalous. It was found that the most effective kind of OE for this setup can be generated synthetically by perturbing normal samples to look like defective versions \parencitegoyal2020drocc,liznerski2021explainable,mirzaei2022fake. The powerful idea of outlier exposure has not yet been explored in histopathology.

A.2 Explaining deep AD with heatmaps

In our paper, we provide heatmaps that represent the anomaly scores for overlap** patches, thereby providing an interpretable AD method. Explaining deep anomaly detectors via heatmaps is a rather recent line of research. It draws from the broad literature on explainable AI (see e.g. \parencitebaehrens2010explain,simonyan2013deep,bach2015pixel,sundararajan2017axiomatic,lapuschkin2019unmasking,gunning2019xai,samek2021explaining,holzinger2022xxai,liznerski2024reimagining,varshneya2024interpretable). The most common approach is to make the neural network directly attribute features \parenciteliznerski2021explainable, so that each feature (e.g., pixel) is assigned a separate anomaly score, together forming the anomaly heatmap. Early works used generative models or autoencoders, where the pixel-wise reconstruction error yields the heatmap \parencitebaur2018deep, bergmann2019mvtec, dehaene2020Iterative. In a more modern approach, \parencitekauffmann2020towards,liznerski2021explainable utilized the properties of fully convolutional neural networks to have the network directly output pixel-wise anomaly scores. Other works used gradient-based explanations (e.g., Grad-CAM \parenciteselvaraju2017grad) to highlight the regions the AD model focuses on, usually aligning with the more anomalous regions \parencitevenkataramanan2020attention,li2021cutpaste. In combination with synthetically generated anomalies that are perturbed versions of normal samples, one can also directly train a semantic segmentation model that assigns a probability of anomalousness per pixel \parenciteschluter2022natural. The training target is the pixel-wise difference between the generated anomaly and the normal sample. In the most recent approaches, an arbitrary feature learner is trained, and then the discrepancy between extracted feature maps for a test sample and a collection of normal feature maps is used to assign an anomaly score per pixel \parencitedefard2021padim,roth2022towards,zou2022spot.

A.3 AD in histopathology

Until recently, applying AD to histopathology remained largely unexplored. Virtually all previous publications focused on detecting single disease entities as anomalous \parencitepocevivciute2021unsupervised,shvetsova2021anomaly,stepec2021unsupervised,he2022review, such as the detection of breast cancer metastasis within lymph nodes. For this narrow task, an AUROC of up to 94.7% was reached \parencitestepec2021unsupervised, using image-to-image translation. Zehnder et al. aimed to detect three different kinds of anomalous changes (necrosis, peritonitis, inflammation) in mouse liver tissue \parencitezehnder2022multiscale. However, the size of the used dataset was limited (total size of 50 WSI’s) and included down to only 53 tiles for individual abnormal classes. In this regard, our diverse stomach & colon datasets with 1.973 WSI’s of frequent findings and 200 WSI’s of anomalous findings provide a much larger data basis. Methodologically, previous publications focused on AD models such as generative adversarial networks \parencitevstepec2020image,pocevivciute2021unsupervised, stepec2021unsupervised, zehnder2022multiscale, autoencoders \parenciteshvetsova2021anomaly, or flow-based models \parencitepawlowski2021abnormality. Zingman et al. have used AD to detect anomalous liver tissue showing patterns of non-alcoholic fatty liver disease in mice \parenciteZINGMANpublished. Interestingly, they also used other tissue types (liver, brain, kidney, heart, lung, pancreas, spleen) during the training process of their model. However, their approach was to use other tissue types to optimize image representations with subsequent AD using a one-class classifier. This differs significantly from our outlier exposure approach.

If a model that detects anomalies also differentiates between different classes of normal inputs, it is typically called an out-of-distribution (OOD) detector. In contrast to AD models, OOD detectors usually require labels and use the confidence of the classifier to detect OOD samples. Linsmans et al. trained a large set of models to differentiate between normal lymph node tissue and breast cancer metastasis. They then analyzed the models ability to accurately detect OOD-inputs of diffuse large B-cell lymphoma. The best-performing model, a 5-multi-head ensemble, reached an AUROC of 81.02% \parencitelinmans2023predictive. Dolezal et al. tested the ability to detect OOD inputs for a classifier trained to differentiate between lung squamous cell carcinoma and lung adenocarcinoma. Interestingly, even though their model was implemented to predict the uncertainty of the classifier accurately, 21.5 % of input slides showing non-lung, non-adenocarcinoma, non-squamous OOD cancer types were incorrectly assigned to in-distribution classes with high confidence \parenciteuncertainty_dolezal.

Appendix B Additional details

B.1 Dataset statistics

Table 2 contains an overview over the Charité cohort including auxiliary slides used for our outlier exposure model. For the LMU cohort, Table 3 provides further diagnosis statistics.

Table 2: Diagnosis and tissue distribution in the Charité cohort.

Diagnosis	Slides
Frequent findings	961
Chronic gastritis, NFS	322
Normal tissue	230
Chemical gastritis (Type C)	213
Bacterial gastritis (Type B)	113
Acute and chronic gastritis, NFS	65
Autoimmune gastritis (Type A)	18
Neoplastic, malignant	60
Gastric adenocarcinoma, NFS	11
Gastric adenocarcinoma, signet-ring-cell	10
Marginal zone lymphoma	7
Metastasis, Adenocarcinoma of breast	5
Neuroendocrine tumor	4
Metastasis, Melanoma	4
Squamous cell carcinoma	3
Neuroendocrine carcinoma	3
Gastrointestinal stromal tumor	2
MINEN	2
Isolated lymphangiosis carcinomatosa	1
Gastric adenocarcinoma, hepatoid	1
B-cell lymphoma	1
Undifferentiated sarcoma	1
Undifferentiated carcinoma	1
Metastasis, Adenocarcinoma of lung	1
Adenosquamous carcinoma	1
Metastasis, Urothelial carcinoma	1
Metastasis, Adenocarcinoma of pancreas	1
Neoplastic, other	22
Adenoma, foveolar type	7
Tubular adenoma	6
Hyperplastic polyp	4
Adenoma, oxyntic type	1
Tubulovillous adenoma	1
Peutz-Jeghers polyp	1
Leiomyoma	1
Fundic gland polyp	1
Inflammation	12
Ulcer	11
Lymphocytic gastritis	1
Other	6
Pancreatic heterotopia	2
Xanthoma	2
Helminthosis	1
Calcinosis	1

(a)

Diagnosis	Slides
Frequent findings	1012
Normal tissue	507
Adenoma, low grade	321
Hyperplastic polyp	139
Sessile serrated lesion	45
Neoplastic, malignant	31
Colorectal adenocarcinoma	23
Neuroendocrine tumor	4
Metastasis, Melanoma	2
Squamous cell carcinoma	1
Undifferentiated sarcoma	1
Neoplastic, other	9
Tubulovillous adenoma, high grade	4
Tubular adenoma, high grade	3
Leiomyoma	1
Villous adenoma, high grade	1
Inflammation	52
Ulcerative colitis	15
Crohn’s disease	8
Inflammatory pseudopolyp	6
Acute and chronic colitis	5
Lymphocytic colitis	4
Ischemic colitis	4
Collagenous colitis	4
Ulcer	4
Sevelamer-induced colitis	1
Pseudomembranous colitis	1
Other	8
Pseudomelanosis Coli	3
Intestinal spirochaetosis	3
Helminthosis	2

(b)

Category	Slides
Auxiliary (for “Outlier Exposure”)	2,901
Prostate	599
Kidney	459
Liver	402
Lung	401
Small intestine	283
Breast	204
Other, mixed	553

(c)

Table 3: Diagnosis distribution of the colon (left) and stomach (right) LMU cohort.

Diagnosis	Slides
Frequent findings	63
Adenoma, low grade	35
Hyperplastic polyp	13
Normal tissue	11
Sessile serrated lesion	4
Neoplastic, malignant	30
Colorectal adenocarcinoma	25
Neuroendocrine tumor	3
Metastasis, Adenocarcinoma of lung	1
Metastasis, Melanoma	1
Neoplastic, other	6
Tubulovillous adenoma, high grade	6
Inflammation	55
Ulcerative colitis	23
Crohn’s disease	17
Collagenous colitis	12
Chronic colitis, NFS	2
Pseudomembranous colitis	1
Other	3
Pseudomelanosis Coli	3

Diagnosis	Slides
Frequent findings	164
Chronic gastritis, NFS	65
Normal tissue	53
Bacterial gastritis (Type B)	34
Chemical gastritis (Type C)	12
Neoplastic, malignant	24
Marginal zone lymphoma	8
Gastric adenocarcinoma, signet-ring-cell	7
Gastric adenocarcinoma, NFS	5
Neuroendocrine tumor	3
Metastasis, Adenocarcinoma of ovary	1
Neoplastic, other	2
Hyperplastic polyp	2
Inflammation	2
Ulcer	2

B.2 Training details

To fine-tune self-supervised learning models with a one-class, we followed the training procedure of PANDA \parencitereiss2021panda. We used a learning rate of $10^{-2}$ , a batch size of 32, and the SGD optimizer. To prevent a collapse of the representations, we clipped the gradient norm to $10^{-3}$ , froze the first blocks of the network, and did not update batch norm statistics during fine-tuning.

For OE, we trained the network with the standard SGD optimizer with momentum $0.9$ , used a learning rate of $5\cdot 10^{-4}$ , batch size of 32, and a weight decay of $0.0001$ . For augmentation, we used resized crops and color jitter. To achieve a balance between patches of frequent GI findings and OE data, we sampled an equal number of both in each batch. The OE patches were further sampled to include 50% of near-tissue types (stomach, colon, small intestine) and 50% of far-tissue types (all other tissues), which we defined according to their informativeness of morphological similarity. As some basic tissues (e.g. connective tissue, muscle tissue) can be found in both our frequent findings and the OE data, we aimed to remove patches with such overlap** tissue components. We did so by computing the similarity between patches from both groups and consequently removed samples from the OE data with a cosine similarity of more than $0.9$ . Each model was trained within a single day on an A100 NVIDIA GPU.

B.3 Evaluation Details

To obtain robust patch-level results, we applied test time augmentation as an ensembling technique \parenciteshorten2019survey. We used the same augmentations as during training (random crops, color jitter) to generate $n=10$ views of the same patch, computed the anomaly score for each view, and then averaged over all the views for a final patch anomaly score.

To point pathologists at anomalous regions within a slide, we generated fine-grained anomaly score heatmaps from the patch predictions. First, we extracted patches from WSIs with an overlap of 75 pixels. Second, the patches were pre-processed and passed through the model with test-time augmentation as described above. The resulting patch scores were then aggregated into a spatial map, where the scores of overlap** patches were averaged, creating a smooth heatmap. A color was assigned to each tissue patch based on the anomaly score.

B.4 Image pre-processing pipeline

For each collected slide, we computed a tissue boundary using standard computer vision operations. Subsequently, we extracted patches of size $340\times 340$ pixels from the identified tissue regions at 20x magnification, corresponding to a resolution of roughly 0.5 microns per pixel (mpp). We ignored patches with more than 80% background and applied Reinhard’s stain normalization method \parencitereinhard2001stainnorm to each patch with the average stain statistics of our frequent findings as a normalization target.

B.5 Independent validation on the LMU Munich cohort

To demonstrate the ability of our methodology to generalize across hospitals and scanners, we collected an independent dataset with H&E-stained slides of 192 gastric and 157 colon specimens from the archives of the Institute of Pathology, LMU Munich, from routine diagnostics between years 2020 and 2023. Details of the LMU dataset are provided in Table 3.

The LMU slides were scanned with a different scanner (Leica Aperio GT 450) and, after digitization, pre-processed in the same manner as described above. Slides with pronounced lab-specific tearing artifacts that were markedly distinct from those encountered in the Charité cohort, were excluded (for both frequent and anomalous findings). Inflammatory colon changes were graded on a scale of 0 (no inflammation) to 3 (high inflammation) and matched to inflammation levels in the Charité cohort in order to increase comparability. Consequently, we applied our trained AD models to all patches from the slides and aggregated the patch scores to slide scores using the same strategy as on the Charité cohort. No re-training was performed; that is, no model was exposed to any of the LMU data at training time. For each assessed method, we evaluated all five models trained via 5-fold cross-validation on the Charité dataset and report the mean and standard deviation of the slide-AUROC scores on the LMU cohort.

Appendix C Additional results

This section provides additional experimental results, supplementing the results presented in the main paper.

C.1 Performance on the external LMU cohort

Table 4 shows the performance on the external hold-out LMU cohort.

Table 4: Anomaly detection performance on the stomach and colon LMU cohort. We only report slide-AUROC scores as no pixel-wise annotations were available for this cohort.

		Stomach	Colon
Model	Diagnosis Group	slide-AUROC	slide-AUROC
Self-supervision w/ kNN		$88.6\pm 0.1$	$84.44\pm 0.61$
	Neoplastic, malignant	$87.91\pm 0.15$	$97.89\pm 0.1$
	Neoplastic, other	$99.15\pm 0.09$	$100.0\pm 0.0$
	Inflammation	$90.24\pm 0.64$	$78.72\pm 0.98$
	Other	-	$23.6\pm 0.96$
	w/o pseudomelanosis and collagenous colitis	-	$90.05\pm 0.54$
Self-supervision w/ OCC		$89.92\pm 0.85$	$87.43\pm 0.61$
	Neoplastic, malignant	$89.28\pm 0.98$	$97.79\pm 0.25$
	Neoplastic, other	$96.67\pm 0.65$	$100.0\pm 0.0$
	Inflammation	$89.11\pm 0.85$	$83.27\pm 0.85$
	Other	-	$34.92\pm 1.35$
	w/o pseudomelanosis and collagenous colitis	-	$93.3\pm 0.61$
Outlier Exposure w/ BCE		$94.5\pm 0.93$	$85.88\pm 0.94$
	Neoplastic, malignant	$94.77\pm 0.88$	$95.02\pm 0.37$
	Neoplastic, other	$99.47\pm 0.27$	$98.57\pm 0.24$
	Inflammation	$90.08\pm 2.24$	$82.53\pm 1.42$
	Other	-	$30.48\pm 2.86$
	w/o pseudomelanosis and collagenous colitis	-	$91.71\pm 0.57$

C.2 Detection performance for each distinct diagnosis in the Charité cohort

We report patch-AUROC and slide-AUROC for each distinct diagnosis in our Charité cohort in Table 5 (stomach) and Table 6 (colon).

Table 5: Patch-AUROC and slide-AUROC scores for each distinct diagnosis in the stomach Charité cohort for the OE model. The scores are computed by assessing only the anomalous data from the respective diagnosis type and all frequent findings.

Diagnosis Group	Diagnosis	patch-AUROC	slide-AUROC
Neoplastic, malignant	Gastric adenocarcinoma, NFS	$95.54\pm 0.39$	$98.14\pm 0.53$
	Gastric adenocarcinoma, signet-ring-cell	$97.66\pm 0.20$	$99.18\pm 0.22$
	Gastric adenocarcinoma, hepatoid	$96.63\pm 0.48$	$99.27\pm 0.47$
	Adenosquamous carcinoma	$97.52\pm 0.18$	$99.79\pm 0.29$
	Squamous cell carcinoma	$97.67\pm 0.18$	$99.76\pm 0.27$
	Undifferentiated carcinoma	$99.63\pm 0.16$	$99.79\pm 0.29$
	Isolated lymphangiosis carcinomatosa	$94.02\pm 0.53$	$99.27\pm 0.61$
	Neuroendocrine tumor	$91.21\pm 0.86$	$93.70\pm 0.90$
	MINEN	$98.29\pm 0.22$	$99.16\pm 0.40$
	Neuroendocrine carcinoma	$98.40\pm 0.23$	$96.39\pm 0.60$
	B-cell lymphoma, NFS	$93.26\pm 0.43$	$97.61\pm 0.59$
	Marginal zone lymphoma	$91.12\pm 0.50$	$95.52\pm 0.77$
	Gastrointestinal stromal tumor	$98.10\pm 0.25$	$99.63\pm 0.30$
	Undifferentiated sarcoma	$97.80\pm 0.23$	$99.79\pm 0.29$
	Metastasis, Adenocarcinoma of pancreas	$96.40\pm 0.39$	$98.23\pm 0.81$
	Metastasis, Adenocarcinoma of breast	$91.58\pm 0.60$	$97.79\pm 0.88$
	Metastasis, Adenocarcinoma of lung	$96.63\pm 0.22$	$99.79\pm 0.29$
	Metastasis, Urothelial carcinoma	$99.05\pm 0.18$	$99.79\pm 0.29$
	Metastasis, Melanoma	$92.17\pm 0.56$	$95.00\pm 0.72$
Neoplastic, other	Tubular adenoma	$94.47\pm 0.25$	$99.51\pm 0.38$
	Tubulovillous adenoma	$98.25\pm 0.30$	$99.69\pm 0.29$
	Adenoma, oxyntic type	$73.33\pm 2.13$	$64.42\pm 3.70$
	Adenoma, foveolar type	$86.69\pm 0.84$	$73.03\pm 1.82$
	Fundic gland polyp	$88.43\pm 1.24$	$98.95\pm 0.76$
	Hyperplastic polyp	$92.87\pm 0.51$	$97.04\pm 0.64$
	Leiomyoma	$95.18\pm 0.54$	$99.16\pm 0.61$
	Peutz-Jeghers polyp	$73.00\pm 1.06$	$87.22\pm 1.52$
Inflammation	Ulcer	$96.73\pm 0.26$	$95.26\pm 0.89$
	Lymphocytic gastritis	$69.23\pm 3.00$	$72.93\pm 2.70$
Other	Helminthosis	$99.85\pm 0.11$	$99.79\pm 0.29$
	Xanthoma	$86.80\pm 1.09$	$92.24\pm 0.92$
	Calcinosis	$92.25\pm 0.84$	$97.71\pm 0.31$
	Pancreatic heterotopia	$96.35\pm 0.49$	$95.84\pm 0.71$

Table 6: Patch-AUROC and slide-AUROC scores for each distinct diagnosis in the colon Charité cohort for the OE model. The scores are computed by assessing only the anomalous data from the respective diagnosis type and all frequent findings.

Diagnosis Group	Diagnosis	patch-AUROC	slide-AUROC
Neoplastic, malignant	Colorectal adenocarcinoma	$96.14\pm 0.27$	$98.53\pm 0.67$
	Squamous cell carcinoma	$95.63\pm 0.34$	$97.49\pm 1.51$
	Neuroendocrine tumor	$95.03\pm 0.37$	$85.68\pm 1.06$
	Undifferentiated sarcoma	$98.85\pm 0.26$	$100.0\pm 0.00$
	Metastasis, Melanoma	$98.03\pm 0.41$	$99.85\pm 0.23$
Neoplastic, other	Tubular adenoma, high grade	$89.56\pm 0.46$	$90.23\pm 1.95$
	Tubulovillous adenoma, high grade	$95.87\pm 0.46$	$99.35\pm 0.40$
	Villous adenoma, high grade	$88.27\pm 1.16$	$93.39\pm 1.93$
	Leiomyoma	$97.83\pm 0.24$	$100.0\pm 0.00$
Inflammation	Crohn’s disease	$91.86\pm 0.46$	$98.44\pm 0.65$
	Ulcerative colitis	$87.95\pm 0.66$	$95.20\pm 1.45$
	Acute and chronic colitis	$91.28\pm 0.47$	$94.19\pm 1.75$
	Ischemic colitis	$96.03\pm 0.41$	$99.49\pm 0.64$
	Collagenous colitis	$70.19\pm 0.68$	$83.82\pm 3.07$
	Lymphocytic colitis	$51.71\pm 0.38$	$79.39\pm 0.71$
	Ulcer	$96.84\pm 0.31$	$99.19\pm 0.46$
	Inflammatory pseudopolyp	$92.78\pm 0.44$	$98.75\pm 0.62$
	Pseudomembranous colitis	$95.19\pm 0.64$	$97.21\pm 1.00$
	Sevelamer-induced colitis	$73.79\pm 1.04$	$85.99\pm 5.34$
Other	Helminthosis	$78.50\pm 1.49$	$95.90\pm 1.24$
	Intestinal spirochaetosis	$40.87\pm 1.43$	$23.56\pm 2.58$
	Pseudomelanosis Coli	$28.86\pm 0.90$	$20.28\pm 3.36$

C.3 Artifacts vs. anomalies

We also annotated tissue- and processing artifacts on anomaly slides, checking whether they might cause “Clever Hans” effects \parencitelapuschkin2019unmasking,kauffmann2020clever where a slide receives a high anomaly score for the wrong reasons, i.e. for potential artifacts possibly driving high anomaly scores of infrequent cases. However, compared to annotated anomaly regions, the artifact regions received significantly lower anomaly scores. The artifact regions had patch-AUROCs of 80.89% (colon) and 75.64% (stomach) while anomaly regions received patch-AUROCs of 90.47% (colon) and 91.37% (stomach). This shows that artifacts did not have a major influence on our results.

C.4 Anomaly heatmaps

We also investigated the model performance qualitatively by presenting heatmaps of our AD model to pathologists. In the main paper, we provide selected excerpts of heatmaps to visually demonstrate the performance of our AD model. Here, in Figures 5–10, we present additional heatmaps (right side) along with pixel-wise annotations by pathologists (left side). Areas annotated with red color indicate anomalous regions that define the final diagnosis. Areas annotated in yellow indicate other anomalous regions that do not directly define the final diagnosis (e.g. inflammatory changes adjacent to tumor tissue). We show complete tissue cuts of whole slide images.

C.5 kNN with self-supervised models

Any model that provides informative image representations can be used for AD by simply applying the k-nearest-neighbor (kNN) algorithm in representation space. This often results in state-of-the-art or competitive performance \parencitedn2, reiss2021panda, tack2020, muttenthaler2023improving.

Table 7: Overview of self-supervised models that have been pretrained on a large collection of histopathology images. All methods maximize the similarity of two augmented views from the same image.

Name	Architecture	Method	Dataset
HIPT \parencitechen2022scaling	VIT-S/16	DINO	33 cancer types, 104M 256×256 images
R50 SimCLR BRCA \parencitechen2022self	ResNet-50	SimCLR	TCGA-BRCA.
Ciga et al. \parenciteciga2022self	ResNet-18	SimCLR	23 non WSI and 35 WSI datasets
RetCCL \parenciteretccl	ResNet-50	contrastive learning	32,000 WSIs
CTranspath \parencitectranspath	SwinTransformer	mod. SimCLR	TCGA + PAIP

Suitable models for our scenario, i.e., those pretrained with diverse self-supervised techniques on a large collection of histopathology images, are readily available online. Additionally, we trained our own self-supervised model with the SimCLR framework on the normal data. Further, we fine-tuned the self-supervised models with a one-class loss \parencitereiss2021panda.

Table 8: Anomaly detection performance with self-supervised models on the stomach and colon Charité cohort. All models use kNN for AD.

		Colon		Stomach
Cohort	Method	patch-AUROC	slide-AUROC	patch-AUROC	slide-AUROC
Charité	R50 ImageNet	54.65	66.42	61.11	73.94
	R50 SimCLR BRCA \parencitechen2022self	67.02	84.57	70.84	86.63
	Ciga et al. \parenciteciga2022self	68.07	75.91	76.22	90.55
	Self-trained R18 SimCLR	60.13	80.79	73.66	91.54
	HIPT \parencitechen2022scaling	75.55	85.84	75.68	87.31
	RetCCL \parenciteretccl	77.95	86.86	84.65	93.64
	CTransPath \parencitectranspath	84.30	90.39	87.16	95.09

We investigated the AD performance of those models with kNN. We only report the best model in the main paper, while, in this section, we show the results for all other self-supervised models. Table 7 displays an overview of the considered models, and Table 8 shows the results for the Charité cohort. We observed that the CTranspath model \parencitectranspath outperforms all other models on both tissue types. Table 9 shows the models after finetuning with a one-class loss. Finetuning improves the patch-AUROC for some models but does not bring significant improvements on the slide level.

Table 9: Anomaly detection performance with self-supervised models on the stomach and colon Charité cohort after finetuning with a one-class loss. All models use kNN for AD.

		Colon		Stomach
Cohort	Method	patch-AUROC	slide-AUROC	patch-AUROC	slide-AUROC
Charité	R50 ImageNet	54.03	63.30	61.75	68.74
	R50 SimCLR BRCA \parencitechen2022self	66.05	82.83	70.27	84.08
	R18 SimCLR \parenciteciga2022self	70.69	72.88	77.83	83.80
	HIPT \parencitechen2022scaling	76.98	84.82	80.49	87.41
	RetCCL \parenciteretccl	82.72	85.89	86.51	87.18
	CTransPath \parencitectranspath	86.24	88.93	89.56	93.52

C.6 Autoencoder

Many of the discussed previous works for AD in histopathology used reconstruction-based methods to detect anomalies \parencitepocevivciute2021unsupervised, litjens20181399, stepec2021unsupervised, zehnder2022multiscale. Therefore, we also investigated the performance of an autoencoder model on our dataset. As mentioned in the related work, an autoencoder is trained on the normal data to compress the image into a low-dimensional representation and then reconstructs the image from that latent representation. The main idea is that the model has a larger reconstruction error on anomalies, as it had only been trained on normal data, thus making the reconstruction error an anomaly score.

We trained a simple autoencoder with a bottleneck dimension of 512 and 6 blocks with 2 convolutional layers each in the encoder and decoder, respectively. Table 10 shows the performance of this autoencoder model.

Table 10: Anomaly detection performance of a simple autoencoder model compared to our OE based model.

		Colon		Stomach
Cohort	Method	patch-AUROC	slide-AUROC	patch-AUROC	slide-AUROC
Charité	Autoencoder	59.67	65.58	51.33	63.73
	Outlier Exposure	90.38	91.49	90.90	94.40
LMU	Autoencoder	-	55.54	-	55.12
	Outlier Exposure	-	84.58	-	93.07

We can observe that the performance of the autoencoder model is poor compared to our OE approach. Especially for the stomach cohort, the performance is not considerably better than random (51.33 patch-AUROC). We could potentially improve the autoencoder model with an additional adversarial or perceptual loss. However, these improvements usually only result in small performance improvements \parencitebergmann2019mvtec. In contrast to industry defect datasets \parencitebergmann2019mvtec, where there is typically only small variation within the normal data, the normal data in our case has a great intrinsic variation.

C.7 Multi-scale considerations

Even for pathologists, some pathological patterns are barely detectable on individual patches of some fixed size or may look very similar to frequently found and healthy tissue characteristics. We experienced this issue in some anomalous findings.

For example, histomorphological changes from normal colon tissue to adenomas with low-grade or high-grade cellular changes and even to adenocarcinoma occur on a continuous scale. We noticed that findings from different stages along that continuum can look very similar on a patch level, as exemplified in Figure 11. To set a clear distinction between non-neoplastic colon tissue and high-grade epithelial changes/adenocarcinoma, which are crucial to detect, we decided to exclude low-grade adenomas from our training data. This enabled us to get high detection rates for potentially cancerous findings, while at the same time preserving low anomaly scores for low-grade adenomas because of their high similarity to deep crypt characteristics in regular colon mucosa on a patch level.

Similarly, tissue changes consistent with leiomyoma can often not be differentiated from normal smooth muscle tissue on a patch level. Here, the size and context of anomalous regions are needed to make an accurate diagnosis in clinical practice. However, as smooth muscle tissue of deep biopsies in our dataset tends to get slightly higher anomaly scores than mucosal tissue, the leiomyomas in our dataset received notable slide anomaly scores after aggregation.

As shown, tissue context beyond patch size is often helpful in detecting anomalies. Architectural changes in colon mucosa during inflammation or post-inflammation are another example of this. Multi-scale approaches, integrating tissue contexts at different magnification levels, might be a promising avenue to improve the detection performance for these kinds of anomalies, which we plan to explore in future work.

Appendix D Ablation experiments

We performed ablation experiments to investigate the effects of the different building blocks in our OE model.

D.1 Varying normal data size

We used a large number of frequent finding slides ( $\approx$ 1000 each for stomach and colon) for training our AD model. In this ablation, we investigated how much training data is necessary for the AD model to generalize to unseen anomalies and whether collecting more training slides would result in a significant performance gain. To test this scaling behavior, we varied the number of slides that we used during training and observed the resulting AD performance on the test set.

We report results for 1, 10, and 100 randomly sampled slides, as well as the full training dataset. We trained all models for the same amount of iterations. Figure 12 shows the performance on the Charité cohort with the x-axis being on a logarithmic scale.

The plots show patch-AUROC and slide-AUROC scores with respect to the number of WSIs used during training. As expected, the results highlight that more WSIs in the training data lead to an improved AD performance. Having more data of frequent findings available at training time, allows the model to see more patterns and therefore enables better generalization to unseen data. We observed that the performance started to saturate at around 100 slides. Adding a magnitude of more training slides only marginally improved the performance. This indicates that 100 slides are already sufficient to capture most of the patterns in the training data. Collecting this amount of slides from routine diagnostics is practical and the scaling behavior is consistent between colon and stomach. Therefore, we believe that AD can be adapted well for other tissues with limited slide collection needs.

D.2 OE datasets

In the following, we show that the selection of suitable OE data is nontrivial and of utter importance for strong generalization to true anomalies. Table 11 shows the performance for different OE datasets, which we will elaborate on in the following. Throughout all ablations, we used the same train/test split and do not report results on full 5-fold cross-validation. As the standard deviation on the 5-fold cross-validation is low, significant performance differences are also evident with this reduced evaluation scheme.

Table 11: AD performance for different OE datasets on the Charité cohort. Further information about the datasets is given in the following paragraphs.

	Colon		Stomach
Dataset	patch-AUROC	slide-AUROC	patch-AUROC	slide-AUROC
ImageNet-1K	50.72	52.17	55.91	75.21
TCGA	78.84	82.73	80.73	88.80
Full OE Set	90.38	91.49	90.90	94.40
Charité OE mixed	86.09	88.48	88.34	93.53
Kidney subset	85.13	86.78	85.47	88.32
Breast subset	84.98	88.96	87.13	89.90
Small Intestine	90.41	91.74	89.54	92.44
Stomach	90.73	91.61	-	-
Colon	-	-	88.48	90.99

OE using natural images

On natural image AD benchmarks (e.g., one vs. rest with CIFAR10), OE with a diverse set of natural images is most effective \parencitehendrycks2019deep, ruff2020rethinking, liznerski2022exposing. In a first experiment, we investigated if this also holds in the histopathology regime, i.e., if a diverse OE set of natural images already suffices for strong generalization to true anomalies. The data distribution of natural images strongly differs from our medical images; therefore, we expected generalization to true anomalies to be challenging. We used the same training setup as for our main results but exchanged the OE dataset with the popular ImageNet-1K dataset \parenciteimagenet containing a set of 1000 image classes.

After training our model, we observed that the binary cross-entropy loss rapidly decreased after the first optimization steps. This shows that the model can easily differentiate between histopathology and natural images. After training, we evaluated our model on the test set, which yielded a poor performance of 50.72 patch-AUROC (colon) and 55.91 patch-AUROC (stomach) (see Table 11). The binary classification between natural images and our histopathology images is trivial in this case, and therefore the model does not learn features that generalize to real anomalies.

OE using TCGA data

Table 12: Statistics of TCGA OE data set. Patches were created at 20x magnification with patch size 340x340 pixels.

Study	Description	Slides	Patches
BRCA	Breast invasive carcinoma	50	265,017
CHOL	Cholangiocarcinoma	39	433,769
ESCA	Esophageal carcinoma	50	338,414
LUAD	Lung adenocarcinoma	49	336,924
PAAD	Pancreatic adenocarcinoma	48	332,582
PRAD	Prostate adenocarcinoma	50	372,134
UCEC	Uterine Corpus Endometrial Carcinoma	50	490,544
Total		336	2,569,384

Previous studies suggest that OE becomes more effective the more similar the OE data are to the normal data \parencitegoyal2020drocc, as this forces the model to learn a tighter decision boundary around the normal data. To obtain an OE dataset closer to our normal colon and stomach slides, we collected publicly available histopathology images from 7 selected studies of The Cancer Genome Atlas (TCGA¹¹1https://www.cancer.gov/tcga): BRCA, CHOL, ESCA, LUAD, PAAD, PRAD, UCEC. The studies were chosen for some (however distant) morphological proximity to colon and stomach tissue. We did not consider studies containing colon or stomach slides, as they might already include some of the anomalies we aim to predict. We randomly sampled up to 50 slides per study and preprocessed the slides as described in the Methods section. The resulting dataset statistics are presented in Table 12. We then trained models to discriminate between normal colon or stomach patches on one side and the TCGA OE patches on the other side and evaluated them on the held-out Charité cohort. In comparison to using natural images as OE, this improved the AD performance by a large margin (see Table 11).

OE using separately collected Charité data

While TCGA data is closer to histopathology images than natural images, there are significant differences in how labs process slides. Also, TCGA data is limited in the diversity of patterns, as it mostly consists of cancerous tissue. Therefore, we collected a diverse set of OE slides from Charité hospital and preprocessed them in the same manner as our frequent findings in the Charité cohort. This mitigates the model’s ability to leverage certain low-level features, such as staining or resolution, in order to differentiate OE from normal data. The resulting Charité OE mixed dataset consists of 2901 slides. We observed a significant performance improvement over the TCGA OE dataset (see Table 11), underscoring the significance of the OE data distribution being closely similar to that of the normal data.

Diversity of morphological patterns

The observations in the previous sections showed that the composition of the OE data is crucial for strong generalization to true anomalies. We hypothesized that presenting the model with diverse morphological patterns helps to tighten the decision boundary around the normal data. Therefore, we evaluted different scenarios with varying diversity of OE data. The Kidney subset and Breast subset of Charité OE mixed are limited in morphological diversity, as they contain only one tissue type. As expected, these limited subsets were outperformed by the more diverse complete Charité OE mixed dataset. Interestingly, however, single tissue types that are morphologically very similar to the normal data (seperatly collected slides of small intestine biopsies, colon for stomach being normal, stomach for colon being normal) were competitive with the Charité OE mixed performance. Combining all tissue types (Full OE set) yielded the overall strongest performance.

Sampling

The previous section has shown that both the diversity of OE Data and their similarity to the normal data are important for a well-generalizing AD model. Therefore, we sampled tissue that is similar and tissue that is more diverse equally in our approach. We defined tissue that is similar to the normal data as small intestine + colon or stomach and all other tissue types as diverse tissue types. From both sets, we sampled the OE data with equal probability.

D.3 Data augmentation

The appearance of histopathology slides can vary considerably across stain manufacturers, scanners, and storage times \parencitemacenko2009stainnorm, reinhard2001stainnorm, histo-xai-review. To not overfit on the slide characteristics of the training set, we use stain normalization as well as training- and test-time augmentations. In this section, we will evaluate how much influence each of the different mechanisms has on the model performance.

Training augmentations

For training purposes, we applied a set of data augmentations. We used random resized crops, which generate a crop in the range of 10% to 100% while maintaining $0.75$ of the image aspect ratio. Further, we used color jitter and transformed the image to grayscale with a probability of 20%. The effect of different training augmentations is shown in Table 13. In all scenarios below, we use stain normalization as a preprocessing step.

Table 13: AD performance for OE with different forms of training augmentations for colon and stomach in the Charité and LMU cohorts.

		Colon		Stomach
Cohort	Augmentation	patch-AUROC	slide-AUROC	patch-AUROC	slide-AUROC
Charité	Crop Augmentations	89.23	91.16	90.18	95.77
	Weaker Augmentations	90.25	91.55	91.19	95.02
	Strong Augmentations	90.38	91.49	90.90	94.40
LMU	Crop Augmentations	-	82.57	-	90.51
	Weaker Augmentations	-	85.60	-	91.42
	Strong Augmentations	-	84.58	-	93.07

We observed that data augmentations resulted in small performance improvements on the Charité cohort. However, on the LMU cohort, we saw larger improvements from color augmentations. This indicates that the augmentations improve the models robustness to distribution shifts and also highlights the need to evaluate the models on a hold-out test cohort from a different hospital.

Test-time augmentation

We used test-time augmentations to make our anomaly score estimation more robust. In detail, we augmented each image $k$ times, computed an anomaly score with our model, and then averaged the result. Table 14 shows the effect of test-time augmentation on the Charité and the LMU cohort.

Table 14: The effect of test-time augmentations on the performance of the Charité and LMU cohort.

		Colon		Stomach
Cohort	Augmentation	patch-AUROC	slide-AUROC	patch-AUROC	slide-AUROC
Charité	No Test-time augmentations	89.72	91.09	90.74	94.66
	5 Augmentations	90.32	91.54	90.87	94.47
	10 Augmentations	90.38	91.49	90.90	94.40
LMU	No Test-time augmentations	-	82.24	-	87.59
	5 Augmentations	-	84.77	-	92.49
	10 Augmentations	-	84.58	-	93.07

We observed that, on the Charité cohort, the performance with more test-time augmentations only marginally increased. However, for the LMU cohort, the performance notably increased from using no test-time augmentations to averaging the result of 5 augmentations. We used 10 augmentations for our results in the main paper.

D.4 Model variations

In this section, we present an ablation on the model architecture and objective function of our AD model.

Network

We compared fine-tuning a pretrained model (CTranspath \parencitectranspath) vs. training a Resnet-18 \parencitehe2016deep model from scratch on the OE task. Table 15 shows results on both cohorts in the 5-fold cross-validation setting from the main paper.

Table 15: AD performance for OE with different network architectures for colon and stomach in the Charité and LMU cohorts.

		Colon		Stomach
Cohort	Architecture	slide-AUROC	patch-AUROC	slide-AUROC	patch-AUROC
Charité	ResNet-18 \parencitehe2016deep	$90.57$	$90.3$	$95.11$	$91.26$
	CTransPath \parencitectranspath	$91.01$	$90.47$	$95.04$	$91.37$
LMU	ResNet-18 \parencitehe2016deep	$86.37$	$-$	$92.96$	$-$
	CTransPath \parencitectranspath	$85.88$	$-$	$94.5$	$-$

We observed that a pretrained model is not needed for a competitive anomaly detection model. The outlier task seems sufficient for the model to learn suitable representations that generalize from auxiliary anomalies to true anomalies.

Loss function

Previous work has found that with OE, a simple binary cross-entropy loss outperforms specialized AD losses like DeepSAD \parenciteruff2020dsad and HSC \parenciteliznerski2022exposing, which derive anomaly scores directly from the latent space. However, the authors attribute improved robustness to HSC, indicating superior performance in scenarios with limited data availability or suboptimal OE samples. Hence, we assessed whether DeepSAD or HSC could improve the AD performance on our colon and stomach cohorts. Table 16 shows the respective patch-AUROC and slide-AUROC scores on the Charité cohort. Neither method resulted in significant performance gains.

Table 16: AD performance for OE with different loss functions in the colon and stomach Charité cohorts.

	Colon		Stomach
Loss function	patch-AUROC	slide-AUROC	patch-AUROC	slide-AUROC
OE w/ BCE	90.38	91.49	90.90	94.40
OE w/ HSC	89.09	91.55	89.14	93.40
OE w/ DeepSAD	89.11	91.66	89.30	94.60

In summary, our ablations show that a sufficiently high amount of training data as well as a fitting OE dataset are far more important for a successful AD model than the choice of network architecture or objective function.