PathAlign: A vision–language model for whole slide images in histopathology
Abstract
Microscopic interpretation of histopathology images underlies many important diagnostic and treatment decisions. While advances in vision–language modeling raise new opportunities for analysis of such images, the gigapixel-scale size of whole slide images (WSIs) introduces unique challenges. Additionally, pathology reports simultaneously highlight key findings from small regions while also aggregating interpretation across multiple slides, often making it difficult to create robust image–text pairs. As such, pathology reports remain a largely untapped source of supervision in computational pathology, with most efforts relying on region-of-interest annotations or self-supervision at the patch-level. In this work, we develop a vision–language model based on the BLIP-2 framework using WSIs paired with curated text from pathology reports. This enables applications utilizing a shared image–text embedding space, such as text or image retrieval for finding cases of interest, as well as integration of the WSI encoder with a frozen large language model (LLM) for WSI-based generative text capabilities such as report generation or AI-in-the-loop interactions. We utilize a de-identified dataset of over 350,000 WSIs and diagnostic text pairs, spanning a wide range of diagnoses, procedure types, and tissue types. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization (slide-level triaging). Model-generated text for WSIs was rated by pathologists as accurate, without clinically significant error or omission, for 78% of WSIs on average. This work demonstrates exciting potential capabilities for language-aligned WSI embeddings.
![Refer to caption](x1.png)
1 Introduction
Recent work in the field of digital histopathology has moved beyond task-specific image classifiers, or even image-only foundation models, to advances using image–text data for vision–language modeling (Huang et al., 2023, Ikezogwo et al., 2024, Lu et al., 2023a, Sun et al., 2024b). The training data for such efforts have predominantly been based on small patches or regions-of-interest (ROIs) extracted from within a Whole Slide Image (WSI), paired with associated patch-level text descriptions. For example, the captions and figures for histopathology images in journal articles or educational resources. While such sources can provide useful pairs for local histological features, many pathology tasks involve slide-level or case-level interpretation. Additionally, curated WSI-level text descriptions accurately paired with specific slides are less readily available than patch-level captions, particularly at the scale necessary for machine learning based approaches. Even when pathology reports are available, it can be challenging to identify the specific slides that are associated with the reported findings. This is because reporting is typically done for the entire case, but there may be many slides for each case, some of which contribute more meaningfully to the diagnosis and reported findings than others. At least in part due to this data-curation challenge, robust strategies to develop visual-language models for WSIs in pathology have been limited to a small number of recent examples.
In this work, we develop PathAlign to further address some of the challenges of image–text alignment for gigapixel WSIs. (see Figure 1). PathAlign learns vision–language alignment using WSIs paired with the corresponding diagnostic text from pathology reports. This approach enables capabilities that rely on image–text alignment at a slide-level, bringing us closer to the possibility of applications such as automatic report generation and case-level visual question answering for digital histopathology. We utilize embeddings from a patch-level foundation model (Lai et al., 2023) as inputs to a BLIP-2 (Li et al., 2023) framework and train two models: one variant trained only with the image–text contrastive loss for efficient embedding-based retrieval, and a second variant using the standard two-stage BLIP-2 training that further integrates a frozen LLM to enable WSI-level text generation and basic visual question-answering capabilities. We present one of the first quantitative pathologist evaluations of WSI-level text generation and image-to-text retrieval. Additionally, we evaluate model performance for WSI classification and present an example of slide prioritization as one practical use case for LLM integration.
2 Related work
The emergence of large language models (LLMs) and large multimodal models (LMMs) has created an entirely new field of multimodal and generative AI systems, with approaches such as CLIP (Radford et al., 2021), BLIP-2 (Li et al., 2023), LLaVa (Liu et al., 2024), CoCa (Yu et al., 2022, Zuo et al., 2023), and others. For pathology specifically, a number of recent works describe promising results for vision–language models. These include utilization of a variety of different data sources for image–text pairs, including social media posts by experts (Tsuneki and Kanavati, 2022), YouTube video and caption curation (Ikezogwo et al., 2024), pathology reports with patch extraction (Zhang et al., 2023a, b), and large-scale curation of figure-caption pairs from medical literature and educational resources (Lu et al., 2023a, Sun et al., 2024b, Gamper and Rajpoot, 2021, Sun et al., 2024a). Additionally, Sun et al. (2024a) recently evaluated many publicly available LMMs on a large, patch-based visual question answering (VQA) dataset, which was itself generated and curated with the help of the GPT-4V LMM (Sun et al., 2024a). While many of these works focus on patch-level modeling, initial strategies have also been described to represent WSIs, including patch aggregation (Lu et al., 2023b, Song et al., 2024, Ciga et al., 2021), multimodal pretraining (Jaume et al., 2024), hierarchical and self-supervised learning (Chen et al., 2022, Hou et al., 2024), and WSI–language alignment (Xu et al., 2024, Shaikovski et al., 2024). These contributions represent important milestones, and also highlight some of the unique challenges for aligning large images and text information at the WSI-level.
3 Methods
3.1 Data
The primary data used for this work consists of a de-identified dataset (DS1) of 354,089 WSIs from a teaching hospital paired with diagnostic text from pathology reports. The vast majority are hematoxylin and eosin (H&E) stained, with a smaller portion of immunohistochemical (IHC) stained slides. DS1 reflects a real-world distribution of case-types for general pathology practice in the U.S. A summary of the most common tissue sample labels is shown in Supplemental Table C.1. The study was reviewed by Advarra IRB (Columbia, Maryland) and deemed exempt from from further review as all data is retrospective and de-identified. In order to further enrich our data for cancer cases, we also included de-identified data from The Cancer Genome Atlas (TCGA). We utilized the set of 12,268 diagnostic WSIs in TCGA across all 32 TCGA solid tumor study types (where each study type approximately maps to a unique cancer type).
![Refer to caption](x2.png)
3.2 Curating image–text pairs
Pathology specimens are typically processed and accessioned by case, part, and block, with findings reported per part (where part indicates distinct tissue specimens within a single case; see also Supplemental Figure B.1). This results in three high-level categories of association between slides and part-level text (see Figure 2): (1) a single slide from a single block; (2) multiple slides from a single block; (3) one or more slides per block across multiple blocks. The probability that some of the information in the part-level text does not actually apply to a given slide (because that particular slide is not representative of the final diagnostic finding) increases from category 1 to category 3. This raises the challenge of pairing any given slide with the portion of the pathology report that actually describes the findings on that particular slide.
To at least partially address this challenge, we first pair each slide with its associated, part-level text from the report using part indicators present in both the full text and the WSI metadata. Next, we separate the DS1 dataset into a “clean” set of all category 1 along with category 2 slides where these ambiguities are mitigated (calling this set DS1-Clean), and a “noisy” set consisting of all categories (DS1-Noisy).
For TCGA, instead of parsing heterogeneous reports to map text to slides without part indicators, we utilized structured case-level metadata available for TCGA (Liu et al., 2018) to generate a basic description in the label : finding format, analogous to the typical structure of part-level text in DS1.
We provide additional details about creating image–text pairs in Section A.1 in the Appendix.
Name | Description | Split | # Cases | #Parts | #Slides | ||||||||||||
DS1-Clean | Image-text pairs from DS1 with only one WSI available for the corresponding part-level diagnostic note. |
|
|
|
|
||||||||||||
DS1-Noisy | All available WSIs with the corresponding part-level note, including instances where there are multiple WSIs for the same part-level diagnostic note. |
|
|
|
|
||||||||||||
TCGA | Image text pairs from all FFPE images in TCGA; the diagnostic note text was created based on available TCGA study type and metadata. |
|
|
N/A |
|
||||||||||||
3.3 Data splits
For DS1, slides from DS1-Clean were randomly split by case into train, validation, and test sets (90/5/5 split). All slides from cases not included in the validation and test splits of DS1-Clean were combined with remaining category 2 and category 3 WSIs to form DS1-Noisy, a larger dataset used for training only ( WSIs).
For TCGA, diagnostic H&E slides across all TCGA study types () were split into train, validation and test sets on a per-study basis by tissue source site (TSS) to enable better evaluation of generalization across tissue and image processing variability from different sites. TSSs were assigned with a target split-ratio of 2:1:1 across train, validation and test splits within each TCGA study, though the final ratios varied due to site-size variability (see Table 1 and Supplemental LABEL:tab:tss-splits for details).
![Refer to caption](x4.png)
3.4 Modeling
Patch sampling
We represent each WSI by a set of up to 10,240 tissue-containing patches of size at 10X magnification (1 micron-per-pixel), which covers all patches for 97.8% percent of DS1 WSIs, and 91.4% of TCGA WSIs (see supplemental Figure B.3, with additional details in Section A.2.)
Patch encoder
We pretrained a pathology-specific patch-level encoder via self-supervised learning using the train split of DS1, following the approach described by Lai et al. (2023) using Masked Siamese Networks (Assran et al., 2022) as the SSL method along with the RandStainNA color augmentation method (Shen et al., 2022). This patch encoder uses a ViT-S architecture (Dosovitskiy et al., 2020, Steiner et al., 2021) and maps pixel patches into embeddings of size 384.
WSI encoder
Our WSI-encoder is comprised of the image transformer submodule of the Q-Former in the BLIP-2 framework (Li et al., 2023). The input to the WSI-encoder is the sequence of up to 10,240 patch-embeddings from the patch-encoder, with non-learnable sine and cosine position encodings for the patch coordinates incorporated for both and axes (Vaswani et al., 2017). The learned query vectors in the Q-Former are used to cross-attend to the input WSI data.
Image–text alignment
PathAlign is based on the BLIP-2 (Li et al., 2023) vision–language model architecture and training approach (see Figure 1). In the first stage, the WSI and text encoders are trained to align their representations, using learned query vectors to cross-attend to the WSI data. For input to the WSI encoder, WSIs are represented via sequences of patch-level embeddings produced from a patch-level SSL-trained histopathology foundation model along with their positional coordinates. For the second stage, we discard the text encoder from stage 1, and graft the pretrained WSI-encoder to a frozen generative LLM via a linear layer with further fine tuning for text generation. We train one stage 1 model with the image–text contrastive loss only, referring to this variant as PathAlign-R (for retrieval, based on use of this model for cross-modal retrieval tasks). The second variant is trained using the standard two-stage BLIP-2 training procedure along with LLM integration for text generation. We refer to this variant as PathAlign-G (for generation), and use a frozen PaLM-2 S (Anil et al., 2023) model as the LLM. Additional details are provided in Section A.3 including hyper-parameter settings in Supplemental Table C.3.
Example 1 | Example 2 | Example 3 | |
WSI thumbnail | ![]() |
![]() |
![]() |
Enlarged view | ![]() |
![]() |
![]() |
Original text | duodenum, biopsy : unremarkable intestinal mucosa. | cervix : biopsy: - low grade squamous intraepithelial lesion (cin 1, mild dysplasia). | skin, biopsy : intradermal nevus. |
Top retrieved text | duodenum, third part, biopsy : small bowel mucosa with no pathologic diagnosis. | cervix : biopsy: - high grade squamous intraepithelial lesion (cin-2; hsil). | skin, punch biopsy : intradermal nevus. |
Generated text | duodenum, biopsy : duodenal mucosa with no significant pathologic changes. | cervix, biopsy : low grade squamous intraepithelial lesion (cin 1). | skin, punch biopsy : compound nevus. |
Pathologist review | Agree with all | Favor HSIL (high grade) | Favor compound nevus |
3.5 Evaluation
Text retrieval and generation
Two U.S.-board certified pathologists evaluated texts for top-K image-to-text retrieval (PathAlign-R) and text generation (PathAlign-G). Automatic evaluation was also performed (primarily for model development) using a similarity score threshold for embeddings from a text-similarity model to determine accurate retrievals (see Section A.5). Retrieval was performed using cosine-similarity between WSI and text embeddings using the corpus of unique texts in the test set ( unique diagnostic texts). For pathologist evaluation, text examples were rated on a five-point scale based on concurrent review of the corresponding WSI (scoring instruction details in Supplemental Table C.8). Additional details including information about the retrieval task setup and the 120 test set WSIs sampled for pathologist evaluation are provided in Section A.4.
WSI classification
We evaluated PathAlign-R on four WSI classification tasks: (1) NSCLC subty**: non-small cell lung cancer subty** using LUAD and LUSC in TCGA; (2) RCC subty**: renal cell carcinoma subty** using KIRC, KIRP and KICH in TCGA; (3) BRCA subty**: breast cancer subty** of ductal versus lobular carcinoma using BRCA and subtype metadata in TCGA; (4) Procedure type: biopsy vs. resection classification using a subset of DS1. To perform classification, WSI embeddings are compared to text embeddings for the classes of interest, using a curated set of texts per class. Texts used for each class are provided in Supplemental Table C.7. See Section A.4.3 for additional WSI classification details. Confidence-intervals were computed via bootstrapped resampling with replacement over 1000 replicates.
Task | AUC | Balanced accuracy |
NSCLC Subty** | 0.945 | 0.875 |
RCC Subty** | 0.971 | 0.889 |
BRCA Subty** | 0.879 | 0.775 |
Procedure Classification | 0.987 | 0.942 |
4 Results
Image-to-text retrieval
Pathologist evaluation of image-to-text retrieval is summarized in 3(a). Top-1 and top-3 retrieval accuracy were 73.5% and 91.3%, respectively (based on a rating score of 4 or 5 to define accurate text). The original diagnostic text was scored as 4 or 5 for 86.5% of ratings. Plots for the individual raters are provided in Supplemental 4(a). Sub-analysis by “common” and “less common” specimen-type categories did not suggest bias towards retrieval of common cases (Supplemental Figure B.7). While automatic evaluation of image-to-text retrieval (as well as text-to-image and image-to-image retrieval) was also performed, this was primarily used for hyper-parameter tuning; details and test set results for automatic evaluation are in Section A.5 and Supplemental Table C.4.
Image-based text generation
Evaluation of generated text is summarized in 3(b) and Figure B.6, with examples in Table 2. For images where either original text or AI generated text was rated 4 or above, the AI generated text was determined to be equivalent or better than the original text in 75% of ratings. For all WSIs ( images), generated text was rated to be 4 or 5 (i.e. mostly or highly accurate) for 78% of ratings. See 3(b) and Figure B.6 for complete results, including subanalysis by finding type of normal, mild, or significant (as based on the original diagnostic text). Data for the individual raters as well as the inter-rater confusion matrix for scoring of original diagnostic texts are provided in Supplemental 4(b) and Supplemental Figure B.5, respectively.
WSI classification
Exploring additional vision–language applications
To highlight one potential application utilizing LLM integration, we demonstrate a case prioritization example. We randomly select 200 colon biopsies representing a theoretical case load and use PathAlign-G to return the set, sorted by likely “severity” of the findings. The results and prompt are summarized in Supplemental LABEL:tab:slide-prioritization. While not perfect, all carcinoma cases are appropriately in the top category, most tubular adenomas and other findings in the second category, and most hyperplastic polyps along with benign biopsies in the lowest risk category, thus highlighting the promising potential to organize or group cases with flexible natural language queries.
5 Discussion
Many important applications in histopathology involve interpretation of WSIs. Leveraging advances in efficient vision–language pretraining (Li et al., 2023) and self-supervised patch-level encoders (Lai et al., 2023), we develop a pathology report aligned WSI-encoder using a real-world dataset of over 350,000 gigapixel WSIs with diagnostic text from associated pathology reports. We evaluate this model for classification, cross-modal retrieval, and generation of text describing pathologic findings.
Our work complements recent efforts on language aligned WSI-encoders such as PathM3 (Zhou et al., 2024), PRISM (Shaikovski et al., 2024) and GigaPath (Xu et al., 2024). Compared to prior work, we explore an alternative method for efficient image–text alignment with WSIs based on the BLIP-2 approach. This enables us to align our WSI-encoder with a pretrained LLM (PaLM-2 S) for generating text from WSIs. While evaluation of the other recent models has been limited to classification tasks,automated scoring, and qualitative review of text generation, we report the first quantitative pathologist evaluation of cross-modal retrieval and text generation.
The text generation evaluations provide several interesting insights. On one hand, they reflect the impressive capability of the domain-specific WSI-encoder to align with a pretrained LLM even when images are gigapixel-sized. The generated texts are generally quite accurate at reflecting important information about the WSI, often showcasing important slide-level capabilities by providing information that requires aggregating information and context from multiple patches. Examples of this include the type of procedure or biopsy, and perhaps more impressively, the concept of low grade versus high grade cervical dysplasia, which is defined in part by the extent of the epithelial thickness that is affected, and thus likely requires contextual information within the slide (example in Table 2).
On the other hand, there are clearly still some shortcomings in the details provided by the generated text, such as specific grades for prostate and breast cancer. We also observe some confabulations, particularly in the specimen type when specimen information cannot be readily inferred from the image (e.g. neck contents, lymph node). This is likely due, in part, to imperfect removal of this type of specimen information when we processed part labels, but also reflects the inherent importance of context when reviewing slides and writing reports. While prompting the model with the specimen information along with the image is one strategy that might reflect real world availability of the part label during slide review, we did not find this to significantly improve text generation in our study. Efforts to more effectively clean training data and to more thoroughly evaluate confabulations and optimize prompting strategies using available metadata are opportunities for future work.
While PathAlign-R performs well on the cancer subty** tasks (see Table 3), direct comparison to other image–text pathology models (e.g. CONCH (Lu et al., 2023a), GigaPath (Xu et al., 2024)) cannot be made directly due to different image splits, as well as our inclusion of TCGA data in training (albeit from different TSS than those used for testing).
While our training datasets are comparable in size to prior work on WSI-level image–text alignment (Xu et al., 2024, Shaikovski et al., 2024), they are relatively small compared to datasets that have been used for image–text alignment from natural images (). In initial experiments, we observed that training on the full set of training data (DS1-Noisy) provided significantly better performance than training on only the cleaner, but smaller train split of DS1-Clean, further supporting the potential for data scaling. Because pathology reports are in principle available for all WSIs that have gone through clinical workflows, we hope to see future work build on our approach with larger-scale datasets of WSIs paired with pathology reports.
5.1 Limitations
A primary challenge in aligning pathology WSIs with diagnostic text is the many-to-one nature of slides to the associated portion of the diagnostic report. In this work, we curated our dataset in a manner that minimized this issue for the validation and test splits, but this also resulted in a relative enrichment in these splits for the types of cases that typically only have one slide per submitted pathology part, such as colon, skin, and cervical biopsies along with other small specimens. Additionally, there are at least some instances for which there are “missing” slides, such that the single slide used for inference may not contain all the information represented in the paired text. Curation of datasets to include only the representative slides for the reported findings and modeling at the level of multiple slides remain as opportunities to further address this issue.
The use of TCGA, while useful for enriching DS1 with cancer cases and providing increased training diversity, also introduces specific limitations. For example, we used available structured metadata to generate “synthetic” diagnostic captions for these images. While an effort was made to diversify the language used for describing any given entity, these captions still do not necessarily represent the reporting style for these types of cancer cases in real clinical reports. For example, an actual diagnostic report for a cancer resection might describe many aspects of the tumor grading, staging, and subty** in a manner that is more extensive than the available structured metadata from TCGA.
Due to lack of available out-of-distribution datasets for this study, image-to-text retrieval and text generation tasks were only evaluated on in-distribution data for DS1. Experiments to evaluate generalization of this model to diverse data sources are warranted. Analysis on additional tasks such as text-to-image retrieval could also be performed and existing evaluations could be improved by increasing the evaluation set size, including greater diversity of cases and findings, and including a larger number of raters.
5.2 Future work
Future work could explore additional vision–language modeling strategies coupled with different LLMs and further instruction tuning. While we benefited from the lower computational costs by using relatively fewer number of query vectors, direct patch-to-patch interaction across the WSI is potentially not fully captured in the cross-attention mechanism and might be improved further, such as through efficient self-attention. Modeling at the level of multiple slides across entire parts or cases, higher magnifications, or a pyramid of multiple magnifications could also further enable useful applications.
6 Conclusion
This work demonstrates the novel development of multimodal pathology models using WSIs paired with curated portions of original diagnostic reports along with a pre-trained patch encoder and a LLM. These initial results highlight the potential for WSI-text alignment in a manner that can incorporate the reasoning capabilities of large multimodal models.
Acknowledgements
We thank Wei-Hung Weng, Tiam Jaroensri, and Michael Howell for useful feedback on the manuscript. We thank the Google Research team for software and hardware infrastructure support as well as operations team members involved in the digitization and program management aspects related to this study; especially Melissa Moran, Robert MacDonald, Allen Chai, Robert Nagle, and Josh Pomorski. We also thank Kenneth Philbrick, Liron Yatziv, Can Kirmizi, and Rory Pilgrim for helpful technical discussions and Tiya Tiyasirichokchai for advice on figure design. We acknowledge James Wren and Colin Wageman for data-related discussions and thank the pathologists who reviewed model output for this study. We thank Todd Lilje, Daniel Ward, and the Naval Medical Center San Diego Laboratory and Clinical Investigations Departments for administrative and research support and guidance. N.O. is a military Service member. This work was prepared as part of their official duties. Title 17, U.S.C., §105 provides that copyright protection under this title is not available for any work of the U.S. Government. Title 17, U.S.C., §101 defines a U.S. Government work as a work prepared by a military Service member or employee of the U.S. Government as part of that person’s official duties. The study protocol was approved by the Naval Medical Center San Diego Institutional Review Board in compliance with all applicable Federal regulations governing the protection of human subjects. Support for this study included funding support from Google under NCRADA-16-471. The results shown here are in part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.
References
- Anil et al. (2023) Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Assran et al. (2022) Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Mike Rabbat, and Nicolas Ballas. Masked siamese networks for label-efficient learning. In European Conference on Computer Vision, pages 456–473. Springer, 2022.
- Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
- Cer et al. (2018) Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. Universal sentence encoder. arXiv preprint arXiv:1803.11175, 2018.
- Chen et al. (2022) Richard J Chen, Chengkuan Chen, Yicong Li, Tiffany Y Chen, Andrew D Trister, Rahul G Krishnan, and Faisal Mahmood. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16144–16155, 2022.
- Ciga et al. (2021) Ozan Ciga, Tony Xu, Sharon Nofech-Mozes, Shawna Noy, Fang-I Lu, and Anne L Martel. Overcoming the limitations of patch-based learning to detect cancer in whole slide images. Scientific Reports, 11(1):8894, 2021.
- Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Gamper and Rajpoot (2021) Jevgenij Gamper and Nasir Rajpoot. Multiple instance captioning: Learning representations from histopathology textbooks and articles. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16549–16559, 2021.
- Hou et al. (2024) Xinhai Hou, Cheng Jiang, Akhil Kondepudi, Yiwei Lyu, Asadur Zaman Chowdury, Honglak Lee, and Todd C Hollon. A self-supervised framework for learning whole slide representations. arXiv preprint arXiv:2402.06188, 2024.
- Huang et al. (2023) Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas Montine, and James Zou. Leveraging medical twitter to build a visual–language foundation model for pathology ai. bioRxiv, pages 2023–03, 2023.
- Ikezogwo et al. (2024) Wisdom Ikezogwo, Saygin Seyfioglu, Fatemeh Ghezloo, Dylan Geva, Fatwir Sheikh Mohammed, Pavan Kumar Anand, Ranjay Krishna, and Linda Shapiro. Quilt-1m: One million image-text pairs for histopathology. Advances in Neural Information Processing Systems, 36, 2024.
- Jaume et al. (2024) Guillaume Jaume, Lukas Oldenburg, Anurag Vaidya, Richard J Chen, Drew FK Williamson, Thomas Peeters, Andrew H Song, and Faisal Mahmood. Transcriptomics-guided slide representation learning in computational pathology. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9632–9644, 2024.
- Lai et al. (2023) Jeremy Lai, Faruk Ahmed, Supriya Vijay, Tiam Jaroensri, Jessica Loo, Saurabh Vyawahare, Saloni Agarwal, Fayaz Jamil, Yossi Matias, Greg S Corrado, et al. Domain-specific optimization and diverse evaluation of self-supervised models for histopathology. arXiv preprint arXiv:2310.13259, 2023.
- Li et al. (2023) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
- Lin (2004) Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
- Liu et al. (2024) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
- Liu et al. (2018) Jianfang Liu, Tara Lichtenberg, Katherine A Hoadley, Laila M Poisson, Alexander J Lazar, Andrew D Cherniack, Albert J Kovatich, Christopher C Benz, Douglas A Levine, Adrian V Lee, et al. An integrated tcga pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell, 173(2):400–416, 2018.
- Lu et al. (2023a) Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Andrew Zhang, Long Phi Le, et al. Towards a visual-language foundation model for computational pathology. arXiv preprint arXiv:2307.12914, 2023a.
- Lu et al. (2023b) Ming Y Lu, Bowen Chen, Andrew Zhang, Drew FK Williamson, Richard J Chen, Tong Ding, Long Phi Le, Yung-Sung Chuang, and Faisal Mahmood. Visual language pretrained multiple instance zero-shot transfer for histopathology images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19764–19775, 2023b.
- Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Shaikovski et al. (2024) George Shaikovski, Adam Casson, Kristen Severson, Eric Zimmermann, Yi Kan Wang, Jeremy D Kunz, Juan A Retamero, Gerard Oakley, David Klimstra, Christopher Kanan, et al. Prism: A multi-modal generative foundation model for slide-level histopathology. arXiv preprint arXiv:2405.10254, 2024.
- Shen et al. (2022) Yiqing Shen, Yulin Luo, Dinggang Shen, and **g Ke. Randstainna: Learning stain-agnostic features from histology slides by bridging stain augmentation and normalization. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 212–221. Springer, 2022.
- Song et al. (2024) Andrew H Song, Richard J Chen, Tong Ding, Drew FK Williamson, Guillaume Jaume, and Faisal Mahmood. Morphological prototy** for unsupervised slide representation learning in computational pathology. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11566–11578, 2024.
- Steiner et al. (2021) Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
- Sun et al. (2024a) Yuxuan Sun, Hao Wu, Chenglu Zhu, Sunyi Zheng, Qizi Chen, Kai Zhang, Yunlong Zhang, Xiaoxiao Lan, Mengyue Zheng, **gxiong Li, et al. Pathmmu: A massive multimodal expert-level benchmark for understanding and reasoning in pathology. arXiv preprint arXiv:2401.16355, 2024a.
- Sun et al. (2024b) Yuxuan Sun, Chenglu Zhu, Sunyi Zheng, Kai Zhang, Lin Sun, Zhongyi Shui, Yunlong Zhang, Honglin Li, and Lin Yang. Pathasst: A generative foundation ai assistant towards artificial general intelligence of pathology. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5034–5042, 2024b.
- Tsuneki and Kanavati (2022) Masayuki Tsuneki and Fahdi Kanavati. Inference of captions from histopathological patches. In International Conference on Medical Imaging with Deep Learning, pages 1235–1250. PMLR, 2022.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Xu et al. (2024) Hanwen Xu, Naoto Usuyama, Jaspreet Bagga, Sheng Zhang, Rajesh Rao, Tristan Naumann, Cliff Wong, Zelalem Gero, Javier González, Yu Gu, et al. A whole-slide foundation model for digital pathology from real-world data. Nature, pages 1–8, 2024.
- Yu et al. (2022) Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Zhang et al. (2023a) Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915, 2023a.
- Zhang et al. (2023b) Yunkun Zhang, ** Gao, Mu Zhou, Xiaosong Wang, Yu Qiao, Shaoting Zhang, and Dequan Wang. Text-guided foundation model adaptation for pathological image classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 272–282. Springer, 2023b.
- Zhou et al. (2024) Qifeng Zhou, Wenliang Zhong, Yuzhi Guo, Michael Xiao, Hehuan Ma, and Junzhou Huang. Pathm3: A multimodal multi-task multiple instance learning framework for whole slide image classification and captioning. arXiv preprint arXiv:2403.08967, 2024.
- Zuo et al. (2023) Jialong Zuo, Changqian Yu, Nong Sang, and Changxin Gao. Plip: Language-image pre-training for person representation learning. arXiv preprint arXiv:2305.08386, 2023.
Appendix A Supplemental methods
A.1 Creating image–text pairs
As is typical for pathology reports, each case in DS1 has an associated diagnostic text, corresponding to the final diagnosis (i.e. “bottom line” text) from the pathology report (see Supplemental Figure B.2). These diagnostic texts are structured into part-level text sections. Because each case may have several different parts, we parse these to the part-level via regular expressions. For each part-level text section, there is a label (description of tissue site or surgical procedure) and a finding (description of diagnostic findings). Because the label text is typically based on the specimen processing and preparation, it often includes information such as anatomic location and laterality that may not be inferable from the WSI alone. Occasionally the findings section also includes this type of information. As such, we further apply a set of regular expressions to remove information from both the labels and findings that cannot reliably be determined from the WSI. Additionally, as diagnostic reporting in pathology often exhibits a common style across pathologists, many free text descriptions for different slides will be the same when the findings are essentially the same.
Within the framework of the three categories established for part-to-slide map**s, it is also possible that the pathologist reviewed more slides than those that were archived and subsequently digitized, so even category 1 has some possibility of the text not corresponding specifically to the image. While we believe this possibility to be a rare occurrence in DS1, complete accessioning metadata from each case was not available, so formal quantification of “missing slides” could not be performed.
For TCGA, although pathology reports are available (as PDFs), they are submitted from a variety of source sites with significant variability in structure and detail. Additionally, they do not specify which portion of the report corresponds to the available images. Instead of parsing these heterogeneous reports, we utilized structured case-level metadata available for TCGA (Liu et al., 2018) to generate a basic description in the label : finding format, analogous to the typical structure of part-level text in DS1. Specifically, we used the tissue type, histological type, and histologic grade (when available) to generate captions such as bladder, resection : histological type: invasive urothelial carcinoma; tumor grade: high grade. We associate each slide with a caption generated from the metadata in this way. For TCGA, the possibility that a given slide in isolation is not representative of this metadata is at least partially mitigated because typically only one or two diagnostic slides per case were submitted and these slides were selected to be representative of the case-level diagnosis and to have substantial tumor content.
A.2 Patch sampling
Tissue masks were generated via a sequence of image processing operations. These consist of transforming RGB images to HSV-space, performing morphological operations to group together connected regions, thresholding on pixel intensity, and post-processing with an erosion operation to remove remaining noise. Using these tissue masks, we identify all tissue containing patches of size pixels, with a stride of 192 pixels (32 pixels of overlap). We use at most 10,240 patches per WSI (sampling without replacement when the total number of tissue-patches exceeds this count).
A.3 Modeling: image–text alignment
To avoid false negatives from similar diagnostic findings within training batches in the image–text contrastive (ITC) loss and image–text matching (ITM) loss, we mask out negative pairs, (imagei, textj), , with high text similarity between texti and textj using text embeddings from the Universal Sentence Encoder model (Cer et al., 2018). We threshold on a cosine similarity of 0.985, which typically requires very high similarity with only slight differences in syntax or word choice (see Supplemental Table C.2 for examples). To further reduce the impact of false-negatives, we did not use hard-negative mining for the ITM loss.
For PathAlign-R, we chose not to use ITM reranking during retrieval for practical considerations, with efficiency in terms of potential model-serving and typical client-end API use in mind. When using the cosine similarity between contrastive embeddings for ranking, we found that training with the ITC loss alone and using a single learnable query in the Q-Former worked better for retrieval. For text generation (PathAlign-G), we found that the standard BLIP-2 approach of training a stage 1 model including all losses worked best, along with 32 learnable query vectors. Hyper-parameter settings are provided in Supplemental Table C.3. All model selection choices were made using the validation set.
A.4 Evaluation details
A.4.1 Pathologist evaluation
A subset of 120 test set WSIs from the DS1 test set was selected for evaluation by pathologists. For subset selection the test set was first divided into two categories: (1) the most common specimen types (colon, rectum, cervix, and skin biopsies) and (2) other specimen types. Then, 60 images from each of these two categories were sampled. For each WSI, pathologists were presented with the image in a web-based digital pathology viewer along with five retrieved diagnostic texts (from PathAlign-R), a generated text (from PathAlign-G), and the original diagnostic text. Texts were provided in random order and pathologists were blinded to the source of each (i.e. retrieved, generated, original) to avoid any bias in interpretation. At evaluation, 5 images were dropped due to pathologist review indicating poor image quality () or need for immunohistochemistry (IHC) () for confident interpretation.
A.4.2 Evaluating image-to-text retrieval
For image-to-text retrieval, PathAlign-R takes a WSI and corpus of texts as input, and scores the texts according to embedding similarity with that of the input image. The text corpus is comprised of all unique diagnostic texts from WSI-text pairs in the DS1 test set ( unique diagnostic texts). Similarity scoring was performed using cosine-similarity between embeddings for the input WSI and the diagnostic text. For measuring top-K retrieval accuracy, we consider a retrieved text as being accurate if it received a rating of at least 4 (i.e. mostly accurate without clinically significant error or omission). To address the fact that retrieval results could be influenced by the frequency of similar cases in the corpus, we limited retrieval to unique diagnostic texts (i.e. duplicates removed before retrieval), and we also performed sub-analysis on the “common” and “less common” specimen types as defined in prior sections.
A.4.3 WSI classification
To perform classification, PathAlign-R is given a WSI and a class is assigned based on similarity of the image embeddings to the text embeddings for classes of interest, using a curated set of texts per class. This can be thought of as a highly constrained version of image-to-text retrieval, except that instead of scoring the similarity between the model’s WSI embedding and embeddings for the corpus of texts, the scoring is done using the average similarity between the WSI embeddings and the set of texts associated with each class. This approach is often referred to as zero-shot classification in the literature, but since the concepts in these tasks are contained in the training data, we refer to this simply as WSI classification here.
For the three subty** tasks, texts for each class were curated to represent common diagnostic texts for WSIs of each subtype in the training set (identified via regular expression matching). For the procedure classification task, the biopsy class is represented by just the single word biopsy, while the resection class is represented by resection as well as a variety of tissue specific resection types (e.g. lobectomy, mastectomy, nephrectomy, etc; see Supplemental Table C.7). WSIs for the procedure classification task were selected by randomly sampling 250 WSIs with a diagnostic text containing the word biopsy and selecting all WSIs containing any of the resection texts () from the DS1 test set. We evaluate classification performance with both macro-averaged AUC (using average similarity directly) as well as balanced accuracy (taking the max similarity score across classes). Confidence-intervals were computed via bootstrapped resampling with replacement over 1000 replicates.
A.5 Automatic evaluation
During model development, we used automatic evaluations for cross-modal retrieval and diagnostic text generation to guide hyper-parameter selection and modeling choices. The methods we used are described below and test set results for the final models are reported below.
Cross-modal retrieval
We performed cross-modal retrieval analysis for image–text pairs at the WSI-level, a task with implications for finding and curating cases of interest across educational, research, and clinical workflows.
In the image-to-text retrieval setting, the model is given a WSI and tasked with scoring a corpus of diagnostic texts according to how relevant they are for the given WSI. In our case, the text corpus consists of all unique diagnostic texts from WSI-text pairs in the validation dataset. The scoring is done using cosine-similarity between the model’s embedding for the input WSI and the model’s embeddings for all diagnostic texts in the corpus. Because there may be many texts that accurately match the image, a key challenge in evaluating this type of retrieval is determining the full set of diagnostic texts in the corpus that match with the input WSI, which is required for standard retrieval metrics such as MAP, NDCG and top-K accuracy.
Since our datasets consist of (WSI, diagnostic text) pairs, we have one known ground-truth diagnostic text match. However, there are potentially many other texts in the corpus that describe the same diagnostic finding in different ways. To estimate the set of matching diagnostic texts we compute the cosine similarity between embeddings from the ground-truth text and all other texts in the corpus using the Universal Sentence Encoder model (Cer et al., 2018). Any text with a cosine similarity score above a threshold of 0.985 was considered a match. This threshold was manually tuned to be as low as possible without including false positive matches (on the validation set). However, due to this high similarity threshold and failures in the Universal Sentence Encoder to map syntactically different yet diagnostically equivalent texts to similar embeddings, not all true-positive matches are included (see examples in Table C.6). This is a limitation of this automated, yet, large-scale retrieval analysis, addressed through evaluations performed by a human expert reviewing both images and text.
In the text-to-image retrieval setting, the model is given a diagnostic text and tasked with scoring a corpus of WSIs according to how relevant they are to the diagnostic text. The retrieval analysis was performed analogously to the image-to-text setting, with the set of matching WSIs estimated using similarity between the input diagnostic text and the diagnostic texts associated with each WSI in the corpus of WSIs.
Results are summarized in Table C.4.
Image-to-image retrieval
We also evaluated image-to-image retrieval, i.e. the problem of finding images with similar associated diagnostic text. This was done analogously to cross-modal retrieval except that the input image is excluded from the set of matching images and input images that do not have any matching images (i.e. there are no other images in the image corpus the where similarity between associated texts is above the required threshold) were excluded from analysis. For automatic evaluation of image-to-image retrieval, the similarity score between the original texts for the input and the retrieved images were calculated, again using a threshold of 0.985 for defining accurate retrieval. Results are summarized in Table C.6, where they are presented in comparison to performance using the averaged patch-level embeddings from the domain-specific patch encoder (PathSSL) that we used to embed patches for our model. Text generation Text generation was evaluated automatically by computing ROUGE-L (Lin, 2004) and METEOR (Banerjee and Lavie, 2005) scores between original and generated diagnostic text. Results are summarized in Table C.5.
Appendix B Supplementary figures
![Refer to caption](x5.png)
![Refer to caption](x6.png)
![Refer to caption](extracted/5695862/figures/tethys_seqlen_hist.png)
![Refer to caption](extracted/5695862/figures/tcga_seqlen_hist.png)
![Refer to caption](x7.png)
![Refer to caption](x8.png)
![Refer to caption](extracted/5695862/figures/rating-conf-pathologists.png)
![Refer to caption](x9.png)
![Refer to caption](x10.png)
Appendix C Supplementary tables
Part Label | Percentage | Part Label | Percentage |
colon, biopsy | 6.19 | endocervical curettings | 0.94 |
skin, shave biopsy | 5.49 | cervix, leep | 0.91 |
skin, excisional biopsy | 4.17 | breast, mastectomy | 0.83 |
cervix, biopsy | 3.72 | prostate, prostatectomy | 0.83 |
lymph node, excision | 2.43 | breast, lumpectomy | 0.83 |
skin, punch biopsy | 2.3 | duodenum, biopsy | 0.78 |
cervix | 2.2 | placenta | 0.74 |
cervical biopsy | 1.77 | appendix, appendectomy | 0.57 |
rectum, biopsy | 1.48 | endometrial biopsy | 0.56 |
skin, biopsy | 1.41 | gallbladder, cholecystectomy | 0.56 |
colon, polypectomy | 1.12 | endocervical curettage | 0.53 |
esophagus, biopsy | 1.06 | prostate, biopsy | 0.51 |
stomach, biopsy | 0.98 | ||
Threshold | Text | Text | Scoreij |
soft tissue, supraclavicular region, biopsy : classical hodgkin lymphoma. | cervix : biopsy: high grade squamous intraepithelial lesion (cin-ii). | 0.619 | |
colon, cecum, polypectomy : fragments of tubulovillous adenoma. | colon, polypectomy : tubular adenoma. | 0.681 | |
cervical polyp biopsy : low-grade squamous intraepithelial lesion (cin i). | cervix, biopsy : benign squamous mucosa, no transformation zone identified. | 0.771 | |
skin, excisional biopsy : neurofibroma. | skin, excisional biopsy : dermatofibroma. | 0.953 | |
colon, biopsy : hyperplastic polyp. | colon, biopsy : adenomatous polyp. | 0.957 | |
cervix, biopsy : high-grade squamous intraepithelial lesion. | cervix, biopsy : low-grade squamous intraepithelial lesion. | 0.970 | |
uterine cervix, biopsy : benign cervical tissue, no dysplasia identified. | cervix : biopsy: benign cervical tissue, no dysplasia identified. | 0.978 | |
skin, excisional biopsy : epidermal inclusion cyst, excised. | skin, excisional biopsy : epidermal inclusion cyst. | 0.990 | |
skin, shave biopsy : ulcerated sclerosing basal cell carcinoma; extends to the base of the biopsy. | skin, shave biopsy : basal cell carcinoma, extends to the base of the biopsy. | 0.989 | |
cervix, biopsy : low grade squamous intraepithelial lesion (cin-i). | cervix : biopsy: low grade squamous intraepithelial lesion. | 0.989 | |
colon, biopsy : chronic active colitis, no dysplasia identified. | colon, biopsy : chronic active colitis, mild. -no dysplasia identified. | 0.987 | |
appendix, appendectomy : acute suppurative appendicitis. | vermiform appendix, appendectomy : acute suppurative appendicitis. | 0.987 | |
colon polyp, biopsy : adenomatous polyp. | colon, polyp, excisional biopsy : adenomatous polyp. | 0.987 | |
terminal ileum, biopsy : small bowel mucosa with no pathologic diagnosis. | ileum, terminal, biopsy : small bowel mucosa with no pathologic diagnosis. | 0.986 | |
Shared: Stage 1 | |
Initialization | Random |
ITC/ITM false-negative masking | Yes |
Q-Former query dimension | 192 |
Q-Former intermediate dimension | 3072 |
ITC projection layer dimension | 128 |
Learning rate scheduler | Linear warmup + cosine decay |
Learning rate | |
Weight decay | 0.05 |
AdamW | 0.9, 0.998 |
Linear-warmup steps | 2000 |
Maximum training steps (w/ early stop**) | 100000 |
Batch-size | 1024 |
Learnable constrastive temperature | Yes |
PathAlign-R: Stage 1 | |
Learnable queries | 1 |
ITC, ITM, ITG loss coefficients | 1.0, 0.0, 0.0 |
Initial contrastive temperature | 0.01 |
PathAlign-G: Stage 1 | |
Number of learnable query vectors | 32 |
ITC, ITM, ITG loss-coefficients | 1.0, 0.5, 1.0 |
ITM false-negative masking | Yes |
Learning rate | 1e-4 |
Initial contrastive temperature | 0.07 |
PathAlign-G: Stage 2 | |
Optimizer | Adafactor with Adam |
Learning rate | |
Adam | 0.9, 0.999 |
Warmup steps | 1000 |
Weight decay | |
Maximum training steps (w/ early stop**) | 200000 |
Batch size | 64 |
Gradient-clip** norm | 10.0 |
LLM | PaLM-2 S |
Decoding | greedy |
Dataset | Query | Corpus | MAP | NDCG | Top-1 | Top-5 | Top-10 |
DS1 | Text | Image | 0.22 | 0.43 | 0.16 | 0.41 | 0.57 |
Image | Text | 0.21 | 0.37 | 0.10 | 0.33 | 0.48 | |
TCGA | Text | Image | 0.49 | 0.76 | 0.58 | 0.87 | 0.9 |
Image | Text | 0.50 | 0.62 | 0.34 | 0.73 | 0.84 | |
Dataset | ROUGE-L (F-Measure) | METEOR |
DS1 | 0.579 | 0.612 |
Dataset | Model | MAP | NDCG | Top-1 | Top-5 | Top-10 |
DS1 | PathSSL | 0.23 | 0.47 | 0.29 | 0.48 | 0.57 |
PathAlign-R | 0.26 | 0.50 | 0.27 | 0.52 | 0.63 | |
DS1 | PathSSL | 0.29 | 0.67 | 0.62 | 0.83 | 0.88 |
PathAlign-R | 0.40 | 0.72 | 0.60 | 0.84 | 0.89 | |
Task | Task prefix | Class Class prefix | ||||||||||||||||||||||||||||||
NSCLC Subty** (TCGA) |
|
|
||||||||||||||||||||||||||||||
RCC Suby** (TCGA) |
|
|
||||||||||||||||||||||||||||||
BRCA Suby** (TCGA) |
|
|
||||||||||||||||||||||||||||||
Procedure type (DS1) |
|
|
||||||||||||||||||||||||||||||
Rating | Description and instructions |
1 | Completely inaccurate – May describe something that can occur in the specimen/tissue type pictured, but fundamentally incorrect, or may be the wrong tissue type or concept altogether. |
2 | Partially accurate (i.e. related but wrong) – The text might describe an entity that is related to the image, and occurring in that specimen type, but the image is definitively a different diagnostic entity. – May accurately describe something that is seen on the image, but additional, essential info is missing or incorrect. |
3 | Mostly accurate with clinically significant error/omission – The text is a good match/description for the image, but something minor is incorrect or missing that may have clinical or diagnostic implications. |
4 | Mostly accurate without clinically significant error/omission – The text is a very good match/description for the image, but there may be a minor, clinically insignificant aspect that is incorrect or missing. For example, the diagnosis is accurate and acceptable, but doesn’t capture all of the details. |
5 | Highly accurate – The text is a great description of the image, with no obvious information missing or incorrect. – Note that even a very short summary or a description of “no pathologic findings” can still belong in this rating. |
Cannot Interpret | Please provide a very brief comment regarding the issue and/or what additional info you would need. If you can interpret the image to some extent, but need IHC or other studies to be more confident, please still provide a score based on your best interpretation of the available image and provide details in the comments. |
. Category Generated text rating Original text rating AI preferred 4 3 Both ok, AI preferred 5 4 Both ok, same rating 4 4 5 5 Both ok, original preferred 4 5 Original preferred 3 4 Both with errors or omissions 3 3
Original findings | AI prioritization | Count |
tubular adenoma with high grade dysplasia and focal intramucosal carcinoma. | 3 | 1 |
invasive moderately differentiated adenocarcinoma of the colon. | 3 | 1 |
invasive poorly differentiated colonic adenocarcinoma. | 3 | 1 |
fragment of ulcer debris. - no colonic mucosa identified. | 3 | 1 |
adenomatous polyp. | 2 | 26 |
tubular adenoma. | 2 | 5 |
multiple fragments of flat and polypoid colonic mucosa with adenomatous epithelium, consistent with multiple colonic adenomas. | 2 | 1 |
adenomatous polyp with focal high grade dysplasia and trauma related changes. | 2 | 1 |
mild active colitis, no evidence of chronicity. | 2 | 1 |
adenomatous polyps. | 2 | 1 |
essentially unremarkable colonic mucosa. | 2 | 1 |
tubular adenoma. - negative for high grade dysplasia or carcinoma. | 2 | 1 |
tubular adenoma fully excised in the sections examined. | 2 | 1 |
chronic colitis with moderate to severe activity. no dysplasia identified. | 2 | 1 |
adenomatous polyp, low grade. | 2 | 1 |
adenomatous polyp (tubular adenoma). electrocautery margin appears uninvolved. | 2 | 1 |
fragments of adenomatous polyp. | 2 | 1 |
active chronic colitis with crypt abscess. | 2 | 1 |
colonic mucosa with no pathologic diagnosis. | 1 | 16 |
colonic mucosa with no significant pathologic abnormality. | 1 | 14 |
hyperplastic polyp. | 1 | 12 |
unremarkable colonic mucosa. | 1 | 8 |
essentially unremarkable colonic mucosa. | 1 | 7 |
benign colonic mucosa with no significant pathologic abnormality. | 1 | 6 |
colonic mucosa with no significant pathologic changes. | 1 | 5 |
tubular adenoma. | 1 | 5 |
colonic mucosa with no diagnostic alteration. | 1 | 4 |
chronic active colitis. | 1 | 4 |
colonic mucosa with no pathologic diagnosis; negative for dysplasia. | 1 | 4 |
colonic mucosa with no significant microscopic abnormality. | 1 | 3 |
polypoid fragment of benign colonic mucosa. | 1 | 2 |
no significant abnormalities. | 1 | 2 |
benign colonic mucosa with no diagnostic abnormality. | 1 | 2 |
benign colonic mucosa with no diagnostic alteration. no dysplasia identified. | 1 | 2 |
colonic mucosa overlying lymphoid aggregates otherwise no significant microscopic abnormality. | 1 | 2 |
fragments of benign colonic mucosa. | 1 | 2 |
unremarakble colonic mucosa. | 1 | 2 |
colonic mucosa with no diagnostic alteration; negative for dysplasia. | 1 | 2 |
active colitis with non-necrotic granulomas and features of remote and persistent injury. | 1 | 2 |
chronic inactive colitis. | 1 | 2 |
sessile serrated polyp. | 1 | 2 |
tubular adenomas (2). - negative for high grade dysplasia or carcinoma. | 1 | 1 |
tubular adenoma with surface cautery artifact. | 1 | 1 |
diminutive adenomatous polyp. | 1 | 1 |
focal active colitis. | 1 | 1 |
chronic inactive colitis. - no dysplasia or granulomas identified. | 1 | 1 |
benign colonic mucosa with no significant microscopic abnormality. | 1 | 1 |
colonic quiescent colitis with hyperplastic change; negative for dysplasia. | 1 | 1 |
polypoid fragments of benign colonic mucosa. | 1 | 1 |
benign colonic mucosa with rare clusters of neutrophils in the lamina propria. - no chronic architectural changes, granulomas or dysplasia identified. | 1 | 1 |
unremarkable colonic mucosa with increased eosinophils; likely due to medication. | 1 | 1 |
tubular adenoma. - negative for high grade dysplasia or carcinoma. | 1 | 1 |
no diagnostic alteration. | 1 | |
quiescent colitis with focal hyperplasia. no dysplasia identified. | 1 | 1 |
colonic mucosa and fibroadipose submucosa with no pathologic diagnosis. | 1 | 1 |
features consistent with submucosal lipoma. | 1 | 1 |
tubular adenoma. - no high grade dysplasia or carinoma identified. | 1 | 1 |
consistent with hyperplastic polyps (2). | 1 | 1 |
benign colonic mucosa with no significant pathology. | 1 | 1 |
benign colonic mucosa with no pathologic diagnosis. | 1 | 1 |
colonic mucosa with mild crypt architectural distortion; no dysplasia identified. | 1 | 1 |
inactive chronic crypt destructive colitis without granulomas; no dysplasia identified. | 1 | 1 |
hyperplastic polyps (2). fragments of unremarkable colonic mucosa (3). | 1 | 1 |
benign colonic mucosa with hyperplastic change. | 1 | 1 |
benign polypoid fragment of colonic mucosa with no microscopic abnormality. | 1 | 1 |
adenomatoid polyp. | 1 | 1 |
colonic mucosa with no pathologic changes. | 1 | 1 |
tubular adenoma(s). | 1 | 1 |
colonic mucosa with glandular architectural changes consistent with chronic inactive colitis. - negative for dysplasia. | 1 | 1 |
hyperplastic polyp. colonic mucosa with lymphoid aggregate formation. | 1 | 1 |
fragments of unremarkable colonic mucosa. | 1 | 1 |
benign colonic mucosa with focal hyperplastic changes. | 1 | 1 |
tubular adenoma. - benign colonic mucosa. | 1 | 1 |
colonic mucosa with benign lymphoid aggregates and no pathologic diagnosis. | 1 | 1 |
benign colonic mucosa with prominent lymphoid aggregate. | 1 | 1 |
fragments of colonic mucosa with hyperplastic change. | 1 | 1 |
polypoid fragment of colonic mucosa with lamina propria edema, fibrosis and mild chronic inflammation. | 1 | 1 |
portions of colonic mucosa with no significant microscopic abnormalities. | 1 | 1 |
focal acute inflammation. | 1 | 1 |
colonic mucosa with mild architectural disorder. - negative for dysplasia. | 1 | 1 |
fragments of colonic mucosa with no significant pathologic changes. | 1 | 1 |
portions of colonic mucosa with pigmented macrophages in the lamina propria. | 1 | 1 |
benign colonic mucosa with no pathologic abnormality. | 1 | 1 |
unremarkable colonic/rectal mucosa. | 1 | 1 |
Study | Split | TSS code | #Cases | #Slides |
ACC | train | OR ***One predominant TSS for this study and split. | 82 | 201 |
validation | PA, P6 | 4 | 2 | |
test | PK, OU | 6 | 24 | |
BLCA | train | XF, ZF, DK, FD, BT | 206 | 188 |
validation | FJ, SY, E5, 5N, K4, 2F, LT, GU, BL, H4, E7, CU, LC, R3, UY | 98 | 117 | |
test | G2, S5, YF, 4Z, CF, YC, HQ, FT, PQ, GV, GD, KQ, C4, MV, GC | 108 | 153 | |
BRCA | train | A2, E2, AR, A8, D8, BH | 574 | 605 |
validation | PE, XX, AQ, HN, UU, MS, PL, A1, EW, GM, 5T, GI, AN, W8, AC, OK, B6 | 250 | 238 | |
test | OL, 4H, LL, LQ, WT, S3, 3C, UL, Z7, V7, JL, E9, C8, A7, LD, 5L, AO | 273 | 284 | |
CESC | train | VS, EK, C5 | 146 | 116 |
validation | LP, R2, RA, XS, 4J, BI, HM, EX, PN, ZX, IR, 2W, DR, DS, WL, JX, ZJ, HG, GH | 77 | 74 | |
test | FU, DG, Q1, UC, MU, MY, EA, MA, JW | 84 | 89 | |
CHOL | train | W5 | 21 | 18 |
validation | ZU, 3X, 4G, ZD | 12 | 9 | |
test | YR, ZH, W6, WD | 12 | 12 | |
COAD | train | AA, A6 | 225 | 781 |
validation | AU, QG, RU, SS, QL, DM, AY, D5, 4T, F4, WS, 3L, AD | 107 | 266 | |
test | CK, CM, AM, 4N, G4, CA, AZ, 5M, NH, T9 | 125 | 363 | |
DLBC | train | FF, FA | 12 | 16 |
validation | RQ, G8 | 6 | 6 | |
test | GS, VB, FM, GR | 7 | 10 | |
ESCA | train | LN, L5 | 86 | 59 |
validation | V5, M9, ZR, R6, XP, IC, L7, Q9, X8, RE, VR, KH | 49 | 38 | |
test | JY, S8, IG, 2H, Z6 | 50 | 50 | |
GBM | train | 06, 12, 02 | 307 | 564 |
validation | 15, 4W, 87, 26, 76, 28, 41, 19 | 133 | 100 | |
test | 14, 32, 81, 27, 16, RR, OX, 74, 08 | 155 | 197 | |
HNSC | train | CQ, CV, CN | 249 | 234 |
validation | RS, 4P, BB, IQ, C9, DQ, P3, UF, MZ, HL, H7, KU | 100 | 97 | |
test | T3, HD, D6, T2, BA, MT, QK, TN, CX, WA, UP, F7 | 125 | 141 | |
KICH | train | KL, UW | 49 | 37 |
validation | KN, NP | 26 | 26 | |
test | KO, KM | 38 | 23 | |
KIRC | train | BP, B0 | 249 | 251 |
validation | A3, T7, DV, B2, MW, 6D, GK, G6, 3Z, EU, CW, MM, B8 | 141 | 127 | |
test | AS, AK, CZ, CJ, B4 | 147 | 147 | |
KIRP | train | A4, 5P, B9, UZ, SX, BQ, 2Z | 152 | 150 |
validation | F9, HE, IZ, B1, UN, P4, IA, WN, DW, AT, O9, PJ, 4A | 65 | 61 | |
test | AL, Y8, MH, Q2, V9, G7, EV, GL, 2K, B3, DZ, KV, J7 | 73 | 87 | |
LGG | train | HT, S9, FG, DU | 279 | 487 |
validation | HW, FN, KT, WY, E1, WH, VM, IK, TM, QH, VW | 107 | 147 | |
test | CS, DH, F6, DB, VV, P5, RY, TQ, EZ, R8, W9 | 129 | 177 | |
LIHC | train | G3, DD | 184 | 187 |
validation | O8, BC, RG, YA, NI, RC, 5R, K7, ED, WJ, T1, 3K, 4R, XR, 2V, PD, BW, WX, MR, QA, ZS, ES, EP | 90 | 88 | |
test | ZP, 5C, KR, LG, 2Y, UB, HP, FV, WQ, CC, BD, GJ, MI | 103 | 97 | |
LUAD | train | 05, 50, 44, 78, 86, 55 | 273 | 248 |
validation | 75, 91, 99, 64, MN, 4B, 95, 97, S2, L9, 67, 35, 71 | 115 | 100 | |
test | 80, 53, MP, 93, O1, 73, 69, 49, NJ, L4, 83, 62, J2, 38 | 134 | 183 | |
LUSC | train | 60, 22, 66, 85, 63, 56, 77 | 260 | 260 |
validation | 39, XC, 6A, O2, 68, 52, MF, 98, 34, 70, 90, 18, 51, 58 | 117 | 97 | |
test | 46, LA, 33, 43, L3, 37, NC, NK, 79, J1, 96, 21, 94, 92 | 127 | 156 | |
MESO | train | TS, 3H, MQ, LK | 44 | 46 |
validation | NQ, SC, UT, ZN | 20 | 26 | |
test | YS, 3U, SH, XT, UD | 23 | 23 | |
OV | train | 13, 61, 24 | 274 | 2 |
validation | 42, 57, 25, VG, 10, 36, 20, 23, 5X, 3P | 152 | 100 | |
test | OY, 30, 09, WR, 29, 59, 31, 04 | 161 | 4 | |
PAAD | train | IB, 2J, HZ | 86 | 88 |
validation | YH, M8, XN, 3A, H6, US, YB, H8, L1, RL, XD, LB, HV, YY | 49 | 69 | |
test | FB, 3E, RB, FZ, 2L, OE, PZ, S4, Z5, F2, Q3 | 50 | 48 | |
PCPG | train | QR, WB | 82 | 83 |
validation | SQ, RM, P7, RX, SP, XG, P8, W2, S7, PR | 48 | 61 | |
test | QT, TT, RW, SA, SR, RT | 49 | 51 | |
PRAD | train | HC, KK, EJ, G9 | 251 | 217 |
validation | X4, YJ, VP, TK, SU, VN, Y6, 2A, V1, M7, HI, FC, XA, ZG | 115 | 100 | |
test | H9, J9, WW, J4, CH, TP, 4L, XJ, QU, XQ, YL, XK, MG, KC | 134 | 126 | |
READ | train | AG | 80 | 72 |
validation | DY, G5, DT, F5, BM, AF, CL | 42 | 42 | |
test | EF, EI, CI, AH, DC | 45 | 40 | |
SARC | train | 3B, DX | 137 | 360 |
validation | HB, SG, RN, KF, UE, Z4, IE, QQ, KD, PT, IW, X2, X9, WK, VT, SI | 59 | 117 | |
test | K1, LI, PC, QC, MO, N1, X6, WP, 3R, FX, HS, IS, IF, MB, JV, MJ | 64 | 118 | |
SKCM | train | EB, EE | 136 | 136 |
validation | FR, W3, QB, BF, IH, HR, WE, YD, RP, LH, D9, 3N, FS, D3, GF, YG, Z2 | 82 | 87 | |
test | ER, FW, XV, GN, DA | 82 | 83 | |
STAD | train | BR, VQ | 205 | 180 |
validation | MX, ZQ, HF, ZA, RD, SW, EQ, HJ, CD, IP, R5, KB, CG | 117 | 91 | |
test | F1, D7, B7, HU, FP, IN, 3M | 121 | 122 | |
TGCT | train | 2G | 64 | 113 |
validation | 2X, SB, XY, SO, 4K, VF, YU, X3, W4 | 34 | 52 | |
test | S6, SN, XE, ZM, WZ | 36 | 38 | |
THCA | train | EL, EM, DJ | 254 | 260 |
validation | E8, DO, IM, 4C, KS, L6, BJ, DE, FK | 117 | 123 | |
test | FE, E3, CE, MK, FY, J8, H2, GE, QD, ET | 136 | 136 | |
THYM | train | X7, ZB, XU | 66 | 65 |
validation | 4V, 3Q, ZC, 3S, 3G, 5V, ZT, 3T | 27 | 30 | |
test | XM, XH, 5G, ZL, 5U, 4X, 5K, YT | 31 | 85 | |
UCEC | train | A5, D1, AX, AP, B5 | 294 | 290 |
validation | KP, FI, AW, KJ, PG, 2E, EY, BS, DI, SJ, JU, EC, 5B | 113 | 152 | |
test | QS, EO, H5, QF, 5S, 4E, BK, K6, SL, BG, DF, AJ, E6 | 141 | 147 | |
UCS | train | N5, N8, NA | 22 | 50 |
validation | NG, N9, QN, NF | 13 | 31 | |
test | N6, QM, N7, ND | 15 | 50 | |
UVM | train | V4 | 33 | 33 |
validation | V3, WC | 16 | 16 | |
test | RZ, YZ, VD | 31 | 23 |