Rethinking Attention-Based Multiple Instance Learning for Whole-Slide Pathological Image Classification: An Instance Attribute Viewpoint

Linghan Cai    Shen** Huang    Ye Zhang    **peng Lu    and Yongbing Zhang    \IEEEmembershipSenior Member, IEEE This work was supported in part by the National Natural Science Foundation of China under 62031023 & 62331011; in part by the Shenzhen Science and Technology Project under JCYJ20200109142808034 & GXWD20220818170353009, and in part by the Fundamental Research Funds for the Central Universities under HIT.OCEF.2023050. Thanks to the doctors from Haiyu Zhou’s Group at Guangdong Provincial People’s Hospital for providing medical knowledge support for this work. Corresponding author: Yongbing Zhang.LH Cai, Y Zhang, and YB Zhang are with the School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, 518055, China. (e-mails: [email protected]; [email protected]; [email protected]).SJ Huang is with the Faculty of Computing, Harbin Institute of Technology, Harbin, 150001, China. (e-mail: [email protected]).JP Lu is with the School of Science, Harbin Institute of Technology, Shenzhen, 518055, China. (e-mail: [email protected]).
Abstract

Multiple instance learning (MIL) is a robust paradigm for whole-slide pathological image (WSI) analysis, processing gigapixel-resolution images with slide-level labels. As pioneering efforts, attention-based MIL (ABMIL) and its variants are increasingly becoming popular due to the characteristics of simultaneously handling clinical diagnosis and tumor localization. However, the attention mechanism exhibits limitations in discriminating between instances, which often misclassifies tissues and potentially impairs MIL performance. This paper proposes an Attribute-Driven MIL (AttriMIL) framework to address these issues. Concretely, we dissect the calculation process of ABMIL and present an attribute scoring mechanism that measures the contribution of each instance to bag prediction effectively, quantifying instance attributes. Based on attribute quantification, we develop a spatial attribute constraint and an attribute ranking constraint to model instance correlations within and across slides, respectively. These constraints encourage the network to capture the spatial correlation and semantic similarity of instances, improving the ability of AttriMIL to distinguish tissue types and identify challenging instances. Additionally, AttriMIL employs a histopathology adaptive backbone that maximizes the pre-trained model’s feature extraction capability for collecting pathological features. Extensive experiments on three public benchmarks demonstrate that our AttriMIL outperforms existing state-of-the-art frameworks across multiple evaluation metrics. The implementation code is available at https://github.com/MedCAI/AttriMIL.

{IEEEkeywords}

histopathological image classification, multiple instance learning, attribute scoring, ranking loss, histopathology adaptive backbone

1 Introduction

\IEEEPARstart

Histopathological examination is regarded as the gold standard for cancer diagnosis and prognosis in modern healthcare [1]. During these examinations, pathologists use a microscope to view specimens on stained slides, identifying tumor areas within tissue slices. With advances in scanning technology, traditional glass slides are increasingly being converted into digital whole-slide images (WSIs), providing significant opportunities for computer-assisted diagnosis. Nevertheless, the application of computational techniques in histopathological image analysis encounters great challenges: the enormous resolution of WSIs (e.g., 40,000×\times×40,000 pixels) and the lack of fine-grained (patch-level) annotations [2]. Consequently, WSI classification is often formulated as a weakly supervised task using multiple instance learning (MIL) [3, 4, 5]. In the MIL paradigm, each WSI is treated as a labeled bag containing multiple unlabeled instances (tile patches). If a bag is negative, all instances within it are negative. Conversely, a positive WSI contains at least one positive instance.

In general, WSI classification requires the MIL framework to complete two tasks: bag classification and instance discrimination, corresponding to clinical diagnosis and tumor localization, respectively. As a seminal work, attention-based MIL (ABMIL) [6] addresses both tasks simultaneously and has therefore been widely adopted in WSI analysis. This architecture processes the input WSI in two steps: embedding image patches as instance-level features, and then aggregating the features into a bag-level representation. In the aggregation phase, an attention pooling operation assigns each instance an attention score, which weights the contribution to the bag aggregation and, to a certain extent, reflects the attribute of the instance (shown in Fig. 1 (a)). Building on the attention mechanism, subsequent studies [7, 8, 9, 10] treat high-attention patches within a positive WSI as positive instances and exploit contrastive learning or prototype learning methods to refine the instance representation for improving classification accuracy.

However, an open question remains in ABMIL: Can attention reliably distinguish between instances? This issue is critical as introducing mislabeled instances could harm MIL and confuse diagnosis [11]. In this paper, we argue that the level of attention in ABMIL is not a reliable measure of instance attributes for two reasons. Firstly, both positive and negative patches contribute to bag prediction. Typical negative instances have high attention, offering high confidence for negative bag prediction and prompting the network to identify positive instances. Secondly, a high level of attention does not necessarily indicate a high contribution to the prediction. As shown in Fig. 1 (a), the outcome of ABMIL is determined by both the attention pooling and bag classification head. Hence, instance discrimination based merely on attention level is incomplete and potentially misleading, and thus an effective measure of instance attributes is worth exploring.

Refer to caption

Figure 1: Illustration of ABMIL and our AttriMIL. For features, red cuboids denote positive attributes and blue cuboids are negative; for scores, redder colors represent higher scores and bluer colors indicate lower ones. Notably, for a positive bag, existing methods [6, 7, 8, 4, 5] generally believe the instances with high attention scores are positive instances.

Besides the aforementioned issue, the assumption in MIL algorithms [6, 12, 13] that instances are independent limits WSI classification performance due to the negligence of significant correlations between patches. The correlations can be categorized into two types within this paper: intra-slide correlation and inter-slide correlation. For example, considering a tumor area, the intra-slide correlation emphasizes the spatial distribution of instances within a WSI, reflecting how tumor cells tend to cluster [14]. Similarly, the inter-slide relation reveals attribute consistency among tumor instances of the same subtype [15]. To model instance correlations, recent works [16, 17, 18] investigate Transformer [19] for aggregating instances. Despite great progress, the use of position encoding hinders the flexibility of MIL frameworks, as it imposes rigid spatial relationships. Furthermore, these methods do not establish instance connections across WSIs, leading to suboptimal generalization and difficulty in identifying hard instances.

Motivated by the above discussions, we present an Attribute-Driven Multiple Instance Learning (AttriMIL) that comprehensively improves ABMIL in the pathological image classification task. To precisely represent instance attributes, we introduce an attribute scoring mechanism (shown in Fig. 1 (b)) that integrates attention pooling with bag classification head to quantify instances’ contribution to bag prediction. Based on attribute scores, a spatial attribute constraint is applied to maintain the spatial correlations among instances within a single WSI. Meanwhile, we develop an attribute ranking loss to model the instance correlation across WSIs, enhancing the network’s capability to distinguish challenging instances by emphasizing differences between positive and negative instances. Moreover, to model these correlations effectively, we employ a histopathology adaptive backbone that optimizes the pre-trained model at different stages, maximizing the model’s pathological feature extraction ability. In summary, the main contributions of this paper are as follows:

  • An attribute-driven multiple instance learning (AttriMIL) framework is proposed for histopathological image analysis. In detail, AttriMIL improves the instance aggregation and bag prediction process of attention-based multiple instance learning (ABMIL) through attribute scoring, achieving effective instance attribute measurement.

  • Considering patch correlations in WSIs, we introduce two instance constraints for regularizing the training of the MIL framework. For a WSI, a spatial attribute constraint is leveraged to model the spatial dependency between instances. For a set of WSIs, an attribute ranking loss is designed to highlight the attribute differences between instances of different subtypes.

  • Inspired by the parameter-efficient fine-tuning technique [20], we develop a histopathology adaptive backbone for efficient pathological feature extraction, where the backbone enables our AttriMIL to model instance correlations at multiple feature levels.

  • Extensive experiments validate the effectiveness of the proposed approaches, and AttriMIL derives state-of-the-art results on three public datasets. Notably, AttriMIL also exhibits potential in identifying out-of-detection (OOD) samples, offering a promising solution for develo** a complete computer-aided pathological diagnosis system.

The rest of this study is organized as follows. Section 2 reviews MIL and parameter-efficient fine-tuning. Our method is described in Section 3. Section 4 provides experimental results. Discussion and conclusion are given in Section 5.

Refer to caption

Figure 2: Overview of AttriMIL framework. For an input WSI, AttriMIL crops it into patches and adopts a histopathology adaptive backbone to extract instance features. Afterward, AttriMIL generates instance attribute scores in each subtype branch (tumor and normal in the tumor detection task) using a multi-class attribute scoring mechanism. For a subtype branch, it considers WSIs of the same subtype as it as positive and WSIs of the other subtypes as negative. Spatial attribute constraint (“\nabla” is a differential operation) and attribute ranking constraint (“+++” denotes a weighted sum operation) are applied in the training stage. Next, AttriMIL performs score aggregation to obtain C bag scores and then generates bag prediction probabilities. Instance attribute scores corresponding to the bag prediction are mapped for tumor localization.

2 Related Work

2.1 Instance-based MIL on WSIs

MIL methods can be broadly divided into two categories in WSI analysis, namely, instance-based MIL and bag-based MIL. The main idea of instance-based MIL methods is to train an instance classifier and then collect the instance predictions for bag prediction. Early solutions [21] adopt a straightforward MIL framework that propagates the bag label to instances for instance classifier’s training. This behavior inevitably brings noisy instance-level supervision due to the presence of only a small portion of positive regions within a WSI [22]. To mitigate this problem, several studies [23] explore a mixed-supervision strategy, which utilizes patch-level annotations for partial instances, effectively reducing the adverse effects of noise by focusing on the labeled instances. Without instance-level labels, Hou et al. [24] propose a PatchCNN that uses a subtle threshold scheme to select representative instances at both class and WSI levels. Lin et al. [25] summarize the previous studies from a causal perspective and develop an IMIL to choose key instances via causal intervention and effects. Albeit the fruitful progress, the performance of instance-based MIL is usually inferior to bag-based methods in WSI classification due to the inaccurate instance-level supervision signal.

2.2 Bag-based MIL on WSIs

Bag-based MIL aggregates instance features into bag representation and classifies them through distance measures or bag classifiers [2]. In WSI classification, executing a bag-based MIL is non-trivial owing to the large memory required to store patches. Therefore, current methods tend to adopt a two-stage framework that separately trains instance feature extractors and feature aggregation networks. To enhance classification performance, researchers make contributions at both stages. For feature extractors, they introduce a variety of architectures from convolution neural networks to vision Transformers [16, 18], and convert the training scheme from ImageNet pre-training to self-supervised learning [26, 4]. In the meantime, some researchers focus on feature aggregation and develop the non-parametric pooling to learnable ones [7, 3, 17]. Among these methods, ABMIL [6] attracts lots of attention for its characteristic of capturing discriminative instances. In this paper, we expand ABMIL with an attribute scoring mechanism and two attribute constraints, achieving accurate WSI classification and tumor localization.

2.3 Parameter-efficient Fine-tuning

Parameter-efficient fine-tuning techniques are first proposed in natural language processing because it is impractical to fully fine-tune the large language models for various downstream tasks [27]. Their goal is to reach the performance of fully fine-tuning with low training costs (e.g., few training parameters and training time). Lately, parameter-efficient transfer learning has been developed in computer vision [20], where adapter-based methods have yielded great success [28, 29]. Inspired by these works, AttriMIL incorporates adapters into the ImageNet pre-trained model, reducing the domain bias between natural and pathological images for better feature extraction.

3 Methodology

Fig. 2 shows the overview of AttriMIL. In this section, we first revisit MIL formulations and ABMIL. AttriMIL is then described in detail.

3.1 Preliminaries

3.1.1 Formulate MIL

In MIL, any input WSI is considered as a bag with multiple instances. Take binary classification as an example, let X={(𝐱1,y1),,(𝐱N,yN)}𝑋subscript𝐱1subscript𝑦1subscript𝐱𝑁subscript𝑦𝑁X=\{(\mathbf{x}_{1},y_{1}),...,(\mathbf{x}_{N},y_{N})\}italic_X = { ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) } as a WSI bag, where 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT instance with an unavailable label yi{0,1}subscript𝑦𝑖01y_{i}\in\{0,1\}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 }. Under the standard MIL assumption, the bag label Y𝑌Yitalic_Y is formed as:

Y={0,iffiyi=0,1,otherwise.Y=\left\{\begin{aligned} 0,\quad&\text{iff}\ \sum\nolimits_{i}y_{i}=0,\\ 1,\quad&\text{otherwise}.\end{aligned}\right.italic_Y = { start_ROW start_CELL 0 , end_CELL start_CELL iff ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 , end_CELL end_ROW start_ROW start_CELL 1 , end_CELL start_CELL otherwise . end_CELL end_ROW (1)

Generally, deep MIL yields the bag prediction through three steps [22]. (1) Instance transformation: a feature extractor f()𝑓f(\cdot)italic_f ( ⋅ ) is used to extract instance-level feature 𝐡isubscript𝐡𝑖\mathbf{h}_{i}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, (2) instance aggregation: a pooling function σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is targeted for bag feature 𝐳𝐳\mathbf{z}bold_z, and (3) bag transformation: a bag-level classifier g()𝑔g(\cdot)italic_g ( ⋅ ) is applied for prediction. The above procedures can be formulated as:

𝐡i=f(𝐱i),𝐳=σ(𝐡1,,𝐡N),Y^=g(𝐳),formulae-sequencesubscript𝐡𝑖𝑓subscript𝐱𝑖formulae-sequence𝐳𝜎subscript𝐡1subscript𝐡𝑁^𝑌𝑔𝐳\mathbf{h}_{i}=f(\mathbf{x}_{i}),\ \mathbf{z}=\sigma(\mathbf{h}_{1},...,% \mathbf{h}_{N}),\ \hat{Y}=g(\mathbf{z}),bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_z = italic_σ ( bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) , over^ start_ARG italic_Y end_ARG = italic_g ( bold_z ) , (2)

where the σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) needs to be a permutation-invariant function to maintain the spatial invariance of MIL methods.

3.1.2 Revisit ABMIL

ABMIL [6] generates a weight for each instance via a gated attention mechanism. Let {𝐡1,,𝐡N}subscript𝐡1subscript𝐡𝑁\{\mathbf{h}_{1},...,\mathbf{h}_{N}\}{ bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_h start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } be a bag of N𝑁Nitalic_N instance features, ABMIL obtains the bag representation 𝐳𝐳\mathbf{z}bold_z through:

𝐳=i=1Nai𝐡i1×M,𝐳superscriptsubscript𝑖1𝑁subscript𝑎𝑖subscript𝐡𝑖superscript1𝑀\mathbf{z}=\sum_{i=1}^{N}\ a_{i}\mathbf{h}_{i}\in\mathbb{R}^{1\times M},bold_z = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_M end_POSTSUPERSCRIPT , (3)

where M𝑀Mitalic_M is the dimension of vector 𝐳𝐳\mathbf{z}bold_z and 𝐡isubscript𝐡𝑖\mathbf{h}_{i}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT; aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the attention score of the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT instance and is calculated by:

ai=exp{𝐰T(tanh(𝐕𝐡iT))sigmoid(𝐔𝐡iT)}j=1Nexp{𝐰T(tanh(𝐕𝐡jT))sigmoid(𝐔𝐡jT)},subscript𝑎𝑖expdirect-productsuperscript𝐰Ttanhsubscriptsuperscript𝐕𝐡T𝑖sigmoidsubscriptsuperscript𝐔𝐡T𝑖superscriptsubscript𝑗1𝑁expdirect-productsuperscript𝐰Ttanhsubscriptsuperscript𝐕𝐡T𝑗sigmoidsubscriptsuperscript𝐔𝐡T𝑗a_{i}=\frac{\text{exp}\{\mathbf{w}^{\text{T}}(\text{tanh}(\mathbf{V}\mathbf{h}% ^{\text{T}}_{i}))\odot\text{sigmoid}(\mathbf{U}\mathbf{h}^{\text{T}}_{i})\}}{% \sum_{j=1}^{N}{\text{exp}\{\mathbf{w}^{\text{T}}(\text{tanh}(\mathbf{V}\mathbf% {h}^{\text{T}}_{j}))\odot\text{sigmoid}(\mathbf{U}\mathbf{h}^{\text{T}}_{j})\}% }},italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG exp { bold_w start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ( tanh ( bold_Vh start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ⊙ sigmoid ( bold_Uh start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT exp { bold_w start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ( tanh ( bold_Vh start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) ⊙ sigmoid ( bold_Uh start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } end_ARG , (4)

where 𝐰L×1𝐰superscript𝐿1\mathbf{w}\in\mathbb{R}^{L\times 1}bold_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × 1 end_POSTSUPERSCRIPT, 𝐕L×M𝐕superscript𝐿𝑀\mathbf{V}\in\mathbb{R}^{L\times M}bold_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_M end_POSTSUPERSCRIPT, 𝐔L×M𝐔superscript𝐿𝑀\mathbf{U}\in\mathbb{R}^{L\times M}bold_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_M end_POSTSUPERSCRIPT are parameters, 𝐡i1×Msubscript𝐡𝑖superscript1𝑀\mathbf{h}_{i}\in\mathbb{R}^{1\times M}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_M end_POSTSUPERSCRIPT is the feature of the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT instance, “T” is a transpose operation, “direct-product\odot” is an element-wise multiplication, tanh()tanh\text{tanh}(\cdot)tanh ( ⋅ ) is a tanh function, and sigmoid()sigmoid\text{sigmoid}(\cdot)sigmoid ( ⋅ ) means a sigmoid non-linearity.

In ABMIL, the attention score aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT reflects the importance of the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT instance in sha** the bag representation, thus providing interpretability to the classification result. Based on attention scores, numerous works [7, 23, 8, 9] explore various ways to further strengthen MIL performance.

Refer to caption

Figure 3: Illustration of histopathology adaptive backbone. In contrast to previous solutions, the histopathology adaptive backbone adopts feature adapters at different stages of the pre-trained network. For the adapter, the global pooling is used when training the current adapter.

3.2 Attribute Scoring Mechanism

Although the attention score gives the importance of each instance in bag aggregation, it is not equivalent to the contribution to the final prediction. This difference may introduce noise in the selected instances when using attention for instance discrimination. To distinguish between instances more effectively, we dissect ABMIL and introduce an attribute scoring mechanism for precise measurement of instance attributes.

Given a bag feature 𝐳1×M𝐳superscript1𝑀\mathbf{z}\in\mathbb{R}^{1\times M}bold_z ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_M end_POSTSUPERSCRIPT, ABMIL adopts a fully connected (FC) layer as a classification head for bag prediction:

Y^=b+𝐳𝐜,^𝑌𝑏𝐳𝐜\hat{Y}=b+\mathbf{z}\mathbf{c}\ ,over^ start_ARG italic_Y end_ARG = italic_b + bold_zc , (5)

where b𝑏bitalic_b is a bias, 𝐜M×1𝐜superscript𝑀1\mathbf{c}\in\mathbb{R}^{M\times 1}bold_c ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × 1 end_POSTSUPERSCRIPT denotes the weight of the classification head. The network is inclined to classify the input as a positive bag when the value of Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG is high, and as a negative bag when it is low. Combined with Eq. 3, Eq. 5 can be further formed below:

Y^=b+i=1Nai𝐡i𝐜,^𝑌𝑏superscriptsubscript𝑖1𝑁subscript𝑎𝑖subscript𝐡𝑖𝐜\hat{Y}=b+\sum_{i=1}^{N}a_{i}\mathbf{h}_{i}\mathbf{c}\ ,over^ start_ARG italic_Y end_ARG = italic_b + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_c , (6)

where the second term reveals that the contribution of the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT instance to the final bag prediction is determined by the combined effect of the attention and the instance prediction rather than by the attention alone. Based on the above analysis, we define the attribute score sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT instance as:

si=ui𝐡i𝐜,subscript𝑠𝑖subscript𝑢𝑖subscript𝐡𝑖𝐜s_{i}=u_{i}\mathbf{h}_{i}\mathbf{c}\ ,italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_c , (7)

where ui=exp{𝐰T(tanh(𝐕𝐡iT))sigmoid(𝐔𝐡iT)}subscript𝑢𝑖expdirect-productsuperscript𝐰Ttanhsubscriptsuperscript𝐕𝐡T𝑖sigmoidsubscriptsuperscript𝐔𝐡T𝑖u_{i}=\text{exp}\{\mathbf{w}^{\text{T}}(\text{tanh}(\mathbf{V}\mathbf{h}^{% \text{T}}_{i}))\odot\text{sigmoid}(\mathbf{U}\mathbf{h}^{\text{T}}_{i})\}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = exp { bold_w start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT ( tanh ( bold_Vh start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ⊙ sigmoid ( bold_Uh start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) }, denoting the unnormalized attention score of the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT instance. uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT removes the impact of the instance number, enabling us to compare instance attributes across bags. si(,+)subscript𝑠𝑖s_{i}\in(-\infty,+\infty)italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ ( - ∞ , + ∞ ) is the attribute score of the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT instance, in which the sign of the score indicates the instance attribute, while the absolute size of the value indicates the level of emphasis the network places on the instance. Essentially, AttriMIL converts ABMIL’s sequential process of attention calculation and bag prediction into a parallel operation for attribute scoring. The bag prediction is obtained by the sum of normalized instance attribute scores (score aggregation in AttriMIL) with a learnable bias as Eq. 6.

3.3 Spatial Attribute Constraint

Similar to birds of a feather flocking together, image tiles in a WSI exhibit significant spatial relationships, with patches of similar attributes often clustering together. By leveraging this prior information, we can alleviate errors in instance discrimination, thereby enhancing tumor localization and bag classification. With attribute scores, this paper proposes a simple but effective strategy, named spatial attribute constraint, to establish the spatial relation between patches.

Given an input WSI with N𝑁Nitalic_N instances, the spatial attribute constraint can be formulated as the sum of differences between adjacent instances:

spatial=1N(i,j)(si,jsi+1,j)2+(si,jsi,j+1)2,subscript𝑠𝑝𝑎𝑡𝑖𝑎𝑙1𝑁subscript𝑖𝑗superscriptsubscript𝑠𝑖𝑗subscript𝑠𝑖1𝑗2superscriptsubscript𝑠𝑖𝑗subscript𝑠𝑖𝑗12\mathcal{L}_{spatial}=\frac{1}{N}\sum_{(i,j)}\sqrt{(s_{i,j}-s_{i+1,j})^{2}+(s_% {i,j}-s_{i,j+1})^{2}},caligraphic_L start_POSTSUBSCRIPT italic_s italic_p italic_a italic_t italic_i italic_a italic_l end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT ( italic_i , italic_j ) end_POSTSUBSCRIPT square-root start_ARG ( italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i + 1 , italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i , italic_j + 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (8)

where si,jsubscript𝑠𝑖𝑗s_{i,j}italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT denotes the attribute score of the instance located at i,j𝑖𝑗i,jitalic_i , italic_j position within a WSI. si+1,jsubscript𝑠𝑖1𝑗s_{i+1,j}italic_s start_POSTSUBSCRIPT italic_i + 1 , italic_j end_POSTSUBSCRIPT and si,j+1subscript𝑠𝑖𝑗1s_{i,j+1}italic_s start_POSTSUBSCRIPT italic_i , italic_j + 1 end_POSTSUBSCRIPT represent the attribute scores of the instances below and to the right of the current instance, respectively.

Refer to caption

Figure 4: Training loss curves and AUC changes of the validation set under different loss constraints on the Camelyon16 dataset. α𝛼\alphaitalic_α is set as 0.1 for the spatial attribute loss and β𝛽\betaitalic_β is set to 0.001 for the attribute ranking loss.

To implement the spatial attribute constraint efficiently, the coordinates and neighbors of each instance are recorded in pre-processing. For edge instances where adjacent positions are missing due to the irregularity of the segmented tissue, we consider their missing adjacent positions as their own, simplifying the computation. As shown in Fig. 2, for an input WSI, we calculate the average spatial attribute constraint of each branch to constrain AttriMIL in training.

3.4 Attribute Ranking Constraint

In addition to the intra-slide correlation, we also attempt to capture the instance correlation across WSIs, and thus enhance the model’s instance attribute perception. Specifically, our goal is that positive instances exhibit higher attribute scores than negative ones, which can be formulated as:

s(Qp)>s(Qn),𝑠superscript𝑄𝑝𝑠superscript𝑄𝑛s(Q^{p})>s(Q^{n}),italic_s ( italic_Q start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) > italic_s ( italic_Q start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) , (9)

where Qpsuperscript𝑄𝑝Q^{p}italic_Q start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and Qnsuperscript𝑄𝑛Q^{n}italic_Q start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT represent positive and negative instances, s(Qp)𝑠superscript𝑄𝑝s(Q^{p})italic_s ( italic_Q start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) and s(Qn)𝑠superscript𝑄𝑛s(Q^{n})italic_s ( italic_Q start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) represent the corresponding instance attribute scores respectively. The above ranking function is acceptable if the instance-level annotations are known during training. However, lacking instance-level annotations precludes the direct application of Eq. 9. Instead, the following multiple instance ranking function can be used:

maxiXps(Qip)>maxiXns(Qin),subscript𝑖superscript𝑋𝑝𝑠subscriptsuperscript𝑄𝑝𝑖subscript𝑖superscript𝑋𝑛𝑠subscriptsuperscript𝑄𝑛𝑖\mathop{\max}\limits_{i\in X^{p}}s({Q}^{p}_{i})>\mathop{\max}\limits_{i\in X^{% n}}s({Q}^{n}_{i}),roman_max start_POSTSUBSCRIPT italic_i ∈ italic_X start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_s ( italic_Q start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > roman_max start_POSTSUBSCRIPT italic_i ∈ italic_X start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_s ( italic_Q start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (10)

where max\maxroman_max is taken over all patches in each bag. Different from the comparison in a single WSI, Eq. 10 implements ranking only on the two instances having the highest attribute score respectively in the positive and negative bags. The patch with the highest attribute score likely represents a true positive instance, whereas the highest-score patch in a negative bag is considered a hard instance, closely resembling a positive instance yet being negative. Based on Eq. 10, we try to widen the gap in attribute scores between positive and negative instances, as follows:

rasubscript𝑟𝑎\displaystyle\mathcal{L}_{ra}caligraphic_L start_POSTSUBSCRIPT italic_r italic_a end_POSTSUBSCRIPT =nkmax(0,maxiXps(Qip)+maxiXns(Qin))\displaystyle{}_{nk}=\max(0,-\mathop{\max}\limits_{i\in X^{p}}s(Q^{p}_{i})+% \mathop{\max}\limits_{i\in X^{n}}s(Q^{n}_{i}))start_FLOATSUBSCRIPT italic_n italic_k end_FLOATSUBSCRIPT = roman_max ( 0 , - roman_max start_POSTSUBSCRIPT italic_i ∈ italic_X start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_s ( italic_Q start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + roman_max start_POSTSUBSCRIPT italic_i ∈ italic_X start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_s ( italic_Q start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (11)
+max(0,maxiXps(Qip))+max(0,maxiXns(Qin)),0subscript𝑖superscript𝑋𝑝𝑠subscriptsuperscript𝑄𝑝𝑖0subscript𝑖superscript𝑋𝑛𝑠subscriptsuperscript𝑄𝑛𝑖\displaystyle+\max(0,-\mathop{\max}\limits_{i\in X^{p}}s(Q^{p}_{i}))+\max(0,% \mathop{\max}\limits_{i\in X^{n}}s(Q^{n}_{i})),+ roman_max ( 0 , - roman_max start_POSTSUBSCRIPT italic_i ∈ italic_X start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_s ( italic_Q start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + roman_max ( 0 , roman_max start_POSTSUBSCRIPT italic_i ∈ italic_X start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_s ( italic_Q start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ,

where the first item highlights the distance between instances of different attributes, the second and the third items use 0 as a threshold to constrain the attributes of the selected instances.

The attribute ranking loss can be extended for a subtype classification task in the multi-class attribute scoring mechanism. As shown in gray dotted boxes of Fig. 2, we adopt a positive bank and a negative bank for each subtype branch to record instances, where their capacity is set to K. For an input WSI in a branch, we use the highest attribute score instance to update the positive or negative bank according to whether the label of the WSI corresponds to the branch’s subtype. Then, we select the top K instances to calculate attribute rank loss with the recorded K instances one by one and obtain the mean value, since a positive slide always contains multiple positive instances. The final attribute ranking loss is the average of ranking losses for each branch. In experiments, we set K to 4 as the minimum positive instance number within a positive WSI is 4 in the training process.

3.5 Histopathology Adaptive Backbone

Within the bag-based MIL framework, the feature extractor is crucial for embedding patches into deep features to capture semantic information. To enhance instance representation, researchers [26, 4, 16] investigate self-supervised learning for training a powerful extractor. However, their performance does not consistently outperform the ImageNet pre-trained backbone in pathological image classification tasks, as the final prediction is also determined by the aggregator [30]. On the other hand, the pre-trained models have been trained on large-scale image datasets and have demonstrated excellent performance in various downstream tasks. We believe that they could provide pathological features through fine-tuning. In fact, this is a prevalent scheme in existing works [17, 7], where they generally integrate multi-layer perceptron (i.e., an adapter) after a pre-trained network to optimize instance-level features. In this paper, we develop a histopathology adaptive backbone, encouraging AttriMIL to perceive pathological information at various levels for better feature collection.

As shown in Fig. 3 (a), the adapter used in AttriMIL has a bottleneck architecture that consists of two fully connected (FC) layers and an activation layer in the middle. The first FC layer projects the input to a lower dimension and the second projects it to the original dimension. To enhance feature representation, we apply adapters after different stages of the feature extractor and introduce a progressive learning scheme for optimizing them. To be specific, we train the adapter progressively based on the depth of the network. Firstly, we use the first stage of the pre-trained model as the instance feature extractor and optimize the 1stsuperscript1𝑠𝑡1^{st}1 start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT adapter. In the process, an auxiliary classifier is used to perform score aggregation and add bias. The training loss \mathcal{L}caligraphic_L can be expressed as:

=ce+αspatial+βrank,subscript𝑐𝑒𝛼subscript𝑠𝑝𝑎𝑡𝑖𝑎𝑙𝛽subscript𝑟𝑎𝑛𝑘\mathcal{L}=\mathcal{L}_{ce}+\alpha\mathcal{L}_{spatial}+\beta\mathcal{L}_{% rank},caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + italic_α caligraphic_L start_POSTSUBSCRIPT italic_s italic_p italic_a italic_t italic_i italic_a italic_l end_POSTSUBSCRIPT + italic_β caligraphic_L start_POSTSUBSCRIPT italic_r italic_a italic_n italic_k end_POSTSUBSCRIPT , (12)

where ce=CE(Y,Y^)subscript𝑐𝑒𝐶𝐸𝑌^𝑌\mathcal{L}_{ce}=CE(Y,\hat{Y})caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT = italic_C italic_E ( italic_Y , over^ start_ARG italic_Y end_ARG ) is a cross-entropy loss for bag classification, α𝛼\alphaitalic_α and β𝛽\betaitalic_β are used to balance the three terms.

With the trained adapter, we further collect the second stage output of the optimized pre-trained model, and then train the 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT adapter. By repeating the above steps, the training of histopathology adaptive backbone is completed.

Refer to caption

Figure 5: Visual analyses of the spatial attribute constraint and attribute ranking constraint. (a) presents a WSI from the Camelyon16 testing set, with tumor regions surrounded by yellow curves. (b)-(e) zoomed in area in the boxes of (a). In (b)-(e), tumor areas identified by the model (instance attribute score is greater than 0 in the tumor branch) are highlighted in red.

4 Experiments

In this section, extensive experiments are performed to validate the effectiveness of the proposed approaches. In Section 4.1, we introduce the datasets and evaluation metrics used in our experiments. The implementation details are described in Section 4.2. Next, we present ablation studies and comparison results in Section 4.3 and 4.4, respectively.

Table 1: Evaluation of spatial attribute constraint and attribute ranking constraint in terms of AUC (%). The best result is shown in bold, and the second-best result is underlined. P-values (baseline vs. best) are all less than 0.05.
Datasets α𝛼\alphaitalic_α β𝛽\betaitalic_β 00 0.0010.0010.0010.001 0.0100.0100.0100.010 0.1000.1000.1000.100 1.0001.0001.0001.000 10.00010.00010.00010.000
Camelyon16 00 86.1786.1786.1786.17 88.5288.5288.5288.52 88.0188.0188.0188.01 88.2788.2788.2788.27 85.7985.7985.7985.79 84.2184.2184.2184.21
0.0010.0010.0010.001 86.5686.5686.5686.56 87.1487.1487.1487.14 87.2287.2287.2287.22 86.9686.9686.9686.96 84.7784.7784.7784.77 84.0984.0984.0984.09
0.0100.0100.0100.010 87.6287.6287.6287.62 88.1588.1588.1588.15 87.1287.1287.1287.12 89.46¯¯89.46\underline{89.46}under¯ start_ARG 89.46 end_ARG 83.1883.1883.1883.18 82.5482.5482.5482.54
0.1000.1000.1000.100 89.0189.0189.0189.01 91.3191.31\mathbf{91.31}bold_91.31 89.2689.2689.2689.26 88.3488.3488.3488.34 86.4286.4286.4286.42 82.9382.9382.9382.93
1.0001.0001.0001.000 87.7687.7687.7687.76 88.1688.1688.1688.16 87.9687.9687.9687.96 88.9788.9788.9788.97 87.2987.2987.2987.29 83.7283.7283.7283.72
10.00010.00010.00010.000 86.7686.7686.7686.76 87.7987.7987.7987.79 87.1987.1987.1987.19 86.5886.5886.5886.58 86.8686.8686.8686.86 82.8182.8182.8182.81
TCGA-NSCLC 00 94.3694.3694.3694.36 95.0395.0395.0395.03 94.9794.9794.9794.97 94.4394.4394.4394.43 94.4394.4394.4394.43 94.9994.9994.9994.99
0.0010.0010.0010.001 95.2395.2395.2395.23 95.1395.1395.1395.13 95.2295.2295.2295.22 95.4695.4695.4695.46 95.3895.3895.3895.38 94.2494.2494.2494.24
0.0100.0100.0100.010 95.2395.2395.2395.23 95.2095.2095.2095.20 94.9594.9594.9594.95 94.3894.3894.3894.38 95.4995.49\mathbf{95.49}bold_95.49 94.5194.5194.5194.51
0.1000.1000.1000.100 95.3095.3095.3095.30 95.47¯¯95.47\underline{95.47}under¯ start_ARG 95.47 end_ARG 94.9994.9994.9994.99 94.8994.8994.8994.89 95.4695.4695.4695.46 94.2994.2994.2994.29
1.0001.0001.0001.000 95.3295.3295.3295.32 95.4095.4095.4095.40 94.7294.7294.7294.72 94.6394.6394.6394.63 95.4395.4395.4395.43 94.3994.3994.3994.39
10.00010.00010.00010.000 95.3095.3095.3095.30 95.4395.4395.4395.43 94.3394.3394.3394.33 94.4694.4694.4694.46 94.2094.2094.2094.20 94.1194.1194.1194.11

4.1 Datasets and Evaluation Metrics

4.1.1 Dataset Description

Our methods are evaluated on three public datasets: Camelyon16 [31], TCGA-NSCLC, and UniToPatho [32]. Camelyon16 [31] is a breast cancer detection dataset that consists of 399 slides in 2 classes (normal and tumor). Following [7], we remove the background and crop the WSI into 256×\times×256 sized non-overlap** patches at 20×\times× magnification. We use the official split of 270/129 slides for training/testing and report the testing results. TCGA-NSCLC (https://portal.gdc.cancer.gov) is a non-small cell lung cancer dataset, including 507 lung adenocarcinoma (LUDA) slides from 444 patients and 486 lung squamous cell carcinoma (LUSC) slides from 452 patients. Following [7], we crop the foreground of each WSI into 256×\times×256 sized non-overlap** patches, obtaining approximately 15,000 patches per WSI. On the dataset, we conduct 4-fold cross-validation for evaluation. Specifically, we ensure that different slides from one patient case do not exist in both the training and testing sets, and then randomly split the data as the ratio of training: validation: testing = 60: 15: 25. UniToPatho [32] is a colon cancer dataset comprising 9,536 hematoxylin and eosin (H&E) stained patches extracted from 292 WSIs. There are six classes in the dataset, namely, normal tissue (NORM), hyperplastic polyp (HP), tubular adenoma & high-grade dysplasia (TA.HG), tubular adenoma & low-grade dysplasia (TA.LG), tubulo-billous adenoma & high-grade dysplasia (TVA.HG), and tubulo-villous adenoma & low-grade dysplasia (TVA.LG). Based on the official code, we crop the provided images into 224×\times×224 sized patches without overlap**. In experiments, the official split of 204/88 slides is used for training/testing.

4.1.2 Evaluation Metrics

We evaluate the model performance using class-wised average accuracy (ACC), F1-Score, and the area under the curve (AUC) score. The higher the value of these indicators, the better the performance of the method. In experiments, we took AUC as the major evaluation metric and conducted a Delong test on it to verify the statistical significance of improvements. To avoid randomness, we ran the experiments four times and reported the average metrics on Camelyon16 and UniToPatho datasets. For TCGA-NSCLC, we reported average metrics from 4-fold cross-validation.

Table 2: Evaluation of the use of adapters in histopathology adaptive backbone. P-values (adapter3 vs. adapter1 + adapter2 + adapter3) are all less than 0.05.
ImageNet Pre-trained Camelyon16 TCGA-NSCLC
Adapter1 Adapter2 Adapter3 ACC (%percent\%%) AUC (%percent\%%) ACC (%percent\%%) AUC (%percent\%%)
65.4165.4165.4165.41 72.3172.3172.3172.31 80.5280.5280.5280.52 86.3286.3286.3286.32
79.8479.8479.8479.84 83.3883.3883.3883.38 85.4485.4485.4485.44 91.2391.2391.2391.23
84.4984.4984.4984.49 86.1786.1786.1786.17 88.4588.4588.4588.45 94.3694.3694.3694.36
81.2181.2181.2181.21 84.5284.5284.5284.52 86.5786.5786.5786.57 91.4591.4591.4591.45
85.3885.3885.3885.38 89.2489.2489.2489.24 86.6086.6086.6086.60 94.5894.5894.5894.58
85.7285.7285.7285.72 89.9289.9289.9289.92 88.9288.9288.9288.92 94.7794.7794.7794.77
86.6386.6386.6386.63 91.1291.12\mathbf{91.12}bold_91.12 89.5089.50\mathbf{89.50}bold_89.50 95.1195.11\mathbf{95.11}bold_95.11
SimCLR [33] 86.8786.87\mathbf{86.87}bold_86.87 89.9589.9589.9589.95 89.4189.4189.4189.41 95.0395.0395.0395.03
MoCov2 [34] 86.2186.2186.2186.21 88.6788.6788.6788.67 89.0589.0589.0589.05 94.8894.8894.8894.88
Table 3: Quantitative comparison of our AttriMIL and state-of-the-art methods. The best performance is highlighted in bold, and the second-best is underlined. P-values (ABMIL vs. AttriMIL) are all less than 0.05.
Methods Camelyon16 TCGA-NSCLC UniToPatho
ACC (%) F1-Score (%) AUC (%) ACC (%) F1-Score (%) AUC (%) ACC (%) F1-Score (%) AUC (%)
Mean-Pooling 79.56±3.15plus-or-minus79.563.1579.56\pm 3.1579.56 ± 3.15 70.16±7.59plus-or-minus70.167.5970.16\pm 7.5970.16 ± 7.59 80.01±7.25plus-or-minus80.017.2580.01\pm 7.2580.01 ± 7.25 84.78±1.26plus-or-minus84.781.2684.78\pm 1.2684.78 ± 1.26 84.74±1.36plus-or-minus84.741.3684.74\pm 1.3684.74 ± 1.36 92.84±1.14plus-or-minus92.841.1492.84\pm 1.1492.84 ± 1.14 66.76±3.34plus-or-minus66.763.3466.76\pm 3.3466.76 ± 3.34 52.12¯±2.68plus-or-minus¯52.122.68\underline{52.12}\pm 2.68under¯ start_ARG 52.12 end_ARG ± 2.68 88.25±1.27plus-or-minus88.251.2788.25\pm 1.2788.25 ± 1.27
Max-Pooling 81.32±6.63plus-or-minus81.326.6381.32\pm 6.6381.32 ± 6.63 72.24±4.46plus-or-minus72.244.4672.24\pm 4.4672.24 ± 4.46 84.65±5.03plus-or-minus84.655.0384.65\pm 5.0384.65 ± 5.03 85.65±1.50plus-or-minus85.651.5085.65\pm 1.5085.65 ± 1.50 83.97±2.22plus-or-minus83.972.2283.97\pm 2.2283.97 ± 2.22 91.16±0.79plus-or-minus91.160.7991.16\pm 0.7991.16 ± 0.79 53.69±2.18plus-or-minus53.692.1853.69\pm 2.1853.69 ± 2.18 35.28±4.99plus-or-minus35.284.9935.28\pm 4.9935.28 ± 4.99 73.68±1.84plus-or-minus73.681.8473.68\pm 1.8473.68 ± 1.84
MIL-RNN [3] 83.72±2.95plus-or-minus83.722.9583.72\pm 2.9583.72 ± 2.95 76.40±3.87plus-or-minus76.403.8776.40\pm 3.8776.40 ± 3.87 83.21±3.65plus-or-minus83.213.6583.21\pm 3.6583.21 ± 3.65 86.15±2.31plus-or-minus86.152.3186.15\pm 2.3186.15 ± 2.31 84.89±3.55plus-or-minus84.893.5584.89\pm 3.5584.89 ± 3.55 92.25±1.03plus-or-minus92.251.0392.25\pm 1.0392.25 ± 1.03 51.14±4.18plus-or-minus51.144.1851.14\pm 4.1851.14 ± 4.18 43.15±2.78plus-or-minus43.152.7843.15\pm 2.7843.15 ± 2.78 78.18±0.62plus-or-minus78.180.6278.18\pm 0.6278.18 ± 0.62
DSMIL [26] 88.75¯±2.21plus-or-minus¯88.752.21\underline{88.75}\pm 2.21under¯ start_ARG 88.75 end_ARG ± 2.21 84.33¯±2.98plus-or-minus¯84.332.98\underline{84.33}\pm 2.98under¯ start_ARG 84.33 end_ARG ± 2.98 92.33¯±1.43plus-or-minus¯92.331.43\underline{92.33}\pm 1.43under¯ start_ARG 92.33 end_ARG ± 1.43 89.50¯±2.15plus-or-minus¯89.502.15\underline{89.50}\pm 2.15under¯ start_ARG 89.50 end_ARG ± 2.15 92.61±2.93plus-or-minus92.612.93\mathbf{92.61}\pm 2.93bold_92.61 ± 2.93 95.52¯±1.21plus-or-minus¯95.521.21\underline{95.52}\pm 1.21under¯ start_ARG 95.52 end_ARG ± 1.21 67.04±2.43plus-or-minus67.042.43\mathbf{67.04}\pm 2.43bold_67.04 ± 2.43 49.69±3.55plus-or-minus49.693.5549.69\pm 3.5549.69 ± 3.55 88.48¯±1.00plus-or-minus¯88.481.00\underline{88.48}\pm 1.00under¯ start_ARG 88.48 end_ARG ± 1.00
CLAM-SB [7] 83.72±1.12plus-or-minus83.721.1283.72\pm 1.1283.72 ± 1.12 77.89±1.49plus-or-minus77.891.4977.89\pm 1.4977.89 ± 1.49 90.66±2.78plus-or-minus90.662.7890.66\pm 2.7890.66 ± 2.78 88.60±2.47plus-or-minus88.602.4788.60\pm 2.4788.60 ± 2.47 88.61±2.19plus-or-minus88.612.1988.61\pm 2.1988.61 ± 2.19 95.39±1.45plus-or-minus95.391.4595.39\pm 1.4595.39 ± 1.45 52.27±3.40plus-or-minus52.273.4052.27\pm 3.4052.27 ± 3.40 41.61±1.21plus-or-minus41.611.2141.61\pm 1.2141.61 ± 1.21 84.79±1.06plus-or-minus84.791.0684.79\pm 1.0684.79 ± 1.06
CLAM-MB [7] 84.41±1.33plus-or-minus84.411.3384.41\pm 1.3384.41 ± 1.33 78.06±2.05plus-or-minus78.062.0578.06\pm 2.0578.06 ± 2.05 90.05±2.35plus-or-minus90.052.3590.05\pm 2.3590.05 ± 2.35 88.90±2.57plus-or-minus88.902.5788.90\pm 2.5788.90 ± 2.57 89.10±2.31plus-or-minus89.102.3189.10\pm 2.3189.10 ± 2.31 95.02±1.53plus-or-minus95.021.5395.02\pm 1.5395.02 ± 1.53 58.23±2.58plus-or-minus58.232.5858.23\pm 2.5858.23 ± 2.58 46.86±2.40plus-or-minus46.862.4046.86\pm 2.4046.86 ± 2.40 84.27±0.70plus-or-minus84.270.7084.27\pm 0.7084.27 ± 0.70
DGMIL [4] 82.49±2.93plus-or-minus82.492.9382.49\pm 2.9382.49 ± 2.93 75.10±2.45plus-or-minus75.102.4575.10\pm 2.4575.10 ± 2.45 88.86±3.10plus-or-minus88.863.1088.86\pm 3.1088.86 ± 3.10 88.84±1.43plus-or-minus88.841.4388.84\pm 1.4388.84 ± 1.43 88.71±1.57plus-or-minus88.711.5788.71\pm 1.5788.71 ± 1.57 94.55±1.29plus-or-minus94.551.2994.55\pm 1.2994.55 ± 1.29 - - -
TransMIL [17] 83.72±2.33plus-or-minus83.722.3383.72\pm 2.3383.72 ± 2.33 81.50±3.56plus-or-minus81.503.5681.50\pm 3.5681.50 ± 3.56 88.86±2.85plus-or-minus88.862.8588.86\pm 2.8588.86 ± 2.85 88.46±2.61plus-or-minus88.462.6188.46\pm 2.6188.46 ± 2.61 88.42±3.74plus-or-minus88.423.7488.42\pm 3.7488.42 ± 3.74 94.32±2.01plus-or-minus94.322.0194.32\pm 2.0194.32 ± 2.01 63.92±4.10plus-or-minus63.924.1063.92\pm 4.1063.92 ± 4.10 42.73±6.67plus-or-minus42.736.6742.73\pm 6.6742.73 ± 6.67 87.02±1.98plus-or-minus87.021.9887.02\pm 1.9887.02 ± 1.98
DTFD-MIL [5] 86.61±0.88plus-or-minus86.610.8886.61\pm 0.8886.61 ± 0.88 79.52±1.07plus-or-minus79.521.0779.52\pm 1.0779.52 ± 1.07 88.82±2.49plus-or-minus88.822.4988.82\pm 2.4988.82 ± 2.49 88.46±2.01plus-or-minus88.462.0188.46\pm 2.0188.46 ± 2.01 86.90±2.26plus-or-minus86.902.2686.90\pm 2.2686.90 ± 2.26 93.77±1.24plus-or-minus93.771.2493.77\pm 1.2493.77 ± 1.24 64.53±3.92plus-or-minus64.533.9264.53\pm 3.9264.53 ± 3.92 43.75±5.24plus-or-minus43.755.2443.75\pm 5.2443.75 ± 5.24 86.45±1.92plus-or-minus86.451.9286.45\pm 1.9286.45 ± 1.92
PMIL [10] 87.86±1.59plus-or-minus87.861.5987.86\pm 1.5987.86 ± 1.59 80.15±2.16plus-or-minus80.152.1680.15\pm 2.1680.15 ± 2.16 90.20±2.40plus-or-minus90.202.4090.20\pm 2.4090.20 ± 2.40 88.54±1.92plus-or-minus88.541.9288.54\pm 1.9288.54 ± 1.92 86.77±2.42plus-or-minus86.772.4286.77\pm 2.4286.77 ± 2.42 94.83±1.69plus-or-minus94.831.6994.83\pm 1.6994.83 ± 1.69 - - -
ABMIL [6] 84.49±1.65plus-or-minus84.491.6584.49\pm 1.6584.49 ± 1.65 76.74±3.06plus-or-minus76.743.0676.74\pm 3.0676.74 ± 3.06 88.75±3.15plus-or-minus88.753.1588.75\pm 3.1588.75 ± 3.15 88.90±0.51plus-or-minus88.900.5188.90\pm 0.5188.90 ± 0.51 88.59±0.79plus-or-minus88.590.7988.59\pm 0.7988.59 ± 0.79 94.95±0.30plus-or-minus94.950.3094.95\pm 0.3094.95 ± 0.30 57.38±1.71plus-or-minus57.381.7157.38\pm 1.7157.38 ± 1.71 44.27±4.45plus-or-minus44.274.4544.27\pm 4.4544.27 ± 4.45 85.37±0.37plus-or-minus85.370.3785.37\pm 0.3785.37 ± 0.37
AttriMIL (Ours) 90.69±1.02plus-or-minus90.691.02\mathbf{90.69}\pm 1.02bold_90.69 ± 1.02 87.23±1.78plus-or-minus87.231.78\mathbf{87.23}\pm 1.78bold_87.23 ± 1.78 93.90±1.23plus-or-minus93.901.23\mathbf{93.90}\pm 1.23bold_93.90 ± 1.23 90.38±2.32plus-or-minus90.382.32\mathbf{90.38}\pm 2.32bold_90.38 ± 2.32 90.24¯±2.21plus-or-minus¯90.242.21\underline{90.24}\pm 2.21under¯ start_ARG 90.24 end_ARG ± 2.21 96.13±1.20plus-or-minus96.131.20\mathbf{96.13}\pm 1.20bold_96.13 ± 1.20 66.92¯±3.41plus-or-minus¯66.923.41\underline{66.92}\pm 3.41under¯ start_ARG 66.92 end_ARG ± 3.41 55.97±4.71plus-or-minus55.974.71\mathbf{55.97}\pm 4.71bold_55.97 ± 4.71 88.99±1.31plus-or-minus88.991.31\mathbf{88.99}\pm 1.31bold_88.99 ± 1.31

4.2 Implementation Details

We employ an Adam optimizer with a constant learning rate of 2e-4 for updating learnable weights during the training phase. The mini-batch size for training is set to 1 (bag). The hyper-parameters α𝛼\alphaitalic_α and β𝛽\betaitalic_β of the loss function (Eq. 12) are set as 0.1 and 0.001 in training each adapter by default. We adopt the first three blocks of an ImageNet pre-trained ResNet-50 as the feature extractor’s backbone. Notably, we replace batch normalization with group normalization [35] in the backbone to avoid interactions between instances during feature extraction. All experiments are implemented on the PyTorch 1.10.0 using an Nvidia GeForce RTX 3090 GPU.

4.3 Ablation Study

In this subsection, we explore and analyze the effectiveness of each component. Details are described in following parts.

Refer to caption

Figure 6: Qualitative comparison of our AttriMIL and state-of-the-art methods. Tumor regions in each WSI are surrounded by yellow curves, and red boxes highlight the salient differences of each method in tumor localization. For (b), instance prediction results are used for patch discrimination. For (c)-(f), attention scores are re-scaled from min-max to [0, 1]. For (g), attribute scores are used for instance discrimination. According to the meaning of attribute score signs, score values less than 0 are set as 0, and values greater than 0 are re-scaled from min-max to [0.5, 1].

4.3.1 Evaluation of Spatial Attribute Constraint

In this part, we investigate the impact of the spatial attribute constraint on WSI classification. We first conducted a hyper-parameter optimization experiment on the Camelyon16 and TCGA-NSCLC datasets, with the results detailed in Table 1. From the column of β=0𝛽0\beta=0italic_β = 0, we can observe two points. (1) Regardless of the value of α𝛼\alphaitalic_α (>>>0), the incorporation of spatial attribute constraint consistently increases the performance of the baseline on both datasets, indicating the importance of patch spatial correlation in WSI classification. (2) The effect of spatial attribute constraint is related to the dataset, as the optimal value of α𝛼\alphaitalic_α is inconsistent in different datasets. We argue the phenomenon is attributed to the differences in instance distribution across the datasets. Compared with TCGA-NSCLC, the tumor area ratio is too small on Camelyon16, which may cause overfitting for the negative instance when the value of α𝛼\alphaitalic_α is large, thus yielding limited improvements.

To further demonstrate the effectiveness of spatial attribute constraint, we present the training curves in Fig. 4. Comparing (a) and (b), we observe a large spatial attribute loss (exceeding 350,000) in the vanilla training process, due to the baseline (ABMIL) not taking the spatial correlation of instances into account. Instead, the use of spatial attribute constraint enables the network to be aware of the intra-slide correlation and improves the bag classification performance. Additionally, Fig. 5 intuitively shows the impact of spatial attribute constraint. As depicted in the upper box of (c) , the spatial constraint leads to fewer holes in the positive regions, underscoring the utility of the spatial attribute constraint in enhancing tumor localization. Nevertheless, establishing only spatial relationships within a single WSI is powerless to solve the problem of identifying hard samples, as shown in the bottom box of (c).

4.3.2 Evaluation of Attribute Ranking Constraint

The attribute ranking constraint is designed to enhance the model’s learning capability by leveraging inter-slide correlations. Table 1 presents the influence on the model performance under different settings of β𝛽\betaitalic_β. From the rows of α=0𝛼0\alpha=0italic_α = 0, we obtain two observations: (1) When β𝛽\betaitalic_β is set as 0.001, the performance of the baseline is improved by 2.35% and 0.67% on Camelyon16 and TCGA-NSCLC respectively, indicating that the attribute ranking constraint is beneficial to WSI classification. (2) A large β𝛽\betaitalic_β value reduces the model performance on Camelyon16, likely because the large ranking loss makes the network overly focus on differences between partial instances and unreasonably decrease the overall loss.

Fig. 4 (a) and (c) reflect the difference in the learning process when attribute rank loss is introduced, from which we can observe: (1) In the case of using only cross-entropy loss, the attribute ranking loss presents multiple peaks in the training process, and gradually approaches a low level (about 2,000) at the end of training. In other words, the baseline attempts to reduce the ranking loss in the training, which proves the rationality of the introduction of attribute ranking loss. (2) As shown in Fig. 4 (c), the introduction of attribute ranking loss makes AUC stably rise in the training process, which indicates attribute ranking constraint is effective for facilitating MIL. In addition, Fig. 5 intuitively shows the superiority of ranking loss in differentiating hard instances (bottom of (b) and (d)). The model trained with ranksubscript𝑟𝑎𝑛𝑘\mathcal{L}_{rank}caligraphic_L start_POSTSUBSCRIPT italic_r italic_a italic_n italic_k end_POSTSUBSCRIPT significantly reduces the wrong classification of instances compared to the baseline.

Furthermore, we explore the joint use of the spatial attribute constraint and attribute ranking constraint. As listed in the Table 1, the combination of both constraints significantly improves the performance of baseline (P-values<<<0.05), boosting the AUC of baseline by 5.14% and 1.10% on Camelyon16 and TCGA-NSCLC, respectively. The results in Fig. 5 (e) visually illustrate the advantages of the dual constraints. In comparison to (b), (c), and (d), the model with constraints performs better in localizing the tumor regions and successfully distinguishes hard negative instances. These findings show that attribute constraints can synergistically contribute to model training, and the introduction of intra-slide and inter-slide correlations is meaningful for WSI classification. Moreover, these significant improvements demonstrate the advance of the attribute scores, since the implementation of both loss functions relies on the attribute scoring mechanism.

4.3.3 Evaluation of Histopathology Adaptive Backbone

In AttriMIL, histopathology adaptive backbone is used to obtain the optimized instance features. In ablation experiments, we adopted an ImageNet pre-trained ResNet-50 as a backbone, inserted adapters, and successively trained them using the cross-entropy loss function. Table 2 shows the impact of these adapters on WSI classification. From the table, we can derive the following conclusions. (1) The shallow features extracted by the feature extractor are able to reflect pathological property to a certain extent, as the ResNet-50 with adapter1 achieves 72.31% AUC on Camelyon16 and 86.32% AUC on TCGA-NSCLC. (2) Multiple adapters facilitate the instance feature extraction, and our histopathology adaptive backbone (adapter1 + adapter2 + adapter3) achieves the best performance compared to other settings. In addition, following [26] and [36], we present the classification performance of contrastive learning methods (i.e., SimCLR [33] and MoCov2 [34]) in Table 2. In comparison with these methods, our method achieves superior AUC results. This is attributed to the property that our approach focuses on pathological image classification, enabling the backbone to adaptively refine features at various levels for powerful task-specific representation.

4.4 Comparisons with State-of-the-Art Methods

In this part, we present classification results for both binary classification and multiple classification. Binary classification tasks contain positive/negative classification over Camelyon16 and LUSC/LUAD subtype classification over TCGA-NSCLC. The multiple classification task refers to NORM, HP, TA.HG, TA.LG, TVA.HG, and TVA.LG classification over UniToPatho. Table 3 lists the quantitative comparisons of our AttriMIL and current state-of-the-art methods, where Mean-pooling and Max-pooling are utilized to highlight the superiority of each method. For a fair comparison, we obtain the experimental results by rerunning their released code with their published instance features.

4.4.1 Quantitative Comparison

In Camelyon16, the tumor areas only occupy a small proportion of each positive WSI (the average proportion is less than 10%). Attention-based methods like ABMIL [6], CLAM [7], DSMIL [26], and TransMIL [17] consistently outperform the traditional Mean-pooling and Max-pooling. In terms of AUC, ABMIL, DTFD-MIL [5] and CLAM are at least 3% lower than AttriMIL, as these methods ignore the correlation between patches in WSI classification. TransMIL and DSMIL only model the relations between instances within a single WSI, resulting in limited performance. Notice that, PMIL [10] is a prototype-based MIL framework which considers the relations across WSIs. However, PMIL mainly emphasizes phenotypic differences between instances and neglects the pathology attribute, leading to inferior performance compared to AttriMIL. In TCGA-NSCLC, the positive WSI generally contains large tumor regions, consequently, all the methods perform better than on the Camelyon16 dataset. In comparison, AttriMIL outperforms other methods, with an increase in ACC and AUC of 0.88% and 0.61%, respectively. In the UnitoPatho dataset, as DGMIL and PMIL do not consider the multi-subtype classification tasks, Table 3 does not list their results. UniToPatho has an unbalanced distribution of subtypes and positive areas. In the dataset, our AttriMIL achieves advanced performance in terms of ACC, F1-Score, and AUC, indicating that AttriMIL can be applied to multi-class problems with unbalanced data.

Refer to caption

Figure 7: Instance discrimination ability comparison and distribution position relation between attention and attribute scores on the Camelyon16 dataset. In (a), the positive area coverage rate indicates the proportion of the number of positive instances within the high-score instances to the total number of positive instances. For (b), we rank instances based on attention and attribute scores and present the position relation between them, where 100% represents the highest-score instance. The green lines (mean ±plus-or-minus\pm± standard deviation) represent the ranking position where the instance’s attribute score is equal to 0.

Refer to caption

Figure 8: Out-of-distribution detection and potential applications using AttriMIL. (a) presents instance attribute score distribution (without normalized attention) and the aggregated bag score in the WSI’s region of interest (RoI) on UniToPatho. For the OOD sample, the bag attribute score in each branch is less than 0. Bar plots in (b) show the OOD performance of AttriMIL under different thresholds. When the threshold is set to 2, AttriMIL can identify more OOD samples. (c) shows the potential impact of AttriMIL on computer-assisted pathological diagnosis.

4.4.2 Qualitative Comparison

Fig. 6 intuitively illustrates the tumor localization capability of different methods, from which we can obtain three observations. (1) Compared to existing methods, our AttriMIL exhibits strong tumor localization power. As shown in the 1stsuperscript1𝑠𝑡1^{st}1 start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT row, our method accurately captures tumors with extremely small areas. (2) Previous MIL frameworks always misidentify negative instances. By contrast, our method effectively distinguish the negative instances in various scenes (shown in the 1stsuperscript1𝑠𝑡1^{st}1 start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT and 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT rows). This is due to the introduction of attribute scoring and attribute ranking loss, which offer an attribute measurement and a threshold (00 in this paper) for differentiating instance attributes. To a certain extent, attribute ranking loss introduces the advantages of Max-pooling into the attention-based MIL framework for locating discriminative instance. (3) In comparison to other methods, AttriMIL effectively identifies negative regions embedded in the tumor area. For the LUAD slide, AttriMIL assigns low attribute scores to the pulmonary acinus embedded in the tumor area. For the LUSC slide, AttriMIL successfully recognizes the connective tissues, including fibers and immune cells. These underscore AttriMIL’s advancements in providing highly interpretable and accurate results for WSI analysis.

5 Discussion and Conclusion

5.1 Relationship between Attention and Attribute Scores

In this part, we explore the relationship between attention and attribute scores. We trained an ABMIL using the cross-entropy loss function on Camelyon16 and present comparisons of various scores in Fig. 7. Fig. 7 (a) reports the instance discrimination ability of attention, attribute scores, and instance predictions, from which we can see that the attribute score outperforms the others in localizing positive regions while the attention performance is poor. Fig. 7 (b) shows the relation between attention and attribute scores, where the distribution position relation curve is U-shape. This suggests that attention tends to focus on significantly positive and negative areas, and the attribute scoring mechanism improves attention for precise instance discrimination. Furthermore, from Eq. 4 and Eq. 6, we can realize that the attention mechanism can be regarded as a special bag classifier, which weights instance in exponential form compared to the vanilla fully connected layer, enabling ABMIL to quickly converge. This is the reason why ABMIL has performed well in the instance attribute viewpoint.

5.2 AttriMIL for Out-of-Distribution Detection

Different from previous frameworks, AttriMIL sums the attribute score of each instance for bag aggregation, where the aggregated score in each branch is meaningful due to the characteristic of attribute scores. Specifically, AttriMIL uses 0 as the threshold (refer to Eq. 11). If the bag score exceeds 0, it means that the input WSI corresponds to the attribute of the branch; otherwise, it does not belong to this branch. This feature allows AttriMIL to detect out-of-distribution (OOD) samples, i.e., when the bag scores of all branches are less than 0, the sample is considered to be an OOD bag. To verify this point, we conducted experiments on UniToPatho, where the in-distribution (ID) samples of TA.LG, TA.HG, TVA.LG, and TVA.HG were used for training, and the Norm and HP WSIs were considered OOD samples. Fig. 8 visually shows the performance of AttriMIL in OOD detection, and thus our method provides a solution to establish a complete computer-assisted pathological diagnosis system.

5.3 Conclusion

In this paper, we propose a novel multiple instance learning (MIL) framework named Attribute-Driven MIL (AttriMIL) for whole-slide pathological image analysis. Unlike previous solutions, AttriMIL introduces an attribute scoring scheme that effectively measures the contribution of each instance to bag prediction, thereby quantifying the instance’s attributes. Based on the quantification of instance attributes, two attribute constraints, namely spatial attribute constraint and attribute ranking loss, are developed to model the intra-slide and inter-slide correlations among instances, respectively. These correlations enhance the instance discrimination capability of the network and enable AttriMIL to achieve accurate tumor localization. Moreover, a histopathology adaptive backbone is applied in AttriMIL, which takes full advantage of the pre-trained model for improving instance feature representation. Extensive experiments on three benchmarks demonstrate the superiority of our method. Furthermore, AttriMIL shows potential in processing out-of-detection samples, providing a promising solution for building a complete pathological diagnosis system.

References

  • [1] M. Y. Lu, T. Y. Chen, D. F. Williamson, M. Zhao, M. Shady, J. Lipkova, and F. Mahmood, “Ai-based pathology predicts origins for cancers of unknown primary,” Nature, vol. 594, no. 7861, pp. 106–110, 2021.
  • [2] C. L. Srinidhi, O. Ciga, and A. L. Martel, “Deep neural network models for computational histopathology: A survey,” Medical Image Analysis, vol. 67, p. 101813, 2021.
  • [3] G. Campanella, M. G. Hanna, L. Geneslaw, A. Miraflor, V. Werneck Krauss Silva, K. J. Busam, E. Brogi, V. E. Reuter, D. S. Klimstra, and T. J. Fuchs, “Clinical-grade computational pathology using weakly supervised deep learning on whole slide images,” Nature medicine, vol. 25, no. 8, pp. 1301–1309, 2019.
  • [4] L. Qu, X. Luo, S. Liu, M. Wang, and Z. Song, “Dgmil: Distribution guided multiple instance learning for whole slide image classification,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2022, pp. 24–34.
  • [5] H. Zhang, Y. Meng, Y. Zhao, Y. Qiao, X. Yang, S. E. Coupland, and Y. Zheng, “Dtfd-mil: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 802–18 812.
  • [6] M. Ilse, J. Tomczak, and M. Welling, “Attention-based deep multiple instance learning,” in International conference on machine learning.   PMLR, 2018, pp. 2127–2136.
  • [7] M. Y. Lu, D. F. Williamson, T. Y. Chen, R. J. Chen, M. Barbieri, and F. Mahmood, “Data-efficient and weakly supervised computational pathology on whole-slide images,” Nature biomedical engineering, vol. 5, no. 6, pp. 555–570, 2021.
  • [8] H. Wang, K. Song, J. Fan, Y. Wang, J. Xie, and Z. Zhang, “Hard patches mining for masked image modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  • [9] J. Shi, L. Tang, Y. Li, X. Zhang, Z. Gao, Y. Zheng, C. Wang, T. Gong, and C. Li, “A structure-aware hierarchical graph-based multiple instance learning framework for pt staging in histopathological image,” IEEE Transactions on Medical Imaging, 2023.
  • [10] J.-G. Yu, Z. Wu, Y. Ming, S. Deng, Y. Li, C. Ou, C. He, B. Wang, P. Zhang, and Y. Wang, “Prototypical multiple instance learning for predicting lymph node metastasis of breast cancer from whole-slide pathological images,” Medical Image Analysis, vol. 85, p. 102748, 2023.
  • [11] L. Qu, Y. Ma, X. Luo, M. Wang, and Z. Song, “Rethinking multiple instance learning for whole slide image classification: A good instance classifier is all you need,” arXiv preprint arXiv:2307.02249, 2023.
  • [12] Y. Chen, J. Bi, and J. Z. Wang, “Miles: Multiple-instance learning via embedded instance selection,” IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 12, pp. 1931–1947, 2006.
  • [13] M.-A. Carbonneau, V. Cheplygina, E. Granger, and G. Gagnon, “Multiple instance learning: A survey of problem characteristics and applications,” Pattern Recognition, vol. 77, pp. 329–353, 2018.
  • [14] K. J. Cheung and A. J. Ewald, “A collective route to metastasis: Seeding by tumor cell clusters,” Science, vol. 352, no. 6282, pp. 167–169, 2016.
  • [15] W. D. Travis, “Pathology of lung cancer,” Clinics in chest medicine, vol. 23, no. 1, pp. 65–81, 2002.
  • [16] R. J. Chen, C. Chen, Y. Li, T. Y. Chen, A. D. Trister, R. G. Krishnan, and F. Mahmood, “Scaling vision transformers to gigapixel images via hierarchical self-supervised learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 144–16 155.
  • [17] Z. Shao, H. Bian, Y. Chen, Y. Wang, J. Zhang, X. Ji et al., “Transmil: Transformer based correlated multiple instance learning for whole slide image classification,” Advances in neural information processing systems, vol. 34, pp. 2136–2147, 2021.
  • [18] Z. Li, Y. Jiang, M. Lu, R. Li, and Y. Xia, “Survival prediction via hierarchical multimodal co-attention transformer: A computational histology-radiology solution,” IEEE Transactions on Medical Imaging, 2023.
  • [19] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [20] X. He, C. Li, P. Zhang, J. Yang, and X. E. Wang, “Parameter-efficient model adaptation for vision transformers,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp. 817–825.
  • [21] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artificial intelligence, vol. 97, no. 1-2, pp. 273–324, 1997.
  • [22] T. Lin, Z. Yu, H. Hu, Y. Xu, and C.-W. Chen, “Interventional bag multi-instance learning on whole-slide pathological images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 830–19 839.
  • [23] P. Tourniaire, M. Ilie, P. Hofman, N. Ayache, and H. Delingette, “Ms-clam: Mixed supervision for the classification and localization of tumors in whole slide images,” Medical Image Analysis, 2023.
  • [24] L. Hou, D. Samaras, T. M. Kurc, Y. Gao, J. E. Davis, and J. H. Saltz, “Patch-based convolutional neural network for whole slide tissue image classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2424–2433.
  • [25] T. Lin, H. Xu, C. Yang, and Y. Xu, “Interventional multi-instance learning with deconfounded instance-level prediction,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
  • [26] B. Li, Y. Li, and K. W. Eliceiri, “Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14 318–14 328.
  • [27] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” in International Conference on Machine Learning.   PMLR, 2019, pp. 2790–2799.
  • [28] T. Yang, Y. Zhu, Y. Xie, A. Zhang, C. Chen, and M. Li, “Aim: Adapting image models for efficient video action recognition,” arXiv preprint arXiv:2302.03024, 2023.
  • [29] H. Chen, R. Tao, H. Zhang, Y. Wang, W. Ye, J. Wang, G. Hu, and M. Savvides, “Conv-adapter: Exploring parameter efficient transfer learning for convnets,” arXiv preprint arXiv:2208.07463, 2022.
  • [30] G. Bredell, M. Fischer, P. Szostak, S. Abbasi-Sureshjani, and A. Gomariz, “Aggregation model hyperparameters matter in digital pathology,” arXiv preprint arXiv:2311.17804, 2023.
  • [31] B. Ehteshami Bejnordi, M. Veta, P. Johannes van Diest, B. van Ginneken, N. Karssemeijer, G. Litjens, J. A. W. M. van der Laak, , and the CAMELYON16 Consortium, “Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer,” JAMA, vol. 318, no. 22, pp. 2199–2210, 12 2017. [Online]. Available: https://doi.org/10.1001/jama.2017.14585
  • [32] C. A. Barbano, D. Perlo, E. Tartaglione, A. Fiandrotti, L. Bertero, P. Cassoni, and M. Grangetto, “Unitopatho, a labeled histopathological dataset for colorectal polyps classification and adenoma dysplasia grading,” in 2021 IEEE International Conference on Image Processing (ICIP).   IEEE, 2021, pp. 76–80.
  • [33] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning.   PMLR, 2020, pp. 1597–1607.
  • [34] X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” arXiv preprint arXiv:2003.04297, 2020.
  • [35] Y. Wu and K. He, “Group normalization,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.
  • [36] M. Kang, H. Song, S. Park, D. Yoo, and S. Pereira, “Benchmarking self-supervised learning on diverse pathology datasets,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 3344–3354.