Semi-supervised variational autoencoder for cell feature extraction in multiplexed immunofluorescence images

Abstract

Advancements in digital imaging technologies have sparked increased interest in using multiplexed immunofluorescence (mIF) images to visualise and identify the interactions between specific immunophenotypes with the tumour microenvironment at the cellular level. Current state-of-the-art multiplexed immunofluorescence image analysis pipelines depend on cell feature representations characterised by morphological and stain intensity-based metrics generated using simple statistical and machine learning-based tools. However, these methods are not capable of generating complex representations of cells. We propose a deep learning-based cell feature extraction model using a variational autoencoder with supervision using a latent subspace to extract cell features in mIF images. We perform cell phenotype classification using a cohort of more than 44,000 multiplexed immunofluorescence cell image patches extracted across 1,093 tissue microarray cores of breast cancer patients, to demonstrate the success of our model against current and alternative methods.

Index Terms— Multiplexed immunofluorescence, cell feature extraction, semi-supervised variational autoencoder, tumour microenvironment.

1 Introduction

Rapid developments in digital imaging technologies are contributing to growing interest in multiplexed immunofluorescence (mIF) imaging as a tool with potential to revolutionise diagnostics, drug development, and personalised medicine. This technique uses fluorescently labelled antibodies or markers that bind to specific target molecules in formalin fixed paraffin embedded (FFPE) tissue sections allowing in-situ visualisation of multiple immunophenotypes within the tumour microenvironment (TME), leading to accurate assessment of complex biological processes at single-cell resolution. However, mIF images present unique challenges such as spectral overlap, noisy backgrounds, extreme stain variation, complexity, and cost associated with the analysis of spatially resolved high-resolution images with multiple stains.

Currently, mIF-based TME analysis [1, 2, 3] involves several steps: spectral unmixing, tissue and cell segmentation [4, 5, 6], computation of morphological characteristics (such as shape and size measurements) and biomarker expression statistics in individual cells, nuclei and cytoplasm (using medical image analysis tools), cell clustering, and patient-level analysis carried out using machine learning or statistical approaches [7, 8].

Due to the noisy background and high stain variability in mIF images, simple segmentation methods based on image processing and basic machine learning tools are not reliable descriptors of nuclei and cell boundaries. Since the cell features used in current methods [1, 2, 3] are directly calculated using these segmentation masks it is possible that these features are not able to accurately represent the complex morphological characteristics and biomarker expression patterns in mIF images. However, training advanced segmentation models require a large amount of manual annotations, incurring substantial resource costs while containing inter- and intra-observer variability. Therefore, we believe that extraction of more comprehensive cell features without requiring precise segmentation can have a direct impact on the subsequent TME analysis stage. Inspired by the success of neural networks, we propose a novel method for cell feature extraction in mIF images that employs a variational autoencoder (VAE) [9] with supervision via latent subspace representation.

Neural networks are widely used for extraction of high-dimensional representations from medical images [10, 11, 12, 13]. An autoencoder is a neural network that maps input images to a lower-dimensional representation using an encoder, often employed as a feature extraction model. VAEs are an improvement on the traditional autoencoder by compressing the latent representations into a probabilistic distribution, thereby forcing the model to learn complex and continuous variations within images. Moreover, supervised autoencoders [14] have been recently introduced as a network that can produce more generalisable representations by using target label prediction as an auxiliary task. However, these models are introduced for generalised computer vision tasks involving natural images which differ significantly from mIF images.

To achieve a more generalised feature extraction of mIF images, our proposed method uses a VAE as the base model. Furthermore, our model uses cell phenotype label as a supervision signal to enhance generalisability and facilitate learning of cell phenotype-related features. However, obtaining cell phenotype labels annotated by expert pathologists is a costly and time consuming task with potentially high inter- and intra-observer variability. In addition, cell phenotype labels can be spuriously correlated with the presence (or absence) of respective biomarkers which can limit the learning of meaningful feature representations due to the inherent inductive biases in neural networks [15, 16]. To overcome these challenges, we use the labels generated by QuPath [4] and propose to use the full latent representation for image reconstruction and a latent subspace for the joint supervision task. We demonstrate the effectiveness of our approach in generating robust cell feature representations compared to current and alternative methods using a dataset of n=44,400𝑛44400n=44,400italic_n = 44 , 400 mIF cell images stained using 9 fluorescently labelled antibodies, extracted from 1,093 tissue microarray (TMA) cores belonging to a cohort of 450 breast cancer patients.

2 Methodology

Figure 1 illustrates the proposed method where we follow the encoder-decoder architecture of a standard VAE [9] as our baseline model. Let X={x,y}𝑋𝑥𝑦X=\{x,y\}italic_X = { italic_x , italic_y } where x𝑥xitalic_x is any image of dimension h×w×c𝑤𝑐h\times w\times citalic_h × italic_w × italic_c and y𝑦yitalic_y is the corresponding label. In our experiments, we use mIF image patches of dimension 48×48×94848948\times 48\times 948 × 48 × 9 as input images and cell phenotype as labels. Let the encoder and decoder networks parameterised by ϕitalic-ϕ\phiitalic_ϕ and θ𝜃\thetaitalic_θ respectively be denoted as Eϕsubscript𝐸italic-ϕE_{\phi}italic_E start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT and Dθsubscript𝐷𝜃D_{\theta}italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Then the encoder network can be written as

Eϕ=(μϕ(x),diag(σϕ2((x))))subscript𝐸italic-ϕsubscript𝜇italic-ϕ𝑥diagsuperscriptsubscript𝜎italic-ϕ2𝑥E_{\phi}=\biggl{(}\mu_{\phi}(x),\text{diag}\Bigl{(}\sigma_{\phi}^{2}\bigl{(}(x% )\bigr{)}\Bigr{)}\biggr{)}italic_E start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = ( italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) , diag ( italic_σ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ( italic_x ) ) ) ) (1)

which models the distribution qϕ(z|x)subscript𝑞italic-ϕconditional𝑧𝑥q_{\phi}(z|x)italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z | italic_x ) that maps input x𝑥xitalic_x to a latent representation z𝑧zitalic_z. In Eq. (1), μϕ(x)subscript𝜇italic-ϕ𝑥\mu_{\phi}(x)italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) and diag(σϕ2((x)))diagsuperscriptsubscript𝜎italic-ϕ2𝑥\text{diag}(\sigma_{\phi}^{2}((x)))diag ( italic_σ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ( italic_x ) ) ) represent the latent mean and (diagonal) covariance matrix respectively. The decoder network can be written as

Dθ=(μθ(z),diag(σθ2(z)))subscript𝐷𝜃subscript𝜇𝜃𝑧diagsuperscriptsubscript𝜎𝜃2𝑧D_{\theta}=\biggl{(}\mu_{\theta}(z),\text{diag}\Bigl{(}\sigma_{\theta}^{2}(z)% \Bigr{)}\biggr{)}italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = ( italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z ) , diag ( italic_σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_z ) ) ) (2)

and is used to model the likelihood distribution pθ(x|z)subscript𝑝𝜃conditional𝑥𝑧p_{\theta}(x|z)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x | italic_z ) which maps the latent representation z𝑧zitalic_z back to data space.

Refer to caption

Fig. 1: Proposed VAE framework for cell feature extraction in mIF image data.

We can decompose the joint probability distribution of the VAE p(x,z)𝑝𝑥𝑧p(x,z)italic_p ( italic_x , italic_z ) into likelihood and prior as p(x,z)=p(x|z)p(z)𝑝𝑥𝑧𝑝conditional𝑥𝑧𝑝𝑧p(x,z)=p(x|z)p(z)italic_p ( italic_x , italic_z ) = italic_p ( italic_x | italic_z ) italic_p ( italic_z ). To infer the true posterior p(z|x)𝑝conditional𝑧𝑥p(z|x)italic_p ( italic_z | italic_x ), from Bayes theorem we can write p(z|x)=p(x|z)p(z)/p(x)𝑝conditional𝑧𝑥𝑝conditional𝑥𝑧𝑝𝑧𝑝𝑥p(z|x)=p(x|z)p(z)/p(x)italic_p ( italic_z | italic_x ) = italic_p ( italic_x | italic_z ) italic_p ( italic_z ) / italic_p ( italic_x ) where p(x)=zp(x,z)𝑑z𝑝𝑥subscript𝑧𝑝𝑥𝑧differential-d𝑧p(x)=\int_{z}p(x,z)dzitalic_p ( italic_x ) = ∫ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_p ( italic_x , italic_z ) italic_d italic_z. However, calculating p(x)𝑝𝑥p(x)italic_p ( italic_x ) is intractable as it requires integration over all possible configurations of the latent space z𝑧zitalic_z. Therefore, we maximise the evidence lower bound (ELBO) of the log-likelihood as

ELBO(θ,ϕ;x)Ezqϕ[logpθ(x,z)qϕ(z|x)].ELBO𝜃italic-ϕ𝑥subscript𝐸similar-to𝑧subscript𝑞italic-ϕdelimited-[]subscript𝑝𝜃𝑥𝑧subscript𝑞italic-ϕconditional𝑧𝑥\text{ELBO}(\theta,\phi;x)\coloneqq{E}_{z\sim q_{\phi}}\left[\log\frac{p_{% \theta}(x,z)}{q_{\phi}(z|x)}\right].ELBO ( italic_θ , italic_ϕ ; italic_x ) ≔ italic_E start_POSTSUBSCRIPT italic_z ∼ italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_z ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z | italic_x ) end_ARG ] . (3)

Since mIF cell images exhibit high variability, we integrate a classifier Cγsubscript𝐶𝛾C_{\gamma}italic_C start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT to increase the generalisability of our model [14]. However, the cell phenotype labels generated using QuPath can be spuriously correlated with the biomarker stains in the mIF cell image which can reduce the complexity of the latent feature space. Therefore, we propose to extract a subspace zsuperscript𝑧z^{\prime}italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from the latent space z𝑧zitalic_z (Fig. 1) where the latent space is now represented as z=(zz,z)𝑧𝑧superscript𝑧superscript𝑧z=(z-z^{\prime},z^{\prime})italic_z = ( italic_z - italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). The classifier Cγsubscript𝐶𝛾C_{\gamma}italic_C start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT uses the subspace μϕ(x)subscript𝜇italic-ϕsuperscript𝑥\mu_{\phi}(x)^{\prime}italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for classification while the full latent space is used for reconstruction. The subspace μϕ(x)subscript𝜇italic-ϕsuperscript𝑥\mu_{\phi}(x)^{\prime}italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT generated by Eϕsubscript𝐸italic-ϕE_{\phi}italic_E start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT can be used to predict the class label probabilities using cross-entropy loss as p(y|x)=lCE(y,Cγ(μϕ(x)))𝑝conditional𝑦𝑥subscript𝑙CE𝑦subscript𝐶𝛾subscript𝜇italic-ϕsuperscript𝑥p(y|x)=l_{\text{CE}}\Bigl{(}y,C_{\gamma}\left(\mu_{\phi}(x)^{\prime}\right)% \Bigr{)}italic_p ( italic_y | italic_x ) = italic_l start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_y , italic_C start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) and the combined objective function L(θ,ϕ,γ)𝐿𝜃italic-ϕ𝛾L(\theta,\phi,\gamma)italic_L ( italic_θ , italic_ϕ , italic_γ ) is defined as

L(θ,ϕ,γ)(x,y)[ELBO(θ,ϕ;x)+αlog(lCE(y,Cγ))]𝐿𝜃italic-ϕ𝛾subscript𝑥𝑦delimited-[]ELBO𝜃italic-ϕ𝑥𝛼subscript𝑙CE𝑦subscript𝐶𝛾L(\theta,\phi,\gamma)\coloneqq-\sum_{(x,y)}\Bigl{[}\text{ELBO}(\theta,\phi;x)+% \alpha\cdot\log\left(l_{\text{CE}}\left(y,C_{\gamma}\right)\right)\Bigr{]}italic_L ( italic_θ , italic_ϕ , italic_γ ) ≔ - ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) end_POSTSUBSCRIPT [ ELBO ( italic_θ , italic_ϕ ; italic_x ) + italic_α ⋅ roman_log ( italic_l start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( italic_y , italic_C start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) ) ] (4)

where ELBO(θ,ϕ;x)ELBO𝜃italic-ϕ𝑥\text{ELBO}(\theta,\phi;x)ELBO ( italic_θ , italic_ϕ ; italic_x ) is a non-positive quantity, α𝛼\alphaitalic_α is a scalar hyperparameter that controls the trade-off between the reconstruction and classification components.

Through this method, we can effectively limit the learning of spurious correlations between labels and input to only a subspace of the full latent representations z𝑧zitalic_z and still maintain high reconstruction quality. The size of zsuperscript𝑧z^{\prime}italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is fixed and dependent on the complexity of the dataset and the level of correlation between the label and input. In our study, we used a proportion of 1/8 for z/zsuperscript𝑧𝑧z^{\prime}/zitalic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_z, where z𝑧zitalic_z and zsuperscript𝑧z^{\prime}italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are vectors of size 9,21692169,2169 , 216 and 1,15211521,1521 , 152 respectively.

The encoder of our network follows the standard convolutional neural network (CNN) architecture [17] while the decoder consists of transposed convolutional blocks followed by batch normalisation layers and leaky rectified linear unit (ReLU) activation. The loss function of our network incorporates the image reconstruction error (cross entropy loss), Kullback-Leibler (KL) divergence term for regularisation of the latent space [9] and classification error (cross entropy loss). The classifier network is comprised of linear layers followed by batch normalisation and leaky ReLU.

3 Experiments and results

3.1 Dataset and experimental details

Our dataset consists of 18 TMAs with each having 9 channels corresponding to 9 biomarker stains (PD1, CD140b, CD146, Thy1, PanCK, CD8, α𝛼\alphaitalic_α-SMA, CD31, DAPI) and an additional autofluorescence channel that are all registered. In addition, we also observe that PanCK and CD140b stains are dominant stains in this dataset. Each TMA contains cores of approximately 1.25 µm in diameter scanned at 0.5 µm/pixel resolution. Our dataset consists of 1,093 TMA cores collected from a cohort of 450 breast cancer patients.

First, we use QuPath’s [4] built-in tools based on the watershed algorithm to segment cells and predict cell centres. Following this step, we generate a dataset of >4.1absent4.1>4.1> 4.1 million predictions of cell centres. We depend on image labels generated from QuPath to label the cell phenotypes. Initially, a small number of cell detections are annotated from a single slide for biomarker positivity. Then, we train a classifier using QuPath’s built-in tools and use it to classify all cells in the dataset.

Refer to caption

Fig. 2: Example mIF image patches. (a) Expanded views of cells in a mIF TMA core. (b) Each channel of the mIF image captures the presence of a respective biomarker.

To clean the labels, we first remove cells that do not have any biomarker positivity. Then, we remove positive labels for cells whose maximum signal detection (within a cell region segmented using QuPath) is less than 0.5, which can be assumed to be a noisy label. We make further adjustments to narrow down the positive labels for the dominant stains PanCK and CD140b as follows. Since the PanCK stain can be found relatively evenly distributed across the cytoplasm of tumour cells, we remove any cells with positive PanCK detections where the mean expression of PanCK within the cytoplasm region is less than 0.5. Then, we remove positive PanCK and CD140b detections where the maximum cell region is less than 1% of the maximum of the stain for the respective slide. Finally, the cell phenotypes are categorised based on biomarker positivity predicted by QuPath and the details of this classification and percentage availability is 62.55% tumour (PanCK+), 22.59% iCAFs (CD140b+), 7.78% myCAFs (α𝛼\alphaitalic_α-SMA+), 3.44% T-cells (CD8+), 2.50% dPVLs (CD140+/ CD146+), and 0.92% exhausted T-cells (PD1+). Due to the significant class imbalance in the dataset, we remove the two classes corresponding to the lowest cell populations in the dataset, which are blood vessels (CD31+) 0.18%percent0.180.18\%0.18 % and imPVLs (Thy1+) 0.03%percent0.030.03\%0.03 %.

The final dataset consists of n=44,400𝑛44400n=44,400italic_n = 44 , 400 randomly selected cell detection representing six cell phenotypes (tumour, iCAFs, myCAFs, T-cells, dPVLs, and exhausted T-cells) in equal proportion across all TMA slides. We use the predicted cell centres as approximations of the actual cell centres and extract cell patches of size 48×48×94848948\times 48\times 948 × 48 × 9 pixels (Fig. 2). We assume that a patch size of 48×48484848\times 4848 × 48 pixels, which corresponds to a tissue area of 24×24242424\times 2424 × 24 µm2superscriptm2\text{m}^{2}m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, is adequate to capture the complete cell image for all cell phenotypes based on the average size of cells in the TME [18]. Due to the extreme variability in biomarker intensity within the TMA and across different slides, it is necessary to normalise the images before feeding them as input to our model. We set a lower threshold of 0.5 for the dominant stains PanCK and CD140b while setting a lower threshold of 0.3 for the remaining 7 stains. The upper threshold is set to the maximum intensity observed for each biomarker at slide level. Subsequently, biomarker intensities are scaled using the min-max normalization technique.

We split the dataset with stratification to allocate 20%percent2020\%20 % as the test set and 80%percent8080\%80 % as the training set, further split as 85%percent8585\%85 % for actual training and 15%percent1515\%15 % for validation. All models are trained with a learning rate of 2×1052superscript1052\times 10^{-5}2 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for 1,000 epochs with a batch size of 128 using Tesla V100 GPUs.

Table 1: Comparison of cell classification results for different models.
Method Image Embedding Test Results (n = 8,880)
Size Size Accuracy Precision Recall
ResNet50 Pretrained on ImageNet with PCA 48×48×94848superscript9\ \ 48\times 48\times 9^{*}48 × 48 × 9 start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 1,152 0.7191 0.7180 0.7215
Standard VAE [9] 48×48×94848948\times 48\times 948 × 48 × 9 9,216 0.8074 0.8147 0.8125
Morphological Features QuPath [4] - 156 0.8084 0.8136 0.8088
Semi-Supervised Autoencoder [14] 48×48×94848948\times 48\times 948 × 48 × 9 9,216 0.8201 0.8464 0.8257
Proposed Model 48×48×94848948\times 48\times 948 × 48 × 9 1,152 0.8486 0.8654 0.8507
Features extracted from image patches using pretrained model used as input in the classification task.

Refer to caption

Fig. 3: Reconstruction and classification results. (a,b) Reconstruction results of our model accurately captures cellular features while minimising noise. (c) Confusion matrix shows our model is capable of classifying all cell phenotypes with high accuracy. Exhausted T-cells are more prone to be categorised as T-cells which may be due to the variation in staining strength of CD8+ and PD1+ biomarkers.
Table 2: Comparison of results for the proportion of latent subspace used in the classification network.
Proportion Feature Test Results (n = 8,880)
(z/zsuperscript𝑧𝑧z^{\prime}/zitalic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_z) Size (zsuperscript𝑧z^{\prime}italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) Accuracy Precision Recall
1 9,216 0.8377 0.8571 0.8420
1/2 4,608 0.8383 0.8645 0.8429
1/8 1,152 0.8486 0.8654 0.8507
1/16 576 0.8438 0.8641 0.8462

3.2 Results

To evaluate the performance of our method, we compare the cell phenotype classification accuracy using the feature representations of cell patches extracted using different models (Table 1). Additionally, we present the confusion matrix for 6 class classification as well as some qualitative results in form of cell image reconstructions generated using our model (Fig. 3). For all comparisons we keep the classification network the same except the size of the input layer which corresponds to the size of the feature vector. We train a standard VAE [9] and a semi-supervised autoencoder [14] with the same dataset split and use their latent representations for our evaluation. In addition, the results obtained using a ResNet50 pretrained on the ImageNet dataset is presented as a benchmark experiment. We extract a vector of size 2,048 per each channel using the pretrained ResNet50 (2,048×9204892,048\times 92 , 048 × 9 features per cell patch) and reduce the dimensionality of each channel to 128 using principal component analysis (PCA) (128×91289128\times 9128 × 9 features per cell patch). The features obtained from pretrained ResNet50 with PCA exhibit considerably lower performance, possibly attributing to the substantial dissimilarity in the image domains. To compare the performance of our model to the handcrafted features used in current state-of-the-art methods [1, 2, 3], we extract 156 important cell features based on the morphological and intensity features of the nuclear segmentation masks. These include 6 contour features for nucleus and cell (such as area, circularity, and eccentricity), nucleus/cell ratio, distance to annotations, and 140 intensity features of each biomarker (such as mean, range, and standard deviation) for nucleus, cell, and cytoplasm regions.

The results of our experiments (Table 1) confirm that the use of labelled data as a supervisory signal to train a latent subspace of the VAE can be useful in retaining cellular-level features that are relevant to the subsequent tasks in cellular-level analysis of the tumour microenvironment. However, it is important to note that the latent subspace sampling approach is more effective if an optimal subspace size is selected (Table 2). The selection of this size is subjective of the correlation of the labels and input as well as the quality of the labels available. If the labels are highly correlated with the images, it is recommended to keep the subspace smaller, as less information is required for prediction. We found that the optimal proportion for z/zsuperscript𝑧𝑧z^{\prime}/zitalic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_z in this experiment is 1/8.

4 Conclusion

We present a semi-supervised VAE for cell feature extraction in mIF images. We use labels generated by QuPath [4], in order to reduce the use of exhaustively annotated large datasets of cell images. We propose to use a subsection of the latent space for classification and the full latent space for reconstruction thereby limiting the model from learning spurious correlations between the input and labels. By comparing the results of our method against the current state-of-the-art work, we show that the proposed model is capable of extracting more robust representation of cells. In future work, we aim to expand our experiments to larger datasets and carry out further experiments to evaluate the importance of cellular structures towards patient outcome. We believe our method holds significant potential for advancing research in the analysis of cellular interaction within the TME using mIF images.

5 Acknowledgments

This research was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), supported by the Australian Government.

6 Ethics approval

Ethical approval for this study was provided by the South Eastern Sydney Local Health District Human Research Ethics Committee at Prince of Wales Hospital (2018/ETH00138 and HREC 96/16) who granted a waiver of consent to perform research analyses on the tissue blocks. All methods were performed in accordance with the relevant institutional guidelines and regulations.

References

  • [1] J. Wang et al., “Multiplexed immunofluorescence identifies high stromal CD68+PD-L1+ macrophages as a predictor of improved survival in triple negative breast cancer,” Scientific Reports, vol. 11, no. 1, pp. 1–12, 2021.
  • [2] J. Kim, et al., “Unsupervised discovery of tissue architecture in multiplexed imaging,” Nature Methods, vol. 19, no. 12, pp. 1653–1661, 2022.
  • [3] A. Viratham Pulsawatdi et al., “A robust multiplex immunofluorescence and digital pathology workflow for the characterisation of the tumour immune microenvironment,” Molecular Oncology, vol. 14, no. 10, pp. 2384–2402, 2020.
  • [4] P. Bankhead et al., “QuPath: Open source software for digital pathology image analysis,” Scientific Reports, vol. 7, no. 1, pp. 1–7, 2017.
  • [5] C. Stringer, T. Wang, M. Michaelos, and M. Pachitariu, “Cellpose: a generalist algorithm for cellular segmentation,” Nature Methods, vol. 18, no. 1, pp. 100–106, 2021.
  • [6] M. Pachitariu and C. Stringer, “Cellpose 2.0: how to train your own model,” Nature Methods, vol. 19, no. 12, pp. 1634–1641, 2022.
  • [7] J. H. Levine et al., “Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis,” Cell, vol. 162, no. 1, pp. 184–197, 2015.
  • [8] S. van Gassen et al., “FlowSOM: Using self-organizing maps for visualization and interpretation of cytometry data,” Cytometry, vol. 87, no. 7, pp. 636–645, 2015.
  • [9] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” Proceedings of the International Conference on Learning Representations (ICLR), pp. 1–14, 2014.
  • [10] D. Tellez, G. Litjens, J. Van Der Laak, and F. Ciompi, “Neural image compression for gigapixel histopathology image analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 2, pp. 567–578, 2021.
  • [11] B. Li, Y. Li, and K. W. Eliceiri, “Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14 313–14 323, 2021.
  • [12] H. Zhang et al., “DTFD-MIL: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18 802–18 812, 2022.
  • [13] E. Wulczyn et al., “Interpretable survival prediction for colorectal cancer using deep learning,” npj Digital Medicine, vol. 4, p. 71, 2021.
  • [14] L. Le, A. Patterson, and M. White, “Supervised autoencoders: Improving generalization performance with unsupervised regularizers,” Advances in Neural Information Processing Systems, vol. 31, pp. 107–117, 2018.
  • [15] R. Geirhos et al., “Shortcut learning in deep neural networks,” Nature Machine Intelligence, vol. 2, no. 11, pp. 665–673, 2020.
  • [16] M. Moayeri, P. Pope, Y. Balaji, and S. Feizi, “A comprehensive study of image classification model sensitivity to foregrounds, backgrounds, and visual attributes,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19 065–19 075, 2022.
  • [17] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” Proceedings of the International Conference on Learning Representations (ICLR), 2015.
  • [18] B. Shashni et al., “Size-based differentiation of cancer and normal cells by a particle size analyzer assisted by a cell-recognition pc software,” Biological and Pharmaceutical Bulletin, vol. 41, no. 4, pp. 487–503, 2018.