Semi-supervised variational autoencoder for cell feature extraction in multiplexed immunofluorescence images

Abstract

Advancements in digital imaging technologies have sparked increased interest in using multiplexed immunofluorescence (mIF) images to visualise and identify the interactions between specific immunophenotypes with the tumour microenvironment at the cellular level. Current state-of-the-art multiplexed immunofluorescence image analysis pipelines depend on cell feature representations characterised by morphological and stain intensity-based metrics generated using simple statistical and machine learning-based tools. However, these methods are not capable of generating complex representations of cells. We propose a deep learning-based cell feature extraction model using a variational autoencoder with supervision using a latent subspace to extract cell features in mIF images. We perform cell phenotype classification using a cohort of more than 44,000 multiplexed immunofluorescence cell image patches extracted across 1,093 tissue microarray cores of breast cancer patients, to demonstrate the success of our model against current and alternative methods.

Index Terms— Multiplexed immunofluorescence, cell feature extraction, semi-supervised variational autoencoder, tumour microenvironment.

1 Introduction

Rapid developments in digital imaging technologies are contributing to growing interest in multiplexed immunofluorescence (mIF) imaging as a tool with potential to revolutionise diagnostics, drug development, and personalised medicine. This technique uses fluorescently labelled antibodies or markers that bind to specific target molecules in formalin fixed paraffin embedded (FFPE) tissue sections allowing in-situ visualisation of multiple immunophenotypes within the tumour microenvironment (TME), leading to accurate assessment of complex biological processes at single-cell resolution. However, mIF images present unique challenges such as spectral overlap, noisy backgrounds, extreme stain variation, complexity, and cost associated with the analysis of spatially resolved high-resolution images with multiple stains.

Currently, mIF-based TME analysis [1, 2, 3] involves several steps: spectral unmixing, tissue and cell segmentation [4, 5, 6], computation of morphological characteristics (such as shape and size measurements) and biomarker expression statistics in individual cells, nuclei and cytoplasm (using medical image analysis tools), cell clustering, and patient-level analysis carried out using machine learning or statistical approaches [7, 8].

Due to the noisy background and high stain variability in mIF images, simple segmentation methods based on image processing and basic machine learning tools are not reliable descriptors of nuclei and cell boundaries. Since the cell features used in current methods [1, 2, 3] are directly calculated using these segmentation masks it is possible that these features are not able to accurately represent the complex morphological characteristics and biomarker expression patterns in mIF images. However, training advanced segmentation models require a large amount of manual annotations, incurring substantial resource costs while containing inter- and intra-observer variability. Therefore, we believe that extraction of more comprehensive cell features without requiring precise segmentation can have a direct impact on the subsequent TME analysis stage. Inspired by the success of neural networks, we propose a novel method for cell feature extraction in mIF images that employs a variational autoencoder (VAE) [9] with supervision via latent subspace representation.

Neural networks are widely used for extraction of high-dimensional representations from medical images [10, 11, 12, 13]. An autoencoder is a neural network that maps input images to a lower-dimensional representation using an encoder, often employed as a feature extraction model. VAEs are an improvement on the traditional autoencoder by compressing the latent representations into a probabilistic distribution, thereby forcing the model to learn complex and continuous variations within images. Moreover, supervised autoencoders [14] have been recently introduced as a network that can produce more generalisable representations by using target label prediction as an auxiliary task. However, these models are introduced for generalised computer vision tasks involving natural images which differ significantly from mIF images.

To achieve a more generalised feature extraction of mIF images, our proposed method uses a VAE as the base model. Furthermore, our model uses cell phenotype label as a supervision signal to enhance generalisability and facilitate learning of cell phenotype-related features. However, obtaining cell phenotype labels annotated by expert pathologists is a costly and time consuming task with potentially high inter- and intra-observer variability. In addition, cell phenotype labels can be spuriously correlated with the presence (or absence) of respective biomarkers which can limit the learning of meaningful feature representations due to the inherent inductive biases in neural networks [15, 16]. To overcome these challenges, we use the labels generated by QuPath [4] and propose to use the full latent representation for image reconstruction and a latent subspace for the joint supervision task. We demonstrate the effectiveness of our approach in generating robust cell feature representations compared to current and alternative methods using a dataset of $n=44,400$ mIF cell images stained using 9 fluorescently labelled antibodies, extracted from 1,093 tissue microarray (TMA) cores belonging to a cohort of 450 breast cancer patients.

2 Methodology

Figure 1 illustrates the proposed method where we follow the encoder-decoder architecture of a standard VAE [9] as our baseline model. Let $X=\{x,y\}$ where $x$ is any image of dimension $h\times w\times c$ and $y$ is the corresponding label. In our experiments, we use mIF image patches of dimension $48\times 48\times 9$ as input images and cell phenotype as labels. Let the encoder and decoder networks parameterised by $\phi$ and $\theta$ respectively be denoted as $E_{\phi}$ and $D_{\theta}$ . Then the encoder network can be written as

E_{\phi}=\biggl{(}\mu_{\phi}(x),\text{diag}\Bigl{(}\sigma_{\phi}^{2}\bigl{(}(x% )\bigr{)}\Bigr{)}\biggr{)}

(1)

which models the distribution $q_{\phi}(z|x)$ that maps input $x$ to a latent representation $z$ . In Eq. (1), $\mu_{\phi}(x)$ and $\text{diag}(\sigma_{\phi}^{2}((x)))$ represent the latent mean and (diagonal) covariance matrix respectively. The decoder network can be written as

D_{\theta}=\biggl{(}\mu_{\theta}(z),\text{diag}\Bigl{(}\sigma_{\theta}^{2}(z)% \Bigr{)}\biggr{)}

(2)

and is used to model the likelihood distribution $p_{\theta}(x|z)$ which maps the latent representation $z$ back to data space.

Refer to caption — Fig. 1: Proposed VAE framework for cell feature extraction in mIF image data.

We can decompose the joint probability distribution of the VAE $p(x,z)$ into likelihood and prior as $p(x,z)=p(x|z)p(z)$ . To infer the true posterior $p(z|x)$ , from Bayes theorem we can write $p(z|x)=p(x|z)p(z)/p(x)$ where $p(x)=\int_{z}p(x,z)dz$ . However, calculating $p(x)$ is intractable as it requires integration over all possible configurations of the latent space $z$ . Therefore, we maximise the evidence lower bound (ELBO) of the log-likelihood as

\text{ELBO}(\theta,\phi;x)\coloneqq{E}_{z\sim q_{\phi}}\left[\log\frac{p_{% \theta}(x,z)}{q_{\phi}(z|x)}\right].

(3)

Since mIF cell images exhibit high variability, we integrate a classifier $C_{\gamma}$ to increase the generalisability of our model [14]. However, the cell phenotype labels generated using QuPath can be spuriously correlated with the biomarker stains in the mIF cell image which can reduce the complexity of the latent feature space. Therefore, we propose to extract a subspace $z^{\prime}$ from the latent space $z$ (Fig. 1) where the latent space is now represented as $z=(z-z^{\prime},z^{\prime})$ . The classifier $C_{\gamma}$ uses the subspace $\mu_{\phi}(x)^{\prime}$ for classification while the full latent space is used for reconstruction. The subspace $\mu_{\phi}(x)^{\prime}$ generated by $E_{\phi}$ can be used to predict the class label probabilities using cross-entropy loss as $p(y|x)=l_{\text{CE}}\Bigl{(}y,C_{\gamma}\left(\mu_{\phi}(x)^{\prime}\right)% \Bigr{)}$ and the combined objective function $L(\theta,\phi,\gamma)$ is defined as

L(\theta,\phi,\gamma)\coloneqq-\sum_{(x,y)}\Bigl{[}\text{ELBO}(\theta,\phi;x)+% \alpha\cdot\log\left(l_{\text{CE}}\left(y,C_{\gamma}\right)\right)\Bigr{]}

(4)

where $\text{ELBO}(\theta,\phi;x)$ is a non-positive quantity, $\alpha$ is a scalar hyperparameter that controls the trade-off between the reconstruction and classification components.

Through this method, we can effectively limit the learning of spurious correlations between labels and input to only a subspace of the full latent representations $z$ and still maintain high reconstruction quality. The size of $z^{\prime}$ is fixed and dependent on the complexity of the dataset and the level of correlation between the label and input. In our study, we used a proportion of 1/8 for $z^{\prime}/z$ , where $z$ and $z^{\prime}$ are vectors of size $9,216$ and $1,152$ respectively.

The encoder of our network follows the standard convolutional neural network (CNN) architecture [17] while the decoder consists of transposed convolutional blocks followed by batch normalisation layers and leaky rectified linear unit (ReLU) activation. The loss function of our network incorporates the image reconstruction error (cross entropy loss), Kullback-Leibler (KL) divergence term for regularisation of the latent space [9] and classification error (cross entropy loss). The classifier network is comprised of linear layers followed by batch normalisation and leaky ReLU.

3 Experiments and results

3.1 Dataset and experimental details

Our dataset consists of 18 TMAs with each having 9 channels corresponding to 9 biomarker stains (PD1, CD140b, CD146, Thy1, PanCK, CD8, $\alpha$ -SMA, CD31, DAPI) and an additional autofluorescence channel that are all registered. In addition, we also observe that PanCK and CD140b stains are dominant stains in this dataset. Each TMA contains cores of approximately 1.25 µm in diameter scanned at 0.5 µm/pixel resolution. Our dataset consists of 1,093 TMA cores collected from a cohort of 450 breast cancer patients.

First, we use QuPath’s [4] built-in tools based on the watershed algorithm to segment cells and predict cell centres. Following this step, we generate a dataset of $>4.1$ million predictions of cell centres. We depend on image labels generated from QuPath to label the cell phenotypes. Initially, a small number of cell detections are annotated from a single slide for biomarker positivity. Then, we train a classifier using QuPath’s built-in tools and use it to classify all cells in the dataset.

To clean the labels, we first remove cells that do not have any biomarker positivity. Then, we remove positive labels for cells whose maximum signal detection (within a cell region segmented using QuPath) is less than 0.5, which can be assumed to be a noisy label. We make further adjustments to narrow down the positive labels for the dominant stains PanCK and CD140b as follows. Since the PanCK stain can be found relatively evenly distributed across the cytoplasm of tumour cells, we remove any cells with positive PanCK detections where the mean expression of PanCK within the cytoplasm region is less than 0.5. Then, we remove positive PanCK and CD140b detections where the maximum cell region is less than 1% of the maximum of the stain for the respective slide. Finally, the cell phenotypes are categorised based on biomarker positivity predicted by QuPath and the details of this classification and percentage availability is 62.55% tumour (PanCK+), 22.59% iCAFs (CD140b+), 7.78% myCAFs ( $\alpha$ -SMA+), 3.44% T-cells (CD8+), 2.50% dPVLs (CD140+/ CD146+), and 0.92% exhausted T-cells (PD1+). Due to the significant class imbalance in the dataset, we remove the two classes corresponding to the lowest cell populations in the dataset, which are blood vessels (CD31+) $0.18\%$ and imPVLs (Thy1+) $0.03\%$ .

The final dataset consists of $n=44,400$ randomly selected cell detection representing six cell phenotypes (tumour, iCAFs, myCAFs, T-cells, dPVLs, and exhausted T-cells) in equal proportion across all TMA slides. We use the predicted cell centres as approximations of the actual cell centres and extract cell patches of size $48\times 48\times 9$ pixels (Fig. 2). We assume that a patch size of $48\times 48$ pixels, which corresponds to a tissue area of $24\times 24$ µ $\text{m}^{2}$ , is adequate to capture the complete cell image for all cell phenotypes based on the average size of cells in the TME [18]. Due to the extreme variability in biomarker intensity within the TMA and across different slides, it is necessary to normalise the images before feeding them as input to our model. We set a lower threshold of 0.5 for the dominant stains PanCK and CD140b while setting a lower threshold of 0.3 for the remaining 7 stains. The upper threshold is set to the maximum intensity observed for each biomarker at slide level. Subsequently, biomarker intensities are scaled using the min-max normalization technique.

We split the dataset with stratification to allocate $20\%$ as the test set and $80\%$ as the training set, further split as $85\%$ for actual training and $15\%$ for validation. All models are trained with a learning rate of $2\times 10^{-5}$ for 1,000 epochs with a batch size of 128 using Tesla V100 GPUs.

Table 1: Comparison of cell classification results for different models.

^∗Features extracted from image patches using pretrained model used as input in the classification task.
Method	Image	Embedding	Test Results (n = 8,880)
	Size	Size	Accuracy	Precision	Recall
ResNet50 Pretrained on ImageNet with PCA	$\ \ 48\times 48\times 9^{*}$	1,152	0.7191	0.7180	0.7215
Standard VAE [9]	$48\times 48\times 9$	9,216	0.8074	0.8147	0.8125
Morphological Features QuPath [4]	-	156	0.8084	0.8136	0.8088
Semi-Supervised Autoencoder [14]	$48\times 48\times 9$	9,216	0.8201	0.8464	0.8257
Proposed Model	$48\times 48\times 9$	1,152	0.8486	0.8654	0.8507

Table 2: Comparison of results for the proportion of latent subspace used in the classification network.

Proportion	Feature	Test Results (n = 8,880)
( $z^{\prime}/z$ )	Size ( $z^{\prime}$ )	Accuracy	Precision	Recall
1	9,216	0.8377	0.8571	0.8420
1/2	4,608	0.8383	0.8645	0.8429
1/8	1,152	0.8486	0.8654	0.8507
1/16	576	0.8438	0.8641	0.8462

3.2 Results

To evaluate the performance of our method, we compare the cell phenotype classification accuracy using the feature representations of cell patches extracted using different models (Table 1). Additionally, we present the confusion matrix for 6 class classification as well as some qualitative results in form of cell image reconstructions generated using our model (Fig. 3). For all comparisons we keep the classification network the same except the size of the input layer which corresponds to the size of the feature vector. We train a standard VAE [9] and a semi-supervised autoencoder [14] with the same dataset split and use their latent representations for our evaluation. In addition, the results obtained using a ResNet50 pretrained on the ImageNet dataset is presented as a benchmark experiment. We extract a vector of size 2,048 per each channel using the pretrained ResNet50 ( $2,048\times 9$ features per cell patch) and reduce the dimensionality of each channel to 128 using principal component analysis (PCA) ( $128\times 9$ features per cell patch). The features obtained from pretrained ResNet50 with PCA exhibit considerably lower performance, possibly attributing to the substantial dissimilarity in the image domains. To compare the performance of our model to the handcrafted features used in current state-of-the-art methods [1, 2, 3], we extract 156 important cell features based on the morphological and intensity features of the nuclear segmentation masks. These include 6 contour features for nucleus and cell (such as area, circularity, and eccentricity), nucleus/cell ratio, distance to annotations, and 140 intensity features of each biomarker (such as mean, range, and standard deviation) for nucleus, cell, and cytoplasm regions.

The results of our experiments (Table 1) confirm that the use of labelled data as a supervisory signal to train a latent subspace of the VAE can be useful in retaining cellular-level features that are relevant to the subsequent tasks in cellular-level analysis of the tumour microenvironment. However, it is important to note that the latent subspace sampling approach is more effective if an optimal subspace size is selected (Table 2). The selection of this size is subjective of the correlation of the labels and input as well as the quality of the labels available. If the labels are highly correlated with the images, it is recommended to keep the subspace smaller, as less information is required for prediction. We found that the optimal proportion for $z^{\prime}/z$ in this experiment is 1/8.

4 Conclusion

We present a semi-supervised VAE for cell feature extraction in mIF images. We use labels generated by QuPath [4], in order to reduce the use of exhaustively annotated large datasets of cell images. We propose to use a subsection of the latent space for classification and the full latent space for reconstruction thereby limiting the model from learning spurious correlations between the input and labels. By comparing the results of our method against the current state-of-the-art work, we show that the proposed model is capable of extracting more robust representation of cells. In future work, we aim to expand our experiments to larger datasets and carry out further experiments to evaluate the importance of cellular structures towards patient outcome. We believe our method holds significant potential for advancing research in the analysis of cellular interaction within the TME using mIF images.

5 Acknowledgments

This research was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), supported by the Australian Government.

6 Ethics approval

Ethical approval for this study was provided by the South Eastern Sydney Local Health District Human Research Ethics Committee at Prince of Wales Hospital (2018/ETH00138 and HREC 96/16) who granted a waiver of consent to perform research analyses on the tissue blocks. All methods were performed in accordance with the relevant institutional guidelines and regulations.

References

[1] J. Wang et al., “Multiplexed immunofluorescence identifies high stromal CD68+PD-L1+ macrophages as a predictor of improved survival in triple negative breast cancer,” Scientific Reports, vol. 11, no. 1, pp. 1–12, 2021.
[2] J. Kim, et al., “Unsupervised discovery of tissue architecture in multiplexed imaging,” Nature Methods, vol. 19, no. 12, pp. 1653–1661, 2022.
[3] A. Viratham Pulsawatdi et al., “A robust multiplex immunofluorescence and digital pathology workflow for the characterisation of the tumour immune microenvironment,” Molecular Oncology, vol. 14, no. 10, pp. 2384–2402, 2020.
[4] P. Bankhead et al., “QuPath: Open source software for digital pathology image analysis,” Scientific Reports, vol. 7, no. 1, pp. 1–7, 2017.
[5] C. Stringer, T. Wang, M. Michaelos, and M. Pachitariu, “Cellpose: a generalist algorithm for cellular segmentation,” Nature Methods, vol. 18, no. 1, pp. 100–106, 2021.
[6] M. Pachitariu and C. Stringer, “Cellpose 2.0: how to train your own model,” Nature Methods, vol. 19, no. 12, pp. 1634–1641, 2022.
[7] J. H. Levine et al., “Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis,” Cell, vol. 162, no. 1, pp. 184–197, 2015.
[8] S. van Gassen et al., “FlowSOM: Using self-organizing maps for visualization and interpretation of cytometry data,” Cytometry, vol. 87, no. 7, pp. 636–645, 2015.
[9] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” Proceedings of the International Conference on Learning Representations (ICLR), pp. 1–14, 2014.
[10] D. Tellez, G. Litjens, J. Van Der Laak, and F. Ciompi, “Neural image compression for gigapixel histopathology image analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 2, pp. 567–578, 2021.
[11] B. Li, Y. Li, and K. W. Eliceiri, “Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14 313–14 323, 2021.
[12] H. Zhang et al., “DTFD-MIL: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18 802–18 812, 2022.
[13] E. Wulczyn et al., “Interpretable survival prediction for colorectal cancer using deep learning,” npj Digital Medicine, vol. 4, p. 71, 2021.
[14] L. Le, A. Patterson, and M. White, “Supervised autoencoders: Improving generalization performance with unsupervised regularizers,” Advances in Neural Information Processing Systems, vol. 31, pp. 107–117, 2018.
[15] R. Geirhos et al., “Shortcut learning in deep neural networks,” Nature Machine Intelligence, vol. 2, no. 11, pp. 665–673, 2020.
[16] M. Moayeri, P. Pope, Y. Balaji, and S. Feizi, “A comprehensive study of image classification model sensitivity to foregrounds, backgrounds, and visual attributes,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19 065–19 075, 2022.
[17] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” Proceedings of the International Conference on Learning Representations (ICLR), 2015.
[18] B. Shashni et al., “Size-based differentiation of cancer and normal cells by a particle size analyzer assisted by a cell-recognition pc software,” Biological and Pharmaceutical Bulletin, vol. 41, no. 4, pp. 487–503, 2018.