-
A generalized framework to predict continuous scores from medical ordinal labels
Authors:
Katharina V. Hoebel,
Andreanne Lemay,
John Peter Campbell,
Susan Ostmo,
Michael F. Chiang,
Christopher P. Bridge,
Matthew D. Li,
Praveer Singh,
Aaron S. Coyner,
Jayashree Kalpathy-Cramer
Abstract:
Many variables of interest in clinical medicine, like disease severity, are recorded using discrete ordinal categories such as normal/mild/moderate/severe. These labels are used to train and evaluate disease severity prediction models. However, ordinal categories represent a simplification of an underlying continuous severity spectrum. Using continuous scores instead of ordinal categories is more…
▽ More
Many variables of interest in clinical medicine, like disease severity, are recorded using discrete ordinal categories such as normal/mild/moderate/severe. These labels are used to train and evaluate disease severity prediction models. However, ordinal categories represent a simplification of an underlying continuous severity spectrum. Using continuous scores instead of ordinal categories is more sensitive to detecting small changes in disease severity over time. Here, we present a generalized framework that accurately predicts continuously valued variables using only discrete ordinal labels during model development. We found that for three clinical prediction tasks, models that take the ordinal relationship of the training labels into account outperformed conventional multi-class classification models. Particularly the continuous scores generated by ordinal classification and regression models showed a significantly higher correlation with expert rankings of disease severity and lower mean squared errors compared to the multi-class classification models. Furthermore, the use of MC dropout significantly improved the ability of all evaluated deep learning approaches to predict continuously valued scores that truthfully reflect the underlying continuous target variable. We showed that accurate continuously valued predictions can be generated even if the model development only involves discrete ordinal labels. The novel framework has been validated on three different clinical prediction tasks and has proven to bridge the gap between discrete ordinal labels and the underlying continuously valued variables.
△ Less
Submitted 30 May, 2023;
originally announced May 2023.
-
Improving the repeatability of deep learning models with Monte Carlo dropout
Authors:
Andreanne Lemay,
Katharina Hoebel,
Christopher P. Bridge,
Brian Befano,
Silvia De Sanjosé,
Diden Egemen,
Ana Cecilia Rodriguez,
Mark Schiffman,
John Peter Campbell,
Jayashree Kalpathy-Cramer
Abstract:
The integration of artificial intelligence into clinical workflows requires reliable and robust models. Repeatability is a key attribute of model robustness. Repeatable models output predictions with low variation during independent tests carried out under similar conditions. During model development and evaluation, much attention is given to classification performance while model repeatability is…
▽ More
The integration of artificial intelligence into clinical workflows requires reliable and robust models. Repeatability is a key attribute of model robustness. Repeatable models output predictions with low variation during independent tests carried out under similar conditions. During model development and evaluation, much attention is given to classification performance while model repeatability is rarely assessed, leading to the development of models that are unusable in clinical practice. In this work, we evaluate the repeatability of four model types (binary classification, multi-class classification, ordinal classification, and regression) on images that were acquired from the same patient during the same visit. We study the performance of binary, multi-class, ordinal, and regression models on four medical image classification tasks from public and private datasets: knee osteoarthritis, cervical cancer screening, breast density estimation, and retinopathy of prematurity. Repeatability is measured and compared on ResNet and DenseNet architectures. Moreover, we assess the impact of sampling Monte Carlo dropout predictions at test time on classification performance and repeatability. Leveraging Monte Carlo predictions significantly increased repeatability for all tasks on the binary, multi-class, and ordinal models leading to an average reduction of the 95\% limits of agreement by 16% points and of the disagreement rate by 7% points. The classification accuracy improved in most settings along with the repeatability. Our results suggest that beyond about 20 Monte Carlo iterations, there is no further gain in repeatability. In addition to the higher test-retest agreement, Monte Carlo predictions were better calibrated which leads to output probabilities reflecting more accurately the true likelihood of being correctly classified.
△ Less
Submitted 15 February, 2022;
originally announced February 2022.
-
Monte Carlo dropout increases model repeatability
Authors:
Andreanne Lemay,
Katharina Hoebel,
Christopher P. Bridge,
Didem Egemen,
Ana Cecilia Rodriguez,
Mark Schiffman,
John Peter Campbell,
Jayashree Kalpathy-Cramer
Abstract:
The integration of artificial intelligence into clinical workflows requires reliable and robust models. Among the main features of robustness is repeatability. Much attention is given to classification performance without assessing the model repeatability, leading to the development of models that turn out to be unusable in practice. In this work, we evaluate the repeatability of four model types…
▽ More
The integration of artificial intelligence into clinical workflows requires reliable and robust models. Among the main features of robustness is repeatability. Much attention is given to classification performance without assessing the model repeatability, leading to the development of models that turn out to be unusable in practice. In this work, we evaluate the repeatability of four model types on images from the same patient that were acquired during the same visit. We study the performance of binary, multi-class, ordinal, and regression models on three medical image analysis tasks: cervical cancer screening, breast density estimation, and retinopathy of prematurity classification. Moreover, we assess the impact of sampling Monte Carlo dropout predictions at test time on classification performance and repeatability. Leveraging Monte Carlo predictions significantly increased repeatability for all tasks on the binary, multi-class, and ordinal models leading to an average reduction of the 95% limits of agreement by 17% points.
△ Less
Submitted 12 November, 2021;
originally announced November 2021.
-
Not Color Blind: AI Predicts Racial Identity from Black and White Retinal Vessel Segmentations
Authors:
Aaron S. Coyner,
Praveer Singh,
James M. Brown,
Susan Ostmo,
R. V. Paul Chan,
Michael F. Chiang,
Jayashree Kalpathy-Cramer,
J. Peter Campbell
Abstract:
Background: Artificial intelligence (AI) may demonstrate racial bias when skin or choroidal pigmentation is present in medical images. Recent studies have shown that convolutional neural networks (CNNs) can predict race from images that were not previously thought to contain race-specific features. We evaluate whether grayscale retinal vessel maps (RVMs) of patients screened for retinopathy of pre…
▽ More
Background: Artificial intelligence (AI) may demonstrate racial bias when skin or choroidal pigmentation is present in medical images. Recent studies have shown that convolutional neural networks (CNNs) can predict race from images that were not previously thought to contain race-specific features. We evaluate whether grayscale retinal vessel maps (RVMs) of patients screened for retinopathy of prematurity (ROP) contain race-specific features.
Methods: 4095 retinal fundus images (RFIs) were collected from 245 Black and White infants. A U-Net generated RVMs from RFIs, which were subsequently thresholded, binarized, or skeletonized. To determine whether RVM differences between Black and White eyes were physiological, CNNs were trained to predict race from color RFIs, raw RVMs, and thresholded, binarized, or skeletonized RVMs. Area under the precision-recall curve (AUC-PR) was evaluated.
Findings: CNNs predicted race from RFIs near perfectly (image-level AUC-PR: 0.999, subject-level AUC-PR: 1.000). Raw RVMs were almost as informative as color RFIs (image-level AUC-PR: 0.938, subject-level AUC-PR: 0.995). Ultimately, CNNs were able to detect whether RFIs or RVMs were from Black or White babies, regardless of whether images contained color, vessel segmentation brightness differences were nullified, or vessel segmentation widths were normalized.
Interpretation: AI can detect race from grayscale RVMs that were not thought to contain racial information. Two potential explanations for these findings are that: retinal vessels physiologically differ between Black and White babies or the U-Net segments the retinal vasculature differently for various fundus pigmentations. Either way, the implications remain the same: AI algorithms have potential to demonstrate racial bias in practice, even when preliminary attempts to remove such information from the underlying images appear to be successful.
△ Less
Submitted 28 September, 2021;
originally announced September 2021.
-
CaRENets: Compact and Resource-Efficient CNN for Homomorphic Inference on Encrypted Medical Images
Authors:
** Chao,
Ahmad Al Badawi,
Balagopal Unnikrishnan,
Jie Lin,
Chan Fook Mun,
James M. Brown,
J. Peter Campbell,
Michael Chiang,
Jayashree Kalpathy-Cramer,
Vijay Ramaseshan Chandrasekhar,
Pavitra Krishnaswamy,
Khin Mi Mi Aung
Abstract:
Convolutional neural networks (CNNs) have enabled significant performance leaps in medical image classification tasks. However, translating neural network models for clinical applications remains challenging due to data privacy issues. Fully Homomorphic Encryption (FHE) has the potential to address this challenge as it enables the use of CNNs on encrypted images. However, current HE technology pos…
▽ More
Convolutional neural networks (CNNs) have enabled significant performance leaps in medical image classification tasks. However, translating neural network models for clinical applications remains challenging due to data privacy issues. Fully Homomorphic Encryption (FHE) has the potential to address this challenge as it enables the use of CNNs on encrypted images. However, current HE technology poses immense computational and memory overheads, particularly for high-resolution images such as those seen in the clinical context. We present CaRENets: Compact and Resource-Efficient CNNs for high performance and resource-efficient inference on high-resolution encrypted images in practical applications. At the core, CaRENets comprises a new FHE compact packing scheme that is tightly integrated with CNN functions. CaRENets offers dual advantages of memory efficiency (due to compact packing of images and CNN activations) and inference speed (due to the reduction in the number of ciphertexts created and the associated mathematical operations) over standard interleaved packing schemes. We apply CaRENets to perform homomorphic abnormality detection with 80-bit security level in two clinical conditions - Retinopathy of Prematurity (ROP) and Diabetic Retinopathy (DR). The ROP dataset comprises 96 x 96 grayscale images, while the DR dataset comprises 256 x 256 RGB images. We demonstrate over 45x improvement in memory efficiency and 4-5x speedup in inference over the interleaved packing schemes. As our approach enables memory-efficient low-latency HE inference without imposing additional communication burden, it has implications for practical and secure deep learning inference in clinical imaging.
△ Less
Submitted 28 January, 2019;
originally announced January 2019.
-
Accelerated Experimental Design for Pairwise Comparisons
Authors:
Yuan Guo,
Jennifer Dy,
Deniz Erdogmus,
Jayashree Kalpathy-Cramer,
Susan Ostmo,
J. Peter Campbell,
Michael F. Chiang,
Stratis Ioannidis
Abstract:
Pairwise comparison labels are more informative and less variable than class labels, but generating them poses a challenge: their number grows quadratically in the dataset size. We study a natural experimental design objective, namely, D-optimality, that can be used to identify which $K$ pairwise comparisons to generate. This objective is known to perform well in practice, and is submodular, makin…
▽ More
Pairwise comparison labels are more informative and less variable than class labels, but generating them poses a challenge: their number grows quadratically in the dataset size. We study a natural experimental design objective, namely, D-optimality, that can be used to identify which $K$ pairwise comparisons to generate. This objective is known to perform well in practice, and is submodular, making the selection approximable via the greedy algorithm. A naïve greedy implementation has $O(N^2d^2K)$ complexity, where $N$ is the dataset size, $d$ is the feature space dimension, and $K$ is the number of generated comparisons. We show that, by exploiting the inherent geometry of the dataset--namely, that it consists of pairwise comparisons--the greedy algorithm's complexity can be reduced to $O(N^2(K+d)+N(dK+d^2) +d^2K).$ We apply the same acceleration also to the so-called lazy greedy algorithm. When combined, the above improvements lead to an execution time of less than 1 hour for a dataset with $10^8$ comparisons; the naïve greedy algorithm on the same dataset would require more than 10 days to terminate.
△ Less
Submitted 17 January, 2019;
originally announced January 2019.
-
Deep feature transfer between localization and segmentation tasks
Authors:
Szu-Yeu Hu,
Andrew Beers,
Ken Chang,
Kathi Höbel,
J. Peter Campbell,
Deniz Erdogumus,
Stratis Ioannidis,
Jennifer Dy,
Michael F. Chiang,
Jayashree Kalpathy-Cramer,
James M. Brown
Abstract:
In this paper, we propose a new pre-training scheme for U-net based image segmentation. We first train the encoding arm as a localization network to predict the center of the target, before extending it into a U-net architecture for segmentation. We apply our proposed method to the problem of segmenting the optic disc from fundus photographs. Our work shows that the features learned by encoding ar…
▽ More
In this paper, we propose a new pre-training scheme for U-net based image segmentation. We first train the encoding arm as a localization network to predict the center of the target, before extending it into a U-net architecture for segmentation. We apply our proposed method to the problem of segmenting the optic disc from fundus photographs. Our work shows that the features learned by encoding arm can be transferred to the segmentation network to reduce the annotation burden. We propose that an approach could have broad utility for medical image segmentation, and alleviate the burden of delineating complex structures by pre-training on annotations that are much easier to acquire.
△ Less
Submitted 10 November, 2018; v1 submitted 6 November, 2018;
originally announced November 2018.
-
High-resolution medical image synthesis using progressively grown generative adversarial networks
Authors:
Andrew Beers,
James Brown,
Ken Chang,
J. Peter Campbell,
Susan Ostmo,
Michael F. Chiang,
Jayashree Kalpathy-Cramer
Abstract:
Generative adversarial networks (GANs) are a class of unsupervised machine learning algorithms that can produce realistic images from randomly-sampled vectors in a multi-dimensional space. Until recently, it was not possible to generate realistic high-resolution images using GANs, which has limited their applicability to medical images that contain biomarkers only detectable at native resolution.…
▽ More
Generative adversarial networks (GANs) are a class of unsupervised machine learning algorithms that can produce realistic images from randomly-sampled vectors in a multi-dimensional space. Until recently, it was not possible to generate realistic high-resolution images using GANs, which has limited their applicability to medical images that contain biomarkers only detectable at native resolution. Progressive growing of GANs is an approach wherein an image generator is trained to initially synthesize low resolution synthetic images (8x8 pixels), which are then fed to a discriminator that distinguishes these synthetic images from real downsampled images. Additional convolutional layers are then iteratively introduced to produce images at twice the previous resolution until the desired resolution is reached. In this work, we demonstrate that this approach can produce realistic medical images in two different domains; fundus photographs exhibiting vascular pathology associated with retinopathy of prematurity (ROP), and multi-modal magnetic resonance images of glioma. We also show that fine-grained details associated with pathology, such as retinal vessels or tumor heterogeneity, can be preserved and enhanced by including segmentation maps as additional channels. We envisage several applications of the approach, including image augmentation and unsupervised classification of pathology.
△ Less
Submitted 9 May, 2018; v1 submitted 8 May, 2018;
originally announced May 2018.
-
Making Sense of Unstructured Text Data
Authors:
Lin Li,
William M. Campbell,
Cagri Dagli,
Joseph P. Campbell
Abstract:
Many network analysis tasks in social sciences rely on pre-existing data sources that were created with explicit relations or interactions between entities under consideration. Examples include email logs, friends and followers networks on social media, communication networks, etc. In these data, it is relatively easy to identify who is connected to whom and how they are connected. However, most o…
▽ More
Many network analysis tasks in social sciences rely on pre-existing data sources that were created with explicit relations or interactions between entities under consideration. Examples include email logs, friends and followers networks on social media, communication networks, etc. In these data, it is relatively easy to identify who is connected to whom and how they are connected. However, most of the data that we encounter on a daily basis are unstructured free-text data, e.g., forums, online marketplaces, etc. It is considerably more difficult to extract network data from unstructured text. In this work, we present an end-to-end system for analyzing unstructured text data and transforming the data into structured graphs that are directly applicable to a downstream application. Specifically, we look at social media data and attempt to predict the most indicative words from users' posts. The resulting keywords can be used to construct a context+content network for downstream processing such as graph-based analysis and learning. With that goal in mind, we apply our methods to the application of cross-domain entity resolution. The performance of the resulting system with automatic keywords shows improvement over the system with user-annotated hashtags.
△ Less
Submitted 18 April, 2017;
originally announced April 2017.
-
Cross-Domain Entity Resolution in Social Media
Authors:
W. M. Campbell,
Lin Li,
C. Dagli,
J. Acevedo-Aviles,
K. Geyer,
J. P. Campbell,
C. Priebe
Abstract:
The challenge of associating entities across multiple domains is a key problem in social media understanding. Successful cross-domain entity resolution provides integration of information from multiple sites to create a complete picture of user and community activities, characteristics, and trends. In this work, we examine the problem of entity resolution across Twitter and Instagram using general…
▽ More
The challenge of associating entities across multiple domains is a key problem in social media understanding. Successful cross-domain entity resolution provides integration of information from multiple sites to create a complete picture of user and community activities, characteristics, and trends. In this work, we examine the problem of entity resolution across Twitter and Instagram using general techniques. Our methods fall into three categories: profile, content, and graph based. For the profile-based methods, we consider techniques based on approximate string matching. For content-based methods, we perform author identification. Finally, for graph-based methods, we apply novel cross-domain community detection methods and generate neighborhood-based features. The three categories of methods are applied to a large graph of users in Twitter and Instagram to understand challenges, determine performance, and understand fusion of multiple methods. Final results demonstrate an equal error rate less than 1%.
△ Less
Submitted 3 August, 2016;
originally announced August 2016.