-
Development and Validation of a Deep Learning-Based Microsatellite Instability Predictor from Prostate Cancer Whole-Slide Images
Authors:
Qiyuan Hu,
Abbas A. Rizvi,
Geoffery Schau,
Kshitij Ingale,
Yoni Muller,
Rachel Baits,
Sebastian Pretzer,
Aïcha BenTaieb,
Abigail Gordhamer,
Roberto Nussenzveig,
Adam Cole,
Matthew O. Leavitt,
Rohan P. Joshi,
Nike Beaubier,
Martin C. Stumpe,
Kunal Nagpal
Abstract:
Microsatellite instability-high (MSI-H) is a tumor agnostic biomarker for immune checkpoint inhibitor therapy. However, MSI status is not routinely tested in prostate cancer, in part due to low prevalence and assay cost. As such, prediction of MSI status from hematoxylin and eosin (H&E) stained whole-slide images (WSIs) could identify prostate cancer patients most likely to benefit from confirmato…
▽ More
Microsatellite instability-high (MSI-H) is a tumor agnostic biomarker for immune checkpoint inhibitor therapy. However, MSI status is not routinely tested in prostate cancer, in part due to low prevalence and assay cost. As such, prediction of MSI status from hematoxylin and eosin (H&E) stained whole-slide images (WSIs) could identify prostate cancer patients most likely to benefit from confirmatory testing and becoming eligible for immunotherapy. Prostate biopsies and surgical resections from de-identified records of consecutive prostate cancer patients referred to our institution were analyzed. Their MSI status was determined by next generation sequencing. Patients before a cutoff date were split into an algorithm development set (n=4015, MSI-H 1.8%) and a paired validation set (n=173, MSI-H 19.7%) that consisted of two serial sections from each sample, one stained and scanned internally and the other at an external site. Patients after the cutoff date formed the temporal validation set (n=1350, MSI-H 2.3%). Attention-based multiple instance learning models were trained to predict MSI-H from H&E WSIs. The MSI-H predictor achieved area under the receiver operating characteristic curve values of 0.78 (95% CI [0.69-0.86]), 0.72 (95% CI [0.63-0.81]), and 0.72 (95% CI [0.62-0.82]) on the internally prepared, externally prepared, and temporal validation sets, respectively. While MSI-H status is significantly correlated with Gleason score, the model remained predictive within each Gleason score subgroup. In summary, we developed and validated an AI-based MSI-H diagnostic model on a large real-world cohort of routine H&E slides, which effectively generalized to externally stained and scanned samples and a temporally independent validation cohort. This algorithm has the potential to direct prostate cancer patients toward immunotherapy and to identify MSI-H cases secondary to Lynch syndrome.
△ Less
Submitted 12 October, 2023;
originally announced October 2023.
-
My Science Tutor (MyST) -- A Large Corpus of Children's Conversational Speech
Authors:
Sameer S. Pradhan,
Ronald A. Cole,
Wayne H. Ward
Abstract:
This article describes the MyST corpus developed as part of the My Science Tutor project -- one of the largest collections of children's conversational speech comprising approximately 400 hours, spanning some 230K utterances across about 10.5K virtual tutor sessions by around 1.3K third, fourth and fifth grade students. 100K of all utterances have been transcribed thus far. The corpus is freely av…
▽ More
This article describes the MyST corpus developed as part of the My Science Tutor project -- one of the largest collections of children's conversational speech comprising approximately 400 hours, spanning some 230K utterances across about 10.5K virtual tutor sessions by around 1.3K third, fourth and fifth grade students. 100K of all utterances have been transcribed thus far. The corpus is freely available (https://myst.cemantix.org) for non-commercial use using a creative commons license. It is also available for commercial use (https://boulderlearning.com/resources/myst-corpus/). To date, ten organizations have licensed the corpus for commercial use, and approximately 40 university and other not-for-profit research groups have downloaded the corpus. It is our hope that the corpus can be used to improve automatic speech recognition algorithms, build and evaluate conversational AI agents for education, and together help accelerate development of multimodal applications to improve children's excitement and learning about science, and help them learn remotely.
△ Less
Submitted 23 September, 2023;
originally announced September 2023.
-
Fast and Credible Likelihood-Free Cosmology with Truncated Marginal Neural Ratio Estimation
Authors:
Alex Cole,
Benjamin Kurt Miller,
Samuel J. Witte,
Maxwell X. Cai,
Meiert W. Grootes,
Francesco Nattino,
Christoph Weniger
Abstract:
Sampling-based inference techniques are central to modern cosmological data analysis; these methods, however, scale poorly with dimensionality and typically require approximate or intractable likelihoods. In this paper we describe how Truncated Marginal Neural Ratio Estimation (TMNRE) (a new approach in so-called simulation-based inference) naturally evades these issues, improving the $(i)$ effici…
▽ More
Sampling-based inference techniques are central to modern cosmological data analysis; these methods, however, scale poorly with dimensionality and typically require approximate or intractable likelihoods. In this paper we describe how Truncated Marginal Neural Ratio Estimation (TMNRE) (a new approach in so-called simulation-based inference) naturally evades these issues, improving the $(i)$ efficiency, $(ii)$ scalability, and $(iii)$ trustworthiness of the inferred posteriors. Using measurements of the Cosmic Microwave Background (CMB), we show that TMNRE can achieve converged posteriors using orders of magnitude fewer simulator calls than conventional Markov Chain Monte Carlo (MCMC) methods. Remarkably, the required number of samples is effectively independent of the number of nuisance parameters. In addition, a property called \emph{local amortization} allows the performance of rigorous statistical consistency checks that are not accessible to sampling-based methods. TMNRE promises to become a powerful tool for cosmological data analysis, particularly in the context of extended cosmologies, where the timescale required for conventional sampling-based inference methods to converge can greatly exceed that of simple cosmological models such as $Λ$CDM. To perform these computations, we use an implementation of TMNRE via the open-source code \texttt{swyft}.
△ Less
Submitted 8 November, 2022; v1 submitted 15 November, 2021;
originally announced November 2021.
-
Truncated Marginal Neural Ratio Estimation
Authors:
Benjamin Kurt Miller,
Alex Cole,
Patrick Forré,
Gilles Louppe,
Christoph Weniger
Abstract:
Parametric stochastic simulators are ubiquitous in science, often featuring high-dimensional input parameters and/or an intractable likelihood. Performing Bayesian parameter inference in this context can be challenging. We present a neural simulation-based inference algorithm which simultaneously offers simulation efficiency and fast empirical posterior testability, which is unique among modern al…
▽ More
Parametric stochastic simulators are ubiquitous in science, often featuring high-dimensional input parameters and/or an intractable likelihood. Performing Bayesian parameter inference in this context can be challenging. We present a neural simulation-based inference algorithm which simultaneously offers simulation efficiency and fast empirical posterior testability, which is unique among modern algorithms. Our approach is simulation efficient by simultaneously estimating low-dimensional marginal posteriors instead of the joint posterior and by proposing simulations targeted to an observation of interest via a prior suitably truncated by an indicator function. Furthermore, by estimating a locally amortized posterior our algorithm enables efficient empirical tests of the robustness of the inference results. Since scientists cannot access the ground truth, these tests are necessary for trusting inference in real-world applications. We perform experiments on a marginalized version of the simulation-based inference benchmark and two complex and narrow posteriors, highlighting the simulator efficiency of our algorithm as well as the quality of the estimated marginal posteriors.
△ Less
Submitted 26 October, 2021; v1 submitted 2 July, 2021;
originally announced July 2021.
-
Topological Echoes of Primordial Physics in the Universe at Large Scales
Authors:
Alex Cole,
Matteo Biagetti,
Gary Shiu
Abstract:
We present a pipeline for characterizing and constraining initial conditions in cosmology via persistent homology. The cosmological observable of interest is the cosmic web of large scale structure, and the initial conditions in question are non-Gaussianities (NG) of primordial density perturbations. We compute persistence diagrams and derived statistics for simulations of dark matter halos with G…
▽ More
We present a pipeline for characterizing and constraining initial conditions in cosmology via persistent homology. The cosmological observable of interest is the cosmic web of large scale structure, and the initial conditions in question are non-Gaussianities (NG) of primordial density perturbations. We compute persistence diagrams and derived statistics for simulations of dark matter halos with Gaussian and non-Gaussian initial conditions. For computational reasons and to make contact with experimental observations, our pipeline computes persistence in sub-boxes of full simulations and simulations are subsampled to uniform halo number. We use simulations with large NG ($f_{\rm NL}^{\rm loc}=250$) as templates for identifying data with mild NG ($f_{\rm NL}^{\rm loc}=10$), and running the pipeline on several cubic volumes of size $40~(\textrm{Gpc/h})^{3}$, we detect $f_{\rm NL}^{\rm loc}=10$ at $97.5\%$ confidence on $\sim 85\%$ of the volumes for our best single statistic. Throughout we benefit from the interpretability of topological features as input for statistical inference, which allows us to make contact with previous first-principles calculations and make new predictions.
△ Less
Submitted 7 December, 2020;
originally announced December 2020.
-
Interpretable Phase Detection and Classification with Persistent Homology
Authors:
Alex Cole,
Gregory J. Loges,
Gary Shiu
Abstract:
We apply persistent homology to the task of discovering and characterizing phase transitions, using lattice spin models from statistical physics for working examples. Persistence images provide a useful representation of the homological data for conducting statistical tasks. To identify the phase transitions, a simple logistic regression on these images is sufficient for the models we consider, an…
▽ More
We apply persistent homology to the task of discovering and characterizing phase transitions, using lattice spin models from statistical physics for working examples. Persistence images provide a useful representation of the homological data for conducting statistical tasks. To identify the phase transitions, a simple logistic regression on these images is sufficient for the models we consider, and interpretable order parameters are then read from the weights of the regression. Magnetization, frustration and vortex-antivortex structure are identified as relevant features for characterizing phase transitions.
△ Less
Submitted 1 December, 2020;
originally announced December 2020.
-
Simulation-efficient marginal posterior estimation with swyft: stop wasting your precious time
Authors:
Benjamin Kurt Miller,
Alex Cole,
Gilles Louppe,
Christoph Weniger
Abstract:
We present algorithms (a) for nested neural likelihood-to-evidence ratio estimation, and (b) for simulation reuse via an inhomogeneous Poisson point process cache of parameters and corresponding simulations. Together, these algorithms enable automatic and extremely simulator efficient estimation of marginal and joint posteriors. The algorithms are applicable to a wide range of physics and astronom…
▽ More
We present algorithms (a) for nested neural likelihood-to-evidence ratio estimation, and (b) for simulation reuse via an inhomogeneous Poisson point process cache of parameters and corresponding simulations. Together, these algorithms enable automatic and extremely simulator efficient estimation of marginal and joint posteriors. The algorithms are applicable to a wide range of physics and astronomy problems and typically offer an order of magnitude better simulator efficiency than traditional likelihood-based sampling methods. Our approach is an example of likelihood-free inference, thus it is also applicable to simulators which do not offer a tractable likelihood function. Simulator runs are never rejected and can be automatically reused in future analysis. As functional prototype implementation we provide the open-source software package swyft.
△ Less
Submitted 27 November, 2020;
originally announced November 2020.
-
A Preliminary Investigation in the Molecular Basis of Host Shutoff Mechanism in SARS-CoV
Authors:
Niharika Pandala,
Casey A. Cole,
Devaun McFarland,
Anita Nag,
Homayoun Valafar
Abstract:
Recent events leading to the worldwide pandemic of COVID-19 have demonstrated the effective use of genomic sequencing technologies to establish the genetic sequence of this virus. In contrast, the COVID-19 pandemic has demonstrated the absence of computational approaches to understand the molecular basis of this infection rapidly. Here we present an integrated approach to the study of the nsp1 pro…
▽ More
Recent events leading to the worldwide pandemic of COVID-19 have demonstrated the effective use of genomic sequencing technologies to establish the genetic sequence of this virus. In contrast, the COVID-19 pandemic has demonstrated the absence of computational approaches to understand the molecular basis of this infection rapidly. Here we present an integrated approach to the study of the nsp1 protein in SARS-CoV-1, which plays an essential role in maintaining the expression of viral proteins and further disabling the host protein expression, also known as the host shutoff mechanism. We present three independent methods of evaluating two potential binding sites speculated to participate in host shutoff by nsp1. We have combined results from computed models of nsp1, with deep mining of all existing protein structures (using PDBMine), and binding site recognition (using msTALI) to examine the two sites consisting of residues 55-59 and 73-80. Based on our preliminary results, we conclude that the residues 73-80 appear as the regions that facilitate the critical initial steps in the function of nsp1. Given the 90% sequence identity between nsp1 from SARS-CoV-1 and SARS-CoV-2, we conjecture the same critical initiation step in the function of COVID-19 nsp1.
△ Less
Submitted 23 July, 2020;
originally announced July 2020.
-
Recognition of Smoking Gesture Using Smart Watch Technology
Authors:
Casey A. Cole,
Bethany Janos,
Dien Anshari,
James F. Thrasher,
Scott Strayer,
Homayoun Valafar
Abstract:
Diseases resulting from prolonged smoking are the most common preventable causes of death in the world today. In this report we investigate the success of utilizing accelerometer sensors in smart watches to identify smoking gestures. Early identification of smoking gestures can help to initiate the appropriate intervention method and prevent relapses in smoking. Our experiments indicate 85%-95% su…
▽ More
Diseases resulting from prolonged smoking are the most common preventable causes of death in the world today. In this report we investigate the success of utilizing accelerometer sensors in smart watches to identify smoking gestures. Early identification of smoking gestures can help to initiate the appropriate intervention method and prevent relapses in smoking. Our experiments indicate 85%-95% success rates in identification of smoking gesture among other similar gestures using Artificial Neural Networks (ANNs). Our investigations concluded that information obtained from the x-dimension of accelerometers is the best means of identifying the smoking gesture, while y and z dimensions are helpful in eliminating other gestures such as: eating, drinking, and scratch of nose. We utilized sensor data from the Apple Watch during the training of the ANN. Using sensor data from another participant collected on Pebble Steel, we obtained a smoking identification accuracy of greater than 90% when using an ANN trained on data previously collected from the Apple Watch. Finally, we have demonstrated the possibility of using smart watches to perform continuous monitoring of daily activities.
△ Less
Submitted 5 March, 2020;
originally announced March 2020.
-
State Transition Modeling of the Smoking Behavior using LSTM Recurrent Neural Networks
Authors:
Chrisogonas O. Odhiambo,
Casey A. Cole,
Alaleh Torkjazi,
Homayoun Valafar
Abstract:
The use of sensors has pervaded everyday life in several applications including human activity monitoring, healthcare, and social networks. In this study, we focus on the use of smartwatch sensors to recognize smoking activity. More specifically, we have reformulated the previous work in detection of smoking to include in-context recognition of smoking. Our presented reformulation of the smoking g…
▽ More
The use of sensors has pervaded everyday life in several applications including human activity monitoring, healthcare, and social networks. In this study, we focus on the use of smartwatch sensors to recognize smoking activity. More specifically, we have reformulated the previous work in detection of smoking to include in-context recognition of smoking. Our presented reformulation of the smoking gesture as a state-transition model that consists of the mini-gestures hand-to-lip, hand-on-lip, and hand-off-lip, has demonstrated improvement in detection rates nearing 100% using conventional neural networks. In addition, we have begun the utilization of Long-Short-Term Memory (LSTM) neural networks to allow for in-context detection of gestures with accuracy nearing 97%.
△ Less
Submitted 7 January, 2020;
originally announced January 2020.
-
PDBMine: A Reformulation of the Protein Data Bank to Facilitate Structural Data Mining
Authors:
Casey A Cole,
Christopher Ott,
Diego Valdes,
Homayoun Valafar
Abstract:
Large scale initiatives such as the Human Genome Project, Structural Genomics, and individual research teams have provided large deposits of genomic and proteomic data. The transfer of data to knowledge has become one of the existing challenges, which is a consequence of capturing data in databases that are optimally designed for archiving and not mining. In this research, we have targeted the Pro…
▽ More
Large scale initiatives such as the Human Genome Project, Structural Genomics, and individual research teams have provided large deposits of genomic and proteomic data. The transfer of data to knowledge has become one of the existing challenges, which is a consequence of capturing data in databases that are optimally designed for archiving and not mining. In this research, we have targeted the Protein Databank (PDB) and demonstrated a transformation of its content, named PDBMine, that reduces storage space by an order of magnitude, and allows for powerful mining in relation to the topic of protein structure determination. We have demonstrated the utility of PDBMine in exploring the prevalence of dimeric and trimeric amino acid sequences and provided a mechanism of predicting protein structure.
△ Less
Submitted 19 November, 2019;
originally announced November 2019.
-
Improvements of the REDCRAFT Software Package
Authors:
Casey A Cole,
Caleb Parks,
Julian Rachele,
Homayoun Valafar
Abstract:
Traditional approaches to elucidation of protein structures by NMR spectroscopy rely on distance restraints also known as nuclear Overhauser effects (NOEs). The use of NOEs as the primary source of structure determination by NMR spectroscopy is time consuming and expensive. Residual Dipolar Couplings (RDCs) have become an alternate approach for structure calculation by NMR spectroscopy. In previou…
▽ More
Traditional approaches to elucidation of protein structures by NMR spectroscopy rely on distance restraints also known as nuclear Overhauser effects (NOEs). The use of NOEs as the primary source of structure determination by NMR spectroscopy is time consuming and expensive. Residual Dipolar Couplings (RDCs) have become an alternate approach for structure calculation by NMR spectroscopy. In previous works, the software package REDCRAFT has been presented as a means of harnessing the information containing in RDCs for structure calculation of proteins. In this work, we present significant improvements to the REDCRAFT package including: refinement of the decimation procedure, the inclusion of graphical user interface, adoption of NEF standards, and addition of scripts for enhanced protein modeling options. The improvements to REDCRAFT have resulted in the ability to fold proteins that the previous versions were unable to fold. For instance, we report the results of folding of the protein 1A1Z in the presence of highly erroneous data.
△ Less
Submitted 19 November, 2019;
originally announced November 2019.
-
Searching the Landscape of Flux Vacua with Genetic Algorithms
Authors:
Alex Cole,
Andreas Schachner,
Gary Shiu
Abstract:
In this paper, we employ genetic algorithms to explore the landscape of type IIB flux vacua. We show that genetic algorithms can efficiently scan the landscape for viable solutions satisfying various criteria. More specifically, we consider a symmetric $T^{6}$ as well as the conifold region of a Calabi-Yau hypersurface. We argue that in both cases genetic algorithms are powerful tools for finding…
▽ More
In this paper, we employ genetic algorithms to explore the landscape of type IIB flux vacua. We show that genetic algorithms can efficiently scan the landscape for viable solutions satisfying various criteria. More specifically, we consider a symmetric $T^{6}$ as well as the conifold region of a Calabi-Yau hypersurface. We argue that in both cases genetic algorithms are powerful tools for finding flux vacua with interesting phenomenological properties. We also compare genetic algorithms to algorithms based on different breeding mechanisms as well as random walk approaches.
△ Less
Submitted 26 November, 2019; v1 submitted 23 July, 2019;
originally announced July 2019.
-
A General Method for Finding Low Error Rates of LDPC Codes
Authors:
Chad A. Cole,
Stephen G. Wilson,
Eric. K. Hall,
Thomas R. Giallorenzi
Abstract:
This paper outlines a three-step procedure for determining the low bit error rate performance curve of a wide class of LDPC codes of moderate length. The traditional method to estimate code performance in the higher SNR region is to use a sum of the contributions of the most dominant error events to the probability of error. These dominant error events will be both code and decoder dependent, co…
▽ More
This paper outlines a three-step procedure for determining the low bit error rate performance curve of a wide class of LDPC codes of moderate length. The traditional method to estimate code performance in the higher SNR region is to use a sum of the contributions of the most dominant error events to the probability of error. These dominant error events will be both code and decoder dependent, consisting of low-weight codewords as well as non-codeword events if ML decoding is not used. For even moderate length codes, it is not feasible to find all of these dominant error events with a brute force search. The proposed method provides a convenient way to evaluate very low bit error rate performance of an LDPC code without requiring knowledge of the complete error event weight spectrum or resorting to a Monte Carlo simulation. This new method can be applied to various types of decoding such as the full belief propagation version of the message passing algorithm or the commonly used min-sum approximation to belief propagation. The proposed method allows one to efficiently see error performance at bit error rates that were previously out of reach of Monte Carlo methods. This result will provide a solid foundation for the analysis and design of LDPC codes and decoders that are required to provide a guaranteed very low bit error rate performance at certain SNRs.
△ Less
Submitted 11 May, 2006;
originally announced May 2006.