-
Multilevel Regression and Poststratification Interface: Application to Track Community-level COVID-19 Viral Transmission
Authors:
Yajuan Si,
Toan Tran,
Jonah Gabry,
Mitzi Morris,
Andrew Gelman
Abstract:
In the absence of comprehensive or random testing throughout the COVID-19 pandemic, we have developed a proxy method for synthetic random sampling to estimate the actual viral incidence in the community, based on viral RNA testing of asymptomatic patients who present for elective procedures within a hospital system. The approach collects routine testing data on SARS-CoV-2 exposure among outpatient…
▽ More
In the absence of comprehensive or random testing throughout the COVID-19 pandemic, we have developed a proxy method for synthetic random sampling to estimate the actual viral incidence in the community, based on viral RNA testing of asymptomatic patients who present for elective procedures within a hospital system. The approach collects routine testing data on SARS-CoV-2 exposure among outpatients and performs statistical adjustments of sample representation using multilevel regression and poststratification (MRP). MRP adjusts for selection bias and yields stable small area estimates. We have developed an open-source, user-friendly MRP interface for public implementation of the statistical workflow. We illustrate the MRP interface with an application to track community-level COVID-19 viral transmission in the state of Michigan.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
Self-supervised Multidimensional Scaling with $F$-ratio: Improving Microbiome Visualization
Authors:
Hyungseok Kim,
Soobin Kim,
Megan M. Morris,
Jeffrey A. Kimbrel,
Xavier Mayali,
Cullen R. Buie
Abstract:
Multidimensional scaling (MDS) is an unsupervised learning technique that preserves pairwise distances between observations and is commonly used for analyzing multivariate biological datasets. Recent advances in MDS have achieved successful classification results, but the configurations heavily depend on the choice of hyperparameters, limiting its broader application. Here, we present a self-super…
▽ More
Multidimensional scaling (MDS) is an unsupervised learning technique that preserves pairwise distances between observations and is commonly used for analyzing multivariate biological datasets. Recent advances in MDS have achieved successful classification results, but the configurations heavily depend on the choice of hyperparameters, limiting its broader application. Here, we present a self-supervised MDS approach informed by the dispersions of observations that share a common binary label ($F$-ratio). Our visualization accurately configures the $F$-ratio while consistently preserving the global structure with a low data distortion compared to existing dimensionality reduction tools. Using an algal microbiome dataset, we show that this new method better illustrates the community's response to the host, suggesting its potential impact on microbiology and ecology data analysis.
△ Less
Submitted 1 August, 2023;
originally announced August 2023.
-
Drug Discovery under Covariate Shift with Domain-Informed Prior Distributions over Functions
Authors:
Leo Klarner,
Tim G. J. Rudner,
Michael Reutlinger,
Torsten Schindler,
Garrett M. Morris,
Charlotte Deane,
Yee Whye Teh
Abstract:
Accelerating the discovery of novel and more effective therapeutics is an important pharmaceutical problem in which deep learning is playing an increasingly significant role. However, real-world drug discovery tasks are often characterized by a scarcity of labeled data and significant covariate shift$\unicode{x2013}\unicode{x2013}$a setting that poses a challenge to standard deep learning methods.…
▽ More
Accelerating the discovery of novel and more effective therapeutics is an important pharmaceutical problem in which deep learning is playing an increasingly significant role. However, real-world drug discovery tasks are often characterized by a scarcity of labeled data and significant covariate shift$\unicode{x2013}\unicode{x2013}$a setting that poses a challenge to standard deep learning methods. In this paper, we present Q-SAVI, a probabilistic model able to address these challenges by encoding explicit prior knowledge of the data-generating process into a prior distribution over functions, presenting researchers with a transparent and probabilistically principled way to encode data-driven modeling preferences. Building on a novel, gold-standard bioactivity dataset that facilitates a meaningful comparison of models in an extrapolative regime, we explore different approaches to induce data shift and construct a challenging evaluation setup. We then demonstrate that using Q-SAVI to integrate contextualized prior knowledge of drug-like chemical space into the modeling process affords substantial gains in predictive accuracy and calibration, outperforming a broad range of state-of-the-art self-supervised pre-training and domain adaptation techniques.
△ Less
Submitted 14 July, 2023;
originally announced July 2023.
-
Exploring QSAR Models for Activity-Cliff Prediction
Authors:
Markus Dablander,
Thierry Hanser,
Renaud Lambiotte,
Garrett M. Morris
Abstract:
Pairs of similar compounds that only differ by a small structural modification but exhibit a large difference in their binding affinity for a given target are known as activity cliffs (ACs). It has been hypothesised that quantitative structure-activity relationship (QSAR) models struggle to predict ACs and that ACs thus form a major source of prediction error. However, a study to explore the AC-pr…
▽ More
Pairs of similar compounds that only differ by a small structural modification but exhibit a large difference in their binding affinity for a given target are known as activity cliffs (ACs). It has been hypothesised that quantitative structure-activity relationship (QSAR) models struggle to predict ACs and that ACs thus form a major source of prediction error. However, a study to explore the AC-prediction power of modern QSAR methods and its relationship to general QSAR-prediction performance is lacking. We systematically construct nine distinct QSAR models by combining three molecular representation methods (extended-connectivity fingerprints, physicochemical-descriptor vectors and graph isomorphism networks) with three regression techniques (random forests, k-nearest neighbours and multilayer perceptrons); we then use each resulting model to classify pairs of similar compounds as ACs or non-ACs and to predict the activities of individual molecules in three case studies: dopamine receptor D2, factor Xa, and SARS-CoV-2 main protease. We observe low AC-sensitivity amongst the tested models when the activities of both compounds are unknown, but a substantial increase in AC-sensitivity when the actual activity of one of the compounds is given. Graph isomorphism features are found to be competitive with or superior to classical molecular representations for AC-classification and can thus be employed as baseline AC-prediction models or simple compound-optimisation tools. For general QSAR-prediction, however, extended-connectivity fingerprints still consistently deliver the best performance. Our results provide strong support for the hypothesis that indeed QSAR methods frequently fail to predict ACs. We propose twin-network training for deep learning models as a potential future pathway to increase AC-sensitivity and thus overall QSAR performance.
△ Less
Submitted 31 January, 2023;
originally announced January 2023.
-
Automatic Dynamic Relevance Determination for Gaussian process regression with high-dimensional functional inputs
Authors:
Luis Damiano,
Margaret Johnson,
Joaquim Teixeira,
Max D. Morris,
Jarad Niemi
Abstract:
In the context of Gaussian process regression with functional inputs, it is common to treat the input as a vector. The parameter space becomes prohibitively complex as the number of functional points increases, effectively becoming a hindrance for automatic relevance determination in high-dimensional problems. Generalizing a framework for time-varying inputs, we introduce the asymmetric Laplace fu…
▽ More
In the context of Gaussian process regression with functional inputs, it is common to treat the input as a vector. The parameter space becomes prohibitively complex as the number of functional points increases, effectively becoming a hindrance for automatic relevance determination in high-dimensional problems. Generalizing a framework for time-varying inputs, we introduce the asymmetric Laplace functional weight (ALF): a flexible, parametric function that drives predictive relevance over the index space. Automatic dynamic relevance determination (ADRD) is achieved with three unknowns per input variable and enforces smoothness over the index space. Additionally, we discuss a screening technique to assess under complete absence of prior and model information whether ADRD is reasonably consistent with the data. Such tool may serve for exploratory analyses and model diagnostics. ADRD is applied to remote sensing data and predictions are generated in response to atmospheric functional inputs. Fully Bayesian estimation is carried out to identify relevant regions of the functional input space. Validation is performed to benchmark against traditional vector-input model specifications. We find that ADRD outperforms models with input dimension reduction via functional principal component analysis. Furthermore, the predictive power is comparable to high-dimensional models, in terms of both mean prediction and uncertainty, with 10 times fewer tuning parameters. Enforcing smoothness on the predictive relevance profile rules out erratic patterns associated with vector-input models.
△ Less
Submitted 31 August, 2022;
originally announced September 2022.
-
Prediction for Distributional Outcomes in High-Performance Computing I/O Variability
Authors:
Li Xu,
Yili Hong,
Max D. Morris,
Kirk W. Cameron
Abstract:
Although high-performance computing (HPC) systems have been scaled to meet the exponentially-growing demand for scientific computing, HPC performance variability remains a major challenge and has become a critical research topic in computer science. Statistically, performance variability can be characterized by a distribution. Predicting performance variability is a critical step in HPC performanc…
▽ More
Although high-performance computing (HPC) systems have been scaled to meet the exponentially-growing demand for scientific computing, HPC performance variability remains a major challenge and has become a critical research topic in computer science. Statistically, performance variability can be characterized by a distribution. Predicting performance variability is a critical step in HPC performance variability management and is nontrivial because one needs to predict a distribution function based on system factors. In this paper, we propose a new framework to predict performance distributions. The proposed model is a modified Gaussian process that can predict the distribution function of the input/output (I/O) throughput under a specific HPC system configuration. We also impose a monotonic constraint so that the predicted function is nondecreasing, which is a property of the cumulative distribution function. Additionally, the proposed model can incorporate both quantitative and qualitative input variables. We evaluate the performance of the proposed method by using the IOzone variability data based on various prediction tasks. Results show that the proposed method can generate accurate predictions, and outperform existing methods. We also show how the predicted functional output can be used to generate predictions for a scalar summary of the performance distribution, such as the mean, standard deviation, and quantiles. Our methods can be further used as a surrogate model for HPC system variability monitoring and optimization.
△ Less
Submitted 19 May, 2022;
originally announced May 2022.
-
ergm 4: Computational Improvements
Authors:
Pavel N. Krivitsky,
David R. Hunter,
Martina Morris,
Chad Klumb
Abstract:
The ergm package supports the statistical analysis and simulation of network data. It anchors the statnet suite of packages for network analysis in R introduced in a special issue in Journal of Statistical Software in 2008. This article provides an overview of the performance improvements in the 2021 release of ergm version 4. These include performance enhancements to the Markov chain Monte Carlo…
▽ More
The ergm package supports the statistical analysis and simulation of network data. It anchors the statnet suite of packages for network analysis in R introduced in a special issue in Journal of Statistical Software in 2008. This article provides an overview of the performance improvements in the 2021 release of ergm version 4. These include performance enhancements to the Markov chain Monte Carlo and maximum likelihood estimation algorithms as well as broader and faster searching for networks with certain target statistics using simulated annealing.
△ Less
Submitted 15 March, 2022;
originally announced March 2022.
-
Approximations for STERGMs Based on Cross-Sectional Data
Authors:
Chad Klumb,
Martina Morris,
Steven M. Goodreau,
Samuel M. Jenness
Abstract:
Temporal exponential-family random graph models (TERGMs) are a flexible class of network models for the dynamics of tie formation and dissolution. In practice, separable TERGMs (STERGMs) are the subclass most often used, as these permit estimation from inexpensive cross-sectional study designs, and benefit from approximations designed to reduce the computational burden. Improving the approximation…
▽ More
Temporal exponential-family random graph models (TERGMs) are a flexible class of network models for the dynamics of tie formation and dissolution. In practice, separable TERGMs (STERGMs) are the subclass most often used, as these permit estimation from inexpensive cross-sectional study designs, and benefit from approximations designed to reduce the computational burden. Improving the approximations are the focus of this paper. We extend the work of Carnegie et al., which addressed the problem of constructing a STERGM with two specific equilibrium properties: a cross-sectional distribution defined by a given exponential-family random graph model (ERGM), and tie durations defined by given constant hazards of dissolution. We start with Carnegie et al.'s observation that the exact result is tractable in the dyad-independent case, and then show that taking the sparse limit of the exact result leads to a different approximation than the one they presented. We show that the new approximation outperforms theirs for sparse, dyad-independent models, and that for dyad-dependent models the errors tend to increase with the level of dependence for both approximations. We then extend the theoretical results of Carnegie et al. to the dyad-dependent case, proving that both the old and new approximations are asymptotically exact as the STERGM time step size goes to zero, for arbitrary dyad-dependent terms and some dyad-dependent constraints. We also show that the continuous-time limit of the discrete-time approximations has exactly the combination of cross-sectional and durational equilibrium behavior that we seek.
△ Less
Submitted 19 March, 2022; v1 submitted 6 December, 2021;
originally announced December 2021.
-
Accuracy, Repeatability, and Reproducibility of Firearm Comparisons Part 1: Accuracy
Authors:
L. Scott Chumbley,
Max D. Morris,
Stanley J. Bajic,
Daniel Zamzow,
Erich Smith,
Keith Monson,
Gene Peters
Abstract:
Researchers at the Ames Laboratory-USDOE and the Federal Bureau of Investigation (FBI) conducted a study to assess the performance of forensic examiners in firearm investigations. The study involved three different types of firearms and 173 volunteers who compared both bullets and cartridge cases. The total number of comparisons reported is 20,130, allocated to assess accuracy (8,640), repeatabili…
▽ More
Researchers at the Ames Laboratory-USDOE and the Federal Bureau of Investigation (FBI) conducted a study to assess the performance of forensic examiners in firearm investigations. The study involved three different types of firearms and 173 volunteers who compared both bullets and cartridge cases. The total number of comparisons reported is 20,130, allocated to assess accuracy (8,640), repeatability (5,700), and reproducibility (5,790) of the evaluations made by participating examiners. The overall false positive error rate was estimated as 0.656% and 0.933% for bullets and cartridge cases, respectively, while the rate of false negatives was estimated as 2.87% and 1.87% for bullets and cartridge cases, respectively. Because chi-square tests of independence strongly suggest that error probabilities are not the same for each examiner, these are maximum likelihood estimates based on the beta-binomial probability model and do not depend on an assumption of equal examiner-specific error rates. Corresponding 95% confidence intervals are (0.305%,1.42%) and (0.548%,1.57%) for false positives for bullets and cartridge cases, respectively, and (1.89%,4.26%) and (1.16%,2.99%) for false negatives for bullets and cartridge cases, respectively. These results are based on data representing all controlled conditions considered, including different firearm manufacturers, sequence of manufacture, and firing separation between unknown and known comparison specimens. The results are consistent with those of prior studies, despite its more robust design and challenging specimens.
△ Less
Submitted 30 July, 2021;
originally announced August 2021.
-
ergm 4: New features
Authors:
Pavel N. Krivitsky,
David R. Hunter,
Martina Morris,
Chad Klumb
Abstract:
The ergm package supports the statistical analysis and simulation of network data. It anchors the statnet suite of packages for network analysis in R introduced in a special issue in Journal of Statistical Software in 2008. This article provides an overview of the new functionality in the 2021 release of ergm version 4. These include more flexible handling of nodal covariates, term operators that…
▽ More
The ergm package supports the statistical analysis and simulation of network data. It anchors the statnet suite of packages for network analysis in R introduced in a special issue in Journal of Statistical Software in 2008. This article provides an overview of the new functionality in the 2021 release of ergm version 4. These include more flexible handling of nodal covariates, term operators that extend and simplify model specification, new models for networks with valued edges, improved handling of constraints on the sample space of networks, and estimation with missing edge data. We also identify the new packages in the statnet suite that extend ergm's functionality to other network data types and structural features and the robust set of online resources that support the statnet development process and applications.
△ Less
Submitted 15 March, 2022; v1 submitted 9 June, 2021;
originally announced June 2021.
-
The prospects of quantum computing in computational molecular biology
Authors:
Carlos Outeiral,
Martin Strahm,
Jiye Shi,
Garrett M. Morris,
Simon C. Benjamin,
Charlotte M. Deane
Abstract:
Quantum computers can in principle solve certain problems exponentially more quickly than their classical counterparts. We have not yet reached the advent of useful quantum computation, but when we do, it will affect nearly all scientific disciplines. In this review, we examine how current quantum algorithms could revolutionize computational biology and bioinformatics. There are potential benefits…
▽ More
Quantum computers can in principle solve certain problems exponentially more quickly than their classical counterparts. We have not yet reached the advent of useful quantum computation, but when we do, it will affect nearly all scientific disciplines. In this review, we examine how current quantum algorithms could revolutionize computational biology and bioinformatics. There are potential benefits across the entire field, from the ability to process vast amounts of information and run machine learning algorithms far more efficiently, to algorithms for quantum simulation that are poised to improve computational calculations in drug discovery, to quantum algorithms for optimization that may advance fields from protein structure prediction to network analysis. However, these exciting prospects are susceptible to "hype", and it is also important to recognize the caveats and challenges in this new technology. Our aim is to introduce the promise and limitations of emerging quantum computing technologies in the areas of computational molecular biology and bioinformatics.
△ Less
Submitted 26 May, 2020;
originally announced May 2020.
-
Functional ANOVA with Multiple Distributions: Implications for the Sensitivity Analysis of Computer Experiments
Authors:
Emanuele Borgonovo,
Max D. Morris,
Elmar Plischke
Abstract:
The functional ANOVA expansion of a multivariate map** plays a fundamental role in statistics. The expansion is unique once a unique distribution is assigned to the covariates. Recent investigations in the environmental and climate sciences show that analysts may not be in a position to assign a unique distribution in realistic applications. We offer a systematic investigation of existence, uniq…
▽ More
The functional ANOVA expansion of a multivariate map** plays a fundamental role in statistics. The expansion is unique once a unique distribution is assigned to the covariates. Recent investigations in the environmental and climate sciences show that analysts may not be in a position to assign a unique distribution in realistic applications. We offer a systematic investigation of existence, uniqueness, orthogonality, monotonicity and ultramodularity of the functional ANOVA expansion of a multivariate map** when a multiplicity of distributions is assigned to the covariates. In particular, we show that a multivariate map** can be associated with a core of probability measures that guarantee uniqueness. We obtain new results for variance decomposition and dimension distribution under mixtures. Implications for the global sensitivity analysis of computer experiments are also discussed.
△ Less
Submitted 16 January, 2018;
originally announced January 2018.
-
Predictive modelling of training loads and injury in Australian football
Authors:
David L. Carey,
Kok-Leong Ong,
Rod Whiteley,
Kay M. Crossley,
Justin Crow,
Meg E. Morris
Abstract:
To investigate whether training load monitoring data could be used to predict injuries in elite Australian football players, data were collected from elite athletes over 3 seasons at an Australian football club. Loads were quantified using GPS devices, accelerometers and player perceived exertion ratings. Absolute and relative training load metrics were calculated for each player each day (rolling…
▽ More
To investigate whether training load monitoring data could be used to predict injuries in elite Australian football players, data were collected from elite athletes over 3 seasons at an Australian football club. Loads were quantified using GPS devices, accelerometers and player perceived exertion ratings. Absolute and relative training load metrics were calculated for each player each day (rolling average, exponentially weighted moving average, acute:chronic workload ratio, monotony and strain). Injury prediction models (regularised logistic regression, generalised estimating equations, random forests and support vector machines) were built for non-contact, non-contact time-loss and hamstring specific injuries using the first two seasons of data. Injury predictions were generated for the third season and evaluated using the area under the receiver operator characteristic (AUC). Predictive performance was only marginally better than chance for models of non-contact and non-contact time-loss injuries (AUC$<$0.65). The best performing model was a multivariate logistic regression for hamstring injuries (best AUC=0.76). Learning curves suggested logistic regression was underfitting the load-injury relationship and that using a more complex model or increasing the amount of model building data may lead to future improvements. Injury prediction models built using training load data from a single club showed poor ability to predict injuries when tested on previously unseen data, suggesting they are limited as a daily decision tool for practitioners. Focusing the modelling approach on specific injury types and increasing the amount of training data may lead to the development of improved predictive models for injury prevention.
△ Less
Submitted 14 June, 2017;
originally announced June 2017.
-
Adjusting for Network Size and Composition Effects in Exponential-Family Random Graph Models
Authors:
Pavel N. Krivitsky,
Mark S. Handcock,
Martina Morris
Abstract:
Exponential-family random graph models (ERGMs) provide a principled way to model and simulate features common in human social networks, such as propensities for homophily and friend-of-a-friend triad closure. We show that, without adjustment, ERGMs preserve density as network size increases. Density invariance is often not appropriate for social networks. We suggest a simple modification based on…
▽ More
Exponential-family random graph models (ERGMs) provide a principled way to model and simulate features common in human social networks, such as propensities for homophily and friend-of-a-friend triad closure. We show that, without adjustment, ERGMs preserve density as network size increases. Density invariance is often not appropriate for social networks. We suggest a simple modification based on an offset which instead preserves the mean degree and accommodates changes in network composition asymptotically. We demonstrate that this approach allows ERGMs to be applied to the important situation of egocentrically sampled data. We analyze data from the National Health and Social Life Survey (NHSLS).
△ Less
Submitted 27 December, 2010; v1 submitted 29 April, 2010;
originally announced April 2010.