Search | arXiv e-print repository

arXiv:2405.05909 [pdf, other]

Multilevel Regression and Poststratification Interface: Application to Track Community-level COVID-19 Viral Transmission

Authors: Yajuan Si, Toan Tran, Jonah Gabry, Mitzi Morris, Andrew Gelman

Abstract: In the absence of comprehensive or random testing throughout the COVID-19 pandemic, we have developed a proxy method for synthetic random sampling to estimate the actual viral incidence in the community, based on viral RNA testing of asymptomatic patients who present for elective procedures within a hospital system. The approach collects routine testing data on SARS-CoV-2 exposure among outpatient… ▽ More In the absence of comprehensive or random testing throughout the COVID-19 pandemic, we have developed a proxy method for synthetic random sampling to estimate the actual viral incidence in the community, based on viral RNA testing of asymptomatic patients who present for elective procedures within a hospital system. The approach collects routine testing data on SARS-CoV-2 exposure among outpatients and performs statistical adjustments of sample representation using multilevel regression and poststratification (MRP). MRP adjusts for selection bias and yields stable small area estimates. We have developed an open-source, user-friendly MRP interface for public implementation of the statistical workflow. We illustrate the MRP interface with an application to track community-level COVID-19 viral transmission in the state of Michigan. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2308.00354 [pdf, other]

Self-supervised Multidimensional Scaling with $F$-ratio: Improving Microbiome Visualization

Authors: Hyungseok Kim, Soobin Kim, Megan M. Morris, Jeffrey A. Kimbrel, Xavier Mayali, Cullen R. Buie

Abstract: Multidimensional scaling (MDS) is an unsupervised learning technique that preserves pairwise distances between observations and is commonly used for analyzing multivariate biological datasets. Recent advances in MDS have achieved successful classification results, but the configurations heavily depend on the choice of hyperparameters, limiting its broader application. Here, we present a self-super… ▽ More Multidimensional scaling (MDS) is an unsupervised learning technique that preserves pairwise distances between observations and is commonly used for analyzing multivariate biological datasets. Recent advances in MDS have achieved successful classification results, but the configurations heavily depend on the choice of hyperparameters, limiting its broader application. Here, we present a self-supervised MDS approach informed by the dispersions of observations that share a common binary label ($F$-ratio). Our visualization accurately configures the $F$-ratio while consistently preserving the global structure with a low data distortion compared to existing dimensionality reduction tools. Using an algal microbiome dataset, we show that this new method better illustrates the community's response to the host, suggesting its potential impact on microbiology and ecology data analysis. △ Less

Submitted 1 August, 2023; originally announced August 2023.

arXiv:2307.15073 [pdf, other]

Drug Discovery under Covariate Shift with Domain-Informed Prior Distributions over Functions

Authors: Leo Klarner, Tim G. J. Rudner, Michael Reutlinger, Torsten Schindler, Garrett M. Morris, Charlotte Deane, Yee Whye Teh

Abstract: Accelerating the discovery of novel and more effective therapeutics is an important pharmaceutical problem in which deep learning is playing an increasingly significant role. However, real-world drug discovery tasks are often characterized by a scarcity of labeled data and significant covariate shift$\unicode{x2013}\unicode{x2013}$a setting that poses a challenge to standard deep learning methods.… ▽ More Accelerating the discovery of novel and more effective therapeutics is an important pharmaceutical problem in which deep learning is playing an increasingly significant role. However, real-world drug discovery tasks are often characterized by a scarcity of labeled data and significant covariate shift$\unicode{x2013}\unicode{x2013}$a setting that poses a challenge to standard deep learning methods. In this paper, we present Q-SAVI, a probabilistic model able to address these challenges by encoding explicit prior knowledge of the data-generating process into a prior distribution over functions, presenting researchers with a transparent and probabilistically principled way to encode data-driven modeling preferences. Building on a novel, gold-standard bioactivity dataset that facilitates a meaningful comparison of models in an extrapolative regime, we explore different approaches to induce data shift and construct a challenging evaluation setup. We then demonstrate that using Q-SAVI to integrate contextualized prior knowledge of drug-like chemical space into the modeling process affords substantial gains in predictive accuracy and calibration, outperforming a broad range of state-of-the-art self-supervised pre-training and domain adaptation techniques. △ Less

Submitted 14 July, 2023; originally announced July 2023.

Comments: Published in the Proceedings of the 40th International Conference on Machine Learning (ICML 2023)

arXiv:2301.13644 [pdf, other]

doi 10.1186/s13321-023-00708-w

Exploring QSAR Models for Activity-Cliff Prediction

Authors: Markus Dablander, Thierry Hanser, Renaud Lambiotte, Garrett M. Morris

Abstract: Pairs of similar compounds that only differ by a small structural modification but exhibit a large difference in their binding affinity for a given target are known as activity cliffs (ACs). It has been hypothesised that quantitative structure-activity relationship (QSAR) models struggle to predict ACs and that ACs thus form a major source of prediction error. However, a study to explore the AC-pr… ▽ More Pairs of similar compounds that only differ by a small structural modification but exhibit a large difference in their binding affinity for a given target are known as activity cliffs (ACs). It has been hypothesised that quantitative structure-activity relationship (QSAR) models struggle to predict ACs and that ACs thus form a major source of prediction error. However, a study to explore the AC-prediction power of modern QSAR methods and its relationship to general QSAR-prediction performance is lacking. We systematically construct nine distinct QSAR models by combining three molecular representation methods (extended-connectivity fingerprints, physicochemical-descriptor vectors and graph isomorphism networks) with three regression techniques (random forests, k-nearest neighbours and multilayer perceptrons); we then use each resulting model to classify pairs of similar compounds as ACs or non-ACs and to predict the activities of individual molecules in three case studies: dopamine receptor D2, factor Xa, and SARS-CoV-2 main protease. We observe low AC-sensitivity amongst the tested models when the activities of both compounds are unknown, but a substantial increase in AC-sensitivity when the actual activity of one of the compounds is given. Graph isomorphism features are found to be competitive with or superior to classical molecular representations for AC-classification and can thus be employed as baseline AC-prediction models or simple compound-optimisation tools. For general QSAR-prediction, however, extended-connectivity fingerprints still consistently deliver the best performance. Our results provide strong support for the hypothesis that indeed QSAR methods frequently fail to predict ACs. We propose twin-network training for deep learning models as a potential future pathway to increase AC-sensitivity and thus overall QSAR performance. △ Less

Submitted 31 January, 2023; originally announced January 2023.

Comments: Submitted to Journal of Cheminformatics

Journal ref: Journal of Cheminformatics 15.1 (2023): 47

arXiv:2209.00044 [pdf, other]

Automatic Dynamic Relevance Determination for Gaussian process regression with high-dimensional functional inputs

Authors: Luis Damiano, Margaret Johnson, Joaquim Teixeira, Max D. Morris, Jarad Niemi

Abstract: In the context of Gaussian process regression with functional inputs, it is common to treat the input as a vector. The parameter space becomes prohibitively complex as the number of functional points increases, effectively becoming a hindrance for automatic relevance determination in high-dimensional problems. Generalizing a framework for time-varying inputs, we introduce the asymmetric Laplace fu… ▽ More In the context of Gaussian process regression with functional inputs, it is common to treat the input as a vector. The parameter space becomes prohibitively complex as the number of functional points increases, effectively becoming a hindrance for automatic relevance determination in high-dimensional problems. Generalizing a framework for time-varying inputs, we introduce the asymmetric Laplace functional weight (ALF): a flexible, parametric function that drives predictive relevance over the index space. Automatic dynamic relevance determination (ADRD) is achieved with three unknowns per input variable and enforces smoothness over the index space. Additionally, we discuss a screening technique to assess under complete absence of prior and model information whether ADRD is reasonably consistent with the data. Such tool may serve for exploratory analyses and model diagnostics. ADRD is applied to remote sensing data and predictions are generated in response to atmospheric functional inputs. Fully Bayesian estimation is carried out to identify relevant regions of the functional input space. Validation is performed to benchmark against traditional vector-input model specifications. We find that ADRD outperforms models with input dimension reduction via functional principal component analysis. Furthermore, the predictive power is comparable to high-dimensional models, in terms of both mean prediction and uncertainty, with 10 times fewer tuning parameters. Enforcing smoothness on the predictive relevance profile rules out erratic patterns associated with vector-input models. △ Less

Submitted 31 August, 2022; originally announced September 2022.

Comments: Submitted to Technometrics. 34 pages, 5 figures, 2 tables

arXiv:2205.09879 [pdf, other]

Prediction for Distributional Outcomes in High-Performance Computing I/O Variability

Authors: Li Xu, Yili Hong, Max D. Morris, Kirk W. Cameron

Abstract: Although high-performance computing (HPC) systems have been scaled to meet the exponentially-growing demand for scientific computing, HPC performance variability remains a major challenge and has become a critical research topic in computer science. Statistically, performance variability can be characterized by a distribution. Predicting performance variability is a critical step in HPC performanc… ▽ More Although high-performance computing (HPC) systems have been scaled to meet the exponentially-growing demand for scientific computing, HPC performance variability remains a major challenge and has become a critical research topic in computer science. Statistically, performance variability can be characterized by a distribution. Predicting performance variability is a critical step in HPC performance variability management and is nontrivial because one needs to predict a distribution function based on system factors. In this paper, we propose a new framework to predict performance distributions. The proposed model is a modified Gaussian process that can predict the distribution function of the input/output (I/O) throughput under a specific HPC system configuration. We also impose a monotonic constraint so that the predicted function is nondecreasing, which is a property of the cumulative distribution function. Additionally, the proposed model can incorporate both quantitative and qualitative input variables. We evaluate the performance of the proposed method by using the IOzone variability data based on various prediction tasks. Results show that the proposed method can generate accurate predictions, and outperform existing methods. We also show how the predicted functional output can be used to generate predictions for a scalar summary of the performance distribution, such as the mean, standard deviation, and quantiles. Our methods can be further used as a surrogate model for HPC system variability monitoring and optimization. △ Less

Submitted 19 May, 2022; originally announced May 2022.

Comments: 31 pages, 10 figures

arXiv:2203.08198 [pdf, other]

ergm 4: Computational Improvements

Authors: Pavel N. Krivitsky, David R. Hunter, Martina Morris, Chad Klumb

Abstract: The ergm package supports the statistical analysis and simulation of network data. It anchors the statnet suite of packages for network analysis in R introduced in a special issue in Journal of Statistical Software in 2008. This article provides an overview of the performance improvements in the 2021 release of ergm version 4. These include performance enhancements to the Markov chain Monte Carlo… ▽ More The ergm package supports the statistical analysis and simulation of network data. It anchors the statnet suite of packages for network analysis in R introduced in a special issue in Journal of Statistical Software in 2008. This article provides an overview of the performance improvements in the 2021 release of ergm version 4. These include performance enhancements to the Markov chain Monte Carlo and maximum likelihood estimation algorithms as well as broader and faster searching for networks with certain target statistics using simulated annealing. △ Less

Submitted 15 March, 2022; originally announced March 2022.

Comments: Computational improvements discussion originally in arXiv:2106.04997v1, extracted into its own preprint; 23 pages, 2 figures, 3 tables

arXiv:2112.03239 [pdf, ps, other]

Approximations for STERGMs Based on Cross-Sectional Data

Authors: Chad Klumb, Martina Morris, Steven M. Goodreau, Samuel M. Jenness

Abstract: Temporal exponential-family random graph models (TERGMs) are a flexible class of network models for the dynamics of tie formation and dissolution. In practice, separable TERGMs (STERGMs) are the subclass most often used, as these permit estimation from inexpensive cross-sectional study designs, and benefit from approximations designed to reduce the computational burden. Improving the approximation… ▽ More Temporal exponential-family random graph models (TERGMs) are a flexible class of network models for the dynamics of tie formation and dissolution. In practice, separable TERGMs (STERGMs) are the subclass most often used, as these permit estimation from inexpensive cross-sectional study designs, and benefit from approximations designed to reduce the computational burden. Improving the approximations are the focus of this paper. We extend the work of Carnegie et al., which addressed the problem of constructing a STERGM with two specific equilibrium properties: a cross-sectional distribution defined by a given exponential-family random graph model (ERGM), and tie durations defined by given constant hazards of dissolution. We start with Carnegie et al.'s observation that the exact result is tractable in the dyad-independent case, and then show that taking the sparse limit of the exact result leads to a different approximation than the one they presented. We show that the new approximation outperforms theirs for sparse, dyad-independent models, and that for dyad-dependent models the errors tend to increase with the level of dependence for both approximations. We then extend the theoretical results of Carnegie et al. to the dyad-dependent case, proving that both the old and new approximations are asymptotically exact as the STERGM time step size goes to zero, for arbitrary dyad-dependent terms and some dyad-dependent constraints. We also show that the continuous-time limit of the discrete-time approximations has exactly the combination of cross-sectional and durational equilibrium behavior that we seek. △ Less

Submitted 19 March, 2022; v1 submitted 6 December, 2021; originally announced December 2021.

Comments: 35 pages, 2 figures

arXiv:2108.04030 [pdf]

Accuracy, Repeatability, and Reproducibility of Firearm Comparisons Part 1: Accuracy

Authors: L. Scott Chumbley, Max D. Morris, Stanley J. Bajic, Daniel Zamzow, Erich Smith, Keith Monson, Gene Peters

Abstract: Researchers at the Ames Laboratory-USDOE and the Federal Bureau of Investigation (FBI) conducted a study to assess the performance of forensic examiners in firearm investigations. The study involved three different types of firearms and 173 volunteers who compared both bullets and cartridge cases. The total number of comparisons reported is 20,130, allocated to assess accuracy (8,640), repeatabili… ▽ More Researchers at the Ames Laboratory-USDOE and the Federal Bureau of Investigation (FBI) conducted a study to assess the performance of forensic examiners in firearm investigations. The study involved three different types of firearms and 173 volunteers who compared both bullets and cartridge cases. The total number of comparisons reported is 20,130, allocated to assess accuracy (8,640), repeatability (5,700), and reproducibility (5,790) of the evaluations made by participating examiners. The overall false positive error rate was estimated as 0.656% and 0.933% for bullets and cartridge cases, respectively, while the rate of false negatives was estimated as 2.87% and 1.87% for bullets and cartridge cases, respectively. Because chi-square tests of independence strongly suggest that error probabilities are not the same for each examiner, these are maximum likelihood estimates based on the beta-binomial probability model and do not depend on an assumption of equal examiner-specific error rates. Corresponding 95% confidence intervals are (0.305%,1.42%) and (0.548%,1.57%) for false positives for bullets and cartridge cases, respectively, and (1.89%,4.26%) and (1.16%,2.99%) for false negatives for bullets and cartridge cases, respectively. These results are based on data representing all controlled conditions considered, including different firearm manufacturers, sequence of manufacture, and firing separation between unknown and known comparison specimens. The results are consistent with those of prior studies, despite its more robust design and challenging specimens. △ Less

Submitted 30 July, 2021; originally announced August 2021.

arXiv:2106.04997 [pdf, other]

doi 10.18637/jss.v105.i06

ergm 4: New features

Authors: Pavel N. Krivitsky, David R. Hunter, Martina Morris, Chad Klumb

Abstract: The ergm package supports the statistical analysis and simulation of network data. It anchors the statnet suite of packages for network analysis in R introduced in a special issue in Journal of Statistical Software in 2008. This article provides an overview of the new functionality in the 2021 release of ergm version 4. These include more flexible handling of nodal covariates, term operators that… ▽ More The ergm package supports the statistical analysis and simulation of network data. It anchors the statnet suite of packages for network analysis in R introduced in a special issue in Journal of Statistical Software in 2008. This article provides an overview of the new functionality in the 2021 release of ergm version 4. These include more flexible handling of nodal covariates, term operators that extend and simplify model specification, new models for networks with valued edges, improved handling of constraints on the sample space of networks, and estimation with missing edge data. We also identify the new packages in the statnet suite that extend ergm's functionality to other network data types and structural features and the robust set of online resources that support the statnet development process and applications. △ Less

Submitted 15 March, 2022; v1 submitted 9 June, 2021; originally announced June 2021.

Comments: Computational improvements discussion in the previous version was split out into another preprint; 30 pages, 2 figures

Journal ref: Journal of Statistical Software, 105(1), 1-44 (2023)

arXiv:2005.12792 [pdf]

doi 10.1002/wcms.1481

The prospects of quantum computing in computational molecular biology

Authors: Carlos Outeiral, Martin Strahm, Jiye Shi, Garrett M. Morris, Simon C. Benjamin, Charlotte M. Deane

Abstract: Quantum computers can in principle solve certain problems exponentially more quickly than their classical counterparts. We have not yet reached the advent of useful quantum computation, but when we do, it will affect nearly all scientific disciplines. In this review, we examine how current quantum algorithms could revolutionize computational biology and bioinformatics. There are potential benefits… ▽ More Quantum computers can in principle solve certain problems exponentially more quickly than their classical counterparts. We have not yet reached the advent of useful quantum computation, but when we do, it will affect nearly all scientific disciplines. In this review, we examine how current quantum algorithms could revolutionize computational biology and bioinformatics. There are potential benefits across the entire field, from the ability to process vast amounts of information and run machine learning algorithms far more efficiently, to algorithms for quantum simulation that are poised to improve computational calculations in drug discovery, to quantum algorithms for optimization that may advance fields from protein structure prediction to network analysis. However, these exciting prospects are susceptible to "hype", and it is also important to recognize the caveats and challenges in this new technology. Our aim is to introduce the promise and limitations of emerging quantum computing technologies in the areas of computational molecular biology and bioinformatics. △ Less

Submitted 26 May, 2020; originally announced May 2020.

Comments: 23 pages, 3 figures

Journal ref: WIREs Computational Molecular Science, 2020

arXiv:1801.05186 [pdf, ps, other]

Functional ANOVA with Multiple Distributions: Implications for the Sensitivity Analysis of Computer Experiments

Authors: Emanuele Borgonovo, Max D. Morris, Elmar Plischke

Abstract: The functional ANOVA expansion of a multivariate map** plays a fundamental role in statistics. The expansion is unique once a unique distribution is assigned to the covariates. Recent investigations in the environmental and climate sciences show that analysts may not be in a position to assign a unique distribution in realistic applications. We offer a systematic investigation of existence, uniq… ▽ More The functional ANOVA expansion of a multivariate map** plays a fundamental role in statistics. The expansion is unique once a unique distribution is assigned to the covariates. Recent investigations in the environmental and climate sciences show that analysts may not be in a position to assign a unique distribution in realistic applications. We offer a systematic investigation of existence, uniqueness, orthogonality, monotonicity and ultramodularity of the functional ANOVA expansion of a multivariate map** when a multiplicity of distributions is assigned to the covariates. In particular, we show that a multivariate map** can be associated with a core of probability measures that guarantee uniqueness. We obtain new results for variance decomposition and dimension distribution under mixtures. Implications for the global sensitivity analysis of computer experiments are also discussed. △ Less

Submitted 16 January, 2018; originally announced January 2018.

Comments: To Appear on SIAM/ASA Journal on Uncertainty Quantification 2018

arXiv:1706.04336 [pdf, other]

doi 10.2478/ijcss-2018-0002

Predictive modelling of training loads and injury in Australian football

Authors: David L. Carey, Kok-Leong Ong, Rod Whiteley, Kay M. Crossley, Justin Crow, Meg E. Morris

Abstract: To investigate whether training load monitoring data could be used to predict injuries in elite Australian football players, data were collected from elite athletes over 3 seasons at an Australian football club. Loads were quantified using GPS devices, accelerometers and player perceived exertion ratings. Absolute and relative training load metrics were calculated for each player each day (rolling… ▽ More To investigate whether training load monitoring data could be used to predict injuries in elite Australian football players, data were collected from elite athletes over 3 seasons at an Australian football club. Loads were quantified using GPS devices, accelerometers and player perceived exertion ratings. Absolute and relative training load metrics were calculated for each player each day (rolling average, exponentially weighted moving average, acute:chronic workload ratio, monotony and strain). Injury prediction models (regularised logistic regression, generalised estimating equations, random forests and support vector machines) were built for non-contact, non-contact time-loss and hamstring specific injuries using the first two seasons of data. Injury predictions were generated for the third season and evaluated using the area under the receiver operator characteristic (AUC). Predictive performance was only marginally better than chance for models of non-contact and non-contact time-loss injuries (AUC$<$0.65). The best performing model was a multivariate logistic regression for hamstring injuries (best AUC=0.76). Learning curves suggested logistic regression was underfitting the load-injury relationship and that using a more complex model or increasing the amount of model building data may lead to future improvements. Injury prediction models built using training load data from a single club showed poor ability to predict injuries when tested on previously unseen data, suggesting they are limited as a daily decision tool for practitioners. Focusing the modelling approach on specific injury types and increasing the amount of training data may lead to the development of improved predictive models for injury prevention. △ Less

Submitted 14 June, 2017; originally announced June 2017.

Comments: 15 pages, 5 figures

arXiv:1004.5328 [pdf, other]

doi 10.1016/j.stamet.2011.01.005

Adjusting for Network Size and Composition Effects in Exponential-Family Random Graph Models

Authors: Pavel N. Krivitsky, Mark S. Handcock, Martina Morris

Abstract: Exponential-family random graph models (ERGMs) provide a principled way to model and simulate features common in human social networks, such as propensities for homophily and friend-of-a-friend triad closure. We show that, without adjustment, ERGMs preserve density as network size increases. Density invariance is often not appropriate for social networks. We suggest a simple modification based on… ▽ More Exponential-family random graph models (ERGMs) provide a principled way to model and simulate features common in human social networks, such as propensities for homophily and friend-of-a-friend triad closure. We show that, without adjustment, ERGMs preserve density as network size increases. Density invariance is often not appropriate for social networks. We suggest a simple modification based on an offset which instead preserves the mean degree and accommodates changes in network composition asymptotically. We demonstrate that this approach allows ERGMs to be applied to the important situation of egocentrically sampled data. We analyze data from the National Health and Social Life Survey (NHSLS). △ Less

Submitted 27 December, 2010; v1 submitted 29 April, 2010; originally announced April 2010.

Comments: 37 pages, 2 figures, 5 tables; notation revised and clarified, some sections (particularly 4.3 and 5) made more rigorous, some derivations moved into the appendix, typos fixed, some wording changed

MSC Class: 91D30 (Primary) 62D; 62F12; 62F40; 62P25; 62M40 (Secondary)

Journal ref: Statistical Methodology 8 (2011) 319-339

Showing 1–14 of 14 results for author: Morris, M