-
Computationally Efficient and Error Aware Surrogate Construction for Numerical Solutions of Subsurface Flow Through Porous Media
Authors:
Aleksei G. Sorokin,
Aleksandra Pachalieva,
Daniel O'Malley,
James M. Hyman,
Fred J. Hickernell,
Nicolas W. Hengartner
Abstract:
Limiting the injection rate to restrict the pressure below a threshold at a critical location can be an important goal of simulations that model the subsurface pressure between injection and extraction wells. The pressure is approximated by the solution of Darcy's partial differential equation (PDE) for a given permeability field. The subsurface permeability is modeled as a random field since it i…
▽ More
Limiting the injection rate to restrict the pressure below a threshold at a critical location can be an important goal of simulations that model the subsurface pressure between injection and extraction wells. The pressure is approximated by the solution of Darcy's partial differential equation (PDE) for a given permeability field. The subsurface permeability is modeled as a random field since it is known only up to statistical properties. This induces uncertainty in the computed pressure. Solving the PDE for an ensemble of random permeability simulations enables estimating a probability distribution for the pressure at the critical location. These simulations are computationally expensive, and practitioners often need rapid online guidance for real-time pressure management. An ensemble of numerical PDE solutions is used to construct a Gaussian process regression model that can quickly predict the pressure at the critical location as a function of the extraction rate and permeability realization.
Our first novel contribution is to identify a sampling methodology for the random environment and matching kernel technology for which fitting the Gaussian process regression model scales as O(n log n) instead of the typical O(n^3) rate in the number of samples n used to fit the surrogate. The surrogate model allows almost instantaneous predictions for the pressure at the critical location as a function of the extraction rate and permeability realization. Our second contribution is a novel algorithm to calibrate the uncertainty in the surrogate model to the discrepancy between the true pressure solution of Darcy's equation and the numerical solution. Although our method is derived for building a surrogate for the solution of Darcy's equation with a random permeability field, the framework broadly applies to solutions of other PDE with random coefficients.
△ Less
Submitted 20 October, 2023;
originally announced October 2023.
-
Why I'm not Answering: Understanding Determinants of Classification of an Abstaining Classifier for Cancer Pathology Reports
Authors:
Sayera Dhaubhadel,
Jamaludin Mohd-Yusof,
Kumkum Ganguly,
Gopinath Chennupati,
Sunil Thulasidasan,
Nicolas W. Hengartner,
Brent J. Mumphrey,
Eric B. Durbin,
Jennifer A. Doherty,
Mireille Lemieux,
Noah Schaefferkoetter,
Georgia Tourassi,
Linda Coyle,
Lynne Penberthy,
Benjamin H. McMahon,
Tanmoy Bhattacharya
Abstract:
Safe deployment of deep learning systems in critical real world applications requires models to make very few mistakes, and only under predictable circumstances. In this work, we address this problem using an abstaining classifier that is tuned to have $>$95% accuracy, and then identify the determinants of abstention using LIME. Essentially, we are training our model to learn the attributes of pat…
▽ More
Safe deployment of deep learning systems in critical real world applications requires models to make very few mistakes, and only under predictable circumstances. In this work, we address this problem using an abstaining classifier that is tuned to have $>$95% accuracy, and then identify the determinants of abstention using LIME. Essentially, we are training our model to learn the attributes of pathology reports that are likely to lead to incorrect classifications, albeit at the cost of reduced sensitivity. We demonstrate an abstaining classifier in a multitask setting for classifying cancer pathology reports from the NCI SEER cancer registries on six tasks of interest. For these tasks, we reduce the classification error rate by factors of 2--5 by abstaining on 25--45% of the reports. For the specific task of classifying cancer site, we are able to identify metastasis, reports involving lymph nodes, and discussion of multiple cancer sites as responsible for many of the classification mistakes, and observe that the extent and types of mistakes vary systematically with cancer site (e.g., breast, lung, and prostate). When combining across three of the tasks, our model classifies 50% of the reports with an accuracy greater than 95% for three of the six tasks\edit, and greater than 85% for all six tasks on the retained samples. Furthermore, we show that LIME provides a better determinant of classification than measures of word occurrence alone. By combining a deep abstaining classifier with feature identification using LIME, we are able to identify concepts responsible for both correctness and abstention when classifying cancer sites from pathology reports. The improvement of LIME over keyword searches is statistically significant, presumably because words are assessed in context and have been identified as a local determinant of classification.
△ Less
Submitted 21 April, 2022; v1 submitted 10 September, 2020;
originally announced September 2020.
-
A Note on Using Discretized Simulated Data to Estimate Implicit Likelihoods in Bayesian Analyses
Authors:
M. S. Hamada,
T. L. Graves,
N. W. Hengartner,
D. M. Higdon,
A. V. Huzurbazar,
E. C. Lawrence,
C. D. Linkletter,
C. S. Reese,
D. W. Scott,
R. R. Sitter,
R. L. Warr,
B. J. Williams
Abstract:
This article presents a Bayesian inferential method where the likelihood for a model is unknown but where data can easily be simulated from the model. We discretize simulated (continuous) data to estimate the implicit likelihood in a Bayesian analysis employing a Markov chain Monte Carlo algorithm. Three examples are presented as well as a small study on some of the method's properties.
This article presents a Bayesian inferential method where the likelihood for a model is unknown but where data can easily be simulated from the model. We discretize simulated (continuous) data to estimate the implicit likelihood in a Bayesian analysis employing a Markov chain Monte Carlo algorithm. Three examples are presented as well as a small study on some of the method's properties.
△ Less
Submitted 6 August, 2020;
originally announced August 2020.
-
What needles do sparse neural networks find in nonlinear haystacks
Authors:
Sylvain Sardy,
Nicolas W Hengartner,
Nikolai Bonenko,
Yen Ting Lin
Abstract:
Using a sparsity inducing penalty in artificial neural networks (ANNs) avoids over-fitting, especially in situations where noise is high and the training set is small in comparison to the number of features. For linear models, such an approach provably also recovers the important features with high probability in regimes for a well-chosen penalty parameter. The typical way of setting the penalty p…
▽ More
Using a sparsity inducing penalty in artificial neural networks (ANNs) avoids over-fitting, especially in situations where noise is high and the training set is small in comparison to the number of features. For linear models, such an approach provably also recovers the important features with high probability in regimes for a well-chosen penalty parameter. The typical way of setting the penalty parameter is by splitting the data set and performing the cross-validation, which is (1) computationally expensive and (2) not desirable when the data set is already small to be further split (for example, whole-genome sequence data). In this study, we establish the theoretical foundation to select the penalty parameter without cross-validation based on bounding with a high probability the infinite norm of the gradient of the loss function at zero under the zero-feature assumption. Our approach is a generalization of the universal threshold of Donoho and Johnstone (1994) to nonlinear ANN learning. We perform a set of comprehensive Monte Carlo simulations on a simple model, and the numerical results show the effectiveness of the proposed approach.
△ Less
Submitted 7 June, 2020;
originally announced June 2020.
-
The Novel Coronavirus, 2019-nCoV, is Highly Contagious and More Infectious Than Initially Estimated
Authors:
Steven Sanche,
Yen Ting Lin,
Chonggang Xu,
Ethan Romero-Severson,
Nicolas W. Hengartner,
Ruian Ke
Abstract:
The novel coronavirus (2019-nCoV) is a recently emerged human pathogen that has spread widely since January 2020. Initially, the basic reproductive number, R0, was estimated to be 2.2 to 2.7. Here we provide a new estimate of this quantity. We collected extensive individual case reports and estimated key epidemiology parameters, including the incubation period. Integrating these estimates and high…
▽ More
The novel coronavirus (2019-nCoV) is a recently emerged human pathogen that has spread widely since January 2020. Initially, the basic reproductive number, R0, was estimated to be 2.2 to 2.7. Here we provide a new estimate of this quantity. We collected extensive individual case reports and estimated key epidemiology parameters, including the incubation period. Integrating these estimates and high-resolution real-time human travel and infection data with mathematical models, we estimated that the number of infected individuals during early epidemic double every 2.4 days, and the R0 value is likely to be between 4.7 and 6.6. We further show that quarantine and contact tracing of symptomatic individuals alone may not be effective and early, strong control measures are needed to stop transmission of the virus.
△ Less
Submitted 8 February, 2020;
originally announced February 2020.
-
Development of a Fragment-Based Machine Learning Algorithm for Designing Hybrid Drugs Optimized for Permeating Gram-Negative Bacteria
Authors:
Rachael A. Mansbach,
Inga V. Leus,
Jitender Mehla,
Cesar A. Lopez,
John K. Walker,
Valentin V. Rybenkov,
Nicolas W. Hengartner,
Helen I. Zgurskaya,
S. Gnanakaran
Abstract:
Gram-negative bacteria are a serious health concern due to the strong multidrug resistance that they display, partly due to the presence of a permeability barrier comprising two membranes with active efflux. New approaches are urgently needed to design antibiotics effective against these pathogens. In this work, we present a novel topological fragment-based approach ("Hunting Fragments Of X" or "H…
▽ More
Gram-negative bacteria are a serious health concern due to the strong multidrug resistance that they display, partly due to the presence of a permeability barrier comprising two membranes with active efflux. New approaches are urgently needed to design antibiotics effective against these pathogens. In this work, we present a novel topological fragment-based approach ("Hunting Fragments Of X" or "Hunting FOX") to rationally "hunt for" chemical fragments that promote compound ability to permeate the outer membrane. Our approach generalizes to other drug design applications. We measure minimum inhibitory concentrations of compounds in two strains of Pseudomonas aeruginosa with variable permeability barriers and use them as an input to the Hunting FOX algorithm to identify molecular fragments responsible for enhanced outer membrane permeation properties and candidate molecules from an external library that demonstrate good permeation ability. Overall, we present proof of concept for a novel method that is expected to be valuable for rational design of hybrid drugs.
△ Less
Submitted 29 July, 2019;
originally announced July 2019.
-
Computing Long Timescale Biomolecular Dynamics using Quasi-Stationary Distribution Kinetic Monte Carlo (QSD-KMC)
Authors:
Animesh Agarwal,
Nicolas W. Hengartner,
S. Gnanakaran,
Arthur F. Voter
Abstract:
It is a challenge to obtain an accurate model of the state-to-state dynamics of a complex biological system from molecular dynamics (MD) simulations. In recent years, Markov State Models have gained immense popularity for computing state-to-state dynamics from a pool of short MD simulations. However, the assumption that the underlying dynamics on the reduced space is Markovian induces a systematic…
▽ More
It is a challenge to obtain an accurate model of the state-to-state dynamics of a complex biological system from molecular dynamics (MD) simulations. In recent years, Markov State Models have gained immense popularity for computing state-to-state dynamics from a pool of short MD simulations. However, the assumption that the underlying dynamics on the reduced space is Markovian induces a systematic bias in the model, especially in biomolecular systems with complicated energy landscapes. To address this problem, we have devised a new approach we call quasi-stationary distribution kinetic Monte Carlo (QSD-KMC) that gives accurate long time state-to-state evolution while retaining the entire time resolution even when the dynamics is highly non-Markovian. The proposed method is a kinetic Monte Carlo approach that takes advantage of two concepts: (i) the quasi-stationary distribution and (ii) dynamical corrections theory. Implementation of QSD-KMC imposes stricter requirements on the lengths of the trajectories than in a Markov State Model approach, as the trajectories must be long enough to dephase. However, the QSD-KMC model produces state-to-state trajectories that are statistically indistinguishable from an MD trajectory mapped onto the discrete set of states, for an arbitrary choice of state decomposition. Furthermore, the aforementioned concepts can be used to construct a Monte Carlo approach to optimize the state boundaries regardless of the initial choice of states. We demonstrate the QSD-KMC method on two one-dimensional model systems, one of which is a driven nonequilibrium system, and on two well-characterized biomolecular systems.
△ Less
Submitted 12 July, 2019; v1 submitted 23 May, 2019;
originally announced May 2019.
-
The phase transition in inhomogeneous random intersection graphs
Authors:
Milan Bradonjić,
Aric Hagberg,
Nicolas W. Hengartner,
Nathan Lemons,
Allon G. Percus
Abstract:
We analyze the component evolution in inhomogeneous random intersection graphs when the average degree is close to 1. As the average degree increases, the size of the largest component in the random intersection graph goes through a phase transition. We give bounds on the size of the largest components before and after this transition. We also prove that the largest component after the transition…
▽ More
We analyze the component evolution in inhomogeneous random intersection graphs when the average degree is close to 1. As the average degree increases, the size of the largest component in the random intersection graph goes through a phase transition. We give bounds on the size of the largest components before and after this transition. We also prove that the largest component after the transition is unique. These results are similar to the phase transition in Erdős-Rényi random graphs; one notable difference is that the jump in the size of the largest component varies in size depending on the parameters of the random intersection graph.
△ Less
Submitted 30 January, 2013;
originally announced January 2013.
-
Randomness in Competitions
Authors:
E. Ben-Naim,
N. W. Hengartner,
S. Redner,
F. Vazquez
Abstract:
We study the effects of randomness on competitions based on an elementary random process in which there is a finite probability that a weaker team upsets a stronger team. We apply this model to sports leagues and sports tournaments, and compare the theoretical results with empirical data. Our model shows that single-elimination tournaments are efficient but unfair: the number of games is proportio…
▽ More
We study the effects of randomness on competitions based on an elementary random process in which there is a finite probability that a weaker team upsets a stronger team. We apply this model to sports leagues and sports tournaments, and compare the theoretical results with empirical data. Our model shows that single-elimination tournaments are efficient but unfair: the number of games is proportional to the number of teams N, but the probability that the weakest team wins decays only algebraically with N. In contrast, leagues, where every team plays every other team, are fair but inefficient: the top $\sqrt{N}$ of teams remain in contention for the championship, while the probability that the weakest team becomes champion is exponentially small. We also propose a gradual elimination schedule that consists of a preliminary round and a championship round. Initially, teams play a small number of preliminary games, and subsequently, a few teams qualify for the championship round. This algorithm is fair and efficient: the best team wins with a high probability and the number of games scales as $N^{9/5}$, whereas traditional leagues require N^3 games to fairly determine a champion.
△ Less
Submitted 21 September, 2012;
originally announced September 2012.
-
Component Evolution in General Random Intersection Graphs
Authors:
Milan Bradonjic,
Aric Hagberg,
Nicolas W. Hengartner,
Allon G. Percus
Abstract:
Random intersection graphs (RIGs) are an important random structure with applications in social networks, epidemic networks, blog readership, and wireless sensor networks. RIGs can be interpreted as a model for large randomly formed non-metric data sets. We analyze the component evolution in general RIGs, and give conditions on existence and uniqueness of the giant component. Our techniques genera…
▽ More
Random intersection graphs (RIGs) are an important random structure with applications in social networks, epidemic networks, blog readership, and wireless sensor networks. RIGs can be interpreted as a model for large randomly formed non-metric data sets. We analyze the component evolution in general RIGs, and give conditions on existence and uniqueness of the giant component. Our techniques generalize existing methods for analysis of component evolution: we analyze survival and extinction properties of a dependent, inhomogeneous Galton-Watson branching process on general RIGs. Our analysis relies on bounding the branching processes and inherits the fundamental concepts of the study of component evolution in Erdős-Rényi graphs. The major challenge comes from the underlying structure of RIGs, which involves its both the set of nodes and the set of attributes, as well as the set of different probabilities among the nodes and attributes.
△ Less
Submitted 29 May, 2010;
originally announced May 2010.
-
How to Choose a Champion
Authors:
E. Ben-Naim,
N. W. Hengartner
Abstract:
League competition is investigated using random processes and scaling techniques. In our model, a weak team can upset a strong team with a fixed probability. Teams play an equal number of head-to-head matches and the team with the largest number of wins is declared to be the champion. The total number of games needed for the best team to win the championship with high certainty, T, grows as the…
▽ More
League competition is investigated using random processes and scaling techniques. In our model, a weak team can upset a strong team with a fixed probability. Teams play an equal number of head-to-head matches and the team with the largest number of wins is declared to be the champion. The total number of games needed for the best team to win the championship with high certainty, T, grows as the cube of the number of teams, N, i.e., T ~ N^3. This number can be substantially reduced using preliminary rounds where teams play a small number of games and subsequently, only the top teams advance to the next round. When there are k rounds, the total number of games needed for the best team to emerge as champion, T_k, scales as follows, T_k ~N^(γ_k) with gamma_k=1/[1-(2/3)^(k+1)]. For example, gamma_k=9/5,27/19,81/65 for k=1,2,3. These results suggest an algorithm for how to infer the best team using a schedule that is linear in N. We conclude that league format is an ineffective method of determining the best team, and that sequential elimination from the bottom up is fair and efficient.
△ Less
Submitted 21 December, 2006;
originally announced December 2006.
-
The Basic Reproductive Number of Ebola and the Effects of Public Health Measures: The Cases of Congo and Uganda
Authors:
Gerardo Chowell,
Nick W. Hengartner,
Carlos Castillo-Chavez,
Paul W. Fenimore,
J. M. Hyman
Abstract:
Despite improved control measures, Ebola remains a serious public health risk in African regions where recurrent outbreaks have been observed since the initial epidemic in 1976. Using epidemic modeling and data from two well-documented Ebola outbreaks (Congo 1995 and Uganda 2000), we estimate the number of secondary cases generated by an index case in the absence of control interventions ($R_0$)…
▽ More
Despite improved control measures, Ebola remains a serious public health risk in African regions where recurrent outbreaks have been observed since the initial epidemic in 1976. Using epidemic modeling and data from two well-documented Ebola outbreaks (Congo 1995 and Uganda 2000), we estimate the number of secondary cases generated by an index case in the absence of control interventions ($R_0$). Our estimate of $R_0$ is 1.83 (SD 0.06) for Congo (1995) and 1.34 (SD 0.03) for Uganda (2000). We model the course of the outbreaks via an SEIR (susceptible-exposed-infectious-removed) epidemic model that includes a smooth transition in the transmission rate after control interventions are put in place. We perform an uncertainty analysis of the basic reproductive number $R_0$ to quantify its sensitivity to other disease-related parameters. We also analyze the sensitivity of the final epidemic size to the time interventions begin and provide a distribution for the final epidemic size. The control measures implemented during these two outbreaks (including education and contact tracing followed by quarantine) reduce the final epidemic size by a factor of 2 relative the final size with a two-week delay in their implementation.
△ Less
Submitted 1 March, 2005;
originally announced March 2005.
-
Gradient Networks
Authors:
Zoltan Toroczkai,
Balazs Kozma,
Kevin E. Bassler,
N. W. Hengartner,
G. Korniss
Abstract:
We define gradient networks as directed graphs formed by local gradients of a scalar field distributed on the nodes of a substrate network G. We derive an exact expression for the in-degree distribution of the gradient network when the substrate is a binomial (Erdos-Renyi) random graph, G(N,p). Using this expression we show that the in-degree distribution R(l) of gradient graphs on G(N,p) obeys…
▽ More
We define gradient networks as directed graphs formed by local gradients of a scalar field distributed on the nodes of a substrate network G. We derive an exact expression for the in-degree distribution of the gradient network when the substrate is a binomial (Erdos-Renyi) random graph, G(N,p). Using this expression we show that the in-degree distribution R(l) of gradient graphs on G(N,p) obeys the power law R(l)~1/l for arbitrary, i.i.d. random scalar fields. We then relate gradient graphs to congestion tendency in network flows and show that while random graphs become maximally congested in the large network size limit, scale-free networks are not, forming fairly efficient substrates for transport. Combining this with other constraints, such as uniform edge cost, we obtain a plausible argument in form of a selection principle, for why a number of spontaneously evolved massive networks are scale-free. This paper also presents detailed derivations of the results recently reported in Nature, vol. 428, pp. 716 (2004).
△ Less
Submitted 12 August, 2004;
originally announced August 2004.