-
Perfecting Liquid-State Theories with Machine Intelligence
Authors:
Jianzhong Wu,
Mengyang Gu
Abstract:
Recent years have seen a significant increase in the use of machine intelligence for predicting electronic structure, molecular force fields, and the physicochemical properties of various condensed systems. However, substantial challenges remain in develo** a comprehensive framework capable of handling a wide range of atomic compositions and thermodynamic conditions. This perspective discusses p…
▽ More
Recent years have seen a significant increase in the use of machine intelligence for predicting electronic structure, molecular force fields, and the physicochemical properties of various condensed systems. However, substantial challenges remain in develo** a comprehensive framework capable of handling a wide range of atomic compositions and thermodynamic conditions. This perspective discusses potential future developments in liquid-state theories leveraging on recent advancements of functional machine learning. By harnessing the strengths of theoretical analysis and machine learning techniques including surrogate models, dimension reduction and uncertainty quantification, we envision that liquid-state theories will gain significant improvements in accuracy, scalability and computational efficiency, enabling their broader applications across diverse materials and chemical systems.
△ Less
Submitted 9 November, 2023;
originally announced November 2023.
-
Sequential Kalman filter for fast online changepoint detection in longitudinal health records
Authors:
Hanmo Li,
Yuedong Wang,
Mengyang Gu
Abstract:
This article introduces the sequential Kalman filter, a computationally scalable approach for online changepoint detection with temporally correlated data. The temporal correlation was not considered in the Bayesian online changepoint detection approach due to the large computational cost. Motivated by detecting COVID-19 infections for dialysis patients from massive longitudinal health records wit…
▽ More
This article introduces the sequential Kalman filter, a computationally scalable approach for online changepoint detection with temporally correlated data. The temporal correlation was not considered in the Bayesian online changepoint detection approach due to the large computational cost. Motivated by detecting COVID-19 infections for dialysis patients from massive longitudinal health records with a large number of covariates, we develop a scalable approach to detect multiple changepoints from correlated data by sequentially stitching Kalman filters of subsequences to compute the joint distribution of the observations, which has linear computational complexity with respect to the number of observations between the last detected changepoint and the current observation at each time point, without approximating the likelihood function. Compared to other online changepoint detection methods, simulated experiments show that our approach is more precise in detecting single or multiple changes in mean, variance, or correlation for temporally correlated data. Furthermore, we propose a new way to integrate classification and changepoint detection approaches that improve the detection delay and accuracy for detecting COVID-19 infection compared to other alternatives.
△ Less
Submitted 1 January, 2024; v1 submitted 28 October, 2023;
originally announced October 2023.
-
Analyzing Disparity and Temporal Progression of Internet Quality through Crowdsourced Measurements with Bias-Correction
Authors:
Hyeongseong Lee,
Udit Paul,
Arpit Gupta,
Elizabeth Belding,
Mengyang Gu
Abstract:
Crowdsourced speedtest measurements are an important tool for studying internet performance from the end user perspective. Nevertheless, despite the accuracy of individual measurements, simplistic aggregation of these data points is problematic due to their intrinsic sampling bias. In this work, we utilize a dataset of nearly 1 million individual Ookla Speedtest measurements, correlate each datapo…
▽ More
Crowdsourced speedtest measurements are an important tool for studying internet performance from the end user perspective. Nevertheless, despite the accuracy of individual measurements, simplistic aggregation of these data points is problematic due to their intrinsic sampling bias. In this work, we utilize a dataset of nearly 1 million individual Ookla Speedtest measurements, correlate each datapoint with 2019 Census demographic data, and develop new methods to present a novel analysis to quantify regional sampling bias and the relationship of internet performance to demographic profile. We find that the crowdsourced Ookla Speedtest data points contain significant sampling bias across different census block groups based on a statistical test of homogeneity. We introduce two methods to correct the regional bias by the population of each census block group. Whereas the sampling bias leads to a small discrepancy in the overall cumulative distribution function of internet speed in a city between estimation from original samples and bias-corrected estimation, the discrepancy is much smaller compared to the size of the sampling heterogeneity across regions. Further, we show that the sampling bias is strongly associated with a few demographic variables, such as income, education level, age, and ethnic distribution. Through regression analysis, we find that regions with higher income, younger populations, and lower representation of Hispanic residents tend to measure faster internet speeds along with substantial collinearity amongst socioeconomic attributes and ethnic composition. Finally, we find that average internet speed increases over time based on both linear and nonlinear analysis from state space models, though the regional sampling bias may result in a small overestimation of the temporal increase of internet speed.
△ Less
Submitted 7 December, 2023; v1 submitted 24 October, 2023;
originally announced October 2023.
-
Ab initio uncertainty quantification in scattering analysis of microscopy
Authors:
Mengyang Gu,
Yue He,
Xubo Liu,
Yimin Luo
Abstract:
Estimating parameters from data is a fundamental problem in physics, customarily done by minimizing a loss function between a model and observed statistics. In scattering-based analysis, researchers often employ their domain expertise to select a specific range of wavevectors for analysis, a choice that can vary depending on the specific case. We introduce another paradigm that defines a probabili…
▽ More
Estimating parameters from data is a fundamental problem in physics, customarily done by minimizing a loss function between a model and observed statistics. In scattering-based analysis, researchers often employ their domain expertise to select a specific range of wavevectors for analysis, a choice that can vary depending on the specific case. We introduce another paradigm that defines a probabilistic generative model from the beginning of data processing and propagates the uncertainty for parameter estimation, termed ab initio uncertainty quantification (AIUQ). As an illustrative example, we demonstrate this approach with differential dynamic microscopy (DDM) that extracts dynamical information through Fourier analysis at a selected range of wavevectors. We first show that DDM is equivalent to fitting a temporal variogram in the reciprocal space using a latent factor model as the generative model. Then we derive the maximum marginal likelihood estimator, which optimally weighs information at all wavevectors, therefore eliminating the need to select the range of wavevectors. Furthermore, we substantially reduce the computational cost by utilizing the generalized Schur algorithm for Toeplitz covariances without approximation. Simulated studies validate that AIUQ significantly improves estimation accuracy and enables model selection with automated analysis. The utility of AIUQ is also demonstrated by three distinct sets of experiments: first in an isotropic Newtonian fluid, pushing limits of optically dense systems compared to multiple particle tracking; next in a system undergoing a sol-gel transition, automating the determination of gelling points and critical exponent; and lastly, in discerning anisotropic diffusive behavior of colloids in a liquid crystal. These outcomes collectively underscore AIUQ's versatility to capture system dynamics in an efficient and automated manner.
△ Less
Submitted 19 February, 2024; v1 submitted 5 September, 2023;
originally announced September 2023.
-
Probabilistic forecast of nonlinear dynamical systems with uncertainty quantification
Authors:
Mengyang Gu,
Yizi Lin,
Victor Chang Lee,
Diana Qiu
Abstract:
Data-driven modeling is useful for reconstructing nonlinear dynamical systems when the underlying process is unknown or too expensive to compute. Having reliable uncertainty assessment of the forecast enables tools to be deployed to predict new scenarios unobserved before. In this work, we first extend parallel partial Gaussian processes for predicting the vector-valued transition function that li…
▽ More
Data-driven modeling is useful for reconstructing nonlinear dynamical systems when the underlying process is unknown or too expensive to compute. Having reliable uncertainty assessment of the forecast enables tools to be deployed to predict new scenarios unobserved before. In this work, we first extend parallel partial Gaussian processes for predicting the vector-valued transition function that links the observations between the current and next time points, and quantify the uncertainty of predictions by posterior sampling. Second, we show the equivalence between the dynamic mode decomposition and the maximum likelihood estimator of the linear map** matrix in the linear state space model. The connection provides a {probabilistic generative} model of dynamic mode decomposition and thus, uncertainty of predictions can be obtained. Furthermore, we draw close connections between different data-driven models for approximating nonlinear dynamics, through a unified view of generative models. We study two numerical examples, where the inputs of the dynamics are assumed to be known in the first example and the inputs are unknown in the second example. The examples indicate that uncertainty of forecast can be properly quantified, whereas model or input misspecification can degrade the accuracy of uncertainty quantification.
△ Less
Submitted 30 October, 2023; v1 submitted 15 May, 2023;
originally announced May 2023.
-
A Nonparametric Mixed-Effects Mixture Model for Patterns of Clinical Measurements Associated with COVID-19
Authors:
Xiaoran Ma,
Wensheng Guo,
Mengyang Gu,
Len Usvyat,
Peter Kotanko,
Yuedong Wang
Abstract:
Some patients with COVID-19 show changes in signs and symptoms such as temperature and oxygen saturation days before being positively tested for SARS-CoV-2, while others remain asymptomatic. It is important to identify these subgroups and to understand what biological and clinical predictors are related to these subgroups. This information will provide insights into how the immune system may respo…
▽ More
Some patients with COVID-19 show changes in signs and symptoms such as temperature and oxygen saturation days before being positively tested for SARS-CoV-2, while others remain asymptomatic. It is important to identify these subgroups and to understand what biological and clinical predictors are related to these subgroups. This information will provide insights into how the immune system may respond differently to infection and can further be used to identify infected individuals. We propose a flexible nonparametric mixed-effects mixture model that identifies risk factors and classifies patients with biological changes. We model the latent probability of biological changes using a logistic regression model and trajectories in the latent groups using smoothing splines. We developed an EM algorithm to maximize the penalized likelihood for estimating all parameters and mean functions. We evaluate our methods by simulations and apply the proposed model to investigate changes in temperature in a cohort of COVID-19-infected hemodialysis patients.
△ Less
Submitted 31 May, 2024; v1 submitted 6 May, 2023;
originally announced May 2023.
-
Data-driven model construction for anisotropic dynamics of active matter
Authors:
Mengyang Gu,
Xinyi Fang,
Yimin Luo
Abstract:
The dynamics of cellular pattern formation is crucial for understanding embryonic development and tissue morphogenesis. Recent studies have shown that human dermal fibroblasts cultured on liquid crystal elastomers can exhibit an increase in orientational alignment over time, accompanied by cell proliferation, under the influence of the weak guidance of a molecularly aligned substrate. However, a c…
▽ More
The dynamics of cellular pattern formation is crucial for understanding embryonic development and tissue morphogenesis. Recent studies have shown that human dermal fibroblasts cultured on liquid crystal elastomers can exhibit an increase in orientational alignment over time, accompanied by cell proliferation, under the influence of the weak guidance of a molecularly aligned substrate. However, a comprehensive understanding of how this order arises remains largely unknown. This knowledge gap may be attributed, in part, to a scarcity of mechanistic models that can capture the temporal progression of the complex nonequilibrium dynamics during the cellular alignment process. The orientational alignment occurs primarily when cells reach a high density near confluence. Therefore, for accurate modeling, it is crucial to take into account both the cell-cell interaction term and the influence from the substrate, acting as a one-body external potential term. To fill in this gap, we develop a hybrid procedure that utilizes statistical learning approaches to extend the state-of-the-art physics models for quantifying both effects. We develop a more efficient way to perform feature selection that avoids testing all feature combinations through simulation. The maximum likelihood estimator of the model was derived and implemented in computationally scalable algorithms for model calibration and simulation. By including these features, such as the non-Gaussian, anisotropic fluctuations, and limiting alignment interaction only to neighboring cells with the same velocity direction, this model quantitatively reproduce the key system-level parameters--the temporal progression of the velocity orientational order parameters and the variability of velocity vectors, whereas models missing any of the features fail to capture these temporally dependent parameters.
△ Less
Submitted 23 August, 2023; v1 submitted 6 March, 2023;
originally announced March 2023.
-
Reliable emulation of complex functionals by active learning with error control
Authors:
Xinyi Fang,
Mengyang Gu,
Jianzhong Wu
Abstract:
A statistical emulator can be used as a surrogate of complex physics-based calculations to drastically reduce the computational cost. Its successful implementation hinges on an accurate representation of the nonlinear response surface with a high-dimensional input space. Conventional "space-filling" designs, including random sampling and Latin hypercube sampling, become inefficient as the dimensio…
▽ More
A statistical emulator can be used as a surrogate of complex physics-based calculations to drastically reduce the computational cost. Its successful implementation hinges on an accurate representation of the nonlinear response surface with a high-dimensional input space. Conventional "space-filling" designs, including random sampling and Latin hypercube sampling, become inefficient as the dimensionality of the input variables increases, and the predictive accuracy of the emulator can degrade substantially for a test input distant from the training input set. To address this fundamental challenge, we develop a reliable emulator for predicting complex functionals by active learning with error control (ALEC). The algorithm is applicable to infinite-dimensional map** with high-fidelity predictions and a controlled predictive error. The computational efficiency has been demonstrated by emulating the classical density functional theory (cDFT) calculations, a statistical-mechanical method widely used in modeling the equilibrium properties of complex molecular systems. We show that ALEC is much more accurate than conventional emulators based on the Gaussian processes with "space-filling" designs and alternative active learning methods. Besides, it is computationally more efficient than direct cDFT calculations. ALEC can be a reliable building block for emulating expensive functionals owing to its minimal computational cost, controllable predictive error, and fully automatic features.
△ Less
Submitted 30 January, 2024; v1 submitted 13 August, 2022;
originally announced August 2022.
-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Authors:
Aarohi Srivastava,
Abhinav Rastogi,
Abhishek Rao,
Abu Awal Md Shoeb,
Abubakar Abid,
Adam Fisch,
Adam R. Brown,
Adam Santoro,
Aditya Gupta,
Adrià Garriga-Alonso,
Agnieszka Kluska,
Aitor Lewkowycz,
Akshat Agarwal,
Alethea Power,
Alex Ray,
Alex Warstadt,
Alexander W. Kocurek,
Ali Safaya,
Ali Tazarv,
Alice Xiang,
Alicia Parrish,
Allen Nie,
Aman Hussain,
Amanda Askell,
Amanda Dsouza
, et al. (426 additional authors not shown)
Abstract:
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur…
▽ More
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
△ Less
Submitted 12 June, 2023; v1 submitted 9 June, 2022;
originally announced June 2022.
-
Scalable marginalization of correlated latent variables with applications to learning particle interaction kernels
Authors:
Mengyang Gu,
Xubo Liu,
Xinyi Fang,
Sui Tang
Abstract:
Marginalization of latent variables or nuisance parameters is a fundamental aspect of Bayesian inference and uncertainty quantification. In this work, we focus on scalable marginalization of latent variables in modeling correlated data, such as spatio-temporal or functional observations. We first introduce Gaussian processes (GPs) for modeling correlated data and highlight the computational challe…
▽ More
Marginalization of latent variables or nuisance parameters is a fundamental aspect of Bayesian inference and uncertainty quantification. In this work, we focus on scalable marginalization of latent variables in modeling correlated data, such as spatio-temporal or functional observations. We first introduce Gaussian processes (GPs) for modeling correlated data and highlight the computational challenge, where the computational complexity increases cubically fast along with the number of observations. We then review the connection between the state space model and GPs with Mat{é}rn covariance for temporal inputs. The Kalman filter and Rauch-Tung-Striebel smoother were introduced as a scalable marginalization technique for computing the likelihood and making predictions of GPs without approximation. We then introduce recent efforts on extending the scalable marginalization idea to the linear model of coregionalization for multivariate correlated output and spatio-temporal observations. In the final part of this work, we introduce a novel marginalization technique to estimate interaction kernels and forecast particle trajectories. The achievement lies in the sparse representation of covariance function, then applying conjugate gradient for solving the computational challenges and improving predictive accuracy. The computational advances achieved in this work outline a wide range of applications in molecular dynamic simulation, cellular migration, and agent-based models.
△ Less
Submitted 9 October, 2022; v1 submitted 16 March, 2022;
originally announced March 2022.
-
RobustCalibration: Robust Calibration of Computer Models in R
Authors:
Mengyang Gu
Abstract:
Two fundamental research tasks in science and engineering are forward predictions and data inversion. This article introduces a recent R package RobustCalibration for Bayesian data inversion and model calibration by experiments and field observations. Mathematical models for forward predictions are often written in computer code, and they can be computationally expensive slow to run. To overcome t…
▽ More
Two fundamental research tasks in science and engineering are forward predictions and data inversion. This article introduces a recent R package RobustCalibration for Bayesian data inversion and model calibration by experiments and field observations. Mathematical models for forward predictions are often written in computer code, and they can be computationally expensive slow to run. To overcome the computational bottleneck from the simulator, we implemented a statistical emulator from the RobustGaSP package for emulating both scalar-valued or vector-valued computer model outputs. Both posterior sampling and maximum likelihood approach are implemented in the RobustCalibration package for parameter estimation. For imperfect computer models, we implement Gaussian stochastic process and the scaled Gaussian stochastic process for modeling the discrepancy function between the reality and mathematical model. This package is applicable to various types of field observations, such as repeated experiments and multiple sources of measurements. We discuss numerical examples of calibrating mathematical models that have closed-form expressions, and differential equations solved by numerical methods.
△ Less
Submitted 18 February, 2024; v1 submitted 5 January, 2022;
originally announced January 2022.
-
Efficient force field and energy emulation through partition of permutationally equivalent atoms
Authors:
Hao Li,
Musen Zhou,
Jessalyn Sebastian,
Jianzhong Wu,
Mengyang Gu
Abstract:
Gaussian process (GP) emulator has been used as a surrogate model for predicting force field and molecular potential, to overcome the computational bottleneck of molecular dynamics simulation. Integrating both atomic force and energy in predictions was found to be more accurate than using energy alone, yet it requires $O((NM)^3)$ computational operations for computing the likelihood function and m…
▽ More
Gaussian process (GP) emulator has been used as a surrogate model for predicting force field and molecular potential, to overcome the computational bottleneck of molecular dynamics simulation. Integrating both atomic force and energy in predictions was found to be more accurate than using energy alone, yet it requires $O((NM)^3)$ computational operations for computing the likelihood function and making predictions, where $N$ is the number of atoms and $M$ is the number of simulated configurations in the training sample, due to the inversion of a large covariance matrix. The large computational need limits its applications to emulating simulation of small molecules. The computational challenge of using both gradient information and function values in GPs was recently noticed in statistics and machine learning communities, where conventional approximation methods, such as the low rank decomposition or sparse approximation, may not work well. Here we introduce a new approach, the atomized force field (AFF) model, that integrates both force and energy in the emulator with many fewer computational operations. The drastic reduction on computation is achieved by utilizing the naturally sparse structure of the covariance satisfying the constraints of the energy conservation and permutation symmetry of atoms. The efficient machine learning algorithm extends the limits of its applications on larger molecules under the same computational budget, with nearly no loss of predictive accuracy. Furthermore, our approach contains uncertainty assessment of predictions of atomic forces and potentials, useful for develo** a sequential design over the chemical input space, with almost no increase in computational cost.
△ Less
Submitted 9 May, 2022; v1 submitted 13 August, 2021;
originally announced August 2021.
-
Uncertainty quantification and estimation in differential dynamic microscopy
Authors:
Mengyang Gu,
Yimin Luo,
Yue He,
Matthew E. Helgeson,
Megan T. Valentine
Abstract:
Differential dynamic microscopy (DDM) is a form of video image analysis that combines the sensitivity of scattering and the direct visualization benefits of microscopy. DDM is broadly useful in determining dynamical properties including the intermediate scattering function for many spatiotemporally correlated systems. Despite its straightforward analysis, DDM has not been fully adopted as a routin…
▽ More
Differential dynamic microscopy (DDM) is a form of video image analysis that combines the sensitivity of scattering and the direct visualization benefits of microscopy. DDM is broadly useful in determining dynamical properties including the intermediate scattering function for many spatiotemporally correlated systems. Despite its straightforward analysis, DDM has not been fully adopted as a routine characterization tool, largely due to computational cost and lack of algorithmic robustness. We present statistical analysis that quantifies the noise, reduces the computational order and enhances the robustness of DDM analysis. We propagate the image noise through the Fourier analysis, which allows us to comprehensively study the bias in different estimators of model parameters, and we derive a different way to detect whether the bias is negligible. Furthermore, through use of Gaussian process regression (GPR), we find that predictive samples of the image structure function require only around 0.5%-5% of the Fourier transforms of the observed quantities. This vastly reduces computational cost, while preserving information of the quantities of interest, such as quantiles of the image scattering function, for subsequent analysis. The approach, which we call DDM with uncertainty quantification (DDM-UQ), is validated using both simulations and experiments with respect to accuracy and computational efficiency, as compared with conventional DDM and multiple particle tracking. Overall, we propose that DDM-UQ lays the foundation for important new applications of DDM, as well as to high-throughput characterization. We implement the fast computation tool in a new, publicly available MATLAB software package.
△ Less
Submitted 11 April, 2022; v1 submitted 3 May, 2021;
originally announced May 2021.
-
Gaussian orthogonal latent factor processes for large incomplete matrices of correlated data
Authors:
Mengyang Gu,
Hanmo Li
Abstract:
We introduce Gaussian orthogonal latent factor processes for modeling and predicting large correlated data. To handle the computational challenge, we first decompose the likelihood function of the Gaussian random field with
a multi-dimensional input domain into a product of densities at the orthogonal components with lower-dimensional inputs. The continuous-time Kalman filter is implemented to c…
▽ More
We introduce Gaussian orthogonal latent factor processes for modeling and predicting large correlated data. To handle the computational challenge, we first decompose the likelihood function of the Gaussian random field with
a multi-dimensional input domain into a product of densities at the orthogonal components with lower-dimensional inputs. The continuous-time Kalman filter is implemented to compute the likelihood function efficiently without making approximations. We also show that the posterior distribution of the factor processes is independent, as a consequence of prior independence of factor processes and orthogonal factor loading matrix. For studies with large sample sizes, we propose a flexible way to model the mean, and we derive the marginal posterior distribution to solve identifiability issues in sampling these parameters. Both simulated and real data applications confirm the outstanding performance of this method.
△ Less
Submitted 26 November, 2021; v1 submitted 21 November, 2020;
originally announced November 2020.
-
Robust estimation of SARS-CoV-2 epidemic in US counties
Authors:
Hanmo Li,
Mengyang Gu
Abstract:
The COVID-19 outbreak is asynchronous in US counties. Mitigating the COVID-19 transmission requires not only the state and federal level order of protective measures such as social distancing and testing, but also public awareness of time-dependent risk and reactions at county and community levels. We propose a robust approach to estimate the heterogeneous progression of SARS-CoV-2 at all US count…
▽ More
The COVID-19 outbreak is asynchronous in US counties. Mitigating the COVID-19 transmission requires not only the state and federal level order of protective measures such as social distancing and testing, but also public awareness of time-dependent risk and reactions at county and community levels. We propose a robust approach to estimate the heterogeneous progression of SARS-CoV-2 at all US counties having no less than 2 COVID-19 associated deaths, and we use the daily probability of contracting (PoC) SARS-CoV-2 for a susceptible individual to quantify the risk of SARS-CoV-2 transmission in a community. We found that shortening by $5\%$ of the infectious period of SARS-CoV-2 can reduce around $39\%$ (or $78$K, $95\%$ CI: $[66$K $, 89$K $]$) of the COVID-19 associated deaths in the US as of 20 September 2020. Our findings also indicate that reducing infection and deaths by a shortened infectious period is more pronounced for areas with the effective reproduction number close to 1, suggesting that testing should be used along with other mitigation measures, such as social distancing and facial mask-wearing, to reduce the transmission rate. Our deliverable includes a dynamic county-level map for local officials to determine optimal policy responses and for the public to better understand the risk of contracting SARS-CoV-2 on each day.
△ Less
Submitted 29 April, 2021; v1 submitted 22 October, 2020;
originally announced October 2020.
-
Emulating the First Principles of Matter: A Probabilistic Roadmap
Authors:
Jianzhong Wu,
Mengyang Gu
Abstract:
This chapter provides a tutorial overview of first principles methods to describe the properties of matter at the ground state or equilibrium. It begins with a brief introduction to quantum and statistical mechanics for predicting the electronic structure and diverse static properties of of many-particle systems useful for practical applications. Pedagogical examples are given to illustrate the ba…
▽ More
This chapter provides a tutorial overview of first principles methods to describe the properties of matter at the ground state or equilibrium. It begins with a brief introduction to quantum and statistical mechanics for predicting the electronic structure and diverse static properties of of many-particle systems useful for practical applications. Pedagogical examples are given to illustrate the basic concepts and simple applications of quantum Monte Carlo and density functional theory -- two representative methods commonly used in the literature of first principles modeling. In addition, this chapter highlights the practical needs for the integration of physics-based modeling and data-science approaches to reduce the computational cost and expand the scope of applicability. A special emphasis is placed on recent developments of statistical surrogate models to emulate first principles calculation from a probabilistic point of view. The probabilistic approach provides an internal assessment of the approximation accuracy of emulation that quantifies the uncertainty in predictions. Various recent advances toward this direction establish a new marriage between Gaussian processes and first principles calculation, with physical properties, such as translational, rotational, and permutation symmetry, naturally encoded in new kernel functions. Finally, it concludes with some prospects on future advances in the field toward faster yet more accurate computation leveraging a synergetic combination {of} novel theoretical concepts and efficient numerical algorithms.
△ Less
Submitted 30 September, 2020;
originally announced October 2020.
-
Thermodynamic Machine Learning through Maximum Work Production
Authors:
A. B. Boyd,
J. P. Crutchfield,
M. Gu
Abstract:
Adaptive systems -- such as a biological organism gaining survival advantage, an autonomous robot executing a functional task, or a motor protein transporting intracellular nutrients -- must model the regularities and stochasticity in their environments to take full advantage of thermodynamic resources. Analogously, but in a purely computational realm, machine learning algorithms estimate models t…
▽ More
Adaptive systems -- such as a biological organism gaining survival advantage, an autonomous robot executing a functional task, or a motor protein transporting intracellular nutrients -- must model the regularities and stochasticity in their environments to take full advantage of thermodynamic resources. Analogously, but in a purely computational realm, machine learning algorithms estimate models to capture predictable structure and identify irrelevant noise in training data. This happens through optimization of performance metrics, such as model likelihood. If physically implemented, is there a sense in which computational models estimated through machine learning are physically preferred? We introduce the thermodynamic principle that work production is the most relevant performance metric for an adaptive physical agent and compare the results to the maximum-likelihood principle that guides machine learning. Within the class of physical agents that most efficiently harvest energy from their environment, we demonstrate that an efficient agent's model explicitly determines its architecture and how much useful work it harvests from the environment. We then show that selecting the maximum-work agent for given environmental data corresponds to finding the maximum-likelihood model. This establishes an equivalence between nonequilibrium thermodynamics and dynamic learning. In this way, work maximization emerges as an organizing principle that underlies learning in adaptive thermodynamic systems.
△ Less
Submitted 12 April, 2021; v1 submitted 27 June, 2020;
originally announced June 2020.
-
Boosting on the shoulders of giants in quantum device calibration
Authors:
Alex Wozniakowski,
Jayne Thompson,
Mile Gu,
Felix Binder
Abstract:
Traditional machine learning applications, such as optical character recognition, arose from the inability to explicitly program a computer to perform a routine task. In this context, learning algorithms usually derive a model exclusively from the evidence present in a massive dataset. Yet in some scientific disciplines, obtaining an abundance of data is an impractical luxury, however; there is an…
▽ More
Traditional machine learning applications, such as optical character recognition, arose from the inability to explicitly program a computer to perform a routine task. In this context, learning algorithms usually derive a model exclusively from the evidence present in a massive dataset. Yet in some scientific disciplines, obtaining an abundance of data is an impractical luxury, however; there is an explicit model of the domain based upon previous scientific discoveries. Here we introduce a new approach to machine learning that is able to leverage prior scientific discoveries in order to improve generalizability over a scientific model. We show its efficacy in predicting the entire energy spectrum of a Hamiltonian on a superconducting quantum device, a key task in present quantum computer calibration. Our accuracy surpasses the current state-of-the-art by over $20\%.$ Our approach thus demonstrates how artificial intelligence can be further enhanced by "standing on the shoulders of giants."
△ Less
Submitted 13 May, 2020;
originally announced May 2020.
-
Fast Low-rank Metric Learning for Large-scale and High-dimensional Data
Authors:
Han Liu,
Zhizhong Han,
Yu-Shen Liu,
Ming Gu
Abstract:
Low-rank metric learning aims to learn better discrimination of data subject to low-rank constraints. It keeps the intrinsic low-rank structure of datasets and reduces the time cost and memory usage in metric learning. However, it is still a challenge for current methods to handle datasets with both high dimensions and large numbers of samples. To address this issue, we present a novel fast low-ra…
▽ More
Low-rank metric learning aims to learn better discrimination of data subject to low-rank constraints. It keeps the intrinsic low-rank structure of datasets and reduces the time cost and memory usage in metric learning. However, it is still a challenge for current methods to handle datasets with both high dimensions and large numbers of samples. To address this issue, we present a novel fast low-rank metric learning (FLRML) method.FLRML casts the low-rank metric learning problem into an unconstrained optimization on the Stiefel manifold, which can be efficiently solved by searching along the descent curves of the manifold.FLRML significantly reduces the complexity and memory usage in optimization, which makes the method scalable to both high dimensions and large numbers of samples.Furthermore, we introduce a mini-batch version of FLRML to make the method scalable to larger datasets which are hard to be loaded and decomposed in limited memory. The outperforming experimental results show that our method is with high accuracy and much faster than the state-of-the-art methods under several benchmarks with large numbers of high-dimensional data. Code has been made available at https://github.com/highan911/FLRML
△ Less
Submitted 13 September, 2019;
originally announced September 2019.
-
Variational Langevin Hamiltonian Monte Carlo for Distant Multi-modal Sampling
Authors:
Minghao Gu,
Shiliang Sun
Abstract:
The Hamiltonian Monte Carlo (HMC) sampling algorithm exploits Hamiltonian dynamics to construct efficient Markov Chain Monte Carlo (MCMC), which has become increasingly popular in machine learning and statistics. Since HMC uses the gradient information of the target distribution, it can explore the state space much more efficiently than the random-walk proposals. However, probabilistic inference i…
▽ More
The Hamiltonian Monte Carlo (HMC) sampling algorithm exploits Hamiltonian dynamics to construct efficient Markov Chain Monte Carlo (MCMC), which has become increasingly popular in machine learning and statistics. Since HMC uses the gradient information of the target distribution, it can explore the state space much more efficiently than the random-walk proposals. However, probabilistic inference involving multi-modal distributions is very difficult for standard HMC method, especially when the modes are far away from each other. Sampling algorithms are then often incapable of traveling across the places of low probability. In this paper, we propose a novel MCMC algorithm which aims to sample from multi-modal distributions effectively. The method improves Hamiltonian dynamics to reduce the autocorrelation of the samples and uses a variational distribution to explore the phase space and find new modes. A formal proof is provided which shows that the proposed method can converge to target distributions. Both synthetic and real datasets are used to evaluate its properties and performance. The experimental results verify the theory and show superior performance in multi-modal sampling.
△ Less
Submitted 1 June, 2019;
originally announced June 2019.
-
Surveying structural complexity in quantum many-body systems
Authors:
Whei Yeap Suen,
Thomas J. Elliott,
Jayne Thompson,
Andrew J. P. Garner,
John R. Mahoney,
Vlatko Vedral,
Mile Gu
Abstract:
Quantum many-body systems exhibit a rich and diverse range of exotic behaviours, owing to their underlying non-classical structure. These systems present a deep structure beyond those that can be captured by measures of correlation and entanglement alone. Using tools from complexity science, we characterise such structure. We investigate the structural complexities that can be found within the pat…
▽ More
Quantum many-body systems exhibit a rich and diverse range of exotic behaviours, owing to their underlying non-classical structure. These systems present a deep structure beyond those that can be captured by measures of correlation and entanglement alone. Using tools from complexity science, we characterise such structure. We investigate the structural complexities that can be found within the patterns that manifest from the observational data of these systems. In particular, using two prototypical quantum many-body systems as test cases - the one-dimensional quantum Ising and Bose-Hubbard models - we explore how different information-theoretic measures of complexity are able to identify different features of such patterns. This work furthers the understanding of fully-quantum notions of structure and complexity in quantum systems and dynamics.
△ Less
Submitted 18 March, 2022; v1 submitted 23 December, 2018;
originally announced December 2018.
-
Calibration of imperfect geophysical models by multiple satellite interferograms with measurement bias
Authors:
Mengyang Gu,
Kyle Anderson,
Erika McPhillips
Abstract:
Model calibration consists of using experimental or field data to estimate the unknown parameters of a mathematical model. The presence of model discrepancy and measurement bias in the data complicates this task. Satellite interferograms, for instance, are widely used for calibrating geophysical models in geological hazard quantification. In this work, we used satellite interferograms to relate gr…
▽ More
Model calibration consists of using experimental or field data to estimate the unknown parameters of a mathematical model. The presence of model discrepancy and measurement bias in the data complicates this task. Satellite interferograms, for instance, are widely used for calibrating geophysical models in geological hazard quantification. In this work, we used satellite interferograms to relate ground deformation observations to the properties of the magma chamber at Kīlauea Volcano in Hawai`i. We derived closed-form marginal likelihoods and implemented posterior sampling procedures that simultaneously estimate the model discrepancy of physical models, and the measurement bias from the atmospheric error in satellite interferograms. We found that model calibration by aggregating multiple interferograms and downsampling the pixels in the interferograms can reduce the computation complexity compared to calibration approaches based on multiple data sets. The conditions that lead to no loss of information from data aggregation and downsampling are studied. Simulation illustrates that both discrepancy and measurement bias can be estimated, and real applications demonstrate that modeling both effects helps obtain a reliable estimation of a physical model's unobserved parameters and enhance its predictive accuracy. We implement the computational tools in the RobustCalibration package available on CRAN.
△ Less
Submitted 25 February, 2023; v1 submitted 27 October, 2018;
originally announced October 2018.
-
Generalized probabilistic principal component analysis of correlated data
Authors:
Mengyang Gu,
Weining Shen
Abstract:
Principal component analysis (PCA) is a well-established tool in machine learning and data processing. The principal axes in PCA were shown to be equivalent to the maximum marginal likelihood estimator of the factor loading matrix in a latent factor model for the observed data, assuming that the latent factors are independently distributed as standard normal distributions. However, the independenc…
▽ More
Principal component analysis (PCA) is a well-established tool in machine learning and data processing. The principal axes in PCA were shown to be equivalent to the maximum marginal likelihood estimator of the factor loading matrix in a latent factor model for the observed data, assuming that the latent factors are independently distributed as standard normal distributions. However, the independence assumption may be unrealistic for many scenarios such as modeling multiple time series, spatial processes, and functional data, where the outcomes are correlated. In this paper, we introduce the generalized probabilistic principal component analysis (GPPCA) to study the latent factor model for multiple correlated outcomes, where each factor is modeled by a Gaussian process. Our method generalizes the previous probabilistic formulation of PCA (PPCA) by providing the closed-form maximum marginal likelihood estimator of the factor loadings and other parameters. Based on the explicit expression of the precision matrix in the marginal likelihood that we derived, the number of the computational operations is linear to the number of output variables. Furthermore, we also provide the closed-form expression of the marginal likelihood when other covariates are included in the mean structure. We highlight the advantage of GPPCA in terms of the practical relevance, estimation accuracy and computational convenience. Numerical studies of simulated and real data confirm the excellent finite-sample performance of the proposed approach.
△ Less
Submitted 23 October, 2019; v1 submitted 31 August, 2018;
originally announced August 2018.
-
Nonparametric estimation of utility functions
Authors:
Mengyang Gu,
Debarun Bhattacharjya,
Dharmashankar Subramanian
Abstract:
Inferring a decision maker's utility function typically involves an elicitation phase where the decision maker responds to a series of elicitation queries, followed by an estimation phase where the state-of-the-art is to either fit the response data to a parametric form (such as the exponential or power function) or perform linear interpolation. We introduce a Bayesian nonparametric method involvi…
▽ More
Inferring a decision maker's utility function typically involves an elicitation phase where the decision maker responds to a series of elicitation queries, followed by an estimation phase where the state-of-the-art is to either fit the response data to a parametric form (such as the exponential or power function) or perform linear interpolation. We introduce a Bayesian nonparametric method involving Gaussian stochastic processes for estimating a utility function. Advantages include the flexibility to fit a large class of functions, favorable theoretical properties, and a fully probabilistic view of the decision maker's preference properties including risk attitude. Using extensive simulation experiments as well as two real datasets from the literature, we demonstrate that the proposed approach yields estimates with lower mean squared errors. While our focus is primarily on single-attribute utility functions, one of the real datasets involves three attributes; the results indicate that nonparametric methods also seem promising for multi-attribute utility function estimation.
△ Less
Submitted 27 July, 2018;
originally announced July 2018.
-
Jointly Robust Prior for Gaussian Stochastic Process in Emulation, Calibration and Variable Selection
Authors:
Mengyang Gu
Abstract:
Gaussian stochastic process (GaSP) has been widely used in two fundamental problems in uncertainty quantification, namely the emulation and calibration of mathematical models. Some objective priors, such as the reference prior, are studied in the context of emulating (approximating) computationally expensive mathematical models. In this work, we introduce a new class of priors, called the jointly…
▽ More
Gaussian stochastic process (GaSP) has been widely used in two fundamental problems in uncertainty quantification, namely the emulation and calibration of mathematical models. Some objective priors, such as the reference prior, are studied in the context of emulating (approximating) computationally expensive mathematical models. In this work, we introduce a new class of priors, called the jointly robust prior, for both the emulation and calibration. This prior is designed to maintain various advantages from the reference prior. In emulation, the jointly robust prior has an appropriate tail decay rate as the reference prior, and is computationally simpler than the reference prior in parameter estimation. Moreover, the marginal posterior mode estimation with the jointly robust prior can separate the influential and inert inputs in mathematical models, while the reference prior does not have this property. We establish the posterior propriety for a large class of priors in calibration, including the reference prior and jointly robust prior in general scenarios, but the jointly robust prior is preferred because the calibrated mathematical model typically predicts the reality well. The jointly robust prior is used as the default prior in two new R packages, called "RobustGaSP" and "RobustCalibration", available on CRAN for emulation and calibration, respectively.
△ Less
Submitted 7 September, 2018; v1 submitted 24 April, 2018;
originally announced April 2018.
-
RobustGaSP: Robust Gaussian Stochastic Process Emulation in R
Authors:
Mengyang Gu,
Jesús Palomo,
James O. Berger
Abstract:
Gaussian stochastic process emulation is a powerful tool for approximating computationally intensive computer models. However, estimation of parameters in the GaSP emulator is a challenging task. No closed-form estimator is available and many numerical problems arise with standard estimates, e.g., the maximum likelihood estimator. In this package, we implement a marginal posterior mode estimator,…
▽ More
Gaussian stochastic process emulation is a powerful tool for approximating computationally intensive computer models. However, estimation of parameters in the GaSP emulator is a challenging task. No closed-form estimator is available and many numerical problems arise with standard estimates, e.g., the maximum likelihood estimator. In this package, we implement a marginal posterior mode estimator, for special priors and parameterizations, an estimation method that meets the robust parameter estimation criteria discussed in \cite{Gu2018robustness}; mathematical reasons are provided therein to explain why robust parameter estimation can greatly improve predictive performance of the emulator. In addition, inert inputs (inputs that almost have no effect on the variability of a function) can be identified from the marginal posterior mode estimation, at no extra computational cost. The package also implements the parallel partial Gaussian stochastic process (PP GaSP) emulator (\cite{gu2016parallel}) for the scenario where the computer model has multiple outputs on e.g. spatial-temporal coordinates. The package can be operated in a default mode, but also allows numerous user specifications, such as the capability of specifying trend functions and noise terms. Examples are studied herein to highlight the performance of the package in terms of out-of-sample prediction.}
△ Less
Submitted 14 June, 2019; v1 submitted 5 January, 2018;
originally announced January 2018.
-
Fast Nonseparable Gaussian Stochastic Process with Application to Methylation Level Interpolation
Authors:
Mengyang Gu,
Yanxun Xu
Abstract:
Gaussian stochastic process (GaSP) has been widely used as a prior over functions due to its flexibility and tractability in modeling. However, the computational cost in evaluating the likelihood is $O(n^3)$, where $n$ is the number of observed points in the process, as it requires to invert the covariance matrix. This bottleneck prevents GaSP being widely used in large-scale data. We propose a ge…
▽ More
Gaussian stochastic process (GaSP) has been widely used as a prior over functions due to its flexibility and tractability in modeling. However, the computational cost in evaluating the likelihood is $O(n^3)$, where $n$ is the number of observed points in the process, as it requires to invert the covariance matrix. This bottleneck prevents GaSP being widely used in large-scale data. We propose a general class of nonseparable GaSP models for multiple functional observations with a fast and exact algorithm, in which the computation is linear ($O(n)$) and exact, requiring no approximation to compute the likelihood. We show that the commonly used linear regression and separable models are special cases of the proposed nonseparable GaSP model. Through the study of an epigenetic application, the proposed nonseparable GaSP model can accurately predict the genome-wide DNA methylation levels and compares favorably to alternative methods, such as linear regression, random forest and localized Kriging method. The algorithm for fast computation is implemented in the ${\tt FastGaSP}$ R package on CRAN.
△ Less
Submitted 22 November, 2021; v1 submitted 30 November, 2017;
originally announced November 2017.
-
Scaled Gaussian Stochastic Process for Computer Model Calibration and Prediction
Authors:
Mengyang Gu,
Long Wang
Abstract:
We consider the problem of calibrating an imperfect computer model using experimental data. To compensate the misspecification of the computer model and make more accurate predictions, a discrepancy function is often included and modeled via a Gaussian stochastic process (GaSP). The calibrated computer model alone, however, sometimes fits the experimental data poorly, as the calibration parameters…
▽ More
We consider the problem of calibrating an imperfect computer model using experimental data. To compensate the misspecification of the computer model and make more accurate predictions, a discrepancy function is often included and modeled via a Gaussian stochastic process (GaSP). The calibrated computer model alone, however, sometimes fits the experimental data poorly, as the calibration parameters become unidentifiable. In this work, we propose the scaled Gaussian stochastic process (S-GaSP), a novel stochastic process that bridges the gap between two predominant methods, namely the $L_2$ calibration and the GaSP calibration. It is shown that our approach performs well in both calibration and prediction. A computationally feasible approach is introduced for this new model under the Bayesian paradigm. Compared with the GaSP calibration, the S-GaSP calibration enables the calibrated computer model itself to predict the reality well, based on the posterior distribution of the calibration parameters. Numerical comparisons of the simulated and real data are provided to illustrate the connections and differences between the proposed S-GaSP and other alternative approaches.
△ Less
Submitted 3 May, 2018; v1 submitted 25 July, 2017;
originally announced July 2017.
-
Gaussian Elimination with Randomized Complete Pivoting
Authors:
Christopher Melgaard,
Ming Gu
Abstract:
Gaussian elimination with partial pivoting (GEPP) has long been among the most widely used methods for computing the LU factorization of a given matrix. However, this method is also known to fail for matrices that induce large element growth during the factorization process. In this paper, we propose a new scheme, Gaussian elimination with randomized complete pivoting (GERCP) for the efficient and…
▽ More
Gaussian elimination with partial pivoting (GEPP) has long been among the most widely used methods for computing the LU factorization of a given matrix. However, this method is also known to fail for matrices that induce large element growth during the factorization process. In this paper, we propose a new scheme, Gaussian elimination with randomized complete pivoting (GERCP) for the efficient and reliable LU factorization of a given matrix. GERCP satisfies GECP (Gaussian elimination with complete pivoting) style element growth bounds with high probability, yet costs only marginally higher than GEPP. Our numerical experimental results strongly suggest that GERCP is as reliable as GECP and as efficient as GEPP for computing the LU factorization.
△ Less
Submitted 26 November, 2015;
originally announced November 2015.