-
Density Estimation via Binless Multidimensional Integration
Authors:
Matteo Carli,
Aldo Glielmo,
Alex Rodriguez,
Alessandro Laio
Abstract:
We introduce the Binless Multidimensional Thermodynamic Integration (BMTI) method for nonparametric, robust, and data-efficient density estimation. BMTI estimates the logarithm of the density by initially computing log-density differences between neighbouring data points. Subsequently, such differences are integrated, weighted by their associated uncertainties, using a maximum-likelihood formulati…
▽ More
We introduce the Binless Multidimensional Thermodynamic Integration (BMTI) method for nonparametric, robust, and data-efficient density estimation. BMTI estimates the logarithm of the density by initially computing log-density differences between neighbouring data points. Subsequently, such differences are integrated, weighted by their associated uncertainties, using a maximum-likelihood formulation. This procedure can be seen as an extension to a multidimensional setting of the thermodynamic integration, a technique developed in statistical physics. The method leverages the manifold hypothesis, estimating quantities within the intrinsic data manifold without defining an explicit coordinate map. It does not rely on any binning or space partitioning, but rather on the construction of a neighbourhood graph based on an adaptive bandwidth selection procedure. BMTI mitigates the limitations commonly associated with traditional nonparametric density estimators, effectively reconstructing smooth profiles even in high-dimensional embedding spaces. The method is tested on a variety of complex synthetic high-dimensional datasets, where it is shown to outperform traditional estimators, and is benchmarked on realistic datasets from the chemical physics literature.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Intrinsic dimension estimation for discrete metrics
Authors:
Iuri Macocco,
Aldo Glielmo,
Jacopo Grilli,
Alessandro Laio
Abstract:
Real world-datasets characterized by discrete features are ubiquitous: from categorical surveys to clinical questionnaires, from unweighted networks to DNA sequences. Nevertheless, the most common unsupervised dimensional reduction methods are designed for continuous spaces, and their use for discrete spaces can lead to errors and biases. In this letter we introduce an algorithm to infer the intri…
▽ More
Real world-datasets characterized by discrete features are ubiquitous: from categorical surveys to clinical questionnaires, from unweighted networks to DNA sequences. Nevertheless, the most common unsupervised dimensional reduction methods are designed for continuous spaces, and their use for discrete spaces can lead to errors and biases. In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces. We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting, finding a surprisingly small ID, of order 2. This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
△ Less
Submitted 12 March, 2023; v1 submitted 20 July, 2022;
originally announced July 2022.
-
DADApy: Distance-based Analysis of DAta-manifolds in Python
Authors:
Aldo Glielmo,
Iuri Macocco,
Diego Doimo,
Matteo Carli,
Claudio Zeni,
Romina Wild,
Maria d'Errico,
Alex Rodriguez,
Alessandro Laio
Abstract:
DADApy is a python software package for analysing and characterising high-dimensional data manifolds. It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering and for comparing different distance metrics. We review the main functionalities of the package and exemplify its usage in toy cases and in a real-world application. DADA…
▽ More
DADApy is a python software package for analysing and characterising high-dimensional data manifolds. It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering and for comparing different distance metrics. We review the main functionalities of the package and exemplify its usage in toy cases and in a real-world application. DADApy is freely available under the open-source Apache 2.0 license.
△ Less
Submitted 19 September, 2022; v1 submitted 4 May, 2022;
originally announced May 2022.
-
Exploring the robust extrapolation of high-dimensional machine learning potentials
Authors:
Claudio Zeni,
Andrea Anelli,
Aldo Glielmo,
Kevin Rossi
Abstract:
We show that, contrary to popular assumptions, predictions from machine learning potentials built upon high-dimensional atom-density representations almost exclusively occur in regions of the representation space which lie outside the convex hull defined by the training set points. We then propose a perspective to rationalize the domain of robust extrapolation and accurate prediction of atomistic…
▽ More
We show that, contrary to popular assumptions, predictions from machine learning potentials built upon high-dimensional atom-density representations almost exclusively occur in regions of the representation space which lie outside the convex hull defined by the training set points. We then propose a perspective to rationalize the domain of robust extrapolation and accurate prediction of atomistic machine learning potentials in terms of the probability density induced by training points in the representation space
△ Less
Submitted 22 April, 2022; v1 submitted 20 December, 2021;
originally announced December 2021.
-
Compact atomic descriptors enable accurate predictions via linear models
Authors:
Claudio Zeni,
Kevin Rossi,
Aldo Glielmo,
Stefano De Gironcoli
Abstract:
We probe the accuracy of linear ridge regression employing a three-body local density representation derived from the atomic cluster expansion. We benchmark the accuracy of this framework in the prediction of formation energies and atomic forces in molecules and solids. We find that such a simple regression framework performs on par with state-of-the-art machine learning methods which are, in most…
▽ More
We probe the accuracy of linear ridge regression employing a three-body local density representation derived from the atomic cluster expansion. We benchmark the accuracy of this framework in the prediction of formation energies and atomic forces in molecules and solids. We find that such a simple regression framework performs on par with state-of-the-art machine learning methods which are, in most cases, more complex and more computationally demanding. Subsequently, we look for ways to sparsify the descriptor and further improve the computational efficiency of the method. To this aim, we use both principal component analysis and least absolute shrinkage operator regression for energy fitting on six single-element datasets. Both methods highlight the possibility of constructing a descriptor that is four times smaller than the original with a similar or even improved accuracy. Furthermore, we find that the reduced descriptors share a sizable fraction of their features across the six independent datasets, hinting at the possibility of designing material-agnostic, optimally compressed, and accurate descriptors.
△ Less
Submitted 22 April, 2022; v1 submitted 24 May, 2021;
originally announced May 2021.
-
A Bayesian Inference Framework for Compression and Prediction of Quantum States
Authors:
Yannic Rath,
Aldo Glielmo,
George H. Booth
Abstract:
The recently introduced Gaussian Process State (GPS) provides a highly flexible, compact and physically insightful representation of quantum many-body states based on ideas from the zoo of machine learning approaches. In this work, we give a comprehensive description how such a state can be learned from given samples of a potentially unknown target state and show how regression approaches based on…
▽ More
The recently introduced Gaussian Process State (GPS) provides a highly flexible, compact and physically insightful representation of quantum many-body states based on ideas from the zoo of machine learning approaches. In this work, we give a comprehensive description how such a state can be learned from given samples of a potentially unknown target state and show how regression approaches based on Bayesian inference can be used to compress a target state into a highly compact and accurate GPS representation. By application of a type II maximum likelihood method based on Relevance Vector Machines (RVM), we are able to extract many-body configurations from the underlying Hilbert space which are particularly relevant for the description of the target state, as support points to define the GPS. Together with an introduced optimization scheme for the hyperparameters of the model characterizing the weighting of modelled correlation features, this makes it possible to easily extract physical characteristics of the state such as the relative importance of particular correlation properties. We apply the Bayesian learning scheme to the problem of modelling ground states of small Fermi-Hubbard chains and show that the found solutions represent a systematically improvable trade-off between sparsity and accuracy of the model. Moreover, we show how the learned hyperparameters and the extracted relevant configurations, characterizing the correlation of the wavefunction, depend on the interaction strength of the Hubbard model as well as the target accuracy of the representation.
△ Less
Submitted 18 September, 2020; v1 submitted 8 August, 2020;
originally announced August 2020.
-
Gaussian Process States: A data-driven representation of quantum many-body physics
Authors:
Aldo Glielmo,
Yannic Rath,
Gabor Csanyi,
Alessandro De Vita,
George H. Booth
Abstract:
We present a novel, non-parametric form for compactly representing entangled many-body quantum states, which we call a `Gaussian Process State'. In contrast to other approaches, we define this state explicitly in terms of a configurational data set, with the probability amplitudes statistically inferred from this data according to Bayesian statistics. In this way the non-local physical correlated…
▽ More
We present a novel, non-parametric form for compactly representing entangled many-body quantum states, which we call a `Gaussian Process State'. In contrast to other approaches, we define this state explicitly in terms of a configurational data set, with the probability amplitudes statistically inferred from this data according to Bayesian statistics. In this way the non-local physical correlated features of the state can be analytically resummed, allowing for exponential complexity to underpin the ansatz, but efficiently represented in a small data set. The state is found to be highly compact, systematically improvable and efficient to sample, representing a large number of known variational states within its span. It is also proven to be a `universal approximator' for quantum states, able to capture any entangled many-body state with increasing data set size. We develop two numerical approaches which can learn this form directly: a fragmentation approach, and direct variational optimization, and apply these schemes to the Fermionic Hubbard model. We find competitive or superior descriptions of correlated quantum problems compared to existing state-of-the-art variational ansatzes, as well as other numerical methods.
△ Less
Submitted 17 September, 2020; v1 submitted 27 February, 2020;
originally announced February 2020.
-
On Machine Learning Force Fields for Metallic Nanoparticles
Authors:
Claudio Zeni,
Kevin Rossi,
Aldo Glielmo,
Francesca Baletto
Abstract:
Machine learning algorithms have recently emerged as a tool to generate force fields which display accuracies approaching the ones of the ab-initio calculations they are trained on, but are much faster to compute. The enhanced computational speed of machine learning force fields results key for modelling metallic nanoparticles, as their fluxionality and multi-funneled energy landscape needs to be…
▽ More
Machine learning algorithms have recently emerged as a tool to generate force fields which display accuracies approaching the ones of the ab-initio calculations they are trained on, but are much faster to compute. The enhanced computational speed of machine learning force fields results key for modelling metallic nanoparticles, as their fluxionality and multi-funneled energy landscape needs to be sampled over long time scales. In this review, we first formally introduce the most commonly used machine learning algorithms for force field generation, briefly outlining their structure and properties. We then address the core issue of training database selection, reporting methodologies both already used and yet unused in literature. We finally report and discuss the recent literature regarding machine learning force fields to sample the energy landscape and study the catalytic activity of metallic nanoparticles.
△ Less
Submitted 16 September, 2019;
originally announced September 2019.
-
Building nonparametric $n$-body force fields using Gaussian process regression
Authors:
Aldo Glielmo,
Claudio Zeni,
Ádám Fekete,
Alessandro De Vita
Abstract:
Constructing a classical potential suited to simulate a given atomic system is a remarkably difficult task. This chapter presents a framework under which this problem can be tackled, based on the Bayesian construction of nonparametric force fields of a given order using Gaussian process (GP) priors. The formalism of GP regression is first reviewed, particularly in relation to its application in le…
▽ More
Constructing a classical potential suited to simulate a given atomic system is a remarkably difficult task. This chapter presents a framework under which this problem can be tackled, based on the Bayesian construction of nonparametric force fields of a given order using Gaussian process (GP) priors. The formalism of GP regression is first reviewed, particularly in relation to its application in learning local atomic energies and forces. For accurate regression it is fundamental to incorporate prior knowledge into the GP kernel function. To this end, this chapter details how properties of smoothness, invariance and interaction order of a force field can be encoded into corresponding kernel properties. A range of kernels is then proposed, possessing all the required properties and an adjustable parameter $n$ governing the interaction order modelled. The order $n$ best suited to describe a given system can be found automatically within the Bayesian framework by maximisation of the marginal likelihood. The procedure is first tested on a toy model of known interaction and later applied to two real materials described at the DFT level of accuracy. The models automatically selected for the two materials were found to be in agreement with physical intuition. More in general, it was found that lower order (simpler) models should be chosen when the data are not sufficient to resolve more complex interactions. Low $n$ GPs can be further sped up by orders of magnitude by constructing the corresponding tabulated force field, here named "MFF".
△ Less
Submitted 18 May, 2019;
originally announced May 2019.
-
Building machine learning force fields for nanoclusters
Authors:
Claudio Zeni,
Kevin Rossi,
Aldo Glielmo,
Ádám Fekete,
Nicola Gaston,
Francesca Baletto,
Alessandro De Vita
Abstract:
We assess Gaussian process (GP) regression as a technique to model interatomic forces in metal nanoclusters by analysing the performance of 2-body, 3-body and many-body kernel functions on a set of 19-atom Ni cluster structures. We find that 2-body GP kernels fail to provide faithful force estimates, despite succeeding in bulk Ni systems. However, both 3- and many-body kernels predict forces withi…
▽ More
We assess Gaussian process (GP) regression as a technique to model interatomic forces in metal nanoclusters by analysing the performance of 2-body, 3-body and many-body kernel functions on a set of 19-atom Ni cluster structures. We find that 2-body GP kernels fail to provide faithful force estimates, despite succeeding in bulk Ni systems. However, both 3- and many-body kernels predict forces within a $\sim$0.1 eV/$\textÅ$ average error even for small training datasets, and achieve high accuracy even on out-of-sample, high temperature, structures. While training and testing on the same structure always provides satisfactory accuracy, cross-testing on dissimilar structures leads to higher prediction errors, posing an extrapolation problem. This can be cured using heterogeneous training on databases that contain more than one structure, which results in a good trade-off between versatility and overall accuracy. Starting from a 3-body kernel trained this way, we build an efficient non-parametric 3-body force field that allows accurate prediction of structural properties at finite temperatures, following a newly developed scheme [Glielmo et al. PRB 97, 184307 (2018)]. We use this to assess the thermal stability of Ni$_{19}$ nanoclusters at a fractional cost of full ab initio calculations.
△ Less
Submitted 10 July, 2018; v1 submitted 5 February, 2018;
originally announced February 2018.
-
Efficient nonparametric $n$-body force fields from machine learning
Authors:
Aldo Glielmo,
Claudio Zeni,
Alessandro De Vita
Abstract:
We provide a definition and explicit expressions for $n$-body Gaussian Process (GP) kernels which can learn any interatomic interaction occurring in a physical system, up to $n$-body contributions, for any value of $n$. The series is complete, as it can be shown that the "universal approximator" squared exponential kernel can be written as a sum of $n$-body kernels. These recipes enable the choice…
▽ More
We provide a definition and explicit expressions for $n$-body Gaussian Process (GP) kernels which can learn any interatomic interaction occurring in a physical system, up to $n$-body contributions, for any value of $n$. The series is complete, as it can be shown that the "universal approximator" squared exponential kernel can be written as a sum of $n$-body kernels. These recipes enable the choice of optimally efficient force models for each target system, as confirmed by extensive testing on various materials. We furthermore describe how the $n$-body kernels can be "mapped" on equivalent representations that provide database-size-independent predictions and are thus crucially more efficient. We explicitly carry out this map** procedure for the first non-trivial (3-body) kernel of the series, and show that this reproduces the GP-predicted forces with $\text{meV/} Å$ accuracy while being orders of magnitude faster. These results open the way to using novel force models (here named "M-FFs") that are computationally as fast as their corresponding standard parametrised $n$-body force fields, while retaining the nonparametric character, the ease of training and validation, and the accuracy of the best recently proposed machine learning potentials.
△ Less
Submitted 25 May, 2018; v1 submitted 15 January, 2018;
originally announced January 2018.
-
Accurate Interatomic Force Fields via Machine Learning with Covariant Kernels
Authors:
Aldo Glielmo,
Peter Sollich,
Alessandro De Vita
Abstract:
We present a novel scheme to accurately predict atomic forces as vector quantities, rather than sets of scalar components, by Gaussian Process (GP) Regression. This is based on matrix-valued kernel functions, on which we impose the requirements that the predicted force rotates with the target configuration and is independent of any rotations applied to the configuration database entries. We show t…
▽ More
We present a novel scheme to accurately predict atomic forces as vector quantities, rather than sets of scalar components, by Gaussian Process (GP) Regression. This is based on matrix-valued kernel functions, on which we impose the requirements that the predicted force rotates with the target configuration and is independent of any rotations applied to the configuration database entries. We show that such covariant GP kernels can be obtained by integration over the elements of the rotation group SO(d) for the relevant dimensionality d. Remarkably, in specific cases the integration can be carried out analytically and yields a conservative force field that can be recast into a pair interaction form. Finally, we show that restricting the integration to a summation over the elements of a finite point group relevant to the target system is sufficient to recover an accurate GP. The accuracy of our kernels in predicting quantum-mechanical forces in real materials is investigated by tests on pure and defective Ni, Fe and Si crystalline systems.
△ Less
Submitted 8 June, 2017; v1 submitted 10 November, 2016;
originally announced November 2016.