-
Ranking the Synthesizability of Hypothetical Zeolites with the Sorting Hat
Authors:
Benjamin A. Helfrecht,
Giovanni Pireddu,
Rocio Semino,
Scott M. Auerbach,
Michele Ceriotti
Abstract:
Zeolites are nanoporous alumino-silicate frameworks widely used as catalysts and adsorbents. Even though millions of distinct siliceous networks can be generated by computer-aided searches, no new hypothetical framework has yet been synthesized. The needle-in-a-haystack problem of finding promising candidates among large databases of predicted structures has intrigued materials scientists for deca…
▽ More
Zeolites are nanoporous alumino-silicate frameworks widely used as catalysts and adsorbents. Even though millions of distinct siliceous networks can be generated by computer-aided searches, no new hypothetical framework has yet been synthesized. The needle-in-a-haystack problem of finding promising candidates among large databases of predicted structures has intrigued materials scientists for decades; most work to date on the zeolite problem has been limited to intuitive structural descriptors. Here, we tackle this problem through a rigorous data science scheme-the "zeolite sorting hat"-that exploits interatomic correlations to produce a 95% real versus theoretical zeolites classification accuracy. The hypothetical frameworks that are grouped together with known zeolites are promising candidates for synthesis, that can be further ranked by estimating their thermodynamic stability. A critical analysis of the classifier reveals the decisive structural features. Further partitioning into compositional classes provides guidance in the design of synthetic strategies.
△ Less
Submitted 26 October, 2021;
originally announced October 2021.
-
Improving Sample and Feature Selection with Principal Covariates Regression
Authors:
Rose K. Cersonsky,
Benjamin A. Helfrecht,
Edgar A. Engel,
Michele Ceriotti
Abstract:
Selecting the most relevant features and samples out of a large set of candidates is a task that occurs very often in the context of automated data analysis, where it can be used to improve the computational performance, and also often the transferability, of a model. Here we focus on two popular sub-selection schemes which have been applied to this end: CUR decomposition, that is based on a low-r…
▽ More
Selecting the most relevant features and samples out of a large set of candidates is a task that occurs very often in the context of automated data analysis, where it can be used to improve the computational performance, and also often the transferability, of a model. Here we focus on two popular sub-selection schemes which have been applied to this end: CUR decomposition, that is based on a low-rank approximation of the feature matrix and Farthest Point Sampling, that relies on the iterative identification of the most diverse samples and discriminating features. We modify these unsupervised approaches, incorporating a supervised component following the same spirit as the Principal Covariates Regression (PCovR) method. We show that incorporating target information provides selections that perform better in supervised tasks, which we demonstrate with ridge regression, kernel ridge regression, and sparse kernel regression. We also show that incorporating aspects of simple supervised learning models can improve the accuracy of more complex models, such as feed-forward neural networks. We present adjustments to minimize the impact that any subselection may incur when performing unsupervised tasks. We demonstrate the significant improvements associated with the use of PCov-CUR and PCov-FPS selections for applications to chemistry and materials science, typically reducing by a factor of two the number of features and samples which are required to achieve a given level of regression accuracy.
△ Less
Submitted 22 December, 2020;
originally announced December 2020.
-
Structure-Property Maps with Kernel Principal Covariates Regression
Authors:
Benjamin A. Helfrecht,
Rose K. Cersonsky,
Guillaume Fraux,
Michele Ceriotti
Abstract:
Data analyses based on linear methods constitute the simplest, most robust, and transparent approaches to the automatic processing of large amounts of data for building supervised or unsupervised machine learning models. Principal covariates regression (PCovR) is an underappreciated method that interpolates between principal component analysis and linear regression, and can be used to conveniently…
▽ More
Data analyses based on linear methods constitute the simplest, most robust, and transparent approaches to the automatic processing of large amounts of data for building supervised or unsupervised machine learning models. Principal covariates regression (PCovR) is an underappreciated method that interpolates between principal component analysis and linear regression, and can be used to conveniently reveal structure-property relations in terms of simple-to-interpret, low-dimensional maps. Here we provide a pedagogic overview of these data analysis schemes, including the use of the kernel trick to introduce an element of non-linearity, while maintaining most of the convenience and the simplicity of linear approaches. We then introduce a kernelized version of PCovR and a sparsified extension, and demonstrate the performance of this approach in revealing and predicting structure-property relations in chemistry and materials science, showing a variety of examples including elemental carbon, porous silicate frameworks, organic molecules, amino acid conformers, and molecular materials.
△ Less
Submitted 21 May, 2020; v1 submitted 12 February, 2020;
originally announced February 2020.
-
A New Kind of Atlas of Zeolite Building Blocks
Authors:
Benjamin A. Helfrecht,
Rocio Semino,
Giovanni Pireddu,
Scott M. Auerbach,
Michele Ceriotti
Abstract:
We have analysed structural motifs in the Deem database of hypothetical zeolites, to investigate whether the structural diversity found in this database can be well-represented by classical descriptors such as distances, angles, and ring sizes, or whether a more general representation of atomic structure, furnished by the smooth overlap of atomic positions (SOAP) method, is required to capture acc…
▽ More
We have analysed structural motifs in the Deem database of hypothetical zeolites, to investigate whether the structural diversity found in this database can be well-represented by classical descriptors such as distances, angles, and ring sizes, or whether a more general representation of atomic structure, furnished by the smooth overlap of atomic positions (SOAP) method, is required to capture accurately structure-property relations. We assessed the quality of each descriptor by machine-learning the molar energy and volume for each hypothetical framework in the dataset. We have found that SOAP with a cutoff-length of 6 Å, which goes beyond near-neighbor tetrahedra, best describes the structural diversity in the Deem database by capturing relevant inter-atomic correlations. Kernel principal component analysis shows that SOAP maintains its superior performance even when reducing its dimensionality to those of the classical descriptors, and that the first three kernel principal components capture the main variability in the data set, allowing a 3D point cloud visualization of local environments in the Deem database. This ``cloud atlas" of local environments was found to show good correlations with the contribution of a given motif to the density and stability of its parent framework. Local volume and energy maps constructed from the SOAP/machine-learning analyses provide new images of zeolites that reveal smooth variations of local volumes and energies across a given framework, and correlations between local volume and energy in a given framework.
△ Less
Submitted 8 July, 2019;
originally announced July 2019.