Search | arXiv e-print repository

From Prediction to Action: Critical Role of Performance Estimation for Machine-Learning-Driven Materials Discovery

Authors: Mario Boley, Felix Luong, Simon Teshuva, Daniel F Schmidt, Lucas Foppa, Matthias Scheffler

Abstract: Materials discovery driven by statistical property models is an iterative decision process, during which an initial data collection is extended with new data proposed by a model-informed acquisition function--with the goal to maximize a certain "reward" over time, such as the maximum property value discovered so far. While the materials science community achieved much progress in develo** proper… ▽ More Materials discovery driven by statistical property models is an iterative decision process, during which an initial data collection is extended with new data proposed by a model-informed acquisition function--with the goal to maximize a certain "reward" over time, such as the maximum property value discovered so far. While the materials science community achieved much progress in develo** property models that predict well on average with respect to the training distribution, this form of in-distribution performance measurement is not directly coupled with the discovery reward. This is because an iterative discovery process has a shifting reward distribution that is over-proportionally determined by the model performance for exceptional materials. We demonstrate this problem using the example of bulk modulus maximization among double perovskite oxides. We find that the in-distribution predictive performance suggests random forests as superior to Gaussian process regression, while the results are inverse in terms of the discovery rewards. We argue that the lack of proper performance estimation methods from pre-computed data collections is a fundamental problem for improving data-driven materials discovery, and we propose a novel such estimator that, in contrast to naïve reward estimation, successfully predicts Gaussian processes with the "expected improvement" acquisition function as the best out of four options in our demonstrational study for double perovskites. Importantly, it does so without requiring the over thousand ab initio computations that were needed to confirm this prediction. △ Less

Submitted 6 December, 2023; v1 submitted 27 November, 2023; originally announced November 2023.

Comments: Simplified notation

arXiv:2303.14434 [pdf, other]

doi 10.1103/PhysRevB.108.L100302

Heat flux for semi-local machine-learning potentials

Authors: Marcel F. Langer, Florian Knoop, Christian Carbogno, Matthias Scheffler, Matthias Rupp

Abstract: The Green-Kubo (GK) method is a rigorous framework for heat transport simulations in materials. However, it requires an accurate description of the potential-energy surface and carefully converged statistics. Machine-learning potentials can achieve the accuracy of first-principles simulations while allowing to reach well beyond their simulation time and length scales at a fraction of the cost. In… ▽ More The Green-Kubo (GK) method is a rigorous framework for heat transport simulations in materials. However, it requires an accurate description of the potential-energy surface and carefully converged statistics. Machine-learning potentials can achieve the accuracy of first-principles simulations while allowing to reach well beyond their simulation time and length scales at a fraction of the cost. In this paper, we explain how to apply the GK approach to the recent class of message-passing machine-learning potentials, which iteratively consider semi-local interactions beyond the initial interaction cutoff. We derive an adapted heat flux formulation that can be implemented using automatic differentiation without compromising computational efficiency. The approach is demonstrated and validated by calculating the thermal conductivity of zirconium dioxide across temperatures. △ Less

Submitted 28 March, 2023; v1 submitted 25 March, 2023; originally announced March 2023.

Comments: 6 pages, 3 figures, excluding supplement (12 pages, 10 figures), v2: fixed figures. Additional information at https://marcel.science/gknet

arXiv:2001.11212 [pdf, other]

doi 10.1007/s10618-022-00847-y

TCMI: a non-parametric mutual-dependence estimator for multivariate continuous distributions

Authors: Benjamin Regler, Matthias Scheffler, Luca M. Ghiringhelli

Abstract: The identification of relevant features, i.e., the driving variables that determine a process or the properties of a system, is an essential part of the analysis of data sets with a large number of variables. A mathematical rigorous approach to quantifying the relevance of these features is mutual information. Mutual information determines the relevance of features in terms of their joint mutual d… ▽ More The identification of relevant features, i.e., the driving variables that determine a process or the properties of a system, is an essential part of the analysis of data sets with a large number of variables. A mathematical rigorous approach to quantifying the relevance of these features is mutual information. Mutual information determines the relevance of features in terms of their joint mutual dependence to the property of interest. However, mutual information requires as input probability distributions, which cannot be reliably estimated from continuous distributions such as physical quantities like lengths or energies. Here, we introduce total cumulative mutual information (TCMI), a measure of the relevance of mutual dependences that extends mutual information to random variables of continuous distribution based on cumulative probability distributions. TCMI is a non-parametric, robust, and deterministic measure that facilitates comparisons and rankings between feature sets with different cardinality. The ranking induced by TCMI allows for feature selection, i.e., the identification of variable sets that are nonlinear statistically related to a property of interest, taking into account the number of data samples as well as the cardinality of the set of variables. We evaluate the performance of our measure with simulated data, compare its performance with similar multivariate-dependence measures, and demonstrate the effectiveness of our feature-selection method on a set of standard data sets and a typical scenario in materials science. △ Less

Submitted 30 July, 2022; v1 submitted 30 January, 2020; originally announced January 2020.

Comments: 28 pages, 8 figures, 8 tables

Journal ref: Data Mining and Knowledge Discovery (2022)

arXiv:1811.01277 [pdf, other]

doi 10.1016/j.parco.2019.04.003

Optimizations of the Eigensolvers in the ELPA Library

Authors: P. Kus, A. Marek, S. S. Koecher, H. -H. Kowalski, C. Carbogno, Ch. Scheurer, K. Reuter, M. Scheffler, H. Lederer

Abstract: The solution of (generalized) eigenvalue problems for symmetric or Hermitian matrices is a common subtask of many numerical calculations in electronic structure theory or materials science. Solving the eigenvalue problem can easily amount to a sizeable fraction of the whole numerical calculation. For researchers in the field of computational materials science, an efficient and scalable solution of… ▽ More The solution of (generalized) eigenvalue problems for symmetric or Hermitian matrices is a common subtask of many numerical calculations in electronic structure theory or materials science. Solving the eigenvalue problem can easily amount to a sizeable fraction of the whole numerical calculation. For researchers in the field of computational materials science, an efficient and scalable solution of the eigenvalue problem is thus of major importance. The ELPA-library is a well-established dense direct eigenvalue solver library, which has proven to be very efficient and scalable up to very large core counts. In this paper, we describe the latest optimizations of the ELPA-library for new HPC architectures of the Intel Skylake processor family with an AVX-512 SIMD instruction set, or for HPC systems accelerated with recent GPUs. We also describe a complete redesign of the API in a modern modular way, which, apart from a much simpler and more flexible usability, leads to a new path to access system-specific performance optimizations. In order to ensure optimal performance for a particular scientific setting or a specific HPC system, the new API allows the user to influence in straightforward way the internal details of the algorithms and of performance-critical parameters used in the ELPA-library. On top of that, we introduced an autotuning functionality, which allows for finding the best settings in a self-contained automated way. In situations where many eigenvalue problems with similar settings have to be solved consecutively, the autotuning process of the ELPA-library can be done "on-the-fly". Practical applications from materials science which rely on so-called self-consistency iterations can profit from the autotuning. On some examples of scientific interest, simulated with the FHI-aims application, the advantages of the latest optimizations of the ELPA-library are demonstrated. △ Less

Submitted 3 November, 2018; originally announced November 2018.

Journal ref: Parallel Computing 85, pp 167-177 (2019)

arXiv:1206.0603 [pdf, other]

The COMICS Tool - Computing Minimal Counterexamples for Discrete-time Markov Chains

Authors: Nils Jansen, Erika Ábrahám, Maik Scheffler, Matthias Volk, Andreas Vorpahl, Ralf Wimmer, Joost-Pieter Katoen, Bernd Becker

Abstract: This report presents the tool COMICS, which performs model checking and generates counterexamples for DTMCs. For an input DTMC, COMICS computes an abstract system that carries the model checking information and uses this result to compute a critical subsystem, which induces a counterexample. This abstract subsystem can be refined and concretized hierarchically. The tool comes with a command-line v… ▽ More This report presents the tool COMICS, which performs model checking and generates counterexamples for DTMCs. For an input DTMC, COMICS computes an abstract system that carries the model checking information and uses this result to compute a critical subsystem, which induces a counterexample. This abstract subsystem can be refined and concretized hierarchically. The tool comes with a command-line version as well as a graphical user interface that allows the user to interactively influence the refinement process of the counterexample. △ Less

Submitted 4 June, 2012; originally announced June 2012.

Showing 1–5 of 5 results for author: Scheffler, M