Search | arXiv e-print repository

doi 10.1063/5.0195232

Simulation-Free Determination of Microstructure Representative Volume Element Size via Fisher Scores

Authors: Wei Liu, Satyajit Mojumder, Wing Kam Liu, Wei Chen, Daniel W. Apley

Abstract: A representative volume element (RVE) is a reasonably small unit of microstructure that can be simulated to obtain the same effective properties as the entire microstructure sample. Finite element (FE) simulation of RVEs, as opposed to much larger samples, saves computational expense, especially in multiscale modeling. Therefore, it is desirable to have a framework that determines RVE size prior t… ▽ More A representative volume element (RVE) is a reasonably small unit of microstructure that can be simulated to obtain the same effective properties as the entire microstructure sample. Finite element (FE) simulation of RVEs, as opposed to much larger samples, saves computational expense, especially in multiscale modeling. Therefore, it is desirable to have a framework that determines RVE size prior to FE simulations. Existing methods select the RVE size based on when the FE-simulated properties of samples of increasing size converge with insignificant statistical variations, with the drawback that many samples must be simulated. We propose a simulation-free alternative that determines RVE size based only on a micrograph. The approach utilizes a machine learning model trained to implicitly characterize the stochastic nature of the input micrograph. The underlying rationale is to view RVE size as the smallest moving window size for which the stochastic nature of the microstructure within the window is stationary as the window moves across a large micrograph. For this purpose, we adapt a recently developed Fisher score-based framework for microstructure nonstationarity monitoring. Because the resulting RVE size is based solely on the micrograph and does not involve any FE simulation of specific properties, it constitutes an RVE for any property of interest that solely depends on the microstructure characteristics. Through numerical experiments of simple and complex microstructures, we validate our approach and show that our selected RVE sizes are consistent with when the chosen FE-simulated properties converge. △ Less

Submitted 7 April, 2024; originally announced April 2024.

Journal ref: APL Mach. Learn. 2(2): 026101 (2024)

arXiv:2303.03393 [pdf, other]

Interpretable Architecture Neural Networks for Function Visualization

Authors: Shengtong Zhang, Daniel W. Apley

Abstract: In many scientific research fields, understanding and visualizing a black-box function in terms of the effects of all the input variables is of great importance. Existing visualization tools do not allow one to visualize the effects of all the input variables simultaneously. Although one can select one or two of the input variables to visualize via a 2D or 3D plot while holding other variables fix… ▽ More In many scientific research fields, understanding and visualizing a black-box function in terms of the effects of all the input variables is of great importance. Existing visualization tools do not allow one to visualize the effects of all the input variables simultaneously. Although one can select one or two of the input variables to visualize via a 2D or 3D plot while holding other variables fixed, this presents an oversimplified and incomplete picture of the model. To overcome this shortcoming, we present a new visualization approach using an interpretable architecture neural network (IANN) to visualize the effects of all the input variables directly and simultaneously. We propose two interpretable structures, each of which can be conveniently represented by a specific IANN, and we discuss a number of possible extensions. We also provide a Python package to implement our proposed method. The supplemental materials are available online. △ Less

Submitted 3 March, 2023; originally announced March 2023.

arXiv:2211.02218 [pdf, other]

Fully Bayesian inference for latent variable Gaussian process models

Authors: Suraj Yerramilli, Akshay Iyer, Wei Chen, Daniel W. Apley

Abstract: Real engineering and scientific applications often involve one or more qualitative inputs. Standard Gaussian processes (GPs), however, cannot directly accommodate qualitative inputs. The recently introduced latent variable Gaussian process (LVGP) overcomes this issue by first map** each qualitative factor to underlying latent variables (LVs), and then uses any standard GP covariance function ove… ▽ More Real engineering and scientific applications often involve one or more qualitative inputs. Standard Gaussian processes (GPs), however, cannot directly accommodate qualitative inputs. The recently introduced latent variable Gaussian process (LVGP) overcomes this issue by first map** each qualitative factor to underlying latent variables (LVs), and then uses any standard GP covariance function over these LVs. The LVs are estimated similarly to the other GP hyperparameters through maximum likelihood estimation, and then plugged into the prediction expressions. However, this plug-in approach will not account for uncertainty in estimation of the LVs, which can be significant especially with limited training data. In this work, we develop a fully Bayesian approach for the LVGP model and for visualizing the effects of the qualitative inputs via their LVs. We also develop approximations for scaling up LVGPs and fully Bayesian inference for the LVGP hyperparameters. We conduct numerical studies comparing plug-in inference against fully Bayesian inference over a few engineering models and material design applications. In contrast to previous studies on standard GP modeling that have largely concluded that a fully Bayesian treatment offers limited improvements, our results show that for LVGP modeling it offers significant improvements in prediction accuracy and uncertainty quantification over the plug-in approach. △ Less

Submitted 19 March, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

arXiv:2207.04994 [pdf, other]

doi 10.1038/s41598-022-23431-2

Uncertainty-Aware Mixed-Variable Machine Learning for Materials Design

Authors: Hengrui Zhang, Wei Wayne Chen, Akshay Iyer, Daniel W. Apley, Wei Chen

Abstract: Data-driven design shows the promise of accelerating materials discovery but is challenging due to the prohibitive cost of searching the vast design space of chemistry, structure, and synthesis methods. Bayesian Optimization (BO) employs uncertainty-aware machine learning models to select promising designs to evaluate, hence reducing the cost. However, BO with mixed numerical and categorical varia… ▽ More Data-driven design shows the promise of accelerating materials discovery but is challenging due to the prohibitive cost of searching the vast design space of chemistry, structure, and synthesis methods. Bayesian Optimization (BO) employs uncertainty-aware machine learning models to select promising designs to evaluate, hence reducing the cost. However, BO with mixed numerical and categorical variables, which is of particular interest in materials design, has not been well studied. In this work, we survey frequentist and Bayesian approaches to uncertainty quantification of machine learning with mixed variables. We then conduct a systematic comparative study of their performances in BO using a popular representative model from each group, the random forest-based Lolo model (frequentist) and the latent variable Gaussian process model (Bayesian). We examine the efficacy of the two models in the optimization of mathematical functions, as well as properties of structural and functional materials, where we observe performance differences as related to problem dimensionality and complexity. By investigating the machine learning models' predictive and uncertainty estimation capabilities, we provide interpretations of the observed performance differences. Our results provide practical guidance on choosing between frequentist and Bayesian uncertainty-aware machine learning models for mixed-variable BO in materials design. △ Less

Submitted 4 October, 2022; v1 submitted 11 July, 2022; originally announced July 2022.

Journal ref: Scientific Reports 12, 19760 (2022)

arXiv:2206.10812 [pdf, other]

doi 10.1287/ijds.2022.00017

Diversity Subsampling: Custom Subsamples from Large Data Sets

Authors: Boyang Shang, Daniel W. Apley, Sanjay Mehrotra

Abstract: Subsampling from a large data set is useful in many supervised learning contexts to provide a global view of the data based on only a fraction of the observations. Diverse (or space-filling) subsampling is an appealing subsampling approach when no prior knowledge of the data is available. In this paper, we propose a diversity subsampling approach that selects a subsample from the original data suc… ▽ More Subsampling from a large data set is useful in many supervised learning contexts to provide a global view of the data based on only a fraction of the observations. Diverse (or space-filling) subsampling is an appealing subsampling approach when no prior knowledge of the data is available. In this paper, we propose a diversity subsampling approach that selects a subsample from the original data such that the subsample is independently and uniformly distributed over the support of distribution from which the data are drawn, to the maximum extent possible. We give an asymptotic performance guarantee of the proposed method and provide experimental results to show that the proposed method performs well for typical finite-size data. We also compare the proposed method with competing diversity subsampling algorithms and demonstrate numerically that subsamples selected by the proposed method are closer to a uniform sample than subsamples selected by other methods. The proposed DS algorithm is shown to be more efficient than known methods and takes only a few minutes to select tens of thousands of subsample points from a data set of size one million. Our DS algorithm easily generalizes to select subsamples following distributions other than uniform. We provide the FADS Python package to implement the proposed methods. △ Less

Submitted 21 June, 2022; originally announced June 2022.

arXiv:2106.15356 [pdf]

Scalable Gaussian Processes for Data-Driven Design using Big Data with Categorical Factors

Authors: Liwei Wang, Suraj Yerramilli, Akshay Iyer, Daniel Apley, ** Zhu, Wei Chen

Abstract: Scientific and engineering problems often require the use of artificial intelligence to aid understanding and the search for promising designs. While Gaussian processes (GP) stand out as easy-to-use and interpretable learners, they have difficulties in accommodating big datasets, categorical inputs, and multiple responses, which has become a common challenge for a growing number of data-driven des… ▽ More Scientific and engineering problems often require the use of artificial intelligence to aid understanding and the search for promising designs. While Gaussian processes (GP) stand out as easy-to-use and interpretable learners, they have difficulties in accommodating big datasets, categorical inputs, and multiple responses, which has become a common challenge for a growing number of data-driven design applications. In this paper, we propose a GP model that utilizes latent variables and functions obtained through variational inference to address the aforementioned challenges simultaneously. The method is built upon the latent variable Gaussian process (LVGP) model where categorical factors are mapped into a continuous latent space to enable GP modeling of mixed-variable datasets. By extending variational inference to LVGP models, the large training dataset is replaced by a small set of inducing points to address the scalability issue. Output response vectors are represented by a linear combination of independent latent functions, forming a flexible kernel structure to handle multiple responses that might have distinct behaviors. Comparative studies demonstrate that the proposed method scales well for large datasets with over 10^4 data points, while outperforming state-of-the-art machine learning methods without requiring much hyperparameter tuning. In addition, an interpretable latent space is obtained to draw insights into the effect of categorical factors, such as those associated with building blocks of architectures and element choices in metamaterial and materials design. Our approach is demonstrated for machine learning of ternary oxide materials and topology optimization of a multiscale compliant mechanism with aperiodic microstructures and multiple materials. △ Less

Submitted 29 June, 2021; v1 submitted 25 June, 2021; originally announced June 2021.

Comments: Preprint submitted to Journal of Mechanical Design

arXiv:2012.11135 [pdf, other]

Nonstationarity Analysis of Materials Microstructures via Fisher Score Vectors

Authors: Kungang Zhang, Daniel W. Apley, Wei Chen

Abstract: Microstructures are critical to the physical properties of materials. Stochastic microstructures are commonly observed in many kinds of materials and traditional descriptor-based image analysis of them can be challenging. In this paper, we introduce a powerful and versatile score-based framework for analyzing nonstationarity in stochastic materials microstructures. The framework involves training… ▽ More Microstructures are critical to the physical properties of materials. Stochastic microstructures are commonly observed in many kinds of materials and traditional descriptor-based image analysis of them can be challenging. In this paper, we introduce a powerful and versatile score-based framework for analyzing nonstationarity in stochastic materials microstructures. The framework involves training a parametric supervised learning model to predict a pixel value using neighboring pixels in images of microstructures~(as known as micrographs), and this predictive model provides an implicit characterization of the stochastic nature of the microstructure. The basis for our approach is the Fisher score vector, defined as the gradient of the log-likelihood with respect to the parameters of the predictive model, at each micrograph pixel. A fundamental property of the score vector is that it is zero-mean if the predictive relationship in the vicinity of that pixel remains unchanged, which we equate with the local stochastic nature of the microstructure remaining unchanged. Conversely, if the local stochastic nature changes, then the mean of the score vector generally differs from zero. Our framework analyzes how the local mean of the score vector varies across one or more image samples to: (1) monitor for nonstationarity by indicating whether new samples are statistically different than reference samples and where they may differ and (2) diagnose nonstationarity by identifying the distinct types of stochastic microstructures and labeling accordingly the corresponding regions of the samples. Unlike feature-based methods, our approach is almost completely general and requires no prior knowledge of the nature of the nonstationarities. Using a number of real and simulated micrographs, including polymer composites and multiphase alloys, we demonstrate the power and versatility of the approach. △ Less

Submitted 21 December, 2020; originally announced December 2020.

arXiv:2012.06916 [pdf, other]

Concept Drift Monitoring and Diagnostics of Supervised Learning Models via Score Vectors

Authors: Kungang Zhang, Anh T. Bui, Daniel W. Apley

Abstract: Supervised learning models are one of the most fundamental classes of models. Viewing supervised learning from a probabilistic perspective, the set of training data to which the model is fitted is usually assumed to follow a stationary distribution. However, this stationarity assumption is often violated in a phenomenon called concept drift, which refers to changes over time in the predictive rela… ▽ More Supervised learning models are one of the most fundamental classes of models. Viewing supervised learning from a probabilistic perspective, the set of training data to which the model is fitted is usually assumed to follow a stationary distribution. However, this stationarity assumption is often violated in a phenomenon called concept drift, which refers to changes over time in the predictive relationship between covariates $\mathbf{X}$ and a response variable $Y$ and can render trained models suboptimal or obsolete. We develop a comprehensive and computationally efficient framework for detecting, monitoring, and diagnosing concept drift. Specifically, we monitor the Fisher score vector, defined as the gradient of the log-likelihood for the fitted model, using a form of multivariate exponentially weighted moving average, which monitors for general changes in the mean of a random vector. In spite of the substantial performance advantages that we demonstrate over popular error-based methods, a score-based approach has not been previously considered for concept drift monitoring. Advantages of the proposed score-based framework include applicability to any parametric model, more powerful detection of changes as shown in theory and experiments, and inherent diagnostic capabilities for hel** to identify the nature of the changes. △ Less

Submitted 12 September, 2022; v1 submitted 12 December, 2020; originally announced December 2020.

arXiv:2010.13306 [pdf, other]

doi 10.1021/acs.chemmater.1c00905

Database, Features, and Machine Learning Model to Identify Thermally Driven Metal-Insulator Transition Compounds

Authors: Alexandru B. Georgescu, Peiwen Ren, Aubrey R. Toland, Shengtong Zhang, Kyle D. Miller, Daniel W. Apley, Elsa A. Olivetti, Nicholas Wagner, James M. Rondinelli

Abstract: Metal-insulator transition (MIT) compounds are materials that may exhibit insulating or metallic behavior, depending on the physical conditions, and are of immense fundamental interest owing to their potential applications in emerging microelectronics. There is a dearth of thermally-driven MIT materials, however, which makes delineating these compounds from those that are exclusively insulating or… ▽ More Metal-insulator transition (MIT) compounds are materials that may exhibit insulating or metallic behavior, depending on the physical conditions, and are of immense fundamental interest owing to their potential applications in emerging microelectronics. There is a dearth of thermally-driven MIT materials, however, which makes delineating these compounds from those that are exclusively insulating or metallic challenging. Here we report a material database comprising temperature-controlled MITs (and metals and insulators with similar chemical composition and stoichiometries to the MIT compounds) from high quality experimental literature, built through a combination of materials-domain knowledge and natural language processing. We featurize the dataset using compositional, structural, and energetic descriptors, including two MIT relevant energy scales, an estimated Hubbard interaction and the charge transfer energy, as well as the structure-bond-stress metric referred to as the global-instability index (GII). We then perform supervised classification, constructing three electronic-state classifiers: metal vs non-metal (M), insulator vs non-insulator (I), and MIT vs non-MIT (T). We identify two important descriptors that separate metals, insulators, and MIT materials in a 2D feature space: the average deviation of the covalent radius and the range of the Mendeleev number. We further elaborate on other important features (GII and Ewald energy), and examine how they affect classification of binary vanadium and titanium oxides. We discuss the relationship of these atomic features to the physical interactions underlying MITs in the rare-earth nickelate family. Last, we implement an online version of the classifiers, enabling quick probabilistic class predictions by uploading a crystallographic structure file. △ Less

Submitted 21 July, 2021; v1 submitted 25 October, 2020; originally announced October 2020.

Journal ref: Chem. Mater. 33, 14, 5591-5605 (2021)

arXiv:1910.01688 [pdf]

Bayesian Optimization for Materials Design with Mixed Quantitative and Qualitative Variables

Authors: Yichi Zhang, Daniel Apley, Wei Chen

Abstract: Although Bayesian Optimization (BO) has been employed for accelerating materials design in computational materials engineering, existing works are restricted to problems with quantitative variables. However, real designs of materials systems involve both qualitative and quantitative design variables representing material compositions, microstructure morphology, and processing conditions. For mixed… ▽ More Although Bayesian Optimization (BO) has been employed for accelerating materials design in computational materials engineering, existing works are restricted to problems with quantitative variables. However, real designs of materials systems involve both qualitative and quantitative design variables representing material compositions, microstructure morphology, and processing conditions. For mixed-variable problems, existing Bayesian Optimization (BO) approaches represent qualitative factors by dummy variables first and then fit a standard Gaussian process (GP) model with numerical variables as the surrogate model. This approach is restrictive theoretically and fails to capture complex correlations between qualitative levels. We present in this paper the integration of a novel latent-variable (LV) approach for mixed-variable GP modeling with the BO framework for materials design. LVGP is a fundamentally different approach that maps qualitative design variables to underlying numerical LV in GP, which has strong physical justification. It provides flexible parameterization and representation of qualitative factors and shows superior modeling accuracy compared to the existing methods. We demonstrate our approach through testing with numerical examples and materials design examples. It is found that in all test examples the mapped LVs provide intuitive visualization and substantial insight into the nature and effects of the qualitative factors. Though materials designs are used as examples, the method presented is generic and can be utilized for other mixed variable design optimization problems that involve expensive physics-based simulations. △ Less

Submitted 3 October, 2019; originally announced October 2019.

Comments: 29 pages, 9 figures, 3 tables

arXiv:1812.01786 [pdf, other]

Density Deconvolution with Additive Measurement Errors using Quadratic Programming

Authors: Ran Yang, Daniel Apley, Jeremy Staum, David Ruppert

Abstract: Distribution estimation for noisy data via density deconvolution is a notoriously difficult problem for typical noise distributions like Gaussian. We develop a density deconvolution estimator based on quadratic programming (QP) that can achieve better estimation than kernel density deconvolution methods. The QP approach appears to have a more favorable regularization tradeoff between oversmoothing… ▽ More Distribution estimation for noisy data via density deconvolution is a notoriously difficult problem for typical noise distributions like Gaussian. We develop a density deconvolution estimator based on quadratic programming (QP) that can achieve better estimation than kernel density deconvolution methods. The QP approach appears to have a more favorable regularization tradeoff between oversmoothing vs. oscillation, especially at the tails of the distribution. An additional advantage is that it is straightforward to incorporate a number of common density constraints such as nonnegativity, integration-to-one, unimodality, tail convexity, tail monotonicity, and support constraints. We demonstrate that the QP approach has outstanding estimation performance relative to existing methods. Its performance is superior when only the universally applicable nonnegativity and integration-to-one constraints are incorporated, and incorporating additional common constraints when applicable (e.g., nonnegative support, unimodality, tail monotonicity or convexity, etc.) can further substantially improve the estimation. △ Less

Submitted 4 December, 2018; originally announced December 2018.

arXiv:1806.07504 [pdf]

A Latent Variable Approach to Gaussian Process Modeling with Qualitative and Quantitative Factors

Authors: Yichi Zhang, Siyu Tao, Wei Chen, Daniel W. Apley

Abstract: Computer simulations often involve both qualitative and numerical inputs. Existing Gaussian process (GP) methods for handling this mainly assume a different response surface for each combination of levels of the qualitative factors and relate them via a multiresponse cross-covariance matrix. We introduce a substantially different approach that maps each qualitative factor to an underlying numerica… ▽ More Computer simulations often involve both qualitative and numerical inputs. Existing Gaussian process (GP) methods for handling this mainly assume a different response surface for each combination of levels of the qualitative factors and relate them via a multiresponse cross-covariance matrix. We introduce a substantially different approach that maps each qualitative factor to an underlying numerical latent variable (LV), with the mapped value for each level estimated similarly to the correlation parameters. This provides a parsimonious GP parameterization that treats qualitative factors the same as numerical variables and views them as effecting the response via similar physical mechanisms. This has strong physical justification, as the effects of a qualitative factor in any physics-based simulation model must always be due to some underlying numerical variables. Even when the underlying variables are many, sufficient dimension reduction arguments imply that their effects can be represented by a low-dimensional LV. This conjecture is supported by the superior predictive performance observed across a variety of examples. Moreover, the mapped LVs provide substantial insight into the nature and effects of the qualitative factors. △ Less

Submitted 30 January, 2019; v1 submitted 19 June, 2018; originally announced June 2018.

arXiv:1702.02966 [pdf]

doi 10.1080/00401706.2017.1302362

A monitoring and diagnostic approach for stochastic textured surfaces

Authors: Anh Tuan Bui, Daniel W. Apley

Abstract: We develop a supervised-learning-based approach for monitoring and diagnosing texture-related defects in manufactured products characterized by stochastic textured surfaces that satisfy the locality and stationarity properties of Markov random fields. Examples of stochastic textured surface data include images of woven textiles; image or surface metrology data for machined, cast, or formed metal p… ▽ More We develop a supervised-learning-based approach for monitoring and diagnosing texture-related defects in manufactured products characterized by stochastic textured surfaces that satisfy the locality and stationarity properties of Markov random fields. Examples of stochastic textured surface data include images of woven textiles; image or surface metrology data for machined, cast, or formed metal parts; microscopy images of material microstructure samples; etc. To characterize the complex spatial statistical dependencies of in-control samples of the stochastic textured surface, we use rather generic supervised learning methods, which provide an implicit characterization of the joint distribution of the surface texture. We propose two spatial moving statistics, which are computed from residual errors of the fitted supervised learning model, for monitoring and diagnosing local aberrations in the general spatial statistical behavior of newly manufactured stochastic textured surface samples in a statistical process control context. We illustrate the approach using images of textile fabric samples and simulated 2-D stochastic processes, for which the algorithm successfully detects local defects of various natures. Supplemental discussions, results, data and computer codes are available online. △ Less

Submitted 21 July, 2017; v1 submitted 9 February, 2017; originally announced February 2017.

arXiv:1701.06655 [pdf, other]

Patchwork Kriging for Large-scale Gaussian Process Regression

Authors: Chiwoo Park, Daniel Apley

Abstract: This paper presents a new approach for Gaussian process (GP) regression for large datasets. The approach involves partitioning the regression input domain into multiple local regions with a different local GP model fitted in each region. Unlike existing local partitioned GP approaches, we introduce a technique for patching together the local GP models nearly seamlessly to ensure that the local GP… ▽ More This paper presents a new approach for Gaussian process (GP) regression for large datasets. The approach involves partitioning the regression input domain into multiple local regions with a different local GP model fitted in each region. Unlike existing local partitioned GP approaches, we introduce a technique for patching together the local GP models nearly seamlessly to ensure that the local GP models for two neighboring regions produce nearly the same response prediction and prediction error variance on the boundary between the two regions. This largely mitigates the well-known discontinuity problem that degrades the boundary accuracy of existing local partitioned GP methods. Our main innovation is to represent the continuity conditions as additional pseudo-observations that the differences between neighboring GP responses are identically zero at an appropriately chosen set of boundary input locations. To predict the response at any input location, we simply augment the actual response observations with the pseudo-observations and apply standard GP prediction methods to the augmented data. In contrast to heuristic continuity adjustments, this has an advantage of working within a formal GP framework, so that the GP-based predictive uncertainty quantification remains valid. Our approach also inherits a sparse block-like structure for the sample covariance matrix, which results in computationally efficient closed-form expressions for the predictive mean and variance. In addition, we provide a new spatial partitioning scheme based on a recursive space partitioning along local principal component directions, which makes the proposed approach applicable for regression domains having more than two dimensions. Using three spatial datasets and three higher dimensional datasets, we investigate the numerical performance of the approach and compare it to several state-of-the-art approaches. △ Less

Submitted 7 July, 2018; v1 submitted 23 January, 2017; originally announced January 2017.

MSC Class: 68T01 ACM Class: G.3

arXiv:1612.08468 [pdf, other]

Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models

Authors: Daniel W. Apley, **gyu Zhu

Abstract: When fitting black box supervised learning models (e.g., complex trees, neural networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects of the individual predictor variables and their low-order interaction effects is often important, and partial dependence (PD) plots are the most popular approach for accomplishing this. However,… ▽ More When fitting black box supervised learning models (e.g., complex trees, neural networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects of the individual predictor variables and their low-order interaction effects is often important, and partial dependence (PD) plots are the most popular approach for accomplishing this. However, PD plots involve a serious pitfall if the predictor variables are far from independent, which is quite common with large observational data sets. Namely, PD plots require extrapolation of the response at predictor values that are far outside the multivariate envelope of the training data, which can render the PD plots unreliable. Although marginal plots (M plots) do not require such extrapolation, they produce substantially biased and misleading results when the predictors are dependent, analogous to the omitted variable bias in regression. We present a new visualization approach that we term accumulated local effects (ALE) plots, which inherits the desirable characteristics of PD and M plots, without inheriting their preceding shortcomings. Like M plots, ALE plots do not require extrapolation; and like PD plots, they are not biased by the omitted variable phenomenon. Moreover, ALE plots are far less computationally expensive than PD plots. △ Less

Submitted 19 August, 2019; v1 submitted 26 December, 2016; originally announced December 2016.

Comments: The R package ALEPlot is available on CRAN. The new version contains refined definitions of ALE effects, a new illustrative example, theorems and proofs of asymptotic properties of ALE effects and estimators, and extra implementation details

arXiv:1509.06721 [pdf]

Designed Sampling from Large Databases for Controlled Trials

Authors: Liwen Ouyang, Daniel W. Apley, Sanjay Mehrotra

Abstract: The increasing prevalence of rich sources of data and the availability of electronic medical record databases and electronic registries opens tremendous opportunities for enhancing medical research. For example, controlled trials are ubiquitously used to investigate the effect of a medical treatment, perhaps dependent on a set of patient covariates, and traditional approaches have relied primarily… ▽ More The increasing prevalence of rich sources of data and the availability of electronic medical record databases and electronic registries opens tremendous opportunities for enhancing medical research. For example, controlled trials are ubiquitously used to investigate the effect of a medical treatment, perhaps dependent on a set of patient covariates, and traditional approaches have relied primarily on randomized patient sampling and allocation to treatment and control group. However, when covariate data for a large cohort group of patients have already been collected and are available in a database, one can potentially design a treatment/control sample and allocation that provides far better estimates of the covariate-dependent effects of the treatment. In this paper, we develop a new approach that uses optimal design of experiments (DOE) concepts to accomplish this objective. The approach selects the patients for the treatment and control samples upfront, based on their covariate values, in a manner that optimizes the information content in the data. For the optimal sample selection, we develop simple guidelines and an optimization algorithm that provides solutions that are substantially better than random sampling. Moreover, our approach causes no sampling bias in the estimated effects, for the same reason that DOE principles do not bias estimated effects. We test our method with a simulation study based on a testbed data set containing information on the effect of statins on low-density lipoprotein (LDL) cholesterol. △ Less

Submitted 22 September, 2015; originally announced September 2015.

arXiv:1303.0383 [pdf, other]

Local Gaussian process approximation for large computer experiments

Authors: Robert B. Gramacy, Daniel W. Apley

Abstract: We provide a new approach to approximate emulation of large computer experiments. By focusing expressly on desirable properties of the predictive equations, we derive a family of local sequential design schemes that dynamically define the support of a Gaussian process predictor based on a local subset of the data. We further derive expressions for fast sequential updating of all needed quantities… ▽ More We provide a new approach to approximate emulation of large computer experiments. By focusing expressly on desirable properties of the predictive equations, we derive a family of local sequential design schemes that dynamically define the support of a Gaussian process predictor based on a local subset of the data. We further derive expressions for fast sequential updating of all needed quantities as the local designs are built-up iteratively. Then we show how independent application of our local design strategy across the elements of a vast predictive grid facilitates a trivially parallel implementation. The end result is a global predictor able to take advantage of modern multicore architectures, while at the same time allowing for a nonstationary modeling feature as a bonus. We demonstrate our method on two examples utilizing designs sized in the thousands, and tens of thousands of data points. Comparisons are made to the method of compactly supported covariances. △ Less

Submitted 10 October, 2014; v1 submitted 2 March, 2013; originally announced March 2013.

Comments: 29 pages, 5 figures, 2 tables

Showing 1–17 of 17 results for author: Apley, D