Search | arXiv e-print repository

Factor Importance Ranking and Selection using Total Indices

Authors: Chaofan Huang, V. Roshan Joseph

Abstract: Factor importance measures the impact of each feature on output prediction accuracy. Many existing works focus on the model-based importance, but an important feature in one learning algorithm may hold little significance in another model. Hence, a factor importance measure ought to characterize the feature's predictive potential without relying on a specific prediction algorithm. Such algorithm-a… ▽ More Factor importance measures the impact of each feature on output prediction accuracy. Many existing works focus on the model-based importance, but an important feature in one learning algorithm may hold little significance in another model. Hence, a factor importance measure ought to characterize the feature's predictive potential without relying on a specific prediction algorithm. Such algorithm-agnostic importance is termed as intrinsic importance in Williamson et al. (2023), but their estimator again requires model fitting. To bypass the modeling step, we present the equivalence between predictiveness potential and total Sobol' indices from global sensitivity analysis, and introduce a novel consistent estimator that can be directly estimated from noisy data. Integrating with forward selection and backward elimination gives rise to FIRST, Factor Importance Ranking and Selection using Total (Sobol') indices. Extensive simulations are provided to demonstrate the effectiveness of FIRST on regression and binary classification problems, and a clear advantage over the state-of-the-art methods. △ Less

Submitted 11 January, 2024; v1 submitted 1 January, 2024; originally announced January 2024.

arXiv:2312.05372 [pdf, other]

Rational Kriging

Authors: V. Roshan Joseph

Abstract: This article proposes a new kriging that has a rational form. It is shown that the generalized least squares estimate of the mean from rational kriging is much more well behaved than that from ordinary kriging. Parameter estimation and uncertainty quantification for rational kriging are proposed using a Gaussian process framework. Its potential applications in emulation and calibration of computer… ▽ More This article proposes a new kriging that has a rational form. It is shown that the generalized least squares estimate of the mean from rational kriging is much more well behaved than that from ordinary kriging. Parameter estimation and uncertainty quantification for rational kriging are proposed using a Gaussian process framework. Its potential applications in emulation and calibration of computer models are also discussed. △ Less

Submitted 8 December, 2023; originally announced December 2023.

arXiv:2310.07953 [pdf, other]

Enhancing Sample Quality through Minimum Energy Importance Weights

Authors: Chaofan Huang, V. Roshan Joseph

Abstract: Importance sampling is a powerful tool for correcting the distributional mismatch in many statistical and machine learning problems, but in practice its performance is limited by the usage of simple proposals whose importance weights can be computed analytically. To address this limitation, Liu and Lee (2017) proposed a Black-Box Importance Sampling (BBIS) algorithm that computes the importance we… ▽ More Importance sampling is a powerful tool for correcting the distributional mismatch in many statistical and machine learning problems, but in practice its performance is limited by the usage of simple proposals whose importance weights can be computed analytically. To address this limitation, Liu and Lee (2017) proposed a Black-Box Importance Sampling (BBIS) algorithm that computes the importance weights for arbitrary simulated samples by minimizing the kernelized Stein discrepancy. However, this requires knowing the score function of the target distribution, which is not easy to compute for many Bayesian problems. Hence, in this paper we propose another novel BBIS algorithm using minimum energy design, BBIS-MED, that requires only the unnormalized density function, which can be utilized as a post-processing step to improve the quality of Markov Chain Monte Carlo samples. We demonstrate the effectiveness and wide applicability of our proposed BBIS-MED algorithm on extensive simulations and a real-world Bayesian model calibration problem where the score function cannot be derived analytically. △ Less

Submitted 31 December, 2023; v1 submitted 11 October, 2023; originally announced October 2023.

arXiv:2310.07016 [pdf, other]

Discovering the Unknowns: A First Step

Authors: V. Roshan Joseph, William E. Lewis, Henry S. Yuchi, Kathryn A. Maupin

Abstract: This article aims at discovering the unknown variables in the system through data analysis. The main idea is to use the time of data collection as a surrogate variable and try to identify the unknown variables by modeling gradual and sudden changes in the data. We use Gaussian process modeling and a sparse representation of the sudden changes to efficiently estimate the large number of parameters… ▽ More This article aims at discovering the unknown variables in the system through data analysis. The main idea is to use the time of data collection as a surrogate variable and try to identify the unknown variables by modeling gradual and sudden changes in the data. We use Gaussian process modeling and a sparse representation of the sudden changes to efficiently estimate the large number of parameters in the proposed statistical model. The method is tested on a realistic dataset generated using a one-dimensional implementation of a Magnetized Liner Inertial Fusion (MagLIF) simulation model and encouraging results are obtained. △ Less

Submitted 10 October, 2023; originally announced October 2023.

arXiv:2309.16492 [pdf, other]

Asset Bundling for Wind Power Forecasting

Authors: Hanyu Zhang, Mathieu Tanneau, Chaofan Huang, V. Roshan Joseph, Shangkun Wang, Pascal Van Hentenryck

Abstract: The growing penetration of intermittent, renewable generation in US power grids, especially wind and solar generation, results in increased operational uncertainty. In that context, accurate forecasts are critical, especially for wind generation, which exhibits large variability and is historically harder to predict. To overcome this challenge, this work proposes a novel Bundle-Predict-Reconcile (… ▽ More The growing penetration of intermittent, renewable generation in US power grids, especially wind and solar generation, results in increased operational uncertainty. In that context, accurate forecasts are critical, especially for wind generation, which exhibits large variability and is historically harder to predict. To overcome this challenge, this work proposes a novel Bundle-Predict-Reconcile (BPR) framework that integrates asset bundling, machine learning, and forecast reconciliation techniques. The BPR framework first learns an intermediate hierarchy level (the bundles), then predicts wind power at the asset, bundle, and fleet level, and finally reconciles all forecasts to ensure consistency. This approach effectively introduces an auxiliary learning task (predicting the bundle-level time series) to help the main learning tasks. The paper also introduces new asset-bundling criteria that capture the spatio-temporal dynamics of wind power time series. Extensive numerical experiments are conducted on an industry-size dataset of 283 wind farms in the MISO footprint. The experiments consider short-term and day-ahead forecasts, and evaluates a large variety of forecasting models that include weather predictions as covariates. The results demonstrate the benefits of BPR, which consistently and significantly improves forecast accuracy over baselines, especially at the fleet level. △ Less

Submitted 28 September, 2023; originally announced September 2023.

arXiv:2305.07202 [pdf, other]

doi 10.1080/00401706.2023.2231042

Sequential Designs for Filling Output Spaces

Authors: Shangkun Wang, Adam P. Generale, Surya R. Kalidindi, V. Roshan Joseph

Abstract: Space-filling designs are commonly used in computer experiments to fill the space of inputs so that the input-output relationship can be accurately estimated. However, in certain applications such as inverse design or feature-based modeling, the aim is to fill the response or feature space. In this article, we propose a new experimental design framework that aims to fill the space of the outputs (… ▽ More Space-filling designs are commonly used in computer experiments to fill the space of inputs so that the input-output relationship can be accurately estimated. However, in certain applications such as inverse design or feature-based modeling, the aim is to fill the response or feature space. In this article, we propose a new experimental design framework that aims to fill the space of the outputs (responses or features). The design is adaptive and model-free, and therefore is expected to be robust to different kinds of modeling choices and input-output relationships. Several examples are given to show the advantages of the proposed method over the traditional input space-filling designs. △ Less

Submitted 11 May, 2023; originally announced May 2023.

Comments: 36 pages, 12 figures

Journal ref: Technometrics (2023)

arXiv:2212.00941 [pdf, other]

doi 10.1287/ijds.2023.0028

Adaptive Exploration and Optimization of Materials Crystal Structures

Authors: Arvind Krishna, Huan Tran, Chaofan Huang, Rampi Ramprasad, V. Roshan Joseph

Abstract: A central problem of materials science is to determine whether a hypothetical material is stable without being synthesized, which is mathematically equivalent to a global optimization problem on a highly non-linear and multi-modal potential energy surface (PES). This optimization problem poses multiple outstanding challenges, including the exceedingly high dimensionality of the PES and that PES mu… ▽ More A central problem of materials science is to determine whether a hypothetical material is stable without being synthesized, which is mathematically equivalent to a global optimization problem on a highly non-linear and multi-modal potential energy surface (PES). This optimization problem poses multiple outstanding challenges, including the exceedingly high dimensionality of the PES and that PES must be constructed from a reliable, sophisticated, parameters-free, and thus, very expensive computational method, for which density functional theory (DFT) is an example. DFT is a quantum mechanics based method that can predict, among other things, the total potential energy of a given configuration of atoms. DFT, while accurate, is computationally expensive. In this work, we propose a novel expansion-exploration-exploitation framework to find the global minimum of the PES. Starting from a few atomic configurations, this ``known'' space is expanded to construct a big candidate set. The expansion begins in a non-adaptive manner, where new configurations are added without considering their potential energy. A novel feature of this step is that it tends to generate a space-filling design without the knowledge of the boundaries of the domain space. If needed, the non-adaptive expansion of the space of configurations is followed by adaptive expansion, where ``promising regions'' of the domain space (those with low energy configurations) are further expanded. Once a candidate set of configurations is obtained, it is simultaneously explored and exploited using Bayesian optimization to find the global minimum. The methodology is demonstrated using a problem of finding the most stable crystal structure of Aluminum. △ Less

Submitted 1 December, 2022; originally announced December 2022.

Journal ref: INFORMS Journal on Data Science, 2023

arXiv:2209.13748 [pdf, other]

Conglomerate Multi-Fidelity Gaussian Process Modeling, with Application to Heavy-Ion Collisions

Authors: Yi Ji, Henry Shaowu Yuchi, Derek Soeder, J. -F. Paquet, Steffen A. Bass, V. Roshan Joseph, C. F. Jeff Wu, Simon Mak

Abstract: In an era where scientific experimentation is often costly, multi-fidelity emulation provides a powerful tool for predictive scientific computing. While there has been notable work on multi-fidelity modeling, existing models do not incorporate an important "conglomerate" property of multi-fidelity simulators, where the accuracies of different simulator components are controlled by different fideli… ▽ More In an era where scientific experimentation is often costly, multi-fidelity emulation provides a powerful tool for predictive scientific computing. While there has been notable work on multi-fidelity modeling, existing models do not incorporate an important "conglomerate" property of multi-fidelity simulators, where the accuracies of different simulator components are controlled by different fidelity parameters. Such conglomerate simulators are widely encountered in complex nuclear physics and astrophysics applications. We thus propose a new CONglomerate multi-FIdelity Gaussian process (CONFIG) model, which embeds this conglomerate structure within a novel non-stationary covariance function. We show that the proposed CONFIG model can capture prior knowledge on the numerical convergence of conglomerate simulators, which allows for cost-efficient emulation of multi-fidelity systems. We demonstrate the improved predictive performance of CONFIG over state-of-the-art models in a suite of numerical experiments and two applications, the first for emulation of cantilever beam deflection and the second for emulating the evolution of the quark-gluon plasma, which was theorized to have filled the Universe shortly after the Big Bang. △ Less

Submitted 28 September, 2023; v1 submitted 27 September, 2022; originally announced September 2022.

arXiv:2202.03326 [pdf, other]

doi 10.1002/sam.11583

Optimal Ratio for Data Splitting

Authors: V. Roshan Joseph

Abstract: It is common to split a dataset into training and testing sets before fitting a statistical or machine learning model. However, there is no clear guidance on how much data should be used for training and testing. In this article we show that the optimal splitting ratio is $\sqrt{p}:1$, where $p$ is the number of parameters in a linear regression model that explains the data well. It is common to split a dataset into training and testing sets before fitting a statistical or machine learning model. However, there is no clear guidance on how much data should be used for training and testing. In this article we show that the optimal splitting ratio is $\sqrt{p}:1$, where $p$ is the number of parameters in a linear regression model that explains the data well. △ Less

Submitted 7 February, 2022; originally announced February 2022.

Journal ref: Statistical Analysis and Data Mining: The ASA Data Science Journal, 2022

arXiv:2110.02927 [pdf, other]

doi 10.1002/sam.11574

Data Twinning

Authors: Akhil Vakayil, V. Roshan Joseph

Abstract: In this work, we develop a method named Twinning, for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model-independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the SPlit algorithm, which makes it applicable to Big Data problems such as data compression. Twinning… ▽ More In this work, we develop a method named Twinning, for partitioning a dataset into statistically similar twin sets. Twinning is based on SPlit, a recently proposed model-independent method for optimally splitting a dataset into training and testing sets. Twinning is orders of magnitude faster than the SPlit algorithm, which makes it applicable to Big Data problems such as data compression. Twinning can also be used for generating multiple splits of a given dataset to aid divide-and-conquer procedures and $k$-fold cross validation. △ Less

Submitted 6 October, 2021; originally announced October 2021.

arXiv:2104.11963 [pdf, other]

doi 10.1007/s11222-021-10054-2

Constrained Minimum Energy Designs

Authors: Chaofan Huang, V. Roshan Joseph, Douglas M. Ray

Abstract: Space-filling designs are important in computer experiments, which are critical for building a cheap surrogate model that adequately approximates an expensive computer code. Many design construction techniques in the existing literature are only applicable for rectangular bounded space, but in real world applications, the input space can often be non-rectangular because of constraints on the input… ▽ More Space-filling designs are important in computer experiments, which are critical for building a cheap surrogate model that adequately approximates an expensive computer code. Many design construction techniques in the existing literature are only applicable for rectangular bounded space, but in real world applications, the input space can often be non-rectangular because of constraints on the input variables. One solution to generate designs in a constrained space is to first generate uniformly distributed samples in the feasible region, and then use them as the candidate set to construct the designs. Sequentially Constrained Monte Carlo (SCMC) is the state-of-the-art technique for candidate generation, but it still requires large number of constraint evaluations, which is problematic especially when the constraints are expensive to evaluate. Thus, to reduce constraint evaluations and improve efficiency, we propose the Constrained Minimum Energy Design (CoMinED) that utilizes recent advances in deterministic sampling methods. Extensive simulation results on 15 benchmark problems with dimensions ranging from 2 to 13 are provided for demonstrating the improved performance of CoMinED over the existing methods. △ Less

Submitted 24 April, 2021; originally announced April 2021.

Comments: Submitted to Statistics and Computing

Journal ref: Stat Comput 31, 80 (2021)

arXiv:2012.13769 [pdf, other]

doi 10.1080/10618600.2022.2034637

Population Quasi-Monte Carlo

Authors: Chaofan Huang, V. Roshan Joseph, Simon Mak

Abstract: Monte Carlo methods are widely used for approximating complicated, multidimensional integrals for Bayesian inference. Population Monte Carlo (PMC) is an important class of Monte Carlo methods, which utilizes a population of proposals to generate weighted samples that approximate the target distribution. The generic PMC framework iterates over three steps: samples are simulated from a set of propos… ▽ More Monte Carlo methods are widely used for approximating complicated, multidimensional integrals for Bayesian inference. Population Monte Carlo (PMC) is an important class of Monte Carlo methods, which utilizes a population of proposals to generate weighted samples that approximate the target distribution. The generic PMC framework iterates over three steps: samples are simulated from a set of proposals, weights are assigned to such samples to correct for mismatch between the proposal and target distributions, and the proposals are then adapted via resampling from the weighted samples. When the target distribution is expensive to evaluate, the PMC has its computational limitation since the convergence rate is $\mathcal{O}(N^{-1/2})$. To address this, we propose in this paper a new Population Quasi-Monte Carlo (PQMC) framework, which integrates Quasi-Monte Carlo ideas within the sampling and adaptation steps of PMC. A key novelty in PQMC is the idea of importance support points resampling, a deterministic method for finding an "optimal" subsample from the weighted proposal samples. Moreover, within the PQMC framework, we develop an efficient covariance adaptation strategy for multivariate normal proposals. Lastly, a new set of correction weights is introduced for the weighted PMC estimator to improve the efficiency from the standard PMC estimator. We demonstrate the improved empirical convergence of PQMC over PMC in extensive numerical simulations and a friction drilling application. △ Less

Submitted 26 December, 2020; originally announced December 2020.

Comments: Submitted to Journal of Computational and Graphical Statistics

Journal ref: Journal of Computational and Graphical Statistics (2022)

arXiv:2012.10945 [pdf, other]

doi 10.1080/00401706.2021.1921037

SPlit: An Optimal Method for Data Splitting

Authors: V. Roshan Joseph, Akhil Vakayil

Abstract: In this article we propose an optimal method referred to as SPlit for splitting a dataset into training and testing sets. SPlit is based on the method of Support Points (SP), which was initially developed for finding the optimal representative points of a continuous distribution. We adapt SP for subsampling from a dataset using a sequential nearest neighbor algorithm. We also extend SP to deal wit… ▽ More In this article we propose an optimal method referred to as SPlit for splitting a dataset into training and testing sets. SPlit is based on the method of Support Points (SP), which was initially developed for finding the optimal representative points of a continuous distribution. We adapt SP for subsampling from a dataset using a sequential nearest neighbor algorithm. We also extend SP to deal with categorical variables so that SPlit can be applied to both regression and classification problems. The implementation of SPlit on real datasets shows substantial improvement in the worst-case testing performance for several modeling methods compared to the commonly used random splitting procedure. △ Less

Submitted 19 March, 2021; v1 submitted 20 December, 2020; originally announced December 2020.

arXiv:2008.00547 [pdf, other]

doi 10.1080/00224065.2021.1930618

Robust Experimental Designs for Model Calibration

Authors: Arvind Krishna, V. Roshan Joseph, Shan Ba, William A. Brenneman, William R. Myers

Abstract: A computer model can be used for predicting an output only after specifying the values of some unknown physical constants known as calibration parameters. The unknown calibration parameters can be estimated from real data by conducting physical experiments. This paper presents an approach to optimally design such a physical experiment. The problem of optimally designing physical experiment, using… ▽ More A computer model can be used for predicting an output only after specifying the values of some unknown physical constants known as calibration parameters. The unknown calibration parameters can be estimated from real data by conducting physical experiments. This paper presents an approach to optimally design such a physical experiment. The problem of optimally designing physical experiment, using a computer model, is similar to the problem of finding optimal design for fitting nonlinear models. However, the problem is more challenging than the existing work on nonlinear optimal design because of the possibility of model discrepancy, that is, the computer model may not be an accurate representation of the true underlying model. Therefore, we propose an optimal design approach that is robust to potential model discrepancies. We show that our designs are better than the commonly used physical experimental designs that do not make use of the information contained in the computer model and other nonlinear optimal designs that ignore potential model discrepancies. We illustrate our approach using a toy example and a real example from industry. △ Less

Submitted 2 August, 2020; originally announced August 2020.

Comments: 25 pages, 10 figures

arXiv:1910.05452 [pdf, other]

Adaptive design for Gaussian process regression under censoring

Authors: Jialei Chen, Simon Mak, V. Roshan Joseph, Chuck Zhang

Abstract: A key objective in engineering problems is to predict an unknown experimental surface over an input domain. In complex physical experiments, this may be hampered by response censoring, which results in a significant loss of information. For such problems, experimental design is paramount for maximizing predictive power using a small number of expensive experimental runs. To tackle this, we propose… ▽ More A key objective in engineering problems is to predict an unknown experimental surface over an input domain. In complex physical experiments, this may be hampered by response censoring, which results in a significant loss of information. For such problems, experimental design is paramount for maximizing predictive power using a small number of expensive experimental runs. To tackle this, we propose a novel adaptive design method, called the integrated censored mean-squared error (ICMSE) method. The ICMSE method first estimates the posterior probability of a new observation being censored, then adaptively chooses design points that minimize predictive uncertainty under censoring. Adopting a Gaussian process regression model with product correlation function, the proposed ICMSE criterion is easy to evaluate, which allows for efficient design optimization. We demonstrate the effectiveness of the ICMSE design in two real-world applications on surgical planning and wafer manufacturing. △ Less

Submitted 25 June, 2021; v1 submitted 11 October, 2019; originally announced October 2019.

Journal ref: Annals of Applied Statistics, 2021

arXiv:1910.01754 [pdf, other]

doi 10.1080/00401706.2020.1801255

Function-on-function kriging, with applications to 3D printing of aortic tissues

Authors: Jialei Chen, Simon Mak, V. Roshan Joseph, Chuck Zhang

Abstract: 3D-printed medical prototypes, which use synthetic metamaterials to mimic biological tissue, are becoming increasingly important in urgent surgical applications. However, the mimicking of tissue mechanical properties via 3D-printed metamaterial can be difficult and time-consuming, due to the functional nature of both inputs (metamaterial structure) and outputs (mechanical response curve). To deal… ▽ More 3D-printed medical prototypes, which use synthetic metamaterials to mimic biological tissue, are becoming increasingly important in urgent surgical applications. However, the mimicking of tissue mechanical properties via 3D-printed metamaterial can be difficult and time-consuming, due to the functional nature of both inputs (metamaterial structure) and outputs (mechanical response curve). To deal with this, we propose a novel function-on-function kriging model for efficient emulation and tissue-mimicking optimization. For functional inputs, a key novelty of our model is the spectral-distance (SpeD) correlation function, which captures important spectral differences between two functional inputs. Dependencies for functional outputs are then modeled via a co-kriging framework. We further adopt shrinkage priors on both the input spectra and the output co-kriging covariance matrix, which allows the emulator to learn and incorporate important physics (e.g., dominant input frequencies, output curve properties). Finally, we demonstrate the effectiveness of the proposed SpeD emulator in a real-world study on mimicking human aortic tissue, and show that it can provide quicker and more accurate tissue-mimicking performance compared to existing methods in the medical literature. △ Less

Submitted 1 July, 2020; v1 submitted 3 October, 2019; originally announced October 2019.

Journal ref: Technometrics,2020

arXiv:1712.09074 [pdf, other]

doi 10.1080/00401706.2018.1451390

Space-Filling Designs for Robustness Experiments

Authors: V. Roshan Joseph, Li Gu, Shan Ba, William R. Myers

Abstract: To identify the robust settings of the control factors, it is very important to understand how they interact with the noise factors. In this article, we propose space-filling designs for computer experiments that are more capable of accurately estimating the control-by-noise interactions. Moreover, the existing space-filling designs focus on uniformly distributing the points in the design space, w… ▽ More To identify the robust settings of the control factors, it is very important to understand how they interact with the noise factors. In this article, we propose space-filling designs for computer experiments that are more capable of accurately estimating the control-by-noise interactions. Moreover, the existing space-filling designs focus on uniformly distributing the points in the design space, which are not suitable for noise factors because they usually follow non-uniform distributions such as normal distribution. This would suggest placing more points in the regions with high probability mass. However, noise factors also tend to have a smooth relationship with the response and therefore, placing more points towards the tails of the distribution is also useful for accurately estimating the relationship. These two opposing effects make the experimental design methodology a challenging problem. We propose optimal and computationally efficient solutions to this problem and demonstrate their advantages using simulated examples and a real industry example involving a manufacturing packing line. △ Less

Submitted 25 December, 2017; originally announced December 2017.

MSC Class: 62K25

arXiv:1712.08929 [pdf, other]

doi 10.1080/00401706.2018.1552203

Deterministic Sampling of Expensive Posteriors Using Minimum Energy Designs

Authors: V. Roshan Joseph, Dianpeng Wang, Li Gu, Shiji Lv, Rui Tuo

Abstract: Markov chain Monte Carlo (MCMC) methods require a large number of samples to approximate a posterior distribution, which can be costly when the likelihood or prior is expensive to evaluate. The number of samples can be reduced if we can avoid repeated samples and those that are close to each other. This is the idea behind deterministic sampling methods such as Quasi-Monte Carlo (QMC). However, the… ▽ More Markov chain Monte Carlo (MCMC) methods require a large number of samples to approximate a posterior distribution, which can be costly when the likelihood or prior is expensive to evaluate. The number of samples can be reduced if we can avoid repeated samples and those that are close to each other. This is the idea behind deterministic sampling methods such as Quasi-Monte Carlo (QMC). However, the existing QMC methods aim at sampling from a uniform hypercube, which can miss the high probability regions of the posterior distribution and thus the approximation can be poor. Minimum energy design (MED) is a recently proposed deterministic sampling method, which makes use of the posterior evaluations to obtain a weighted space-filling design in the region of interest. However, the existing implementation of MED is inefficient because it requires several global optimizations and thus numerous evaluations of the posterior. In this article, we develop an efficient algorithm that can generate MED with few posterior evaluations. We also make several improvements to the MED criterion to make it perform better in high dimensions. The advantages of MED over MCMC and QMC are illustrated using an example of calibrating a friction drilling process. △ Less

Submitted 24 December, 2017; originally announced December 2017.

MSC Class: 62K99

arXiv:1708.06897 [pdf, other]

Projected support points: a new method for high-dimensional data reduction

Authors: Simon Mak, V. Roshan Joseph

Abstract: In an era where big and high-dimensional data is readily available, data scientists are inevitably faced with the challenge of reducing this data for expensive downstream computation or analysis. To this end, we present here a new method for reducing high-dimensional big data into a representative point set, called projected support points (PSPs). A key ingredient in our method is the so-called sp… ▽ More In an era where big and high-dimensional data is readily available, data scientists are inevitably faced with the challenge of reducing this data for expensive downstream computation or analysis. To this end, we present here a new method for reducing high-dimensional big data into a representative point set, called projected support points (PSPs). A key ingredient in our method is the so-called sparsity-inducing (SpIn) kernel, which encourages the preservation of low-dimensional features when reducing high-dimensional data. We begin by introducing a unifying theoretical framework for data reduction, connecting PSPs with fundamental sampling principles from experimental design and Quasi-Monte Carlo. Through this framework, we then derive sparsity conditions under which the curse-of-dimensionality in data reduction can be lifted for our method. Next, we propose two algorithms for one-shot and sequential reduction via PSPs, both of which exploit big data subsampling and majorization-minimization for efficient optimization. Finally, we demonstrate the practical usefulness of PSPs in two real-world applications, the first for data reduction in kernel learning, and the second for reducing Markov Chain Monte Carlo (MCMC) chains. △ Less

Submitted 2 June, 2018; v1 submitted 23 August, 2017; originally announced August 2017.

arXiv:1611.07911 [pdf, other]

An efficient surrogate model for emulation and physics extraction of large eddy simulations

Authors: Simon Mak, Chih-Li Sung, Xingjian Wang, Shiang-Ting Yeh, Yu-Hung Chang, V. Roshan Joseph, Vigor Yang, C. F. Jeff Wu

Abstract: In the quest for advanced propulsion and power-generation systems, high-fidelity simulations are too computationally expensive to survey the desired design space, and a new design methodology is needed that combines engineering physics, computer simulations and statistical modeling. In this paper, we propose a new surrogate model that provides efficient prediction and uncertainty quantification of… ▽ More In the quest for advanced propulsion and power-generation systems, high-fidelity simulations are too computationally expensive to survey the desired design space, and a new design methodology is needed that combines engineering physics, computer simulations and statistical modeling. In this paper, we propose a new surrogate model that provides efficient prediction and uncertainty quantification of turbulent flows in swirl injectors with varying geometries, devices commonly used in many engineering applications. The novelty of the proposed method lies in the incorporation of known physical properties of the fluid flow as {simplifying assumptions} for the statistical model. In view of the massive simulation data at hand, which is on the order of hundreds of gigabytes, these assumptions allow for accurate flow predictions in around an hour of computation time. To contrast, existing flow emulators which forgo such simplications may require more computation time for training and prediction than is needed for conducting the simulation itself. Moreover, by accounting for coupling mechanisms between flow variables, the proposed model can jointly reduce prediction uncertainty and extract useful flow physics, which can then be used to guide further investigations. △ Less

Submitted 26 May, 2017; v1 submitted 23 November, 2016; originally announced November 2016.

Comments: Submitted to JASA A&CS

arXiv:1611.00203 [pdf, ps, other]

Orthogonal Gaussian process models

Authors: Matthew Plumlee, V. Roshan Joseph

Abstract: Gaussian processes models are widely adopted for nonparameteric/semi-parametric modeling. Identifiability issues occur when the mean model contains polynomials with unknown coefficients. Though resulting prediction is unaffected, this leads to poor estimation of the coefficients in the mean model, and thus the estimated mean model loses interpretability. This paper introduces a new Gaussian proces… ▽ More Gaussian processes models are widely adopted for nonparameteric/semi-parametric modeling. Identifiability issues occur when the mean model contains polynomials with unknown coefficients. Though resulting prediction is unaffected, this leads to poor estimation of the coefficients in the mean model, and thus the estimated mean model loses interpretability. This paper introduces a new Gaussian process model whose stochastic part is orthogonal to the mean part to address this issue. This paper also discusses applications to multi-fidelity simulations using data examples. △ Less

Submitted 1 November, 2016; originally announced November 2016.

arXiv:1609.01811 [pdf, other]

Support points

Authors: Simon Mak, V. Roshan Joseph

Abstract: This paper introduces a new way to compact a continuous probability distribution $F$ into a set of representative points called support points. These points are obtained by minimizing the energy distance, a statistical potential measure initially proposed by Székely and Rizzo (2004) for testing goodness-of-fit. The energy distance has two appealing features. First, its distance-based structure all… ▽ More This paper introduces a new way to compact a continuous probability distribution $F$ into a set of representative points called support points. These points are obtained by minimizing the energy distance, a statistical potential measure initially proposed by Székely and Rizzo (2004) for testing goodness-of-fit. The energy distance has two appealing features. First, its distance-based structure allows us to exploit the duality between powers of the Euclidean distance and its Fourier transform for theoretical analysis. Using this duality, we show that support points converge in distribution to $F$, and enjoy an improved error rate to Monte Carlo for integrating a large class of functions. Second, the minimization of the energy distance can be formulated as a difference-of-convex program, which we manipulate using two algorithms to efficiently generate representative point sets. In simulation studies, support points provide improved integration performance to both Monte Carlo and a specific Quasi-Monte Carlo method. Two important applications of support points are then highlighted: (a) as a way to quantify the propagation of uncertainty in expensive simulations, and (b) as a method to optimally compact Markov chain Monte Carlo (MCMC) samples in Bayesian computation. △ Less

Submitted 9 September, 2018; v1 submitted 6 September, 2016; originally announced September 2016.

Comments: Accepted, Annals of Statistics

MSC Class: 62E17

arXiv:1602.03938 [pdf, other]

Minimax and minimax projection designs using clustering

Authors: Simon Mak, V. Roshan Joseph

Abstract: Minimax designs provide a uniform coverage of a design space $\mathcal{X} \subseteq \mathbb{R}^p$ by minimizing the maximum distance from any point in this space to its nearest design point. Although minimax designs have many useful applications, e.g., for optimal sensor allocation or as space-filling designs for computer experiments, there has been little work in develo** algorithms for generat… ▽ More Minimax designs provide a uniform coverage of a design space $\mathcal{X} \subseteq \mathbb{R}^p$ by minimizing the maximum distance from any point in this space to its nearest design point. Although minimax designs have many useful applications, e.g., for optimal sensor allocation or as space-filling designs for computer experiments, there has been little work in develo** algorithms for generating these designs, due to its computational complexity. In this paper, a new hybrid algorithm combining particle swarm optimization and clustering is proposed for generating minimax designs on any convex and bounded design space. The computation time of this algorithm scales linearly in dimension $p$, meaning our method can generate minimax designs efficiently for high-dimensional regions. Simulation studies and a real-world example show that the proposed algorithm provides improved minimax performance over existing methods on a variety of design spaces. Finally, we introduce a new type of experimental design called a minimax projection design, and show that this proposed design provides better minimax performance on projected subspaces of $\mathcal{X}$ compared to existing designs. An efficient implementation of these algorithms can be found in the R package minimaxdesign. △ Less

Submitted 28 October, 2016; v1 submitted 11 February, 2016; originally announced February 2016.

Comments: Under revision, Journal of Computational and Graphical Statistics (JCGS)

arXiv:1301.2503 [pdf, ps, other]

doi 10.1214/12-AOAS570

Composite Gaussian process models for emulating expensive functions

Authors: Shan Ba, V. Roshan Joseph

Abstract: A new type of nonstationary Gaussian process model is developed for approximating computationally expensive functions. The new model is a composite of two Gaussian processes, where the first one captures the smooth global trend and the second one models local details. The new predictor also incorporates a flexible variance model, which makes it more capable of approximating surfaces with varying v… ▽ More A new type of nonstationary Gaussian process model is developed for approximating computationally expensive functions. The new model is a composite of two Gaussian processes, where the first one captures the smooth global trend and the second one models local details. The new predictor also incorporates a flexible variance model, which makes it more capable of approximating surfaces with varying volatility. Compared to the commonly used stationary Gaussian process model, the new predictor is numerically more stable and can more accurately approximate complex surfaces when the experimental design is sparse. In addition, the new model can also improve the prediction intervals by quantifying the change of local variability associated with the response. Advantages of the new predictor are demonstrated using several examples. △ Less

Submitted 11 January, 2013; originally announced January 2013.

Comments: Published in at http://dx.doi.org/10.1214/12-AOAS570 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS570

Journal ref: Annals of Applied Statistics 2012, Vol. 6, No. 4, 1838-1860

arXiv:1211.1592 [pdf, other]

Analysis of Computer Experiments with Functional Response

Authors: Ying Hung, V. Roshan Joseph, Shreyes N. Melkote

Abstract: This paper is motivated by a computer experiment conducted for optimizing residual stresses in the machining of metals. Although kriging is widely used in the analysis of computer experiments, it cannot be easily applied to model the residual stresses because they are obtained as a profile. The high dimensionality caused by this functional response introduces severe computational challenges in kri… ▽ More This paper is motivated by a computer experiment conducted for optimizing residual stresses in the machining of metals. Although kriging is widely used in the analysis of computer experiments, it cannot be easily applied to model the residual stresses because they are obtained as a profile. The high dimensionality caused by this functional response introduces severe computational challenges in kriging. It is well known that if the functional data are observed on a regular grid, the computations can be simplified using an application of Kronecker products. However, the case of irregular grid is quite complex. In this paper, we develop a Gibbs sampling-based expectation maximization algorithm, which converts the irregularly spaced data into a regular grid so that the Kronecker product-based approach can be employed for efficiently fitting a kriging model to the functional data. △ Less

Submitted 7 November, 2012; originally announced November 2012.

arXiv:1011.0610 [pdf, ps, other]

doi 10.1214/09-AOAS254

Structured variable selection and estimation

Authors: Ming Yuan, V. Roshan Joseph, Hui Zou

Abstract: In linear regression problems with related predictors, it is desirable to do variable selection and estimation by maintaining the hierarchical or structural relationships among predictors. In this paper we propose non-negative garrote methods that can naturally incorporate such relationships defined through effect heredity principles or marginality principles. We show that the methods are very eas… ▽ More In linear regression problems with related predictors, it is desirable to do variable selection and estimation by maintaining the hierarchical or structural relationships among predictors. In this paper we propose non-negative garrote methods that can naturally incorporate such relationships defined through effect heredity principles or marginality principles. We show that the methods are very easy to compute and enjoy nice theoretical properties. We also show that the methods can be easily extended to deal with more general regression problems such as generalized linear models. Simulations and real examples are used to illustrate the merits of the proposed methods. △ Less

Submitted 2 November, 2010; originally announced November 2010.

Comments: Published in at http://dx.doi.org/10.1214/09-AOAS254 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS254

Journal ref: Annals of Applied Statistics 2009, Vol. 3, No. 4, 1738-1757

Showing 1–26 of 26 results for author: Joseph, V R