-
High-dimensional multiple imputation (HDMI) for partially observed confounders including natural language processing-derived auxiliary covariates
Authors:
Janick Weberpals,
Pamela A. Shaw,
Kueiyu Joshua Lin,
Richard Wyss,
Joseph M Plasek,
Li Zhou,
Kerry Ngan,
Thomas DeRamus,
Sudha R. Raman,
Bradley G. Hammill,
Hana Lee,
Sengwee Toh,
John G. Connolly,
Kimberly J. Dandreo,
Fang Tian,
Wei Liu,
Jie Li,
José J. Hernández-Muñoz,
Sebastian Schneeweiss,
Rishi J. Desai
Abstract:
Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from…
▽ More
Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from opioid vs. non-steroidal anti-inflammatory drug (NSAID) initiators (X) with observed serum creatinine labs (Z2) and time-to-acute kidney injury as outcome. We simulated 100 cohorts with a null treatment effect, including X, Z2, atrial fibrillation (U), and 13 other investigator-derived confounders (Z1) in the outcome generation. We then imposed missingness (MZ2) on 50% of Z2 measurements as a function of Z2 and U and created different HDMI candidate AC using structured and NLP-derived features. We mimicked scenarios where U was unobserved by omitting it from all AC candidate sets. Using LASSO, we data-adaptively selected HDMI covariates associated with Z2 and MZ2 for MI, and with U to include in propensity score models. The treatment effect was estimated following propensity score matching in MI datasets and we benchmarked HDMI approaches against a baseline imputation and complete case analysis with Z1 only. HDMI using claims data showed the lowest bias (0.072). Combining claims and sentence embeddings led to an improvement in the efficiency displaying the lowest root-mean-squared-error (0.173) and coverage (94%). NLP-derived AC alone did not perform better than baseline MI. HDMI approaches may decrease bias in studies with partially observed confounders where missingness depends on unobserved factors.
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
Sequential good lattice point sets for computer experiments
Authors:
Xue-Ru Zhang,
Min-Qian Liu,
Dennis K. J. Lin,
Yong-Dao Zhou
Abstract:
Sequential Latin hypercube designs have recently received great attention for computer experiments. Much of the work has been restricted to invariant spaces. The related systematic construction methods are inflexible while algorithmic methods are ineffective for large designs. For such designs in space contraction, systematic construction methods have not been investigated yet. This paper proposes…
▽ More
Sequential Latin hypercube designs have recently received great attention for computer experiments. Much of the work has been restricted to invariant spaces. The related systematic construction methods are inflexible while algorithmic methods are ineffective for large designs. For such designs in space contraction, systematic construction methods have not been investigated yet. This paper proposes a new method for constructing sequential Latin hypercube designs via good lattice point sets in a variety of experimental spaces. These designs are called sequential good lattice point sets. Moreover, we provide fast and efficient approaches for identifying the (nearly) optimal sequential good lattice point sets under a given criterion. Combining with the linear level permutation technique, we obtain a class of asymptotically optimal sequential Latin hypercube designs in invariant spaces where the $L_1$-distance in each stage is either optimal or asymptotically optimal. Numerical results demonstrate that the sequential good lattice point set has a better space-filling property than the existing sequential Latin hypercube designs in the invariant space. It is also shown that the sequential good lattice point sets have less computational complexity and more adaptability.
△ Less
Submitted 16 May, 2023; v1 submitted 11 May, 2023;
originally announced May 2023.
-
A Bayesian Robust Regression Method for Corrupted Data Reconstruction
Authors:
Zheyi Fan,
Zhaohui Li,
**gyan Wang,
Dennis K. J. Lin,
Xiao Xiong,
Qingpei Hu
Abstract:
Because of the widespread existence of noise and data corruption, recovering the true regression parameters with a certain proportion of corrupted response variables is an essential task. Methods to overcome this problem often involve robust least-squares regression, but few methods perform well when confronted with severe adaptive adversarial attacks. In many applications, prior knowledge is ofte…
▽ More
Because of the widespread existence of noise and data corruption, recovering the true regression parameters with a certain proportion of corrupted response variables is an essential task. Methods to overcome this problem often involve robust least-squares regression, but few methods perform well when confronted with severe adaptive adversarial attacks. In many applications, prior knowledge is often available from historical data or engineering experience, and by incorporating prior information into a robust regression method, we develop an effective robust regression method that can resist adaptive adversarial attacks. First, we propose the novel TRIP (hard Thresholding approach to Robust regression with sImple Prior) algorithm, which improves the breakdown point when facing adaptive adversarial attacks. Then, to improve the robustness and reduce the estimation error caused by the inclusion of priors, we use the idea of Bayesian reweighting to construct the more robust BRHT (robust Bayesian Reweighting regression via Hard Thresholding) algorithm. We prove the theoretical convergence of the proposed algorithms under mild conditions, and extensive experiments show that under different types of dataset attacks, our algorithms outperform other benchmark ones. Finally, we apply our methods to a data-recovery problem in a real-world application involving a space solar array, demonstrating their good applicability.
△ Less
Submitted 8 January, 2023; v1 submitted 24 December, 2022;
originally announced December 2022.
-
Statistical Monitoring of the Covariance Matrix in Multivariate Processes: A Literature Review
Authors:
Mohsen Ebadi,
Shoja'eddin Chenouri,
Dennis K. J. Lin,
Stefan H. Steiner
Abstract:
Monitoring several correlated quality characteristics of a process is common in modern manufacturing and service industries. Although a lot of attention has been paid to monitoring the multivariate process mean, not many control charts are available for monitoring the covariance matrix. This paper presents a comprehensive overview of the literature on control charts for monitoring the covariance m…
▽ More
Monitoring several correlated quality characteristics of a process is common in modern manufacturing and service industries. Although a lot of attention has been paid to monitoring the multivariate process mean, not many control charts are available for monitoring the covariance matrix. This paper presents a comprehensive overview of the literature on control charts for monitoring the covariance matrix in a multivariate statistical process monitoring (MSPM) framework. It classifies the research that has previously appeared in the literature. We highlight the challenging areas for research and provide some directions for future research.
△ Less
Submitted 14 April, 2021; v1 submitted 14 February, 2020;
originally announced February 2020.
-
Interval-valued Data Prediction via Regularized Artificial Neural Network
Authors:
Zebin Yang,
Dennis K. J. Lin,
Aijun Zhang
Abstract:
A regularized artificial neural network (RANN) is proposed for interval-valued data prediction. The ANN model is selected due to its powerful capability in fitting linear and nonlinear functions. To meet mathematical coherence requirement for an interval (i.e., the predicted lower bounds should not cross over their upper bounds), a soft non-crossing regularizer is introduced to the interval-valued…
▽ More
A regularized artificial neural network (RANN) is proposed for interval-valued data prediction. The ANN model is selected due to its powerful capability in fitting linear and nonlinear functions. To meet mathematical coherence requirement for an interval (i.e., the predicted lower bounds should not cross over their upper bounds), a soft non-crossing regularizer is introduced to the interval-valued ANN model. We conduct extensive experiments based on both simulation datasets and real-life datasets, and compare the proposed RANN method with multiple traditional models, including the linear constrained center and range method (CCRM), the least absolute shrinkage and selection operator-based interval-valued regression method (Lasso-IR), the nonlinear interval kernel regression (IKR), the interval multi-layer perceptron (iMLP) and the multi-output support vector regression (MSVR). Experimental results show that the proposed RANN model is an effective tool for interval-valued prediction tasks with high prediction accuracy.
△ Less
Submitted 21 August, 2018;
originally announced August 2018.
-
Design of Order-of-Addition Experiments
Authors:
Jiayu Peng,
Rahul Mukerjee,
Dennis K. J. Lin
Abstract:
In an order-of-addition experiment, each treatment is a permutation of m components. It is often unaffordable to test all the m! treatments, and the design problem arises. We consider a model that incorporates the order of each pair of components and can also account for the distance between the two components in every such pair. Under this model, the optimality of the uniform design measure is es…
▽ More
In an order-of-addition experiment, each treatment is a permutation of m components. It is often unaffordable to test all the m! treatments, and the design problem arises. We consider a model that incorporates the order of each pair of components and can also account for the distance between the two components in every such pair. Under this model, the optimality of the uniform design measure is established, via the approximate theory, for a broad range of criteria. Coupled with an eigen-analysis, this result serves as a benchmark that paves the way for assessing the efficiency and robustness of any exact design. The closed-form construction of a class of robust optimal fractional designs is then explored and illustrated.
△ Less
Submitted 12 May, 2018;
originally announced May 2018.
-
Asymptotics of nonparametric L-1 regression models with dependent data
Authors:
Zhibiao Zhao,
Ying Wei,
Dennis K. J. Lin
Abstract:
We investigate asymptotic properties of least-absolute-deviation or median quantile estimates of the location and scale functions in nonparametric regression models with dependent data from multiple subjects. Under a general dependence structure that allows for longitudinal data and some spatially correlated data, we establish uniform Bahadur representations for the proposed median quantile estima…
▽ More
We investigate asymptotic properties of least-absolute-deviation or median quantile estimates of the location and scale functions in nonparametric regression models with dependent data from multiple subjects. Under a general dependence structure that allows for longitudinal data and some spatially correlated data, we establish uniform Bahadur representations for the proposed median quantile estimates. The obtained Bahadur representations provide deep insights into the asymptotic behavior of the estimates. Our main theoretical development is based on studying the modulus of continuity of kernel weighted empirical process through a coupling argument. Progesterone data is used for an illustration.
△ Less
Submitted 4 July, 2014;
originally announced July 2014.
-
Uniform fractional factorial designs
Authors:
Yu Tang,
Hongquan Xu,
Dennis K. J. Lin
Abstract:
The minimum aberration criterion has been frequently used in the selection of fractional factorial designs with nominal factors. For designs with quantitative factors, however, level permutation of factors could alter their geometrical structures and statistical properties. In this paper uniformity is used to further distinguish fractional factorial designs, besides the minimum aberration criterio…
▽ More
The minimum aberration criterion has been frequently used in the selection of fractional factorial designs with nominal factors. For designs with quantitative factors, however, level permutation of factors could alter their geometrical structures and statistical properties. In this paper uniformity is used to further distinguish fractional factorial designs, besides the minimum aberration criterion. We show that minimum aberration designs have low discrepancies on average. An efficient method for constructing uniform minimum aberration designs is proposed and optimal designs with 27 and 81 runs are obtained for practical use. These designs have good uniformity and are effective for studying quantitative factors.
△ Less
Submitted 5 June, 2012;
originally announced June 2012.
-
Profile control charts based on nonparametric $L$-1 regression methods
Authors:
Ying Wei,
Zhibiao Zhao,
Dennis K. J. Lin
Abstract:
Classical statistical process control often relies on univariate characteristics. In many contemporary applications, however, the quality of products must be characterized by some functional relation between a response variable and its explanatory variables. Monitoring such functional profiles has been a rapidly growing field due to increasing demands. This paper develops a novel nonparametric…
▽ More
Classical statistical process control often relies on univariate characteristics. In many contemporary applications, however, the quality of products must be characterized by some functional relation between a response variable and its explanatory variables. Monitoring such functional profiles has been a rapidly growing field due to increasing demands. This paper develops a novel nonparametric $L$-1 location-scale model to screen the shapes of profiles. The model is built on three basic elements: location shifts, local shape distortions, and overall shape deviations, which are quantified by three individual metrics. The proposed approach is applied to the previously analyzed vertical density profile data, leading to some interesting insights.
△ Less
Submitted 21 March, 2012;
originally announced March 2012.
-
On construction of optimal mixed-level supersaturated designs
Authors:
Fasheng Sun,
Dennis K. J. Lin,
Min-Qian Liu
Abstract:
Supersaturated design (SSD) has received much recent interest because of its potential in factor screening experiments. In this paper, we provide equivalent conditions for two columns to be fully aliased and consequently propose methods for constructing $E(f_{\mathrm{NOD}})$- and $χ^2$-optimal mixed-level SSDs without fully aliased columns, via equidistant designs and difference matrices. The meth…
▽ More
Supersaturated design (SSD) has received much recent interest because of its potential in factor screening experiments. In this paper, we provide equivalent conditions for two columns to be fully aliased and consequently propose methods for constructing $E(f_{\mathrm{NOD}})$- and $χ^2$-optimal mixed-level SSDs without fully aliased columns, via equidistant designs and difference matrices. The methods can be easily performed and many new optimal mixed-level SSDs have been obtained. Furthermore, it is proved that the nonorthogonality between columns of the resulting design is well controlled by the source designs. A rather complete list of newly generated optimal mixed-level SSDs are tabulated for practical use.
△ Less
Submitted 19 May, 2011;
originally announced May 2011.