Search | arXiv e-print repository

arXiv:2405.10925 [pdf]

High-dimensional multiple imputation (HDMI) for partially observed confounders including natural language processing-derived auxiliary covariates

Authors: Janick Weberpals, Pamela A. Shaw, Kueiyu Joshua Lin, Richard Wyss, Joseph M Plasek, Li Zhou, Kerry Ngan, Thomas DeRamus, Sudha R. Raman, Bradley G. Hammill, Hana Lee, Sengwee Toh, John G. Connolly, Kimberly J. Dandreo, Fang Tian, Wei Liu, Jie Li, José J. Hernández-Muñoz, Sebastian Schneeweiss, Rishi J. Desai

Abstract: Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from… ▽ More Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from opioid vs. non-steroidal anti-inflammatory drug (NSAID) initiators (X) with observed serum creatinine labs (Z2) and time-to-acute kidney injury as outcome. We simulated 100 cohorts with a null treatment effect, including X, Z2, atrial fibrillation (U), and 13 other investigator-derived confounders (Z1) in the outcome generation. We then imposed missingness (MZ2) on 50% of Z2 measurements as a function of Z2 and U and created different HDMI candidate AC using structured and NLP-derived features. We mimicked scenarios where U was unobserved by omitting it from all AC candidate sets. Using LASSO, we data-adaptively selected HDMI covariates associated with Z2 and MZ2 for MI, and with U to include in propensity score models. The treatment effect was estimated following propensity score matching in MI datasets and we benchmarked HDMI approaches against a baseline imputation and complete case analysis with Z1 only. HDMI using claims data showed the lowest bias (0.072). Combining claims and sentence embeddings led to an improvement in the efficiency displaying the lowest root-mean-squared-error (0.173) and coverage (94%). NLP-derived AC alone did not perform better than baseline MI. HDMI approaches may decrease bias in studies with partially observed confounders where missingness depends on unobserved factors. △ Less

Submitted 17 May, 2024; originally announced May 2024.

arXiv:2305.06578

doi 10.1007/s11425-021-2087-2

Sequential good lattice point sets for computer experiments

Authors: Xue-Ru Zhang, Min-Qian Liu, Dennis K. J. Lin, Yong-Dao Zhou

Abstract: Sequential Latin hypercube designs have recently received great attention for computer experiments. Much of the work has been restricted to invariant spaces. The related systematic construction methods are inflexible while algorithmic methods are ineffective for large designs. For such designs in space contraction, systematic construction methods have not been investigated yet. This paper proposes… ▽ More Sequential Latin hypercube designs have recently received great attention for computer experiments. Much of the work has been restricted to invariant spaces. The related systematic construction methods are inflexible while algorithmic methods are ineffective for large designs. For such designs in space contraction, systematic construction methods have not been investigated yet. This paper proposes a new method for constructing sequential Latin hypercube designs via good lattice point sets in a variety of experimental spaces. These designs are called sequential good lattice point sets. Moreover, we provide fast and efficient approaches for identifying the (nearly) optimal sequential good lattice point sets under a given criterion. Combining with the linear level permutation technique, we obtain a class of asymptotically optimal sequential Latin hypercube designs in invariant spaces where the $L_1$-distance in each stage is either optimal or asymptotically optimal. Numerical results demonstrate that the sequential good lattice point set has a better space-filling property than the existing sequential Latin hypercube designs in the invariant space. It is also shown that the sequential good lattice point sets have less computational complexity and more adaptability. △ Less

Submitted 16 May, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

Comments: there was an error for rank of authors

MSC Class: 62K15; 62K05

Journal ref: SCIENCE CHINA Mathematics (2023)

arXiv:2212.12787 [pdf, other]

A Bayesian Robust Regression Method for Corrupted Data Reconstruction

Authors: Zheyi Fan, Zhaohui Li, **gyan Wang, Dennis K. J. Lin, Xiao Xiong, Qingpei Hu

Abstract: Because of the widespread existence of noise and data corruption, recovering the true regression parameters with a certain proportion of corrupted response variables is an essential task. Methods to overcome this problem often involve robust least-squares regression, but few methods perform well when confronted with severe adaptive adversarial attacks. In many applications, prior knowledge is ofte… ▽ More Because of the widespread existence of noise and data corruption, recovering the true regression parameters with a certain proportion of corrupted response variables is an essential task. Methods to overcome this problem often involve robust least-squares regression, but few methods perform well when confronted with severe adaptive adversarial attacks. In many applications, prior knowledge is often available from historical data or engineering experience, and by incorporating prior information into a robust regression method, we develop an effective robust regression method that can resist adaptive adversarial attacks. First, we propose the novel TRIP (hard Thresholding approach to Robust regression with sImple Prior) algorithm, which improves the breakdown point when facing adaptive adversarial attacks. Then, to improve the robustness and reduce the estimation error caused by the inclusion of priors, we use the idea of Bayesian reweighting to construct the more robust BRHT (robust Bayesian Reweighting regression via Hard Thresholding) algorithm. We prove the theoretical convergence of the proposed algorithms under mild conditions, and extensive experiments show that under different types of dataset attacks, our algorithms outperform other benchmark ones. Finally, we apply our methods to a data-recovery problem in a real-world application involving a space solar array, demonstrating their good applicability. △ Less

Submitted 8 January, 2023; v1 submitted 24 December, 2022; originally announced December 2022.

Comments: 22 pages

arXiv:2002.06159 [pdf, ps, other]

Statistical Monitoring of the Covariance Matrix in Multivariate Processes: A Literature Review

Authors: Mohsen Ebadi, Shoja'eddin Chenouri, Dennis K. J. Lin, Stefan H. Steiner

Abstract: Monitoring several correlated quality characteristics of a process is common in modern manufacturing and service industries. Although a lot of attention has been paid to monitoring the multivariate process mean, not many control charts are available for monitoring the covariance matrix. This paper presents a comprehensive overview of the literature on control charts for monitoring the covariance m… ▽ More Monitoring several correlated quality characteristics of a process is common in modern manufacturing and service industries. Although a lot of attention has been paid to monitoring the multivariate process mean, not many control charts are available for monitoring the covariance matrix. This paper presents a comprehensive overview of the literature on control charts for monitoring the covariance matrix in a multivariate statistical process monitoring (MSPM) framework. It classifies the research that has previously appeared in the literature. We highlight the challenging areas for research and provide some directions for future research. △ Less

Submitted 14 April, 2021; v1 submitted 14 February, 2020; originally announced February 2020.

arXiv:1808.06943 [pdf, other]

Interval-valued Data Prediction via Regularized Artificial Neural Network

Authors: Zebin Yang, Dennis K. J. Lin, Aijun Zhang

Abstract: A regularized artificial neural network (RANN) is proposed for interval-valued data prediction. The ANN model is selected due to its powerful capability in fitting linear and nonlinear functions. To meet mathematical coherence requirement for an interval (i.e., the predicted lower bounds should not cross over their upper bounds), a soft non-crossing regularizer is introduced to the interval-valued… ▽ More A regularized artificial neural network (RANN) is proposed for interval-valued data prediction. The ANN model is selected due to its powerful capability in fitting linear and nonlinear functions. To meet mathematical coherence requirement for an interval (i.e., the predicted lower bounds should not cross over their upper bounds), a soft non-crossing regularizer is introduced to the interval-valued ANN model. We conduct extensive experiments based on both simulation datasets and real-life datasets, and compare the proposed RANN method with multiple traditional models, including the linear constrained center and range method (CCRM), the least absolute shrinkage and selection operator-based interval-valued regression method (Lasso-IR), the nonlinear interval kernel regression (IKR), the interval multi-layer perceptron (iMLP) and the multi-output support vector regression (MSVR). Experimental results show that the proposed RANN model is an effective tool for interval-valued prediction tasks with high prediction accuracy. △ Less

Submitted 21 August, 2018; originally announced August 2018.

arXiv:1805.04648 [pdf]

Design of Order-of-Addition Experiments

Authors: Jiayu Peng, Rahul Mukerjee, Dennis K. J. Lin

Abstract: In an order-of-addition experiment, each treatment is a permutation of m components. It is often unaffordable to test all the m! treatments, and the design problem arises. We consider a model that incorporates the order of each pair of components and can also account for the distance between the two components in every such pair. Under this model, the optimality of the uniform design measure is es… ▽ More In an order-of-addition experiment, each treatment is a permutation of m components. It is often unaffordable to test all the m! treatments, and the design problem arises. We consider a model that incorporates the order of each pair of components and can also account for the distance between the two components in every such pair. Under this model, the optimality of the uniform design measure is established, via the approximate theory, for a broad range of criteria. Coupled with an eigen-analysis, this result serves as a benchmark that paves the way for assessing the efficiency and robustness of any exact design. The closed-form construction of a class of robust optimal fractional designs is then explored and illustrated. △ Less

Submitted 12 May, 2018; originally announced May 2018.

arXiv:1407.1225 [pdf, ps, other]

doi 10.3150/13-BEJ532

Asymptotics of nonparametric L-1 regression models with dependent data

Authors: Zhibiao Zhao, Ying Wei, Dennis K. J. Lin

Abstract: We investigate asymptotic properties of least-absolute-deviation or median quantile estimates of the location and scale functions in nonparametric regression models with dependent data from multiple subjects. Under a general dependence structure that allows for longitudinal data and some spatially correlated data, we establish uniform Bahadur representations for the proposed median quantile estima… ▽ More We investigate asymptotic properties of least-absolute-deviation or median quantile estimates of the location and scale functions in nonparametric regression models with dependent data from multiple subjects. Under a general dependence structure that allows for longitudinal data and some spatially correlated data, we establish uniform Bahadur representations for the proposed median quantile estimates. The obtained Bahadur representations provide deep insights into the asymptotic behavior of the estimates. Our main theoretical development is based on studying the modulus of continuity of kernel weighted empirical process through a coupling argument. Progesterone data is used for an illustration. △ Less

Submitted 4 July, 2014; originally announced July 2014.

Comments: Published in at http://dx.doi.org/10.3150/13-BEJ532 the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm)

Report number: IMS-BEJ-BEJ532

Journal ref: Bernoulli 2014, Vol. 20, No. 3, 1532-1559

arXiv:1206.0897 [pdf, ps, other]

doi 10.1214/12-AOS987

Uniform fractional factorial designs

Authors: Yu Tang, Hongquan Xu, Dennis K. J. Lin

Abstract: The minimum aberration criterion has been frequently used in the selection of fractional factorial designs with nominal factors. For designs with quantitative factors, however, level permutation of factors could alter their geometrical structures and statistical properties. In this paper uniformity is used to further distinguish fractional factorial designs, besides the minimum aberration criterio… ▽ More The minimum aberration criterion has been frequently used in the selection of fractional factorial designs with nominal factors. For designs with quantitative factors, however, level permutation of factors could alter their geometrical structures and statistical properties. In this paper uniformity is used to further distinguish fractional factorial designs, besides the minimum aberration criterion. We show that minimum aberration designs have low discrepancies on average. An efficient method for constructing uniform minimum aberration designs is proposed and optimal designs with 27 and 81 runs are obtained for practical use. These designs have good uniformity and are effective for studying quantitative factors. △ Less

Submitted 5 June, 2012; originally announced June 2012.

Comments: Published in at http://dx.doi.org/10.1214/12-AOS987 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS987

Journal ref: Annals of Statistics 2012, Vol. 40, No. 2, 891-907

arXiv:1203.4661 [pdf, ps, other]

doi 10.1214/11-AOAS501

Profile control charts based on nonparametric $L$-1 regression methods

Authors: Ying Wei, Zhibiao Zhao, Dennis K. J. Lin

Abstract: Classical statistical process control often relies on univariate characteristics. In many contemporary applications, however, the quality of products must be characterized by some functional relation between a response variable and its explanatory variables. Monitoring such functional profiles has been a rapidly growing field due to increasing demands. This paper develops a novel nonparametric… ▽ More Classical statistical process control often relies on univariate characteristics. In many contemporary applications, however, the quality of products must be characterized by some functional relation between a response variable and its explanatory variables. Monitoring such functional profiles has been a rapidly growing field due to increasing demands. This paper develops a novel nonparametric $L$-1 location-scale model to screen the shapes of profiles. The model is built on three basic elements: location shifts, local shape distortions, and overall shape deviations, which are quantified by three individual metrics. The proposed approach is applied to the previously analyzed vertical density profile data, leading to some interesting insights. △ Less

Submitted 21 March, 2012; originally announced March 2012.

Comments: Published in at http://dx.doi.org/10.1214/11-AOAS501 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS501

Journal ref: Annals of Applied Statistics 2012, Vol. 6, No. 1, 409-427

arXiv:1105.3816 [pdf, ps, other]

doi 10.1214/11-AOS877

On construction of optimal mixed-level supersaturated designs

Authors: Fasheng Sun, Dennis K. J. Lin, Min-Qian Liu

Abstract: Supersaturated design (SSD) has received much recent interest because of its potential in factor screening experiments. In this paper, we provide equivalent conditions for two columns to be fully aliased and consequently propose methods for constructing $E(f_{\mathrm{NOD}})$- and $χ^2$-optimal mixed-level SSDs without fully aliased columns, via equidistant designs and difference matrices. The meth… ▽ More Supersaturated design (SSD) has received much recent interest because of its potential in factor screening experiments. In this paper, we provide equivalent conditions for two columns to be fully aliased and consequently propose methods for constructing $E(f_{\mathrm{NOD}})$- and $χ^2$-optimal mixed-level SSDs without fully aliased columns, via equidistant designs and difference matrices. The methods can be easily performed and many new optimal mixed-level SSDs have been obtained. Furthermore, it is proved that the nonorthogonality between columns of the resulting design is well controlled by the source designs. A rather complete list of newly generated optimal mixed-level SSDs are tabulated for practical use. △ Less

Submitted 19 May, 2011; originally announced May 2011.

Comments: Published in at http://dx.doi.org/10.1214/11-AOS877 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS877

Journal ref: Annals of Statistics 2011, Vol. 39, No. 2, 1310-1333

Showing 1–10 of 10 results for author: Lin, K J