Search | arXiv e-print repository

Automating the Selection of Proxy Variables of Unmeasured Confounders

Authors: Feng Xie, Zhengming Chen, Shanshan Luo, Wang Miao, Ruichu Cai, Zhi Geng

Abstract: Recently, interest has grown in the use of proxy variables of unobserved confounding for inferring the causal effect in the presence of unmeasured confounders from observational data. One difficulty inhibiting the practical use is finding valid proxy variables of unobserved confounding to a target causal effect of interest. These proxy variables are typically justified by background knowledge. In… ▽ More Recently, interest has grown in the use of proxy variables of unobserved confounding for inferring the causal effect in the presence of unmeasured confounders from observational data. One difficulty inhibiting the practical use is finding valid proxy variables of unobserved confounding to a target causal effect of interest. These proxy variables are typically justified by background knowledge. In this paper, we investigate the estimation of causal effects among multiple treatments and a single outcome, all of which are affected by unmeasured confounders, within a linear causal model, without prior knowledge of the validity of proxy variables. To be more specific, we first extend the existing proxy variable estimator, originally addressing a single unmeasured confounder, to accommodate scenarios where multiple unmeasured confounders exist between the treatments and the outcome. Subsequently, we present two different sets of precise identifiability conditions for selecting valid proxy variables of unmeasured confounders, based on the second-order statistics and higher-order statistics of the data, respectively. Moreover, we propose two data-driven methods for the selection of proxy variables and for the unbiased estimation of causal effects. Theoretical analysis demonstrates the correctness of our proposed algorithms. Experimental results on both synthetic and real-world data show the effectiveness of the proposed approach. △ Less

Submitted 25 May, 2024; originally announced May 2024.

arXiv:2312.10596 [pdf, other]

A maximin optimal approach for sampling designs in two-phase studies

Authors: Ruoyu Wang, Qihua Wang, Wang Miao

Abstract: Data collection costs can vary widely across variables in data science tasks. Two-phase designs are often employed to save data collection costs. In two-phase studies, inexpensive variables are collected for all subjects in the first phase, and expensive variables are measured for a subset of subjects in the second phase based on a predetermined sampling rule. The estimation efficiency under two-p… ▽ More Data collection costs can vary widely across variables in data science tasks. Two-phase designs are often employed to save data collection costs. In two-phase studies, inexpensive variables are collected for all subjects in the first phase, and expensive variables are measured for a subset of subjects in the second phase based on a predetermined sampling rule. The estimation efficiency under two-phase designs relies heavily on the sampling rule. Existing literature primarily focuses on designing sampling rules for estimating a scalar parameter in some parametric models or specific estimating problems. However, real-world scenarios are usually model-unknown and involve two-phase designs for model-free estimation of a scalar or multi-dimensional parameter. This paper proposes a maximin criterion to design an optimal sampling rule based on semiparametric efficiency bounds. The proposed method is model-free and applicable to general estimating problems. The resulting sampling rule can minimize the semiparametric efficiency bound when the parameter is scalar and improve the bound for every component when the parameter is multi-dimensional. Simulation studies demonstrate that the proposed designs reduce the variance of the resulting estimator in various settings. The implementation of the proposed design is illustrated in a real data analysis. △ Less

Submitted 25 May, 2024; v1 submitted 16 December, 2023; originally announced December 2023.

arXiv:2311.08691 [pdf, ps, other]

On Doubly Robust Estimation with Nonignorable Missing Data Using Instrumental Variables

Authors: Baoluo Sun, Wang Miao, Deshanee S. Wickramarachchi

Abstract: Suppose we are interested in the mean of an outcome that is subject to nonignorable nonresponse. This paper develops new semiparametric estimation methods with instrumental variables which affect nonresponse, but not the outcome. The proposed estimators remain consistent and asymptotically normal even under partial model misspecifications for two variation independent nuisance components. We evalu… ▽ More Suppose we are interested in the mean of an outcome that is subject to nonignorable nonresponse. This paper develops new semiparametric estimation methods with instrumental variables which affect nonresponse, but not the outcome. The proposed estimators remain consistent and asymptotically normal even under partial model misspecifications for two variation independent nuisance components. We evaluate the performance of the proposed estimators via a simulation study, and apply them in adjusting for missing data induced by HIV testing refusal in the evaluation of HIV seroprevalence in Mochudi, Botswana, using interviewer experience as an instrumental variable. △ Less

Submitted 13 May, 2024; v1 submitted 14 November, 2023; originally announced November 2023.

Comments: 29 pages

arXiv:2303.10134 [pdf, ps, other]

doi 10.1016/j.spl.2023.109836

Proximal Causal Inference without Uniqueness Assumptions

Authors: Jeffrey Zhang, Wei Li, Wang Miao, Eric Tchetgen Tchetgen

Abstract: We consider identification and inference about a counterfactual outcome mean when there is unmeasured confounding using tools from proximal causal inference (Miao et al. [2018], Tchetgen Tchetgen et al. [2020]). Proximal causal inference requires existence of solutions to at least one of two integral equations. We motivate the existence of solutions to the integral equations from proximal causal i… ▽ More We consider identification and inference about a counterfactual outcome mean when there is unmeasured confounding using tools from proximal causal inference (Miao et al. [2018], Tchetgen Tchetgen et al. [2020]). Proximal causal inference requires existence of solutions to at least one of two integral equations. We motivate the existence of solutions to the integral equations from proximal causal inference by demonstrating that, assuming the existence of a solution to one of the integral equations, $\sqrt{n}$-estimability of a linear functional (such as its mean) of that solution requires the existence of a solution to the other integral equation. Solutions to the integral equations may not be unique, which complicates estimation and inference. We construct a consistent estimator for the solution set for one of the integral equations and then adapt the theory of extremum estimators to find from the estimated set a consistent estimator for a uniquely defined solution. A debiased estimator for the counterfactual mean is shown to be root-$n$ consistent, regular, and asymptotically semiparametrically locally efficient under additional regularity conditions. △ Less

Submitted 1 October, 2023; v1 submitted 17 March, 2023; originally announced March 2023.

Comments: Fixed some errors and added to acknowledgements

Journal ref: Statistics & Probability Letters 198 (2023)

arXiv:2301.02225 [pdf, ps, other]

$l_{1-2}$ GLasso: $L_{1-2}$ Regularized Multi-task Graphical Lasso for Joint Estimation of eQTL Map** and Gene Network

Authors: Wei Miao, Lan Yao

Abstract: A critical problem in genetics is to discover how gene expression is regulated within cells. Two major tasks of regulatory association learning are : (i) identifying SNP-gene relationships, known as eQTL map**, and (ii) determining gene-gene relationships, known as gene network estimation. To share information between these two tasks, we focus on the unified model for joint estimation of eQTL ma… ▽ More A critical problem in genetics is to discover how gene expression is regulated within cells. Two major tasks of regulatory association learning are : (i) identifying SNP-gene relationships, known as eQTL map**, and (ii) determining gene-gene relationships, known as gene network estimation. To share information between these two tasks, we focus on the unified model for joint estimation of eQTL map** and gene network, and propose a $L_{1-2}$ regularized multi-task graphical lasso, named $L_{1-2}$ GLasso. Numerical experiments on artificial datasets demonstrate the competitive performance of $L_{1-2}$ GLasso on capturing the true sparse structure of eQTL map** and gene network. $L_{1-2}$ GLasso is further applied to real dataset of ADNI-1 and experimental results show that $L_{1 -2}$ GLasso can obtain sparser and more accurate solutions than other commonly-used methods. △ Less

Submitted 4 January, 2023; originally announced January 2023.

arXiv:2210.02014 [pdf, other]

Doubly Robust Proximal Synthetic Controls

Authors: Hongxiang Qiu, Xu Shi, Wang Miao, Edgar Dobriban, Eric Tchetgen Tchetgen

Abstract: To infer the treatment effect for a single treated unit using panel data, synthetic control methods construct a linear combination of control units' outcomes that mimics the treated unit's pre-treatment outcome trajectory. This linear combination is subsequently used to impute the counterfactual outcomes of the treated unit had it not been treated in the post-treatment period, and used to estimate… ▽ More To infer the treatment effect for a single treated unit using panel data, synthetic control methods construct a linear combination of control units' outcomes that mimics the treated unit's pre-treatment outcome trajectory. This linear combination is subsequently used to impute the counterfactual outcomes of the treated unit had it not been treated in the post-treatment period, and used to estimate the treatment effect. Existing synthetic control methods rely on correctly modeling certain aspects of the counterfactual outcome generating mechanism and may require near-perfect matching of the pre-treatment trajectory. Inspired by proximal causal inference, we obtain two novel nonparametric identifying formulas for the average treatment effect for the treated unit: one is based on weighting, and the other combines models for the counterfactual outcome and the weighting function. We introduce the concept of covariate shift to synthetic controls to obtain these identification results conditional on the treatment assignment. We also develop two treatment effect estimators based on these two formulas and the generalized method of moments. One new estimator is doubly robust: it is consistent and asymptotically normal if at least one of the outcome and weighting models is correctly specified. We demonstrate the performance of the methods via simulations and apply them to evaluate the effectiveness of a Pneumococcal conjugate vaccine on the risk of all-cause pneumonia in Brazil. △ Less

Submitted 6 May, 2024; v1 submitted 5 October, 2022; originally announced October 2022.

arXiv:2210.00200 [pdf, other]

Paradoxes and resolutions for semiparametric fusion of individual and summary data

Authors: Wenjie Hu, Ruoyu Wang, Wei Li, Wang Miao

Abstract: Suppose we have available individual data from an internal study and various types of summary statistics from relevant external studies. External summary statistics have been used as constraints on the internal data distribution, which promised to improve the statistical inference in the internal data; however, the additional use of external summary data may lead to paradoxical results: efficiency… ▽ More Suppose we have available individual data from an internal study and various types of summary statistics from relevant external studies. External summary statistics have been used as constraints on the internal data distribution, which promised to improve the statistical inference in the internal data; however, the additional use of external summary data may lead to paradoxical results: efficiency loss may occur if the uncertainty of summary statistics is not negligible and large estimation bias can emerge even if the bias of external summary statistics is small. We investigate these paradoxical results in a semiparametric framework. We establish the semiparametric efficiency bound for estimating a general functional of the internal data distribution, which is shown to be no larger than that using only internal data. We propose a data-fused efficient estimator that achieves this bound so that the efficiency paradox is resolved. Besides, a debiased estimator is further proposed which has selection consistency property by employing adaptive lasso penalty so that the resultant estimator can achieve the same asymptotic distribution as the oracle one that uses only unbiased summary statistics, which resolves the bias paradox. Simulations and application to a Helicobacter pylori infection dataset are used to illustrate the proposed methods. △ Less

Submitted 17 July, 2023; v1 submitted 1 October, 2022; originally announced October 2022.

Comments: 17 pages, 3 figures

arXiv:2208.01237 [pdf, ps, other]

Doubly Robust Proximal Causal Inference under Confounded Outcome-Dependent Sampling

Authors: Kendrick Qijun Li, Xu Shi, Wang Miao, Eric Tchetgen Tchetgen

Abstract: Unmeasured confounding and selection bias are often of concern in observational studies and may invalidate a causal analysis if not appropriately accounted for. Under outcome-dependent sampling, a latent factor that has causal effects on the treatment, outcome, and sample selection process may cause both unmeasured confounding and selection bias, rendering standard causal parameters unidentifiable… ▽ More Unmeasured confounding and selection bias are often of concern in observational studies and may invalidate a causal analysis if not appropriately accounted for. Under outcome-dependent sampling, a latent factor that has causal effects on the treatment, outcome, and sample selection process may cause both unmeasured confounding and selection bias, rendering standard causal parameters unidentifiable without additional assumptions. Under an odds ratio model for the treatment effect, Li et al. 2022 established both proximal identification and estimation of causal effects by leveraging a pair of negative control variables as proxies of latent factors at the source of both confounding and selection bias. However, their approach relies exclusively on the existence and correct specification of a so-called treatment confounding bridge function, a model that restricts the treatment assignment mechanism. In this article, we propose doubly robust estimation under the odds ratio model with respect to two nuisance functions -- a treatment confounding bridge function and an outcome confounding bridge function that restricts the outcome law, such that our estimator is consistent and asymptotically normal if either bridge function model is correctly specified, without knowing which one is. Thus, our proposed doubly robust estimator is potentially more robust than that of Li et al. 2022. Our simulations confirm that the proposed proximal estimators of an odds ratio causal effect can adequately account for both residual confounding and selection bias under stated conditions with well-calibrated confidence intervals in a wide range of scenarios, where standard methods generally fail to be consistent. In addition, the proposed doubly robust estimator is consistent if at least one confounding bridge function is correctly specified. △ Less

Submitted 2 August, 2022; originally announced August 2022.

Comments: 43 pages, 1 figure

arXiv:2207.08535 [pdf, other]

A self-censoring model for multivariate nonignorable nonmonotone missing data

Authors: Yilin Li, Wang Miao, Ilya Shpitser, Eric J. Tchetgen Tchetgen

Abstract: We introduce a self-censoring model for multivariate nonignorable nonmonotone missing data, where the missingness process of each outcome is affected by its own value and is associated with missingness indicators of other outcomes, while conditionally independent of the other outcomes. The self-censoring model complements previous graphical approaches for the analysis of multivariate nonignorable… ▽ More We introduce a self-censoring model for multivariate nonignorable nonmonotone missing data, where the missingness process of each outcome is affected by its own value and is associated with missingness indicators of other outcomes, while conditionally independent of the other outcomes. The self-censoring model complements previous graphical approaches for the analysis of multivariate nonignorable missing data. It is identified under a completeness condition stating that any variability in one outcome can be captured by variability in the other outcomes among complete cases. For estimation, we propose a suite of semiparametric estimators including doubly robust estimators that deliver valid inferences under partial misspecification of the full-data distribution. We evaluate the performance of the proposed estimators with simulations and apply them to analyze a study about the effect of highly active antiretroviral therapy on preterm delivery of HIV-positive mothers. △ Less

Submitted 30 September, 2022; v1 submitted 18 July, 2022; originally announced July 2022.

Comments: 28 pages, 6 figures

arXiv:2206.08228 [pdf, other]

Identification and estimation of causal effects in the presence of confounded principal strata

Authors: Shanshan Luo, Wei Li, Wang Miao, Yangbo He

Abstract: The principal stratification has become a popular tool to address a broad class of causal inference questions, particularly in dealing with non-compliance and truncation-by-death problems. The causal effects within principal strata which are determined by joint potential values of the intermediate variable, also known as the principal causal effects, are often of interest in these studies. Analyse… ▽ More The principal stratification has become a popular tool to address a broad class of causal inference questions, particularly in dealing with non-compliance and truncation-by-death problems. The causal effects within principal strata which are determined by joint potential values of the intermediate variable, also known as the principal causal effects, are often of interest in these studies. Analyses of principal causal effects from observed data in the literature mostly rely on ignorability of the treatment assignment, which requires practitioners to accurately measure as many as covariates so that all possible confounding sources are captured. However, collecting all potential confounders in observational studies is often difficult and costly, the ignorability assumption may thus be questionable. In this paper, by leveraging available negative controls that have been increasingly used to deal with uncontrolled confounding, we consider identification and estimation of causal effects when the treatment and principal strata are confounded by unobserved variables. Specifically, we show that the principal causal effects can be nonparametrically identified by invoking a pair of negative controls that are both required not to directly affect the outcome. We then relax this assumption and establish identification of principal causal effects under various semiparametric or parametric models. We also propose an estimation method of principal causal effects. Extensive simulation studies show good performance of the proposed approach and a real data application from the National Longitudinal Survey of Young Men is used for illustration. △ Less

Submitted 17 June, 2022; v1 submitted 16 June, 2022; originally announced June 2022.

Comments: Figure 1 updated

arXiv:2203.12509 [pdf, other]

Double Negative Control Inference in Test-Negative Design Studies of Vaccine Effectiveness

Authors: Kendrick Qijun Li, Xu Shi, Wang Miao, Eric Tchetgen Tchetgen

Abstract: The test-negative design (TND) has become a standard approach to evaluate vaccine effectiveness against the risk of acquiring infectious diseases in real-world settings, such as Influenza, Rotavirus, Dengue fever, and more recently COVID-19. In a TND study, individuals who experience symptoms and seek care are recruited and tested for the infectious disease which defines cases and controls. Despit… ▽ More The test-negative design (TND) has become a standard approach to evaluate vaccine effectiveness against the risk of acquiring infectious diseases in real-world settings, such as Influenza, Rotavirus, Dengue fever, and more recently COVID-19. In a TND study, individuals who experience symptoms and seek care are recruited and tested for the infectious disease which defines cases and controls. Despite TND's potential to reduce unobserved differences in healthcare seeking behavior (HSB) between vaccinated and unvaccinated subjects, it remains subject to various potential biases. First, residual confounding may remain due to unobserved HSB, occupation as healthcare worker, or previous infection history. Second, because selection into the TND sample is a common consequence of infection and HSB, collider stratification bias may exist when conditioning the analysis on tested samples, which further induces confounding by latent HSB. In this paper, we present a novel approach to identify and estimate vaccine effectiveness in the target population by carefully leveraging a pair of negative control exposure and outcome variables to account for potential hidden bias in TND studies. We illustrate our proposed method with extensive simulations and an application to study COVID-19 vaccine effectiveness using data from the University of Michigan Health System. △ Less

Submitted 8 March, 2023; v1 submitted 23 March, 2022; originally announced March 2022.

Comments: 78 pages, 4 figures, 5 tables

arXiv:2112.02822 [pdf, other]

A stableness of resistance model for nonresponse adjustment with callback data

Authors: Wang Miao, Xinyu Li, Baoluo Sun

Abstract: Nonresponse arises frequently in surveys and follow-ups are routinely made to increase the response rate. In order to monitor the follow-up process, callback data have been used in social sciences and survey studies for decades. In modern surveys, the availability of callback data is increasing because the response rate is decreasing and follow-ups are essential to collect maximum information. Alt… ▽ More Nonresponse arises frequently in surveys and follow-ups are routinely made to increase the response rate. In order to monitor the follow-up process, callback data have been used in social sciences and survey studies for decades. In modern surveys, the availability of callback data is increasing because the response rate is decreasing and follow-ups are essential to collect maximum information. Although callback data are helpful to reduce the bias in surveys, such data have not been widely used in statistical analysis until recently. We propose a stableness of resistance assumption for nonresponse adjustment with callback data. We establish the identification and the semiparametric efficiency theory under this assumption, and propose a suite of semiparametric estimation methods including a doubly robust one, which generalize existing parametric approaches for callback data analysis. We apply the approach to a Consumer Expenditure Survey dataset. The results suggest an association between nonresponse and high housing expenditures. △ Less

Submitted 14 February, 2023; v1 submitted 6 December, 2021; originally announced December 2021.

arXiv:2110.05776 [pdf, other]

Nonparametric inference about mean functionals of nonignorable nonresponse data without identifying the joint distribution

Authors: Wei Li, Wang Miao, Eric Tchetgen Tchetgen

Abstract: We consider identification and inference about mean functionals of observed covariates and an outcome variable subject to nonignorable missingness. By leveraging a shadow variable, we establish a necessary and sufficient condition for identification of the mean functional even if the full data distribution is not identified. We further characterize a necessary condition for $\sqrt{n}$-estimability… ▽ More We consider identification and inference about mean functionals of observed covariates and an outcome variable subject to nonignorable missingness. By leveraging a shadow variable, we establish a necessary and sufficient condition for identification of the mean functional even if the full data distribution is not identified. We further characterize a necessary condition for $\sqrt{n}$-estimability of the mean functional. This condition naturally strengthens the identifying condition, and it requires the existence of a function as a solution to a representer equation that connects the shadow variable to the mean functional. Solutions to the representer equation may not be unique, which presents substantial challenges for nonparametric estimation and standard theories for nonparametric sieve estimators are not applicable here. We construct a consistent estimator for the solution set and then adapt the theory of extremum estimators to find from the estimated set a consistent estimator for an appropriately chosen solution. The estimator is asymptotically normal, locally efficient and attains the semiparametric efficiency bound under certain regularity conditions. We illustrate the proposed approach via simulations and a real data application on home pricing. △ Less

Submitted 6 April, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

Comments: 23 pages, 1 figure and 3 tables

arXiv:2110.01106 [pdf, ps, other]

Data Integration in Causal Inference

Authors: Xu Shi, Ziyang Pan, Wang Miao

Abstract: Integrating data from multiple heterogeneous sources has become increasingly popular to achieve a large sample size and diverse study population. This paper reviews development in causal inference methods that combines multiple datasets collected by potentially different designs from potentially heterogeneous populations. We summarize recent advances on combining randomized clinical trial with ext… ▽ More Integrating data from multiple heterogeneous sources has become increasingly popular to achieve a large sample size and diverse study population. This paper reviews development in causal inference methods that combines multiple datasets collected by potentially different designs from potentially heterogeneous populations. We summarize recent advances on combining randomized clinical trial with external information from observational studies or historical controls, combining samples when no single sample has all relevant variables with application to two-sample Mendelian randomization, distributed data setting under privacy concerns for comparative effectiveness and safety research using real-world data, Bayesian causal inference, and causal discovery methods. △ Less

Submitted 3 October, 2021; originally announced October 2021.

arXiv:2109.07030 [pdf, other]

Proximal Causal Inference for Complex Longitudinal Studies

Authors: Andrew Ying, Wang Miao, Xu Shi, Eric J. Tchetgen Tchetgen

Abstract: A standard assumption for causal inference about the joint effects of time-varying treatment is that one has measured sufficient covariates to ensure that within covariate strata, subjects are exchangeable across observed treatment values, also known as "sequential randomization assumption (SRA)". SRA is often criticized as it requires one to accurately measure all confounders. Realistically, meas… ▽ More A standard assumption for causal inference about the joint effects of time-varying treatment is that one has measured sufficient covariates to ensure that within covariate strata, subjects are exchangeable across observed treatment values, also known as "sequential randomization assumption (SRA)". SRA is often criticized as it requires one to accurately measure all confounders. Realistically, measured covariates can rarely capture all confounders with certainty. Often covariate measurements are at best proxies of confounders, thus invalidating inferences under SRA. In this paper, we extend the proximal causal inference (PCI) framework of Miao et al. (2018) to the longitudinal setting under a semiparametric marginal structural mean model (MSMM). PCI offers an opportunity to learn about joint causal effects in settings where SRA based on measured time-varying covariates fails, by formally accounting for the covariate measurements as imperfect proxies of underlying confounding mechanisms. We establish nonparametric identification with a pair of time-varying proxies and provide a corresponding characterization of regular and asymptotically linear estimators of the parameter indexing the MSMM, including a rich class of doubly robust estimators, and establish the corresponding semiparametric efficiency bound for the MSMM. Extensive simulation studies and a data application illustrate the finite sample behavior of proposed methods. △ Less

Submitted 3 August, 2022; v1 submitted 14 September, 2021; originally announced September 2021.

arXiv:2108.13935 [pdf, other]

Theory for identification and Inference with Synthetic Controls: A Proximal Causal Inference Framework

Authors: Xu Shi, Kendrick Li, Wang Miao, Mengtong Hu, Eric Tchetgen Tchetgen

Abstract: Synthetic control (SC) methods are commonly used to estimate the treatment effect on a single treated unit in panel data settings. An SC is a weighted average of control units built to match the treated unit, with weights typically estimated by regressing (summaries of) pre-treatment outcomes and measured covariates of the treated unit to those of the control units. However, it has been establishe… ▽ More Synthetic control (SC) methods are commonly used to estimate the treatment effect on a single treated unit in panel data settings. An SC is a weighted average of control units built to match the treated unit, with weights typically estimated by regressing (summaries of) pre-treatment outcomes and measured covariates of the treated unit to those of the control units. However, it has been established that in the absence of a good fit, such regression estimator will generally perform poorly. In this paper, we introduce a proximal causal inference framework to formalize identification and inference for both the SC and ultimately the treatment effect on the treated, based on the observation that control units not contributing to the construction of an SC can be repurposed as proxies of latent confounders. We view the difference in the post-treatment outcomes between the treated unit and the SC as a time series, which opens the door to various time series methods for treatment effect estimation. The proposed framework can accommodate nonlinear models, which allows for binary and count outcomes that are understudied in the SC literature. We illustrate with simulation studies and an application to evaluation of the 1990 German Reunification. △ Less

Submitted 18 February, 2023; v1 submitted 31 August, 2021; originally announced August 2021.

Comments: 37 pages, 3 figures. The Supplementary Materials are attached

arXiv:2108.12600 [pdf, other]

A robust fusion-extraction procedure with summary statistics in the presence of biased sources

Authors: Ruoyu Wang, Qihua Wang, Wang Miao

Abstract: Information from various data sources is increasingly available nowadays. However, some of the data sources may produce biased estimation due to commonly encountered biased sampling, population heterogeneity, or model misspecification. This calls for statistical methods to combine information in the presence of biased sources. In this paper, a robust data fusion-extraction method is proposed. The… ▽ More Information from various data sources is increasingly available nowadays. However, some of the data sources may produce biased estimation due to commonly encountered biased sampling, population heterogeneity, or model misspecification. This calls for statistical methods to combine information in the presence of biased sources. In this paper, a robust data fusion-extraction method is proposed. The method can produce a consistent estimator of the parameter of interest even if many of the data sources are biased. The proposed estimator is easy to compute and only employs summary statistics, and hence can be applied to many different fields, e.g. meta-analysis, Mendelian randomisation and distributed system. Moreover, the proposed estimator is asymptotically equivalent to the oracle estimator that only uses data from unbiased sources under some mild conditions. Asymptotic normality of the proposed estimator is also established. In contrast to the existing meta-analysis methods, the theoretical properties are guaranteed even if both the number of data sources and the dimension of the parameter diverge as the sample size increases, which ensures the performance of the proposed method over a wide range. The robustness and oracle property is also evaluated via simulation studies. The proposed method is applied to a meta-analysis data set to evaluate the surgical treatment for the moderate periodontal disease, and a Mendelian randomization data set to study the risk factors of head and neck cancer. △ Less

Submitted 5 February, 2023; v1 submitted 28 August, 2021; originally announced August 2021.

arXiv:2011.09829 [pdf, ps, other]

Sharp bounds for variance of treatment effect estimators in the finite population in the presence of covariates

Authors: Ruoyu Wang, Qihua Wang, Wang Miao, Xiaohua Zhou

Abstract: In a completely randomized experiment, the variances of treatment effect estimators in the finite population are usually not identifiable and hence not estimable. Although some estimable bounds of the variances have been established in the literature, few of them are derived in the presence of covariates. In this paper, the difference-in-means estimator and the Wald estimator are considered in t… ▽ More In a completely randomized experiment, the variances of treatment effect estimators in the finite population are usually not identifiable and hence not estimable. Although some estimable bounds of the variances have been established in the literature, few of them are derived in the presence of covariates. In this paper, the difference-in-means estimator and the Wald estimator are considered in the completely randomized experiment with perfect compliance and noncompliance, respectively. Sharp bounds for the variances of these two estimators are established when covariates are available. Furthermore, consistent estimators for such bounds are obtained, which can be used to shorten the confidence intervals and improve the power of tests. Confidence intervals are constructed based on the consistent estimators of the upper bounds, whose coverage rates are uniformly asymptotically guaranteed. Simulations were conducted to evaluate the proposed methods. The proposed methods are also illustrated with two real data analyses. △ Less

Submitted 19 September, 2022; v1 submitted 19 November, 2020; originally announced November 2020.

Comments: Accepted by Statistica Sinica

arXiv:2011.08411 [pdf, other]

Semiparametric proximal causal inference

Authors: Yifan Cui, Hongming Pu, Xu Shi, Wang Miao, Eric Tchetgen Tchetgen

Abstract: Skepticism about the assumption of no unmeasured confounding, also known as exchangeability, is often warranted in making causal inferences from observational data; because exchangeability hinges on an investigator's ability to accurately measure covariates that capture all potential sources of confounding. In practice, the most one can hope for is that covariate measurements are at best proxies o… ▽ More Skepticism about the assumption of no unmeasured confounding, also known as exchangeability, is often warranted in making causal inferences from observational data; because exchangeability hinges on an investigator's ability to accurately measure covariates that capture all potential sources of confounding. In practice, the most one can hope for is that covariate measurements are at best proxies of the true underlying confounding mechanism operating in a given observational study. In this paper, we consider the framework of proximal causal inference introduced by Miao et al. (2018); Tchetgen Tchetgen et al. (2020), which while explicitly acknowledging covariate measurements as imperfect proxies of confounding mechanisms, offers an opportunity to learn about causal effects in settings where exchangeability on the basis of measured covariates fails. We make a number of contributions to proximal inference including (i) an alternative set of conditions for nonparametric proximal identification of the average treatment effect; (ii) general semiparametric theory for proximal estimation of the average treatment effect including efficiency bounds for key semiparametric models of interest; (iii) a characterization of proximal doubly robust and locally efficient estimators of the average treatment effect. Moreover, we provide analogous identification and efficiency results for the average treatment effect on the treated. Our approach is illustrated via simulation studies and a data application on evaluating the effectiveness of right heart catheterization in the intensive care unit of critically ill patients. △ Less

Submitted 21 February, 2023; v1 submitted 16 November, 2020; originally announced November 2020.

arXiv:2011.07234 [pdf, other]

doi 10.1111/biom.13583

Improving efficiency of inference in clinical trials with external control data

Authors: Xinyu Li, Wang Miao, Fang Lu, Xiao-Hua Zhou

Abstract: Suppose we are interested in the effect of a treatment in a clinical trial. The efficiency of inference may be limited due to small sample size. However, external control data are often available from historical studies. Motivated by an application to Helicobacter pylori infection, we show how to borrow strength from such data to improve efficiency of inference in the clinical trial. Under an exch… ▽ More Suppose we are interested in the effect of a treatment in a clinical trial. The efficiency of inference may be limited due to small sample size. However, external control data are often available from historical studies. Motivated by an application to Helicobacter pylori infection, we show how to borrow strength from such data to improve efficiency of inference in the clinical trial. Under an exchangeability assumption about the potential outcome mean, we show that the semiparametric efficiency bound for estimating the average treatment effect can be reduced by incorporating both the clinical trial data and external controls. We then derive a doubly robust and locally efficient estimator. The improvement in efficiency is prominent especially when the external control dataset has a large sample size and small variability. Our method allows for a relaxed overlap assumption, and we illustrate with the case where the clinical trial only contains a treated group. We also develop doubly robust and locally efficient approaches that extrapolate the causal effect in the clinical trial to the external population and the overall population. Our results also offer a meaningful implication for trial design and data collection. We evaluate the finite-sample performance of the proposed estimators via simulation. In the Helicobacter pylori infection application, our approach shows that the combination treatment has potential efficacy advantages over the triple therapy. △ Less

Submitted 9 December, 2021; v1 submitted 14 November, 2020; originally announced November 2020.

Comments: Accepted for publication in Biometrics; 1 figure, 3 tables

arXiv:2011.04504 [pdf, other]

Identifying effects of multiple treatments in the presence of unmeasured confounding

Authors: Wang Miao, Wenjie Hu, Elizabeth L. Ogburn, Xiaohua Zhou

Abstract: Identification of treatment effects in the presence of unmeasured confounding is a persistent problem in the social, biological, and medical sciences. The problem of unmeasured confounding in settings with multiple treatments is most common in statistical genetics and bioinformatics settings, where researchers have developed many successful statistical strategies without engaging deeply with the c… ▽ More Identification of treatment effects in the presence of unmeasured confounding is a persistent problem in the social, biological, and medical sciences. The problem of unmeasured confounding in settings with multiple treatments is most common in statistical genetics and bioinformatics settings, where researchers have developed many successful statistical strategies without engaging deeply with the causal aspects of the problem. Recently there have been a number of attempts to bridge the gap between these statistical approaches and causal inference, but these attempts have either been shown to be flawed or have relied on fully parametric assumptions. In this paper, we propose two strategies for identifying and estimating causal effects of multiple treatments in the presence of unmeasured confounding. The auxiliary variables approach leverages variables that are not causally associated with the outcome; in the case of a univariate confounder, our method only requires one auxiliary variable, unlike existing instrumental variable methods that would require as many instruments as there are treatments. An alternative null treatments approach relies on the assumption that at least half of the confounded treatments have no causal effect on the outcome, but does not require a priori knowledge of which treatments are null. Our identification strategies do not impose parametric assumptions on the outcome model and do not rest on estimation of the confounder. This paper extends and generalizes existing work on unmeasured confounding with a single treatment and models commonly used in bioinformatics. △ Less

Submitted 9 July, 2022; v1 submitted 9 November, 2020; originally announced November 2020.

arXiv:2009.10982 [pdf, other]

An Introduction to Proximal Causal Learning

Authors: Eric J Tchetgen Tchetgen, Andrew Ying, Yifan Cui, Xu Shi, Wang Miao

Abstract: A standard assumption for causal inference from observational data is that one has measured a sufficiently rich set of covariates to ensure that within covariate strata, subjects are exchangeable across observed treatment values. Skepticism about the exchangeability assumption in observational studies is often warranted because it hinges on investigators' ability to accurately measure covariates c… ▽ More A standard assumption for causal inference from observational data is that one has measured a sufficiently rich set of covariates to ensure that within covariate strata, subjects are exchangeable across observed treatment values. Skepticism about the exchangeability assumption in observational studies is often warranted because it hinges on investigators' ability to accurately measure covariates capturing all potential sources of confounding. Realistically, confounding mechanisms can rarely if ever, be learned with certainty from measured covariates. One can therefore only ever hope that covariate measurements are at best proxies of true underlying confounding mechanisms operating in an observational study, thus invalidating causal claims made on basis of standard exchangeability conditions. Causal learning from proxies is a challenging inverse problem which has to date remained unresolved. In this paper, we introduce a formal potential outcome framework for proximal causal learning, which while explicitly acknowledging covariate measurements as imperfect proxies of confounding mechanisms, offers an opportunity to learn about causal effects in settings where exchangeability on the basis of measured covariates fails. Sufficient conditions for nonparametric identification are given, leading to the proximal g-formula and corresponding proximal g-computation algorithm for estimation. These may be viewed as generalizations of Robins' foundational g-formula and g-computation algorithm, which account explicitly for bias due to unmeasured confounding. Both point treatment and time-varying treatment settings are considered, and an application of proximal g-computation of causal effects is given for illustration. △ Less

Submitted 23 September, 2020; originally announced September 2020.

Comments: This paper was originally presented by the first author at the 2020 Myrto Lefkopoulou Distinguished Lectureship at the Harvard T. H. Chan School of Public Health on September 17th 2020

MSC Class: 62A01

arXiv:2009.05641 [pdf, ps, other]

A Selective Review of Negative Control Methods in Epidemiology

Authors: Xu Shi, Wang Miao, Eric Tchetgen Tchetgen

Abstract: Purpose of Review: Negative controls are a powerful tool to detect and adjust for bias in epidemiological research. This paper introduces negative controls to a broader audience and provides guidance on principled design and causal analysis based on a formal negative control framework. Recent Findings: We review and summarize causal and statistical assumptions, practical strategies, and validati… ▽ More Purpose of Review: Negative controls are a powerful tool to detect and adjust for bias in epidemiological research. This paper introduces negative controls to a broader audience and provides guidance on principled design and causal analysis based on a formal negative control framework. Recent Findings: We review and summarize causal and statistical assumptions, practical strategies, and validation criteria that can be combined with subject matter knowledge to perform negative control analyses. We also review existing statistical methodologies for detection, reduction, and correction of confounding bias, and briefly discuss recent advances towards nonparametric identification of causal effects in a double negative control design. Summary: There is great potential for valid and accurate causal inference leveraging contemporary healthcare data in which negative controls are routinely available. Design and analysis of observational data leveraging negative controls is an area of growing interest in health and social sciences. Despite these developments, further effort is needed to disseminate these novel methods to ensure they are adopted by practicing epidemiologists. △ Less

Submitted 19 July, 2022; v1 submitted 11 September, 2020; originally announced September 2020.

arXiv:1810.03353 [pdf, ps, other]

doi 10.5705/ss.202020.0081

On Semiparametric Instrumental Variable Estimation of Average Treatment Effects through Data Fusion

Authors: BaoLuo Sun, Wang Miao

Abstract: Suppose one is interested in estimating causal effects in the presence of potentially unmeasured confounding with the aid of a valid instrumental variable. This paper investigates the problem of making inferences about the average treatment effect when data are fused from two separate sources, one of which contains information on the treatment and the other contains information on the outcome, whi… ▽ More Suppose one is interested in estimating causal effects in the presence of potentially unmeasured confounding with the aid of a valid instrumental variable. This paper investigates the problem of making inferences about the average treatment effect when data are fused from two separate sources, one of which contains information on the treatment and the other contains information on the outcome, while values for the instrument and a vector of baseline covariates are recorded in both. We provide a general set of sufficient conditions under which the average treatment effect is nonparametrically identified from the observed data law induced by data fusion, even when the data are from two heterogeneous populations, and derive the efficiency bound for estimating this causal parameter. For inference, we develop both parametric and semiparametric methods, including a multiply robust and locally efficient estimator that is consistent even under partial misspecification of the observed data model. We illustrate the methods through simulations and an application on public housing projects. △ Less

Submitted 13 February, 2020; v1 submitted 8 October, 2018; originally announced October 2018.

Comments: 34 pages

arXiv:1808.04945 [pdf, other]

A Confounding Bridge Approach for Double Negative Control Inference on Causal Effects

Authors: Wang Miao, Xu Shi, Eric Tchetgen Tchetgen

Abstract: Unmeasured confounding is a key challenge for causal inference. Negative control variables are widely available in observational studies. A negative control outcome is associated with the confounder but not causally affected by the exposure in view, and a negative control exposure is correlated with the primary exposure or the confounder but does not causally affect the outcome of interest. In thi… ▽ More Unmeasured confounding is a key challenge for causal inference. Negative control variables are widely available in observational studies. A negative control outcome is associated with the confounder but not causally affected by the exposure in view, and a negative control exposure is correlated with the primary exposure or the confounder but does not causally affect the outcome of interest. In this paper, we establish a framework to use them for unmeasured confounding adjustment. We introduce a confounding bridge function that links the potential outcome mean and the negative control outcome distribution, and we incorporate a negative control exposure to identify the bridge function and the average causal effect. Our approach can be used to repair an invalid instrumental variable in case it is correlated with the unmeasured confounder. We also extend our approach by allowing for a causal association between the primary exposure and the control outcome. We illustrate our approach with simulations and apply it to a study about the short-term effect of air pollution. Although a standard analysis shows a significant acute effect of PM2.5 on mortality, our analysis indicates that this effect may be confounded, and after double negative control adjustment, the effect is attenuated toward zero. △ Less

Submitted 18 September, 2020; v1 submitted 14 August, 2018; originally announced August 2018.

Comments: Supplement and Sample Codes are included

arXiv:1808.04906 [pdf, ps, other]

Multiply Robust Causal Inference with Double Negative Control Adjustment for Categorical Unmeasured Confounding

Authors: Xu Shi, Wang Miao, Jennifer C. Nelson, Eric J. Tchetgen Tchetgen

Abstract: Unmeasured confounding is a threat to causal inference in observational studies. In recent years, use of negative controls to mitigate unmeasured confounding has gained increasing recognition and popularity. Negative controls have a longstanding tradition in laboratory sciences and epidemiology to rule out non-causal explanations, although they have been used primarily for bias detection. Recently… ▽ More Unmeasured confounding is a threat to causal inference in observational studies. In recent years, use of negative controls to mitigate unmeasured confounding has gained increasing recognition and popularity. Negative controls have a longstanding tradition in laboratory sciences and epidemiology to rule out non-causal explanations, although they have been used primarily for bias detection. Recently, Miao et al. (2018) have described sufficient conditions under which a pair of negative control exposure and outcome variables can be used to nonparametrically identify the average treatment effect (ATE) from observational data subject to uncontrolled confounding. In this paper, we establish nonparametric identification of the ATE under weaker conditions in the case of categorical unmeasured confounding and negative control variables. We also provide a general semiparametric framework for obtaining inferences about the ATE while leveraging information about a possibly large number of measured covariates. In particular, we derive the semiparametric efficiency bound in the nonparametric model, and we propose multiply robust and locally efficient estimators when nonparametric estimation may not be feasible. We assess the finite sample performance of our methods in extensive simulation studies. Finally, we illustrate our methods with an application to the postlicensure surveillance of vaccine safety among children. △ Less

Submitted 4 September, 2019; v1 submitted 14 August, 2018; originally announced August 2018.

arXiv:1609.08816 [pdf, ps, other]

Identifying Causal Effects With Proxy Variables of an Unmeasured Confounder

Authors: Wang Miao, Zhi Geng, Eric Tchetgen Tchetgen

Abstract: We consider a causal effect that is confounded by an unobserved variable, but with observed proxy variables of the confounder. We show that, with at least two independent proxy variables satisfying a certain rank condition, the causal effect is nonparametrically identified, even if the measurement error mechanism, i.e., the conditional distribution of the proxies given the con- founder, may not be… ▽ More We consider a causal effect that is confounded by an unobserved variable, but with observed proxy variables of the confounder. We show that, with at least two independent proxy variables satisfying a certain rank condition, the causal effect is nonparametrically identified, even if the measurement error mechanism, i.e., the conditional distribution of the proxies given the con- founder, may not be identified. Our result generalizes the identification strategy of Kuroki & Pearl (2014) that rests on identification of the measurement error mechanism. When only one proxy for the confounder is available, or the required rank condition is not met, we develop a strategy to test the null hypothesis of no causal effect. △ Less

Submitted 28 June, 2018; v1 submitted 28 September, 2016; originally announced September 2016.

arXiv:1607.03197 [pdf, other]

Semiparametric Estimation with Data Missing Not at Random Using an Instrumental Variable

Authors: BaoLuo Sun, Lan Liu, Wang Miao, Kathleen Wirth, James Robins, Eric Tchetgen Tchetgen

Abstract: Missing data occur frequently in empirical studies in health and social sciences, often compromising our ability to make accurate inferences. An outcome is said to be missing not at random (MNAR) if, conditional on the observed variables, the missing data mechanism still depends on the unobserved outcome. In such settings, identification is generally not possible without imposing additional assump… ▽ More Missing data occur frequently in empirical studies in health and social sciences, often compromising our ability to make accurate inferences. An outcome is said to be missing not at random (MNAR) if, conditional on the observed variables, the missing data mechanism still depends on the unobserved outcome. In such settings, identification is generally not possible without imposing additional assumptions. Identification is sometimes possible, however, if an instrumental variable (IV) is observed for all subjects which satisfies the exclusion restriction that the IV affects the missingness process without directly influencing the outcome. In this paper, we provide necessary and sufficient conditions for nonparametric identification of the full data distribution under MNAR with the aid of an IV. In addition, we give sufficient identification conditions that are more straightforward to verify in practice. For inference, we focus on estimation of a population outcome mean, for which we develop a suite of semiparametric estimators that extend methods previously developed for data missing at random. Specifically, we propose inverse probability weighted estimation, outcome regression-based estimation and doubly robust estimation of the mean of an outcome subject to MNAR. For illustration, the methods are used to account for selection bias induced by HIV testing refusal in the evaluation of HIV seroprevalence in Mochudi, Botswana, using interviewer characteristics such as gender, age and years of experience as IVs. △ Less

Submitted 17 January, 2017; v1 submitted 11 July, 2016; originally announced July 2016.

Journal ref: Statistica Sinica 28 (2018), 1965-1983

arXiv:1509.03860 [pdf, other]

Identifiability of Normal and Normal Mixture Models With Nonignorable Missing Data

Authors: Wang Miao, Peng Ding, Zhi Geng

Abstract: Missing data problems arise in many applied research studies. They may jeopardize statistical inference of the model of interest, if the missing mechanism is nonignorable, that is, the missing mechanism depends on the missing values themselves even conditional on the observed data. With a nonignorable missing mechanism, the model of interest is often not identifiable without imposing further assum… ▽ More Missing data problems arise in many applied research studies. They may jeopardize statistical inference of the model of interest, if the missing mechanism is nonignorable, that is, the missing mechanism depends on the missing values themselves even conditional on the observed data. With a nonignorable missing mechanism, the model of interest is often not identifiable without imposing further assumptions. We find that even if the missing mechanism has a known parametric form, the model is not identifiable without specifying a parametric outcome distribution. Although it is fundamental for valid statistical inference, identifiability under nonignorable missing mechanisms is not established for many commonly-used models. In this paper, we first demonstrate identifiability of the normal distribution under monotone missing mechanisms. We then extend it to the normal mixture and $t$ mixture models with non-monotone missing mechanisms. We discover that models under the Logistic missing mechanism are less identifiable than those under the Probit missing mechanism. We give necessary and sufficient conditions for identifiability of models under the Logistic missing mechanism, which sometimes can be checked in real data analysis. We illustrate our methods using a series of simulations, and apply them to a real-life dataset. △ Less

Submitted 13 September, 2015; originally announced September 2015.

arXiv:1509.02556 [pdf, other]

Identification, Doubly Robust Estimation, and Semiparametric Efficiency Theory of Nonignorable Missing Data With a Shadow Variable

Authors: Wang Miao, Lan Liu, Eric Tchetgen Tchetgen, Zhi Geng

Abstract: We consider identification and estimation with an outcome missing not at random (MNAR). We study an identification strategy based on a so-called shadow variable. A shadow variable is assumed to be correlated with the outcome, but independent of the missingness process conditional on the outcome and fully observed covariates. We describe a general condition for nonparametric identification of the f… ▽ More We consider identification and estimation with an outcome missing not at random (MNAR). We study an identification strategy based on a so-called shadow variable. A shadow variable is assumed to be correlated with the outcome, but independent of the missingness process conditional on the outcome and fully observed covariates. We describe a general condition for nonparametric identification of the full data law under MNAR using a valid shadow variable. Our condition is satisfied by many commonly-used models; moreover, it is imposed on the complete cases, and therefore has testable implications with observed data only. We describe semiparametric estimation methods and evaluate their performance on both simulation data and a real data example. We characterize the semiparametric efficiency bound for the class of regular and asymptotically linear estimators, and derive a closed form for the efficient influence function. △ Less

Submitted 9 September, 2019; v1 submitted 8 September, 2015; originally announced September 2015.

arXiv:1506.08149 [pdf, other]

Identification and Inference for Marginal Average Treatment Effect on the Treated With an Instrumental Variable

Authors: Lan Liu, Wang Miao, Baoluo Sun, James Robins, Eric Tchetgen Tchetgen

Abstract: In observational studies, treatments are typically not randomized and therefore estimated treatment effects may be subject to confounding bias. The instrumental variable (IV) design plays the role of a quasi-experimental handle since the IV is associated with the treatment and only affects the outcome through the treatment. In this paper, we present a novel framework for identification and inferen… ▽ More In observational studies, treatments are typically not randomized and therefore estimated treatment effects may be subject to confounding bias. The instrumental variable (IV) design plays the role of a quasi-experimental handle since the IV is associated with the treatment and only affects the outcome through the treatment. In this paper, we present a novel framework for identification and inference using an IV for the marginal average treatment effect amongst the treated (ETT) in the presence of unmeasured confounding. For inference, we propose three different semiparametric approaches: (i) inverse probability weighting (IPW), (ii) outcome regression (OR), and (iii) doubly robust (DR) estimation, which is consistent if either (i) or (ii) is consistent, but not necessarily both. A closed-form locally semiparametric efficient estimator is obtained in the simple case of binary IV and outcome and the efficiency bound is derived for the more general case. △ Less

Submitted 26 August, 2016; v1 submitted 26 June, 2015; originally announced June 2015.

arXiv:1210.3709 [pdf, other]

A Rank-Corrected Procedure for Matrix Completion with Fixed Basis Coefficients

Authors: Weimin Miao, Shaohua Pan, Defeng Sun

Abstract: For the problems of low-rank matrix completion, the efficiency of the widely-used nuclear norm technique may be challenged under many circumstances, especially when certain basis coefficients are fixed, for example, the low-rank correlation matrix completion in various fields such as the financial market and the low-rank density matrix completion from the quantum state tomography. To seek a soluti… ▽ More For the problems of low-rank matrix completion, the efficiency of the widely-used nuclear norm technique may be challenged under many circumstances, especially when certain basis coefficients are fixed, for example, the low-rank correlation matrix completion in various fields such as the financial market and the low-rank density matrix completion from the quantum state tomography. To seek a solution of high recovery quality beyond the reach of the nuclear norm, in this paper, we propose a rank-corrected procedure using a nuclear semi-norm to generate a new estimator. For this new estimator, we establish a non-asymptotic recovery error bound. More importantly, we quantify the reduction of the recovery error bound for this rank-corrected procedure. Compared with the one obtained for the nuclear norm penalized least squares estimator, this reduction can be substantial (around 50%). We also provide necessary and sufficient conditions for rank consistency in the sense of Bach (2008). Very interestingly, these conditions are highly related to the concept of constraint nondegeneracy in matrix optimization. As a byproduct, our results provide a theoretical foundation for the majorized penalty method of Gao and Sun (2010) and Gao (2010) for structured low-rank matrix optimization problems. Extensive numerical experiments demonstrate that our proposed rank-corrected procedure can simultaneously achieve a high recovery accuracy and capture the low-rank structure. △ Less

Submitted 22 June, 2015; v1 submitted 13 October, 2012; originally announced October 2012.

Comments: 51 pages, 4 figures

arXiv:1010.0308 [pdf, ps, other]

doi 10.1214/09-STS301

The Impact of Levene's Test of Equality of Variances on Statistical Theory and Practice

Authors: Joseph L. Gastwirth, Yulia R. Gel, Weiwen Miao

Abstract: In many applications, the underlying scientific question concerns whether the variances of $k$ samples are equal. There are a substantial number of tests for this problem. Many of them rely on the assumption of normality and are not robust to its violation. In 1960 Professor Howard Levene proposed a new approach to this problem by applying the $F$-test to the absolute deviations of the observation… ▽ More In many applications, the underlying scientific question concerns whether the variances of $k$ samples are equal. There are a substantial number of tests for this problem. Many of them rely on the assumption of normality and are not robust to its violation. In 1960 Professor Howard Levene proposed a new approach to this problem by applying the $F$-test to the absolute deviations of the observations from their group means. Levene's approach is powerful and robust to nonnormality and became a very popular tool for checking the homogeneity of variances. This paper reviews the original method proposed by Levene and subsequent robust modifications. A modification of Levene-type tests to increase their power to detect monotonic trends in variances is discussed. This procedure is useful when one is concerned with an alternative of increasing or decreasing variability, for example, increasing volatility of stocks prices or "open or closed gramophones" in regression residual analysis. A major section of the paper is devoted to discussion of various scientific problems where Levene-type tests have been used, for example, economic anthropology, accuracy of medical measurements, volatility of the price of oil, studies of the consistency of jury awards in legal cases and the effect of hurricanes on ecological systems. △ Less

Submitted 2 October, 2010; originally announced October 2010.

Comments: Published in at http://dx.doi.org/10.1214/09-STS301 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-STS-STS301

Journal ref: Statistical Science 2009, Vol. 24, No. 3, 343-360

Showing 1–33 of 33 results for author: Miao, W