Search | arXiv e-print repository

Weighted composite likelihood for linear mixed models in complex samples

Abstract: Fitting mixed models to complex survey data is a challenging problem. Most methods in the literature, including the most widely used one, require a close relationship between the model structure and the survey design. In this paper we present methods for fitting arbitrary mixed models to data from arbitrary survey designs. We support this with an implementation that allows for multilevel linear mo… ▽ More Fitting mixed models to complex survey data is a challenging problem. Most methods in the literature, including the most widely used one, require a close relationship between the model structure and the survey design. In this paper we present methods for fitting arbitrary mixed models to data from arbitrary survey designs. We support this with an implementation that allows for multilevel linear models and multistage designs without any assumptions about nesting of model and design, and that also allows for correlation structures such as those resulting from genetic relatedness. The estimation and inference approach uses weighted pairwise (composite) likelihood. △ Less

Submitted 21 November, 2023; originally announced November 2023.

arXiv:2307.04944 [pdf, ps, other]

Linear mixed models for complex survey data: implementing and evaluating pairwise likelihood

Authors: Thomas Lumley, Xudong Huang

Abstract: As complex-survey data becomes more widely used in health and social-science research, there is increasing interest in fitting a wider range of regression models. We describe an implementation of two-level linear mixed models in R using the pairwise composite likelihood approach of Rao and co-workers. We discuss the computational efficiency of pairwise composite likelihood and compare the estimato… ▽ More As complex-survey data becomes more widely used in health and social-science research, there is increasing interest in fitting a wider range of regression models. We describe an implementation of two-level linear mixed models in R using the pairwise composite likelihood approach of Rao and co-workers. We discuss the computational efficiency of pairwise composite likelihood and compare the estimator to the existing stagewise pseudolikelihood estimator in simulations and in data from the PISA educational survey. △ Less

Submitted 10 July, 2023; originally announced July 2023.

arXiv:2209.10061 [pdf, ps, other]

Practical considerations for sandwich variance estimation in two-stage regression settings

Authors: Lillian A. Boe, Thomas Lumley, Pamela A. Shaw

Abstract: We present a practical approach for computing the sandwich variance estimator in two-stage regression model settings. As a motivating example for two-stage regression, we consider regression calibration, a popular approach for addressing covariate measurement error. The sandwich variance approach has been rarely applied in regression calibration, despite that it requires less computation time than… ▽ More We present a practical approach for computing the sandwich variance estimator in two-stage regression model settings. As a motivating example for two-stage regression, we consider regression calibration, a popular approach for addressing covariate measurement error. The sandwich variance approach has been rarely applied in regression calibration, despite that it requires less computation time than popular resampling approaches for variance estimation, specifically the bootstrap. This is likely due to requiring specialized statistical coding. In practice, a simple bootstrap approach with Wald confidence intervals is often applied, but this approach can yield confidence intervals that do not achieve the nominal coverage level. We first outline the steps needed to compute the sandwich variance estimator. We then develop a convenient method of computation in R for sandwich variance estimation, which leverages standard regression model outputs and existing R functions and can be applied in the case of a simple random sample or complex survey design. We use a simulation study to compare the performance of the sandwich to a resampling variance approach for both data settings. Finally, we further compare these two variance estimation approaches for data examples from the Women's Health Initiative (WHI) and Hispanic Community Health Study/Study of Latinos (HCHS/SOL). △ Less

Submitted 20 September, 2022; originally announced September 2022.

Comments: 18 pages of main manuscript including 2 figures and 4 tables; 14 pages of supplementary materials and references (including 2 tables)

arXiv:2205.01743 [pdf, other]

Three-phase generalized raking and multiple imputation estimators to address error-prone data

Authors: Gustavo Amorim, Ran Tao, Sarah Lotspeich, Pamela A. Shaw, Thomas Lumley, Rena C. Patel, Bryan E. Shepherd

Abstract: Validation studies are often used to obtain more reliable information in settings with error-prone data. Validated data on a subsample of subjects can be used together with error-prone data on all subjects to improve estimation. In practice, more than one round of data validation may be required, and direct application of standard approaches for combining validation data into analyses may lead to… ▽ More Validation studies are often used to obtain more reliable information in settings with error-prone data. Validated data on a subsample of subjects can be used together with error-prone data on all subjects to improve estimation. In practice, more than one round of data validation may be required, and direct application of standard approaches for combining validation data into analyses may lead to inefficient estimators since the information available from intermediate validation steps is only partially considered or even completely ignored. In this paper, we present two novel extensions of multiple imputation and generalized raking estimators that make full use of all available data. We show through simulations that incorporating information from intermediate steps can lead to substantial gains in efficiency. This work is motivated by and illustrated in a study of contraceptive effectiveness among 82,957 women living with HIV whose data were originally extracted from electronic medical records, of whom 4855 had their charts reviewed, and a subsequent 1203 also had a telephone interview to validate key study variables. △ Less

Submitted 3 May, 2022; originally announced May 2022.

arXiv:2203.10701 [pdf, other]

Choosing good subsamples for regression modelling

Authors: Thomas Lumley, Tong Chen

Abstract: A common problem in health research is that we have a large database with many variables measured on a large number of individuals. We are interested in measuring additional variables on a subsample; these measurements may be newly available, or expensive, or simply not considered when the data were first collected. The intended use for the new measurements is to fit a regression model generalisab… ▽ More A common problem in health research is that we have a large database with many variables measured on a large number of individuals. We are interested in measuring additional variables on a subsample; these measurements may be newly available, or expensive, or simply not considered when the data were first collected. The intended use for the new measurements is to fit a regression model generalisable to the whole cohort (and to its source population). This is a two-phase sampling problem; it differs from some other two-phase sampling problems in the richness of the phase I data and in the goal of regression modelling. In particular, an important special case is measurement-error models, where a variable strongly correlated with the phase II measurements is available at phase I. We will explain how influence functions have been useful as a unifying concept for extending classical results to this setting, and describe the steps from designing for a simple weighted estimator at known parameter values through adaptive multiwave designs and the use of prior information. We will conclude with some comments on the information gap between design-based and model-based estimators in this setting. △ Less

Submitted 20 March, 2022; originally announced March 2022.

arXiv:2109.14001 [pdf, other]

Analysis of Error-prone Electronic Health Records with Multi-wave Validation Sampling: Association of Maternal Weight Gain during Pregnancy with Childhood Outcomes

Authors: Bryan E. Shepherd, Kyunghee Han, Tong Chen, Aihua Bian, Shannon Pugh, Stephany N. Duda, Thomas Lumley, William J. Heerman, Pamela A. Shaw

Abstract: Electronic health record (EHR) data are increasingly used for biomedical research, but these data have recognized data quality challenges. Data validation is necessary to use EHR data with confidence, but limited resources typically make complete data validation impossible. Using EHR data, we illustrate prospective, multi-wave, two-phase validation sampling to estimate the association between mate… ▽ More Electronic health record (EHR) data are increasingly used for biomedical research, but these data have recognized data quality challenges. Data validation is necessary to use EHR data with confidence, but limited resources typically make complete data validation impossible. Using EHR data, we illustrate prospective, multi-wave, two-phase validation sampling to estimate the association between maternal weight gain during pregnancy and the risks of her child develo** obesity or asthma. The optimal validation sampling design depends on the unknown efficient influence functions of regression coefficients of interest. In the first wave of our multi-wave validation design, we estimate the influence function using the unvalidated (phase 1) data to determine our validation sample; then in subsequent waves, we re-estimate the influence function using validated (phase 2) data and update our sampling. For efficiency, estimation combines obesity and asthma sampling frames while calibrating sampling weights using generalized raking. We validated 996 of 10,335 mother-child EHR dyads in 6 sampling waves. Estimated associations between childhood obesity/asthma and maternal weight gain, as well as other covariates, are compared to naive estimates that only use unvalidated data. In some cases, estimates markedly differ, underscoring the importance of efficient validation sampling to obtain accurate estimates incorporating validated data. △ Less

Submitted 28 September, 2021; originally announced September 2021.

arXiv:2106.09494 [pdf, other]

Optimum Allocation for Adaptive Multi-Wave Sampling in R: The R Package optimall

Authors: Jasper B. Yang, Bryan E. Shepherd, Thomas Lumley, Pamela A. Shaw

Abstract: The R package optimall offers a collection of functions that efficiently streamline the design process of sampling in surveys ranging from simple to complex. The package's main functions allow users to interactively define and adjust strata cut points based on values or quantiles of auxiliary covariates, adaptively calculate the optimum number of samples to allocate to each stratum using Neyman or… ▽ More The R package optimall offers a collection of functions that efficiently streamline the design process of sampling in surveys ranging from simple to complex. The package's main functions allow users to interactively define and adjust strata cut points based on values or quantiles of auxiliary covariates, adaptively calculate the optimum number of samples to allocate to each stratum using Neyman or Wright allocation, and select specific IDs to sample based on a stratified sampling design. Using real-life epidemiological study examples, we demonstrate how optimall facilitates an efficient workflow for the design and implementation of surveys in R. Although tailored towards multi-wave sampling under two- or three-phase designs, the R package optimall may be useful for any sampling survey. △ Less

Submitted 17 June, 2021; originally announced June 2021.

Comments: 31 pages, 7 figures

arXiv:2106.08530 [pdf, other]

doi 10.1002/sim.9300

Optimal sampling for design-based estimators of regression models

Authors: Tong Chen, Thomas Lumley

Abstract: Two-phase designs measure variables of interest on a subcohort where the outcome and covariates are readily available or cheap to collect on all individuals in the cohort. Given limited resource availability, it is of interest to find an optimal design that includes more informative individuals in the final sample. We explore the optimal designs and efficiencies for analysis by design-based estima… ▽ More Two-phase designs measure variables of interest on a subcohort where the outcome and covariates are readily available or cheap to collect on all individuals in the cohort. Given limited resource availability, it is of interest to find an optimal design that includes more informative individuals in the final sample. We explore the optimal designs and efficiencies for analysis by design-based estimators. Generalized raking is an efficient design-based estimator that improves on the inverse-probability weighted (IPW) estimator by adjusting weights based on the auxiliary information. We derive a closed-form solution of the optimal design for estimating regression coefficients from generalized raking estimators. We compare it with the optimal design for analysis via the IPW estimator and other two-phase designs in measurement-error settings. We consider general two-phase designs where the outcome variable and variables of interest can be continuous or discrete. Our results show that the optimal designs for analysis by the two design-based estimators can be very different. The optimal design for IPW estimation is optimal for analysis via the IPW estimator and typically gives near-optimal efficiency for generalized raking, though we show there is potential improvement in some settings. △ Less

Submitted 15 June, 2021; originally announced June 2021.

Journal ref: Stat.Med. (2022) 1-16

arXiv:2106.01574 [pdf, other]

Multiple Imputation Through XGBoost

Authors: Yongshi Deng, Thomas Lumley

Abstract: The use of multiple imputation (MI) is becoming increasingly popular for addressing missing data. Although some conventional MI approaches have been well studied and have shown empirical validity, they have limitations when processing large datasets with complex data structures. Their imputation performances usually rely on the proper specification of imputation models, which requires expert knowl… ▽ More The use of multiple imputation (MI) is becoming increasingly popular for addressing missing data. Although some conventional MI approaches have been well studied and have shown empirical validity, they have limitations when processing large datasets with complex data structures. Their imputation performances usually rely on the proper specification of imputation models, which requires expert knowledge of the inherent relations among variables. Moreover, these standard approaches tend to be computationally inefficient for medium and large datasets. In this paper, we propose a scalable MI framework mixgb, which is based on XGBoost, subsampling, and predictive mean matching. Our approach leverages the power of XGBoost, a fast implementation of gradient boosted trees, to automatically capture interactions and non-linear relations while achieving high computational efficiency. In addition, we incorporate subsampling and predictive mean matching to reduce bias and better account for appropriate imputation variability. The proposed framework is implemented in an R package mixgb. Supplementary materials for this article are available online. △ Less

Submitted 27 July, 2023; v1 submitted 2 June, 2021; originally announced June 2021.

arXiv:2006.07480 [pdf, other]

Improved Generalized Raking Estimators to Address Dependent Covariate and Failure-Time Outcome Error

Authors: Eric J. Oh, Bryan E. Shepherd, Thomas Lumley, Pamela A. Shaw

Abstract: Biomedical studies that use electronic health records (EHR) data for inference are often subject to bias due to measurement error. The measurement error present in EHR data is typically complex, consisting of errors of unknown functional form in covariates and the outcome, which can be dependent. To address the bias resulting from such errors, generalized raking has recently been proposed as a rob… ▽ More Biomedical studies that use electronic health records (EHR) data for inference are often subject to bias due to measurement error. The measurement error present in EHR data is typically complex, consisting of errors of unknown functional form in covariates and the outcome, which can be dependent. To address the bias resulting from such errors, generalized raking has recently been proposed as a robust method that yields consistent estimates without the need to model the error structure. We provide rationale for why these previously proposed raking estimators can be expected to be inefficient in failure-time outcome settings involving misclassification of the event indicator. We propose raking estimators that utilize multiple imputation, to impute either the target variables or auxiliary variables, to improve the efficiency. We also consider outcome-dependent sampling designs and investigate their impact on the efficiency of the raking estimators, either with or without multiple imputation. We present an extensive numerical study to examine the performance of the proposed estimators across various measurement error settings. We then apply the proposed methods to our motivating setting, in which we seek to analyze HIV outcomes in an observational cohort with electronic health records data from the Vanderbilt Comprehensive Care Clinic. △ Less

Submitted 12 June, 2020; originally announced June 2020.

arXiv:2005.13739 [pdf, ps, other]

doi 10.1002/sim.8760

Optimal multi-wave sampling for regression modelling in two-phase designs

Authors: Tong Chen, Thomas Lumley

Abstract: Two-phase designs involve measuring extra variables on a subset of the cohort where some variables are already measured. The goal of two-phase designs is to choose a subsample of individuals from the cohort and analyse that subsample efficiently. It is of interest to obtain an optimal design that gives the most efficient estimates of regression parameters. In this paper, we propose a multi-wave sa… ▽ More Two-phase designs involve measuring extra variables on a subset of the cohort where some variables are already measured. The goal of two-phase designs is to choose a subsample of individuals from the cohort and analyse that subsample efficiently. It is of interest to obtain an optimal design that gives the most efficient estimates of regression parameters. In this paper, we propose a multi-wave sampling design to approximate the optimal design for design-based estimators. Influences functions are used to compute the optimal sampling allocations. We propose to use informative priors on regression parameters to derive the wave-1 sampling probabilities because any pre-specified sampling probabilities may be far from optimal and decrease efficiency. Generalised raking is used in statistical analysis. We show that a two-wave sampling with reasonable informative priors will end up with higher precision for the parameter of interest and be close to the underlying optimal design. △ Less

Submitted 22 August, 2020; v1 submitted 27 May, 2020; originally announced May 2020.

arXiv:2005.05511 [pdf, other]

Two-phase analysis and study design for survival models with error-prone exposures

Authors: Kyunghee Han, Thomas Lumley, Bryan E. Shepherd, Pamela A. Shaw

Abstract: Increasingly, medical research is dependent on data collected for non-research purposes, such as electronic health records data (EHR). EHR data and other large databases can be prone to measurement error in key exposures, and unadjusted analyses of error-prone data can bias study results. Validating a subset of records is a cost-effective way of gaining information on the error structure, which in… ▽ More Increasingly, medical research is dependent on data collected for non-research purposes, such as electronic health records data (EHR). EHR data and other large databases can be prone to measurement error in key exposures, and unadjusted analyses of error-prone data can bias study results. Validating a subset of records is a cost-effective way of gaining information on the error structure, which in turn can be used to adjust analyses for this error and improve inference. We extend the mean score method for the two-phase analysis of discrete-time survival models, which uses the unvalidated covariates as auxiliary variables that act as surrogates for the unobserved true exposures. This method relies on a two-phase sampling design and an estimation approach that preserves the consistency of complete case regression parameter estimates in the validated subset, with increased precision leveraged from the auxiliary data. Furthermore, we develop optimal sampling strategies which minimize the variance of the mean score estimator for a target exposure under a fixed cost constraint. We consider the setting where an internal pilot is necessary for the optimal design so that the phase two sample is split into a pilot and an adaptive optimal sample. Through simulations and data example, we evaluate efficiency gains of the mean score estimator using the derived optimal validation design compared to balanced and simple random sampling for the phase two sample. We also empirically explore efficiency gains that the proposed discrete optimal design can provide for the Cox proportional hazards model in the setting of a continuous-time survival outcome. △ Less

Submitted 11 May, 2020; originally announced May 2020.

Comments: 22 pages, 2 figures, 3 tables, supplementary material

arXiv:1912.04435 [pdf, other]

Stylised Choropleth Maps for New Zealand Regions and District Health Boards

Authors: Thomas Lumley

Abstract: New Zealand has two top-level sets of administrative divisions: the District Health Boards and the Regions. In this note I describe a hexagonal layout for creating stylised maps of these divisions, and using colour, size, and triangular subdivisions to compare data between divisions and across multiple variables. I present an implementation in the DHBins package for R using both base graphics and… ▽ More New Zealand has two top-level sets of administrative divisions: the District Health Boards and the Regions. In this note I describe a hexagonal layout for creating stylised maps of these divisions, and using colour, size, and triangular subdivisions to compare data between divisions and across multiple variables. I present an implementation in the DHBins package for R using both base graphics and ggplot2; the concepts and specific hexagonal layout could be used in any software. △ Less

Submitted 9 December, 2019; originally announced December 2019.

arXiv:1910.01162 [pdf, other]

Combining multiple imputation with raking of weights: An efficient and robust approach in the setting of nearly-true models

Authors: Kyunghee Han, Pamela A. Shaw, Thomas Lumley

Abstract: Multiple imputation provides us with efficient estimators in model-based methods for handling missing data under the true model. It is also well-understood that design-based estimators are robust methods that do not require accurately modeling the missing data; however, they can be inefficient. In any applied setting, it is difficult to know whether a missing data model may be good enough to win t… ▽ More Multiple imputation provides us with efficient estimators in model-based methods for handling missing data under the true model. It is also well-understood that design-based estimators are robust methods that do not require accurately modeling the missing data; however, they can be inefficient. In any applied setting, it is difficult to know whether a missing data model may be good enough to win the bias-efficiency trade-off. Raking of weights is one approach that relies on constructing an auxiliary variable from data observed on the full cohort, which is then used to adjust the weights for the usual Horvitz-Thompson estimator. Computing the optimally efficient raking estimator requires evaluating the expectation of the efficient score given the full cohort data, which is generally infeasible. We demonstrate multiple imputation (MI) as a practical method to compute a raking estimator that will be optimal. We compare this estimator to common parametric and semi-parametric estimators, including standard multiple imputation. We show that while estimators, such as the semi-parametric maximum likelihood and MI estimator, obtain optimal performance under the true model, the proposed raking estimator utilizing MI maintains a better robustness-efficiency trade-off even under mild model misspecification. We also show that the standard raking estimator, without MI, is often competitive with the optimal raking estimator. We demonstrate these properties through several numerical examples and provide a theoretical discussion of conditions for asymptotically superior relative efficiency of the proposed raking estimator. △ Less

Submitted 9 June, 2020; v1 submitted 2 October, 2019; originally announced October 2019.

Comments: 24 pages, 3 figures

arXiv:1905.08330 [pdf, other]

Raking and Regression Calibration: Methods to Address Bias from Correlated Covariate and Time-to-Event Error

Authors: Eric J. Oh, Bryan E. Shepherd, Thomas Lumley, Pamela A. Shaw

Abstract: Medical studies that depend on electronic health records (EHR) data are often subject to measurement error, as the data are not collected to support research questions under study. These data errors, if not accounted for in study analyses, can obscure or cause spurious associations between patient exposures and disease risk. Methodology to address covariate measurement error has been well develope… ▽ More Medical studies that depend on electronic health records (EHR) data are often subject to measurement error, as the data are not collected to support research questions under study. These data errors, if not accounted for in study analyses, can obscure or cause spurious associations between patient exposures and disease risk. Methodology to address covariate measurement error has been well developed; however, time-to-event error has also been shown to cause significant bias but methods to address it are relatively underdeveloped. More generally, it is possible to observe errors in both the covariate and the time-to-event outcome that are correlated. We propose regression calibration (RC) estimators to simultaneously address correlated error in the covariates and the censored event time. Although RC can perform well in many settings with covariate measurement error, it is biased for nonlinear regression models, such as the Cox model. Thus, we additionally propose raking estimators which are consistent estimators of the parameter defined by the population estimating equation. Raking can improve upon RC in certain settings with failure-time data, require no explicit modeling of the error structure, and can be utilized under outcome-dependent sampling designs. We discuss features of the underlying estimation problem that affect the degree of improvement the raking estimator has over the RC approach. Detailed simulation studies are presented to examine the performance of the proposed estimators under varying levels of signal, error, and censoring. The methodology is illustrated on observational EHR data on HIV outcomes from the Vanderbilt Comprehensive Care Clinic. △ Less

Submitted 9 March, 2020; v1 submitted 20 May, 2019; originally announced May 2019.

arXiv:1803.05165 [pdf, ps, other]

Fast generalised linear models by database sampling and one-step polishing

Authors: Thomas Lumley

Abstract: In this note, I show how to fit a generalised linear model to $N$ observations on $p$ variables stored in a relational database, using one sampling query and one aggregation queries, as long as $N^{\frac{1}{2}+δ}$ observations can be stored in memory. The resulting estimator is fully efficient and asymptotically equivalent to the maximum likelihood estimator, and so its variance can be estimated f… ▽ More In this note, I show how to fit a generalised linear model to $N$ observations on $p$ variables stored in a relational database, using one sampling query and one aggregation queries, as long as $N^{\frac{1}{2}+δ}$ observations can be stored in memory. The resulting estimator is fully efficient and asymptotically equivalent to the maximum likelihood estimator, and so its variance can be estimated from the Fisher information in the usual way. A proof-of-concept implementation uses R with MonetDB and with SQLite, and could easily be adapted to other popular databases. I illustrate the approach with examples of taxi-trip data in New York City and factors related to car colour in New Zealand. △ Less

Submitted 14 March, 2018; originally announced March 2018.

arXiv:1711.04877 [pdf, other]

Estimating prediction error for complex samples

Authors: Andrew Holbrook, Thomas Lumley, Daniel Gillen

Abstract: With a growing interest in using non-representative samples to train prediction models for numerous outcomes it is necessary to account for the sampling design that gives rise to the data in order to assess the generalized predictive utility of a proposed prediction rule. After learning a prediction rule based on a non-uniform sample, it is of interest to estimate the rule's error rate when applie… ▽ More With a growing interest in using non-representative samples to train prediction models for numerous outcomes it is necessary to account for the sampling design that gives rise to the data in order to assess the generalized predictive utility of a proposed prediction rule. After learning a prediction rule based on a non-uniform sample, it is of interest to estimate the rule's error rate when applied to unobserved members of the population. Efron (1986) proposed a general class of covariance penalty inflated prediction error estimators that assume the available training data are representative of the target population for which the prediction rule is to be applied. We extend Efron's estimator to the complex sample context by incorporating Horvitz-Thompson sampling weights and show that it is consistent for the true generalization error rate when applied to the underlying superpopulation. The resulting Horvitz-Thompson-Efron (HTE) estimator is equivalent to dAIC, a recent extension of AIC to survey sampling data, but is more widely applicable. The proposed methodology is assessed with simulations and is applied to models predicting renal function obtained from the large-scale NHANES survey. △ Less

Submitted 14 September, 2019; v1 submitted 13 November, 2017; originally announced November 2017.

Comments: To appear in the Canadian Journal of Statistics

arXiv:1701.07745 [pdf, other]

Pseudo-$R^2$ statistics under complex sampling

Authors: Thomas Lumley

Abstract: Model summaries based on the ratio of fitted and null likelihoods have been proposed for generalised linear models, reducing to the familiar $R^2$ coefficient of determination in the Gaussian model with identity link. In this note I show how to define the Cox--Snell and Nagelkerke summaries under arbitrary probability sampling designs, giving a design-consistent estimator of the population model s… ▽ More Model summaries based on the ratio of fitted and null likelihoods have been proposed for generalised linear models, reducing to the familiar $R^2$ coefficient of determination in the Gaussian model with identity link. In this note I show how to define the Cox--Snell and Nagelkerke summaries under arbitrary probability sampling designs, giving a design-consistent estimator of the population model summary. I also show that for logistic regression models under case--control sampling the usual Cox--Snell and Nagelkerke $R^2$ are not design-consistent, but are systematically larger than would be obtained with a cross-sectional or cohort sample, even in settings where the weighted and unweighted logistic regression estimators are similar or identical. △ Less

Submitted 26 January, 2017; originally announced January 2017.

arXiv:1101.1402 [pdf, ps, other]

doi 10.1214/10-AOAS362

Model-robust regression and a Bayesian ``sandwich'' estimator

Authors: Adam A. Szpiro, Kenneth M. Rice, Thomas Lumley

Abstract: We present a new Bayesian approach to model-robust linear regression that leads to uncertainty estimates with the same robustness properties as the Huber--White sandwich estimator. The sandwich estimator is known to provide asymptotically correct frequentist inference, even when standard modeling assumptions such as linearity and homoscedasticity in the data-generating mechanism are violated. Our… ▽ More We present a new Bayesian approach to model-robust linear regression that leads to uncertainty estimates with the same robustness properties as the Huber--White sandwich estimator. The sandwich estimator is known to provide asymptotically correct frequentist inference, even when standard modeling assumptions such as linearity and homoscedasticity in the data-generating mechanism are violated. Our derivation provides a compelling Bayesian justification for using this simple and popular tool, and it also clarifies what is being estimated when the data-generating mechanism is not linear. We demonstrate the applicability of our approach using a simulation study and health care cost data from an evaluation of the Washington State Basic Health Plan. △ Less

Submitted 7 January, 2011; originally announced January 2011.

Comments: Published in at http://dx.doi.org/10.1214/10-AOAS362 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS362

Journal ref: Annals of Applied Statistics 2010, Vol. 4, No. 4, 2099-2113

Showing 1–19 of 19 results for author: Lumley, T