Skip to main content

Showing 1–19 of 19 results for author: Lumley, T

Searching in archive stat. Search in all archives.
.
  1. arXiv:2311.13048  [pdf, ps, other

    stat.ME

    Weighted composite likelihood for linear mixed models in complex samples

    Authors: Thomas Lumley, Xudong Huang

    Abstract: Fitting mixed models to complex survey data is a challenging problem. Most methods in the literature, including the most widely used one, require a close relationship between the model structure and the survey design. In this paper we present methods for fitting arbitrary mixed models to data from arbitrary survey designs. We support this with an implementation that allows for multilevel linear mo… ▽ More

    Submitted 21 November, 2023; originally announced November 2023.

  2. arXiv:2307.04944  [pdf, ps, other

    stat.ME stat.CO

    Linear mixed models for complex survey data: implementing and evaluating pairwise likelihood

    Authors: Thomas Lumley, Xudong Huang

    Abstract: As complex-survey data becomes more widely used in health and social-science research, there is increasing interest in fitting a wider range of regression models. We describe an implementation of two-level linear mixed models in R using the pairwise composite likelihood approach of Rao and co-workers. We discuss the computational efficiency of pairwise composite likelihood and compare the estimato… ▽ More

    Submitted 10 July, 2023; originally announced July 2023.

  3. arXiv:2209.10061  [pdf, ps, other

    stat.ME stat.AP

    Practical considerations for sandwich variance estimation in two-stage regression settings

    Authors: Lillian A. Boe, Thomas Lumley, Pamela A. Shaw

    Abstract: We present a practical approach for computing the sandwich variance estimator in two-stage regression model settings. As a motivating example for two-stage regression, we consider regression calibration, a popular approach for addressing covariate measurement error. The sandwich variance approach has been rarely applied in regression calibration, despite that it requires less computation time than… ▽ More

    Submitted 20 September, 2022; originally announced September 2022.

    Comments: 18 pages of main manuscript including 2 figures and 4 tables; 14 pages of supplementary materials and references (including 2 tables)

  4. arXiv:2205.01743  [pdf, other

    stat.ME stat.AP

    Three-phase generalized raking and multiple imputation estimators to address error-prone data

    Authors: Gustavo Amorim, Ran Tao, Sarah Lotspeich, Pamela A. Shaw, Thomas Lumley, Rena C. Patel, Bryan E. Shepherd

    Abstract: Validation studies are often used to obtain more reliable information in settings with error-prone data. Validated data on a subsample of subjects can be used together with error-prone data on all subjects to improve estimation. In practice, more than one round of data validation may be required, and direct application of standard approaches for combining validation data into analyses may lead to… ▽ More

    Submitted 3 May, 2022; originally announced May 2022.

  5. arXiv:2203.10701  [pdf, other

    stat.ME

    Choosing good subsamples for regression modelling

    Authors: Thomas Lumley, Tong Chen

    Abstract: A common problem in health research is that we have a large database with many variables measured on a large number of individuals. We are interested in measuring additional variables on a subsample; these measurements may be newly available, or expensive, or simply not considered when the data were first collected. The intended use for the new measurements is to fit a regression model generalisab… ▽ More

    Submitted 20 March, 2022; originally announced March 2022.

  6. arXiv:2109.14001  [pdf, other

    stat.AP stat.ME

    Analysis of Error-prone Electronic Health Records with Multi-wave Validation Sampling: Association of Maternal Weight Gain during Pregnancy with Childhood Outcomes

    Authors: Bryan E. Shepherd, Kyunghee Han, Tong Chen, Aihua Bian, Shannon Pugh, Stephany N. Duda, Thomas Lumley, William J. Heerman, Pamela A. Shaw

    Abstract: Electronic health record (EHR) data are increasingly used for biomedical research, but these data have recognized data quality challenges. Data validation is necessary to use EHR data with confidence, but limited resources typically make complete data validation impossible. Using EHR data, we illustrate prospective, multi-wave, two-phase validation sampling to estimate the association between mate… ▽ More

    Submitted 28 September, 2021; originally announced September 2021.

  7. arXiv:2106.09494  [pdf, other

    stat.ME

    Optimum Allocation for Adaptive Multi-Wave Sampling in R: The R Package optimall

    Authors: Jasper B. Yang, Bryan E. Shepherd, Thomas Lumley, Pamela A. Shaw

    Abstract: The R package optimall offers a collection of functions that efficiently streamline the design process of sampling in surveys ranging from simple to complex. The package's main functions allow users to interactively define and adjust strata cut points based on values or quantiles of auxiliary covariates, adaptively calculate the optimum number of samples to allocate to each stratum using Neyman or… ▽ More

    Submitted 17 June, 2021; originally announced June 2021.

    Comments: 31 pages, 7 figures

  8. Optimal sampling for design-based estimators of regression models

    Authors: Tong Chen, Thomas Lumley

    Abstract: Two-phase designs measure variables of interest on a subcohort where the outcome and covariates are readily available or cheap to collect on all individuals in the cohort. Given limited resource availability, it is of interest to find an optimal design that includes more informative individuals in the final sample. We explore the optimal designs and efficiencies for analysis by design-based estima… ▽ More

    Submitted 15 June, 2021; originally announced June 2021.

    Journal ref: Stat.Med. (2022) 1-16

  9. arXiv:2106.01574  [pdf, other

    stat.ME

    Multiple Imputation Through XGBoost

    Authors: Yongshi Deng, Thomas Lumley

    Abstract: The use of multiple imputation (MI) is becoming increasingly popular for addressing missing data. Although some conventional MI approaches have been well studied and have shown empirical validity, they have limitations when processing large datasets with complex data structures. Their imputation performances usually rely on the proper specification of imputation models, which requires expert knowl… ▽ More

    Submitted 27 July, 2023; v1 submitted 2 June, 2021; originally announced June 2021.

  10. arXiv:2006.07480  [pdf, other

    stat.ME

    Improved Generalized Raking Estimators to Address Dependent Covariate and Failure-Time Outcome Error

    Authors: Eric J. Oh, Bryan E. Shepherd, Thomas Lumley, Pamela A. Shaw

    Abstract: Biomedical studies that use electronic health records (EHR) data for inference are often subject to bias due to measurement error. The measurement error present in EHR data is typically complex, consisting of errors of unknown functional form in covariates and the outcome, which can be dependent. To address the bias resulting from such errors, generalized raking has recently been proposed as a rob… ▽ More

    Submitted 12 June, 2020; originally announced June 2020.

  11. arXiv:2005.13739  [pdf, ps, other

    stat.AP stat.ME

    Optimal multi-wave sampling for regression modelling in two-phase designs

    Authors: Tong Chen, Thomas Lumley

    Abstract: Two-phase designs involve measuring extra variables on a subset of the cohort where some variables are already measured. The goal of two-phase designs is to choose a subsample of individuals from the cohort and analyse that subsample efficiently. It is of interest to obtain an optimal design that gives the most efficient estimates of regression parameters. In this paper, we propose a multi-wave sa… ▽ More

    Submitted 22 August, 2020; v1 submitted 27 May, 2020; originally announced May 2020.

  12. arXiv:2005.05511  [pdf, other

    stat.ME

    Two-phase analysis and study design for survival models with error-prone exposures

    Authors: Kyunghee Han, Thomas Lumley, Bryan E. Shepherd, Pamela A. Shaw

    Abstract: Increasingly, medical research is dependent on data collected for non-research purposes, such as electronic health records data (EHR). EHR data and other large databases can be prone to measurement error in key exposures, and unadjusted analyses of error-prone data can bias study results. Validating a subset of records is a cost-effective way of gaining information on the error structure, which in… ▽ More

    Submitted 11 May, 2020; originally announced May 2020.

    Comments: 22 pages, 2 figures, 3 tables, supplementary material

  13. arXiv:1912.04435  [pdf, other

    stat.AP

    Stylised Choropleth Maps for New Zealand Regions and District Health Boards

    Authors: Thomas Lumley

    Abstract: New Zealand has two top-level sets of administrative divisions: the District Health Boards and the Regions. In this note I describe a hexagonal layout for creating stylised maps of these divisions, and using colour, size, and triangular subdivisions to compare data between divisions and across multiple variables. I present an implementation in the DHBins package for R using both base graphics and… ▽ More

    Submitted 9 December, 2019; originally announced December 2019.

  14. arXiv:1910.01162  [pdf, other

    stat.ME

    Combining multiple imputation with raking of weights: An efficient and robust approach in the setting of nearly-true models

    Authors: Kyunghee Han, Pamela A. Shaw, Thomas Lumley

    Abstract: Multiple imputation provides us with efficient estimators in model-based methods for handling missing data under the true model. It is also well-understood that design-based estimators are robust methods that do not require accurately modeling the missing data; however, they can be inefficient. In any applied setting, it is difficult to know whether a missing data model may be good enough to win t… ▽ More

    Submitted 9 June, 2020; v1 submitted 2 October, 2019; originally announced October 2019.

    Comments: 24 pages, 3 figures

  15. arXiv:1905.08330  [pdf, other

    stat.ME

    Raking and Regression Calibration: Methods to Address Bias from Correlated Covariate and Time-to-Event Error

    Authors: Eric J. Oh, Bryan E. Shepherd, Thomas Lumley, Pamela A. Shaw

    Abstract: Medical studies that depend on electronic health records (EHR) data are often subject to measurement error, as the data are not collected to support research questions under study. These data errors, if not accounted for in study analyses, can obscure or cause spurious associations between patient exposures and disease risk. Methodology to address covariate measurement error has been well develope… ▽ More

    Submitted 9 March, 2020; v1 submitted 20 May, 2019; originally announced May 2019.

  16. arXiv:1803.05165  [pdf, ps, other

    stat.CO

    Fast generalised linear models by database sampling and one-step polishing

    Authors: Thomas Lumley

    Abstract: In this note, I show how to fit a generalised linear model to $N$ observations on $p$ variables stored in a relational database, using one sampling query and one aggregation queries, as long as $N^{\frac{1}{2}+δ}$ observations can be stored in memory. The resulting estimator is fully efficient and asymptotically equivalent to the maximum likelihood estimator, and so its variance can be estimated f… ▽ More

    Submitted 14 March, 2018; originally announced March 2018.

  17. arXiv:1711.04877  [pdf, other

    stat.ME stat.ML

    Estimating prediction error for complex samples

    Authors: Andrew Holbrook, Thomas Lumley, Daniel Gillen

    Abstract: With a growing interest in using non-representative samples to train prediction models for numerous outcomes it is necessary to account for the sampling design that gives rise to the data in order to assess the generalized predictive utility of a proposed prediction rule. After learning a prediction rule based on a non-uniform sample, it is of interest to estimate the rule's error rate when applie… ▽ More

    Submitted 14 September, 2019; v1 submitted 13 November, 2017; originally announced November 2017.

    Comments: To appear in the Canadian Journal of Statistics

  18. arXiv:1701.07745  [pdf, other

    stat.ME

    Pseudo-$R^2$ statistics under complex sampling

    Authors: Thomas Lumley

    Abstract: Model summaries based on the ratio of fitted and null likelihoods have been proposed for generalised linear models, reducing to the familiar $R^2$ coefficient of determination in the Gaussian model with identity link. In this note I show how to define the Cox--Snell and Nagelkerke summaries under arbitrary probability sampling designs, giving a design-consistent estimator of the population model s… ▽ More

    Submitted 26 January, 2017; originally announced January 2017.

  19. Model-robust regression and a Bayesian ``sandwich'' estimator

    Authors: Adam A. Szpiro, Kenneth M. Rice, Thomas Lumley

    Abstract: We present a new Bayesian approach to model-robust linear regression that leads to uncertainty estimates with the same robustness properties as the Huber--White sandwich estimator. The sandwich estimator is known to provide asymptotically correct frequentist inference, even when standard modeling assumptions such as linearity and homoscedasticity in the data-generating mechanism are violated. Our… ▽ More

    Submitted 7 January, 2011; originally announced January 2011.

    Comments: Published in at http://dx.doi.org/10.1214/10-AOAS362 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

    Report number: IMS-AOAS-AOAS362

    Journal ref: Annals of Applied Statistics 2010, Vol. 4, No. 4, 2099-2113