Search | arXiv e-print repository

Review of Quasi-Randomization Approaches for Estimation from Non-probability Samples

Authors: Vladislav Beresovsky, Julie Gershunskaya, Terrance D. Savitsky

Abstract: The recent proliferation of computers and the internet have opened new opportunities for collecting and processing data. However, such data are often obtained without a well-planned probability survey design. Such non-probability based samples cannot be automatically regarded as representative of the population of interest. Several classes of methods for estimation and inferences from non-probabil… ▽ More The recent proliferation of computers and the internet have opened new opportunities for collecting and processing data. However, such data are often obtained without a well-planned probability survey design. Such non-probability based samples cannot be automatically regarded as representative of the population of interest. Several classes of methods for estimation and inferences from non-probability samples have been developed in recent years. The quasi-randomization methods assume that non-probability sample selection is governed by an underlying latent random mechanism. The basic idea is to use information collected from a probability ("reference") sample to uncover latent non-probability survey participation probabilities (also known as "propensity scores") and use them in estimation of target finite population parameters. In this paper, we review and compare theoretical properties of recently developed methods of estimation survey participation probabilities and study their relative performances in simulations. △ Less

Submitted 26 June, 2024; v1 submitted 8 December, 2023; originally announced December 2023.

Comments: 38 pages, 12 figures

arXiv:2310.01575 [pdf, other]

Derivation of outcome-dependent dietary patterns for low-income women obtained from survey data using a Supervised Weighted Overfitted Latent Class Analysis

Authors: Stephanie M. Wu, Matthew R. Williams, Terrance D. Savitsky, Briana J. K. Stephenson

Abstract: Poor diet quality is a key modifiable risk factor for hypertension and disproportionately impacts low-income women. \sw{Analyzing diet-driven hypertensive outcomes in this demographic is challenging due to the complexity of dietary data and selection bias when the data come from surveys, a main data source for understanding diet-disease relationships in understudied populations. Supervised Bayesia… ▽ More Poor diet quality is a key modifiable risk factor for hypertension and disproportionately impacts low-income women. \sw{Analyzing diet-driven hypertensive outcomes in this demographic is challenging due to the complexity of dietary data and selection bias when the data come from surveys, a main data source for understanding diet-disease relationships in understudied populations. Supervised Bayesian model-based clustering methods summarize dietary data into latent patterns that holistically capture relationships among foods and a known health outcome but do not sufficiently account for complex survey design. This leads to biased estimation and inference and lack of generalizability of the patterns}. To address this, we propose a supervised weighted overfitted latent class analysis (SWOLCA) based on a Bayesian pseudo-likelihood approach that integrates sampling weights into an exposure-outcome model for discrete data. Our model adjusts for stratification, clustering, and informative sampling, and handles modifying effects via interaction terms within a Markov chain Monte Carlo Gibbs sampling algorithm. Simulation studies confirm that the SWOLCA model exhibits good performance in terms of bias, precision, and coverage. Using data from the National Health and Nutrition Examination Survey (2015-2018), we demonstrate the utility of our model by characterizing dietary patterns associated with hypertensive outcomes among low-income women in the United States. △ Less

Submitted 28 June, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

Comments: 16 pages, 8 tables, 7 figures

arXiv:2308.06845 [pdf, other]

csSampling: An R Package for Bayesian Models for Complex Survey Data

Authors: Ryan Hornby, Matthew R. Williams, Terrance D. Savitsky, Mahmoud Elkasabi

Abstract: We present csSampling, an R package for estimation of Bayesian models for data collected from complex survey samples. csSampling combines functionality from the probabilistic programming language Stan (via the rstan and brms R packages) and the handling of complex survey data from the survey R package. Under this approach, the user creates a survey-weighted model in brms or provides a custom weigh… ▽ More We present csSampling, an R package for estimation of Bayesian models for data collected from complex survey samples. csSampling combines functionality from the probabilistic programming language Stan (via the rstan and brms R packages) and the handling of complex survey data from the survey R package. Under this approach, the user creates a survey-weighted model in brms or provides a custom weighted model via rstan. Survey design information is provided via the svydesign function of the survey package. The cs_sampling function of csSampling estimates the weighted stan model and provides an asymptotic covariance correction for model mis-specification due to using survey sampling weights as plug-in values in the likelihood. This is often known as a ``design effect'' which is the ratio between the variance from a complex survey sample and a simple random sample of the same size. The resulting adjusted posterior draws can then be used for the usual Bayesian inference while also achieving frequentist properties of asymptotic consistency and correct uncertainty (e.g. coverage). △ Less

Submitted 13 August, 2023; originally announced August 2023.

Comments: 22 pages, 5 figures

arXiv:2210.14366 [pdf, other]

Joint Point and Variance Estimation under a Hierarchical Bayesian model for Survey Count Data

Authors: Terrance D. Savitsky, Julie Gershunskaya, Mark Crankshaw

Abstract: We propose a novel Bayesian framework for the joint modeling of survey point and variance estimates for count data. The approach incorporates an induced prior distribution on the modeled true variance that sets it equal to the generating variance of the point estimate, a key property more readily achieved for continuous data response type models. Our count data model formulation allows the input o… ▽ More We propose a novel Bayesian framework for the joint modeling of survey point and variance estimates for count data. The approach incorporates an induced prior distribution on the modeled true variance that sets it equal to the generating variance of the point estimate, a key property more readily achieved for continuous data response type models. Our count data model formulation allows the input of domains at multiple resolutions (e.g., states, regions, nation) and simultaneously benchmarks modeled estimates at higher resolutions (e.g., states) to those at lower resolutions (e.g., regions) in a fashion that borrows more strength to sharpen our domain estimates at higher resolutions. We conduct a simulation study that generates a population of units within domains to produce ground truth statistics to compare to direct and modeled estimates performed on samples taken from the population where we show improved reductions in error across domains. The model is applied to the job openings variable and other data items published in the Job Openings and Labor Turnover Survey administered by the U.S. Bureau of Labor Statistics. △ Less

Submitted 25 October, 2022; originally announced October 2022.

Comments: 5 figures, 3 tables

arXiv:2208.14541 [pdf, other]

Methods for Combining Probability and Nonprobability Samples Under Unknown Overlaps

Authors: Terrance D. Savitsky, Matthew R. Williams, Julie Gershunskaya, Vladislav Beresovsky, Nels G. Johnson

Abstract: Nonprobability (convenience) samples are increasingly sought to reduce the estimation variance for one or more population variables of interest that are estimated using a randomized survey (reference) sample by increasing the effective sample size. Estimation of a population quantity derived from a convenience sample will typically result in bias since the distribution of variables of interest in… ▽ More Nonprobability (convenience) samples are increasingly sought to reduce the estimation variance for one or more population variables of interest that are estimated using a randomized survey (reference) sample by increasing the effective sample size. Estimation of a population quantity derived from a convenience sample will typically result in bias since the distribution of variables of interest in the convenience sample is different from the population distribution. A recent set of approaches estimates inclusion probabilities for convenience sample units by specifying reference sample-weighted pseudo likelihoods. This paper introduces a novel approach that derives the propensity score for the observed sample as a function of inclusion probabilities for the reference and convenience samples as our main result. Our approach allows specification of a likelihood directly for the observed sample as opposed to the approximate or pseudo likelihood. We construct a Bayesian hierarchical formulation that simultaneously estimates sample propensity scores and the convenience sample inclusion probabilities. We use a Monte Carlo simulation study to compare our likelihood based results with the pseudo likelihood based approaches considered in the literature. △ Less

Submitted 9 June, 2023; v1 submitted 30 August, 2022; originally announced August 2022.

Comments: 37 pages, 11 figures. arXiv admin note: substantial text overlap with arXiv:2204.02271

arXiv:2205.05003 [pdf, other]

Mechanisms for Global Differential Privacy under Bayesian Data Synthesis

Authors: **gchen Hu, Matthew R. Williams, Terrance D. Savitsky

Abstract: This paper introduces a new method that embeds any Bayesian model used to generate synthetic data and converts it into a differentially private (DP) mechanism. We propose an alteration of the model synthesizer to utilize a censored likelihood that induces upper and lower bounds of [$\exp(-ε/ 2), \exp(ε/ 2)$], where $ε$ denotes the level of the DP guarantee. This censoring mechanism equipped with a… ▽ More This paper introduces a new method that embeds any Bayesian model used to generate synthetic data and converts it into a differentially private (DP) mechanism. We propose an alteration of the model synthesizer to utilize a censored likelihood that induces upper and lower bounds of [$\exp(-ε/ 2), \exp(ε/ 2)$], where $ε$ denotes the level of the DP guarantee. This censoring mechanism equipped with an $ε-$DP guarantee will induce distortion into the joint parameter posterior distribution by flattening or shifting the distribution towards a weakly informative prior. To minimize the distortion in the posterior distribution induced by likelihood censoring, we embed a vector-weighted pseudo posterior mechanism within the censoring mechanism. The pseudo posterior is formulated by selectively downweighting each likelihood contribution proportionally to its disclosure risk. On its own, the pseudo posterior mechanism produces a weaker asymptotic differential privacy (aDP) guarantee. After embedding in the censoring mechanism, the DP guarantee becomes strict such that it does not rely on asymptotics. We demonstrate that the pseudo posterior mechanism creates synthetic data with the highest utility at the price of a weaker, aDP guarantee, while embedding the pseudo posterior mechanism in the proposed censoring mechanism produces synthetic data with a stronger, non-asymptotic DP guarantee at the cost of slightly reduced utility. The perturbed histogram mechanism is included for comparison. △ Less

Submitted 3 August, 2023; v1 submitted 10 May, 2022; originally announced May 2022.

arXiv:2204.02271

Methods for Combining Probability and Nonprobability Samples Under Unknown Overlaps

Authors: Terrance D. Savitsky, Matthew R. Williams, Julie Gershunskaya, Vladislav Beresovsky, Nels G. Johnson

Abstract: Nonprobability (convenience) samples are increasingly sought to stabilize estimations for one or more population variables of interest that are performed using a randomized survey (reference) sample by increasing the effective sample size. Estimation of a population quantity derived from a convenience sample will typically result in bias since the distribution of variables of interest in the conve… ▽ More Nonprobability (convenience) samples are increasingly sought to stabilize estimations for one or more population variables of interest that are performed using a randomized survey (reference) sample by increasing the effective sample size. Estimation of a population quantity derived from a convenience sample will typically result in bias since the distribution of variables of interest in the convenience sample is different from the population. A recent set of approaches estimates conditional (on sampling design predictors) inclusion probabilities for convenience sample units by specifying reference sample-weighted pseudo likelihoods. This paper introduces a novel approach that derives the propensity score for the observed sample as a function of conditional inclusion probabilities for the reference and convenience samples as our main result. Our approach allows specification of an exact likelihood for the observed sample. We construct a Bayesian hierarchical formulation that simultaneously estimates sample propensity scores and both conditional and reference sample inclusion probabilities for the convenience sample units. We compare our exact likelihood with the pseudo likelihoods in a Monte Carlo simulation study. △ Less

Submitted 9 June, 2023; v1 submitted 5 April, 2022; originally announced April 2022.

Comments: Duplication with arXiv.2208.14541

arXiv:2101.06237 [pdf, other]

Fully Bayesian Estimation under Dependent and Informative Cluster Sampling

Authors: Luis G. Leon-Novelo, Terrance D. Savitsky

Abstract: Survey data are often collected under multistage sampling designs where units are binned to clusters that are sampled in a first stage. The unit-indexed population variables of interest are typically dependent within cluster. We propose a Fully Bayesian method that constructs an exact likelihood for the observed sample to incorporate unit-level marginal sampling weights for performing unbiased inf… ▽ More Survey data are often collected under multistage sampling designs where units are binned to clusters that are sampled in a first stage. The unit-indexed population variables of interest are typically dependent within cluster. We propose a Fully Bayesian method that constructs an exact likelihood for the observed sample to incorporate unit-level marginal sampling weights for performing unbiased inference for population parameters while simultaneously accounting for the dependence induced by sampling clusters of units to produce correct uncertainty quantification. Our approach parameterizes cluster-indexed random effects in both a marginal model for the response and a conditional model for published, unit-level sampling weights. We compare our method to plug-in Bayesian and frequentist alternatives in a simulation study and demonstrate that our method most closely achieves correct uncertainty quantification for model parameters, including the generating variances for cluster-indexed random effects. We demonstrate our method in an application with NHANES data. △ Less

Submitted 24 August, 2021; v1 submitted 15 January, 2021; originally announced January 2021.

Comments: Total of 22 pages including 3 figures and 4 tables

arXiv:2101.06188 [pdf, other]

Private Tabular Survey Data Products through Synthetic Microdata Generation

Authors: **gchen Hu, Terrance D. Savitsky, Matthew R. Williams

Abstract: We propose two synthetic microdata approaches to generate private tabular survey data products for public release. We adapt a pseudo posterior mechanism that downweights by-record likelihood contributions with weights $\in [0,1]$ based on their identification disclosure risks to producing tabular products for survey data. Our method applied to an observed survey database achieves an asymptotic glo… ▽ More We propose two synthetic microdata approaches to generate private tabular survey data products for public release. We adapt a pseudo posterior mechanism that downweights by-record likelihood contributions with weights $\in [0,1]$ based on their identification disclosure risks to producing tabular products for survey data. Our method applied to an observed survey database achieves an asymptotic global probabilistic differential privacy guarantee. Our two approaches synthesize the observed sample distribution of the outcome and survey weights, jointly, such that both quantities together possess a privacy guarantee. The privacy-protected outcome and survey weights are used to construct tabular cell estimates (where the cell inclusion indicators are treated as known and public) and associated standard errors to correct for survey sampling bias. Through a real data application to the Survey of Doctorate Recipients public use file and simulation studies motivated by the application, we demonstrate that our two microdata synthesis approaches to construct tabular products provide superior utility preservation as compared to the additive-noise approach of the Laplace Mechanism. Moreover, our approaches allow the release of microdata to the public, enabling additional analyses at no extra privacy cost. △ Less

Submitted 3 March, 2022; v1 submitted 15 January, 2021; originally announced January 2021.

arXiv:2006.01230 [pdf, other]

Re-weighting of Vector-weighted Mechanisms for Utility Maximization under Differential Privacy

Authors: Terrance D. Savitsky, **gchen Hu, Matthew R. Williams

Abstract: We address practical implementation of a risk-weighted pseudo posterior synthesizer for microdata dissemination with a new re-weighting strategy that maximizes utility of released synthetic data under at any level of formal privacy guarantee. Our re-weighting strategy applies to any vector-weighted pseudo posterior mechanism under which a vector of observation-indexed weights are used to downweigh… ▽ More We address practical implementation of a risk-weighted pseudo posterior synthesizer for microdata dissemination with a new re-weighting strategy that maximizes utility of released synthetic data under at any level of formal privacy guarantee. Our re-weighting strategy applies to any vector-weighted pseudo posterior mechanism under which a vector of observation-indexed weights are used to downweight likelihood contributions for high disclosure risk records. We demonstrate our method on two different vector-weighted schemes that target high-risk records. Our new method for constructing record-indexed downeighting maximizes the data utility under any privacy budget for the vector-weighted synthesizers by adjusting the by-record weights, such that their individual Lipschitz bounds approach the bound for the entire database. Our method achieves an $(ε= 2 Δ_{\boldsymbolα})-$asymptotic differential privacy (aDP) guarantee, globally, over the space of databases. We illustrate our methods using simulated highly skewed count data and compare the results to a scalar-weighted synthesizer under the Exponential Mechanism (EM). We also apply our methods to a sample of the Survey of Doctorate Recipients and demonstrate the practicality of our methods. △ Less

Submitted 28 April, 2022; v1 submitted 1 June, 2020; originally announced June 2020.

arXiv:2006.00783 [pdf, other]

Distributed Bayesian Varying Coefficient Modeling Using a Gaussian Process Prior

Authors: Rajarshi Guhaniyogi, Cheng Li, Terrance D. Savitsky, Sanvesh Srivastava

Abstract: Varying coefficient models (VCMs) are widely used for estimating nonlinear regression functions for functional data. Their Bayesian variants using Gaussian process priors on the functional coefficients, however, have received limited attention in massive data applications, mainly due to the prohibitively slow posterior computations using Markov chain Monte Carlo (MCMC) algorithms. We address this… ▽ More Varying coefficient models (VCMs) are widely used for estimating nonlinear regression functions for functional data. Their Bayesian variants using Gaussian process priors on the functional coefficients, however, have received limited attention in massive data applications, mainly due to the prohibitively slow posterior computations using Markov chain Monte Carlo (MCMC) algorithms. We address this problem using a divide-and-conquer Bayesian approach. We first create a large number of data subsamples with much smaller sizes. Then, we formulate the VCM as a linear mixed-effects model and develop a data augmentation algorithm for obtaining MCMC draws on all the subsets in parallel. Finally, we aggregate the MCMC-based estimates of subset posteriors into a single Aggregated Monte Carlo (AMC) posterior, which is used as a computationally efficient alternative to the true posterior distribution. Theoretically, we derive minimax optimal posterior convergence rates for the AMC posteriors of both the varying coefficients and the mean regression function. We provide quantification on the orders of subset sample sizes and the number of subsets. The empirical results show that the combination schemes that satisfy our theoretical assumptions, including the AMC posterior, have better estimation performance than their main competitors across diverse simulations and in a real data analysis. △ Less

Submitted 25 February, 2022; v1 submitted 1 June, 2020; originally announced June 2020.

arXiv:2004.06191 [pdf, other]

Pseudo Bayesian Estimation of One-way ANOVA Model in Complex Surveys

Authors: Terrance D. Savitsky, Matthew R. Williams, Sanvesh Srivastava

Abstract: We devise survey-weighted pseudo posterior distribution estimators under two-stage informative sampling of both primary clusters and secondary nested units for a one-way analysis of variance (ANOVA) population generating model as a simple canonical case where population model random effects are defined to be coincident with the primary clusters, for example student performance based on a survey of… ▽ More We devise survey-weighted pseudo posterior distribution estimators under two-stage informative sampling of both primary clusters and secondary nested units for a one-way analysis of variance (ANOVA) population generating model as a simple canonical case where population model random effects are defined to be coincident with the primary clusters, for example student performance based on a survey of schools and students such as the 2000 OECD Programme for International Student Assessment (PISA). We consider estimation on an observed informative sample under both an augmented pseudo likelihood that co-samples the random effects, as well as an integrated likelihood that marginalizes out the random effects from the survey-weighted augmented pseudo likelihood. This paper includes a theoretical exposition that enumerates easily verified conditions for which estimation under the augmented pseudo posterior is guaranteed to be consistent at the true generating parameters. We reveal in simulation that both approaches produce asymptotically unbiased estimation of the generating hyperparameters for the random effects when a key condition on the sum of within cluster weighted residuals is met. We present a comparison with two frequentist alternatives, an expectation-maximization approach and a composite likelihood method that requires pairwise sampling weights. △ Less

Submitted 12 May, 2023; v1 submitted 13 April, 2020; originally announced April 2020.

Comments: 45 pages, 12 figures

MSC Class: 62D05; 62F15; 62J05

arXiv:1909.11796 [pdf, other]

Bayesian Pseudo Posterior Mechanism under Asymptotic Differential Privacy

Authors: Terrance D. Savitsky, Matthew R. Williams, **gchen Hu

Abstract: We propose a Bayesian pseudo posterior mechanism to generate record-level synthetic databases equipped with an $(ε,δ)-$ probabilistic differential privacy (pDP) guarantee, where $δ$ denotes the probability that any observed database exceeds $ε$. The pseudo posterior mechanism employs a data record-indexed, risk-based weight vector with weight values $\in [0, 1]$ that surgically downweight the like… ▽ More We propose a Bayesian pseudo posterior mechanism to generate record-level synthetic databases equipped with an $(ε,δ)-$ probabilistic differential privacy (pDP) guarantee, where $δ$ denotes the probability that any observed database exceeds $ε$. The pseudo posterior mechanism employs a data record-indexed, risk-based weight vector with weight values $\in [0, 1]$ that surgically downweight the likelihood contributions for high-risk records for model estimation and the generation of record-level synthetic data for public release. The pseudo posterior synthesizer constructs a weight for each data record using the Lipschitz bound for that record under a log-pseudo likelihood utility function that generalizes the exponential mechanism (EM) used to construct a formally private data generating mechanism. By selecting weights to remove likelihood contributions with non-finite log-likelihood values, we guarantee a finite local privacy guarantee for our pseudo posterior mechanism at every sample size. Our results may be applied to \emph{any} synthesizing model envisioned by the data disseminator in a computationally tractable way that only involves estimation of a pseudo posterior distribution for parameters, $θ$, unlike recent approaches that use naturally-bounded utility functions implemented through the EM. We specify mild conditions that guarantee the asymptotic contraction of $δ$ to $0$ over the space of databases. We illustrate our pseudo posterior mechanism on the sensitive family income variable from the Consumer Expenditure Surveys database published by the U.S. Bureau of Labor Statistics. We show that utility is better preserved in the synthetic data for our pseudo posterior mechanism as compared to the EM, both estimated using the same non-private synthesizer, due to our use of targeted downweighting. △ Less

Submitted 13 August, 2021; v1 submitted 25 September, 2019; originally announced September 2019.

Comments: 35 pages, 7 figures, 2 tables

arXiv:1908.07639 [pdf, other]

doi 10.1093/jssam/smab013

Risk-Efficient Bayesian Data Synthesis for Privacy Protection

Authors: **gchen Hu, Terrance D. Savitsky, Matthew R. Williams

Abstract: Statistical agencies utilize models to synthesize respondent-level data for release to the public for privacy protection. In this work, we efficiently induce privacy protection into any Bayesian synthesis model by employing a pseudo likelihood that exponentiates each likelihood contribution by an observation record-indexed weight in [0, 1], defined to be inversely proportional to the identificatio… ▽ More Statistical agencies utilize models to synthesize respondent-level data for release to the public for privacy protection. In this work, we efficiently induce privacy protection into any Bayesian synthesis model by employing a pseudo likelihood that exponentiates each likelihood contribution by an observation record-indexed weight in [0, 1], defined to be inversely proportional to the identification risk for that record. We start with the marginal probability of identification risk for a record, which is composed as the probability that the identity of the record may be disclosed. Our application to the Consumer Expenditure Surveys (CE) of the U.S. Bureau of Labor Statistics demonstrates that the marginally risk-adjusted synthesizer provides an overall improved privacy protection; however, the identification risks actually increase for some moderate-risk records after risk-adjusted pseudo posterior estimation synthesis due to increased isolation after weighting; a phenomenon we label "whack-a-mole". We proceed to construct a weight for each record from a collection of pairwise identification risk probabilities with other records, where each pairwise probability measures the joint probability of re-identification of the pair of records, which mitigates the whack-a-mole issue and produces a more efficient set of synthetic data with lower risk and higher utility for the CE data. △ Less

Submitted 8 February, 2021; v1 submitted 20 August, 2019; originally announced August 2019.

Journal ref: Journal of Survey Statistics and Methodology, 2021

arXiv:1904.07680 [pdf, other]

Pseudo Bayesian Mixed Models under Informative Sampling

Authors: Terrance D. Savitsky, Matthew R. Williams

Abstract: When random effects are correlated with sample design variables, the usual approach of employing individual survey weights (constructed to be inversely proportional to the unit survey inclusion probabilities) to form a pseudo-likelihood no longer produces asymptotically unbiased inference. We construct a weight-exponentiated formulation for the random effects distribution that achieves unbiased in… ▽ More When random effects are correlated with sample design variables, the usual approach of employing individual survey weights (constructed to be inversely proportional to the unit survey inclusion probabilities) to form a pseudo-likelihood no longer produces asymptotically unbiased inference. We construct a weight-exponentiated formulation for the random effects distribution that achieves unbiased inference for generating hyperparameters of the random effects. We contrast our approach with frequentist methods that rely on numerical integration to reveal that only the Bayesian method achieves both unbiased estimation with respect to the sampling design distribution and consistency with respect to the population generating distribution. Our simulations and real data example for a survey of business establishments demonstrate the utility of our approach across different modeling formulations and sampling designs. This work serves as a capstone for recent developmental efforts that combine traditional survey estimation approaches with the Bayesian modeling paradigm and provides a bridge across the two rich but disparate sub-fields. △ Less

Submitted 24 August, 2021; v1 submitted 16 April, 2019; originally announced April 2019.

Comments: 31 pages, 6 figures, 2 table

MSC Class: 62F15; 62D05

arXiv:1901.06462

Bayesian Pseudo Posterior Synthesis for Data Privacy Protection

Authors: **gchen Hu, Terrance D. Savitsky

Abstract: Statistical agencies utilize models to synthesize respondent-level data for release to the general public as an alternative to the actual data records. A Bayesian model synthesizer encodes privacy protection by employing a hierarchical prior construction that induces smoothing of the real data distribution. Synthetic respondent-level data records are often preferred to summary data tables due to t… ▽ More Statistical agencies utilize models to synthesize respondent-level data for release to the general public as an alternative to the actual data records. A Bayesian model synthesizer encodes privacy protection by employing a hierarchical prior construction that induces smoothing of the real data distribution. Synthetic respondent-level data records are often preferred to summary data tables due to the many possible uses by researchers and data analysts. Agencies balance a trade-off between utility of the synthetic data versus disclosure risks and hold a specific target threshold for disclosure risk before releasing synthetic datasets. We introduce a pseudo posterior likelihood that exponentiates each contribution by an observation record-indexed weight in (0, 1), defined to be inversely proportional to the disclosure risk for that record in the synthetic data. Our use of a vector of weights allows more precise downweighting of high risk records in a fashion that better preserves utility as compared with using a scalar weight. We illustrate our method with a simulation study and an application to the Consumer Expenditure Survey of the U.S. Bureau of Labor Statistics. We demonstrate how the frequentist consistency and uncertainty quantification are affected by the inverse risk-weighting. △ Less

Submitted 15 May, 2020; v1 submitted 18 January, 2019; originally announced January 2019.

Comments: This is to replace arXiv:1908.07639

arXiv:1901.03791 [pdf, other]

Optimization of Survey Weights under a Large Number of Conflicting Constraints

Authors: Matthew R. Williams, Terrance D. Savitsky

Abstract: In the analysis of survey data, sampling weights are needed for consistent estimation of the population. However, the original inverse probability weights from the survey sample design are typically modified to account for non-response, to increase efficiency by incorporating auxiliary population information, and to reduce the variability in estimates due to extreme weights. It is often the case t… ▽ More In the analysis of survey data, sampling weights are needed for consistent estimation of the population. However, the original inverse probability weights from the survey sample design are typically modified to account for non-response, to increase efficiency by incorporating auxiliary population information, and to reduce the variability in estimates due to extreme weights. It is often the case that no single set of weights can be found which successfully incorporates all of these modifications because together they induce a large number of constraints and restrictions on the feasible solution space. For example, a unique combination of categorical variables may not be present in the sample data, even if the corresponding population level information is available. Additional requirements for weights to fall within specified ranges may also lead to fewer population level adjustments being incorporated. We present a framework and accompanying computational methods to address this issue of constraint achievement or selection within a restricted space that will produce revised weights with reasonable properties. By combining concepts from generalized raking, ridge and lasso regression, benchmarking of small area estimates, augmentation of state-space equations, path algorithms, and data-cloning, this framework simultaneously selects constraints and provides diagnostics suggesting why a fully constrained solution is not possible. Combinatoric operations such as brute force evaluations of all possible combinations of constraints and restrictions are avoided. We demonstrate this framework by applying alternative methods to post-stratification for the National Survey on Drug Use and Health. We also discuss strategies for scaling up to even larger data sets. Computations were performed in R and code is available from the authors. △ Less

Submitted 11 January, 2019; originally announced January 2019.

Comments: 23 pages, 2 figures, 3 tables

arXiv:1809.10074 [pdf, other]

Bayesian Data Synthesis and Disclosure Risk Quantification: An Application to the Consumer Expenditure Surveys

Authors: **gchen Hu, Terrance D. Savitsky

Abstract: The release of synthetic data generated from a model estimated on the data helps statistical agencies disseminate respondent-level data with high utility and privacy protection. Motivated by the challenge of disseminating sensitive variables containing geographic information in the Consumer Expenditure Surveys (CE) at the U.S. Bureau of Labor Statistics, we propose two non-parametric Bayesian mode… ▽ More The release of synthetic data generated from a model estimated on the data helps statistical agencies disseminate respondent-level data with high utility and privacy protection. Motivated by the challenge of disseminating sensitive variables containing geographic information in the Consumer Expenditure Surveys (CE) at the U.S. Bureau of Labor Statistics, we propose two non-parametric Bayesian models as data synthesizers for the county identifier of each data record: a Bayesian latent class model and a Bayesian areal model. Both data synthesizers use Dirichlet Process priors to cluster observations of similar characteristics and allow borrowing information across observations. We develop innovative disclosure risks measures to quantify inherent risks in the confidential CE data and how those data risks are ameliorated by our proposed synthesizers. By creating a lower bound and an upper bound of disclosure risks under a minimum and a maximum disclosure risks scenarios respectively, our proposed inherent risks measures provide a range of acceptable disclosure risks for evaluating risks level in the synthetic datasets. △ Less

Submitted 2 February, 2021; v1 submitted 26 September, 2018; originally announced September 2018.

arXiv:1807.11796 [pdf, other]

doi 10.1111/insr.12376

Bayesian Uncertainty Estimation Under Complex Sampling

Authors: Matthew R. Williams, Terrance D. Savitsky

Abstract: Social and economic studies are often implemented as complex survey designs. For example, multistage, unequal probability sampling designs utilized by federal statistical agencies are typically constructed to maximize the efficiency of the target domain level estimator (e.g., indexed by geographic area) within cost constraints for survey administration. Such designs may induce dependence between t… ▽ More Social and economic studies are often implemented as complex survey designs. For example, multistage, unequal probability sampling designs utilized by federal statistical agencies are typically constructed to maximize the efficiency of the target domain level estimator (e.g., indexed by geographic area) within cost constraints for survey administration. Such designs may induce dependence between the sampled units; for example, with employment of a sampling step that selects geographically-indexed clusters of units. A sampling-weighted pseudo-posterior distribution may be used to estimate the population model on the observed sample. The dependence induced between co-clustered units inflates the scale of the resulting pseudo-posterior covariance matrix that has been shown to induce under coverage of the credibility sets. By bridging results across Bayesian model mispecification and survey sampling, we demonstrate that the scale and shape of the asymptotic distributions are different between each of the pseudo-MLE, the pseudo-posterior and the MLE under simple random sampling. Through insights from survey sampling variance estimation and recent advances in computational methods, we devise a correction applied as a simple and fast post-processing step to MCMC draws of the pseudo-posterior distribution. This adjustment projects the pseudo-posterior covariance matrix such that the nominal coverage is approximately achieved. We make an application to the National Survey on Drug Use and Health as a motivating example and we demonstrate the efficacy of our scale and shape projection procedure on synthetic data on several common archetypes of survey designs. △ Less

Submitted 29 July, 2019; v1 submitted 31 July, 2018; originally announced July 2018.

Comments: 45 pages, 4 figures, 1 table

MSC Class: 62D05; 62F15; 62F12

Journal ref: International Statistical Review 2020

arXiv:1807.05066 [pdf, other]

doi 10.1214/18-BA1143

Bayesian Estimation Under Informative Sampling with Unattenuated Dependence

Authors: Matthew R. Williams, Terrance D. Savitsky

Abstract: An informative sampling design leads to unit inclusion probabilities that are correlated with the response variable of interest. However, multistage sampling designs may also induce higher order dependencies, which are typically ignored in the literature when establishing consistency of estimators for survey data under a condition requiring asymptotic independence among the unit inclusion probabil… ▽ More An informative sampling design leads to unit inclusion probabilities that are correlated with the response variable of interest. However, multistage sampling designs may also induce higher order dependencies, which are typically ignored in the literature when establishing consistency of estimators for survey data under a condition requiring asymptotic independence among the unit inclusion probabilities. We refine and relax this condition of asymptotic independence or asymptotic factorization and demonstrate that consistency is still achieved in the presence of residual sampling dependence. A popular approach for conducting inference on a population based on a survey sample is the use of a pseudo-posterior, which uses sampling weights based on first order inclusion probabilities to exponentiate the likelihood. We show that the pseudo-posterior is consistent not only for survey designs which have asymptotic factorization, but also for designs with residual or unattenuated dependence. Using the complex sampling design of the National Survey on Drug Use and Health, we explore the impact of multistage designs and order based sampling. The use of the survey-weighted pseudo-posterior together with our relaxed requirements for the survey design establish a broad class of analysis models that can be applied to a wide variety of survey data sets. △ Less

Submitted 12 July, 2018; originally announced July 2018.

Comments: 35 pages, 5 figures. arXiv admin note: text overlap with arXiv:1710.10102

Journal ref: Bayesian Anal., advance publication, 4 January 2019

arXiv:1712.09767 [pdf, other]

A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging

Authors: Rajarshi Guhaniyogi, Cheng Li, Terrance D. Savitsky, Sanvesh Srivastava

Abstract: We propose a three-step divide-and-conquer strategy within the Bayesian paradigm that delivers massive scalability for any spatial process model. We partition the data into a large number of subsets, apply a readily available Bayesian spatial process model on every subset, in parallel, and optimally combine the posterior distributions estimated across all the subsets into a pseudo-posterior distri… ▽ More We propose a three-step divide-and-conquer strategy within the Bayesian paradigm that delivers massive scalability for any spatial process model. We partition the data into a large number of subsets, apply a readily available Bayesian spatial process model on every subset, in parallel, and optimally combine the posterior distributions estimated across all the subsets into a pseudo-posterior distribution that conditions on the entire data. The combined pseudo posterior distribution replaces the full data posterior distribution for predicting the responses at arbitrary locations and for inference on the model parameters and spatial surface. Based on distributed Bayesian inference, our approach is called "Distributed Kriging" (DISK) and offers significant advantages in massive data applications where the full data are stored across multiple machines. We show theoretically that the Bayes $L_2$-risk of the DISK posterior distribution achieves the near optimal convergence rate in estimating the true spatial surface with various types of covariance functions, and provide upper bounds for the number of subsets as a function of the full sample size. The model-free feature of DISK is demonstrated by scaling posterior computations in spatial process models with a stationary full-rank and a nonstationary low-rank Gaussian process (GP) prior. A variety of simulations and a geostatistical analysis of the Pacific Ocean sea surface temperature data validate our theoretical results. △ Less

Submitted 12 June, 2019; v1 submitted 28 December, 2017; originally announced December 2017.

Comments: 29 pages, including 4 figures and 5 tables

arXiv:1710.10102 [pdf, other]

doi 10.1214/18-EJS1435

Bayesian Pairwise Estimation Under Dependent Informative Sampling

Authors: Matthew R. Williams, Terrance D. Savitsky

Abstract: An informative sampling design leads to the selection of units whose inclusion probabilities are correlated with the response variable of interest. Model inference performed on the resulting observed sample will be biased for the population generative model. One approach that produces asymptotically unbiased inference employs marginal inclusion probabilities to form sampling weights used to expone… ▽ More An informative sampling design leads to the selection of units whose inclusion probabilities are correlated with the response variable of interest. Model inference performed on the resulting observed sample will be biased for the population generative model. One approach that produces asymptotically unbiased inference employs marginal inclusion probabilities to form sampling weights used to exponentiate each likelihood contribution of a pseudo likelihood used to form a pseudo posterior distribution. Conditions for posterior consistency restrict applicable sampling designs to those under which pairwise inclusion dependencies asymptotically limit to 0. There are many sampling designs excluded by this restriction; for example, a multi-stage design that samples individuals within households. Viewing each household as a population, the dependence among individuals does not attenuate. We propose a more targeted approach in this paper for inference focused on pairs of individuals or sampled units; for example, the substance use of one spouse in a shared household, conditioned on the substance use of the other spouse. We formulate the pseudo likelihood with weights based on pairwise or second order probabilities and demonstrate consistency, removing the requirement for asymptotic independence and replacing it with restrictions on higher order selection probabilities. Our approach provides a nearly automated estimation procedure applicable to any model specified by the data analyst. We demonstrate our method on the National Survey on Drug Use and Health. △ Less

Submitted 27 October, 2017; originally announced October 2017.

Comments: 35 pages, 9 figures

MSC Class: 62D05; 62G20

Journal ref: Electron. J. Statist. Volume 12, Number 1 (2018), 1631-1661

arXiv:1710.00019 [pdf, other]

Fully Bayesian Estimation Under Informative Sampling

Authors: Luis G. Leon-Novelo, Terrance D. Savitsky

Abstract: Bayesian estimation is increasingly popular for performing model based inference to support policymaking. These data are often collected from surveys under informative sampling designs where subject inclusion probabilities are designed to be correlated with the response variable of interest. Sampling weights constructed from marginal inclusion probabilities are typically used to form an exponentia… ▽ More Bayesian estimation is increasingly popular for performing model based inference to support policymaking. These data are often collected from surveys under informative sampling designs where subject inclusion probabilities are designed to be correlated with the response variable of interest. Sampling weights constructed from marginal inclusion probabilities are typically used to form an exponentiated pseudo likelihood that adjusts the population likelihood for estimation on the sample due to ease-of-estimation. We propose an alternative adjustment based on a Bayes rule construction that simultaneously performs weight smoothing and estimates the population model parameters in a fully Bayesian construction. We formulate conditions on known marginal and pairwise inclusion probabilities that define a class of sampling designs where $L_{1}$ consistency of the joint posterior is guaranteed. We compare performances between the two approaches on synthetic data, which reveals that our fully Bayesian approach better estimates posterior uncertainty without a requirement to calibrate the normalization of the sampling weights. We demonstrate our method on an application concerning the National Health and Nutrition Examination Survey exploring the relationship between caffeine consumption and systolic blood pressure. △ Less

Submitted 11 July, 2018; v1 submitted 29 September, 2017; originally announced October 2017.

Comments: Pages 1-29 conform the main paper and they include seven figures and three tables. Pages 30-36 contain Supplementary Material and pages 36-37 contain references

arXiv:1606.07488 [pdf, other]

Scalable Bayes under Informative Sampling

Authors: Terrance D. Savitsky, Sanvesh Srivastava

Abstract: The United States Bureau of Labor Statistics collects data using survey instruments under informative sampling designs that assign probabilities of inclusion to be correlated with the response. The bureau extensively uses Bayesian hierarchical models and posterior sampling to impute missing items in respondent-level data and to infer population parameters. Posterior sampling for survey data collec… ▽ More The United States Bureau of Labor Statistics collects data using survey instruments under informative sampling designs that assign probabilities of inclusion to be correlated with the response. The bureau extensively uses Bayesian hierarchical models and posterior sampling to impute missing items in respondent-level data and to infer population parameters. Posterior sampling for survey data collected based on informative designs are computationally expensive and do not support production schedules of the bureau. Motivated by this problem, we propose a new method to scale Bayesian computations in informative sampling designs. Our method divides the data into smaller subsets, performs posterior sampling in parallel for every subset, and combines the collection of posterior samples from all the subsets through their mean in the Wasserstein space of order 2. Theoretically, we construct conditions on a class of sampling designs where posterior consistency of the proposed method is achieved. Empirically, we demonstrate that our method is competitive with traditional methods while being significantly faster in many simulations and in the Current Employment Statistics survey conducted by the bureau. △ Less

Submitted 24 October, 2017; v1 submitted 23 June, 2016; originally announced June 2016.

Comments: 34 pages, 6 figures, 2 tables

arXiv:1511.05360 [pdf, ps, other]

doi 10.1214/15-AOAS833

Inferring constructs of effective teaching from classroom observations: An application of Bayesian exploratory factor analysis without restrictions

Authors: J. R. Lockwood, Terrance D. Savitsky, Daniel F. McCaffrey

Abstract: Ratings of teachers' instructional practices using standardized classroom observation instruments are increasingly being used for both research and teacher accountability. There are multiple instruments in use, each attempting to evaluate many dimensions of teaching and classroom activities, and little is known about what underlying teaching quality attributes are being measured. We use data from… ▽ More Ratings of teachers' instructional practices using standardized classroom observation instruments are increasingly being used for both research and teacher accountability. There are multiple instruments in use, each attempting to evaluate many dimensions of teaching and classroom activities, and little is known about what underlying teaching quality attributes are being measured. We use data from multiple instruments collected from 458 middle school mathematics and English language arts teachers to inform research and practice on teacher performance measurement by modeling latent constructs of high-quality teaching. We make inferences about these constructs using a novel approach to Bayesian exploratory factor analysis (EFA) that, unlike commonly used approaches for identifying factor loadings in Bayesian EFA, is invariant to how the data dimensions are ordered. Applying this approach to ratings of lessons reveals two distinct teaching constructs in both mathematics and English language arts: (1) quality of instructional practices; and (2) quality of teacher management of classrooms. We demonstrate the relationships of these constructs to other indicators of teaching quality, including teacher content knowledge and student performance on standardized tests. △ Less

Submitted 17 November, 2015; originally announced November 2015.

Comments: Published at http://dx.doi.org/10.1214/15-AOAS833 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS833

Journal ref: Annals of Applied Statistics 2015, Vol. 9, No. 3, 1484-1509

arXiv:1508.00615 [pdf, other]

Bayesian Nonparametric Functional Mixture Estimation for Time-Series Data, With Application to Estimation of State Employment Totals

Authors: Terrance D. Savitsky

Abstract: The U.S. Bureau of Labor Statistics use monthly, by-state employment totals from the Current Population Survey (CPS) as a key input to develop employment estimates for counties within the states. The monthly CPS by-state totals, however, express high levels of volatility that compromise the accuracy of resulting estimates composed for the counties. Typically-employed models for small area estimati… ▽ More The U.S. Bureau of Labor Statistics use monthly, by-state employment totals from the Current Population Survey (CPS) as a key input to develop employment estimates for counties within the states. The monthly CPS by-state totals, however, express high levels of volatility that compromise the accuracy of resulting estimates composed for the counties. Typically-employed models for small area estimation produce de-noised, state-level employment estimates by borrowing information over the survey months, but assume independence among the collection of by-state time series, which is typically violated due to similarities in their underlying economies. We construct Gaussian process and Gaussian Markov random field alternative functional prior specifications, each in a mixture of multivariate Gaussian distributions with a Dirichlet process (DP) mixing measure over the parameters of their covariance or precision matrices. Our DP mixture of functions models allow the data to simultaneously estimate a dependence among the months and between states. A feature of our models is that those functions assigned to the same cluster are drawn from a distribution with the same covariance parameters, so that they are similar, but don't have to be identical. We compare the performances of our two alternatives on synthetic data and apply them to recover de-noised, by-state CPS employment totals for data from $2000-2013$. △ Less

Submitted 3 August, 2015; originally announced August 2015.

Comments: 30 pages, 9 figures

arXiv:1508.00604 [pdf, other]

Bayesian Nonparameteric Multiresolution Estimation for the American Community Survey

Authors: Terrance D. Savitsky

Abstract: Bayesian hierarchical methods implemented for small area estimation focus on reducing the noise variation in published government official statistics by borrowing information among dependent response values. Even the most flexible models confine parameters defined at the finest scale to link to each data observation in a one-to-one construction. We propose a Bayesian multiresolution formulation th… ▽ More Bayesian hierarchical methods implemented for small area estimation focus on reducing the noise variation in published government official statistics by borrowing information among dependent response values. Even the most flexible models confine parameters defined at the finest scale to link to each data observation in a one-to-one construction. We propose a Bayesian multiresolution formulation that utilizes an ensemble of observations at a variety of coarse scales in space and time to additively nest parameters we define at a finer scale, which serve as our focus for estimation. Our construction is motivated by and applied to the estimation of $1-$ year period employment levels, indexed by county, from statistics published at coarser areal domains and multi-year intervals in the American Community Survey (ACS). We construct a nonparametric mixture of Gaussian processes as the prior on a set of regression coefficients of county-indexed latent functions over multiple survey years. We evaluate a modified Dirichlet process prior that incorporates county-year predictors as the mixing measure. Each county-year parameter of a latent function is estimated from multiple coarse scale observations in space and time to which it links. The multiresolution formulation is evaluated on synthetic data and applied to the ACS. △ Less

Submitted 3 August, 2015; originally announced August 2015.

Comments: 35 pages, 11 figures

arXiv:1507.07050 [pdf, other]

Bayesian Estimation Under Informative Sampling

Authors: Terrance D. Savitsky, Daniell Toth

Abstract: Bayesian analysis is increasingly popular for use in social science and other application areas where the data are observations from an informative sample. An informative sampling design leads to inclusion probabilities that are correlated with the response variable of interest. Model inference performed on the observed sample taken from the population will be biased for the population generative… ▽ More Bayesian analysis is increasingly popular for use in social science and other application areas where the data are observations from an informative sample. An informative sampling design leads to inclusion probabilities that are correlated with the response variable of interest. Model inference performed on the observed sample taken from the population will be biased for the population generative model under informative sampling since the balance of information in the sample data is different from that for the population. Typical approaches to account for an informative sampling design under Bayesian estimation are often difficult to implement because they require re-parameterization of the hypothesized generating model, or focus on design, rather than model-based, inference. We propose to construct a pseudo-posterior distribution that utilizes sampling weights based on the marginal inclusion probabilities to exponentiate the likelihood contribution of each sampled unit, which weights the information in the sample back to the population. Our approach provides a nearly automated estimation procedure applicable to any model specified by the data analyst for the population and retains the population model parameterization and posterior sampling geometry. We construct conditions on known marginal and pairwise inclusion probabilities that define a class of sampling designs where $L_{1}$ consistency of the pseudo posterior is guaranteed. We demonstrate our method on an application concerning the Bureau of Labor Statistics Job Openings and Labor Turnover Survey. △ Less

Submitted 3 June, 2016; v1 submitted 24 July, 2015; originally announced July 2015.

Comments: 24 pages, 3 figures

arXiv:1312.1856 [pdf, ps, other]

doi 10.1214/12-AOAS620

Bayesian nonparametric hierarchical modeling for multiple membership data in grouped attendance interventions

Authors: Terrance D. Savitsky, Susan M. Paddock

Abstract: We develop a dependent Dirichlet process (DDP) model for repeated measures multiple membership (MM) data. This data structure arises in studies under which an intervention is delivered to each client through a sequence of elements which overlap with those of other clients on different occasions. Our interest concentrates on study designs for which the overlaps of sequences occur for clients who re… ▽ More We develop a dependent Dirichlet process (DDP) model for repeated measures multiple membership (MM) data. This data structure arises in studies under which an intervention is delivered to each client through a sequence of elements which overlap with those of other clients on different occasions. Our interest concentrates on study designs for which the overlaps of sequences occur for clients who receive an intervention in a shared or grouped fashion whose memberships may change over multiple treatment events. Our motivating application focuses on evaluation of the effectiveness of a group therapy intervention with treatment delivered through a sequence of cognitive behavioral therapy session blocks, called modules. An open-enrollment protocol permits entry of clients at the beginning of any new module in a manner that may produce unique MM sequences across clients. We begin with a model that composes an addition of client and multiple membership module random effect terms, which are assumed independent. Our MM DDP model relaxes the assumption of conditionally independent client and module random effects by specifying a collection of random distributions for the client effect parameters that are indexed by the unique set of module attendances. We demonstrate how this construction facilitates examining heterogeneity in the relative effectiveness of group therapy modules over repeated measurement occasions. △ Less

Submitted 6 December, 2013; originally announced December 2013.

Comments: Published in at http://dx.doi.org/10.1214/12-AOAS620 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS620

Journal ref: Annals of Applied Statistics 2013, Vol. 7, No. 2, 1074-1094

Showing 1–29 of 29 results for author: Savitsky, T D