-
Statistical Inference for Privatized Data with Unknown Sample Size
Authors:
Jordan Awan,
Andres Felipe Barrientos,
Nianqiao Ju
Abstract:
We develop both theory and algorithms to analyze privatized data in the unbounded differential privacy(DP), where even the sample size is considered a sensitive quantity that requires privacy protection. We show that the distance between the sampling distributions under unbounded DP and bounded DP goes to zero as the sample size $n$ goes to infinity, provided that the noise used to privatize $n$ i…
▽ More
We develop both theory and algorithms to analyze privatized data in the unbounded differential privacy(DP), where even the sample size is considered a sensitive quantity that requires privacy protection. We show that the distance between the sampling distributions under unbounded DP and bounded DP goes to zero as the sample size $n$ goes to infinity, provided that the noise used to privatize $n$ is at an appropriate rate; we also establish that ABC-type posterior distributions converge under similar assumptions. We further give asymptotic results in the regime where the privacy budget for $n$ goes to zero, establishing similarity of sampling distributions as well as showing that the MLE in the unbounded setting converges to the bounded-DP MLE. In order to facilitate valid, finite-sample Bayesian inference on privatized data in the unbounded DP setting, we propose a reversible jump MCMC algorithm which extends the data augmentation MCMC of Ju et al. (2022). We also propose a Monte Carlo EM algorithm to compute the MLE from privatized data in both bounded and unbounded DP. We apply our methodology to analyze a linear regression model as well as a 2019 American Time Use Survey Microdata File which we model using a Dirichlet distribution.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
The Vela Pulsar Progenitor Was Most Likely a Binary Merger
Authors:
Jeremiah W. Murphy,
Andres F. Barrientos,
Rene Andrae,
Joseph Guzman,
Benjamin F. Williams,
Julianne J. Dalcanton,
Brad Koplitz
Abstract:
Stellar evolution theory restricted to single stars predicts a minimum mass for core-collapse supernovae (CCSNe) of around eight solar masses; this minimum mass corresponds to a maximum age of around 45 million years for the progenitor and the coeval population of stars. Binary evolution complicates this prediction. For example, an older stellar population around 100 million years could contain st…
▽ More
Stellar evolution theory restricted to single stars predicts a minimum mass for core-collapse supernovae (CCSNe) of around eight solar masses; this minimum mass corresponds to a maximum age of around 45 million years for the progenitor and the coeval population of stars. Binary evolution complicates this prediction. For example, an older stellar population around 100 million years could contain stellar mergers that reach the minimum mass for core collapse. Despite this clear prediction by binary evolution, there are few, if any CCSNe associated with a distinctly older stellar population...until now. The stellar population within 150 pc of the Vela Pulsar is inconsistent with single-star evolution only; instead, the most likely solution is that the stellar population is $\ge$80 Myr old, and the brightest stars are mass gainers and/or mergers, the result of binary evolution. The evidence is as follows. Even though the main sequence is clearly dominated by a $\ge$80-Myr-old population, a large fraction of the corresponding red giants is missing. The best-fitting single-star model expects 51.5 red giants, yet there are only 22; the Poisson probability of this is $1.7 \times 10^{-6}$. In addition, there is an overabundance of bright, young-looking stars (25-30 Myrs old), yet there is not a corresponding young main sequence (MS). Upon closer inspection, the vast majority of the young-looking stars show either past or current signs of binary evolution. These new results are possible due to exquisite Gaia parallaxes and a new age-dating software called {\it Stellar Ages}.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Design, Manufacturing and Open-Loop Control of a Soft Pneumatic Arm
Authors:
Jorge Francisco García-Samartín,
Adrián Rieker,
Antonio Barrientos
Abstract:
Soft Robots distinguish themselves from traditional robots by embracing flexible kinematics. Because of their recent emergence, there exist numerous uncharted territories, including novel actuators, manufacturing processes, and advanced control methods. This research is centred on the design, fabrication, and control of a pneumatic soft robot. The principal objective is to develop a modular soft r…
▽ More
Soft Robots distinguish themselves from traditional robots by embracing flexible kinematics. Because of their recent emergence, there exist numerous uncharted territories, including novel actuators, manufacturing processes, and advanced control methods. This research is centred on the design, fabrication, and control of a pneumatic soft robot. The principal objective is to develop a modular soft robot featuring with multiple segments, each one of three degrees of freedom. This yields to tubular structure with five independent degrees of freedom, enabling motion across three spatial dimensions. Physical construction leverages tin-cured silicone and a wax casting method, refined through iterative processes. 3D-printed PLA moulds, filled with silicone, yield the desired model, while bladder-like structures, are formed within using solidified paraffin wax positive moulds. For control, an empirically fine-tuned open-loop system is adopted. The project culminates in rigorous testing bending ability and weight carrying capacity and possible applications are discussed.
△ Less
Submitted 2 January, 2024;
originally announced January 2024.
-
Incompatibilities Between Current Practices in Statistical Data Analysis and Differential Privacy
Authors:
Joshua Snoke,
Claire McKay Bowen,
Aaron R. Williams,
Andrés F. Barrientos
Abstract:
The authors discuss their experience applying differential privacy with a complex data set with the goal of enabling standard approaches to statistical data analysis. They highlight lessons learned and roadblocks encountered, distilling them into incompatibilities between current practices in statistical data analysis and differential privacy that go beyond issues which can be solved with a noisy…
▽ More
The authors discuss their experience applying differential privacy with a complex data set with the goal of enabling standard approaches to statistical data analysis. They highlight lessons learned and roadblocks encountered, distilling them into incompatibilities between current practices in statistical data analysis and differential privacy that go beyond issues which can be solved with a noisy measurements file. The authors discuss how overcoming these incompatibilities require compromise and a change in either our approach to statistical data analysis or differential privacy that should be addressed head-on.
△ Less
Submitted 16 August, 2023;
originally announced September 2023.
-
Long Distance GNSS-Denied Visual Inertial Navigation for Autonomous Fixed Wing Unmanned Air Vehicles: SO(3) Manifold Filter based on Virtual Vision Sensor
Authors:
Eduardo Gallo,
Antonio Barrientos
Abstract:
This article proposes a visual inertial navigation algorithm intended to diminish the horizontal position drift experienced by autonomous fixed wing UAVs (Unmanned Air Vehicles) in the absence of GNSS (Global Navigation Satellite System) signals. In addition to accelerometers, gyroscopes, and magnetometers, the proposed navigation filter relies on the accurate incremental displacement outputs gene…
▽ More
This article proposes a visual inertial navigation algorithm intended to diminish the horizontal position drift experienced by autonomous fixed wing UAVs (Unmanned Air Vehicles) in the absence of GNSS (Global Navigation Satellite System) signals. In addition to accelerometers, gyroscopes, and magnetometers, the proposed navigation filter relies on the accurate incremental displacement outputs generated by a VO (Visual Odometry) system, denoted here as a Virtual Vision Sensor or VVS, which relies on images of the Earth surface taken by an onboard camera and is itself assisted by the filter inertial estimations. Although not a full replacement for a GNSS receiver since its position observations are relative instead of absolute, the proposed system enables major reductions in the GNSS-Denied attitude and position estimation errors. In order to minimize the accumulation of errors in the absence of absolute observations, the filter is implemented in the manifold of rigid body rotations or SO (3). Stochastic high fidelity simulations of two representative scenarios involving the loss of GNSS signals are employed to evaluate the results. The authors release the C++ implementation of both the visual inertial navigation filter and the high fidelity simulation as open-source software.
△ Less
Submitted 7 March, 2023;
originally announced March 2023.
-
Differentially Private Methods for Compositional Data
Authors:
Qi Guo,
Andrés F. Barrientos,
Víctor Peña
Abstract:
Confidential data are increasingly common; some examples are electronic health records, activity data recorded from wearable devices, or geolocation. Differential privacy is a framework that enables statistical analyses while controlling the risk of leaking private information. Compositional data, which consists of vectors with positive components that add up to a constant, has received little att…
▽ More
Confidential data are increasingly common; some examples are electronic health records, activity data recorded from wearable devices, or geolocation. Differential privacy is a framework that enables statistical analyses while controlling the risk of leaking private information. Compositional data, which consists of vectors with positive components that add up to a constant, has received little attention in the differential privacy literature. In this article, we propose differentially private approaches for analyzing compositional data using the Dirichlet distribution. We consider several methods, including frequentist and Bayesian procedures.
△ Less
Submitted 3 October, 2023; v1 submitted 11 November, 2022;
originally announced November 2022.
-
Differentially Private Hypothesis Testing with the Subsampled and Aggregated Randomized Response Mechanism
Authors:
Víctor Peña,
Andrés F. Barrientos
Abstract:
Randomized response is one of the oldest and most well-known methods for analyzing confidential data. However, its utility for differentially private hypothesis testing is limited because it cannot achieve high privacy levels and low type I error rates simultaneously. In this article, we show how to overcome this issue with the subsample and aggregate technique. The result is a general-purpose met…
▽ More
Randomized response is one of the oldest and most well-known methods for analyzing confidential data. However, its utility for differentially private hypothesis testing is limited because it cannot achieve high privacy levels and low type I error rates simultaneously. In this article, we show how to overcome this issue with the subsample and aggregate technique. The result is a general-purpose method that can be used for both frequentist and Bayesian testing. {{We illustrate the performance of our proposal in three scenarios: goodness-of-fit testing for linear regression models, nonparametric testing of a location parameter with the Wilcoxon test, and the nonparametric Kruskal-Wallis test.
△ Less
Submitted 3 March, 2023; v1 submitted 14 August, 2022;
originally announced August 2022.
-
Nonparametric Bayesian Approach to Treatment Ranking in Network Meta-Analysis with Application to Comparisons of Antidepressants
Authors:
Andrés F. Barrientos,
Garritt L. Page,
Lifeng Lin
Abstract:
Network meta-analysis is a powerful tool to synthesize evidence from independent studies and compare multiple treatments simultaneously. A critical task of performing a network meta-analysis is to offer ranks of all available treatment options for a specific disease outcome. Frequently, the estimated treatment rankings are accompanied by a large amount of uncertainty, suffer from multiplicity issu…
▽ More
Network meta-analysis is a powerful tool to synthesize evidence from independent studies and compare multiple treatments simultaneously. A critical task of performing a network meta-analysis is to offer ranks of all available treatment options for a specific disease outcome. Frequently, the estimated treatment rankings are accompanied by a large amount of uncertainty, suffer from multiplicity issues, and rarely permit ties. These issues make interpreting rankings problematic as they are often treated as absolute metrics. To address these shortcomings, we formulate a ranking strategy that adapts to scenarios with high order uncertainty by producing more conservative results. This improves the interpretability while simultaneously accounting for multiple comparisons. To admit ties between treatment effects, we also develop a Bayesian Nonparametric approach for network meta-analysis. The approach capitalizes on the induced clustering mechanism of Bayesian Nonparametric methods producing a positive probability that two treatment effects are equal. We demonstrate the utility of the procedure through numerical experiments and a network meta-analysis designed to study antidepressant treatments.
△ Less
Submitted 13 July, 2022;
originally announced July 2022.
-
Profile Monitoring via Eigenvector Perturbation
Authors:
Takayuki Iguchi,
Andrés F. Barrientos,
Eric Chicken,
Debajyoti Sinha
Abstract:
Control charts are often used to monitor the quality characteristics of a process over time to ensure undesirable behavior is quickly detected. The escalating complexity of processes we wish to monitor spurs the need for more flexible control charts such as those used in profile monitoring. Additionally, designing a control chart that has an acceptable false alarm rate for a practitioner is a comm…
▽ More
Control charts are often used to monitor the quality characteristics of a process over time to ensure undesirable behavior is quickly detected. The escalating complexity of processes we wish to monitor spurs the need for more flexible control charts such as those used in profile monitoring. Additionally, designing a control chart that has an acceptable false alarm rate for a practitioner is a common challenge. Alarm fatigue can occur if the sampling rate is high (say, once a millisecond) and the control chart is calibrated to an average in-control run length ($ARL_0$) of 200 or 370 which is often done in the literature. As alarm fatigue may not just be annoyance but result in detrimental effects to the quality of the product, control chart designers should seek to minimize the false alarm rate. Unfortunately, reducing the false alarm rate typically comes at the cost of detection delay or average out-of-control run length ($ARL_1$). Motivated by recent work on eigenvector perturbation theory, we develop a computationally fast control chart called the Eigenvector Perturbation Control Chart for nonparametric profile monitoring. The control chart monitors the $l_2$ perturbation of the leading eigenvector of a correlation matrix and requires only a sample of known in-control profiles to determine control limits. Through a simulation study we demonstrate that it is able to outperform its competition by achieving an $ARL_1$ close to or equal to 1 even when the control limits result in a large $ARL_0$ on the order of $10^6$. Additionally, non-zero false alarm rates with a change point after $10^4$ in-control observations were only observed in scenarios that are either pathological or truly difficult for a correlation based monitoring scheme.
△ Less
Submitted 30 May, 2022;
originally announced May 2022.
-
GNSS-Denied Semi Direct Visual Navigation for Autonomous UAVs Aided by PI-Inspired Inertial Priors
Authors:
Eduardo Gallo,
Antonio Barrientos
Abstract:
This article proposes a method to diminish the pose (position plus attitude) drift experienced by an SVO (Semi-Direct Visual Odometry) based visual navigation system installed onboard a UAV (Unmanned Air Vehicle) by supplementing its pose estimation non linear optimizations with priors based on the outputs of a GNSS (Global Navigation Satellite System) Denied inertial navigation system. The method…
▽ More
This article proposes a method to diminish the pose (position plus attitude) drift experienced by an SVO (Semi-Direct Visual Odometry) based visual navigation system installed onboard a UAV (Unmanned Air Vehicle) by supplementing its pose estimation non linear optimizations with priors based on the outputs of a GNSS (Global Navigation Satellite System) Denied inertial navigation system. The method is inspired in a PI (Proportional Integral) control system, in which the attitude, altitude, and rate of climb inertial outputs act as targets to ensure that the visual estimations do not deviate far from their inertial counterparts. The resulting IA-VNS (Inertially Assisted Visual Navigation System) achieves major reductions in the horizontal position drift inherent to the GNSS-Denied navigation of autonomous fixed wing low SWaP (Size, Weight, and Power) UAVs. Additionally, the IA-VNS can be considered as a virtual incremental position (ground velocity) sensor capable of providing observations to the inertial filter. Stochastic high fidelity Monte Carlo simulations of two representative scenarios involving the loss of GNSS signals are employed to evaluate the results and to analyze their sensitivity to the terrain type overflown by the aircraft as well as to the quality of the onboard sensors on which the priors are based. The author releases the C ++ implementation of both the navigation algorithms and the high fidelity simulation as open-source software.
△ Less
Submitted 22 December, 2022; v1 submitted 26 May, 2022;
originally announced May 2022.
-
A Latent Class Modeling Approach for Generating Synthetic Data and Making Posterior Inferences from Differentially Private Counts
Authors:
Michelle Pistner Nixon,
Andrés F. Barrientos,
Jerome P. Reiter,
Aleksandra Slavković
Abstract:
Several algorithms exist for creating differentially private counts from contingency tables, such as two-way or three-way marginal counts. The resulting noisy counts generally do not correspond to a coherent contingency table, so that some post-processing step is needed if one wants the released counts to correspond to a coherent contingency table. We present a latent class modeling approach for p…
▽ More
Several algorithms exist for creating differentially private counts from contingency tables, such as two-way or three-way marginal counts. The resulting noisy counts generally do not correspond to a coherent contingency table, so that some post-processing step is needed if one wants the released counts to correspond to a coherent contingency table. We present a latent class modeling approach for post-processing differentially private marginal counts that can be used (i) to create differentially private synthetic data from the set of marginal counts, and (ii) to enable posterior inferences about the confidential counts. We illustrate the approach using a subset of the 2016 American Community Survey Public Use Microdata Sets and the 2004 National Long Term Care Survey.
△ Less
Submitted 25 January, 2022;
originally announced January 2022.
-
A Feasibility Study of Differentially Private Summary Statistics and Regression Analyses with Evaluations on Administrative and Survey Data
Authors:
Andrés F. Barrientos,
Aaron R. Williams,
Joshua Snoke,
Claire McKay Bowen
Abstract:
Federal administrative data, such as tax data, are invaluable for research, but because of privacy concerns, access to these data is typically limited to select agencies and a few individuals. An alternative to sharing microlevel data is to allow individuals to query statistics without directly accessing the confidential data. This paper studies the feasibility of using differentially private (DP)…
▽ More
Federal administrative data, such as tax data, are invaluable for research, but because of privacy concerns, access to these data is typically limited to select agencies and a few individuals. An alternative to sharing microlevel data is to allow individuals to query statistics without directly accessing the confidential data. This paper studies the feasibility of using differentially private (DP) methods to make certain queries while preserving privacy. We also include new methodological adaptations to existing DP regression methods for using new data types and returning standard error estimates. We define feasibility as the impact of DP methods on analyses for making public policy decisions and the queries accuracy according to several utility metrics. We evaluate the methods using Internal Revenue Service data and public-use Current Population Survey data and identify how specific data features might challenge some of these methods. Our findings show that DP methods are feasible for simple, univariate statistics but struggle to produce accurate regression estimates and confidence intervals. To the best of our knowledge, this is the first comprehensive statistical study of DP regression methodology on real, complex datasets, and the findings have significant implications for the direction of a growing research field and public policy.
△ Less
Submitted 30 June, 2023; v1 submitted 22 October, 2021;
originally announced October 2021.
-
Mixture representations and Bayesian nonparametric inference for likelihood ratio ordered distributions
Authors:
Michael Jauch,
Andrés F. Barrientos,
Víctor Peña,
David S. Matteson
Abstract:
In this article, we introduce mixture representations for likelihood ratio ordered distributions. Essentially, the ratio of two probability densities, or mass functions, is monotone if and only if one can be expressed as a mixture of one-sided truncations of the other. To illustrate the practical value of the mixture representations, we address the problem of density estimation for likelihood rati…
▽ More
In this article, we introduce mixture representations for likelihood ratio ordered distributions. Essentially, the ratio of two probability densities, or mass functions, is monotone if and only if one can be expressed as a mixture of one-sided truncations of the other. To illustrate the practical value of the mixture representations, we address the problem of density estimation for likelihood ratio ordered distributions. In particular, we propose a nonparametric Bayesian solution which takes advantage of the mixture representations. The prior distribution is constructed from Dirichlet process mixtures and has large support on the space of pairs of densities satisfying the monotone ratio constraint. Posterior consistency holds under reasonable conditions on the prior specification and the true unknown densities. To our knowledge, this is the first posterior consistency result in the literature on order constrained inference. With a simple modification to the prior distribution, we can test the equality of two distributions against the alternative of likelihood ratio ordering. We develop a Markov chain Monte Carlo algorithm for posterior inference and demonstrate the method in a biomedical application.
△ Less
Submitted 26 October, 2023; v1 submitted 10 October, 2021;
originally announced October 2021.
-
Differentially private methods for managing model uncertainty in linear regression models
Authors:
Víctor Peña,
Andrés F. Barrientos
Abstract:
In this work, we propose differentially private methods for hypothesis testing, model averaging, and model selection for normal linear models. We consider Bayesian methods based on mixtures of $g$-priors and non-Bayesian methods based on likelihood-ratio statistics and information criteria. The procedures are asymptotically consistent and straightforward to implement with existing software. We foc…
▽ More
In this work, we propose differentially private methods for hypothesis testing, model averaging, and model selection for normal linear models. We consider Bayesian methods based on mixtures of $g$-priors and non-Bayesian methods based on likelihood-ratio statistics and information criteria. The procedures are asymptotically consistent and straightforward to implement with existing software. We focus on practical issues such as adjusting critical values so that hypothesis tests have adequate type I error rates and quantifying the uncertainty introduced by the privacy-ensuring mechanisms.
△ Less
Submitted 29 August, 2023; v1 submitted 8 September, 2021;
originally announced September 2021.
-
Dependent Bayesian nonparametric modeling of compositional data using random Bernstein polynomials
Authors:
Claudia Wehrhahn,
Andrés F. Barrientos,
Alejandro Jara
Abstract:
We discuss Bayesian nonparametric procedures for the regression analysis of compositional responses, that is, data supported on a multivariate simplex. The procedures are based on a modified class of multivariate Bernstein polynomials and on the use of dependent stick-breaking processes. A general model and two simplified versions of the general model are discussed. Appealing theoretical propertie…
▽ More
We discuss Bayesian nonparametric procedures for the regression analysis of compositional responses, that is, data supported on a multivariate simplex. The procedures are based on a modified class of multivariate Bernstein polynomials and on the use of dependent stick-breaking processes. A general model and two simplified versions of the general model are discussed. Appealing theoretical properties such as continuity, association structure, support, and consistency of the posterior distribution are established. Additionally, we exploit the use of spike-and-slab priors for choosing the version of the model that best adapts to the complexity of the underlying true data-generating distribution. The performance of the proposed model is illustrated in a simulation study and in an application to solid waste data from Colombia.
△ Less
Submitted 30 August, 2021;
originally announced August 2021.
-
Minimization of GNSS-Denied Inertial Navigation Errors for Fixed Wing Autonomous Unmanned Air Vehicles
Authors:
Eduardo Gallo,
Antonio Barrientos
Abstract:
This article proposes an inertial navigation algorithm intended to lower the negative consequences of the absence of GNSS (Global Navigation Satellite System) signals on the navigation of autonomous fixed wing low SWaP (Size, Weight, and Power) UAVs (Unmanned Air Vehicles). In addition to accelerometers and gyroscopes, the filter takes advantage of sensors usually present onboard these platforms,…
▽ More
This article proposes an inertial navigation algorithm intended to lower the negative consequences of the absence of GNSS (Global Navigation Satellite System) signals on the navigation of autonomous fixed wing low SWaP (Size, Weight, and Power) UAVs (Unmanned Air Vehicles). In addition to accelerometers and gyroscopes, the filter takes advantage of sensors usually present onboard these platforms, such as magnetometers, Pitot tube, and air vanes, and aims to minimize the attitude error and reduce the position drift (both horizontal and vertical) with the dual objective of improving the aircraft GNSS-Denied inertial navigation capabilities as well as facilitating the fusion of the inertial filter with visual odometry algorithms. Stochastic high fidelity Monte Carlo simulations of two representative scenarios involving the loss of GNSS signals are employed to evaluate the results, compare the proposed filter with more traditional implementations, and analyze the sensitivity of the results to the quality of the onboard sensors. The author releases the C++ implementation of both the navigation filter and the high fidelity simulation as open-source software.
△ Less
Submitted 11 August, 2021;
originally announced August 2021.
-
Bayesian inferences on uncertain ranks and orderings: Application to ranking players and lineups
Authors:
Andres F. Barrientos,
Deborshee Sen,
Garritt L Page,
David B Dunson
Abstract:
It is common to be interested in rankings or order relationships among entities. In complex settings where one does not directly measure a univariate statistic upon which to base ranks, such inferences typically rely on statistical models having entity-specific parameters. These can be treated as random effects in hierarchical models characterizing variation among the entities. In this paper, we a…
▽ More
It is common to be interested in rankings or order relationships among entities. In complex settings where one does not directly measure a univariate statistic upon which to base ranks, such inferences typically rely on statistical models having entity-specific parameters. These can be treated as random effects in hierarchical models characterizing variation among the entities. In this paper, we are particularly interested in the problem of ranking basketball players in terms of their contribution to team performance. Using data from the United States National Basketball Association (NBA), we find that many players have similar latent ability levels, making any single estimated ranking highly misleading. The current literature fails to provide summaries of order relationships that adequately account for uncertainty. Motivated by this, we propose a Bayesian strategy for characterizing uncertainty in inferences on order relationships among players and lineups. Our approach adapts to scenarios in which uncertainty in ordering is high by producing more conservative results that improve interpretability. This is achieved through a reward function within a decision theoretic framework. We apply our approach to data from the 2009-10 NBA season.
△ Less
Submitted 11 April, 2022; v1 submitted 10 July, 2019;
originally announced July 2019.
-
Simultaneous Edit and Imputation for Household Data with Structural Zeros
Authors:
Olanrewaju Akande,
Andrés Barrientos,
Jerome P. Reiter
Abstract:
Multivariate categorical data nested within households often include reported values that fail edit constraints---for example, a participating household reports a child's age as older than his biological parent's age---as well as missing values. Generally, agencies prefer datasets to be free from erroneous or missing values before analyzing them or disseminating them to secondary data users. We pr…
▽ More
Multivariate categorical data nested within households often include reported values that fail edit constraints---for example, a participating household reports a child's age as older than his biological parent's age---as well as missing values. Generally, agencies prefer datasets to be free from erroneous or missing values before analyzing them or disseminating them to secondary data users. We present a model-based engine for editing and imputation of household data based on a Bayesian hierarchical model that includes (i) a nested data Dirichlet process mixture of products of multinomial distributions as the model for the true latent values of the data, truncated to allow only households that satisfy all edit constraints, (ii) a model for the location of errors, and (iii) a reporting model for the observed responses in error. The approach propagates uncertainty due to unknown locations of errors and missing values, generates plausible datasets that satisfy all edit constraints, and can preserve multivariate relationships within and across individuals in the same household. We illustrate the approach using data from the 2012 American Community Survey.
△ Less
Submitted 11 September, 2018; v1 submitted 13 April, 2018;
originally announced April 2018.
-
Multiple Imputation of Missing Values in Household Data with Structural Zeros
Authors:
Olanrewaju Akande,
Jerome Reiter,
Andrés F. Barrientos
Abstract:
We present an approach for imputation of missing items in multivariate categorical data nested within households. The approach relies on a latent class model that (i) allows for household level and individual level variables, (ii) ensures that impossible household configurations have zero probability in the model, and (iii) can preserve multivariate distributions both within households and across…
▽ More
We present an approach for imputation of missing items in multivariate categorical data nested within households. The approach relies on a latent class model that (i) allows for household level and individual level variables, (ii) ensures that impossible household configurations have zero probability in the model, and (iii) can preserve multivariate distributions both within households and across households. We present a Gibbs sampler for estimating the model and generating imputations. We also describe strategies for improving the computational efficiency of the model estimation. We illustrate the performance of the approach with data that mimic the variables collected in typical population censuses.
△ Less
Submitted 4 July, 2018; v1 submitted 18 July, 2017;
originally announced July 2017.
-
Bayesian Bootstraps for Massive Data
Authors:
Andrés F. Barrientos,
Víctor Peña
Abstract:
In this article, we present data-subsetting algorithms that allow for the approximate and scalable implementation of the Bayesian bootstrap. They are analogous to two existing algorithms in the frequentist literature: the bag of little bootstraps (Kleiner et al., 2014) and the subsampled double bootstrap (SDB; Sengupta et al., 2016). Our algorithms have appealing theoretical and computational prop…
▽ More
In this article, we present data-subsetting algorithms that allow for the approximate and scalable implementation of the Bayesian bootstrap. They are analogous to two existing algorithms in the frequentist literature: the bag of little bootstraps (Kleiner et al., 2014) and the subsampled double bootstrap (SDB; Sengupta et al., 2016). Our algorithms have appealing theoretical and computational properties that are comparable to those of their frequentist counterparts. Additionally, we provide a strategy for performing lossless inference for a class of functionals of the Bayesian bootstrap, and briefly introduce extensions to the Dirichlet Process.
△ Less
Submitted 21 March, 2019; v1 submitted 28 May, 2017;
originally announced May 2017.
-
Differentially private significance tests for regression coefficients
Authors:
Andrés F. Barrientos,
Jerome P. Reiter,
Ashwin Machanavajjhala,
Yan Chen
Abstract:
Many data producers seek to provide users access to confidential data without unduly compromising data subjects' privacy and confidentiality. One general strategy is to require users to do analyses without seeing the confidential data; for example, analysts only get access to synthetic data or query systems that provide disclosure-protected outputs of statistical models. With synthetic data or red…
▽ More
Many data producers seek to provide users access to confidential data without unduly compromising data subjects' privacy and confidentiality. One general strategy is to require users to do analyses without seeing the confidential data; for example, analysts only get access to synthetic data or query systems that provide disclosure-protected outputs of statistical models. With synthetic data or redacted outputs, the analyst never really knows how much to trust the resulting findings. In particular, if the user did the same analysis on the confidential data, would regression coefficients of interest be statistically significant or not? We present algorithms for assessing this question that satisfy differential privacy. We describe conditions under which the algorithms should give accurate answers about statistical significance. We illustrate the properties of the proposed methods using artificial and genuine data.
△ Less
Submitted 11 June, 2018; v1 submitted 26 May, 2017;
originally announced May 2017.
-
Providing Access to Confidential Research Data Through Synthesis and Verification: An Application to Data on Employees of the U.S. Federal Government
Authors:
Andrés F. Barrientos,
Alexander Bolton,
Tom Balmat,
Jerome P. Reiter,
John M. de Figueiredo,
Ashwin Machanavajjhala,
Yan Chen,
Charley Kneifel,
Mark DeLong
Abstract:
Data stewards seeking to provide access to large-scale social science data face a difficult challenge. They have to share data in ways that protect privacy and confidentiality, are informative for many analyses and purposes, and are relatively straightforward to use by data analysts. One approach suggested in the literature is that data stewards generate and release synthetic data, i.e., data simu…
▽ More
Data stewards seeking to provide access to large-scale social science data face a difficult challenge. They have to share data in ways that protect privacy and confidentiality, are informative for many analyses and purposes, and are relatively straightforward to use by data analysts. One approach suggested in the literature is that data stewards generate and release synthetic data, i.e., data simulated from statistical models, while also providing users access to a verification server that allows them to assess the quality of inferences from the synthetic data. We present an application of the synthetic data plus verification server approach to longitudinal data on employees of the U.S. federal government. As part of the application, we present a novel model for generating synthetic career trajectories, as well as strategies for generating high dimensional, longitudinal synthetic datasets. We also present novel verification algorithms for regression coefficients that satisfy differential privacy. We illustrate the integrated use of synthetic data plus verification via analysis of differentials in pay by race. The integrated system performs as intended, allowing users to explore the synthetic data for potential pay differentials and learn through verifications which findings in the synthetic data hold up and which do not. The analysis on the confidential data reveals pay differentials across races not documented in published studies.
△ Less
Submitted 16 June, 2018; v1 submitted 22 May, 2017;
originally announced May 2017.