Search | arXiv e-print repository

doi 10.1162/99608f92.79d4660d

Noisy Measurements Are Important, the Design of Census Products Is Much More Important

Abstract: McCartan et al. (2023) call for "making differential privacy work for census data users." This commentary explains why the 2020 Census Noisy Measurement Files (NMFs) are not the best focus for that plea. The August 2021 letter from 62 prominent researchers asking for production of the direct output of the differential privacy system deployed for the 2020 Census signaled the engagement of the schol… ▽ More McCartan et al. (2023) call for "making differential privacy work for census data users." This commentary explains why the 2020 Census Noisy Measurement Files (NMFs) are not the best focus for that plea. The August 2021 letter from 62 prominent researchers asking for production of the direct output of the differential privacy system deployed for the 2020 Census signaled the engagement of the scholarly community in the design of decennial census data products. NMFs, the raw statistics produced by the 2020 Census Disclosure Avoidance System before any post-processing, are one component of that design-the query strategy output. The more important component is the query workload output-the statistics released to the public. Optimizing the query workload-the Redistricting Data (P.L. 94-171) Summary File, specifically-could allow the privacy-loss budget to be more effectively managed. There could be fewer noisy measurements, no post-processing bias, and direct estimates of the uncertainty from disclosure avoidance for each published statistic. △ Less

Submitted 1 May, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

Journal ref: Harvard Data Science Review, Volume 6, Number 2 (Spring, 2024)

arXiv:2312.11283 [pdf, other]

The 2010 Census Confidentiality Protections Failed, Here's How and Why

Authors: John M. Abowd, Tamara Adams, Robert Ashmead, David Darais, Sourya Dey, Simson L. Garfinkel, Nathan Goldschlag, Daniel Kifer, Philip Leclerc, Ethan Lew, Scott Moore, Rolando A. Rodríguez, Ramy N. Tadros, Lars Vilhuber

Abstract: Using only 34 published tables, we reconstruct five variables (census block, sex, age, race, and ethnicity) in the confidential 2010 Census person records. Using the 38-bin age variable tabulated at the census block level, at most 20.1% of reconstructed records can differ from their confidential source on even a single value for these five variables. Using only published data, an attacker can veri… ▽ More Using only 34 published tables, we reconstruct five variables (census block, sex, age, race, and ethnicity) in the confidential 2010 Census person records. Using the 38-bin age variable tabulated at the census block level, at most 20.1% of reconstructed records can differ from their confidential source on even a single value for these five variables. Using only published data, an attacker can verify that all records in 70% of all census blocks (97 million people) are perfectly reconstructed. The tabular publications in Summary File 1 thus have prohibited disclosure risk similar to the unreleased confidential microdata. Reidentification studies confirm that an attacker can, within blocks with perfect reconstruction accuracy, correctly infer the actual census response on race and ethnicity for 3.4 million vulnerable population uniques (persons with nonmodal characteristics) with 95% accuracy, the same precision as the confidential data achieve and far greater than statistical baselines. The flaw in the 2010 Census framework was the assumption that aggregation prevented accurate microdata reconstruction, justifying weaker disclosure limitation methods than were applied to 2010 Census public microdata. The framework used for 2020 Census publications defends against attacks that are based on reconstruction, as we also demonstrate here. Finally, we show that alternatives to the 2020 Census Disclosure Avoidance System with similar accuracy (enhanced swap**) also fail to protect confidentiality, and those that partially defend against reconstruction attacks (incomplete suppression implementations) destroy the primary statutory use case: data for redistricting all legislatures in the country in compliance with the 1965 Voting Rights Act. △ Less

Submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.10863 [pdf, ps, other]

Disclosure Avoidance for the 2020 Census Demographic and Housing Characteristics File

Authors: Ryan Cumings-Menon, Robert Ashmead, Daniel Kifer, Philip Leclerc, Matthew Spence, Pavel Zhuravlev, John M. Abowd

Abstract: In "The 2020 Census Disclosure Avoidance System TopDown Algorithm," Abowd et al. (2022) describe the concepts and methods used by the Disclosure Avoidance System (DAS) to produce formally private output in support of the 2020 Census data product releases, with a particular focus on the DAS implementation that was used to create the 2020 Census Redistricting Data (P.L. 94-171) Summary File. In this… ▽ More In "The 2020 Census Disclosure Avoidance System TopDown Algorithm," Abowd et al. (2022) describe the concepts and methods used by the Disclosure Avoidance System (DAS) to produce formally private output in support of the 2020 Census data product releases, with a particular focus on the DAS implementation that was used to create the 2020 Census Redistricting Data (P.L. 94-171) Summary File. In this paper we describe the updates to the DAS that were required to release the Demographic and Housing Characteristics (DHC) File, which provides more granular tables than other data products, such as the Redistricting Data Summary File. We also describe the final configuration parameters used for the production DHC DAS implementation, as well as subsequent experimental data products to facilitate development of tools that provide confidence intervals for confidential 2020 Census tabulations. △ Less

Submitted 17 December, 2023; originally announced December 2023.

arXiv:2310.09398 [pdf, other]

doi 10.1073/pnas.2220558120

An In-Depth Examination of Requirements for Disclosure Risk Assessment

Authors: Ron S. Jarmin, John M. Abowd, Robert Ashmead, Ryan Cumings-Menon, Nathan Goldschlag, Michael B. Hawes, Sallie Ann Keller, Daniel Kifer, Philip Leclerc, Jerome P. Reiter, Rolando A. Rodríguez, Ian Schmutte, Victoria A. Velkoff, Pavel Zhuravlev

Abstract: The use of formal privacy to protect the confidentiality of responses in the 2020 Decennial Census of Population and Housing has triggered renewed interest and debate over how to measure the disclosure risks and societal benefits of the published data products. Following long-established precedent in economics and statistics, we argue that any proposal for quantifying disclosure risk should be bas… ▽ More The use of formal privacy to protect the confidentiality of responses in the 2020 Decennial Census of Population and Housing has triggered renewed interest and debate over how to measure the disclosure risks and societal benefits of the published data products. Following long-established precedent in economics and statistics, we argue that any proposal for quantifying disclosure risk should be based on pre-specified, objective criteria. Such criteria should be used to compare methodologies to identify those with the most desirable properties. We illustrate this approach, using simple desiderata, to evaluate the absolute disclosure risk framework, the counterfactual framework underlying differential privacy, and prior-to-posterior comparisons. We conclude that satisfying all the desiderata is impossible, but counterfactual comparisons satisfy the most while absolute disclosure risk satisfies the fewest. Furthermore, we explain that many of the criticisms levied against differential privacy would be levied against any technology that is not equivalent to direct, unrestricted access to confidential data. Thus, more research is needed, but in the near-term, the counterfactual approach appears best-suited for privacy-utility analysis. △ Less

Submitted 13 October, 2023; originally announced October 2023.

Comments: 47 pages, 1 table

Journal ref: PNAS, October 13, 2023, Vol. 120, No. 43

arXiv:2303.00845 [pdf, ps, other]

$21^{st}$ Century Statistical Disclosure Limitation: Motivations and Challenges

Authors: John M Abowd, Michael B Hawes

Abstract: This chapter examines the motivations and imperatives for modernizing how statistical agencies approach statistical disclosure limitation for official data product releases. It discusses the implications for agencies' broader data governance and decision-making, and it identifies challenges that agencies will likely face along the way. In conclusion, the chapter proposes some principles and best p… ▽ More This chapter examines the motivations and imperatives for modernizing how statistical agencies approach statistical disclosure limitation for official data product releases. It discusses the implications for agencies' broader data governance and decision-making, and it identifies challenges that agencies will likely face along the way. In conclusion, the chapter proposes some principles and best practices that we believe can help guide agencies in navigating the transformation of their confidentiality programs. △ Less

Submitted 1 March, 2023; originally announced March 2023.

Comments: Forthcoming CRC Handbook of Formally Private and Synthetic Data Approaches for Statistical Disclosure Control

arXiv:2209.03310 [pdf, other]

Bayesian and Frequentist Semantics for Common Variations of Differential Privacy: Applications to the 2020 Census

Authors: Daniel Kifer, John M. Abowd, Robert Ashmead, Ryan Cumings-Menon, Philip Leclerc, Ashwin Machanavajjhala, William Sexton, Pavel Zhuravlev

Abstract: The purpose of this paper is to guide interpretation of the semantic privacy guarantees for some of the major variations of differential privacy, which include pure, approximate, Rényi, zero-concentrated, and $f$ differential privacy. We interpret privacy-loss accounting parameters, frequentist semantics, and Bayesian semantics (including new results). The driving application is the interpretation… ▽ More The purpose of this paper is to guide interpretation of the semantic privacy guarantees for some of the major variations of differential privacy, which include pure, approximate, Rényi, zero-concentrated, and $f$ differential privacy. We interpret privacy-loss accounting parameters, frequentist semantics, and Bayesian semantics (including new results). The driving application is the interpretation of the confidentiality protections for the 2020 Census Public Law 94-171 Redistricting Data Summary File released August 12, 2021, which, for the first time, were produced with formal privacy guarantees. △ Less

Submitted 7 September, 2022; originally announced September 2022.

arXiv:2206.03524 [pdf, ps, other]

doi 10.1146/annurev-statistics-010422-034226

Confidentiality Protection in the 2020 US Census of Population and Housing

Authors: John M Abowd, Michael B Hawes

Abstract: In an era where external data and computational capabilities far exceed statistical agencies' own resources and capabilities, they face the renewed challenge of protecting the confidentiality of underlying microdata when publishing statistics in very granular form and ensuring that these granular data are used for statistical purposes only. Conventional statistical disclosure limitation methods ar… ▽ More In an era where external data and computational capabilities far exceed statistical agencies' own resources and capabilities, they face the renewed challenge of protecting the confidentiality of underlying microdata when publishing statistics in very granular form and ensuring that these granular data are used for statistical purposes only. Conventional statistical disclosure limitation methods are too fragile to address this new challenge. This article discusses the deployment of a differential privacy framework for the 2020 US Census that was customized to protect confidentiality, particularly the most detailed geographic and demographic categories, and deliver controlled accuracy across the full geographic hierarchy. △ Less

Submitted 27 December, 2022; v1 submitted 7 June, 2022; originally announced June 2022.

Comments: Version 2 corrects a few transcription errors in Tables 2, 3 and 5. Version 3 adds final journal copy edits to the preprint

Journal ref: Annual Review of Statistics and Its Application 2023 10:1

arXiv:2204.08986 [pdf, other]

The 2020 Census Disclosure Avoidance System TopDown Algorithm

Authors: John M. Abowd, Robert Ashmead, Ryan Cumings-Menon, Simson Garfinkel, Micah Heineck, Christine Heiss, Robert Johns, Daniel Kifer, Philip Leclerc, Ashwin Machanavajjhala, Brett Moran, William Sexton, Matthew Spence, Pavel Zhuravlev

Abstract: The Census TopDown Algorithm (TDA) is a disclosure avoidance system using differential privacy for privacy-loss accounting. The algorithm ingests the final, edited version of the 2020 Census data and the final tabulation geographic definitions. The algorithm then creates noisy versions of key queries on the data, referred to as measurements, using zero-Concentrated Differential Privacy. Another ke… ▽ More The Census TopDown Algorithm (TDA) is a disclosure avoidance system using differential privacy for privacy-loss accounting. The algorithm ingests the final, edited version of the 2020 Census data and the final tabulation geographic definitions. The algorithm then creates noisy versions of key queries on the data, referred to as measurements, using zero-Concentrated Differential Privacy. Another key aspect of the TDA are invariants, statistics that the Census Bureau has determined, as matter of policy, to exclude from the privacy-loss accounting. The TDA post-processes the measurements together with the invariants to produce a Microdata Detail File (MDF) that contains one record for each person and one record for each housing unit enumerated in the 2020 Census. The MDF is passed to the 2020 Census tabulation system to produce the 2020 Census Redistricting Data (P.L. 94-171) Summary File. This paper describes the mathematics and testing of the TDA for this purpose. △ Less

Submitted 19 April, 2022; originally announced April 2022.

arXiv:2112.05822 [pdf]

U.S. Long-Term Earnings Outcomes by Sex, Race, Ethnicity, and Place of Birth

Authors: Kevin L. McKinney, John M. Abowd, Hubert P. Janicki

Abstract: This paper is part of the Global Income Dynamics Project cross-country comparison of earnings inequality, volatility, and mobility. Using data from the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) infrastructure files we produce a uniform set of earnings statistics for the U.S. From 1998 to 2019, we find U.S. earnings inequality has increased and volatility has decreased. T… ▽ More This paper is part of the Global Income Dynamics Project cross-country comparison of earnings inequality, volatility, and mobility. Using data from the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) infrastructure files we produce a uniform set of earnings statistics for the U.S. From 1998 to 2019, we find U.S. earnings inequality has increased and volatility has decreased. The combination of increased inequality and reduced volatility suggest earnings growth differs substantially across different demographic groups. We explore this further by estimating 12-year average earnings for a single cohort of age 25-54 eligible workers. Differences in labor supply (hours paid and quarters worked) are found to explain almost 90% of the variation in worker earnings, although even after controlling for labor supply substantial earnings differences across demographic groups remain unexplained. Using a quantile regression approach, we estimate counterfactual earnings distributions for each demographic group. We find that at the bottom of the earnings distribution differences in characteristics such as hours paid, geographic division, industry, and education explain almost all the earnings gap, however above the median the contribution of the differences in the returns to characteristics becomes the dominant component. △ Less

Submitted 10 December, 2021; originally announced December 2021.

Comments: 77 pages, 42 figures

arXiv:2008.00253 [pdf]

Male Earnings Volatility in LEHD before, during, and after the Great Recession

Authors: Kevin L. McKinney, John M. Abowd

Abstract: This paper is part of a coordinated collection of papers on prime-age male earnings volatility. Each paper produces a similar set of statistics for the same reference population using a different primary data source. Our primary data source is the Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) infrastructure files. Using LEHD data from 1998 to 2016, we create a well-defined popula… ▽ More This paper is part of a coordinated collection of papers on prime-age male earnings volatility. Each paper produces a similar set of statistics for the same reference population using a different primary data source. Our primary data source is the Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) infrastructure files. Using LEHD data from 1998 to 2016, we create a well-defined population frame to facilitate accurate estimation of temporal changes comparable to designed longitudinal samples of people. We show that earnings volatility, excluding increases during recessions, has declined over the analysis period, a finding robust to various sensitivity analyses. △ Less

Submitted 1 February, 2022; v1 submitted 1 August, 2020; originally announced August 2020.

Comments: Revision submitted to JBES with figures included in the text and Appendix added

arXiv:2007.13275 [pdf, other]

doi 10.1093/jssam/smaa029

Total Error and Variability Measures for the Quarterly Workforce Indicators and LEHD Origin-Destination Employment Statistics in OnTheMap

Authors: Kevin L. McKinney, Andrew S. Green, Lars Vilhuber, John M. Abowd

Abstract: We report results from the first comprehensive total quality evaluation of five major indicators in the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) Program Quarterly Workforce Indicators (QWI): total flow-employment, beginning-of-quarter employment, full-quarter employment, average monthly earnings of full-quarter employees, and total quarterly payroll. Beginning-of-quarte… ▽ More We report results from the first comprehensive total quality evaluation of five major indicators in the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) Program Quarterly Workforce Indicators (QWI): total flow-employment, beginning-of-quarter employment, full-quarter employment, average monthly earnings of full-quarter employees, and total quarterly payroll. Beginning-of-quarter employment is also the main tabulation variable in the LEHD Origin-Destination Employment Statistics (LODES) workplace reports as displayed in OnTheMap (OTM), including OnTheMap for Emergency Management. We account for errors due to coverage; record-level non-response; edit and imputation of item missing data; and statistical disclosure limitation. The analysis reveals that the five publication variables under study are estimated very accurately for tabulations involving at least 10 jobs. Tabulations involving three to nine jobs are a transition zone, where cells may be fit for use with caution. Tabulations involving one or two jobs, which are generally suppressed on fitness-for-use criteria in the QWI and synthesized in LODES, have substantial total variability but can still be used to estimate statistics for untabulated aggregates as long as the job count in the aggregate is more than 10. △ Less

Submitted 26 July, 2020; originally announced July 2020.

Showing 1–11 of 11 results for author: Abowd, J M