-
Noisy Measurements Are Important, the Design of Census Products Is Much More Important
Authors:
John M. Abowd
Abstract:
McCartan et al. (2023) call for "making differential privacy work for census data users." This commentary explains why the 2020 Census Noisy Measurement Files (NMFs) are not the best focus for that plea. The August 2021 letter from 62 prominent researchers asking for production of the direct output of the differential privacy system deployed for the 2020 Census signaled the engagement of the schol…
▽ More
McCartan et al. (2023) call for "making differential privacy work for census data users." This commentary explains why the 2020 Census Noisy Measurement Files (NMFs) are not the best focus for that plea. The August 2021 letter from 62 prominent researchers asking for production of the direct output of the differential privacy system deployed for the 2020 Census signaled the engagement of the scholarly community in the design of decennial census data products. NMFs, the raw statistics produced by the 2020 Census Disclosure Avoidance System before any post-processing, are one component of that design-the query strategy output. The more important component is the query workload output-the statistics released to the public. Optimizing the query workload-the Redistricting Data (P.L. 94-171) Summary File, specifically-could allow the privacy-loss budget to be more effectively managed. There could be fewer noisy measurements, no post-processing bias, and direct estimates of the uncertainty from disclosure avoidance for each published statistic.
△ Less
Submitted 1 May, 2024; v1 submitted 20 December, 2023;
originally announced December 2023.
-
The 2010 Census Confidentiality Protections Failed, Here's How and Why
Authors:
John M. Abowd,
Tamara Adams,
Robert Ashmead,
David Darais,
Sourya Dey,
Simson L. Garfinkel,
Nathan Goldschlag,
Daniel Kifer,
Philip Leclerc,
Ethan Lew,
Scott Moore,
Rolando A. Rodríguez,
Ramy N. Tadros,
Lars Vilhuber
Abstract:
Using only 34 published tables, we reconstruct five variables (census block, sex, age, race, and ethnicity) in the confidential 2010 Census person records. Using the 38-bin age variable tabulated at the census block level, at most 20.1% of reconstructed records can differ from their confidential source on even a single value for these five variables. Using only published data, an attacker can veri…
▽ More
Using only 34 published tables, we reconstruct five variables (census block, sex, age, race, and ethnicity) in the confidential 2010 Census person records. Using the 38-bin age variable tabulated at the census block level, at most 20.1% of reconstructed records can differ from their confidential source on even a single value for these five variables. Using only published data, an attacker can verify that all records in 70% of all census blocks (97 million people) are perfectly reconstructed. The tabular publications in Summary File 1 thus have prohibited disclosure risk similar to the unreleased confidential microdata. Reidentification studies confirm that an attacker can, within blocks with perfect reconstruction accuracy, correctly infer the actual census response on race and ethnicity for 3.4 million vulnerable population uniques (persons with nonmodal characteristics) with 95% accuracy, the same precision as the confidential data achieve and far greater than statistical baselines. The flaw in the 2010 Census framework was the assumption that aggregation prevented accurate microdata reconstruction, justifying weaker disclosure limitation methods than were applied to 2010 Census public microdata. The framework used for 2020 Census publications defends against attacks that are based on reconstruction, as we also demonstrate here. Finally, we show that alternatives to the 2020 Census Disclosure Avoidance System with similar accuracy (enhanced swap**) also fail to protect confidentiality, and those that partially defend against reconstruction attacks (incomplete suppression implementations) destroy the primary statutory use case: data for redistricting all legislatures in the country in compliance with the 1965 Voting Rights Act.
△ Less
Submitted 18 December, 2023;
originally announced December 2023.
-
Disclosure Avoidance for the 2020 Census Demographic and Housing Characteristics File
Authors:
Ryan Cumings-Menon,
Robert Ashmead,
Daniel Kifer,
Philip Leclerc,
Matthew Spence,
Pavel Zhuravlev,
John M. Abowd
Abstract:
In "The 2020 Census Disclosure Avoidance System TopDown Algorithm," Abowd et al. (2022) describe the concepts and methods used by the Disclosure Avoidance System (DAS) to produce formally private output in support of the 2020 Census data product releases, with a particular focus on the DAS implementation that was used to create the 2020 Census Redistricting Data (P.L. 94-171) Summary File. In this…
▽ More
In "The 2020 Census Disclosure Avoidance System TopDown Algorithm," Abowd et al. (2022) describe the concepts and methods used by the Disclosure Avoidance System (DAS) to produce formally private output in support of the 2020 Census data product releases, with a particular focus on the DAS implementation that was used to create the 2020 Census Redistricting Data (P.L. 94-171) Summary File. In this paper we describe the updates to the DAS that were required to release the Demographic and Housing Characteristics (DHC) File, which provides more granular tables than other data products, such as the Redistricting Data Summary File. We also describe the final configuration parameters used for the production DHC DAS implementation, as well as subsequent experimental data products to facilitate development of tools that provide confidence intervals for confidential 2020 Census tabulations.
△ Less
Submitted 17 December, 2023;
originally announced December 2023.
-
An In-Depth Examination of Requirements for Disclosure Risk Assessment
Authors:
Ron S. Jarmin,
John M. Abowd,
Robert Ashmead,
Ryan Cumings-Menon,
Nathan Goldschlag,
Michael B. Hawes,
Sallie Ann Keller,
Daniel Kifer,
Philip Leclerc,
Jerome P. Reiter,
Rolando A. Rodríguez,
Ian Schmutte,
Victoria A. Velkoff,
Pavel Zhuravlev
Abstract:
The use of formal privacy to protect the confidentiality of responses in the 2020 Decennial Census of Population and Housing has triggered renewed interest and debate over how to measure the disclosure risks and societal benefits of the published data products. Following long-established precedent in economics and statistics, we argue that any proposal for quantifying disclosure risk should be bas…
▽ More
The use of formal privacy to protect the confidentiality of responses in the 2020 Decennial Census of Population and Housing has triggered renewed interest and debate over how to measure the disclosure risks and societal benefits of the published data products. Following long-established precedent in economics and statistics, we argue that any proposal for quantifying disclosure risk should be based on pre-specified, objective criteria. Such criteria should be used to compare methodologies to identify those with the most desirable properties. We illustrate this approach, using simple desiderata, to evaluate the absolute disclosure risk framework, the counterfactual framework underlying differential privacy, and prior-to-posterior comparisons. We conclude that satisfying all the desiderata is impossible, but counterfactual comparisons satisfy the most while absolute disclosure risk satisfies the fewest. Furthermore, we explain that many of the criticisms levied against differential privacy would be levied against any technology that is not equivalent to direct, unrestricted access to confidential data. Thus, more research is needed, but in the near-term, the counterfactual approach appears best-suited for privacy-utility analysis.
△ Less
Submitted 13 October, 2023;
originally announced October 2023.
-
$21^{st}$ Century Statistical Disclosure Limitation: Motivations and Challenges
Authors:
John M Abowd,
Michael B Hawes
Abstract:
This chapter examines the motivations and imperatives for modernizing how statistical agencies approach statistical disclosure limitation for official data product releases. It discusses the implications for agencies' broader data governance and decision-making, and it identifies challenges that agencies will likely face along the way. In conclusion, the chapter proposes some principles and best p…
▽ More
This chapter examines the motivations and imperatives for modernizing how statistical agencies approach statistical disclosure limitation for official data product releases. It discusses the implications for agencies' broader data governance and decision-making, and it identifies challenges that agencies will likely face along the way. In conclusion, the chapter proposes some principles and best practices that we believe can help guide agencies in navigating the transformation of their confidentiality programs.
△ Less
Submitted 1 March, 2023;
originally announced March 2023.
-
Bayesian and Frequentist Semantics for Common Variations of Differential Privacy: Applications to the 2020 Census
Authors:
Daniel Kifer,
John M. Abowd,
Robert Ashmead,
Ryan Cumings-Menon,
Philip Leclerc,
Ashwin Machanavajjhala,
William Sexton,
Pavel Zhuravlev
Abstract:
The purpose of this paper is to guide interpretation of the semantic privacy guarantees for some of the major variations of differential privacy, which include pure, approximate, Rényi, zero-concentrated, and $f$ differential privacy. We interpret privacy-loss accounting parameters, frequentist semantics, and Bayesian semantics (including new results). The driving application is the interpretation…
▽ More
The purpose of this paper is to guide interpretation of the semantic privacy guarantees for some of the major variations of differential privacy, which include pure, approximate, Rényi, zero-concentrated, and $f$ differential privacy. We interpret privacy-loss accounting parameters, frequentist semantics, and Bayesian semantics (including new results). The driving application is the interpretation of the confidentiality protections for the 2020 Census Public Law 94-171 Redistricting Data Summary File released August 12, 2021, which, for the first time, were produced with formal privacy guarantees.
△ Less
Submitted 7 September, 2022;
originally announced September 2022.
-
Confidentiality Protection in the 2020 US Census of Population and Housing
Authors:
John M Abowd,
Michael B Hawes
Abstract:
In an era where external data and computational capabilities far exceed statistical agencies' own resources and capabilities, they face the renewed challenge of protecting the confidentiality of underlying microdata when publishing statistics in very granular form and ensuring that these granular data are used for statistical purposes only. Conventional statistical disclosure limitation methods ar…
▽ More
In an era where external data and computational capabilities far exceed statistical agencies' own resources and capabilities, they face the renewed challenge of protecting the confidentiality of underlying microdata when publishing statistics in very granular form and ensuring that these granular data are used for statistical purposes only. Conventional statistical disclosure limitation methods are too fragile to address this new challenge. This article discusses the deployment of a differential privacy framework for the 2020 US Census that was customized to protect confidentiality, particularly the most detailed geographic and demographic categories, and deliver controlled accuracy across the full geographic hierarchy.
△ Less
Submitted 27 December, 2022; v1 submitted 7 June, 2022;
originally announced June 2022.
-
The 2020 Census Disclosure Avoidance System TopDown Algorithm
Authors:
John M. Abowd,
Robert Ashmead,
Ryan Cumings-Menon,
Simson Garfinkel,
Micah Heineck,
Christine Heiss,
Robert Johns,
Daniel Kifer,
Philip Leclerc,
Ashwin Machanavajjhala,
Brett Moran,
William Sexton,
Matthew Spence,
Pavel Zhuravlev
Abstract:
The Census TopDown Algorithm (TDA) is a disclosure avoidance system using differential privacy for privacy-loss accounting. The algorithm ingests the final, edited version of the 2020 Census data and the final tabulation geographic definitions. The algorithm then creates noisy versions of key queries on the data, referred to as measurements, using zero-Concentrated Differential Privacy. Another ke…
▽ More
The Census TopDown Algorithm (TDA) is a disclosure avoidance system using differential privacy for privacy-loss accounting. The algorithm ingests the final, edited version of the 2020 Census data and the final tabulation geographic definitions. The algorithm then creates noisy versions of key queries on the data, referred to as measurements, using zero-Concentrated Differential Privacy. Another key aspect of the TDA are invariants, statistics that the Census Bureau has determined, as matter of policy, to exclude from the privacy-loss accounting. The TDA post-processes the measurements together with the invariants to produce a Microdata Detail File (MDF) that contains one record for each person and one record for each housing unit enumerated in the 2020 Census. The MDF is passed to the 2020 Census tabulation system to produce the 2020 Census Redistricting Data (P.L. 94-171) Summary File. This paper describes the mathematics and testing of the TDA for this purpose.
△ Less
Submitted 19 April, 2022;
originally announced April 2022.
-
U.S. Long-Term Earnings Outcomes by Sex, Race, Ethnicity, and Place of Birth
Authors:
Kevin L. McKinney,
John M. Abowd,
Hubert P. Janicki
Abstract:
This paper is part of the Global Income Dynamics Project cross-country comparison of earnings inequality, volatility, and mobility. Using data from the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) infrastructure files we produce a uniform set of earnings statistics for the U.S. From 1998 to 2019, we find U.S. earnings inequality has increased and volatility has decreased. T…
▽ More
This paper is part of the Global Income Dynamics Project cross-country comparison of earnings inequality, volatility, and mobility. Using data from the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) infrastructure files we produce a uniform set of earnings statistics for the U.S. From 1998 to 2019, we find U.S. earnings inequality has increased and volatility has decreased. The combination of increased inequality and reduced volatility suggest earnings growth differs substantially across different demographic groups. We explore this further by estimating 12-year average earnings for a single cohort of age 25-54 eligible workers. Differences in labor supply (hours paid and quarters worked) are found to explain almost 90% of the variation in worker earnings, although even after controlling for labor supply substantial earnings differences across demographic groups remain unexplained. Using a quantile regression approach, we estimate counterfactual earnings distributions for each demographic group. We find that at the bottom of the earnings distribution differences in characteristics such as hours paid, geographic division, industry, and education explain almost all the earnings gap, however above the median the contribution of the differences in the returns to characteristics becomes the dominant component.
△ Less
Submitted 10 December, 2021;
originally announced December 2021.
-
Male Earnings Volatility in LEHD before, during, and after the Great Recession
Authors:
Kevin L. McKinney,
John M. Abowd
Abstract:
This paper is part of a coordinated collection of papers on prime-age male earnings volatility. Each paper produces a similar set of statistics for the same reference population using a different primary data source. Our primary data source is the Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) infrastructure files. Using LEHD data from 1998 to 2016, we create a well-defined popula…
▽ More
This paper is part of a coordinated collection of papers on prime-age male earnings volatility. Each paper produces a similar set of statistics for the same reference population using a different primary data source. Our primary data source is the Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) infrastructure files. Using LEHD data from 1998 to 2016, we create a well-defined population frame to facilitate accurate estimation of temporal changes comparable to designed longitudinal samples of people. We show that earnings volatility, excluding increases during recessions, has declined over the analysis period, a finding robust to various sensitivity analyses.
△ Less
Submitted 1 February, 2022; v1 submitted 1 August, 2020;
originally announced August 2020.
-
Total Error and Variability Measures for the Quarterly Workforce Indicators and LEHD Origin-Destination Employment Statistics in OnTheMap
Authors:
Kevin L. McKinney,
Andrew S. Green,
Lars Vilhuber,
John M. Abowd
Abstract:
We report results from the first comprehensive total quality evaluation of five major indicators in the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) Program Quarterly Workforce Indicators (QWI): total flow-employment, beginning-of-quarter employment, full-quarter employment, average monthly earnings of full-quarter employees, and total quarterly payroll. Beginning-of-quarte…
▽ More
We report results from the first comprehensive total quality evaluation of five major indicators in the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) Program Quarterly Workforce Indicators (QWI): total flow-employment, beginning-of-quarter employment, full-quarter employment, average monthly earnings of full-quarter employees, and total quarterly payroll. Beginning-of-quarter employment is also the main tabulation variable in the LEHD Origin-Destination Employment Statistics (LODES) workplace reports as displayed in OnTheMap (OTM), including OnTheMap for Emergency Management. We account for errors due to coverage; record-level non-response; edit and imputation of item missing data; and statistical disclosure limitation. The analysis reveals that the five publication variables under study are estimated very accurately for tabulations involving at least 10 jobs. Tabulations involving three to nine jobs are a transition zone, where cells may be fit for use with caution. Tabulations involving one or two jobs, which are generally suppressed on fitness-for-use criteria in the QWI and synthesized in LODES, have substantial total variability but can still be used to estimate statistics for untabulated aggregates as long as the job count in the aggregate is more than 10.
△ Less
Submitted 26 July, 2020;
originally announced July 2020.