Confidentiality and the Census (Part 2)
by Aloni Cohen
Be sure to check out part 1 of this blog post here.
Data Quality
Confidentiality is critical for a useful, high quality Census. Perfect confidentiality is easy to achieve if we don’t care about doing anything with the Census responses. Just erase them. Or don’t collect them at all. But Census data is important, and data confidentiality must be balanced with data utility. The Census Bureau solicited public feedback on the quality of the demonstration data produced by the 2020 DAS. The demonstration data was produced by running preliminary versions of the 2020 DAS on 2010 data (as published, not the raw 2010 data).
Most of the analyses of the demonstration data have focused on a simple question: How much noise does the 2020 DAS add to the total population of various geo-political units (e.g., Congressional districts, state legislative districts, counties, cities, and towns). This amounts to comparing the total population in the demonstration data to the reported 2010 data—the difference comes from the 2020 DAS.
Overall, the population noise from the 2020 DAS is small. For example, below are the results for the first ten cities/towns alphabetically in Utah using the latest data (April 2021). The noise for most of the cities and towns in Utah is in the single digits or low tens, with only a small handful of deviations in the low hundreds. [7]
City / Town | 2010 Reported Population | Noise from 2020 DAS |
Alpine |
9,555 |
0 |
Alta |
383 |
2 |
Altamont |
225 |
-4 |
Alton |
119 |
-3 |
Amalga |
488 |
-2 |
American Fork |
26,263 |
-15 |
Annabella |
795 |
1 |
Antimony |
122 |
7 |
Apple Valley |
701 |
4 |
Aurora |
1,016 |
7 |
Critics of the use of DP in the 2020 DAS point to large relative population deviations. For example, the reported population of Vineyard, Utah in the demonstration data is over 10% lower than in the published 2010 data. But this masks what’s really going on. The 2010 population of Vineyard was 139, and the noise from the 2020 DAS was 15. The noise isn’t large—the reference population is tiny. Even looking at relative error, the vast majority of cities and towns see less than 1% deviation.
The story is the same whether you look at Congressional districts, state legislative districts, and counties. Absolute errors are small; relative errors are small except in the smallest places.
Perhaps more important than the overall distribution of noise is how the distribution varies. The 2020 DAS tends to inflate small counts and deflate large counts. For example, the Census records people’s race using six categories. It seems that the 2020 DAS tends to make racially homogenous areas appear less so by deflating the (large) count of a single racial group and inflating the (small) counts of the other five racial groups. Though the effect isn’t huge, it is too soon to know whether it will have political consequences when looking at the country as a whole. It’s worth noting that the 2010 DAS likely also affected apparent racial homogeneity, but the effect is impossible to measure because details of the 2010 DAS are closely guarded secrets.
Detecting polarization
In joint work with Moon Duchin, JN Matthews, and Bhushan Suwal [9], we approach the data quality question through the lens of anti-gerrymandering litigation.
The most effective tool to fight racial and ethnic gerrymandering is the Voting Rights Act of 1965 (VRA). Section 2 of the VRA allows civil rights groups to challenge gerrymandered district plans in court. To prevail, the plaintiffs must pass the three-part test from the Supreme Court’s decision in Thornburg v Gingles.
Gingles requires demonstrating racially-polarized voting (RPV): the racial or ethnic minority group in question tends to prefer different candidates than the majority population. Measuring RPV directly is impossible, because we don’t have the right data. Election returns tell us how many votes in a given precinct were cast for candidate X and candidate Y. Census data tell us the fraction of the voting-age population in the precinct from each racial and ethnic group. But the votes aren’t reported by race and ethnicity, making RPV analysis imperfect at best.
One of the common techniques used to measure RPV is called ecological regression. It amounts to a linear regression over precinct-level data, where the independent variable (x-axis) is the fraction of the precinct’s population from the minority group, and the dependent variable (y-axis) is the fraction of votes for candidate X. A line of best fit with significant non-zero slope is used as evidence for RPV.
Civil rights groups—including MALDEF and Asian Americans Advancing Justice—worry that the 2020 DAS will bury the RPV signal in noise [8]. It’s a reasonable worry, because noise usually attenuates the signal, making the line of best fit flatter.
In our experiments, we found that the noise from the 2020 DAS does not affect the ability to detect RPV using ecological regression so long as the effect of very small precincts is handled appropriately. By filtering out very small precincts (fewer than 10 votes) or weighting precincts by the number of cast votes, any distortions in the detection of racially polarized voting due to the 2020 DAS vanishes.
We expect the final redistricting data to perform even better than our experiments show. The version of the 2020 DAS will introduce much less noise than the 2018 version we tested. Detection of RPV may actually be improved compared to current techniques where small precincts can have an outsized effect. RPV is only one part of a VRA lawsuit, albeit an important part. It could still be that the 2020 DAS makes it harder for civil rights groups to prevail—that will depend a lot on what happens in the next few months. But our analysis leaves me optimistic.
—
Aloni Cohen is a Postdoctoral Associate at the Hariri Institute for Computing at Boston University and the Boston University School of Law. He will be joining the University of Chicago as an Assistant Professor of Computer Science starting January 2022. Aloni Cohen’s research explores the interplay between theoretical cryptography, privacy, law, and policy. Visit his person website at https://aloni.net/
“Contiguous United States, Census 2010” by Eric Fischer is licensed under CC BY 2.0
—
Citations
[7] Utah’s Analysis of the April 2021 Demonstration Data (PDF)
[8] MALDEF and AAJC’s Preliminary Report on the Impacts of DP & the 2020 Census on Latinos, Asian Americans and Redistricting
[9] A research paper of mine—with Moon Duchin, JN Matthews, and Bhushan Suwal—analyzing the core algorithm and its effects on redistricting: Census TopDown: The Impacts of Differential 2 Privacy on Redistricting
Additional Resources
– An amicus brief that I filed along with an esteemed group of privacy experts—computer scientists and legal scholars—in the Alabama case: Amicus Brief of Data Privacy Experts
– Here’s a very accessible animated introduction to disclosure avoidance and differential privacy at the Census produced by Minute Physics: Protecting Privacy with MATH (Collab with the Census)
– A wonderful essay on “The technopolitics of the US Census”: Democracy’s Data Infrastructure | Knight First Amendment Institute, by Dan Bouk and danah boyd
– The Census Bureau’s page of resources on the 2020 Disclosure Avoidance System: 2020 Census Data Products: Disclosure Avoidance Modernization
– The Brennan Center’s collection of information about the lawsuit, including Alabama’s complaints and all amicus briefs: Alabama v. US Dep’t of Commerce
– The National Conference of State Legislatures’s collection of analyses of demonstration data products put out by the Census Bureau (under Additional Resources): Differential Privacy for Census Data Explained