Confidentiality and the Census
By Aloni Cohen
A political fight is brewing around the 2020 Census. No, not about the citizenship question. This fight is about confidentiality—what’s the best way to publish useful statistics without disclosing individual census responses? At the center of this fight is a lawsuit between the state of Alabama and the Department of Commerce which deals with the interplay of law, policy, and math.
This post is an introduction for readers learning about this issue for the first time. It’s a quickly evolving story, so things may have changed by the time you read this.
I come to this as a computer scientist who has tried to understand legal, policy, and mathematical issues. But this is a complicated subject and I welcome feedback so that I can learn more.
Confidentiality and the Census
Every 10 years, the US Census Bureau tries to count every person in the country. The original purpose, enshrined in the Constitution, is to apportion representation in Congress among the states based on their population. Today, Census data is used for so much more: to draw electoral districts with equal populations, to fight racial gerrymandering, to allocate government funds, and in social science research of all kinds.
Confidentiality of responses is essential to the success of the decennial Census. There’s a practical obligation—people won’t honestly participate if they don’t trust the Census to keep their responses confidential [1]. There’s a legal obligation—disclosing Census responses is a crime [2]. There’s also a moral obligation—the Census collects sensitive data that can be harmful if disclosed (e.g., where same-sex couples live).
Because a high quality Census depends on trust in confidentiality, the Bureau’s long-term ability to gather useful data is inseparable from privacy. Recognizing this, over 275 organizations and civic leaders signed a pledge to support Census confidentiality [3].
The Census Bureau has concluded that the techniques used for disclosure avoidance in 2010 failed to meet this obligation. Census conducted an internal study and found that the tables published in 2010 are detailed enough to allow some of the underlying census responses to be reconstructed—recovered exactly. Census has released a small handful of success metrics from the reconstruction study, mostly in filings in the ongoing Alabama lawsuit. But the Disclosure Review Board at the Census Bureau has been frustratingly tight-lipped about their precise methods and results—I guess out of fear of publicizing the ease of reconstruction. I’m looking forward to eventually seeing their code and internal documentation analyzing the 2010 DAS. Until then, here’s a decent overview: [4].
Differential privacy
There is an ongoing and blustery debate about whether the 2020 DAS strikes a good balance between data privacy and data quality.
As part of the disclosure avoidance system (DAS) for the 2020 Census, the Bureau is using differential privacy (DP), a mathematical framework for quantifying and controlling the amount of individual-level information disclosed by aggregate statistics and other uses of data. Very roughly, a computation is differentially private if the probability of seeing any particular result depends only weakly on any one person’s data. Without a doubt, this is the most consequential deployment of DP ever. See here for a readable, informal introduction [5].
This move puts Census’s confidentiality guarantees on much firmer footing. Unlike previous approaches, DP is future-proof—the guarantees do not depend on making brittle assumptions about the motives, methods, or sophistication of a hypothetical attacker who seeks to learn information about individuals. However, the actual strength or weakness of these guarantees depends on a tunable privacy parameter and other tunable parameters of the algorithm. The Census just settled on these various parameters, and it’s not yet clear how to interpret the resulting privacy guarantees.
One of many sources of error
Strong privacy guarantees like DP require introducing statistical noise to fuzz the data. This means that the reported population of the city of Boston—and every other geographic unit smaller than a state—might differ from the Bureau’s enumeration of Boston’s actual population. The additional error introduced by the 2020 DAS will be relatively small. But even small errors are scary when voting rights are on the line, not to mention billions of dollars of federal funding and umpteen policymaking and research uses.
But error isn’t new, and the error due to the 2020 DAS is drowned out by other sources of error. For example, some communities are harder to count. As a result, the Census historically undercounts racial and ethnic minorities, especially tribal communities [6]. Even noise due from disclosure avoidance isn’t new. For example, the published 2010 racial and ethnic population statistics were also fuzzed, but in a way that didn’t offer meaningful privacy protection.
What’s new is not error, but transparency. The Census has been remarkably open about the 2020 DAS at all stages of its development, soliciting feedback early and often. But this transparency bursts a collective fiction in the minds of stakeholders of all stripes—that the Census publications are Truth. It is no longer possible to pretend that error doesn’t exist, and differential privacy is getting the blame.
Want more information?
Part 2 of this blog post can be found here! Also check out the references below for cited sources and pointers to potentially interesting other reading!
—
Aloni Cohen is a Postdoctoral Associate at the Hariri Institute for Computing at Boston University and the Boston University School of Law. He will be joining the University of Chicago as an Assistant Professor of Computer Science starting January 2022. Aloni Cohen’s research explores the interplay between theoretical cryptography, privacy, law, and policy. Visit his person website at https://aloni.net/
“Contiguous United States, Census 2010” by Eric Fischer is licensed under CC BY 2.0
—
Citations
[1] Concerns about data privacy and confidentiality are one of five barriers to participation in the Census according to a Census Study.
[2] Census’s legal obligation of confidentiality comes from Title 13.
[3] 275 civil rights advocates, civic leaders, state and local elected officials pledge to support Census participation and confidentiality in The Census Confidentiality Pledge.
[4] Damien Desfontaines’ blog post describing the Census Bureau’s reconstruction demonstration better than the Census Bureau describes it: Demystifying the US Census Bureau’s reconstruction attack
[5] Damien Desfontaines’ excellent blog post introducing differential privacy: Why differential privacy is awesome (the whole series is great).
[6] Census Bureau Releases Estimates of Undercount and Overcount in the 2010 Census
Additional Resources
– An amicus brief that I filed along with an esteemed group of privacy experts—computer scientists and legal scholars—in the Alabama case: Amicus Brief of Data Privacy Experts
– Here’s a very accessible animated introduction to disclosure avoidance and differential privacy at the Census produced by Minute Physics: Protecting Privacy with MATH (Collab with the Census)
– A wonderful essay on “The technopolitics of the US Census”: Democracy’s Data Infrastructure | Knight First Amendment Institute, by Dan Bouk and danah boyd
– The Census Bureau’s page of resources on the 2020 Disclosure Avoidance System: 2020 Census Data Products: Disclosure Avoidance Modernization
– The Brennan Center’s collection of information about the lawsuit, including Alabama’s complaints and all amicus briefs: Alabama v. US Dep’t of Commerce
– The National Conference of State Legislatures’s collection of analyses of demonstration data products put out by the Census Bureau (under Additional Resources): Differential Privacy for Census Data Explained