The views expressed by contributors are their own and not the view of The Hill

Why census data needs robust confidentiality protections

In the next few days, the Census Bureau will release the first 2020 Census data — the congressional apportionment population counts. The apportionment numbers — required by law — include population counts of each state and the U.S. as a whole. Results determine how many congressional representatives each state receives.

Other upcoming data releases will include more detail. These will be used for everything from redistricting to environmental justice to COVID recovery. This is why making sure the data is secure is paramount. Today, regulation and protection includes significant punishment — fines and prison time — for any Census Bureau employee who creates a breach or leaks data. The Census Bureau takes the matter so seriously that even accessing data requires a validation process and lifelong commitment —  arguably the most effective outreach tool for collecting sensitive data. 

Without this level of trust in ensuring confidentiality, there would be no census, and hard to count communities would be missed in greater proportion. But what we do today simply isn’t enough, which is why the creation of a new disclosure avoidance system is crucial. 

The upcoming data releases, like past releases, include demographic breakdowns. The Census Bureau is bound by law (US Code Title 13) to keep individuals safe from re-identintification. This becomes increasingly difficult each decade as the availability of commercial data and computing power, as well as the sophistication of nefarious and other actors, continues to increase.

For 2020, Department of Commerce Secretary Gina Raimondo endorses the Census Bureau’s use of differential privacy to protect individual responses. As discussed in a recent Data & Society paper, “differential privacy allows the Census Bureau to mathematically balance between privacy and data utility.”

Here’s why that’s important.

When information has an underlying dataset much like the census does, every data product reveals a little bit of information about the actual, underlying data. In the case of the census, the data comes from individual census responses, which contain personally identifiable information (PII) such as birth dates and marital status. Research is available that shows census data has been utilized to create address look-up lists and other nefarious tools that reveal our personal information. And commercial actors, including popular social media sites, combine census data with their own and third-party datasets to glean insights and create profiles of Americans. The nefarious actors are not limited to domestic players, either. 

Although our census data focuses on the U.S. population, it’s available globally once posted on the Census Bureau website. The byproducts of that data — direct and indirect as well as combining census data with other datasets — are possible and it’s essentially impossible to monitor and track who accesses it. It is a measure of national security to keep census responses confidential. There should not be a roster of our nation available to anyone that wants it — friend or foe, foreign or domestic — but there is potential for one and it’s why a robust disclosure avoidance system is so important. 

How did the Census Bureau decide on using differential privacy? In preparing for 2020, it conducted an experiment. Using a subset of the 2010 data products, Census staff tried to see if they could reveal the underlying dataset.

What they found was jarring. Information on individual census responses was discoverable for 52 million Americans — and that may be the tip of the iceberg. Some external analysts estimate a re-identification risk for nearly 80 percent of the population.

More than 50 percent of the population has a unique combination of age, sex, race and hispanic origin. The more unique people are, the easier they are to match to third party data. In addition, individual risk is exasperated by small geographic units. Census blocks, the smallest geographic unit, average a few hundred people. However, my calculations of raw census data revealed 1.9 million of the nearly 7 million census blocks contain fewer than ten people. If census data is highly accurate at the block level, the people living in these small blocks are extraordinarily easy to identify.

Confidential information and the potential for re-identification is precisely why commercial data brokers and nefarious actors want census data, they can’t get it anywhere else. For example, 2020 was the first time the census questionnaire had an option for same-sex marriages, sensitive data that should remain confidential. The Census Bureau should not be in the business of outing our neighbors.

The use of differential privacy — releasing data so it cannot be linked directly to people — helps keep the trust in the Census Bureau, now and in the future.

Maria Filippelli is a Public Interest Technology Census fellow at New America. Filippelli has a background in urban planning, civic technology, and data science. For the past two and a half years, she has worked with dozens of civil rights organizations to navigate the technical changes to the 2020 Census.