There are thousands of databases floating around the Internet. Most contain Personally Identifiable Information (PII) about each of us. Driver's license numbers, credit/debit card account numbers, and social security numbers to name a few. Everyone knows that. What may not be known is that our PII is not as private as we would like to think.Balancing act
Databases are expensive to build and maintain, so managers try to discover different ways to monetize their holdings. Finding additional uses can be problematic, especially if the database contains PII, thus regulated by privacy laws. To address that, database administrators use a work-around called anonymization:
"The removal of person-related information that could be used for backtracking from, say, patient data to the actual patient."
The trick is figuring out what to remove. There are regulatory guidelines, but no clear definition of what PII is. So, it's left to the discretion of the database owner.Not anonymous
Professor Paul Ohm in his paper Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization points out that anonymization is not working. He uses the following example to show why it's not. Tables 5 and 6 (courtesy of Professor Ohm) are both anonymized databases. Separately, very little information can be gleaned.
All that changes when the two databases are joined in following table (courtesy of Professor Ohm). Potentially sensitive PII is now associated with individuals.
In Massachusetts, a state agency Group Insurance Commission (GIC) decided to release anonymized data about state employee hospital visits. The professor explains:
"By removing fields containing name, address, social security number, and other "explicit identifiers," GIC assumed it protected patient privacy, despite the fact that "nearly one hundred attributes per" patient and hospital visit were still included, including the critical trio, ZIP code, birth date, and sex.
At the time GIC released the data, William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers."
Professor Ohm further explains that Dr. Latanya Sweeney, well-known for her demographic studies based on ZIP code, date of birth, and gender, decided to use the GIC database to test her theories:
"She knew that Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease.
Only six people in Cambridge shared his birth date, only three of them men, and of them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney sent the Governor's health records (which included diagnoses and prescriptions) to his office."
Another example involves Dr. Arvind Narayanan and advisor Dr. Vitaly Shmatikov. They determined one-third of Twitter users also have a Flickr account. That was all they needed. Cross-referencing anonymized Twitter social graphs with Flickr connection information allowed the researchers to identify Twitter accounts.Professor Ohm's claim
The point, Professor Ohm wants to make is all information should be considered PII. No one knows for sure what other database could be used for re-identification:
"Data can either be useful or perfectly anonymous but never both."
In the report, Professor Ohm writes about what he calls the "Database of Ruin". He mentions that everyone has at least one fact about them stored on a database that is potentially injurious. For now, those tidbits remain hidden because a majority of databases are not sharing information. As databases interact, both re-identification and misuse of PII are more likely to occur.Final thoughts
Currently, the U.S. government is spending billions of dollars trying to get electronic health records melded into a national database. After my phone conversation with Professor Ohm, I'm wondering if that could be his "Database of Ruin".
Michael Kassner is currently a systems manager for an international company. Together with his son, he runs MKassner Net, a small IT publication consultancy.