Privacy is fragile indeed. Organizations gather private and sensitive data with gusto but have done a poor job guarding it against those who would seek to purloin it. Isn't it enough to have anonymized data of millions or billions of records to seek patterns and relationships and statistics with liking to actual people?
In many cases, yes, but when trying to solve human problems, it helps to have human data, not clumps of them. Consider medical research. Masking or deleting any Personally Identifiable Information (PII) weakens the dataset by removing features relevant to the investigations.
- Over 14,717,618,286 data breaches have been lost or stolen since 2013.
- 3,353,178,708 records were compromised in the first half of 2018.
- 86% of all breaches in 2017 occurred in North America.
- In 2018, 45.9% of data breaches in the US were in the business sector.
- In January 2015, a Russian hacker calling himself "Peace" stole 117 million LinkedIn email and password combinations.
- Crafty cybercriminals managed to collect the personal data of over 500 million guests of the Marriott International hotel chain between 2014 and 2018.
- In September 2018, a successful attack on Facebook compromised 50 million user accounts
Many solutions have been tried: encrypting the data, sealing the data in a hacker-proof environment. And, as mentioned above, anonymizing the data. Encryption generally works too well - it widely restricts access to the data and slows down or halts investigators. Safe hacker-proof environments are, as hackers prove every day, not hacker-proof. Anonymizing data still enjoys a good reputation despite an abundance of evidence that it is too easy to defeat. In 2007, Netflix offered a $ million prize to the first algorithm that could outperform their collaborative filtering algorithm.
The dataset they supplied was anonymized, but one group de-anonymized it by joining it with information from the IMDb database. An anonymized database can happily expose the PII (personally identifiable information) by combining it with a PII data source and matching other criteria (so-called latent values).
The gold standard of data privacy, anonymization, is not adequate. Differential-privacy (DP) algorithms offer a reliable alternative by utilizing random noise into the data and resolving queries with high accuracy applying probability. The problem is that it is hard to explain. Probability can be very counter-intuitive. If you flip a coin, the probability of it turning up heads is 50%. But once joint probability or conditional probability (Bayesian) gets into the mix, our decision-making rarely involves probability instead of heuristics or deterministic models (think about the annual budget). When asking for the results of your investigation, does it matter if you say 85% probability or 95%? Our management gestalt doesn't work that way.
As a demonstration, if you were asked what is the probability that at least two people from a group of thirty have the same birthday, you might say, "well 30 people, 365 days, I guess 1 in 10 or 11? It's not that simple, and this is a simple problem. It's equal to: 1 - the probability that no two people share the same birthday:
Probability isn't very intuitive most of the time, like the example above. Another reason is that it is not mathematical. Probability uses math for efficiency, but it's not a very good fit. Again, consider the example above. The math is incidental. Solving the problem is imagining the steps.
Let me explain how differential privacy works. Suppose we have a database with sensitive information such as people and credit ratings. A hacker wants to know how many people have a poor credit rating (N). We add (bear with me) some random noise by adding a random number K from a zero-centered Laplace Distribution with standard deviation 2. The hacker gets the answer N+K. Laplace distribution:
The hacker will not know that the query returned an incorrect answer. However, there is a limit. Each successive query will return an answer closer to the truth and estimate the correct answer eventually. Remember, the "noise" is a random variable. There is a "Privacy budget" that can be set so that if the sensitivity is getting to a certain point, the algorithm simply stops responding. One of DP's primary goals is balancing privacy and maximizing utility (which can be measured as data accuracy). As Cynthia Dwork explained in Differential Privacy (PDF):
DP eliminates any potential methods a data analyst might have to distinguish a particular individual from other participants or associate particular behaviors with a previously identified individual in order to tailor content/services accordingly; consequently, DP makes sure that the privacy risk associated with participating in the study is not significantly increased.
Organizations strive to know very personal things about people in their data, such as preferences for travel, bank transactions, or hospital clinical data. A typical application is for a company to review if its targeting strategies are performing. Differential privacy gives investigators access to the most private data without revealing the individuals' actual identity. Other anonymization techniques remove relevant data to protect privacy. Differential privacy is a probabilistic approach defined by a mathematical definition of privacy, privacy as a random variable to quantify the level of security. Working with a dataset, it is impossible to identify individual persons while preserving all of the data for analysis.
DP is not easy to explain. We all think of datasets for analytics as what they are, and querying them should give the same answer every time. It's hard to wrap your head around the idea these answers will be "noisy" unless you can bypass the querying method.
I will summarize conversations with technology vendors implementing DP solutions to provide some specificity to how this works in future articles. For example, Apple and Microsoft are actively employing DP.