The modern world runs on “big data,” the massive data sets used by governments, firms, and academic researchers to conduct analyses, unearth patterns, and drive decision-making. When it comes to data analysis, bigger can be better: The more high-quality data is incorporated, the more robust the analysis will be. Large-scale data analysis is becoming increasingly powerful thanks to machine learning and has a wide range of benefits, such as informing public-health research, reducing traffic, and identifying systemic discrimination in loan applications.
But there’s a downside to big data, as it requires aggregating vast amounts of potentially sensitive personal information. Whether amassing medical records, scraping social media profiles, or tracking banking and credit card transactions, data scientists risk jeopardizing the privacy of the individuals whose records they collect. And once data is stored on a server, it may be stolen, shared, or compromised.
Computer scientists have worked for years to try to find ways to make data more private, but even if they attempt to de-identify data—for example, by removing individuals’ names or other parts of a data set—it is often possible for others to “connect the dots” and piece together information from multiple sources to determine a supposedly anonymous individual’s identity (via a so-called re-identification or linkage attack).
Fortunately, in recent years, computer scientists have developed a promising new approach to privacy-preserving data analysis known as “differential privacy” that allows researchers to unearth the patterns within a data set—and derive observations about the population as a whole—while obscuring the information about each individual’s records.
The solution: differential privacy
Differential privacy (also known as “epsilon indistinguishability”) was first developed in 2006 by Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith. In a 2016 lecture, Dwork defined differential privacy as being achieved when “the outcome of any analysis is essentially equally likely, independent of whether any individual joins, or refrains from joining, the dataset.”
How is this possible? Differential privacy works by adding a pre-determined amount of randomness, or “noise,” into a computation performed on a data set. As an example, imagine if five people submit “yes” or “no” about a question on a survey, but before their responses are accepted, they have to flip a coin. If they flip heads, they answer the question honestly. But if they flip tails, they have to re-flip the coin, and if the second toss is tails, they respond “yes,” and if heads, they respond “no” — regardless of their actual answer to the question.
As a result of this process, we would expect a quarter of respondents (0.5 x 0.5 — those who flip tails and tails) to answer “yes,” even if their actual answer would have been “no”. With sufficient data, the researcher would be able to factor in this probability and still determine the overall population’s response to the original question, but every individual in the data set would be able to plausibly deny that their actual response was included.
Of course, researchers don’t actually use coin tosses and instead rely on algorithms that, based on a pre-determined probability, similarly alter some of the responses in the data set. The more responses are changed by the algorithm, the more the privacy is preserved for the individuals in the data set. The trade-off, of course, is that as more “noise” is added to the computation—that is, as a greater percentage of responses are changed—then the accuracy of the data analysis goes down.
When Dwork and her colleagues first defined differential privacy, they used the Greek symbol ε, or epsilon, to mathematically define the privacy loss associated with the release of data from a data set. This value defines just how much differential privacy is provided by a particular algorithm: The lower the value of epsilon, the more each individual’s privacy is protected. The higher the epsilon, the more accurate the data analysis—but the less privacy is preserved.
When the data is perturbed (i.e. the “noise” is added) while still on a user’s device, it’s known as local differential privacy. When the noise is added to a computation after the data has been collected, it’s called central differential privacy. With this latter method, the more you query a data set, the more information risks being leaked about the individual records. Therefore, the central model requires constantly searching for new sources of data to maintain high levels of privacy.
Either way, a key goal of differential privacy is to ensure that the results of a given query will not be affected by the presence (or absence) of a single record. Differential privacy also makes data less attractive to would-be attackers and can help prevent them from connecting personal data from multiple platforms.
Differential privacy in practice
Differential privacy has already gained widespread adoption by governments, firms, and researchers. It is already being used for “disclosure avoidance” by the U.S. census, for example, and Apple uses differential privacy to analyze user data ranging from emoji suggestions to Safari crashes. Google has even released an open-source version of a differential privacy library used in many of the company’s core products.
Using a concept known as “elastic sensitivity” developed in recent years by researchers at UC Berkeley, differential privacy is being extended into real-world SQL queries. The ride-sharing service Uber adopted this approach to study everything from traffic patterns to drivers’ earnings, all while protecting users’ privacy. By incorporating elastic sensitivity into a system that requires massive amounts of user data to connect riders with drivers, the company can help protect its users from a snoop.
Consider, for example, how implementing elastic sensitivity could protect a high-profile Uber user, such as Ivanka Trump. As Andy Greenberg wrote in Wired:
“If an Uber business analyst asks how many people are currently hailing cars in midtown Manhattan—perhaps to check whether the supply matches the demand—and Ivanka Trump happens to requesting an Uber at that moment, the answer wouldn’t reveal much about her in particular. But if a prying analyst starts asking the same question about the block surrounding Trump Tower, for instance, Uber’s elastic sensitivity would add a certain amount of randomness to the result to mask whether Ivanka, specifically, might be leaving the building at that time.”
Still, for all its benefits, most organizations are not yet using differential privacy. It requires large data sets, it is computationally intensive, and organizations may lack the resources or personnel to deploy it. They also may not want to reveal how much private information they’re using—and potentially leaking.
Another concern is that organizations that use differential privacy may be overstating how much privacy they’re providing. A firm may claim to use differential privacy, but in practice could use such a high epsilon value that the actual privacy provided would be limited.
To address whether differential privacy is being properly deployed, Dwork, together with UC Berkeley researchers Nitin Kohli and Deirdre Mulligan, have proposed the creation of an “Epsilon Registry” to encourage companies to be more transparent. “Given the importance of these implementation details there is a need for shared learning amongst the differential privacy community,” they wrote in the Journal of Privacy and Confidentiality. “To serve these purposes, we propose the creation of the Epsilon Registry—a publicly available communal body of knowledge about differential privacy implementations that can be used by various stakeholders to drive the identification and adoption of judicious differentially private implementations.”
As a final note, organizations should not rely on differential privacy alone, but rather should use it as just one defense in a broader arsenal, alongside other measures, like encryption and access control. Organizations should disclose the sources of data they’re using for their analysis, along with what steps they’re taking to protect that data. Combining such practices with differential privacy with low epsilon values will go a long way in helping to realize the benefits of “big data” while reducing the leakage of sensitive personal data.
Chuck Kapelke is a communications specialist for the Center for Long-Term Cybersecurity an interdisciplinary research and collaboration hub in the School of Information at the University of California, Berkeley.
Acknowledgements: The video was animated by Annalise Kamegawa. The Center for Long-Term Cybersecurity would like to thank Nitin Kohli, PhD student in the UC Berkeley School of Information, and Paul Laskowski, Assistant Adjunct Professor in the UC Berkeley School of Information.
Apple and Google provide financial support to the Brookings Institution, a nonprofit organization devoted to rigorous, independent, in-depth public policy research.