Controlling Cambridge Analytica: Managing the new risks of personal data collection

A data center

Revelations continue to surface about how broadly companies share personal data. This week, The New York Times reports how Facebook shared vast amounts of its user’s personal information with phone and mobile device makers.

Collection of personal information has expanded beyond what anyone could have imagined. Information technologies, such as our always-on internet connections, fitness trackers, and mobile phones make it easier to automatically collect information from a wide range of human activities at frequent intervals. And, increasingly, corporations and governments are collecting, analyzing, and sharing detailed information about individuals over extended periods of time. While these vast quantities of data from new sources and novel methods for large-scale data analysis promise to yield a deeper understanding of individuals’ characteristics, behavior, and relationships, they also create heightened risks of misuse.

The Cambridge Analytica controversy is a salient illustration of the risks that can arise from the widespread collection, analysis, and disclosure of vast amounts of personal information in the modern age. The political firm gained access to the personal information of more than 50 million Facebook users. Data about individual profiles, locations, and interests were reportedly collected without users’ knowledge and leveraged in various ways designed to influence voter behavior during the 2016 US presidential election. How can corporations and governments manage data privacy risks that are rapidly growing and evolving over time?

Our research analyzes how informational risks multiply as more frequent and longer-term personal data collections combine with increasingly broad accumulations of data that include dozens or even thousands of attributes. Broad, long-term data collection greatly increases risk because collecting data from the same individual repeatedly leaves behavioral “fingerprints” that make it much easier to re-identify individuals in the data. When repeated measurements are combined with broad data collections, the data can also provide insights into domains of activity that stray far from the intended purpose of the data collection.

These risks have outpaced corporate data privacy safeguards in current use. For example, consider how one of the largest corporate data brokers recently evaluated a potential new data product:

“Acxiom’s analytics team had developed a model of ‘10,000 audience propensities’ derived primarily from product purchasing data. Although no healthcare data was collected, these propensities included personal predictive scores for a number of sensitive health attributes such as ‘vaginal itch scores’ and ‘erectile dysfunction scores.’ When the leadership team met to discuss whether the use of such scores would be perceived as too invasive, a member of the team, the company’s Global Policy and Privacy officer, came prepared to read the actual scores on these sensitive topics for each of the individuals in the room. When confronted with the problem in this direct, personal way, the leadership team decided that certain scores were ‘too sensitive’ and should not be made available as a product to its customers. The officer later testified that had she not brought to light her concerns in such a compelling way, the leadership team likely would have made a different decision regarding the use of the scores at issue.”

This real-world example illustrates how big data may be used in ways that were not anticipated at the time of collection, and how corporate data protection can critically depend on ad hoc, “gut” judgments by a small number of decisionmakers based on their opinions about unspecified social norms. Indeed, although Acxiom showed restraint in this particular instance, other companies have reached very different conclusions regarding the appropriateness of selling similar types of highly sensitive information about individuals. Other data brokers have made decisions to sell lists of names of rape victims, addresses of domestic violence shelters, and names of individuals suffering from various health conditions, including genetic diseases, dementia, and HIV/AIDS.

DATA privacy controls in research and corporations

Consumers deserve better. In general, sensitive personal information is subject to substantially stronger protections when collected and used in the course of academic research. The table below summarizes our analysis of some of the primary types of safeguards employed in academic research and contrasts these with the protections used in the corporate sector.

  Academic Sector Corporate Sector
Legal and Ethical Frameworks

●        Activities governed by strict ethical and legal frameworks, including oversight by an IRB.

●        Clear responsibilities assigned to IRBs, hosting institutions, and principal investigators.

●        No broadly applicable regulations governing data management are in place.

●        Oversight responsibility is often unspecified.

Risk Assessment ●        Systematic risk assessment by IRBs and curation by research investigators. ●        Systematic review of privacy risks and planning for long-term review, storage, use, and disclosure is rare.
Controls ●        Researchers incorporate multiple layers of protection, including explicit consent, systematic design and review, statistical disclosure control, and legal/procedural controls. ●        Corporate sector privacy protection relies primarily on notice, prior consent, and de-identification.

Further, while controls are generally stronger in academia, we find that traditional approaches to safeguarding privacy are being stretched to the limit by broad and frequent data collections. To manage these risks, we recommend adopting technologies emerging from computer science research which enable sophisticated controls on computation, inference, and use.

Controls on computation limit the direct operations that can be meaningfully performed on data. Emerging approaches include secure multiparty computation, functional encryption, homomorphic encryption, and secure public ledgers (e.g., blockchain technologies). Controls on inference limit how much can be learned from computations about the constituent components of the database, e.g., records, individuals, or groups. Increasingly, differentially private mechanisms are being used to provide strong limits on inferences specific to individuals in the data. Controls on use limit the domain of human activity in which computations and inferences are used. Personal data stores and the executable policies they incorporate are emerging technical approaches to control use.

Privacy is not the inevitable price of technology. In many cases these emerging technologies can provide better privacy than traditional privacy controls, while preserving or increasing the accuracy of data analysis. Nevertheless, there are many unresolved challenges for both traditional and emerging methods used to protect data that are broad and collected frequently. When dealing with these types of data, it is especially important to continually review and adapt practices, including a combination of legal, computational, and procedural controls, to address new risks and provide stronger data privacy protection.

For more details, see Micah Altman, Alexandra Wood, David R. O’Brien, and Urs Gasser, “Practical approaches to big data privacy over time,” International Data Privacy Law, Vol. 8, No. 1 (2018),

Facebook is a donor to the Brookings Institution. The findings, interpretations, and conclusions posted in this piece are solely those of the authors and not influenced by any donation.