Platform data access is a lynchpin of the EU’s Digital Services Act

The European Commission has put forth a first draft of the Digital Services Act (DSA), which aims to create a consistent set of rules for internet companies across the European single market. Tucked into this proposal is a requirement which would enable academic researchers to access data from the largest internet platforms. Modest as it may sound, this subtle provision is key to the success of this legislation.

While some of the DSA’s new rules cover internet access providers, most are concerned with internet services that host content. For instance, all hosting services—including cloud providers, web hosting services, and online platforms—would be required to implement a “notice and action” process for users to inform the service of illegal content, to which that service is then required to respond. Excluding the smallest companies,[1] online platforms such as social media and marketplaces get the most compliance requirements. Online platforms would need to enable complaints about their illegal content decisions, suspend service to users who repeatedly post illegal content, and be more transparent about content moderation and targeted advertising. This proposed legislation goes even further for especially large online platforms, creating a new oversight scheme for platforms with over 10 percent of the EU’s population in active monthly users. This is likely to include Facebook, Twitter, YouTube, Amazon, and TikTok, as well as potentially Instagram, PornHub, and Google Maps.[2]

This oversight scheme is complicated, requiring several additional layers of transparency. First, there are a range of public-facing transparency requirements, including an annual report on content moderation efforts, labels that tell users why they are seeing targeted advertisements, a public database of targeted advertisements, and more transparency on how their recommender systems work. Second, the large platforms must conduct a self-assessment of systemic risks, which they are then required to work to mitigate (or risk large fines) under the proposed law. Third, the platforms must pay for an independent audit of compliance with the holistic requirements of the legislation. All of these transparency requirements are potentially meaningful, but it is the fourth and last mechanism that offers the most promise: the largest internet platforms must open up their data to independent researchers approved by the European Commission.

Article 31 of the DSA states that large internet companies need to comply with data requests from researchers once each request was approved by the EU country which hosts that technology company, or the European Commission. To reduce the risk of privacy breaches and corporate espionage, only vetted researchers with affiliations to academic institutions and relevant expertise will be granted data access, notably not including journalists and activists. Further, the researchers would not be allowed to use the data for profit-seeking purposes, or to inform political campaigns, as was the case in the Cambridge Analytica scandal. Researchers could make these requests only to conduct research related to illegal content (e.g., child pornography, terrorist content, hate speech), manipulative use of the platforms (e.g., disinformation campaigns), and a broader category of negative effects including discrimination and child protection (e.g., self-harm and suicidal behavior resulting from cyberbullying).

This is an enormously important provision, because it harnesses the capability and motivating incentives of researchers to examine and challenge the decisions of the big internet platforms. It will help to close the information gap between lawmakers and technology companies, which is perhaps the most universal problem in technology policy.

Researcher data access is critical to ensuring compliance

Oversight of the large online platforms is complicated for many reasons, not least of which because policy makers and the public have so little insight into the comprehensive workings of these social systems. Despite the widespread impression of far right-wing news dominating Facebook, it’s actually impossible to know if that’s the case with currently available data. While Google produces a wide range of academic research, the recent dismissal of Dr. Timnit Gebru suggests those papers do not provide an unvarnished view of the search giant. The industry-academic partnership Social Science One, an ambitious effort funded by Facebook, strived to balance user privacy and researcher access, but ultimately was not able to offer researchers sufficiently complete data to answer the most pressing questions. Social Science One’s entire European advisory committee stepped down in December, saying the project did not fulfill its goals. The voluntary measures taken by the internet platforms to enable researcher access are simply not working. In perhaps the worst case, Uber has strategically allowed access to datasets to spread a favorable corporate narrative.

These circumstances explain why the DSA includes a provision on data access for independent researchers—there is no other insight into large online platforms, so this research is necessary to better inform the public and the various governments of the EU. It would offer an unvarnished understanding of the scale and spread of illegal and harmful content online, from health misinformation to the sale of counterfeit goods.

Further, the researchers would be able to examine the choices made by the technology companies and better consider the alternatives they eschewed. In 2019, YouTube has claimed that it has made algorithmic changes leading to its users watching fewer fringe videos that might misinform or even radicalize them. More recently, Instagram promised to use algorithms to reduce unlabeled advertising by its influencers. Yet there is no way to verify these claims or consider if there were alternative interventions that would have been effective. Data access for independent researchers would change this, shedding light both on the state of the online world and the decisions that online platforms were making to shape it. In turn, this enables sharper criticisms of the platforms. A more precise conversation about what interventions they should take could lead to more impactful changes.

Independent researchers would also provide a genuinely independent check on the platform’s self-assessment and the independent audit required by the DSA—likely making these more accurate and revealing. The platform’s self-assessment cannot be relied upon alone if one expects this legislation to change platform behavior. The DSA also requires that the large platforms pay for an independent audit, which is potentially more impactful. However, there will undoubtedly be a large financial incentive for those auditors, chosen by the platform companies, to be amenable to their client’s perspective. The credible possibility that an independent academic study, with the same level of data access, will publicly refute the platforms assessments would go a long way to keeping both reports honest.

The DSA also empowers a group of trusted flaggers, who will also benefit from the researcher data access provision. Under the DSA, trusted flaggers are entities to be approved by the EU as having expertise and competence in identifying illegal content. Internet platforms would have to prioritize responses to illegal content notices submitted by these trusted flaggers. Since they would be required to represent the “collective interests,” many trusted flaggers would be non-profit groups and journalists looking to improve the online ecosystem. Notably, the DSA requires that platforms be able to accept illegal content notices in bulk, enabling trusted flaggers to submit many illegal content notices at once. The presents an opportunity for independent researchers to use their privileged access to platform data to help guide the trusted flaggers. By finding patterns in illegal content dissemination and building tools for their detection, academic researchers can help increase the amount of illegal content reported, with expedited responses from the large platforms. In some cases, the research groups might try to become trusted flaggers themselves. This might be especially impactful because the researchers can report content not visible to the general public.

Better information directly impacts other compliance mechanisms in the DSA, too. It will enable the European Board for Digital Services to accurately and fairly levy fines, which can be as high as six percent of annual revenue for the companies. For instance, researchers can examine whether the internet platforms are appropriately suspending accounts that are frequently posting manifestly illegal content, as would be required by the DSA. In short, the independent researcher data access provision of Article 31 strengthens every other oversight and compliance provision within the DSA, and could be the lynchpin of the legislation’s efficacy.

In fact, it is important enough to the success of the legislation that it presents a potential point of failure. The researchers are reliant on the large platforms to honestly provide datasets without edits or omissions, but will not often have the means to evaluate whether this is the case. The platforms are not likely to outright mislead the regulators though, as the Volkswagen emissions scandal demonstrated, this possibility should not be entirely discounted. Still, many of the platforms’ decisions at the margins will affect the usability of the data as well as the picture that data paints. In order to make sure this compliance system work, the Commission should ensure that the required independent audit confirms that platforms are honestly and completely fulfilling the researcher data access requests.

The specifics of the implementation will matter

Currently and without the DSA provisions, while many researchers are greatly interested in analyzing internet platforms companies, they are restrained in how they can do this by technical, legal, and statistical challenges. For instance, much of the data from Facebook is private, and even for the public websites, the outside view is restricted. While a researcher can see what tweets are liked or shared, they cannot know what tweets a user saw, but did not interact with. In other fields, taking a random sample can be effective way to study large-scale problems, but the important questions of the web are dependent on the networks—the connections between users—and sampling from networks is exceptionally challenging. As the statistician Andrew Gelman writes, “The fundamental difficulty of network sampling is that a small sample of a network doesn’t look like a network itself.” Even though Twitter and YouTube offer some data access through APIs, it is not nearly enough to sufficiently understand how their networks work. By providing researchers with expanded data access from the internet platforms, the DSA would resolve many of these issues.

Of course, with this change would come new implementation challenges. The current draft suggests that the large platforms need to provide researchers with the data, and those researchers must commit to preserving data security and confidentiality requirements that will be specific to each data access request. While many academic researchers may have experience working with private and sensitive data, these datasets may pose especially high risks. It could include Facebook posts meant only for family members, YouTube videos watched in private, and even possibly searches on PornHub. This data can be kept anonymized to the researchers, as the academic research community has strong safeguards and practices for maintaining confidentiality. However, some of these datasets will be of interest to outside actors for enabling blackmail and identity theft, or stealing trade secrets. It is not clear that most academic researchers will have data privacy and security standards that are robust to skilled and motivated hackers.

The platforms will be understandably more hesitant to hand over large sensitive datasets to many different researchers, even without legal liability for disclosures. More specific data is much more valuable to researchers and will lead to more impactful research, but it’s also inherently a great privacy risk, since it contains more information to identify individuals. Platforms will likely be concerned that even if they aren’t at fault for a data breach, they still may be blamed by their users.

The European Commission is right to incorporate researcher data access into the DSA, but it may need to play a larger role in enabling this access. As an alternative to putting the security onus on researchers, the Commission could set up a centralized process that enables secure data access to researchers without this capacity. The United States has effectively implemented this through systems like the Census Bureau’s Statistical Research Data Centers and the Coleridge Initiative’s Administrative Data Research Facility. The Secure Data Access Center in France works similarly, enabling researcher access to information such as sensitive health data. Another approach would be for the Commission to provide grant funding for a small number of trusted third-party facilities to set up this capacity for researchers. This would also help alleviate the cost of analyzing these massive datasets, by preempting the need for each research group to separately create secure analytical environments.

The Commission has correctly identified how important researcher data access will be to the effective oversight of large companies. In a sense, the European Commission is outsourcing enforcement and oversight to independent researchers, rather than creating a new agency. If implemented carefully, this facet of the Digital Services Act will be key to its success, contributing to a better public debate and more responsible actions by large internet platforms. If this measure works effectively, expanding the number of companies covered will be worth considering. Even without expanding the rest of the compliance requirements, better researcher access to platform datasets is a worthwhile investment. Legislators in the United States should also pay attention, as the proposals put forth by the European Commission in the Digital Services Act may be models for responsible internet governance.

[1] The online platform requirements do not apply to micro or small businesses, and so only companies with more than 50 staff or more than €10 million in annual turnover need to comply.

[2] Messaging applications, such as WeChat and WhatsApp, are not included. Further, while Netflix and Spotify might qualify as very large online platforms, this is unlikely to significantly affect them since they allow such limited contributions from their users.

Amazon, Facebook, and Google are general, unrestricted donors to the Brookings Institution. The findings, interpretations and conclusions in this piece are solely those of the author and not influenced by any donation.