The COVID-19 pandemic calls attention to the fact that—despite best intentions and efforts—real-time data emerging from global crises may be uncertain, rapidly evolving, incomplete, or even misleading. The lag between COVID-19 transmission and the onset of symptoms, as well as the lag between getting tested and receiving test results, can lead to outdated infection rate estimates and dynamically-changing public health guidance—which in turn reduces public understanding and compliance. Governments and academic researchers must choose whether and how to update backlogged information or retroactively fix past statistics, which may lead to changing, reversing, or delayed policies.
During recent outbreaks, including the 2015 Ebola epidemic and the ongoing COVID-19 pandemic, sharing genomic sequencing data in public databases and data repositories such as GenBank and GISAID has proven extremely valuable—and is supported by international agreements such as the 1996 Bermuda Principles and 2010 Nagoya Protocol. Meanwhile, academic interest in public health data sharing has led to innovative platforms to map disease occurrence, leverage open source and social media intelligence about public health, and even crowdsource data collection. However, these efforts are fragmented at both the global and national levels.
Building common data spaces to enhance information flows
To improve data sharing during global public health crises, it is time to explore the establishment of a common data space for highly infectious diseases. Common data spaces integrate multiple data sources, enabling a more comprehensive analysis of data based on greater volume, range, and access. At its essence, a common data space is like a public library system, which has collections of different types of resources from books to video games; processes to integrate new resources and to borrow resources from other libraries; a catalog system to organize, sort, and search through resources; a library card system to manage users and authorization; and even curated collections or displays that highlight themes among resources.
Even before the COVID-19 pandemic, there was significant momentum to make critical data more widely accessible. In the United States, Title II of the Foundations for Evidence-Based Policymaking Act of 2018, or the OPEN Government Data Act, requires federal agencies to publish their information online as open data, using standardized, machine-readable data formats. This information is now available on the federal data.gov catalog and includes 50 state- or regional-level data hubs and 47 city- or county-level data hubs. In Europe, the European Commission released a data strategy in February 2020 that calls for common data spaces in nine sectors, including healthcare, shared by EU businesses and governments.
Going further, a common data space could help identify outbreaks and accelerate the development of new treatments by compiling line list incidence data, epidemiological information and models, genome and protein sequencing, testing protocols, results of clinical trials, passive environmental monitoring data, and more.
Moreover, it could foster a common understanding and consensus around the facts—a prerequisite to reach international buy-in on policies to address situations unique to COVID-19 or future pandemics, such as the distribution of medical equipment and PPE, disruption to the tourism industry and global supply chains, social distancing or quarantine, and mass closures of businesses.
Challenges of establishing a global common data space
Despite these potential advantages, setting up a common data space that is usable and secure is no simple task. Even with widespread consensus within academia on the importance of sharing public health data, there are real-world technical, geopolitical, and ethical barriers to implementation on a global scale.
First is the technical challenge of setting up a comprehensive, secure, and usable data space system. Integrating data from multiple data sources can be time-consuming and difficult, especially considering low data quality, disparate methods of data collection, lags in data reporting, and inherent uncertainties. Thus, it is important to regularly audit data in shared data spaces—flagging poor data quality and outdated information—to communicate levels of confidence or uncertainty in the data. In addition, novel “data space” approaches can help avoid the high upfront costs of cleaning, processing, and integrating data ex ante, while emerging AI and ML algorithms and data standards could automatically provide basic functionality—enabling researchers to focus their efforts on advanced integrations. The application of blockchain could improve the security and resiliency of systems against accidental or malicious data corruption.
On the geopolitical front, issues of data protectionism, national security, economic competition, lack of trust, and differing privacy regulations and values impede the development of an international common data space. Pre-print publication policies have helped incentivize data sharing and reduce academic concerns about IP, data ownership, and publication rights—yet there remains a gap in translating academic research to policymakers and the general public. In the past, the exchange of epidemiological prediction models, risk maps, and disaster planning simulations has aided researchers to understand country-specific concerns while navigating future uncertainty and high levels of stakeholder complexity.
Underlying these geopolitical issues are ethical questions about data access, equity, and privacy. For instance, how can we ensure that the costs and benefits of a common data space are fairly distributed? It will become necessary to fill gaps in disease detection in under-resourced areas, while simultaneously ensuring fair and affordable access to resulting medicines and treatments among communities that contribute data. On the other hand, it is essential to consider how to address “free riding” nations which can benefit from a common data space without sharing their own data. We must also question how to handle ownership and attribution when researchers share data, what ethical research and accountability standards are necessary, under which contexts to require informed consent from research participants, and how to offer widely-accessible data during public health emergencies while maintaining the privacy of all individuals involved.
With these considerations in mind, the question then turns to whether a national or regional system of data sharing is the most realistic goal—or whether it is possible to achieve a truly global system of common data sharing. Paradoxically, a common data space may help increase international trust and cooperation during future pandemics—but cannot be enacted without some baseline level of international trust and cooperation.