How data science can ease the COVID-19 pandemic

A team of doctors and nurses return a coronavirus patient to a room in the intensive care unit of Purpan Hospital. The hospital has been receiving patients with the coronavirus, COVID19, since the epidemic from China appeared in France and the country has been confined since 17 March 2020. A hundred or so carers take turns working permanently with contaminated hospital patients. 21 April 2020, Toulouse, France.Une equipe de medecins et d infirmieres procedent au retournenement d une patiente atteinte du Coronavirus dans une chambre dans le service de reanimation de l hopital Purpan. L hopital accueille des malades du coronavirus, COVID19, depuis que l epidemie venue de Chine a fait son apparition en France et que le pays est confinee depuis le 17 mars 2020. Une centaine de soignants se relayent en permanence aupres des patients hospitalises contamines. 21 avril 2020, Toulouse, France.NO USE FRANCE

Social distancing and stay-at-home orders in the United States have slowed the infection rate of SARS-CoV-2, the pathogen that causes COVID-19. This has halted the immediate threat to the U.S. healthcare system, but consensus on a long-term plan or solution to the crisis remains unclear.  As the reality settles in that there are no quick fixes and that therapies and vaccines will take several months if not years to invent, validate, and mass produce, this is a good time to consider another question: How can data science and technology help us endure the pandemic while we develop therapies and vaccines?

Before policymakers reopen their economies, they must be sure that the resulting new COVID-19 cases will not force local healthcare systems to resort to crisis standards of care. Doing so requires not just prevention and suppression of the virus, but ongoing measurement of virus activity, assessment of the efficacy of suppression measures, and forecasting of near-term demand on local health systems. This demand is highly variable given community demographics, the prevalence of pre-existing conditions, and population density and socioeconomics.

Data science can already provide ongoing, accurate estimates of health system demand, which is a requirement in almost all reopening plans. We need to go beyond that to a dynamic approach of data collection, analysis, and forecasting to inform policy decisions in real time and iteratively optimize public health recommendations for re-opening. While most reopening plans propose extensive testing, contact tracing, and monitoring of population mobility, almost none consider setting up such a dynamic feedback loop. Having such feedback could determine what level of virus activity can be tolerated in an area, given regional health system capacity, and adjust population distancing accordingly.

We propose that by using existing technology and some nifty data science, it is possible to set up that feedback loop, which would maintain healthcare demand under the threshold of what is available in a region. Just as the maker community stepped up to cover for the failures of the government to provide adequate protective gear to health workers, this is an opportunity for the data and tech community to partner with healthcare experts and provide a measure of public health planning that governments are unable to do. Therefore, the question we invite the data science community to focus on is: How can data science help forecast regional health system resource needs given measurements of virus activity and suppression measures such as population distancing?

For the data science effort to work, first and foremost, we need to fix delays in data collection and access introduced by existing reporting processes. Currently, most departments of public health are collecting and reporting metrics that are not helpful, and are reporting them with 48 hour delays, and often with errors. Although there are examples of regional excellence in such reporting, by and large, the recommendations from the health IT community around accurate and fast public health reporting remain ignored. For instance, consider the number of COVID-19 hospitalizations, which is the best indicator of the disease’s burden on the regional health system. At the present time, due to time lags in confirming and reporting cases and a failure to distinguish between current and cumulative hospitalizations, even regions that report hospitalization data often provide only a blurry picture of the burden on the regional health system. Regions should ideally report both suspected and confirmed hospital cases and indicate the date of admission, in addition to the date of report or confirmation.

Even with perfect reporting, there are fundamental delays in what such data can tell us. For example, new admissions to a hospital today reflect virus activity as of 9 to 13 days ago (which depends, in turn, on social distancing interventions from up to 17 days prior). Not factoring in such considerations have led to significant over-estimation of hospitalization needs nationwide. We therefore need to measure virus activity via proxy measures that are indicative early in the lifecycle of the virus. We must benchmark these against the number of new and total COVID-19 hospitalizations as well as ideally the number of new infections, assuming it is accurately measured through large scale testing. Available proxy measures include test positivity rates in health systems, case counts, deaths and perhaps seropositivity rates. Ongoing symptom tracking via smartphone apps, daily web or phone surveys, or cough sounds can identify potential hotspots where virus transmission rates are high. Contact tracing, which currently requires significant human effort, can also help tracking of potential cases if it can be scaled using technology under development by major American tech companies. 

With reliable tracking and benchmarking in place, we can calculate infection prevalence as well as daily growth and transmission rates, which is essential for determining if policies are working. This is a problem not only of data collection but also data analysis. Issues of sensitivity, daily variability, time lags, and confounding need to be studied before such data can be used reliably. For instance, symptom tracking is nonspecific and may have difficulty tracking virus activity at low prevalence. Other emerging data sources such as wastewater and smart thermometer data hold similar promise but will have to grapple with these same issues.

We then need to estimate the regional effects of policy interventions such as shelter-in-place orders (via mobility reduction) and contact tracing (via reductions in new cases), first as simple forecasts and eventually maturing to what-if analyses. Several efforts have quantified the impact of mobility on virus transmission and some have suggested “safe” forms of mobility. While there are many potential ways to quantify population mobility — such as via traffic patterns, internet bandwidth usage by address, and location of credit card swipes — the most scalable mechanism to measure mobility appears to be via tracking of smartphones. Groups such as the COVID-19 Mobility Data Network provide such data daily in anonymized, aggregated reports.

Once the ability to project from mobility to transmission to health system burden is constructed, we can “close the loop” by predicting how much mobility we can afford given measured virus activity and anticipated health system resources in the next two weeks. Researchers have already attempted to calculate “tolerable transmission” in the form of maximum infection prevalence in a given geography that would not overload health systems. Coupling such tolerable transmission estimates with daily assessments of a valid sample of the population (via testing, via daily surveys, via electronic health record-based surveillance) would allow monitoring of changes in transmission which can alert us to the need to intervene, such as by reducing mobility. As new measures such as contact tracing cut transmission rates, these same monitoring systems can tell us that it is safe to increase mobility further. Continuously analyzing current mobility as well as virus activity and projected health system capacity can allow us to set up “keep the distance” alerts that trade off tolerable transmission against allowed mobility. Doing so will allow us to intelligently balance public health and economic needs in real time.

Concretely, then, the crucial “data science” task is to learn the counterfactual function linking last week’s population mobility and today’s transmission rates to project hospital demand two weeks later. Imagine taking past measurements of mobility around April 10 in a region (such as the Santa Clara County’s report from COVID-19 Community Mobility Reports), the April 20 virus transmission rate estimate for the region (such as from, and the April 25 burden on the health system (such as from the Santa Clara County Hospitalization dashboard), to learn a function that uses today’s mobility and transmission rates to anticipate needed hospital resources two weeks later. It is unclear how many days of data of each proxy measurement we need to reliably learn such a function, what mathematical form this function might take, and how we do this correctly with the observational data on hand and avoid the trap of mere function-fitting. However, this is the data science problem that needs to be tackled as a priority. 

Adopting such technology and data science to keep anticipated healthcare needs under the threshold of availability in a region requires multiple privacy trade-offs, which will require thoughtful legislation so that the solutions invented for enduring the current pandemic do not lead to loss of privacy in perpetuity. However, given the immense economic as well as hidden medical toll of the shutdown, we urgently need to construct an early warning system that tells us to enhance suppression measures if the next COVID-19 outbreak peak might overwhelm our regional healthcare system. It is imperative that we focus our attention on using data science to anticipate, and manage, regional health system resource needs based on local measurements of virus activity and effects of population distancing.

Dr. Nigam Shah is an associate professor of Medicine (Biomedical Informatics) at Stanford University and Associate CIO for data science at Stanford Healthcare.
Dr. Jacob Steinhardt is an assistant professor of statistics at University of California, Berkeley.