How satellites and AI can fix development data problems

Data drives development policy. To determine aid packages and projects, policymakers need good data on everything from population to urban sprawl to economic livelihoods. Yet policymakers creating development policies, whether in response to disasters or with an eye toward the long-term, face a core problem: measuring sustainable development variables.

Against the backdrop of an ever more urgent climate crisis, improving efforts to get good data has never been more important. The most recent report from the U.N. Intergovernmental Panel on Climate Change lays bare the scale of this challenge, yet even as scientists warn that time is running out to slow the warming of the planet, persistent disagreement remains about how much wealthy countries should be spending on climate assistance for lower-resourced ones.

These questions of how much to spend on aid and where to spend it raise a key issue with development broadly. In the past, poor forecasting and inefficient aid distribution have hindered the effectiveness of development programs, including on climate. If policymakers and researchers cannot get accurate information about a problem, it’s more difficult to forge effective solutions. But new technology for development analysis, driven by a combination of satellite imagery and machine learning, may hold the keys to progress.

As it stands, data can be inaccurate, costly to acquire, or hard to get all together. This is especially the case in low-resourced countries. In Africa alone, 34% of countries have gone more than 15 years since their last agricultural survey. Even then, the data that is collected is often incomplete or inaccurate. And yet the need for this high-fidelity data is pressing. When a locust plague struck East Africa in 2020, one of the key challenges in responding to that crisis was simply determining the location of locust swarms. In the absence of effective tools to monitor and respond to the locusts marauding the region, 19 million farmers across East Africa lost their crops, causing widespread food shortages in the region.

Our new paper introduces an elegant solution to the problem of measuring sustainable development: machine learning applied to satellite imagery. The explosion in commercial satellites and the public availability of satellite imagery opens up new opportunities to analyze sustainable development-related variables at low cost, high accuracy, and great scale. When we compared satellite imagery from 200 random sample sites across multiple continents, we observed a substantial increase in the number and quality of images captured over time. Sites once imaged a couple times a year are now captured multiple times a week, and these images detail localized activity like infrastructure growth.

Satellite imagery is one piece of the puzzle. Another is the growing use, and usefulness, of artificial intelligence (AI)-powered machine learning (ML) models to extract common patterns of information from available data. In the development context, researchers have built models increasingly capable of assessing sustainable development metrics from satellite images. One satellite image of arable land might tell the story of a village’s economic health—its crop yields, its agricultural diversity, and its infrastructure development.

In our assessment, ML models leveraging satellite imagery inputs can amplify—and possibly outperform—traditional measurement tools like ground based surveys and censuses, offering a promising path forward. These technologies are unlikely to replace ground-based surveys altogether. But their augmentation of these methods can help address the data problems in sustainable development policy. For example, researchers can use satellite-based estimates of buildings, nighttime lights, and other markers to equip policymakers with more accurate estimates of local population size instead of infrequent, traditional census methods, particularly in lower-resourced countries.

Techniques to “train” ML models—to teach them which patterns to extract from available data—are improving. Researchers can now build models even when training data is less readily available or low in quality, as is often the case with sustainable development data. Synthetically created training data—data that is artificially created instead of generated by real-world events, we found, is another route to addressing data shortcomings, especially useful in the development context. In the agricultural setting, for example, crop model simulations trained on synthetic data to predict crop yields have performed as well as or better than approaches that calibrate directly to limited field data.

In addition, transfer learning and semi-supervised learning can enable researchers to circumvent issues surrounding data quantity and quality. In the former approach, models leverage large quantities of readily available data to learn a task similar to the task of interest and then “transfer” extracted patterns to sustainable development metrics. In the latter, models extract patterns from unlabelled satellite data (sometimes combined with small amounts of labeled data) without substantial human input. While noisy training data is a persistent problem that distorts model performance and evaluation, we found that models trained on high volumes of noisy data but tested on un-degraded data were stable performers, indicating that ML models can still be robust.

ML-driven, satellite-derived assessments of sustainable development variables hold evident promise, but they still face significant challenges. Trust issues loom large. Many ML models are not transparent, and it is often unclear how models arrive at a given outcome—such as predicting that a field’s crop yield will be low based on a satellite image. Policymakers are understandably weary of algorithms that cannot be fully explained. In addition to trust and explainability challenges, ML-driven estimates of sustainable development variables run into issues of scope. While some variables like crop growth can be inferred through ML-based approaches, others like educational attainment cannot be derived from satellite imagery.

Looking forward, researchers and practitioners alike can advance the use of satellite-driven assessments by focusing on explainability in models, cultivating public-private partnerships to operationalize model usage, and better understanding how satellite imagery and AI tools can address development data gaps. Policymakers, for their part, can better understand both the potential and the limitations of this emerging technology. After all, research does not occur in a vacuum, and, especially where development is concerned, policymakers have an important role in setting state policies, prioritizing investments, and raising attention to issues.

By focusing on strategies, such as, the use of synthetic data, transfer learning, and testing models on a small amount of high-quality data to counter noisy data, researchers and policymakers can leverage the power of machine learning and satellite imagery to change sustainable development for the better.

Marshall Burke is an associate professor in the Department of Earth System Science and deputy director of the Center on Food Security and the Environment at Stanford University.
Anne Driscoll is a research data analyst at the Center on Food Security and the Environment at Stanford University.
David Lobell is a professor in the Department of Earth System Science and the Gloria and Richard Kushel Director of the Center on Food Security and the Environment at Stanford University.
Stefano Ermon is an assistant professor in the Department of Computer Science at Stanford University.
This post is adapted from the Stanford HAI’s policy brief,“Using Satellite Imagery to Understand and Promote Sustainable Development.”