On June 24th the New York Times reported the frightful story of Detroit resident Robert Julian-Borchak Williams. Williams, who is African American, lives in the wealthy Detroit suburb of Farmington Hills and was contacted in January by the Detroit Police Department to turn himself in. After ignoring what he assumed was a prank, Williams was arrested by two police officers in front of his wife and two young daughters as he arrived home from work. Thirty hours after being detained, Williams was released on bail after it became clear the police had arrested the wrong man.
As the Times put it, Williams’ case is noteworthy because it may be the first known example of an American wrongfully arrested on the basis of a flawed match from a facial recognition algorithm. Williams’s story brings facial recognition technologies (FRT) squarely into the ongoing conversation in the United States around racial injustice. In May of this year, Stanford’s Institute for Human-Centered Artificial Intelligence convened a workshop to discuss emerging questions about the performance of facial recognition technologies. Although the workshop was held before the nationwide upheaval sparked by the killing of George Floyd, the issues covered are central to the ongoing reckoning with systemic inequities, discrimination, and technology.
Facial recognition technologies have grown in sophistication and adoption across American society: Consumers now use facial recognition tech to unlock their smartphones and cars, retailers use these systems for targeted advertising and to monitor stores for shoplifters, and law enforcement agencies turn to them to identify suspects. But as the popularity of facial recognition tech has grown, significant anxieties around its use have emerged—including declining expectations of privacy, worries about the surveillance of public spaces, and algorithmic bias perpetuating systemic injustices. In the wake of the public demonstrations denouncing the deaths of George Floyd, Breonna Taylor, and Ahmaud Arbery, Amazon, Microsoft, and IBM all announced they would pause their facial recognition work for law enforcement agencies. Given the potential for facial recognition algorithms to perpetuate racial bias, we applaud these moves. But the ongoing conversation around racial injustice also requires a more sustained focus on the use of these systems.
To that end, we want to describe actionable steps that regulators at the federal, state, or local level (or private actors who deploy or use FRT) can take in order to build an evaluative framework that ensures that facial recognition algorithms are not misused. Technologies that work in controlled lab settings may not work as well under real world conditions, and this includes both the data dimension and the human dimension. The former entails what we call “domain shift,” namely when models perform one way in development settings and another way in end-user applications. The latter refers to differences in how the output of an FRT model is interpreted across institutions using the technology, which we refer to as “institutional shift.”
Policymakers can ensure that responsible protocols are in place to validate that facial recognition technology works as billed and to inform decisions about whether and how to use FRT. In building a framework for responsible testing and development, policymakers should empower regulators to use stronger auditing authority and the procurement process to prevent facial recognition applications from evolving in ways that would be harmful to the broader public.
Taking the algorithm out of the lab
The facial recognition industry in the United States is growing at a tremendous pace but lacks a sufficiently robust testing regime to validate the technology. Currently valued at $5 billion, the market for FRT systems is projected to double by 2025. While the National Institute of Standards and Technology has established the well-known Facial Recognition Vendor Test (FRVT) benchmarking standard, rapid adoption across industries and complex ethical concerns about FRT’s impact on society require much more substantial testing.
Many vendors currently advertise high performance metrics for their software, but these tests are carried out in the confines of carefully calibrated test settings. Evaluating the performance of FRT for a real-world task like identifying individuals from stills of closed-circuit television in real time is a significantly more complicated task. The context in which accuracy is tested is often vastly different from the context in which the actual program is applied. FRT vendors may train their systems with clear, well-lit images, but when deployed in law-enforcement applications, for example, officers might use FRT on live footage from body cameras of far lower quality. Computer science research has established that this “domain shift” can significantly degrade model performance.
Domain shift also entails the profound problem of bias: Algorithms trained on one demographic group may perform poorly on another. One leading report found that false positive rates varied by factors of 10 to 100 across demographic groups, with such errors being “highest in West and East African and East Asian people, and lowest in Eastern European individuals.”
Placing a human at the center of the action
How humans incorporate algorithmic output can also contribute to the failure of facial recognition systems. This institutional shift comes from the fact that the same system may be utilized in different ways by different companies and agencies. This type of uncertainty can stem from users selectively listening to model output that confirms their own preexisting biases, users ignoring model output, or users over-trusting an algorithm.
To take a relatively clear-cut example, two police departments in neighboring jurisdictions deploying identical systems could reach divergent conclusions about the identification of a suspect using FRT if one department insists on using a higher confidence threshold than its neighbor. What technologists would see as accurate may be interpreted quite differently by the operator using FRT algorithms in the field.
Responsible FRT testing protocols
With these two overarching sources of uncertainty in mind—domain and institutional shifts—we can build recommendations for a responsible testing protocol to address these challenges. To address negative outcomes stemming from domain-specific concerns, we recommend policymakers focus on the following three pieces of a larger testing protocol.
First, vendors and developers should put greater emphasis on transparency in their training data. Ideally, this would consist of vendors making the full training and test set imagery available to the public but if this is not feasible, an inferior alternative could be to use a large random sample to compare discrepancies between vendor and user metrics.
Second, vendors should provide users and third-party evaluators meaningful access to testing imagery so that they can conduct independent validation of in-domain performance. Such access should also allow users to label their own testing data, reserve holdout testing data, and define metrics that must be met prior to commercial deployment.
Third, vendors and users should conduct ongoing, periodic recertifications of the performance of facial recognition algorithms. Vendors should provide comprehensive release notes and documentation for each version of the model in question. These release notes should, at minimum, include changes to the underlying model, performance metrics across subcategories like demographics and image quality, and potentially be used to trigger a recertification process if one becomes necessary.
To address performance issues stemming from human decision making, policymakers should encourage A/B testing to assess performance within the human context. This would enable researchers to evaluate the effect of an FRT system on human decision making. A/B testing can be adapted to directly compare human decisions with AI-augmented decisions, to assess the human operator’s responsiveness to “confidence scores” of models, or to gauge potential over-reliance or under-reliance on model output—sometimes referred to as “automation bias” and “algorithm aversion,” respectively.
Opening up facial recognition systems to facilitate in-domain accuracy testing will empower a much wider range of parties and stakeholders to rigorously assess the technology. By following the protocols spelled out here, watchdog organizations can expand performance benchmarking on a standardized basis and audit systems on a wider range of FRT domains more rapidly. Businesses and government agencies procuring facial recognition systems through large-scale contracts should condition such purchases on rigorous in-domain accuracy tests adhering to the evaluative framework articulated above. Auditors should expand their testing datasets to cover high-priority emerging domains and academic researchers should pursue more research on domain drift in facial recognition. Finally, media and civil society organizations should amplify the findings of this new testing framework to ensure FRT are better understood by the public.
While a moratorium on facial recognition technologies in criminal justice is a laudable step at this time, FRTs are likely to continue to be deployed in a variety of settings. As a result, standards for whether and how to adopt FRTs must be worked through now. Adopting these protocols and recommendations will not—and should not—silence legitimate scrutiny of facial recognition technology, but a conceptual framework to evaluate and test the negative effects of domain shift and institutional shift can offer a crucial next step for better understanding the operational and human impacts of this emerging technology.
Daniel E. Ho Ph.D. is a Professor at Stanford University and Associate Director at the Stanford Institute for Human-Centered Artificial Intelligence (HAI).
Emily Black is a Ph.D. student in computer science at Carnegie Mellon University.
Maneesh Agrawala Ph.D. is a Professor at Stanford University and Affiliated Faculty at the Stanford Institute for Human-Centered Artificial Intelligence (HAI).
Fei-Fei Li Ph.D. is a Professor at Stanford University and Co-Director at the Stanford Institute for Human-Centered Artificial Intelligence (HAI).
This post is adapted from Stanford HAI’s policy brief, “Domain Shift and Emerging Questions in Facial Recognition Technology.”
Amazon, Microsoft, and IBM provide financial support to the Brookings Institution, a nonprofit organization devoted to rigorous, independent, in-depth public policy research.