Mandating fairness and accuracy assessments for law enforcement facial recognition systems

Fraser Sampson, the newly appointed Commissioner overseeing biometrics and surveillance cameras in the United Kingdom, made headlines recently when he opposed facial recognition technology (FRT) bans in favor of allowing police to “reasonably use” the new technology to do their job. This sounds sensible since the technology promises to increase the effectiveness of policing, but what exactly is reasonable use?

A requirement for prior public assessment of the technology itself is one element of a reasonable use policy. Thus, it may not be reasonable for a police agency to use FRT unless it knows the fallibility of the technology, and how often it makes mistakes, especially when applied to different subgroups defined by gender, race, age, and ethnicity. As part of a reasonable use policy, developers should be required to submit their FRT systems to the National Institute of Standards and Technology (NIST) to assess their accuracy and fairness and NIST must make the results of this assessment publicly available, including to potential purchasers, before these systems can be offered on the market or put into service for law enforcement purposes.

Europe is moving toward a system of prior assessment for facial recognition. The European Commission’s proposed regulation for artificial intelligence applications would require prior third-party conformity assessments for biometric systems, including FRT systems, that are used for real-time law enforcement mass surveillance. Biometric systems include other ways of identifying people besides facial recognition such as fingerprints and retinal scans. Real time mass surveillance means the indiscriminate scanning of public places for law enforcement purposes.

The U.S. should adopt a similar system for prior assessments of facial recognition systems in law enforcement, mandated or encouraged by the federal government, and applicable to all law enforcement agencies at the national, state or local level.

Previous studies and cases on facial recognition use

MIT’s Joy Buolamwini’s pioneering study first drew significant public attention to bias in facial recognition by demonstrating the presence of dramatically different error rates for darker-skinned women compared to lighter-skinned men in several widely used facial recognition systems. The ACLU raised the awareness of policymakers to the problem with its finding that Amazon’s facial recognition technology mistakenly matched 28 members of Congress with people in an arrest record database and disproportionally misidentifying African American and Latino members of Congress.

Despite the growing concerns, police departments around the country continued to use commercial facial recognition programs, often without anyone else in local government knowing that they were doing so. News reports indicated that a startup company, Clearview AI, is making commercial facial recognition technology application available to law enforcement agencies throughout the country with little public disclosure or oversight. Consequential errors are still appearing, including when a faulty facial recognition match led to a Michigan man’s arrest for a crime he did not commit.

Vendors in the U.S. have tried to respond to the growing public concern by backing regulatory bills such as the one that passed in Washington state. That bill allowed government agencies to use facial recognition with restrictions designed to ensure it was not deployed for broad surveillance or tracking innocent people.

But it may be too late for a more balanced regulatory approach. With the spread of the national Black Lives Matter movement and mounting public concern over police abuse in the African American community, calls for a complete or temporary ban increased, including a call for suspension by the professional organization of computer scientists and engineers, the Association for Computing Machinery. More and more local governments have banned government use of the technology including San Francisco, Oakland and Boston. The ban in Portland, Oregon extended beyond government agencies to include commercial use in places of public accommodation. Some police departments abandoned the technology voluntarily including the sheriff’s office in Washington County, Oregon and the Los Angeles Police Department.

Last year, some of the biggest vendors including Microsoft, IBM, and Amazon reacted to this public backlash against the technology by delaying or abandoning their facial recognition programs. The bans and voluntary withdrawals of the technology seemed to follow legal scholar Frank Pasquale’s “second wave of algorithmic accountability” where developers and users ask whether the advantages of new software really outweigh its social costs.

In the wake of the January 6 Capitol Hill riot, the pendulum toward better acceptance of facial recognition systems may be swinging back. News reports surfaced that law enforcement agencies were using facial recognition searches to match images to suspects’ driver’s licenses or social media profiles. Clearview AI said law enforcement was using its facial recognition system and the Miami Police Department confirmed it. Other unnamed facial recognition software was also used. As more of the perpetrators are brought to justice, the question of the technology’s efficacy may return to the public domain.

It still seems to be a good idea for the existing bans and moratoria to continue until good rules can be put into place. But what would those rules look like?

Comprehensive proposals have been offered to protect the public from improper use of facial recognition technology including measures in a Georgetown study and best practice recommendations by the United Kingdom’s former Surveillance Camera Commissioner. These guidelines include such protections for civil liberties as allowing officers to search a facial recognition database only after certifying a reasonable suspicion that the suspect in question committed a felony offense. One element of these proposals calls for assessments for fairness and accuracy.

Fairness and Accuracy in Facial Recognition Systems

NIST has already established criteria for evaluating the accuracy and fairness of facial recognition systems as part of its ongoing Facial Recognition Vendor Tests. Over the last several years, the agency has conducted and published independent assessments of systems that have been voluntarily submitted to them for evaluation and it maintains an ongoing program of evaluation of vendor systems.

In a typical law enforcement use, an agency would run a photograph of a person of interest through a facial recognition system, which can search enormous databases in a matter of seconds. The system typically returns a score indicating how similar the image presented is to one or more of the images in the database.

The law enforcement agency would want the system to indicate a match if there really was one in the database, that is, it would want a high hit rate, or conversely, a low miss rate. But the agency also wants the system to be selective and indicate if there is no match, when in fact there is none.

There’s a mathematical trade-off between these two goals. Typically a facial recognition system will be tuned to return a match only if its score is above a certain threshold. This choice of a threshold represents a balance between the costs of a false negative—missing a lead—and the costs of a false positive—wasting time pursuing innocent people.

NIST’s tests measure both a facial recognition system’s false positive rate and its false negative rate at a certain threshold. Another way NIST measures accuracy is to ask whether the highest match score is a false match, regardless of the threshold, and to calculate a “rank one miss rate” as the rate at which the pair with the highest returned similarity score is not a genuine match.

According to NIST’s assessments, how accurate are today’s facial recognition algorithms? The agency reports that using high quality images, the best algorithm tested has a “rank one miss rate” of 0.1%, but that is only with high quality images such as those obtained from a cooperating subject in good lighting. With the lower quality images typically captured in real world settings, error rates climb as high as 20%. Moreover, algorithms vary enormously in their accuracy, with poor performers making mistakes more than half the time. In addition, in many cases, the correct image in the database receives the highest similarity score, but that score is very low, below the required operational threshold. This means that in practice the real miss rate will be higher than indicated by the rank one miss rate.

NIST also assesses facial recognition systems for fairness. In December 2019, NIST published a report on demographic differentials in facial recognition. It assessed the extent to which the accuracy of facial recognition systems varied across subgroups of people defined by gender, age, race, or country of origin. It defined fairness as homogeneous accuracy across groups and unfairness as the extent to which accuracy is not the same across all subgroups.

How did the tested algorithms do on fairness?

In general, the report found that African American women have higher false positive rates, that black men invariably give lower false negative identification rates than white men, and that women invariably give higher false negative rates than men. These differentials were present even when high quality images were used. The report did not use image data from the internet nor from video surveillance and so did not capture any additional demographic differentials that might occur in such photographs. The report also found that the more accurate algorithms tended to be the most equitable. A key finding from the agency’s research was that different algorithms performed differently on equitable treatment of different subgroups.

Recommendation for Assessments

The recommendation presented here for mandated prior assessments of the accuracy and fairness of law enforcement uses of facial recognition technology builds on and includes numerous prior proposals including:

As the NIST fairness report recommended, owners and users of facial recognition systems should “know their algorithm” and use “publicly available data from NIST and elsewhere” to inform themselves.
Facial recognition systems used in policing should “participate in NIST accuracy tests, and…tests for racially biased error rates,” as proposed in the Georgetown study.
Police procurement officials should “take all reasonable steps to satisfy themselves either directly or by independent verification” whether facial recognition software presents a risk of bias before putting the system in use, as recommended by the former U.K. Surveillance Camera Commissioner.
“Third-party assessments” should be used to ensure facial recognition systems for law enforcement meet “very high” mandated accuracy standards as suggested in a previous Brookings study from my colleague Darrell West.
As recommended by the National Security Commission on Artificial Intelligence, Congress should require prior risk assessments “for privacy and civil liberties impacts” of AI systems, including facial recognition, used by the Intelligence Community, the Department of Homeland Security, and the Federal Bureau of Investigation.

A key purpose of these proposals requiring assessments prior to putting facial recognition systems into use is to allow law enforcement procurement agencies to compare competing algorithms. For this purpose, standardization of testing criteria and procedures is essential—otherwise potential purchasers would have no way of comparing accuracy scores from different vendors. In these circumstances, the best procedure would be the administration of standardized tests by an independent reviewing agency. NIST has already demonstrated the capacity for conducting these studies and has developed a widely accepted methodology for assessing both accuracy and demographic fairness. It is the natural choice as the agency to perform mandated standardized assessments.

Ideally, a federal facial recognition law would impose a uniform national policy requiring prior NIST testing of facial recognition systems used in law enforcement anywhere in the country. Failing that, the federal government has levers it can use, some of which are under the control of the administration without further authorization from Congress. For instance, federal financial assistance for facial recognition in law enforcement and police access to the FBI database could be conditioned on proof that any facial recognition tools in use participated in the NIST accuracy and fairness trials.

Further Recommendations

West’s Brookings report expressed concerns that NIST tests might not “translate into everyday scenarios.” NIST acknowledges that its assessments did not include use of facial recognition software on images from the internet and from simple video surveillance cameras.

To remedy this issue, the Georgetown study suggests “accuracy verification testing on searches that mimic the agency’s actual use of face recognition—such as on probe images that are of lower quality or feature a partially obscured face.” These improvements in NIST testing procedures might make its assessments more reflective of real-world conditions.

Another way forward is for developers to take steps to reduce demographic differentials by using more diverse data sets for training. While NIST did not investigate the cause of the demographic differentials it found, it noted that the differences in false positives between Asian and Caucasian faces for algorithms developed outside Asia were not present for algorithms developed in Asia, suggesting that a more diverse training data might reduce demographic differentials.

Steps to mitigate demographic differentials in use are also possible. NIST investigated the idea that law enforcement agencies could use different accuracy thresholds for different subgroups, which would have the effect of reducing demographic differentials. This is a promising mitigation step. One study found that to achieve equal false positive rates, “East Asian faces required higher identification thresholds than Caucasian faces…”

Law enforcement agencies have an obligation to avoid bias against protected groups defined by gender, age, race and ethnicity, and this duty includes their use of facial recognition software. The Georgetown study recommends that the Civil Rights Division of the U.S. Department of Justice should investigate state and local agencies’ use of face recognition for potential disparate impacts that violate this duty to avoid bias in policing, and this seems a promising idea.

But how much deviation from statistical parity in facial recognition accuracy should be a cause of concern?

The rule of thumb used in U.S. employment law is the 80% test and this might provide some guidance. Applied to facial recognition software, this rule of thumb would require that differentials in facial recognition accuracy for subgroups defined by gender, age, race, or ethnicity should be no more than 20%.

Agencies using facial recognition software with unacceptable differentials in error rates should be required to take mitigation steps, including perhaps the use of different thresholds before initiating or continuing to use it. If such steps do not produce satisfactory equity results for the software, however, then as the UK’s former Camera Surveillance Commissioner recommends, the system should not be used.

Conclusion

Policymakers are used to letting the marketplace determine the timeline for the deployment of new technology. They are reluctant to prescribe precise measures of effectiveness or fairness for emerging technology, fearing that rules will hinder innovation. But this is changing as more and more policymakers understand the dangers of unleashing a technology first and then trying later to fix the enormous problems it has created.

Increasingly agencies of government are seeking to know if the expensive technology they are considering is worth it. This is especially true in the context of a review of expenditures on policing compared to other uses of scare resources. Assessment of fairness and accuracy of facial recognition technology systems is a key ingredient in this overdue review of resource allocation.

Some analysts suggest that simply undergoing NIST assessment is much too low a bar to justify deploying facial recognition technology, and there is something to that perspective. But it is important to start somewhere and put in place a system of assessment that can be built on over time and included in a larger system of protections. The key thing at this point is review by NIST prior to being offered on the market or put into use and public availability of the results of these assessments. More needs to be done, but this is a necessary beginning.

Amazon, IBM, and Microsoft are general, unrestricted donors to the Brookings Institution. The findings, interpretations, and conclusions posted in this piece are solely those of the author and not influenced by any donation.