Sections

Research

Understanding risk assessment instruments in criminal justice

Judge Christopher Haenicke fills out paperwork
Editor's note:

This report from The Brookings Institution’s Artificial Intelligence and Emerging Technology (AIET) Initiative is part of “AI and Bias,” a series that explores ways to mitigate possible biases and create a pathway toward greater fairness in AI and emerging technologies.

Algorithmic tools are in widespread use across the criminal justice system today. Predictive policing algorithms, including PredPol and HunchLab, inform police deployment with estimates of where crime is most likely to occur. Patternizr is a pattern recognition tool at the New York Police Department that helps detectives automatically discover related crimes. Police departments also use facial recognition software to identify possible suspects from video footage. District attorneys in Chicago and New York have leveraged predictive models to focus prosecution efforts on high-risk individuals. In San Francisco, the district attorney uses an algorithm that obscures race information from case materials to reduce bias in charging decisions.

Risk assessment instruments

One class of algorithmic tools, called risk assessment instruments (RAIs), are designed to predict a defendant’s future risk for misconduct. These predictions inform high-stakes judicial decisions, such as whether to incarcerate an individual before their trial. For example, an RAI called the Public Safety Assessment (PSA) considers an individual’s age and history of misconduct, along with other factors, to produce three different risk scores: the risk that they will be convicted for any new crime, the risk that they will be convicted for a new violent crime, and the risk that they will fail to appear in court. A decision-making framework translates these risk scores into release-condition recommendations, with higher risk scores corresponding to stricter release conditions. Judges can disregard these recommendations if they seem too strict or too lax. Other RAIs influence a wide variety of judicial decisions, including sentencing decisions and probation and parole requirements.

Algorithmic RAIs have the potential to bring consistency, accuracy, and transparency to judicial decisions. For example, Jung et al. simulated the use of a simple checklist-style RAI that only considered the age of the defendant and their number of prior failures to appear. The authors noted that judges in an undisclosed jurisdiction had widely varying release rates (from roughly 50% to almost 90% of individuals released). The authors found that if judges had used their proposed checklist-style model to determine pretrial release, decisions would have been more consistent across cases, and they would have detained 30% fewer defendants overall without a corresponding rise in pretrial misconduct. Other studies have found additional evidence that statistical models consistently outperform unaided human decisions. In contrast to the opacity of traditional human decision-making, the transparent nature of a checklist-style model, like the one proposed by Jung et al., would also allow courts to openly describe how they calculate risk. These benefits—along with a general belief that important decisions should be anchored in data—have compelled many jurisdictions across the country to implement RAIs.

The COMPAS RAI

In parallel with their expansion across the country, RAIs have also become increasingly controversial. Critics have focused on four main concerns with RAIs: their lack of individualization, absence of transparency under trade-secret claims, possibility of bias, and questions of their true impact. A 2016 Wisconsin Supreme Court case, Loomis v. Wisconsin, grappled with many of these issues. The petitioner, Eric Loomis, made several arguments against the use of an RAI called Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) in his sentencing decision.

First, Loomis contended that his sentence was not individualized. Instead, he claimed it was informed by historical group tendencies for misconduct, as assessed by COMPAS. The court disagreed, arguing that the judge’s decision was not solely determined by COMPAS, avoiding Loomis’ individualization concerns. Although the court made this distinction, it is worth noting that both humans and algorithms learn from historical behavior. A risk prediction for a given individual—whether from a judge or an RAI—is, as a result, anchored in the historical behavior of similar individuals.

Second, Loomis argued that the company that created COMPAS declined to release enough details on how the algorithm calculated his risk score, preventing him from scrutinizing the accuracy of all information presented at his sentencing. Many RAIs can explain exactly how they arrive at their decisions, an advantage over traditional human decision-making. However, commercial vendors that sell RAIs often hide these details behind trade-secret claims. While the court did not strictly agree with Loomis—arguing that it was sufficient to observe the inputs and outputs of COMPAS—there are compelling reasons for transparency and interpretability in such high-stakes contexts.

For example, although Loomis did not know the full structure of the model, he knew that it incorporated gender as a factor, and he argued that this was discrimination. The court disagreed, emphasizing that including gender in the model helped increase its accuracy. This follows the fact that, given similar criminal histories, recidivism rates are statistically lower for women than for men. Either way, Loomis’ knowledge of the model’s use of gender allowed him to challenge its inclusion, an example of how transparency in RAIs can help stakeholders better understand this high-stakes decision-making process.

Potential discrimination and RAI problems

Other charges of discrimination have been levied against RAIs (and machine learning algorithms in general), noting that they can perpetuate and exacerbate existing biases in the criminal justice system. Perhaps the most notable claim appeared in a 2016 ProPublica article about the use of COMPAS alongside pretrial detention decisions in Broward County, Florida. The article concluded that COMPAS was biased because it performed worse on one measure of performance (false positive rates) for Black individuals when compared to white individuals. However, other researchers have noted a substantial statistical flaw in ProPublica’s findings: They can be mathematically explained by differences in underlying offense rates for each race without requiring a biased model. When researchers apply a traditional measure of model fairness—whether individuals with the same risk score re-offend at the same rate, regardless of race—evidence of racial discrimination disappears.

Even still, a lack of evidence does not guarantee that discrimination is absent, and these claims should be taken seriously. One of the most concerning possible sources of bias can come from the historical outcomes that an RAI learns to predict. If these outcomes are the product of unfair practices, it is possible that any derivative model will learn to replicate them, rather than predict the true underlying risk for misconduct. For example, though race groups have been estimated to consume marijuana at roughly equal rates, Black Americans have historically been convicted for marijuana possession at higher rates. A model that learns to predict convictions for marijuana possession from these historical records would unfairly rate Black Americans as higher risk, even though true underlying rates of use are the same across race groups. Careful selection of outcomes which reflect true underlying crime rates may avoid this issue. For example, a model that predicts convictions for violent crime is less likely to be biased, because convictions for violent crime appear to mirror true underlying rates of victimization.

“[A] lack of evidence does not guarantee that discrimination is absent, and these claims should be taken seriously.”

Many would argue that a pure focus on algorithmic behavior is too limited; that the more important question is how RAIs influence judicial decisions in practice, including any difference in impacts by race. To illustrate this point, it is useful to think of two possible extremes. We may not be as concerned about an inaccurate RAI if it is categorically ignored by judges and does not affect their behavior. On the other hand, a perfectly fair RAI may be cause for concern if it is selectively used by judges to justify punitive treatment for communities of color.

Though many studies have simulated the impact of RAIs, research on their real-world use is limited. A study of RAIs in Virginia between 2012–2014 suggests that pretrial misconduct and incarceration can both be reduced at the same time. Another study examined the 2014 implementation of a PSA in Mecklenburg County, North Carolina, and found that its implementation coincided with higher release rates, while rates of pretrial misconduct went unchanged. A third study scrutinized the implementation of RAIs across Kentucky between 2009–2016, finding limited evidence that the tool reduced incarceration rates. The study did find that a judge’s use of an RAI did not unevenly impact outcomes across race groups.

Recommendations

Anybody, including executive, planning, management, analysis, and software development staff, considering the use of algorithms in criminal justice—or any impactful context more broadly—should heed these concerns when planning policies that leverage algorithms, particularly those steering criminal justice decisions.

First, policymakers should preserve human oversight and careful discretion when implementing machine learning algorithms. In the context of RAIs, it is always possible that unusual factors could affect an individual’s likelihood of misconduct. As a result, a judge must retain the ability to overrule an RAI’s recommendations, even though this discretion may reduce accuracy and consistency. One way to balance these competing priorities is to require a detailed explanation anytime a judge deviates from an RAI recommendation. This would encourage judges to consciously motivate their decision and would discourage arbitrary deviations from an RAI’s recommendations. In general, humans should always make the final decision, with any deviations requiring an explanation and some effort by the judge.

“[P]olicymakers should preserve human oversight and careful discretion when implementing machine learning algorithms.”

Second, any algorithm used in a high-stakes policy context, such as criminal sentencing, should be transparent. This ensures that any interested party can understand exactly how a risk determination is made, a distinct advantage over human decision-making processes. In this way, transparency can help establish trust, and is an acknowledgement of the role these tools play in consequential, impactful decisions.

Third, algorithms, and the data used to generate their predictions, should be carefully examined for the potential that any group would be unfairly harmed by the outputs. Judges, prosecutors, and data scientists should critically examine each element of data provided to an algorithm—particularly the predicted outcomes—to understand if these data are biased against any community. In addition, model predictions should be tested to ensure that individuals with similar risk scores reoffend at similar rates. Finally, the use of interpretable models can help demonstrate that the scores generated by each model appear to be fair, and largely conform to domain expertise about what constitutes risk.

Fourth, data scientists should work to build next-generation risk algorithms that predict reductions in risk caused by supportive interventions. For example, current RAIs only infer the risk of misconduct if an individual is released without support. They do not consider the influence of supportive interventions—such as court-date text-message reminders—even though they may have a tampering effect on an individual’s risk for misconduct. Imagine an individual who is predicted by a traditional RAI to have a low likelihood of court appearance if they are released without support. With only this rating, a judge would likely choose to incarcerate the individual to ensure they appear in court. However, with next-generation RAIs, a judge might also see that text-message reminders substantially increase the likelihood of the individual’s appearance. With this additional information, the judge may instead choose to release the individual and enroll them in reminders. Next-generation risk algorithms that estimate the impact of supportive interventions could encourage judges and other decision-makers to avoid the considerable social and financial costs of punitive action in favor of more humane alternatives.

Finally—and perhaps most important—algorithms should be evaluated as they are implemented. It is possible that participants in any complicated system will react in unexpected ways to a new policy (e.g., by selectively using RAI predictions to penalize communities of color). Given this risk, policymakers should carefully monitor behavior and outcomes as each new algorithm is introduced and should continue routine monitoring once a program is established to understand longer-term effects. These studies will ultimately be key in assessing whether algorithmic innovations generate the impacts they aspire to achieve.

RAIs are only one algorithmic tool in consideration today. Separate challenges surround the use of other algorithms. Most notably, criminal justice agencies must explain how they plan to protect individual privacy and liberty in their use of facial recognition, public DNA databases, and other new forms of surveillance. But if used appropriately and carefully, algorithms can substantially improve impactful decisions, making them more consistent and transparent to any interested stakeholder. As with any new policy or practice, these efforts must include continued evaluation and improvement to ensure that their adoption generates effective and fair outcomes over time.


The Brookings Institution is a nonprofit organization devoted to independent research and policy solutions. Its mission is to conduct high-quality, independent research and, based on that research, to provide innovative, practical recommendations for policymakers and the public. The conclusions and recommendations of any Brookings publication are solely those of its author(s), and do not reflect the views of the Institution, its management, or its other scholars.

Microsoft provides support to The Brookings Institution’s Artificial Intelligence and Emerging Technology (AIET) Initiative. The findings, interpretations, and conclusions in this report are not influenced by any donation. Brookings recognizes that the value it provides is in its absolute commitment to quality, independence, and impact. Activities supported by its donors reflect this commitment.

Author

  • Footnotes
    1. Chammah, M., & Hansen, M. (2016, February 3). Policing the future. The Verge. Retrieved from https://www.theverge.com Mohler, G. O., Short, M. B., Malinowski, S., Johnson, M., Tita, G. E., Bertozzi, A. L., & Brantingham, P. J. (2015). Randomized controlled field trials of predictive policing. Journal of the American Statistical Association, 110(512), 1399–1411.
    2. Chohlas-Wood, A., & Levine, E. (2019). A recommendation engine to aid in identifying crime patterns. INFORMS Journal on Applied Analytics, 49(2), 154–166.
    3. O’Neill, J. (2019, June 9). How facial recognition makes you safer. The New York Times. Retrieved from https://www.nytimes.com
    4. Ferguson, A. G. (2016). Predictive prosecution. Wake Forest L. Rev. 51, 705.
    5. Sernoffsky, E. (2019, June 12). SF DA Gascón launching tool to remove race when deciding to charge suspects. The San Francisco Chronicle. Retrieved from https://www.sfchronicle.com
    6. The author helped design both Patternizr and the bias mitigation project at the SFDA.
    7. DeMichele, M., Baumgartner, P., Wenger, M., Barrick, K., Comfort, M., & Misra, S. (2018). The Public Safety Assessment: A re-validation and assessment of predictive utility and differential prediction by race and gender in Kentucky. SSRN: 3168452
    8. Jung, J., Concannon, C., Shroff, R., Goel, S. and Goldstein, D.G. (2020). Simple rules to guide expert classifications. Journal of the Royal Statistical Society: Series A (Statistics in Society). doi:10.1111/rssa.12576
    9. Ægisdóttir, S., White, M. J., Spengler, P. M., Maugherman, A. S., Anderson, L. A., Cook, R. S., … Cohen, G. et al. (2006). The meta-analysis of clinical judgment project: Fifty-six years of accumulated research on clinical versus statistical prediction. The Counseling Psychologist, 34(3), 341–382. Meehl, P. E. (1954). Clinical versus statistical prediction: A theoretical analysis and a review of the evidence. University of Minnesota Press.  
    10. Jung, J., Concannon, C., Shroff, R., Goel, S. and Goldstein, D.G. (2020). Simple rules to guide expert classifications. Journal of the Royal Statistical Society: Series A (Statistics in Society). doi:10.1111/rssa.12576
    11. Goel, S., Shroff, R., Skeem, J., & Slobogin, C. (Forthcoming). The accuracy, equity, and jurisprudence of criminal risk assessment. In Research handbook on big data law. Edward Elgar Publishing Ltd.
    12. State v. Loomis. (2016). 881 N.W.2d 749 (Wis. 2016).
    13. Wexler, R. (2018). Life, liberty, and trade secrets: Intellectual property in the criminal justice system. Stan. L. Rev. 70, 1343.
    14. Skeem, J., Monahan, J., & Lowenkamp, C. (2016). Gender, risk assessment, and sanctioning: The cost of treating women like men. Law and Human Behavior, 40(5), 580.
    15. Ferguson, A. G. (2017). The rise of big data policing: Surveillance, race, and the future of law enforcement. NYU Press. O’Neil, C. (2016). Weapons of math destruction: How big data increases inequality and threatens democracy. Broadway Books.
    16. Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016, May 23). Machine bias. ProPublica. Retrieved from https://www.propublica.org
    17. Corbett-Davies, S., & Goel, S. (2018). The measure and mismeasure of fairness: A critical review of fair machine learning. arXiv: 1808.00023 [cs.CY]
    18. Corbett-Davies, S., Pierson, E., Feller, A., & Goel, S. (2016, October 17). A computer program used for bail and sentencing decisions was labeled biased against blacks. It’s actually not that clear. The Washington Post. Retrieved from https://www.washingtonpost.com
    19. Alexander, M. (2012). The new Jim Crow: Mass incarceration in the age of colorblindness. The New Press.
    20. Skeem, J. L., & Lowenkamp, C. T. (2016). Risk, race, and recidivism: Predictive bias and disparate impact. Criminology, 54(4), 680–712.
    21. Danner, M. J., VanNostrand, M., & Spruance, L. M. (2016). Race and gender neutral pretrial risk assessment, release recommendations, and supervision: VPRAI and PRAXIS revised. Luminosity.
    22. Redcross, C., Henderson, B., Miratrix, L., & Valentine, E. (2019). Evaluation of pretrial justice system reforms that use the Public Safety Assessment. MDRC Center for Criminal Justice Research.
    23. Stevenson, M. (2018). Assessing risk assessment in action. Minn. L. Rev. 103, 303.