Sections

Research

Challenges for mitigating bias in algorithmic hiring

A "Now Hiring" sign stands outside a business.
Editor's note:

This report from The Brookings Institution’s Artificial Intelligence and Emerging Technology (AIET) Initiative is part of “AI and Bias,” a series that explores ways to mitigate possible biases and create a pathway toward greater fairness in AI and emerging technologies.

Hiring is costly and time-consuming—and highly consequential for employers and employees alike. To improve this process, employers have begun to turn to algorithmic techniques, hoping to more efficiently hire quality candidates.

Employers have been particularly eager to figure out a way to automate the screening stage in the hiring pipeline. Broadly speaking, there are four stages in the hiring p­rocess: sourcing (attracting or curating a pool of candidates), screening, interviewing, and selection. The screening stage involves evaluating applicants—culling some and highlighting others for special attention. While vendors have emerged that offer algorithmic tools for each stage in the hiring process, algorithmic screening is the most active area of development and often the most consequential, as it represents the major filter through which applicants increasingly must pass.

This brief considers the policy issues raised by algorithmic screening. We provide an overview of techniques used in algorithmic screening, summarize the relevant legal landscape, and raise a number of pressing policy questions.

What is algorithmic screening?

Hiring in the United States has a long and troubled history of discrimination. Recent studies have shown that little has changed in the last several decades, despite increased investment in diversity and inclusion initiatives. The persistence of bias in human decision-making and the apparent failure of these established approaches to combatting discrimination explain a good deal of the recent interest in algorithmic hiring. Advocates for algorithmic screening see it as a promising way forward.

The canonical example of algorithmic screening is automated resume analysis: a candidate submits a resume, and an algorithm evaluates this resume to produce a score indicating the applicant’s quality or fit for the job. In such cases, the ultimate hiring decision typically rests with a human, even though an automated process has culled and ranked the pool of candidates. To perform this evaluation, an algorithm may, for example, assign the candidate a higher score based on the presence of specific keywords (e.g., “product manager” or “increased revenue”) in their resume. Importantly, the rules dictating which keywords merit what score may not be written by a human; instead, those rules can be developed automatically through a process called machine learning. In order to determine which keywords are used by successful employees, the machine learning system needs past data to “learn” from. For example, the machine learning system might be given the resumes of current employees and data on their on-the-job performance (e.g., their sales numbers). Taken together, the computer can then identify keywords that successful employees have tended to use in their resumes. Based on this, the machine learning system can produce a set of rules (commonly known as a “model” or “algorithm”; we will use the two interchangeably) to predict, given a future applicant’s resume, how good of an employee they might be.

While resume screening has achieved some degree of public attention, leading vendors of algorithmic screening tools offer very different types of assessments. For example, the company Pymetrics sells game-based assessments, in which applicants play custom-built games, and proprietary algorithms analyze gameplay to score candidates on a number of traits like “learning ability” and “decisiveness.” In such assessments, the inputs to the algorithm may be slightly less clear than in resume screening—for example, algorithms may use candidates’ reaction times or memory ability to make predictions about other traits.

Do algorithmic screening systems reduce bias?

On their surface, algorithmic screening tools seem to be entirely evidence-based, making it an appealing alternative to biased human evaluations. However, there is mounting evidence that such tools can reproduce and even exacerbate human biases manifested in the datasets on which these tools are built. Data encode deeply subjective decisions and judgements; they are rarely neutral records. For example, employers choose who is included in the dataset—often by virtue of who they chose to hire in the past—and what constitutes a “good” employee. If an employer has never hired a candidate from a historically Black college or university, for example, would an algorithm know how to evaluate such candidates effectively? Would it learn to prefer candidates from other schools? Algorithms, by their nature, do not question the human decisions underlying a dataset. Instead, they faithfully attempt to reproduce past decisions, which can lead them to reflect the very sorts of human biases they are intended to replace.

“On their surface, algorithmic screening tools seem to be entirely evidence-based. … However, there is mounting evidence that such tools can reproduce and even exacerbate human biases.”

Vendors often point to the objectivity of algorithms as an advantage over traditional human hiring processes, frequently claiming that their assessments are unbiased or can be used to mitigate human biases. In practice, though, little is known about the construction, validation, and use of these novel algorithmic screening tools, in part because these algorithms (and the datasets used to build them) are typically proprietary and contain private, sensitive employee data. In a recent study, we (along with Jon Kleinberg and Karen Levy) completed a survey of the public statements made by vendors of algorithmic screening tools, finding that the industry rarely discloses details about its methods or the mechanisms by which it aims to achieve an unbiased assessment. In our study, we sampled 18 vendors of algorithmic assessments, documented their practices, and analyzed them in the context of U.S. employment discrimination law.

Algorithmic hiring assessments and civil rights law

Title VII of the Civil Rights Act of 1964 prohibits discrimination based on “race, color, religion, sex, or national origin.” Such a prohibition is understood to apply to both intentional discrimination (so-called disparate treatment), as well as inadvertent but unjustified or avoidable discrimination (so-called disparate impact). The Equal Employment Opportunity Commission’s Uniform Guidelines on Employee Selection (hereafter referred to as simply Uniform Guidelines) states that a selection procedure exhibits disparate treatment if it explicitly considers any of the above protected attributes when making a decision. Disparate impact, on the other hand, is more nuanced: If a selection procedure accepts candidates from one protected group at a rate significantly lower (80%, as a rule of thumb) than that of another, then the selection procedure exhibits a disparate impact. An employer could defend against a claim of disparate impact by showing that the selection procedure serves a justified or necessary business purpose, but would still be found liable if the plaintiff could nevertheless identify an alternative selection procedure that could have served the same purpose while generating less disparate impact.

In our study, we find that vendors of algorithmic hiring assessments typically avoid disparate treatment simply by ensuring that protected attributes like race or gender are not used as inputs to their models. With regards to disparate impact, however, vendors fall into two camps. According to the Uniform Guidelines, one way to defend against a claim of disparate impact is to demonstrate that the assessment in question—the screening algorithm—has validity, meaning that it accurately predicts a job-related quality. Thus, even if the screening algorithm does produce a disparate impact, it can be justified as serving a legitimate business objective if it is sufficiently accurate.

“[E]ven if the screening algorithm does produce a disparate impact, it can be justified as serving a legitimate business objective if it is sufficiently accurate.”

However, some vendors take the additional step of investigating whether they can develop a different screening algorithm that performs equally well, while reducing disparities in selection rates across groups. In other words, these vendors help employers discover the existence of viable alternative business practices—practices that meaningfully reduce disparate impact without imposing significant cost on employers. Employers who fail to consider and adopt such alternative screening tools would open themselves up to liability, as plaintiffs could argue that the original screening process is not really a business necessity or justified by a legitimate business goal. In practice, we observe that many vendors ensure that assessments never produce a disparate impact in the first place, thereby heading off any charges of discrimination without having to rely on an assessment’s validity. Vendors have moved in this direction despite the fact that, to our knowledge, algorithmic assessments in employment have yet to face any legal challenges.

Technically, there are a number of “de-biasing” methods vendors can employ as part of this second strategy. One common approach is to build a model, test it for disparate impact, and if disparate impact is found, remove inputs contributing to this disparate impact and rebuild the model. Consider, for example, a resume-screening algorithm found to select men at a higher rate than women. Suppose this algorithm (like one supposedly built—but never used—by Amazon) gives higher scores to applicants who played lacrosse. Note that lacrosse-playing might legitimately have some correlation with desirable job outcomes; those with experience playing team sports might on average perform better in team settings than those without it. However, it may also be the case that lacrosse tends to be played by affluent white males, and thus, the model might be more likely to select from this group. To combat this, a vendor or employer might prohibit the algorithm from considering the word “lacrosse” on a resume, forcing the model to find alternative terms that predict success and thereby potentially mitigating the original disparate impact. The hope is that the model denied access to the word “lacrosse” will identify other predictors of success—perhaps “sport” or “team”—that apply equally well to all potential job candidates.

Policy implications

Identifying and mitigating bias in screening algorithms raises a number of pressing policy concerns. In what follows, we identify a set of issues in need of greater and often urgent attention.

Plaintiffs may not have sufficient information to suspect or demonstrate disparate impact.

This has long been a problem with cases involving disparate impact—the plaintiff’s case is not based solely on her own experience, but instead on the aggregate impact of a selection process across a group of people. Thus, demonstrating evidence of disparate impact requires data from a sufficiently large group. In past assessments, it may have been possible to infer that a particular question or requirement placed an undue or unnecessary burden on one group as compared to another; however, with modern algorithmic screening tools, candidates may not be asked to complete a traditional assessment and may not even be aware how exactly they are being evaluated. As a result, they may lack any indication that the assessment mechanism is potentially discriminatory.

It is unclear whether predictive validity is sufficient to defend against a claim of disparate impact.

According to the Uniform Guidelines, employers can justify a disparate impact by demonstrating the predictive validity of their selection procedures. This creates a near tautology in the context of machine learning: Models produced by machine learning are, by definition, built to ensure predictive validity. While plaintiffs might challenge whether the built-in validation process is itself valid, it is unclear when traditional forms of validation are insufficient even if they have been executed properly.

“[V]alidation may report that a model performs very well overall while concealing that it performs very poorly for a minority population.”

There are a number of reasons to be suspicious of validation studies. First, validation may report that a model performs very well overall while concealing that it performs very poorly for a minority population. For example, a model that perfectly predicts certain outcomes for a majority group (e.g., 95% of the population), but always makes mistakes on a minority group (e.g., 5% of the population), could still be very accurate overall (i.e., 95% accuracy). Common ways of evaluating a model rarely look at differences in accuracy or errors across different groups. Second, employers, working with vendors, have considerable freedom in choosing the outcome that models are designed to predict (e.g., the “quality” of potential employees). Rarely does a direct or objective measure exist for these outcomes; instead practitioners must choose some proxy (e.g., performance review scores). Because performance reviews are subjective assessments, they run the risk of being inaccurate and biased. And while it may be possible to create a model that accurately predicts performance reviews, doing so would simply reproduce the discriminatory assessments. In other words, the model would demonstrate validity in predicting a biased outcome. Finally, claims regarding validity, lack of bias, and disparate impact are dataset- and context-specific. Such claims rest on the belief that the population and circumstances captured in a dataset used to evaluate a model will be the same as the population and circumstances to which the model will be applied. But this is rarely the case in practice. A model that is a valid predictor exhibiting no disparate impact in an urban context might not be in a rural context. Thus, a selection procedure cannot be determined universally valid or unbiased.

Should an employer or vendor address each of these concerns, the question still might remain: Is predictive validity enough to defend against a claim of disparate impact? In other words, would a demonstrable correlation between inputs and outcomes suffice? The Uniform Guidelines seem to allow for the possibility of validating a model accordingly; there is no obligation to identify a causal mechanism, offer theoretical justification for uncovered relationships, or even understand the relationship between model inputs and outcomes. Yet, when such models generate a disparate impact, we might struggle to accept their results if they rest on non-intuitive and thus seemingly arbitrary factors. At the same time, if the model reduces the degree of disparate impact observed in previous hiring practices, we might welcome the model as an improvement even if we cannot explain the correlations it has uncovered.

Many proposed solutions to mitigating disparities from screening decisions require knowledge of legally protected characteristics.

At a minimum, employers and vendors seeking to mitigate against a disparate impact must know the legally protected classes to which people in the training data belong. Simply depriving a model access to these characteristics at the moment of assessment cannot guarantee unbiased decisions. Yet employers and vendors fear that explicitly considering these characteristics as part of their assessments may invite charges of disparate treatment. Our study suggests that vendors have tried to circumvent this apparent tension by using protected characteristics when building models, removing the correlated factors that contribute to disparate impact, but then ensuring that the models themselves are blind to sensitive attributes. This style of bias prevention, while appealing, is not without complications.

“[T]he more sensitive the data or stigmatized the condition, the less comfortable applicants may be to share it with employers—even if the stated purpose for collecting it is to protect against disparate impact along these lines.”

To remedy this, employers will need to collect information, like race, gender, and other sensitive attributes, that proponents of fair hiring practices have long struggled to withhold from the hiring process. In many cases, employers will be forced to solicit information that applicants rightly view as sensitive because such information has been the basis for discrimination in the past, rather than its mitigation. It is impossible to apply the proposed de-biasing methodologies to models in the absence of information about, for instance, employees’ sexual orientation or disability status. Yet the more sensitive the data or stigmatized the condition, the less comfortable applicants may be to share it with employers—even if the stated purpose for collecting it is to protect against disparate impact along these lines.

A focus on mitigating disparate impact risks concealing differential validity.

Our study suggests that vendors have thus far focused on ensuring that their models exhibit minimal disparate impact, leaving aside questions about differences in model accuracy across the population. Consider a model that is perfectly accurate in predicting job outcomes for one group, but performs no better than random for another group. Such a model might not result in any disparity in selection rates, but the quality of its assessment would differ dramatically between groups—a phenomenon known as differential validity. Assessments exhibiting differential validity could easily set people up to fail, lending support to the harmful stereotypes that have justified discriminatory hiring in the past.

Differential validity can also serve a crucial diagnostic function: A model may be performing differently for different groups because the factors that predict the outcome of interest are not the same across each group. When we observe that a model exhibits differential validity, we learn that the relationship between model inputs and actual outcomes is likely different across groups. In other words, different factors predict success for different groups.

There are a few steps creators of algorithmic assessments can take to mitigate differential validity. Fundamentally, in order to make accurate predictions for the whole population, we need (1) a wide range of model inputs that can be predictive across the whole population (i.e., not just group-specialized inputs like “lacrosse”); and (2) a diverse dataset containing examples of successful individuals from a variety of backgrounds. Importantly, neither of these can be achieved by “de-biasing” the model itself. In some cases, vendors may need to collect more data in order to reduce differential validity.

Algorithmic de-biasing techniques may have significant implications for “alternative business practices.”

Historically, the search for alternative business practices in screening has been quite expensive, requiring firms to consider a wide range of assessments and implementations. However, algorithmic de-biasing techniques promise to automate some degree of exploration, uncovering viable alternative business practices on their own. That said, using these techniques is not without cost. Contracting with vendors of such tools can be expensive. Developing the infrastructure to collect the necessary data, including candidates’ sensitive attributes, can be expensive, cumbersome, and fraught. In some cases, algorithmic de-biasing will also reduce the accuracy of an assessment, since these methods typically involve discarding some information that is genuinely predictive of the outcome of interest. And yet, many vendors encourage employers to do just that, noting that in practice mitigating disparate impact often has only a small effect on predictive accuracy. Vendors’ ability to help employers find such alternative business practices may put legal pressure on employers to work with them, as failure to do so might be seem as needlessly sticking with a hiring process that generates an avoidable disparate impact. And where there is an apparent trade-off between accuracy and disparate impact, these tools will make such tensions explicit and force employers to defend, for example, a choice to favor marginal gains in accuracy over a significant reduction in disparate impact.

Conclusion

Algorithmic hiring brings new promises, opportunities, and risks. Left unchecked, algorithms can perpetuate the same biases and discrimination present in existing hiring practices. Existing legal protections against employment discrimination do apply when these algorithmic tools are used; however, algorithms raise a number of unaddressed policy questions that warrant further attention.


The Brookings Institution is a nonprofit organization devoted to independent research and policy solutions. Its mission is to conduct high-quality, independent research and, based on that research, to provide innovative, practical recommendations for policymakers and the public. The conclusions and recommendations of any Brookings publication are solely those of its author(s), and do not reflect the views of the Institution, its management, or its other scholars.

Microsoft provides support to The Brookings Institution’s Artificial Intelligence and Emerging Technology (AIET) Initiative, and Amazon and Apple provide general, unrestricted support to the Institution. The findings, interpretations, and conclusions in this report are not influenced by any donation. Brookings recognizes that the value it provides is in its absolute commitment to quality, independence, and impact. Activities supported by its donors reflect this commitment.

Authors

  • Footnotes
    1. Bogen, Miranda, and Aaron Rieke. Help wanted: an examination of hiring algorithms, equity. and bias. Technical report, Upturn, 2018.
    2. Quillian, L., Pager, D., Hexel, O., & Midtbøen, A. H. (2017). Meta-analysis of field experiments shows no change in racial discrimination in hiring over time. Proceedings of the National Academy of Sciences, 114(41), 10870-10875.
    3. See also HireVue, which constructs algorithmic assessments based on video interviews.
    4. Raghavan, M., Barocas, S., Kleinberg, J., & Levy, K. (2020). Mitigating Bias in Algorithmic Hiring: Evaluating Claims and Practices. In Proceedings of the Conference on Fairness, Accountability, and Transparency. ACM.
    5. Barocas, S., & Selbst, A. D. (2016). Big data’s disparate impact. Calif. L. Rev., 104, 671.
    6. Passi, S., & Barocas, S. (2019, January). Problem formulation and fairness. In Proceedings of the Conference on Fairness, Accountability, and Transparency (pp. 39-48). ACM.
    7. Grimmelmann, J., & Westreich, D. (2017). Incomprehensible discrimination; Kim, P. T. (2016). Data-driven discrimination at work. Wm. & Mary L. Rev.58, 857.
    8. Selbst, A. D., & Barocas, S. (2018). The intuitive appeal of explainable machines. Fordham L. Rev.87, 1085.
    9. Kim, P. T. (2017). Auditing algorithms for discrimination. U. Pa. L. Rev. Online166, 189.