Can we create better algorithms for screening candidates - and reduce hiring bias?

Profile picture for user Neil Raden By Neil Raden August 30, 2019
Summary:
A new research paper from Georgia Tech takes a surprising position on algorithmic bias in hiring. Their view: we can reduce screening bias if algorithms take the impacted demographic groups into account. Here's my critique.

woman-and-algorithm

In a previous article, Are hiring decisions ready for AI? How repeatable algorithms can harm people, I describe some concerns about the ethical aspects of using the current effectiveness of AI to screen candidates - and came to the conclusion that is was not.

If an algorithm is used to choose resumes or submissions (or in some cases I’ve found, web-scraping social media) for acceptable candidates for positions, the criteria are almost surely going to be superficial.

Do they have a PhD? What is their FICO score? Standardized test score? Where did they go to college?

These attributes are even more subjective when you take into consideration emerging AI systems that scan videos of job applicants and evaluate speech and facial expressions. Based on the analysis, it measures qualities of collaboration, direct communication, persuasion, and empathy.

These algorithms will score candidates and always give a score that is consistent with the model. That’s the problem. If an applicant is somehow vetted through, the actual interview process can overcome shortcomings that lead to a lower score. The problem is, they won’t get to the interview. I think we can all agree that we know people who do not meet some of those qualifying criteria who are nonetheless good choices.

As I wrote:

Repeatability of an algorithm can work against people. Algorithms "fire" at a much higher cadence than people, and repeat a bias at scale (part of the appeal of algorithms is how cheap they are to use).

Kathy O'Neill, in her bestseller, Weapons of Math Destruction, gives an example:

A college student with bipolar disorder wanted a summer job bagging groceries. Every store he applied to was using the same psychometric evaluation software to screen candidates, and he was rejected from every store. Even though humans may have similar biases, not all humans will make the same decision, but given the same inputs, the inferencing algorithm will always make the same decision. That is desirable, but only if humans are to judge for themselves. Perhaps that college student would have been able to find one place to hire him, even if some of the people making decisions had biases about mental health.

That’s part of the problem. The other part of the problem is how the algorithm handles bias.

I came across a paper that was just published, Closing the GAP: Group-Aware Parallelization for Online Selection of Candidates with Biased Evaluation, by two Georgia Tech researchers, Jad Salem and Swati Gupta. The premise is that the algorithms can be improved by considering each candidates membership in certain groups that typically experience bias. What? Swati was gracious enough to take the time to explain:

Why this research? I’m passionate about fairness, and the fact that I can attempt to solve some of these problems using math is really exciting!

One would assume that any programmatic vetting process would need to be blind to race, gender, age and other protected attributes. But in conversation with Swati Gupta, I learned something remarkable. How do you build algorithms to overcome the inherent biases in this generic selection process. In particular, how do you TAKE INTO ACCOUNT those aspects that are most likely to lead to bias when designing algorithms?

In the paper, the authors write:

Although scoring candidates provides a numeric scale to compare them, it is often unclear what the impact of their training or their socioeconomic background is on the observed score. Numerous studies have shown that experiences with racial and gender stereotyping can have adverse effects on test performance.

And the ML models themselves, Swati continues, “are often trained on real-world data that is not free from bias and thus suffer from similar pitfalls.” In other words, masking those things like age, sex, etc. can actually generate models that are significantly biased, based on the training sets which may not be inclusive.

This is where it gets interesting. Instead of dismissing algorithmic candidate selection, the authors go on to reach a reasonable conclusion: make the algorithms better.

The rest of the paper describes a number of models and derived algorithms proving, mathematically, that the best approach for fairness in online selection of candidates is for the algorithm to KNOW what group a candidate is in, to take into account potential bias in the other selection criteria.

Now, these are fairly simplified models and do not take into account all of the complexity and subjectivity in hiring decisions, and they only apply to the serial vetting of candidates. Once an applicant is approved, the interview process is something else.

Still, it is intriguing. The idea that AI algorithms, rather than masked from this kind of data, can prove to be more fair when they know it (and, obviously, can effectively adjust for it, but that’s a bigger problem) is a little counter-intuitive, but not unreasonable.

So in the then end, the authors say:

Our mathematical analysis shows parallelization to some extent might be a good intervention when there exists bias in evaluations since it increases the provable utility of candidates hired under many settings.

An important takeaway from our paper is that if the utilities of people are generated in random, agnostic to their group membership, and there is a significant bias against a certain group of people, then the group-aware parallelization is much better at closing the gap from optimal than using group-agnostic algorithms.

­­­­

My take

This is very interesting, but it has some issues in application. First of all, if the model applied too much of an adjustment for those in a group subject to bias, there could arise the opposite effect – those not in a subject group would be the victims of bias. Second of all, the proof of the models depends on a priori knowledge of the number of applicants that will be evaluated. This isn’t realistic, and I suspect the authors will continue to research and refine the models.

Then there is Title VII of the Civil Rights Act of 1964, which prohibits discrimination based on race, color, religion, sex (pregnancy, childbirth, etc.) or national origin. It is illegal to discriminate in relation to hiring, discharging, compensating, or providing terms, conditions, and privileges of employment. In fact there are many federal and state laws in addition to the 1964 law such as the Age Discrimination in Employment Act (ADEA), the Americans with Disabilities Act, the Equal Pay Act.

And here’s a big Daddy for the near future: the Genetic Nondiscrimination Act of 2008.

Title VII even specifically calls out employment agencies. It would not take a very creative lawyer to make the case that a candidate selection algorithm was operating like an employment agency.

Swati is optimistic though:

These are complex interdisciplinary problems. In our research, we find that taking group membership into account is a promising mechanism for counteracting bias, but there is a lot more work left to be done. To be able to find meaningful solutions, we need to have a continued dialogue between lawyers, policy makers, computer scientists and mathematicians.

I will stay in touch with the authors to see how this develops.