Can we measure fairness? A fresh look at a critical AI debate

Profile picture for user Neil Raden By Neil Raden December 21, 2020
By now, most AI practitioners acknowledge the universal prevalence of bias, and the problem of bias in AI modeling. But what about fairness? Can fairness be measured via quantifiable metrics? Some say no - but this is where the debate gets interesting.

fotolia scales justice

Bias is a universal trait among humans. It can be meaningless, or it can be dangerous. For example, I am biased about the Philadelphia Phillies baseball team. If a Phillies pitcher competes for the Cy Young Award at the end of the season, I am likely to extol the former's virtues and denigrate those of the latter. Unless I'm a member of the Baseball Writers Associations who vote on such things, it makes no difference. It affects no one. It's harmless, and we all have them.

However, the phenomenon of bias and how it manifests itself is the same, whether it's about baseball or matters of gender, age, race, etc. If I am a hiring manager and am biased about older (or younger) people, it makes a significant difference. When the success of those hiring decisions is predicted by a machine learning model, based on historical data that already reflects those biases, that is not only an ethical problem; it is harmful to the community. 

In contrast to the concept of bias, the idea of fairness is a derivative of the actions of discrimination. In AI, bias imposes itself in the data, in the model (and the modeler), and some would say in the algorithm. Still, algorithmic bias is typically a function of the data. Fairness is after-the-fact. The results of the model are either fair or not, or somewhere in-between. Can it be measured? There is no unanimity on that question. 

Among the many points of view about measuring fairness, justice, and unbiased results, in a recent post in LinkedIn, Reid Blackman, Ph.D. said: "Fairness and justice cannot simply be captured by notions of statistical parity...the vast majority of the talk and research regarding 'fair AI' is about aiming for statistical parity in how various sub-populations are treated... There is no mathematical formula for fairness, and there never will be." His advice is to "look beyond the numbers."

There is a lot to unpack here. In fairness to Blackman, referring to parity, on that point, I agree. When parity is described as the same relative bias against subgroups, (my model harms subgroups evenly), it is not an acceptable way to judge your model. However, it is a widespread technique that needs to stop. I couldn't agree more.

However, I do not agree with the premise that fairness and justice can't be measured by statistical means. In my classical statistical training, I was taught that you can't make predictions with numbers. But statistics help you understand what's happening on the ground so you can make better models. "Prediction is very difficult, especially if it's about the future," said Niels Bohr

Fairness and justice can be evident without measurement, but making progress requires measurement. Blackman's comment, "There is no mathematical formula for fairness, and there never will be," I disagree with, and I have some reputable company I'll reveal below. Discrimination and bias didn't appear with AI, but their combination and infinitely scalable resources present an unimaginable opportunity for AI to escalate and propagate uncontrollably. 

Fairness must be measured contextually. As Blackmun wrote, fairness is often based upon a legal standard of disparate impact and is characterized by predicted outcomes different for different groups. We saw this in some iconic AI disasters such as Amazon (hiring), COMPASS (recidivism), and any NLP model pre-trained naïvely on Common Crawl, Google News, or any other corpus, since Word2Vec. Parity of disparate impact is no kind of goal. It measures only how the model predicts a person in a classification should benefit. 

Fairness metrics

In their 2018 paper, Fairness Definitions Explained, Verma and Rubin present a rich selection of fairness metrics. Here is a high-level summary of one example:

There are two types of fairness definitions. "Individual" definitions of fairness employ constraints with semantic content about individuals. Unless these constraints are defined/known, they cannot be resolved. Conversely, when constraints are defined for unknown labels, they cannot be solved. "Statistical" fairness metrics measure equality of false-positive rates evaluated over "protected" populations. The upside is they are straightforward to satisfy but don't offer assurance to individuals. For example, one approach is to equalize false-positive rates among pairs, and the "rate" is an average over replicates and not an average of people. There are algorithms for optimizing classification error subject to this constraint, which gives credibility to both distributions, making promises both to new individuals and over new problems.

In A New Metric for Quantifying Machine Learning Fairness in Healthcare, developed another example:

We propose a new method for measuring fairness called "Group Benefit Equality." Group benefit equality aims to measure the rate at which a particular event is predicted to occur within a subgroup compared to the rate at which it actually happens. No single metric is a silver bullet against algorithmic bias. Practitioners should calculate several bias metrics, like group benefit equality and equality of opportunity, in conjunction and think critically about the impacts of any difference that may be observed.  With that said, it is helpful to have a single metric that is easy to explain, benefits from a transparent procedure, and has well-defined target values and thresholds for when a model is potentially biased.  Group benefit equality uniquely satisfies all of these requirements and is the best healthcare metric for quantifying algorithmic fairness.

My take

I am still suspicious of the enduring value of AI, machine learning in particular. Most ML models use one or another form of regression. As Judea Pearl so colorfully puts it, "It's just curve fitting." What he means by that is that whatever the ML model does, it's looking for patterns and correlations. There is an infinite number of ways to screw that up because the algorithms only know what they know. There is this thing called confounding factors.

Hypothetical example: "Children who drank a lot of milk become heroin addicts." How many hundreds of factors did the model NOT consider to come up with spurious correlation? Those factors are confounding. But there's hope. Judea Pearl leads a group of researchers to bring to AI cause and effect, and they've created a new causal calculus. Pearl asks this question: "There is an almost perfect correlation between eating oranges and not getting scurvy. But the causal relationship in Vitamin C, not oranges. If you didn't know this, and you ran out of oranges, what would you do? Eat bananas."

In my next installment on measuring fairness, I'll go a little deeper into the methodology and introduce some promising new options. In particular, I'll review how Amazon adopted an anti-bias program base on research conducted at Oxford University.