Think AI can aid decision-making? Only if you address the risk of bias in the last mile

Neil Raden Profile picture for user Neil Raden May 31, 2022
AI has the potential to aid decision-making at scale - but there are obstacles. One of the trickiest to unravel is bias, which tends to show up just where you don't want it: during the last mile of AI modeling and project delivery.

Risk concept - man balancing on rope over precipice in mountains at sunset © PHOTOCREO Michal Bednarek - shutterstock

Any decision-aiding or decision-making system that utilizes quantitative or probabilistic modeling is subject to the risk of bias. It is not a new phenomenon. However, previously, systems were built with rules and understandable logic. When problems arose, they could be assessed and cured (provided no one lost the source code).

The problems with AI are two-fold: scale and opacity. A Machine Learning (ML) system can produce countless decisions that are too great to be monitored manually, and its inner workings are largely opaque. 

A significant source of bias in AI is the data used for the models. Human biases are historically manifested in the systems through the data and passed through to the ML model. Software tools to "scrub" data for bias are only partially helpful. Scanning for obvious violations of Personally Identifiable Information (PII) such as name, gender, zip code, or Social Security number, removing them, masking them or encrypting them.

Unfortunately, the tools are immature and insufficient because the problem is more complex. Hiding these identifiable values diminishes the analytic power of the data. A conundrum. Consider, it may be forbidden to use gender or race in credit decisions, but in modeling innovative healthcare solutions, these criteria are essential.

Cleaning the data to remove the column or columns containing the characteristics likely to skew the model towards bias is an incomplete approach. It doesn't work because of the nature of ML processing. Seemingly innocuous attributes can, when combined, lead the model in an unintended direction.  It may not be evident to the data scientist or AI engineer because these weak indicators, called proxy variables, have unknown relationships to the modeler.

ML is always on the prowl for connections and patterns. For example, a combination of the first name, address, education, smoking and drinking habits, pet ownership, diet, height, and BMI (this information is all readily available from third-party, unregulated data brokers) are all proxy variables for age, which in most cases is forbidden to use as a decision criterion. When the model isn’t converging on a solution to the cost function, it may turn to proxy values and provide incorrect inferences, known as "Shortcut Learning." A simple example is a model of pictures of dogs and cats. Rather than discriminating on the features of dogs, it finds an easy path focusing on a leash, but when a cat has a lease, it is determined to be a dog.

So what’s there to do? There is a view that ML is a reliable black box: just design the test data, and run the model. When people make decisions and record certain in digital systems, their biases can often go undetected. For example, loan applications, particularly residential mortgages, involve some subjective opinion by a loan officer and a committee evaluating the applicant's ability to pay or the likelihood of default. These deliberations are not encoded in the administrative systems and may not capture these subjective (biased) group decisions. The data scientist or AI engineer has to test the system's performance against some “fairness” criteria to establish if there is some hidden bias in the data.

Most of the "AI Ethics" discussion involves people, but bias is not limited to people. Providing a poor customer experience can damage a company's prestige and brand value. For example, a model may converge on a characteristic not suitable for the objective in route optimization, resulting in a poor solution that deprives timely deliveries to otherwise acceptable customers. The same is true in supply chain optimization. It can run in unsupervised mode, making wrong pricing decisions because of bias, not people, but because of contradictory model assumptions not caught in time.

The complexity of testing ML Models

Machine learning algorithms are based on statistics and curve-fitting, not psychology or neuroscience. Operationally,  ML performs differential equation calculations in a gradient descent process. The main objective of using a gradient descent algorithm is to minimize the cost function using iteration. In other words, its goal is to find the most efficient way to satisfy the objective. 

Even if the training data does not contain any protected characteristics like gender or race, these issues can occur. A variety of features in the training data are often closely correlated with protected characteristics, e.g., occupation. These ‘proxy variables’ enable the model to reproduce patterns of discrimination associated with those characteristics, even if its designers did not intend this.

These problems can occur in any statistical model. However, they are more likely to occur in AI systems because they can include a more significant number of features and may identify complex combinations of features that are proxies for protected characteristics. Many modern ML methods are more powerful than traditional statistical approaches because they are better at the 20200214 54 Version 1.0 AI auditing framework - draft guidance for consultation uncovering non-linear patterns in high dimensional data. However, these also include patterns that reflect discrimination.

Testing and vetting a model before it moves into production is essential for the Last Mile methodology. Testing for bias is only part of the process. The model has to be proofed for scale, security, how the various APIs and data pipelines operate, and simulations of the model reacts to new data. Synthetic Data is a relatively new technique for testing, which provides generated data that is logically equivalent to the experience data, though there is a fair amount of controversy about where it is appropriate and where it’s not.

To apply the inference of an ML model requires the application of those inferences intelligently.  Machine Learning is a probabilistic method and is not ideally suited to deterministic policies and rules. Applying business rules after ML-based predictions or classifications will deliver better conformance and transparency.

Testing for discrimination and “fairness” is still somewhat tricky because there are no widely accepted quantifiable definitions for discrimination and fairness at this point. Still, you can test against your definitions until “best practices,” laws, and regulations codify them. Some tests available can give some quantitative measures but do not, as yet, give a clear picture of the root cause of defects. For example, the prevailing method for reverse-engineering the output of any predictive algorithm for explainability is SHAP  (SHapley Additive exPlanations). It was invented by Lundberg and Lee and published in a paper in  2017. Briefly, SHAP calculates how much each variable affects the outcome. 

Work by researchers at the Oxford Internet Institute and Alan Turing Institute has examined this mismatch in detail. They found that technical work on AI ethics rarely aligns with legal and philosophical notions of ethics. “We found a fairly significant gap between the majority of the work out there on the technical side and how the law is applied,” said Brent Mittelstadt,  one of the researchers. The researchers instead advocate the use of ‘bias transforming’ metrics that better match the aims of non-discrimination law. They have proposed a new metric of algorithmic fairness, conditional demographic disparity (CDD), informed by legal notions of fairness. This metric has been incorporated into the bias and explainability software offered by Amazon Web Services. 

In fact, the majority of metrics for defining fairness in machine learning clash with EU law, they argue. These metrics are often ‘bias preserving’ because they assume the current state of society to be a neutral starting point from which to measure inequality. “Obviously, this is a problem if we want to use machine learning and AI to uphold the status quo and actively make society fairer by rectifying existing social, economic, and other inequalities,” tweeted Mittelstadt.

My take

I’m skeptical of statements like “use machine learning and AI not simply to uphold the status quo, but to actively make society fairer by rectifying existing social, economic, and other inequalities.” I’d be satisfied with making society less unfair. People have to do the rest.

Image credit - Risk concept - man balancing on rope over precipice in mountains at sunset © PHOTOCREO Michal Bednarek - via

A grey colored placeholder image