The data science conundrum - why do commercial businesses eschew causal analysis?

Profile picture for user Neil Raden By Neil Raden February 21, 2020
Summary:
When we talk about the limits of data science, we often revert to issues like scalability, or the lack of talent. But there's another burning question that data science projects overlook at their peril: just how important is causation?

businessman's dilemma

Is an almost perfect correlation enough? In that case, why bother with causation? Judea Pearl, in The Book of Why?, poses the hypothetical question: We know that oranges prevent scurvy, isn't it enough to have that irrefutable correlation? But what if you ran out of oranges? Not knowing the real causative relationship, Vitamin C, you might try bananas.

We all know that smoking causes lung cancer (and lung disease and heart disease), and the correlation between smoking and these diseases is irrefutable. Why bother with causation? Because understanding the actual causes, the actual mechanisms of action, it might be possible to learn more about other causes that we don't know. We know how to detect cancer, and we know how to treat it, successfully or not, but the truth is, no one knows what causes it.

Why did the Challenger blow up? Because the O-rings failed. But why did they fail? A disaster of that magnitude prompted an inquiry to find out why. What caused the O-rings to fail? It was the temperature, and it was too cold for launch. Finding out the cause prevented similar disasters.

As Pearl wrote:

Causation is not merely an aspect of statistics; it is an addition to statistics, an enrichment that allows statistics to uncover workings of the world that traditional methods cannot.

With that backdrop, I posted an article on LinkedIn a few years ago, Data Science is Nice, but Where is Causation, which attracted many comments, mostly negative. One comment was: "They will tell you that trying to be certain of causation without running a controlled experiment is difficult or impossible." That's part of the problem, a lack of understating about what causal analysis is. It's not about certainty. It's probabilistic.

Another reader gave a flat-earth comment: "Maybe causation isn't essential? If you know that something happens with a certain degree of predictability and regularity, do you really need to know why?" I term this the "Death of Why," and wrote about it in a couple of blogs I wrote many moons ago.

I guess if I hit my head with a hammer a few times, I can reasonably assume that this is the cause of my brain injury. But in other situations, causality is hardly obvious. Back to Pearl's analogy of oranges and scurvy. Eating oranges and not getting scurvy is "something happens with a certain degree of predictability and regularity." It doesn’t prepare you what to do when you run out of oranges. And many important real world problems fall into that category.

The application of quantitative methods, both correlative and causative, is much broader than business and management: scientific, intelligence, humanitarian, public health and safety, poverty, war, violence. What's the root cause of ISIS, for example? What's the root cause of income inequality between have's and have nots or men and women? The commercialization of big data and analytics, especially the writings of people like Tom Davenport with a management/strategy consulting focus, diverts our attention from the potentially (and often unnoticed) use of these skills beyond, as I like to say, selling more shoes to fashionable ladies.

On my LinkedIn piece, another commenter took the correlation is good enough approach, “it may not be worth the time and cost to do so when sufficient correlation can be found in the data to allow a good decision to be made.”

I don't know why some see causation as "not worth the time and cost" when you consider how many terrible mistakes are made without it. Clinical trials for drugs are a good example. An inference drawn from the data turns out to be wrong almost half the time, with disastrous effects. Example: the assumption that drugs that raise HDL prevent heart attacks (reality - HDL in an indicator of heart health, not a cause).

Example 2: in 2012, Amgen scientists tried to replicate 53 high-profile basic research findings in cancer and could only replicate six. A dirty little secret in data science: models decay in the computer. Some very rapidly. Causal models? Smoking causes heart disease. Speeding causes death and property claims. I have to take issue with the notion of "sufficient correlation." Any way to measure that (p-vale, F-statistic, other statistics) is only a mathematical calculation from the data, completed devoid of any direction about cause and effect, except for the modeler's input (read: bias).

Here is another one that is repeated all the time: “Correlation over causation. Perfection is pragmatic, if combined with test-learn-adapt. Trust but verify!” I don't know how perfection crept into the picture. Causation models I've worked with are all couched in probabilities. It's a subtle but essential difference from 0-to-1 statistics, like p-value or R-squared, which are "statistics" based on the data, not the joint probabilities.

My take

It comes down to how important are your findings. If you are building machine learning models for sales and marketing purposes, understanding what-drives-what at some level of accuracy is good enough. But if you’re trying to understand why your supply chain seems to have hiccups once I while, some causative analysis is what you need.

We are sort of drunk with data at the moment, which may explain why people are so disinterested in causation. What do you do when you explore causation? Counterfactual reasoning means thinking about alternative possibilities for past or future events: what might happen/ have happened if…? In other words, you imagine the consequences of something that is contrary to what happened or will have happened ("counter to the facts").

Pearl, among others, formulated a theory of causal inference and developed a causal calculus to get at the problem of causation. At its center, it uses a DAG (directed acyclic graph) to reason across posterior and joint probabilities to arrive at an answer couched in probabilities. This is nothing like regression models to see what influences what is in the data. Honestly, it will take some time for commercial organizations to see the benefit of this.