How causal analysis and AI intersect - methods of causal inference

Neil Raden Profile picture for user Neil Raden October 24, 2023
Our foray into causal analysis is not yet complete. Until we define the methods of causal inference, we can't get to the deeper insights that causal analysis can provide.


This article details many of the methods and techniques of causal inference and is a companion to my prior piece, Can causal analysis change business? Applying causality in AI and beyond.

In that article, I detailed industry use cases where causality can have real impact. But to move ahead properly, we need to achieve a deeper grasp of causal inference.

Causal inference is the process of determining the effect of one variable on another beyond mere association. It's fundamental in many scientific disciplines, from epidemiology to economics, where understanding cause-and-effect relationships is essential. This article is somewhat technical and is best read with a basic understanding of statistical techniques.

Here's an overview of methods and approaches for causal inference:

Randomized Controlled Trials (RCTs)

RCTs involve experiments where units (individuals, businesses, or regions) are randomly assigned to one of two groups: one that experiences a particular intervention and one that does not. Due to this random assignment, any outcome differences between these groups can typically be ascribed to the intervention. However, RCTs can be logistically challenging, costly, or sometimes ethically complex. Though RCTs are often considered the "gold standard" in research due to their ability to demonstrate causality, they come with drawbacks, like all research methodologies. Here are some of the main limitations of RCTs:

  • Generalizability (External Validity): Results from RCTs might only sometimes apply to broader populations or different settings due to the controlled nature of the trials or specific selection criteria for participants.
  • Cost and Time: RCTs can be expensive and time-consuming, especially if they require long follow-up periods, large sample sizes, or both.
  • Ethical Concerns: It is unethical to withhold a potential benefit from participants in the control group, especially if there's existing evidence suggesting the intervention's effectiveness.
  • Attrition: Participants may drop out of the study, leading to imbalances between the control and intervention groups and potentially biased results.
  • Limited Scope: RCTs typically focus on single or limited interventions, which may not capture the complexity of real-world scenarios where multiple factors or interventions interact.
  • Potential for Hawthorne Effect: Participants' knowledge that they are part of a study might influence their behavior, potentially skewing the results.
  • Difficulty in Blinding:** In some RCTs, it may be challenging to blind participants or providers to the intervention, which can introduce biases.
  • Non-adherence:** Participants might only sometimes adhere to the assigned intervention, leading to contamination between groups and potentially diluting the observed effects.
  • Publication Bias: RCTs with significant results are more likely to be published than those with null results, which can skew the perceived effectiveness of interventions in the literature.
  • Lack of Flexibility: Once an RCT begins, there needs to be more flexibility to modify the intervention or study protocol, even if early results suggest beneficial modifications.
  • Potential for Random Imbalances: Even with randomization, there's a chance that confounding variables, a third variable that influences independent and dependent variables, could be unevenly distributed between the intervention and control groups, especially in smaller trials.

RCTs offer robust evidence about causality, but there are limitations in the context of the research questions and the practicalities of the study. They are one tool in a broader toolkit of research methodologies, each with its strengths and weaknesses.

Counterfactual/potential outcomes framework

This approach revolves around imagining alternate scenarios. Specifically, what would the outcome have been for a unit if it hadn't experienced the intervention? Though conceptually straightforward, estimating these hypothetical scenarios often requires strong assumptions or advanced methods. Drawbacks:

  • Unobservable Counterfactuals: The primary challenge with this framework is that for any given unit (e.g., an individual, company, or region), we can only observe one potential outcome: either with or without the specific event or intervention in question. We must infer the unobserved alternative scenario, often based on solid assumptions.
  • Strong Assumptions: Estimating causal effects often requires assumptions like "ignorability" or exchangeability. These assumptions are untestable and may not always be valid in every context.
  • Positivity Assumption: Another core assumption is that every unit has a positive probability of being subjected to each possible scenario. Violations can complicate the identification of causal effects.
  • External Validity: The inferences derived from the potential outcomes of a particular study might not apply to broader contexts, different populations, or other times.
  • Sensitivity to Model Specification: When models adjust for potential confounders or handle missing data, the specific model used can influence the results.
  • Challenges with Continuous or Time-Varying Events: The framework is most straightforward for binary events. Handling continuous events or those that change over time introduces added complexity.
  • Difficulty with Interactions and Mediators: Identifying variations in effects for different subgroups or understanding the pathways through which an effect operates can be challenging and may require added assumptions.
  • Requirement for Large Sample Sizes: A substantial amount of data may be necessary to achieve precise estimates, especially when considering nuanced effects or rare events.
  • Complexity of Multiple Potential Outcomes: Analyzing situations with more than two potential scenarios or conditions can increase the complexity of the approach.
  • Communication Challenges: While conceptually straightforward, the nuances and subtleties of the framework can take time to convey to a general audience, leading to potential misunderstandings.

While the Counterfactual/Potential Outcomes Framework provides a clear conceptual basis for considering causality, it's crucial to be aware of its limitations and the contexts in which it is applied.

Causal Graphs and Directed Acyclic Graphs (DAGs)

DAGs visually represent and analyze relationships among variables, helping to map out and understand potential causal structures. While they benefit conceptual clarity, empirical data is still needed for conclusive causal insights. Causal Graphs and Directed Acyclic Graphs (DAGs) are powerful tools for representing and reasoning about causal relationships. They provide a visual way to depict and understand causal structures in various domains.

Development of Causal Graphs and DAGs:

  • Conceptualization: The first step in developing a DAG is conceptualizing the problem or system. This involves determining relevant variables and hypothesizing about potential causal relationships based on domain knowledge, existing research, or exploratory analysis.
  • Node Representation: Each variable or concept is represented as a node (or vertex) in a DAG. Nodes are the entities for which you're trying to establish relationships.
  • Directional Arrows: Arrows (or edges) between nodes represent direct causal relationships. The direction of the arrow indicates the direction of the causal effect. For example, an arrow pointing from Node A to Node B suggests that an is having a causal impact on B.
  • No Cycles: The "acyclic" nature of DAGs means they do not have any feedback loops or cycles. In other words, starting from any node and following the arrows, you should never return to the same node.
  • Adjusting for Confounding: In DAGs, confounders can be visually identified as common causes of two or more variables.
  • Testing Assumptions: While DAGs visually represent assumptions about causal structures, these assumptions should be tested, where possible, using empirical data. This often involves statistical methods or experiments.

Technology Employed:

  • Software for DAG Creation: Numerous software packages and tools are available for drawing and working with DAGs. For a complete description of DAG creation and current tools, see Online Causal Diagram (and DAG) drawing/editing tools. Once a DAG is conceptualized, statistical software can be used. There's growing interest in using machine learning techniques to automate causal discovery with Machine Learning and Automated Causal Inference. While traditionally, DAGs are drawn based on expert knowledge; algorithms try to learn the causal structure directly from data. However, this task is challenging, and the results often require interpretation and validation.

A Bayesian network is a type of DAG where each edge has a conditional probability table. Software like BayesiaLab or the `bnlearn\ package in R can be used to model and compute with Bayesian networks.

Once a DAG is formulated, it's essential to visualize it effectively. Tools like Graphviz or network visualization libraries in Python (e.g., NetworkX) or R can be handy.

DAGs offer a visual and conceptual representation of causal structures, making complex relationships more understandable. Combining expert knowledge with modern software tools and statistical methods allows for a more rigorous exploration and validation of causal hypotheses.

Propensity Score Methods

Primarily used to reduce confounding effects in observational studies, they involve matching or weighting units based on their propensity scores—the probability of receiving the treatment given observed characteristics. Despite their popularity, there are several drawbacks to these methods:

  • Unobserved Confounding: Propensity scores adjust for observed confounders but cannot account for unobserved or hidden confounders. If significant unmeasured variables affect treatment assignment and the outcome, bias remains.
  • Model Dependency: The accuracy of propensity score methods heavily relies on the correct specification of the propensity score model. Misspecifying the model can lead to biased results.
  • Overfitting: The propensity score model can overfit the data, especially with many covariates, leading to imbalances in the matched or weighted sample.
  • Support Issues: Propensity score methods require "common support" or overlapping propensity scores between treated and untreated groups. If there's limited overlap, it can be challenging to make credible comparisons, especially at the tails of the propensity score distribution.
  • Matching Challenges: Even with a well-estimated propensity score, achieving suitable matches can be difficult, especially with high-dimensional data. Also, some treated units may not find appropriate partners, leading to excluding those observations.
  • Quality of Matches: Nearest neighbor matching is a common technique that only ensures exact matches on the propensity score or the covariates. Some matched pairs still have substantial differences in observed characteristics.
  • Assumptions: The primary premise behind propensity score methods is the" ignorability" or exchangeability assumption. This posits that, conditional on the propensity score, treatment assignment is independent of the potential outcomes. This assumption is strong and untestable.
  • Lack of Transparency: While propensity score methods can simplify the causal analysis by reducing multiple confounders to a single score, this simplification can sometimes obscure imbalances and limit the transparency of the study.
  • Weighting Variance: Weighting methods, like the inverse probability of treatment weighting (IPTW), can introduce high variance in the estimates if some weights are substantial. This can lead to less precise estimates.
  • Complexity: Implementing propensity score methods requires a deep understanding of the underlying assumptions and potential pitfalls. Mistakes in implementation can lead to misleading results.
  • Generalizability: In matched analyses, generalizability may be limited if a substantial portion of the sample is excluded due to a lack of matches. The resulting matched sample might not represent the broader population.


While propensity score methods offer a valuable tool for reducing confounding in observational studies, researchers must be aware of their limitations. It's often recommended to conduct sensitivity analyses and consider alternative methods for causal inference to validate findings.

Some other methods

The Synthetic Control Method builds a "synthetic" comparator by blending various units that did not experience the intervention, aiming to mirror the characteristics of a team that did closely. The method attempts to infer the intervention's effect by comparing the outcomes.

You might turn to Observational Studies with Matching when you can't randomize. Here, the goal is to pair units that experienced an intervention with those that didn't, based on similar observed characteristics. This helps reduce biases from confounding factors, variables that influence both the dependent variable and independent variable, causing a spurious association. However, there can still be concerns about preferences from unobserved factors.

Difference-in-differences (DiD) measures the changes in outcomes of two groups (those with and without an intervention) before and after the intervention's implementation. It assumes that both groups would have followed similar trends if neither had experienced the intervention, an assumption that may only sometimes hold.

Regression Discontinuity (RD) comes into play when a specific threshold determines who experiences an intervention (e.g., grants to projects scoring above a particular grade). Comparing outcomes just above and below this threshold can provide insights about the intervention's effect, primarily focusing on units near the threshold.

Structural Equation Modeling (SEM) is a versatile statistical method that combines factor analysis and multiple regression. It's used to analyze structural relationships between measured variables and latent constructs. SEM is a powerful tool for understanding complex relationships and underlying constructs; researchers must know its limitations. Proper model specification, adequate sample size, and a thorough understanding of the underlying assumptions and their implications are essential for valid SEM applications.

My take

Selecting the correct method depends on the research question, data availability, and the feasibility of each approach. In many instances, a combination of techniques can offer more robust insights.

A grey colored placeholder image