Causal AI Foundational Concepts
Fundamentally, confounders are common causes for multiple variables. These variables are hence confounded. This poses two major problems for modeling: spurious correlations and biased effect estimation.
A common cause of confounded variables often leads to spurious correlations. This is especially problematic if the confounder is latent. In this case, this correlation can be mistaken for a causal relationship. There are many examples of spurious correlations, for example ice cream sales and shark attacks confounded by the time of the year (a more extensive list of spurious correlations is available here). Mistaking spurious correlations for causal relationship can lead to incorrect conclusions and ultimately bad decisions. In particular, you may decide to make an intervention on a variable spuriously correlated with another variable. Since in reality there is no causal relationship between the two variables, this intervention would not have the desired effect.
Confounding can be identified during causal discovery, since by definition, any variables that share at least one ancestor in a causal graph are confounded. Some causal discovery algorithms can automatically detect latent confounders, while some confounding can be identified through domain expertise during human guided causal discovery.
Even when confounding relationships are known (for example, by defining an explicit causal graph), they can pose problems for causal inference. In particular, if multiple causal drivers of a variable of interest are themselves confounded, it may be difficult to calculate the effect of the causal drivers on the variable of interest. This is because the correlational association between causal drivers will be entangled with their causal associations to the variable of interest. This can in turn lead to inaccurate conclusions, including the famous Simpson’s Paradox. Special tools such as Doubly Robust estimation methods can safeguard against this.
Mediators are intermediate variables that help explain the relationship between a cause and an effect. In the context of causal inference, identifying mediators is important because it can help us understand the underlying mechanism that explains how a particular cause leads to a particular effect. By identifying mediators, we can gain a deeper understanding of the causal relationships between variables and potentially identify new ways to intervene and promote positive outcomes.
A classic example of a mediator is the relationship between smoking and lung cancer. Smoking is a well-known risk factor for lung cancer, but the mechanism by which smoking causes lung cancer is not fully understood. However, researchers have identified several potential mediator variables that help explain this relationship. One of these mediator variables is the presence of harmful chemicals in cigarettes, such as tar and nicotine. When a person smokes, they inhale these harmful chemicals, which can damage the cells in their lungs and increase the risk of cancer.
Another mediator variable in the relationship between smoking and lung cancer is chronic inflammation. Smoking can cause inflammation in the lungs, which can lead to DNA damage and an increased risk of cancer. Chronic inflammation can also impair the immune system, making it less effective at detecting and destroying cancer cells.
By identifying mediator variables, researchers can gain a more complete understanding of the causal pathway between treatment and effects. This can help guide the development of more effective interventions to prevent or treat negative outcomes, or optimize for positive ones.
Conditional independence refers to the relationship between two variables, given a third variable. If two variables are conditionally independent given a third variable, then their relationship disappears after controlling for the third variable. Conditional independence is extremely important in understanding cause-and-effect relationships because it can help us identify whether two variables are truly related, or whether their relationship is merely a coincidence or the result of a third, confounding variable. It is the basis of many causal discovery algorithms - which take us from data to causal graphs.
Let's use the example of shark attacks and ice cream sales. Suppose we observe a correlation between the two variables: on days with high ice cream sales, there also tend to be more shark attacks. Without understanding the concept of conditional independence, we might be tempted to conclude that ice cream sales cause shark attacks. However, there could be a third variable, such as temperature, that is causing both ice cream sales and shark attacks. For example, on hot days, people may be more likely to buy ice cream and also more likely to go swimming, increasing the likelihood of shark attacks. In this case, ice cream sales and shark attacks are correlated, but they are not causally related.
Conditional independence helps us identify whether a correlation between two variables is likely to be causal or not. Specifically, two variables are conditionally independent given a third variable if, after controlling for the third variable, the relationship between the two variables disappears. In the case of shark attacks and ice cream sales, if we control for temperature, we might find that the correlation between ice cream sales and shark attacks disappears. This would suggest that temperature is a confounding variable that is driving the correlation between ice cream sales and shark attacks, and that ice cream sales themselves are not causally related to shark attacks.
In summary, understanding conditional independence is important in causal inference because it helps us distinguish between correlation and causation. By controlling for confounding variables and identifying mediator variables, we can gain a better understanding of the true causal relationships between variables and make more accurate predictions and interventions.
Interventions refer to actions or manipulations that are directly made to a system or process.
Take for instance a company that wants to optimize for customer retention. There are many variables that can affect retention, some of which may be changed by the business, in this context interventions can take many forms, depending on the nature of the product or service being offered and the reasons why the customer is considering not renewing. For example, a business may offer the customer a discount or other incentive to renew their subscription, or they may offer additional features or services to entice the customer to stay.
A counterfactual is a statement about what would have happened if a certain event had not occurred, or if a certain intervention had been made. It is a way of reasoning about causality, by comparing what actually happened to what would have happened under different conditions.
Counterfactuals are an inherently causal concept. It implies the knowledge of what is the impact of doing something, even if that was never measured before. ML and classical statistics cannot deal with counterfactuals since they’re just extrapolating from the data that was already seen.
Some examples why are counterfactuals important:
● In any attribution or root cause analysis problem, we need to “imagine” a parallel world where we fail to perform a certain action (let’s say, buy a TV ad), and measure the change in the revenue from not doing that.
● To do scenario modeling or stress testing: we need to imagine a world in which we take a certain factor (let’s say the FED funds rate) and stress it to a level which hasn’t been seen before - then measure the impact on our KPIs.
● To generate a next-best-action, we need to understand what is the best out of many possible interventions - to do so we also need to extrapolate the impact of interventions that are possibly not present in the training data, and select the best one of them.