Machine learning cannot distinguish between causal and environment features.

Shortcut learning

Often we observe shortcut learning: the model learns some dataset dependent shortcuts (e.g. the machine that was used to take the X-ray) to make inference, but this is very brittle, and is not usually able to generalize.

Shortcut learning happens when there are correlations in the test set between causal and non-causal features. Our object of interest should be the main focus, not the environment around, in most of the cases. For example, a camel in a grass land should still be recognized as a camel, not a cow. One solution could be engineering invariant representations which are independent of the environment. So having a kind of encoder that creates these representations.

Counterfactual Invariance

A first notion of counterfactual invariance 🟨

Suppose we have a function $f : \mathcal{X} \to \mathcal{Y}$. Let’s define a counterfactual for a random variable $X$. Let’s say $W$ is a random variable that represents our non-causal features, e.g. our background. We say $X(\omega)$ is the result of $X$ when $W=\omega$, so we force the background to be some specific thing. We would like to formalize the following idea: the outcome of $X$ should be only dependent on $X$, not on $\omega$.

We say $f$ is counterfactually invariant if the following holds: $f(X(\omega)) = f(X(\omega'))$ for any $\omega, \omega' \in W$. How does this happen in practise? Ideally we would like to train any counterfactual, but this is practically impossible (too many resources to get camels into Himalaya to create this counterfactual!)

Causal Graphs 🟩

In causal scenarios our input features $X$ indeed have a causal relation with $Y$

Suppose we want to classify cancer and we have three features, $X = (location, CO_{2}, smoke)$, where location is $\mathbb{R}^{2}$, $CO_{2}\in \mathbb{R}$, and $smoke$ is a boolean, or categorical, $W = city$. We can build a causal graph of possible relations between the variables. Causality-20241013221047812

Anti-Causal 🟩

Let’s consider another scenario anti-causal scenario, where we want to predict celiac disease, we have stomach ache, fatigue and income as features. Our background variable would be the Job. In these kinds of scenarios, our output random variables $Y$ have a causal relationship with the features in $X$.

Causality-20241013221420597

Whys of non-causal relations

We can say that two are the main causes of non-causal associations in causal graphs:

  1. Confounding variables: existence of another random variable U that could affect both of the variables of our interest.
  2. Selection bias: We have a variable S that filters the dataset based on the features that we want.

Where $X$ are the input features , $Y$ are the output, and $W$ is the environment. In this whole set of note we will keep this nomenclature.

Simpson’s Paradox 🟨

We have treatment A and B, we would like to know which treatment is better. We sent 350 to A and 350 to B. Let’s say we observe that with treatment A $78\%$ recovered, with B $83\%$ recovered. Seems treatment B is better. But in the case we have another variable, e.g. the severity of the illness, the view could be far more different! These are called confounding variables. When designing an experiment we should keep also this in mind.

Causality-20241013221959683

A formal definition for counterfactual invariance 🟥

This has been proven by Veitch some years ago, this should be still an active area of research.

For an estimator $f$ to be counterfactually invariant we need:

  • Anti-causal scenario $f(X) \perp W \mid Y$
  • Causal scenario without selection $f(X) \perp W$
  • Causal scenario with selection we need $f(X) \perp W \mid Y$ as long as $(X_{W \& Y}, X_{Y}^{\perp})$ do not influence $Y$, i.e. we have $Y \perp X \mid X_{W}^{\perp}, W$ .

Therefore, connecting to the intuitive notion of counterfactual invariance we would like to have $f(X) \mid W = w, Y=y$ to have the same distribution as $f(X) \mid W = w', Y= y$ for any $w, w'$.

V-structures 🟩

This is called d-separation. Causality-20241017153344164

Let’s take for example they are 3 random variables $A, B, C$ in this order. They are all Markov Chains. The nice thing that we have is that

$$ p(A, C \mid B) = P(A \mid B)p(C \mid B) $$

A V structure is a markov chain in this form $A \to B \leftarrow C$: if i know B, then $A, C$ are related to each other.