Intro_Causality

Introduction to Structural Causal Modeling¶

Understanding What Probabilities Really Are¶

In this course, we have often used libraries to sample random variables

import numpy as np

[...]

ramdom_exp_values = np.random.exponential(my_lambda)

ramdom_normal_values = np.random.normal(average,std)

But how do these libraries work?

Data Generation Process (i)¶

Inversion transform sampling method

Wikipedia example for exponential distribution $P(X \leq x) = F_{X}(x)=1-e^{-\lambda x}$ with inverse $x=F_X^{-1}(r)=-{\frac {1}{\lambda }}\ln(1-r)$:

This is the most fundamental technique for generating sample values of random variables

It uses the cumulative distribution function (CDF) of the random variable

The method depends on the fact that, for any random variable X, the CDF, $F_X(x) = P(X \leq x)$, is a non-decreasing function of $x$ that outputs a number in the interval [0,1]

Let $F_X^{-1}$ be the inverse of $F_X$, i.e., $x = F_X^{-1}(F_X(x))$.

Let $r \sim \text{Uniform}(0,1)$ be a random uniform value in the interval [0,1]
- This is obtained by a pseudorandom number generator

Then, $$ x_\text{sample} = F_X^{-1}(r)$$ is a random sample with distribution $P(X = x)$.

Data Generation Process (ii)¶

Any probability distribution $$𝑃(𝑌)$$ can be described as the data generated by the inverse transform sampling $$𝑌=F_Y^{-1}(U),$$
where $$F_Y^{-1} \text{ is a deterministic function}$$ and $$U∼\text{Uniform}(0,1)$$ is some independent uniform random noise.

Notation: $a \sim b$ means $a$ is "sampled from distribution" $b$

Data Generation Process (iii)¶

Conditional Distributions¶

Any conditional probability distribution $$P(Y|X=x)$$ can be described as $$Y = F_{Y|X}^{-1}(x,U)$$ for $$U∼\text{Uniform}(0,1).$$

Data Generation Process & Simpson's Paradox¶

Consider the following supervised learning task. Doctors prescribe two different treatments (A and B) to patients with kidney stones. Our goal is to predict which treatment we should ascribe to a patient (even a Na ̈ıve Bayes classifier can do this simple task). Let $X \in \{A,B\}$ denote the prescribed treatment. And let $Y \in \{0, 1\}$ be the success (1) or failure (0) of the treatment. In our dataset, we have 700 patients ascribed treatment, equally balanced between A and B.

Which treatment is more effective: A or B?

Alice, the person in charge of applying machine learning at the hospital, investigated the data a little further and identified that doctors find treatment A more invasive and tend to only prescribe it in more severe cases. Her new data shows the following

Which treatment is more effective: A or B?

Now treatment A seems to be more effective

Describing Joint Probability Distributions¶

Let $Y,T,S \in \{0,1\}$ be three binary random variables.

Consider the following interpretation.

$Y$ = treatment positive outcome {0,1}
$T$ = treatment {A,B}
$S$ = kidney stone size.

Suppose we use hospital information as training data for our statistical model: $\mathcal{D}=\{(y_i,t_i,s_i)\}_{i=1}^{n}$ for each patient $i$.

The chain rule of probability) states $$ P(A,B) = P(A|B)P(B) = P(B|A)P(A).$$

On the Data Generation Process¶

A joint probability distribution is simply a way to assign probabilities to joint events

We should never use conditional distributions to interpret how the data was generated.

Q: Why?

A: Because for any of the following data generation processes describes the training data equally well:

A data generation process based on the decomposition $P(Y|T,S)P(T|S)P(S)$:

$$\begin{align} S&=F^{-1}_S(U_S),\\ T&=F^{-1}_T(S,U_T),\\ Y&=F^{-1}_Y(T,S,U_Y),\end{align}$$

where $U_S, U_T, U_Y \sim \text{Uniform}(0,1)$ are uniform variables sampled indepedently.

A data generation process based on the decomposition $P(Y|T,S)P(S|T)P(T)$:

$$\begin{align} T&=F^{-1}_T(U_T),\\ S&=F^{-1}_S(T,U_S),\\ Y&=F^{-1}_Y(T,S,U_Y),\end{align}$$

where $U_S, U_T, U_Y \sim \text{Uniform}(0,1)$ sampled indepedently.

A data generation process based on the decomposition $P(Y|T,S)P(S)P(T)$ that assumes $P(S|T)=P(S)$:

$$\begin{align} T&=F^{-1}_T(U_T),\\ S&=F^{-1}_S(U_S),\\ Y&=F^{-1}_Y(T,S,U_Y),\end{align}$$

where $U_S, U_T, U_Y \sim \text{Uniform}(0,1)$ sampled indepedently.

Hence, we cannot predict what happens if we force $T = A$ in the data generation process (force treatment to be "A"): $$ P(Y,S|do(T=A)) $$

do() notation: The DO notation asks what would happen if we forced a variable to have a certain value. This is the notation developed by Judea Pearl.

Alternative notation: An alternative notation for $P(Y,S|do(T=A))$ is $$ P(Y(T=A),S) $$ which is the notation used by Guido Imbens.

In the data generation process

$$\begin{align} S&=F^{-1}_S(U_S),\\ T&=F^{-1}_T(S,U_T),\\ Y&=F^{-1}_Y(T,S,U_Y),\end{align}$$

the "do" operation is forcing $T=A$, hence the data is generated as $$\begin{align} S&=F^{-1}_S(U_S),\\ T&=A,\\ Y&=F^{-1}_Y(T,S,U_Y).\end{align}$$

In a different data generation process, the "do(T=A)" operation gets the following data $$\begin{align} T&=A,\\ S&=F^{-1}_S(T,U_S),\\ Y&=F^{-1}_Y(T,S,U_Y).\end{align}$$
In yet another data generation process, the "do(T=A)" operation gets the following data $$\begin{align} T&=A,\\ S&=F^{-1}_S(U_S),\\ Y&=F^{-1}_Y(T,S,U_Y).\end{align}$$

Q: Which data generation process is more likely to describe our hospital data?

The Dangers of Data-driven Machine Learning¶

In our data, we found that given treatment B ($T=B$), patients are more likely to recover ($Y=1$) than with treatment $T=A$: $$ P(Y=1|T=B) > P(Y=1|T=A) $$

Is the above enough evidence to say that treatment B is better than A?

The "Simple Statistical Model" fallacy

In another hospital, it is possible that $P(Y=1|T=B)\approx 1$ and $P(Y=0|T=A)\approx 1$, which would allow us to build a simple predictive model
Still, even under this scenario, we could still have $P(Y=1|do(T=B))\approx 0$.
- Using model simplicity to justify our classifier's decisions is an example of zombie machine learning
  - Occam's raisor: the simplest explanation is likely the true explanation
    - Occam's raisor is a misleading principle for explaining cause and effect

Causal Execution Directed Acyclic Graph¶

We could describe the data generation process of this problem using the following random variables:

$S$ = Kidney stone size
$T$ = Treatment type
$Y$ = Treatment outcome $$\begin{align} S&=F^{-1}_S(U_\text{stone size}),\\ T&=F^{-1}_T(S,U_\text{treatment}),\\ Y&=F^{-1}_C(T,S,U_\text{outcome}),\end{align}$$
where $U_\text{stone size},U_\text{treatment},U_\text{outcome} \in [0,1]$ are independent variables.

The above data generation can be described by an execution graph, called the causal Directed Acyclic Graph (DAG):

Confounder variables¶

We say kidney stone size ($S$) is a confounder variable, which is a common cause for both Treatment $T$ and outcome $Y$

Another data generation process for $P(Y,T,S)$:

$$\begin{align} S&=F^{-1}_S(U_\text{Zeus}),\\ T&=F^{-1}_T(U_\text{Zeus}),\\ Y&=F^{-1}_Y(U_\text{Zeus}),\end{align}$$

where $U_\text{Zeus} \sim \text{Uniform}(0,1)$ a decision of the Greek god Zeus.

Q: From data alone, can we tell which data generation process is the correct one?

Causal Directed Acyclic Graph (Causal DAG)¶

The DAG graph notation is as follows:

No. From data alone, we cannot tell which data generation process is the correct one

Structural Causal Modeling¶

Our desire for incorrectly interpreting the data comes from us trying to describe the data generation process.
- Often, without any evidence to prefer one generation process over another one
The data often says little about the data generation process
The data generation process generally comes from prior domain knowledge

Structural Causal Modeling

Structural Causal Modeling (SCM) is a formal way to describe what we know about the data generation process.

Structural Causal Modeling is a combination of data generation equations and their graphical representation
- Think of SCM as a description of the code that generated the data

Causal Effects¶

What happens with $T$ if we force $do(T=B)$, i.e., we "force" treatment B on patients (regardless of their kidney stone condition).

This "forcing" is called:
- An intervention if it is done before our data is collected (e.g., to a new person).
  - Example: Clinical trials. Volunteers in the trial are forced to either take the drug or take the placebo.
- Counterfactual reasoning if it is done after the data is collected. That is, we consider an alternative reality that goes against some fact in our data.

Consider the SCM: $$\begin{align} S&=F^{-1}_S(U_\text{Zeus}),\\ T&=F^{-1}_T(U_\text{Zeus}),\\ Y&=F^{-1}_Y(U_\text{Zeus}),\end{align}$$
where $U_\text{Zeus} \sim \text{Uniform}(0,1)$ a decision of the Greek god Zeus.

Now let's see what happens to $Y$ if we set $do(T=B)$: $$\begin{align} S&=F^{-1}_S(U_\text{Zeus}),\\ T&=B,\\ Y&=F^{-1}_Y(U_\text{Zeus}).\end{align}$$

Q: Under this data generation process, does forcing treatment B changes the probability of a favorable outcome?

Now consider another data generation process (SCM) that could generate the same data:¶

$$\begin{align} S&=F^{-1}_S(U_\text{Zeus}),\\ T&=F^{-1}_T(U_\text{Zeus}),\\ Y&=F^{-1}_Y(T,U_\text{Zeus}).\end{align}$$

Q: Could forcing treatment B (that is, $do(T=B)$) change the probability of patient outcome?

$$\begin{align} S&=F^{-1}_S(U_\text{Zeus}),\\ T&=B,\\ Y&=F^{-1}_Y(T,U_\text{Zeus}).\end{align}$$

Structural Causal Models¶

(Galles & Pearl (1998)) shows that any data generation process can be described through a causal DAG.

In our previous equations, the variables $U_\text{Zeus},U_Y,U_T,U_S$ are called exhogenous variables
- Exhogenous variables are not explicitly modeled in our task (often they cannot be measured)
The variables $S,T,Y$ are the endogenous variables
- Endogenous variables are real quantities that one could measure
$\text{PA}(Y)$ are the parents of variable $Y$ in the causal DAG (described next).

Causal Directed Acyclic Graph (Causal DAG)¶

A simple way to describe the above data generation process is through its "execution" graph (Causal DAG):

Example of the Causal DAG from $$\begin{align} S&=F^{-1}_S(U_\text{Zeus}),\\ T&=F^{-1}_T(U_\text{Zeus}),\\ Y&=F^{-1}_Y(T,U_\text{Zeus}).\end{align}$$

The solid arrows indicate a variable dependence in the SCM.
- The solid arrows must form a Directed Acyclic Graph (DAG) over the described variables.
The dashed arrows show how the variables are related through undescribed variables

We could also include all variables in the causal DAG:

DAG nodes have two colors:
- Gray means observed variables
- White means unobserved variables

Expanding the Causal Model DAG¶

(Galles & Pearl (1998)) shows that any data generation process can always be represented by a DAG:

We can ALWAYS add exoghenous variables to make the data generation process directed.

U = np.random.uniform(0,1)
Chiken = F_Chiken(Egg, U) 
Egg = F_Egg(Chiken, U)

E.g., dinosaurs already layed eggs way before chickens appeared on Earth. Can be described as

U_Dino = np.random.uniform(0,1)
Dinosaur = F_Dino(U_Dino)
U = np.random.uniform(0,1)
Egg = F_Egg(Dinosaur, U)
Chiken = F_Chiken(Egg, U)

Predicting Causal Effects¶

Goal: We want to predict $P(Y|\text{do}(X=x))$
- That is, we want to predict what happens to the probability distribution of $Y$ if we force $X=x$.
Causal Adjustment Formula (Adjustment for Direct Causes, Theorem 3.2.2 of (Pearl 2009)):
- Suppose $Y$ is any set of random variables disjoint with $(X \cup C)$, where $C = \text{PA}(X)$ includes all direct parents of $X$ on the causal DAG.
- Then, $$P(Y = y|\text{do}(X=x)) = \sum_{c} P(Y = y|C=c,X=x)P(C=c), \qquad \forall y \in \mathbb{Y}$$

Note the difference bettween the adjustment formula above and a standard conditional probability statement: $$P(Y = y|X=x) = \sum_{c} P(Y = y|C=c,X=x)P(C=c|X=x).$$

Causal Adjustment Formula Example:

T = Treatment
Y = Treatment outcome
S = Kidney stone size Let's assume the following causal DAG:
Let's compare $P(Y = 1|\text{do}(T=t))$ against $P(Y = 1| T=t)$, $t \in \{A,B\}$.
$P(Y=1|T=A)= 0.78$
$P(Y=1|\text{do}(T=A))= \sum_{s \in \{\text{small,large}\}} P(Y = 1|S=s,T=A)P(S=s) = 0.93 \times 0.51 + 0.73 \times 0.49 = 0.832$
$P(Y=1|T=B)= 0.83$
$P(Y=1|\text{do}(T=B))= \sum_{s \in \{\text{small,large}\}} P(Y = 1|S=s,T=B)P(S=s) = 0.87 \times 0.51 + 0.69 \times 0.49 = 0.7818$

Zillow Home Purchase Case¶

From cnn.com:

In February 2021, Zillow appeared so confident in its ability to use artificial intelligence to estimate the value of homes that it announced a new option: for certain homes, its so-called "Zestimate" would also represent an initial cash offer from the company to purchase the property.

The move, touted by a company executives at the time as "an exciting advancement," was intended to streamline the process for homeowners considering selling to Zillow as part of its home-flipping business. Zillow promoted this option as a way to make it convenient to sell a home while minimizing interactions with others during the pandemic. Just eight months later, however, the company is shutting down that business, Zillow Offers, entirely.

The decision, announced in early November 2021, marks a stunning defeat for Zillow. The real estate listing company took a \$304 million inventory write-down in the third quarter, which it blamed on having recently purchased homes for prices that are higher than it thinks it can sell them. The company saw its stock plunge and it now plans to cut 2,000 jobs, or 25\% of its staff.''

Modeling:

$C \in \{0,1\}$ = Covid (0 = false, 1 = true) (not modeled by Zillow)
$S \in \mathbb{R}$ = Supply (unclear if modeled by Zillow)
$D \in \mathbb{R}$ = Demmand (unclear if modeled by Zillow)
X = House characteristics, neighborhood, recent selling prices
$Y \in \mathbb{R}$ = House price

COVID-19 changed the normal dynamics of supply and demand of houses
- Less supply with more demand = much higher prices than normal
- But the easing of COVID-19 has increased the supply and reduced demand
- Zillow should have applied the Causal Adjustment Formula in order to correctly predict house prices after the height of COVID-19

CS37300 - Data Mining and Machine Learning¶

Introduction to Causality¶

Introduction to Structural Causal Modeling¶

Understanding What Probabilities Really Are¶

Data Generation Process (i)¶

Data Generation Process (ii)¶

Data Generation Process (iii)¶

Conditional Distributions¶

Data Generation Process & Simpson's Paradox¶

Describing Joint Probability Distributions¶

On the Data Generation Process¶

The Dangers of Data-driven Machine Learning¶

Causal Execution Directed Acyclic Graph¶

Confounder variables¶

Causal Directed Acyclic Graph (Causal DAG)¶

Structural Causal Modeling¶

Causal Effects¶

Now consider another data generation process (SCM) that could generate the same data:¶

Structural Causal Models¶

Causal Directed Acyclic Graph (Causal DAG)¶

Expanding the Causal Model DAG¶

Predicting Causal Effects¶

Zillow Home Purchase Case¶

References¶