A data-fusion example


In the blog post on correlations, I ended with the well-known mantra: "correlation does not imply causation". Most of empirical science has been focussing on the correlations, because determining causation is hard. However, Pearl & Mackenzie (2006) urges scientists to search for mechanisms, because it "is critical for science, as well as to everyday life, because different mechanisms call for different actions when circumstances change". For example, we know that a lack of Vitamin C causes us to get scurvy, or not. We can depict this as Vitamin C \rightarrow scurvy. Knowing this is very important, because without it we might think that bananas can cure scurvy (Pearl & Mackenzie (2006)) and send a trade ship on its way with bananas instead of oranges. Alternatively, we could have found that ships with strawberries don't get scurvy (strawberries contain Vitamin C) and conclude that all red fruits can cure scurvy.

So, it is important to know the exact cause of things to make good decisions. One of the easiest ways to determine cause and effect is via doing an intervention, denoted by the do-operator (Pearl (2009)). Looking back on the scurvy example, it can be rewritten as follows: let Vitamin C be denoted by XX (the cause) and scurvy be denoted by YY (the effect). Then, we test whether scurvy is caused by a lack of Vitamin C by figuring out P(Ydo(X))P(Y | do(X)). This do-operator means that you would do an experiment where you change XX and only XX and then measure the effect on YY. The "and only" part is important and can be obtained by doing a randomized controlled trial. The idea of such a trial is to take two completely random subsets of your sample. Then, for you subset you change XX and for the other subset you do not change XX. Apart from XX, you don't try to change anything, so in medical settings the doctors are not even allowed who gets the medication since that could affect YY. (See the story of Clever Hans for an example where observers affect the outcome by accident.) If done correctly, a randomized trial removes the incoming causal arrows. For example, without using randomization, it could be that the patients who get the medicin are actually cured by swimming in the sea ZZ. We can depict this in a causal graph as

(Thanks to Kumor (2018) for code examples of these graphs.) If the randomization is representative of the whole population and executed correctly, then we can say that the graph changed to

which assumes that the two subsets are not fundamentally different. Unfortunately, it is often not possible to apply a randomized trial since it could be too expensive, too time consuming or unethical. For instance, it is a bad plan to to split a group in half and force one half of the group to smoke while tracking the group for 40 years to see whether smoking causes cancer. Luckily, solutions exist to determine causation from observations alone (Bollen & Pearl (2013)). Even better, Bareinboim & Pearl (2016) summarizes how data from observational studies can be combined with randomized trials to find cause and effect; this can be useful to learn a causal effect about one population and apply it to another population. This problem is similar to meta-analyses. However, meta-analyses typically "'[average] out' differences (e.g., using inverse-variance weighting), which, in general, tends to blur, rather than exploit design distinctions among the available studies" (Bareinboim & Pearl (2016)). For a longer discussion about the lack of effectiveness of meta-analyses, see de Vrieze (2018).

In this blog post, my aim is to look at some examples of combining observational and randomized controlled trail data in an attempt to figure out how and when it can be applied.


Thanks to the do-calculus presented by (Pearl (2009)), we can rewrite our graph without thinking. (By that, I mean that the calculus allows you to easily verify equivalences like a+b=b+aa + b = b + a for any aa and bb.) For arbitrary disjoint sets of nodes X,Y,ZX, Y, Z and WW in a causal DAG GG, with GXG_{\overline{X}} and GXG_{\underline{X}} denoting the graph obtained by, respectively, deleting all arrows pointing to and emerging from nodes in GG, we have (Bareinboim & Pearl (2016))

Rule 1 (insertion/deletion of observations):

P(ydo(x),z,w)=P(ydo(x),w) if (YZX,W)GX. P(y|do(x),z,w) = P(y|do(x),w) \: \text{if} \: (Y \perp Z|X, W)_{G_{\overline{X}}}.

Rule 2 (action/observation exchange):

P(ydo(x),do(z),w)=P(ydo(x),z,w) if (YZX,W)GXZ. P(y|do(x), do(z), w) = P(y|do(x), z, w) \: \text{if} \: (Y \perp Z|X, W)_{G_{\overline{X} \underline{Z}}}.

Rule 3 (insertion/deletion of actions):

P(ydo(x),do(z),w)=P(ydo(x),w) if (YZX,W)GXZ. P(y|do(x), do(z), w) = P(y|do(x), w) \: \text{if} \: (Y \perp Z|X, W)_{G_{\overline{X} \overline{Z^*}}}.


Scientific results are meant to be used across different populations. This, according to Pearl & Bareinboim (2014), is called transportability. Specifically, it is about transfering causal effects from experimental studies to observational studies. Here, I will mostly copy the example of (Bareinboim & Pearl (2016)) and add some clarifications at the steps which I find unclear. Consider an experimental source π\pi and an observational target π\pi^* population. The variables are treatment XX, outcome YY, age ZZ and a set of unaccounted factors SS.

The unaccounted factors create differences in age between π\pi and π\pi^*. These unaccounted factors are unknown, but it is known that they cause the differences in age between π\pi and π\pi^*; that is why SS is denoted with a black box in the graph. Note that this means that the graph is an overlapping of the graph of the source and target population. Now, the query can be rewritten to (Bareinboim & Pearl (2016))

Q={  By definition of Q }zP(ydo(x),S=s,z)P(zS=s,do(x))={ By S-admissibility.  }zP(ydo(x),z)P(zS=s,do(x))={ By the 3rd rule of the do-calculus. }zP(ydo(x),z)P(zS=s)={ By definition of the S-node. }zP(ydo(x),z)P(z). \begin{aligned} & Q \\ = \hspace{3mm} & \hspace{5mm} \{ \: \text{ By definition of $Q$. \: } \} \\ & \sum_z P(y|do(x), S=s^*,z)P(z|S=s^*,do(x)) \\ = \hspace{3mm} & \hspace{5mm} \{ \: \text{By \textit{S}-admissibility. \: } \} \\ & \sum_z P(y|do(x), z)P(z|S=s^*,do(x)) \\ = \hspace{3mm} & \hspace{5mm} \{ \: \text{By the 3rd rule of the \textit{do}-calculus.} \: \} \\ & \sum_z P(y|do(x), z) P(z|S = s^*) \\ = \hspace{3mm} & \hspace{5mm} \{ \: \text{By definition of the \textit{S}-node.} \: \} \\ & \sum_z P(y|do(x), z) P^*(z). \end{aligned}

This is called a transport formula, because it explains "how experimental findings in π\pi are transported over to π\pi*; the first factor is estimable from π\pi and the second one from π\pi^*" (Bareinboim & Pearl (2016)).


Bareinboim & Pearl. (2016). Causal inference and the data-fusion problem.

Bollen & Pearl. (2013). Eight Myths About Causality and Structural Equation Models.

Kumor, D. (2018). Causal Graphs in LaTeX.

Pearl, J. (2009). Causality. Cambridge university press.

Pearl & Bareinboim. (2014). External Validity: From Do-Calculus to Transportability Across Populations. Statistical Science. Pearl, J., & Mackenzie, D. (2018). The book of why: the new science of cause and effect. Basic Books.

de Vrieze, J. (2018, September 18). Meta-analyses were supposed to end scientific debates. Often, they only cause more controversy. Science Magazine. Accessed on 2020-11-01.